Batch System 10 Minute Tutorial

Why use a Batch System?

All codes (i.e. applications) should be run via the batch system on the CSF, DPSF and Zrek. This allows you to choose the most appropriate system resources to run your job – applications usually need different amounts of memory, number of CPU cores or even GPUs. The batch system ensures your job only runs when all of those resources are available. It allocates those resources to your job so that it runs as you requested and no other jobs can grab your resources.

It also ensures fair usage – there are many users and jobs on the system, all making different demands of the resources (memory, CPU cores, networking) and so allowing the batch system to choose exactly when to run your job is the only sensible way of running the system.

Note that running your code or application on the systems’ login nodes is not permitted. The login nodes are for other tasks (transferring files on and off the system, editing jobscripts, submitting jobs to the system). They don’t have a lot of memory nor many cores so trying to run your code there is inefficient and may also adversely affect other users. Applications found running on the login nodes may be killed by the sysadmins without warning.

Please do take the time to learn about the batch system. You can try out the sample job below – it shouldn’t take more than 10 minutes to work through the instructions on this page.

10 Minute Tutorial: Submitting a First Job to the Batch System

Here we describe in detail how to submit a simple, first job to the batch system, SGE. This tutorial can be done on the CSF, DPSF, and Zrek. Please read all of the text, don’t just look for the commands to type, as it will explain why you need to run the commands.

This is a serial job – i.e., uses only one core. This will help you become familiar with the principles of the batch system.

Do not simply run jobs on the login node – use the batch system as described below.

Step 1: Create a Job Description File

The jobscript file is the thing you submit to the job queue. It is just a simple plain-text file. It serves two main purposes:

  1. Specifies the number of CPU cores, memory and other resources you need to run your application.
  2. Specifies the commands needed to run your application and anything else your job will do (e.g., copy files).

A key benefit of the jobscript it that it documents exactly what you did to run your job – no need to remember what you did 6 months ago as it is all there in the jobscript.

Hence jobscripts should be considered part of your work that needs to be kept securely in your home directory. They are a record of how you ran a simulation or analysis, for example, or how you processed a particular dataset. Jobscripts are therefore part of your research methods. The home directory area of storage is a backed-up area. It is strongly recommended that you keep important files in your home area for safe keeping.

First, log in to the CSF (we assume the CSF in these notes but the information is valid for our other batch systems). Once you have logged in you will be at the command-line prompt:

[mxyzabc1@login1(csf) ~]$   you will type your commands here
   ^           ^  ^   ^
   |           |  |   | 
   |           |  |   +--- The directory (folder) you are currently in.
   |           |  |        ~ means your home folder which is your private folder.
   |           |  |
   |           |  +--- Name of the system (csf, dpsf, zrek)
   |           |
   |           +--- Name of the login node (some systems have more than one login node)
   |
   +--- Your username appears here

Now create a directory in your home area for our first test job by running the following commands at the prompt:

# All of these commands are run on the CSF login node at the prompt
mkdir ~/first-job            # Create the directory/folder
cd ~/first-job               # Go in to the directory/folder

The next step is to use gedit, or another editor, on the CSF login node (running text editors on the login node is permitted) to create a file with exactly the following content (see below). You can give the file any name, as long as there are no spaces in the name – on this page we use first-job.txt but Linux doesn’t care what extension you use – .txt or .qsub or .jobscript for example:

# Run this command on the CSF login node at the prompt
gedit first-job.txt
  • Note for Windows users: You can create the jobscript below in Notepad and then transfer the file to the required systems if you wish (CSF, DPSF, Zrek all see the same home directory). The file can have any name (we’re using first-job.txt but anything will be OK). However, you must run the following command on the login node to convert the file from Windows format to linux otherwise the job will report an error when you submit it to the batch system (this is only needed for jobscripts, not any other file)
    # Run this command on the CSF login node at the prompt
    dos2unix first-job.txt
               #
               # or whatever filename you used (we assume notepad adds .txt)
    

    But we recommend that Windows users install MobaXterm to log in to the remote system (CSF, DPSF, or Zrek). You can then run gedit on the system’s login node and you’ll get a Linux editor very similar to Notepad. The file you write will be saved directly on the remote system and will not need converting with dos2unix because it is already in the correct format.

Here’s the jobscript content – put this in the text file you are creating either in gedit (run on the CSF, DPSF or Zrek login node) or notepad (run on your Windows PC):

#!/bin/bash

# -- SGE options (whose lines must begin with #$)

#$ -V                 # Inherit environment settings (e.g., from loaded modulefiles)
#$ -cwd               # Run the job in the current directory

# -- the commands to be executed (programs to be run) on a compute node:

/bin/date
/bin/hostname
/bin/sleep 120
/bin/date

Note: lines must NOT be indented in your text file – there should NOT be any spaces at the start of the lines. Cut-n-paste from this web page will work correctly in most browsers in that it won’t copy any leading space.

This BASH script has three parts:

  1. The first line, #!/bin/bash, means that the file you create is treated as a BASH script (scripts in Linux can use several languages, BASH is the one we use for jobscripts).
  2. The lines beginning with #$, are commands to the batch system (SGE) – they provide information about your job:
    • The first, #$ -V, ensures that any settings in your environment on the login node are copied to the compute node when your job runs. In fact the settings are copied as soon as you submit the job so that even if you log out before your job runs, the settings will still be available to your job. Note that the V is UPPERcase.
    • The second, #$ -cwd, ensures that a submitted job runs from the location (directory) from which it was submitted. Without this command, a job will run in a your home directory! This will affect where output files are written to and usually where any input files used by your programs are read from.
  3. The remaining lines comprise our computational job – the applications we actually want to run. In this example we have a trivial job which runs simple Linux commands to output the date and time, followed by the name of the compute node on which the job runs, then waits for two minutes and finally outputs the date and time again. In a real jobscript you would do something more interesting and useful – e.g., run MATLAB or Abaqus or a chemistry program.

Step 2: Copy to scratch area

We now copy the jobscript to your scratch area. We recommend you run jobs from the scratch filesystem: it is another area of storage on the CSF that is faster and larger. Your home directory is in an area that has a quota to be shared amongst everyone in your group – if your job fills up that area you will prevent your colleagues from working! Running jobs in the scratch area avoids this problem. PLEASE NOTE: the scratch area is a temporary area – files older than 3-months can be deleted by the system to free up space. You should always have a copy of important files in your home area (or other research data storage visible on the CSF that your research group may have access to). Think of scratch as fast, temporary storage – if your job reads and writes large files it will be faster if run from scratch.

Let’s copy our jobscript to the scratch area (we keep the original in our home area for safe keeping):

cp first-job.txt ~/scratch

We can now go in to the scratch area:

cd ~/scratch

Our scratch directory is now our current working directory. When we submit the job to the batch queue (see next step) it will run in the scratch area – remember the #$ -cwd flag in the jobscript which makes the job run from whichever directory you are in when you submit the job. Any files that the job generates will also be written to the scratch area and if your job wants to read input data files (ours doesn’t in this example) then it would try to read them from the scratch area.

You will notice the prompt on the command-line will change to indicate where you are currently located:

[mxyzabc1@login2 scratch]$ 
                     #
                     # The prompt shows your current directory

Step 3: Submit the Job to the Batch System

The third step is to submit the job to the batch system. Suppose, the above script is saved in a file called first-job.txt. Then the following command will submit your job to the batch system:

qsub first-job.txt
  • Note: If you receive an error message
    bash: qsub: command not found
    

    then simply run the following command to make the queue commands available to you (then repeat the qsub command):

    module load services/gridscheduler
    

You’ll see a message printed similar to:

Your job 195501 ("first-job.txt") has been submitted

The job id 195501 is a unique number identifying your job (obviously you will receive a different number). You may use this in other commands later.

Step 4: Check Job Status

To confirm that your job is queued, or perhaps already running, enter the command

qstat
  • If the job is still queued (waiting to run) the output from qstat will look like
    job-ID  prior    name        user       state  submit/start at      queue             slots ja-task-ID 
    ------------------------------------------------------------------------------------------------------
    195501  0.00000  first-job.  mxyzabc1   qw     04/03/2013 09:33:37                    1        
    
  • If your job is already running, the output will look like
    job-ID  prior    name        user       state  submit/start at      queue             slots ja-task-ID 
    ------------------------------------------------------------------------------------------------------
    195501 0.05350   first-job.  mxyzabc1   r      04/03/2013 09:33:49  serial.q@node332  1        
    
  • If something is wrong with your jobscript you’ll see Eqw meaning an error has occured – the job will wait for ever! Please contact us at its-ri-team@manchester.ac.uk stating your job-ID and the system you are logged in to and we’ll let you know what has gone wrong (the most common error is creating the file in Notepad and then forgetting to run dos2unix on the file once it has been transferred to the particular CIR system).
    job-ID  prior    name        user       state  submit/start at      queue             slots ja-task-ID 
    ------------------------------------------------------------------------------------------------------
    195501 0.05350   first-job.  mxyzabc1   Eqw    04/03/2013 09:33:49                    1        
    
  • If there is no output, your job has finished.

Step 5: Review Job Results/Output

Each job will output at least two files, one for standard output and one for standard error (instead of printing this output to the console, which isn’t possible in the batch system). In this example the standard output file is called:

first-job.txt.o195501

Where the number is your unique jobid. You can read the files using either gedit or via the cat command. E.g.

cat first-job.txt.o195501

Which will in this case contains the following:

Wed Apr  3 09:33:49 BST 2013
node332
Wed Apr  3 09:36:49 BST 2013

shows the date, twice with a difference of 120 seconds, and the name of the compute node on which the job ran, as expected (refer back to the commands we ran in our first jobscript).

The standard error file has a similar name:

first-job.txt.e195501
In this case the file is empty, indicating that there were no errors associated with running the job.

Note that the names of the output files begin with the name of your jobscript (job-script.txt in this example) and end with .oJOBID and .eJOBID where JOBID is the unique number reported by qsub. You can change the start of the name of the output files by adding the following line to your jobscript (change myjobname to something meaningful for your job)

#$ -N myjobname

Summary

Points to remember

  • Do not simply run your apps on the login node. Write a jobscript and submit it to the batch system. Your app will run on a more powerful node and won’t upset the login node (and the sysadmins!)
  • You can write your jobscript on the login node using gedit.
  • Alternatively if you use notepad on windows ensure you run dos2unix on the jobscript once you’ve transferred it to the CSF.
  • Keep your important files in your home area but copy them to the scratch area and run your jobs from there. Don’t forget to copy important results back to home.
  • Submit the job using qsub
  • Check on the job using qstat
  • Look in the .oNNNNN and .eNNNNN files generated by the job for output and errors.
  • If you have any questions please contact us at its-ri-team@manchester.ac.uk – we’re here to help.

More on Using the Batch System (parallel jobs, GPUs, high-mem)

The batch system, SGE, has a great deal more functionality than described above. Other features, including:

  • Running multi-core/SMP (e.g., OpenMP) jobs
  • Running multi-host (e.g., MPI) jobs
  • Running job arrays — submitting many similar jobs by means of just one Qsub script/command
  • Running GPU jobs
  • Selecting different Intel/AMD hardware
  • Selecting high-memory hardware

are fully documented (with example job scripts) in the main CSF or DPSF SGE documentation.

Finally, each centrally installed application has it’s own webpage where you will find examples of how to submit a job for that specific piece of software and any other information relevant to running it in batch such as extra settings that may be required for it to work. CSF or DPSF software pages.

Last modified on July 27, 2018 at 10:29 am by George Leaver