Batch System 10 Minute Tutorial

Introduction

This page offers new CSF users a tutorial that covers usage of the batch system to run a simple job on the CSF.

The tutorial also provides some information about the storage areas on the CSF and some common Linux commands used to manage your files.

After doing the tutorial you’ll be able to use the CSF. A further tutorial is also available for running more complicated parallel jobs.

If you are interested in attending the 1-day Intro to CSF training course which runs a couple of times each semester, please take a look at the course booking page for details of the schedule and availability.

Before we begin the tutorial we’ll explain what the batch system is and why we need to use it.

Some basics: What is a batch system and why use it?

Join the queue…

Initially a batch system can be thought of as a job queue. You submit jobs to the queue and the system will pick them out of the queue to run them.

The jobs will do whatever commands you ask them to do (for example run an app such as a chemistry app, or a bioinformatics app or whatever application is appropriate to your work).

When the jobs finish you should have some new files containing the results!

No GUI

Something to note about batch jobs is that you never see an application’s graphical user interface (GUI), if it has one. Batch jobs run without any interaction – all options / flags / input files etc will be specified on the command-line in a jobscript (more on those later).

When the app is running, all output will be saved to files. This will be a new way of working if you are used to running an app in a desktop environment (e.g., on Windows).

At this point you might be thinking you don’t like the idea of your work (jobs) waiting in a queue. How long will it queue for? Why can’t it just run immediately? Read on to find out more.

Ask for extra memory, cores or a GPU?

Submitting your work as a job to the batch system allows you to specify the system resources needed by your job – the applications you’ll be running on the CSF usually need different amounts of memory, number of CPU cores or even GPUs.

The batch system ensures your job only runs when all of the required resources are available. It then allocates those resources to your job (so that it runs correctly) and no other jobs can grab your resources.

But don’t worry if you’re not sure what resources you’ll need – there are sensible defaults which you can use to begin with. After trying the defaults, you might find your app needs more memory to process your data, or that it can use more CPU cores to make it run faster.

So you might find that your first attempt at running jobs to run your simulations or process your data-sets, don’t actually complete successfully. Maybe you’ll need to run the jobs again but request more memory so that your app can load a large dataset. Don’t worry – failed jobs don’t do any harm. You can simply delete the output files from these failed jobs (if there are any), modify your jobscript to ask for more resources (more memory, CPUs, …) and then resubmit your jobs.

Fair usage

The batch system also ensures fair usage for you and others – there are many users and jobs on the system, all making different demands of the resources (memory, CPU cores, GPUs) and so allowing the batch system to choose exactly when to run your job is the only sensible way of running the system.

The fact that jobs are starting and finishing all the time means you rarely have to wait very long for your requested resource to become free so that your jobs can start.

There are other factors which control when jobs run (and how many of your jobs can run at the same time) but the use of a job queue should not put you off using the system!

Let the CSF get on with it

An added bonus of a batch system is that once you’ve submitted your jobs to the system, you don’t actually need to remain logged in. You can log off, go home or go to a meeting or do something else with your PC/laptop.

Meanwhile the batch system will run your jobs. It can even email you when a job has finished.

Without a batch system you would have to remain logged in to the CSF until the job had finished, which could be a problem for a simulation that takes several days to complete.

Can I just run my app on the Login Node?

Running your code or application directly on the login nodes is not permitted. The login nodes are for other tasks (transferring files on and off the system, editing jobscripts, submitting jobs to the system). They don’t have a lot of memory nor many cores so trying to run your apps there is inefficient and may also adversely affect other users.

Applications found running on the login nodes may be killed by the sysadmins without warning.

Please do take the time to learn about the batch system. While it may be an unfamiliar way of working initially, particularly if you are used to simply running your apps immediately on a desktop PC, there are actually a lot of benefits to using the batch system – you’ll see it is a very powerful way of working as you begin to do your real work. In this tutorial you can try out the sample job below – it shouldn’t take more than 10 minutes to work through the instructions on this page.

The tutorial assumes you have logged in to the CSF – please see the login instructions for more information.

10 Minute Tutorial: Submitting a First Job to the Batch System

Here we describe in detail how to submit a simple, first job to the batch system (we use a batch system called SGE.) Please read all of the text, don’t just look for the commands to type, as it will explain why you need to run the commands.

What type of job will we run?

We will run a serial job – i.e., it uses only one CPU core. We’ll see later that many of the real applications on the CSF can use more than one CPU core (a multi-core job) to speed up their processing, giving you the results sooner!

You could also request more memory than the default 4-5GB of RAM. You could also request a GPU (once we have given you access to them).

But initially a simple 1-core (serial) job will help you become familiar with the principles of the batch system. These jobs are very common – you may well want to use this type of job in your real work after the tutorial.

Please remember: Do not simply run jobs on the login node – use the batch system as described below.

Step 0: Create a Folder for the Job Files

In the following steps we will be creating a jobscript file. We will explain more about the file in the next step. The job will also create some files (any output generated by the job is saved to files).

Hence we will create a directory (folder) for the job to keep all of the files together in one place. This is important – you will likely run a lot of jobs on the CSF so it will be easier to manage all of your work if you keep your files tidy.

When you log in to the CSF you are placed in your home directory. This area of storage is private to you and, importantly, is backed-up (not all storage areas on the CSF are backed-up). It is strongly recommended that you keep important files in your home directory for safe keeping – and this includes your jobscripts!

Once you have logged in you’ll be at the command-line prompt:

[mxyzabc1@login1 [csf3] ~]$   you will type your commands here, "at the prompt"
  ^            ^   ^    ^
  |            |   |    | 
  |            |   |    +--- The directory (folder) you are currently in.
  |            |   |         ~ means your home folder which is your private folder.
  |            |   |
  |            |   +--- Name of the system
  |            |
  |            +--- Name of the login node (some systems have more than one login node)
  |
  +--- Your username appears here

Now create a directory (usually referred to as a folder in Windows or MacOS) in your CSF home storage area, for our first test job, by running the following commands at the prompt:

# All of these commands are run on the CSF login node at the prompt
mkdir ~/first-job            # Make (create) the directory (folder)
cd ~/first-job               # Change to (go in to) the directory (folder)

Notice that the prompt has changed to indicate you have moved in to the first-job folder:

[mxyzabc1@login1 [csf3] first-job]$   
                           ^
                           |
                           +--- The prompt shows we are now in the first-job folder

Step 1: Create a “Jobscript” – a job description file

The jobscript file is the thing you submit to the batch system (i.e, the queue of jobs.) It is just a simple plain-text file. It serves two main purposes:

  1. It specifies the number of CPU cores, memory and other resources you need to run your application.
  2. It specifies the actual command(s) needed to run your application and anything else your job will do (e.g., copy files).
A key benefit of the jobscript is that it documents exactly what you did to run your job – no need to remember what you did 6 months ago as it is all there in the jobscript. If you ever need to run a job again, or run similar jobs, having the jobscript available is very useful!

Hence jobscripts should be considered part of your work that needs to be kept securely in your home directory. They are a record of how you ran a simulation or analysis, for example, or how you processed a particular dataset. Jobscripts are therefore part of your research methods.

We now use gedit, or another editor, on the CSF login node (running text editors on the login node is permitted) to create a file with exactly the following content (see below). You can name the file anything you like, as long as there are no spaces in the name – in this example we use first-job.txt but Linux doesn’t care what extension you use – .txt or .qsub or .jobscript for example:

# Run this command on the CSF login node at the prompt
gedit first-job.txt
  #
  # Please IGNORE any warnings / messages that appear in the terminal from gedit.
  # For example: (gedit:5246): dconf-WARNING **: .........
  • Note for Windows users: You can create the jobscript below in Notepad and then transfer the file to CSF, although we don’t actually recommend this method. The file can have any name (we’re using first-job.txt but anything will be OK – you’ll find that Notepad names files with .txt at the end anyway). However, you must run the following command on the login node to convert the file from Windows format to Linux format otherwise the job will report an error when you submit it to the batch system (this is only needed for jobscripts, not any other file)
    # Run this command on the CSF login node at the prompt if jobscript was written in notepad
    dos2unix first-job.txt
               #
               # or whatever filename you used (we assume notepad adds .txt)
    

    But we recommend that Windows users install MobaXterm to log in to the CSF. You can then run gedit on the CSF login node and you’ll get a Linux text-editor very similar to Notepad. The file you write will be saved directly on the CSF and will not need converting with dos2unix because it is already in the correct format.

Here’s the jobscript content – put this in the text file you are creating either in gedit (run on the CSF login node) or notepad (run on your Windows PC):

#!/bin/bash --login

# SGE options (whose lines must begin with #$)

#$ -cwd               # Run the job in the current directory

# Now the example commands to be executed (programs to be run) on a compute node:
# In your real work, you'll run apps such as a chemistry app, or a bio-inf app.

/bin/date
/bin/hostname
/bin/sleep 120
/bin/date

Note: lines must NOT be indented in your text file – there should NOT be any spaces at the start of the lines. Cut-n-paste from this web page will work correctly in most browsers in that it won’t copy any leading space.

This BASH script has three parts:

  1. The first line, #!/bin/bash --login, means that the file you create is treated as a BASH script. Linux provides several scripting languages but BASH is the one you use at the command-line once you’ve logged in. So we usually use it for jobscripts too. This means that any commands you would normally type at the command-line can also go in to your jobscript to be run as part of a batch job.
  2. The lines beginning with #$, are commands to the batch system – they provide information about your job.

    In this simple jobscript the line #$ -cwd ensures that a submitted job runs from the location (directory) from which it was submitted. Without this command, a job will run in a your home directory! This will affect where output files are written to and usually where any input files used by your programs are read from.

  3. The remaining lines comprise our computational job – the applications we actually want to run. In this example we have a trivial job which runs simple Linux commands to output the date and time, followed by the name of the compute node on which the job runs, then waits for two minutes and finally outputs the date and time again. In a real jobscript you would do something more interesting and useful – e.g., run MATLAB or Abaqus or a chemistry program.

Step 2: Copy to scratch area

We now copy the jobscript to your scratch area.

We recommend you run jobs from the scratch filesystem: it is another area of storage on the CSF that is faster and larger. Your home directory is in an area that has a quota to be shared amongst everyone in your group – if your job fills up that area you will prevent your colleagues from working! Running jobs in the scratch area avoids this problem.

PLEASE NOTE: the scratch area is a temporary area – files unused in the last 3-months can be deleted by the system to free up space. You should always have a copy of important files in your home area (or other research data storage visible on the CSF that your research group may have access to). Think of scratch as fast, temporary storage – if your job reads and writes large files it will be faster if run from scratch.

A good way of working is to create your important files in the home area, then copy them to scratch when you need to use them in your jobs. That way you always have a safe copy in your home area.

So let’s copy our jobscript to the scratch area (we keep the original in our home area for safe keeping):

cp first-job.txt ~/scratch

We can now go in to the scratch area:

cd ~/scratch

Our scratch directory is now our current working directory. When we submit the job to the batch queue (see next step) it will run in the scratch area – remember, the #$ -cwd flag in the jobscript which makes the job run from whichever directory you are in when you submit the job.

Any files that the job generates will also be written to the scratch area and if your job wants to read input data files (ours doesn’t in this example) then it would try to read them from the scratch area.

You will notice the prompt on the command-line will change to indicate where you are currently located:

[mxyzabc1@login2 [csf3] scratch]$ 
                           #
                           # The prompt shows your current directory

Step 3: Submit the Job to the Batch System

Recap: So far we have created a directory for the jobscript in our home area, written a jobscript text file there (where it is stored safely on backed-up storage), then copied it to the fast temporary scratch storage and changed directory to our scratch area where we’ll run the job from.

The next step is to actually submit the job to the batch system. Suppose, the above script is saved in a file called first-job.txt. Then the following command will submit your job to the batch system:

qsub first-job.txt

You’ll see a message printed similar to:

Your job 195501 ("first-job.txt") has been submitted

The job id 195501 is a unique number identifying your job (obviously you will receive a different number). You may use this in other commands later.

Step 4: Check Job Status

To confirm that your job is queued, or perhaps already running, enter the command

qstat
  • If the job is still queued (waiting to run) the output from qstat will look like
    job-ID prior   name       user     state submit/start at     queue      slots ja-task-ID
    ----------------------------------------------------------------------------------------
    195501 0.00000 first-job. mxyzabc1 qw    04/10/2018 09:33:37            1
    
  • If your job is already running, the output will look like
    job-ID prior   name       user     state submit/start at     queue      slots ja-task-ID
    ----------------------------------------------------------------------------------------
    195501 0.05350 first-job. mxyzabc1 r     04/10/2018 09:33:49 serial.q@n 1
    
  • If your jobs have finished, qstat will show no output – meaning you have no jobs in the queue, either running or waiting.
    [mxyzabc1@login2 [csf3] scratch]$ qstat
    [mxyzabc1@login2 [csf3] scratch]                  # If no output, all jobs have finished
    
  • If something is wrong with your jobscript you’ll see Eqw meaning an error has occured – the job will wait for ever! Please contact us at its-ri-team@manchester.ac.uk stating your job-ID and the system you are logged in to and we’ll let you know what has gone wrong.
    job-ID prior   name       user     state submit/start at     queue      slots ja-task-ID
    ----------------------------------------------------------------------------------------
    195501 0.05350 first-job. mxyzabc1 Eqw   04/10/2018 09:33:49            1
    

    HINT: the most common error is creating the file in Notepad on Windows and then forgetting to run dos2unix on the file once it has been transferred to the CSF. If you wrote the jobscript in Notepad you must use dos2unix on it to convert it to Linux format (you can do that now then resub the job by running again the qsub command used earlier.)

  • If there is no output, your job has finished.

Step 5: Review Job Results/Output

Each job will output at least two files, one for standard output and one for standard error (instead of printing this output to the screen, which isn’t possible in the batch system).

Some applications will send general messages to the standard output file and error messages to the standard error file, but that isn’t always the case. You should always check both files for messages.

Let’s list the files in the current directory using the Linux ls command:

ls
first-job.txt  first-job.txt.o195501  first-job.txt.e195501

We can see our original jobscript first-job.txt and two new files first-job.txt.o195501 first-job.txt.e195501 that have been generated by the job (remember, the job ID number 195501 will be different for your job!)

In this example the standard output file is called:

first-job.txt.o195501

Where the number is your unique jobid. You can read the files using either gedit or via the cat command. E.g.

cat first-job.txt.o195501

Which will in this case contains the following:

Thu Oct  4 09:33:49 BST 2018
node332
Thu Oct  4 09:36:49 BST 2018

shows the date, twice with a difference of 120 seconds, and the name of the compute node on which the job ran, as expected (refer back to the commands we ran in our first jobscript).

The standard error file has a similar name:

first-job.txt.e195501
In this case the file is empty, indicating that there were no errors associated with running the job. If you see some error messages in your .e file, check you have typed the jobscript correctly – particularly the names of the Linux commands run by the jobscript.

Note that the names of the output files begin with the name of your jobscript (job-script.txt in this example) and end with .oJOBID and .eJOBID where JOBID is the unique number reported by qsub. If you wish to, you can change the start of the name of the output files by adding the following line to your jobscript (change myjobname to something meaningful for your job)

#$ -N myjobname

You have now successfully run a job on the CSF. It was a simple 1-core job (it used only one CPU core) to run some basic Linux commands. The output of the commands was captured in to the .o file. No errors were generated so the .e file was empty. By changing the Linux commands to something more useful (e.g., to run your favourite chemistry application) you can get lots of real work done on the CSF.

Summary

Points to remember

  • Do not simply run your apps on the login node. Write a jobscript and submit it to the batch system. Your app will run on a more powerful node and won’t upset the login node (and the sysadmins!)
  • You can write your jobscript on the login node using gedit.
  • Alternatively if you use notepad on windows ensure you run dos2unix on the jobscript once you’ve transferred it to the CSF.
  • Keep your important files in your home area but copy them to the scratch area and run your jobs from there. Don’t forget to copy important results back to home.
  • Submit the job using qsub
  • Check on the job using qstat
  • Look in the .oNNNNN and .eNNNNN files generated by the job for output and errors.
  • If you have any questions please contact us at its-ri-team@manchester.ac.uk – we’re here to help.

More on Using the Batch System (parallel jobs, GPUs, high-mem)

The batch system has a great deal more functionality than described above – by adding more #$ special lines to your jobscript your jobs can make more use of the CSF capabilities. A list of features is given below with links to documentation. You may wish to try the Parallel Job Tutorial once you are familiar with running serial (1-core) jobs on the CSF.

Other features include:

These features are fully documented (with example job scripts) in the CSF SGE documentation.

Application Software

Now that you’ve run a test job you might want to have a look to see whether the application software you intend to use is already installed on the CSF – a lot of apps are already installed!

Each centrally installed application has its own application webpage where you’ll find examples of how to submit a job for that specific piece of software and any other information relevant to running it in batch, such as extra settings that may be required for it to work.

Last modified on April 18, 2024 at 12:42 pm by George Leaver