Batch System 10 Minute Tutorial

Introduction

This page offers new CSF users a tutorial that covers usage of the batch system to run a simple job on the CSF.

The tutorial also provides some information about the storage areas on the CSF and some common Linux commands used to manage your files.

After doing the tutorial you’ll be able to use the CSF.

If you’re a CSF3 user moving to CSF4 this tutorial will allow you to practice writing a jobscript for the SLURM batch system and using the SLURM commands to submit and check on your job.

Before we begin the tutorial we’ll explain what the batch system is and why we need to use it.

Some basics: What is a batch system and why use it?

[Show / Hide]

Join the queue…

Initially a batch system can be thought of as a job queue. You submit jobs to the queue and the system will pick them out of the queue to run them.

The jobs will do whatever commands you ask them to do (for example run an app such as a chemistry app, or a bioinformatics app or whatever application is appropriate to your work).

When the jobs finish you should have some results! Everything is saved to files.

No GUI

Something to note about batch jobs is that you never see an application’s graphical user interface (GUI), if it has one. Batch jobs run without any interaction – all options / flags / input files etc will be specified on the command-line in a jobscript (more on those later).

When the app is running, all output will be saved to files. This will be a new way of working if you are used to running an app in a desktop environment (e.g., on Windows).

At this point you might be thinking you don’t like the idea of your work (jobs) waiting in a queue. How long will it queue for? Why can’t it just run immediately? Read on to find out more.

Ask for extra memory or cores?

Submitting your work as a job to the batch system allows you to specify the system resources needed by your job – the applications you’ll be running on the CSF usually need different amounts of memory or number of CPU cores.

The batch system ensures your job only runs when all of the required resources are available. It then allocates those resources to your job (so that it runs correctly) and no other jobs can grab your resources.

But don’t worry if you’re not sure what resources you’ll need – there are sensible defaults which you can use to begin with. After trying the defaults, you might find your app needs more memory to process your data, or that it can use more CPU cores to make it run faster. No problem – you can edit the jobscript to request more resource (e.g., more memory) then submit the job again.

Note that you can’t do any damage to the CSF – if your first attempt at running your job isn’t correct (e.g., it runs out of memory), just correct the jobscript and resubmit the job. You can submit as many jobs as you like and it is common to run the same job several (or many) times with different input data.

Fair usage

The batch system also ensures fair usage for you and others – there are many users and jobs on the system, all making different demands of the resources (memory, CPU cores, …) and so allowing the batch system to choose exactly when to run your job is the only sensible way of running the system.

The fact that jobs are starting and finishing all the time means you rarely have to wait very long for your requested resource to become free so that your jobs can start.

There are other factors which control when jobs run (and how many of your jobs can run at the same time) but the use of a job queue should not put you off using the system!

Let the CSF get on with it

An added bonus of a batch system is that once you’ve submitted your jobs to the system, you don’t actually need to remain logged in. You can log off, go home or go to a meeting or do something else with your PC/laptop.

Meanwhile the batch system will run your jobs. It can even email you when a job has finished.

Without a batch system you would have to remain logged in to the CSF until the job had finished, which could be a problem for a simulation that takes several days to complete.

Can I just run my app on the Login Node?

Running your code or application directly on the login nodes is not permitted. The login nodes are for other tasks (transferring files on and off the system, editing jobscripts, submitting jobs to the system). They don’t have a lot of memory nor many cores so trying to run your apps there is inefficient and may also adversely affect other users.

Applications found running on the login nodes may be killed by the sysadmins without warning.

Please do take the time to learn about the batch system. While it may be an unfamiliar way of working initially, particularly if you are used to simply running your apps immediately on a desktop PC, there are actually a lot of benefits to using the batch system – you’ll see it is a very powerful way of working as you begin to do your real work. In this tutorial you can try out the sample job below – it shouldn’t take more than 10 minutes to work through the instructions on this page.

The tutorial assumes you’ve logged in to the CSF – please see the login instructions for more information.

10 Minute Tutorial: Submitting a First Job to the Batch System

Here we describe in detail how to submit a simple, first job to the batch system (we use a batch system called SLURM.) Please read all of the text, don’t just look for the commands to type, as it will explain why you need to run the commands.

What type of job will we run?

[Show / Hide]

We will run a serial job – i.e., it uses only one CPU core. We’ll see later that many of the real applications on the CSF can use more than one CPU core (a multi-core job) to speed up their processing, giving you the results sooner!

You could also request more memory than the default 4GB of RAM.

But initially a simple 1-core (serial) job will help you become familiar with the principles of the batch system. These jobs are very common – you may well want to use this type of job in your real work after the tutorial.

Please remember: Do not simply run jobs on the login node – use the batch system as described below.

Step 0: Create a Folder for the Job Files

[Show / Hide]

In the following steps we will be creating a jobscript file. We will explain more about the file in the next step. The job will also create some files (any output generated by the job is saved to files).

Hence we will create a directory (folder) for the job to keep all of the files together in one place. This is important – you will likely run a lot of jobs on the CSF so it will makes things easier for you to manage if you keep your files tidy.

When you log in to the CSF you are placed in your home directory. This area of storage is private to you and, importantly, is backed-up (not all storage areas on the CSF are backed-up). It is strongly recommended that you keep important files in your home directory for safe keeping – and this includes your jobscripts!

Once you’ve logged in you will be at the command-line prompt:

[mxyzabc1@login02 [CSF4] ~]$   you will type your commands here, "at the prompt"
   ^            ^   ^    ^
   |            |   |    | 
   |            |   |    +--- The directory (folder) you are currently in.
   |            |   |         ~ means your home folder which is your private folder.
   |            |   |
   |            |   +--- Name of the system
   |            |
   |            +--- Name of the login node (some systems have more than one login node)
   |
   +--- Your username appears here

Now create a directory (usually referred to as a folder in Windows or MacOS) in your CSF home storage area, for our first test job, by running the following commands at the prompt:

# All of these commands are run on the CSF login node at the prompt
mkdir ~/first-job-csf4            # Make (create) the directory (folder)
cd ~/first-job-csf4               # Change to (go in to) the directory (folder)

Notice that the prompt has changed to indicate you’ve moved in to the first-job-csf4 folder:

[mxyzabc1@login02 [CSF4] ~/first-job-csf4]$   
                            ^
                            |
                            +--- The prompt shows we are now in the first-job-csf4 folder

Step 1: Create a “Jobscript” – a job description file

[Show / Hide]

The jobscript file is the thing you submit to the batch system (i.e, the queue of jobs.) It is just a simple plain-text file. It serves two main purposes:

  1. It specifies the number of CPU cores, memory and other resources you need to run your application.
  2. It specifies the actual command(s) needed to run your application and anything else your job will do (e.g., copy files).
A key benefit of the jobscript is that it documents exactly what you did to run your job – no need to remember what you did 6 months ago as it is all there in the jobscript. If you ever need to run a job again, or run similar jobs, having the jobscript available is very useful!

Hence jobscripts should be considered part of your work that needs to be kept securely in your home directory. They are a record of how you ran a simulation or analysis, for example, or how you processed a particular dataset. Jobscripts are therefore part of your research methods.

We now use gedit, or another editor, on the CSF login node (running text editors on the login node is permitted) to create a file with exactly the following content (see below). You can name the file anything you like, as long as there are no spaces in the name – in this example we use first-job.txt but Linux doesn’t care what extension you use – .txt or .sbatch or .jobscript for example:

# Run this command on the CSF login node at the prompt
gedit first-job.txt
  #
  # Please IGNORE any warnings / messages that appear in the terminal from gedit.
  # For example: (gedit:5246): dconf-WARNING **: .........
  • Note for Windows users: You can create the jobscript below in Notepad and then transfer the file to CSF, although we don’t actually recommend this method. The file can have any name (we’re using first-job.txt but anything will be OK – you’ll find that Notepad names files with .txt at the end anyway). However, you must run the following command on the login node to convert the file from Windows format to Linux format otherwise the batch system will reject the job when you try to submit it.)
    # Run this command on the CSF login node at the prompt if jobscript was written in notepad
    dos2unix first-job.txt
               #
               # or whatever filename you used (we assume notepad adds .txt)
    

    But we recommend that Windows users install MobaXterm to log in to the CSF. You can then run gedit on the CSF login node and you’ll get a Linux editor very similar to Notepad. The file you write will be saved directly on the CSF and will not need converting with dos2unix because it is already in the correct format.

Here’s the jobscript content – put this in the text file you are creating either in gedit (run on the CSF login node) or notepad (run on your Windows PC):

#!/bin/bash --login

# SLURM options (whose lines must begin with #SBATCH)

# OPTIONAL LINE: default partition is serial
#SBATCH -p serial   # (or --partition=serial)

# OPTIONAL LINE: default is 1 core in serial
#SBATCH -n 1        # (or --ntasks=1) use 1 core

# Now the example commands to be executed (programs to be run) on a compute node:

/bin/date
/bin/hostname
/bin/sleep 120
/bin/date

Note: lines must NOT be indented in your text file – there should NOT be any spaces at the start of the lines. Cut-n-paste from this web page will work correctly in most browsers in that it won’t copy any leading space.

This BASH script has three parts:

  1. The first line, #!/bin/bash --login, means that the file you create is treated as a BASH script. Linux provides several scripting languages but BASH is the one you use at the command-line once you’ve logged in, and also in jobscripts. This means that any commands you would normally type on the login node can also be used in your jobscript to be run as part of a batch job.
  2. The lines beginning with #SBATCH provide information about your job to the batch system (SLURM) – your use them to request resources (number of cores, memory, etc.) They must appear before the normal commands that your job will actually run.
    • In this simple jobscript the line #SBATCH -p serial indicates the job is a serial job. This is actually optional – without this line it will be assumed the job is a serial job and only one CPU core will be allocated to the job. Several other partitions are available (serial, multicore, multinode) – these are just areas in the CSF dedicated to running certain types of jobs.
    • The line #SBATCH -n 1 indicates that only one CPU core should be allocated to the job. Again, this is actually optional – jobs running in the serial partition always use exactly one CPU core.
  3. The remaining lines comprise our computational job – the applications we actually want to run. In this example we have a trivial job which runs simple Linux commands to output the date and time, followed by the name of the compute node on which the job runs, then waits for two minutes and finally outputs the date and time again. In a real jobscript you would do something more interesting and useful – e.g., run MATLAB or Abaqus or a chemistry program.

We’ve now written our first jobscript and it is in our private, backed-up home directory. The next section will show how to copy the jobscript to the temporary scratch storage so that we can then submit the job from there.

Step 2: Copy to scratch area

[Show / Hide]

We now copy the jobscript to your scratch area.

We recommend you run jobs from the scratch filesystem: it is another area of storage on the CSF that is faster and larger. Your home directory is in an area that has a quota to be shared amongst everyone in your group – if your job fills up that area you will prevent your colleagues from working! Running jobs in the scratch area avoids this problem.

PLEASE NOTE: the scratch area is a temporary area – files unused in the last 3-months can be deleted by the system to free up space. You should always have a copy of important files in your home area (or other research data storage visible on the CSF that your research group may have access to). Think of scratch as fast, temporary storage – if your job reads and writes large files it will be faster if run from scratch.

A good way of working is to create your important files in the home area, then copy them to scratch when you need to use them in your jobs. That way you always have a safe copy in your home area.

So let’s copy our jobscript to the scratch area (we keep the original in our home area for safe keeping):

cp first-job.txt ~/scratch

We can now go in to the scratch area:

cd ~/scratch

Our scratch directory is now our current working directory. When we submit the job to the batch queue (see next step) it will run in the scratch area – a job always runs from whichever directory you are in when you submit the job. Any files that the job generates will also be written to there (scratch area in this example) and if your job wants to read input data files (ours doesn’t in this example) then it would try to read them from that directory.

You will notice the prompt on the command-line will change to indicate where you are currently located:

[mxyzabc1@login02 [CSF4] ~/scratch]$
                             #
                             # The prompt shows your current directory

Step 3: Submit the Job to the Batch System

[Show / Hide]

Recap: So far we have created a directory for the jobscript in our home area, written a jobscript text file there (where it is stored safely on backed-up storage), then copied it to the fast temporary scratch storage and changed directory to our scratch area where we’ll run the job from.

The next step is to actually submit the job to the batch system. Suppose the above script is saved in a file called first-job.txt. Then the following command will submit your job to the batch system:

sbatch first-job.txt

You’ll see a message printed similar to:

Submitted batch job 226650

The job id 226650 is a unique number identifying your job (obviously you will receive a different number). You may use this in other commands later.

Step 4: Check Job Status

[Show / Hide]

To confirm that your job is queued, or perhaps already running, enter the command

squeue

If the job is still pending (waiting to run) the output from squeue will look like the following – notice the ST column:

                                                                                               NODELIST
 JOBID PRIORITY PARTITION NAME     USER     ACCOUNT ST SUBMIT_TIME  START_TIME TIME NODES CPUS (REASON)
226651 0.019104 serial    first-jo mxyzabc1 group01 PD 2/08/21 9:51 N/A        0:00     1    1 (None)

If your job is already running, the output will look like the following – notice the ST and NODELIST columns:

                                                                                               NODELIST
 JOBID PRIORITY PARTITION NAME     USER     ACCOUNT ST SUBMIT_TIME  START_TIME TIME NODES CPUS (REASON)
226652 0.019104 serial    first-jo mxyzabc1 group01 R  2/08/21 9:55 ... 9:55   0:05     1    1 node003

If your jobs have finished, squeue will show no output – meaning you have no jobs in the queue, either running or waiting.

[mxyzabc1@login02 [CSF4] scratch]$ squeue
JOBID PRIORITY PARTITION NAME     USER  ACCOUNT ST SUBMIT_TIME  START_TIME  TIME  NODES  CPUS NODELIST
  #
  # No jobs listed mean you have no jobs waiting or running (all jobs have finished)

If something is wrong with your jobscript you’ll see F or some other code. There might also be a REASON to help diagnose the problem. Please contact us at its-ri-team@manchester.ac.uk stating your job-ID and the system you are logged in to and we’ll let you know what has gone wrong.

Step 5: Review Job Results/Output

[Show / Hide]

Each job will output least one file, containing any output that would normally have been printed to screen. This can including normal information from your app and also error message, if any occurred.

Let’s list the files in the current directory using the Linux ls command:

ls
first-job.txt  slurm-226652.out

We can see our original jobscript first-job.txt and a new file slurm-226652.out that has been generated by the job (remember, the job ID number 226652 will be different for your job!)

To look at the contents of the output file:

cat slurm-226652.out

In this example the output file contains:

Mon Aug  2 09:55:49 BST 2021
node003
Mon Aug  2 09:57:49 BST 2021

shows the date, twice with a difference of 120 seconds (2 minutes), and the name of the compute node on which the job ran, as expected (refer back to the commands we ran in our first jobscript).

Note that the names of the output file is always, by default, slurm-JOBID.out. It might be easier to keep track of which job output which file if you make the output file use a similar name to that of your jobscript. You can change the start of the name of the output file by adding the following line to your jobscript (change myjobname to something meaningful for your job)

#SBATCH -o %x.o%j      # %x will be replaced by the jobscript name
                       # %j will be replaced by the JOBID number

This would generate an output file named first-job.txt.o226652 (which will be familiar to CSF3 users.)

You’ve now successfully run a job on the CSF. It was a simple 1-core job (it used only one CPU core) to run some basic Linux commands. The output of the commands was captured in to the slurm-226652.out file. By changing the Linux commands to something more useful (e.g., to run your favourite chemistry application) you can get lots of real work done on the CSF.

Step 6: Copy Results back to “home”

[Show / Hide]

Earlier we said that the scratch storage area is temporary (but fast). Hence if we want to keep the results from this job then we should copy them back to the home storage area. Let’s assume we DO want to keep the output from this job. Apart from the usual slurm-NNNNN.out file, it didn’t generate any other files. So we’ll just copy that file back to home:

# Copy from the current scratch dir to the job's directory in home
cp slurm-226652.out ~/first-job-csf4/

That’s it, the output file is now stored in our backed-up home area. We could delete the file from scratch, although sometimes you may wish to leave your files there while you check their contents and possibly use them in future jobs. Remember though, the scratch filesystem will tidy up old files automatically, so at some point they will be deleted.

When you run a real app (e.g., a chemistry app or OpenFOAM) then your jobs may well generate other files (lots of them, possibly large files.) You’ll need to consider more carefully which files you want to keep.

Summary

[Show / Hide]

Points to remember

  • Do not simply run your apps on the login node. Write a jobscript and submit it to the batch system. Your app will run on a more powerful node and won’t upset the login node (and the sysadmins!)
  • You can write your jobscript on the login node using gedit.
  • Alternatively if you use notepad on windows ensure you run dos2unix on the jobscript once you’ve transferred it to the CSF.
  • Keep your important files in your home area but copy them to the scratch area and run your jobs from there. Don’t forget to copy important results back to home.
  • Submit the job using sbatch
  • Check on the job using squeue
  • Look in the slurm-NNNNN.out file generated by the job for output and errors.
  • If you have any questions please contact us at its-ri-team@manchester.ac.uk – we’re here to help.

More on Using the Batch System (multi-core and multi-node parallel jobs)

The batch system has a great deal more functionality than described above – by adding more #SBATCH special lines to your jobscript your jobs can make more use of the CSF capabilities. A list of features is given below with links to documentation.

Other features include:

These features are fully documented (with example job scripts) in the CSF SLURM documentation.

Finally, each centrally installed application has its own application webpage where you will find examples of how to submit a job for that specific piece of software and any other information relevant to running it in batch such as extra settings that may be required for it to work.

Last modified on November 14, 2023 at 4:18 pm by George Leaver