SGE to Slurm

Audience

The instructions on this page are for users who have been asked to test applications in the Slurm batch system on the upgraded CSF3, thereby helping us to ensure the upgraded system is running as expected.

PLEASE DO NOT SUBMIT A REQUEST ASKING FOR ACCESS TO THE NEW SYSTEM – WE WILL CONTACT YOU AT THE APPROPRIATE TIME!!!
(answering these requests will slow down our upgrade work!)

Introduction

The use of Slurm on CSF3 represents a significant change for CSF users who are used to using the SGE batch system.

While SGE has served us well, Slurm has been widely adopted by many other HPC sites, is under active development and has features and flexibility that we need as we introduce new platforms for the research community at the University.

This page shows the Slurm commands and jobscript options next to their SGE counterparts to help you move from SGE to Slurm.

How can I tell which system I’m using?

As part of the CSF3 upgrade, new login nodes have been installed. If you are on a new login node, you will have access to the Slurm batch system. If you’re on an old login node, you’ll have access to SGE.

To check which system you are on, look at the prompt:

[mabcxyz1@login1 [csf3] ~]$   # Old CSF3 (running SGE) uses red.
[mabcxyz1@login1[csf3] ~]$    # Upgraded CSF3 (running Slurm) uses green.                   

# You can also run:
which qsub       # Tells you where it is on old CSF3, or "no qsub in..." on upgraded CSF3.
which sbatch     # Tells you where it is on upgraded CSF3, or "no sbatch in..." on old CSF3.

What are the key differences?

  1. Slurm uses sbatch, squeue, scancel and srun commands, not qsub, qstat, qdel and qrsh. See below.
  2. The joscript sentinel is #SBATCH, not #$. See below.
  3. You must specify explicitly the serial partition to run a 1-core job. See below.
  4. You MUST specify a job wallclock time for your job! There is NO default value applied to a job. The maximum permitted is 7 days in most cases (GPU jobs can have a maximum of 4 days.) Your job will be rejected if you don’t give it a wallclock time. See below.
  5. The jobscript flags (-pe, -l v100=1 and so on) have changed.
  6. The jobscript variables ($JOB_ID, $NSLOTS and so on) have changed. See below.

Please see below for more details on how to use Slurm, compared to SGE.

Jobscript Special Lines – SGE (#$) vs Slurm (#SBATCH)

Your SGE jobscripts will not run in the Slurm batch system. This is because the SGE jobscript special lines beginning with #$ will be ignored by Slurm. Instead, you should use lines beginning with #SBATCH, and will need to change the options you use on those lines.

Note that it is #SBATCH (short for Slurm BATCH) and NOT #$BATCH. This is an easy mistake to make when you begin to modify your SGE jobscripts. Do not use a $ (dollar) symbol in the Slurm special lines.

Can I use a single jobscript for both SGE and Slurm?

It is possible to have both SGE and Slurm lines in your jobscripts – they will each ignore the other’s special lines.

However, there are some differences in the way multi-core and multi-node jobs are run, so we advise writing new jobscripts for use on the upgraded CSF3.

One suggestion is to name your Slurm jobscripts jobscript.sbatch and your SGE jobscripts jobscript.qsub, but you can, of course, use any naming scheme you like.

Examples of SGE jobscripts and their equivalent Slurm jobscript are given below.

The commands used to submit jobs and check on the queue have also changed. See below for the equivalent commands.

Command-line tools – SGE (qsub, …) vs Slurm (sbatch, …)

SGE Commands (old CSF3) Slurm Commands (new CSF3)
# Batch job submission
qsub jobscript
qsub jobscript arg1 arg2 ...
qsub options -b y executable arg1 ...

# Job queue status
qstat             # Show your jobs (if any)
qstat -u "*"      # Show all jobs
qstat -u username

# Cancel (delete) a job
qdel jobid
qdel jobname
qdel jobid -t taskid
qdel "*"             # Delete all my jobs

# Interactive job
qrsh -l short

# Completed job stats
qacct -j jobid
# Batch job submission
sbatch jobscript
sbatch jobscript arg1 arg2 ...
sbatch options --wrap="executable arg1 ..."

# Job queue status
squeue      # An alias for "squeue --me"
\squeue     # Unaliased squeue shows all jobs
squeue -u username

# Cancel (delete) a job
scancel jobid
scancel -n jobname
scancel jobid_taskid
scancel -u $USER       # Delete all my jobs

# Interactive job (1 hour session)
srun -p serial -t 1:00:00 --pty bash

# Completed job stats
sacct -j jobid

Job Output Files (stdout and stderr)

SGE job output files, not merged by default (old CSF3) Slurm job output files, merged by default (new CSF)
# Individual (non-array) jobs
jobscriptname.oJOBID
jobscriptname.eJOBID

# Array jobs
jobscriptname.oJOBID.TASKID
jobscriptname.eJOBID.TASKID
# Individual (non-array) jobs
slurm-JOBID.out


# Array jobs (see later for more details)
slurm-ARRAYJOBID_TASKID.out

The Slurm files contain the normal and error output that SGE splits in to two files.

The naming and merging of the files can be changed using jobscript options (see below) but for now, in the basic jobscripts shown next, we’ll just accept these default names to keep the jobscripts short.

Jobscripts

You will need to rewrite your SGE jobscripts. You could name them somename.sbatch if you like, to make it obvious it is a Slurm jobscript.

Put #SBATCH lines in one block

Please note: all Slurm special lines beginning with #SBATCH must come before ordinary lines that run Linux commands or your application. Any #SBATCH lines appearing after the first non-#SBATCH line will be ignored. For example:

#!/bin/bash --login

# You may put comment lines before and after #SBATCH lines
#SBATCH -n 4
#SBATCH -c 2

# Now the first 'ordinary' line. So no more #SBATCH lines allowed after here
export MY_DATA=~/scratch/data
# We recommend that you purge your module environment before loading further modules in Slurm jobs
module purge
module load myapp/1.2.3

# Any SBATCH lines here will be ignored!
#SBATCH --job-name  new_job_name

./my_app dataset1.dat

Job Wallclock (time) Limits

On the upgraded CSF3 running Slurm, you are required to specify the maximum wallclock time limit for your job.

Previously, on the old CSF3 running SGE, this was optional and would default to the maximum of 7 days (or 4 days for GPU and HPC Pool jobs.)

Why the change? In short, more accurate wallclock times, specified by you, will help with scheduling your job.

In particular, if you know your job will complete within 1 hour, or 1 day, say, by specifying this wallclock limit, the scheduler may be able to squeeze your job onto a compute node that is being held by the scheduler for a much larger, long running job. Slurm will try to use cores by running short jobs that it knows could be left idle while a larger job is waiting for resources.

Of course, you can still specify a maximum wallclock of 7 days, even if your job won’t actually take that long to complete. But you might wait longer in the queue.

You don’t need to be super-accurate, and it is much better to give your job plenty of time to complete. If you don’t give the job enough time, Slurm will kill the job before it completes. So we recommend specifying a number of days.

To specify a wallclock time in a Slurm job:

#SBATCH -t timelimit
## OR
#SBATCH --time=timelimit

where the following are examples of acceptable formats for the timelimit

#SBATCH -t 10            # minutes
#SBATCH -t 10:30         # minutes:seconds
#SBATCH -t 4:20:00       # hours:minutes:seconds
#SBATCH -t 4-0           # days-hours              # Recommended format
#SBATCH -t 4-10:00       # days-hours:minutes
#SBATCH -t 5-8:15:30     # days-hours:minutes:seconds

If you fail to specify a wallclock time limit, you will get the following error when submitting a job:

sbatch: error: Batch job submission failed: Requested time limit is invalid (missing or exceeds some limit)

Add the -t flag and resubmit the job.

Basic Serial (1-core) Josbscript

A basic serial job requires the serial partition to be specified, and a wallclock timelimt.

#!/bin/bash --login
#SBATCH -p serial     # Partition name is required (serial will default to 1 core)
#SBATCH -t 4-0        # Job "wallclock" limit is required. Max permitted is 7 days (7-0).
                      # In this example 4-0 is 4 days (0 hours).
                      # Other formats:  min:sec, hrs:min:sec, day-hours (to name a few)
# Clean env
module purge
module load apps/binapps/someapp/1.2.3

# Use $SLURM_NTASKS if you need to say how many cores are being used (1 in this case)
someapp.exe

Basic Multi-core (single AMD 168-core compute node) Parallel Jobscript

The following will run on an AMD 168-core Genoa node, using 96 cores, for 4.5 days.

#!/bin/bash --login
#SBATCH -p multicore   # Partition name is required
#SBATCH -n 96          # (or --ntasks=) Number of cores (2--168 on AMD)
#SBATCH -t 4-12        # Job "wallclock" limit is required. Max permitted is 7 days (7-0).
                       # In this example 4-12 is 4 days 12 hours (4.5 days).

# Clean env
module purge
module load apps/binapps/someapp/1.2.3

# Use $SLURM_NTASKS if you need to say how many cores are being used (96 in this case)
mpirun -n $SLURM_NTASKS someMPIapp.exe

Basic Multi-node Parallel Jobscript

Details of the Slurm job submission config will appear when ready.

Basic Job Array Jobscript

The following will run a 1000 task job array, where each task uses 2 cores on an AMD node, and each task can run for 2 days

#!/bin/bash --login
#SBATCH -p multicore   # Partition name is required
#SBATCH -n 2           # (or --ntasks=) Number of cores (2--168 on AMD)
#SBATCH -t 2-0         # Job "wallclock" limit is required. Max permitted is 7 days (7-0).
                       # In this example each job array task gets 2 days (0 hours).
#SBATCH -a 1-1000      # 1000 tasks in the job array, numbered 1..1000.
                       # (NB: tasks can begin from 0 in Slurm, unlike SGE)

# Clean env
module purge
module load apps/gcc/someapp/1.2.3

# Use ${SLURM_ARRAY_TASK_ID} to get the task number (1,2,...,1000 in this example)
./myprog -in data.${SLURM_ARRAY_TASK_ID}.dat -out results.${SLURM_ARRAY_TASK_ID}.dat
               #
               # My input files are named: data.1.dat, data.2.dat, ..., data.1000.dat
               # 1000 tasks (copies of this job) will run.
               # Task 1 will read data.1.dat, task 2 will read data.2.dat, ... 

More Jobscript special lines – SGE vs Slurm

Here are some more example jobscripts special lines for achieving things in SGE and Slurm.

Renaming a job and the output .o and .e files

SGE Jobscript Slurm Jobscript
#!/bin/bash --login
...
# Naming the job is optional.
# Default is name of jobscript
# DOES rename .o and .e output files.
#$ -N jobname

# Naming the output files is optional.
# Default is separate .o and .e files:
# jobname.oJOBID and jobname.eJOBID
# Use of '-N jobname' DOES affect those defaults
#$ -o myjob.out
#$ -e myjob.err

# To join .o and .e in to a single file
# similar to Slurm's default behaviour:
#$ -j y
#!/bin/bash --login
...
# Naming the job is optional.
# Default is name of jobscript
# Does NOT rename .out file.
#SBATCH -J jobname

# Naming the output files is optional.
# Default is a single file for .o and .e:
# slurm-JOBID.out
# Use of '-J jobname' does NOT affect the default
#SBATCH -o myjob.out
#SBATCH -e myjob.err

# Use wildcards to recreate the SGE names
#SBATCH -o %x.o%j      # %x = SLURM_JOB_NAME
#SBATCH -e %x.e%j      # %j = SLURM_JOB_ID

The $SLURM_JOB_NAME variable will tell you the name of your jobscript, unless the -J jobname variable is used to rename your job. Then the env var is set to the value of jobname.

If you wanted to use $SLURM_JOB_NAME to always give you the name of the jobscript from within your job, you would have to remove the -J flag. However, the following command run inside your jobscript will give you the name of the jobscript regardless of whether you use the -J flag or not:

scontrol show jobid $SLURM_JOB_ID | grep Command= | awk -F/ '{print $NF}'

Renaming an array job output .o and .e files

An array job uses slurm-ARRAYJOBID_TASKID.out as the default output file for each task in the array job. This can be renamed but you need to use the %A and %a wildcards (not %j).

SGE Jobscript Slurm Jobscript
#!/bin/bash --login
...
# An array job (cannot start at 0)
#$ -t 1-1000

# Naming the job is optional.
# Default is name of jobscript
#$ -N jobname

# Naming the output files is optional.
# Default is separate .o and .e files:
# jobname.oJOBID and jobname.eJOBID
# Use of '-N jobname' DOES affect those defaults

# To join .o and .e in to a single file
# similar to Slurm's default behaviour:
#$ -j y

#!/bin/bash --login
...
# An array job (CAN start at 0)
#SBATCH -a 0-999     # (or --array=0-999)

# Naming the job is optional.
# Default is name of jobscript
#SBATCH -J jobname

# Naming the output files is optional.
# Default is a single file for .o and .e:
# slurm-ARRAYJOBID_TASKID.out
# Use of '-J jobname' does NOT affect the default

# Use wildcards to recreate the SGE names
#SBATCH -o %x.o%A.%a # %x=SLURM_JOB_NAME
#SBATCH -e %x.e%A.%a # %A=SLURM_ARRAY_JOB_ID
                     # %a=SLURM_ARRAY_TASK_ID

Emailing from a job

SLURM can email you when your job begins, ends or fails.

SGE Jobscript Slurm Jobscript
#!/bin/bash --login
...
# Mail events: begin, end, abort
#$ -m bea
#$ -M e.addr@manchester.ac.uk	
#!/bin/bash --login
...
# Mail events: NONE, BEGIN, END, FAIL, ALL
#SBATCH --mail-type=ALL
#SBATCH --mail-user=e.addr@manchester.ac.uk

Note that in Slurm, array jobs only send one email, not an email per job-array tasks as happens in SGE. If you want an email from every job-array task, add ARRAY_TASKS to the --mail flag:

#SBATCH --mail-type=ALL,ARRAY_TASKS
                            #
                            # DO NOT USE IF YOUR ARRAY JOB CONTAINS MORE THAN
                            # 20 TASKS!! THE UoM MAIL ROUTERS WILL BLOCK THE CSF!

But please be aware that you will receive A LOT of email if you run a large job array with this flag enabled.

Job Environment Variables

A number of environment variables are available for use in your jobscripts – these are sometimes useful when creating your own log files, for informing applications how many cores they are allowed to use (we’ve already seen $SLURM_NTASKS in the examples above), and for reading sequentially numbered data files in job arrays.

SGE Environment Variables Slurm Environment Variables
$NSLOTS             # Num cores reserved

$JOB_ID             # Unique jobid number
$JOB_NAME           # Name of job

# For array jobs
$JOB_ID             # Same for all tasks
                    # (e.g, 2173)


$SGE_TASK_ID        # Job array task number*
                    # (e.g., 1,2,3,...)
$SGE_TASK_FIRST     # First task id
$SGE_TASK_LAST      # Last task id
$SGE_TASK_STEPSIZE  # Taskid increment
                    # (default: 1)


# You will be unlikely to use these:
$PE_HOSTFILE        # Multinode host list
$NHOSTS             # Number of nodes in use
$SGE_O_WORKDIR      # Submit directory
$SLURM_NTASKS        # Num cores from -n flag
$SLURM_CPUS_PER_TASK # Num cores from -c flag
$SLURM_JOB_ID        # Unique job id number
$SLURM_JOB_NAME      # Name of job

# For array jobs
$SLURM_JOB_ID        # NOT SAME FOR ALL TASKS 
                     # (eg: 2173,2174,2175)
$SLURM_ARRAY_JOB_ID  # SAME for all tasks
                     # (eg: 2173)
$SLURM_ARRAY_TASK_ID # Job array task number*
                     # (eg: 1,2,3,...)
$SLURM_ARRAY_TASK_MIN   # First task id
$SLURM_ARRAY_TASK_MAX   # Last task id
$SLURM_ARRAY_TASK_STEP  # Taskid Increment
                        # (default: 1)
$SLURM_ARRAY_TASK_COUNT # Number of tasks

# You will be unlikely to use these:
$SLURM_JOB_NODELIST  # Multinode host list
$SLURM_JOB_NUM_NODES # Number of nodes in use
$SLURM_SUBMIT_DIR    # Submit directory

Note (*): SGE job-array task IDs are NOT allowed to start from zero – they must start from 1 (or higher). Slurm job array task IDs can start from zero – it is up to you to specify the starting task id.

Many more environment variables are available for use in your jobscript. The Slurm sbatch manual (also available on the CSF login node by running man sbatch) documents Input and Output environment variables. The input variables can be set by you before submitting a job to set job options (although we recommend not doing this – it is better to put all options in your jobscript so that you have a permanent record of how you ran the job). The output variables can be used inside your jobscript to get information about the job (e.g., number of cores, job name and so on – we have documented several of these above.)

Last modified on March 12, 2025 at 9:21 am by George Leaver