Slurm Batch Commands (sbatch, squeue, scancel, sacct)
Batch Commands
Your applications should be run in the batch system. You’ll need a jobscript (a plain text file) describing your job – its CPU, memory and possibly GPU requirements, and also the commands you actually want the job to run.
Further details on how to write jobscripts are in the sections on serial jobs, parallel jobs, job-arrays and GPU jobs.
You’ll then use one or more of the following batch system commands to submit your job to the system and check on its status. These commands should be run from the CSF’s login nodes.
Job submission using sbatch
sbatch jobscript- Submit a job to the batch system, usually by submitting a jobscript. Alternatively you can specify job options on the
sbatchcommand-line. We recommend using a jobscript because this allows you to easily reuse your jobscript every time you want to run the job. Remembering the command-line options you used (possibly months ago) is much more difficult.The
sbatchcommand will return a unique job-ID number if it accepts the job. You can use this in other commands (see below) and, when requesting support about a job, you should include this number in the details you send in.For example, when submitting a job you will see a message similar to:
[mabcxyz1@login1[csf3] ~]$ sbatch myjobscript Submitted batch job 373
sbatch flags
The sbatch command has an extensive list of command-line flags. In most cases you can also specify these in your jobscripts using one of:
#SBATCH -flag value # short form flag (e.g., -p) #SBATCH --flag=value # long form flag (e.g., --partition)
We provide details of a few flags here:
-p, --partition=partition- Slurm partition (think of this as a queue) that contains a certain type of hardware. Your job will run in this partition. For example, a GPU partition or the multicore AMD job partition or the high memory partition. See the CSF3 partitions page.
-n,--ntasks=NUM- Number of tasks, which can usually be thought of as the number of cores to be used by your job. You should use this flag if running MPI parallel code, but it can also be used for other parallel jobs. See the CSF parallel jobs page.
-c,--cpus-per-task=NUM- Number of cores (which Slurm refers to as CPUs) per task (see
-nabove). Used to specify the number of cores that each MPI rank should use, so you should use this when running mixed-mode MPI+OpenMP applications. But can also be used when running OpenMP parallel applications. See the CSF parallel jobs page. -G, --gpus=NUM- Number of GPUs. Note that the GPU type (A100, L40S, and so on) is determined by the GPU partition you submit to. See the CSF3 GPU jobs page.
-t,--time=wallclock- Wallclock time limit for your job. See the CSF3 time limits page.
--mail-type=flag- Specify when you want to receive an email about the job. The flag can be one or more (a comma-separated list) of
NONE,BEGIN,END,FAIL,REQUEUE,ALL. Please consider how many jobs you are submitting and how many emails you might receive based on you choice of flags. --mail-user=emailaddress- Specify the email address to receive emails sent from Slurm. This can be your Manchester address or an external address.
To avoid needing to specify this flag on every job, you can place your email address in a file named
.forward(yes, with a “dot” at the start of the name). For example, to create this file, run on the login node the command:echo my.name@manchester.ac.uk > ~/.forward # # Change my.name to YOUR own correct email address! # This can be a comma-separated list of addresses.Then you can submit jobs using only the
--mail-typeflag (see above.) -d, --dependency=dependencylist- See the Job Dependencies section.
--parsable- For scripting purposes, you may prefer just to receive the jobid number from the
sbatchcommand. Add the--parsableflag to achieve this:sbatch --parsable myjobscript 12345
squeue- Report the current status of your jobs in the batch system (queued/waiting, running, in error, finished). Note that if you see no jobs listed when you run
squeueit means you have no jobs in the system – they have all finished or you haven’t submitted any! (QOSMaxCpuPerUserLimit)- You have reached a global limit for the type of CPU resources you need. Free at the point of use users may see this more frequently than members of contributing groups due to the max-cores-in-use limit applied to all f@pou users.
(AssocGrpGRES)- Your group has reached the maximum number of GPUs they can use.
(Resources)- This usually means all of the requested resources (e.g., CPUs, GPUs) are in use by other jobs – the CSF is simply very busy!
(Priority)- Similar to
(Resources). This usually indicates there is a high demand for a particular resource and it will take some time for them to be released. e.g. a 168 core job requires a whole AMD node to run. (MaxCpuPerAccount)- The group you are a part of has reached a global limit.
(DependencyNeverSatisfied)- The job was waiting on an earlier job to achieve some state – e.g., finish successfully. But if that job failed, then the dependency could not be met, so the current job was not allowed to start. To check what happened to the earlier job, have a look at the current job’s dependency info:
squeue JOBID PRIORITY PARTITION NAME USER ST SUBMIT_TIME START_TIME TIME NODES CPUS NODELIST(REASON) 372 0.0000005 multicore mymulticore mabcxyz1 PD 08/03/25 13:02 N/A 0:00 1 8 (DependencyNeverSatisfied) scontrol show job 372 | grep Depend JobState=PENDING Reason=DependencyNeverSatisfied Dependency=afterok:361(failed)
The
afterok:361(failed)tells you that the current job (372) was given a dependency on job 361 and that job 361 needed to finish successfully (afterok) before 372 would be allowed to run. But instead job 361failed. So job 372’s dependency will never be satisfied and so it will never run.You should remove this job from the queue (see
scancelbelow). (ReqNodeNotAvail, Reserved for maintenance)- The resources you have requested have been flagged for maitenance and as such are temporarily unavailable for your job. Your job will queue until the resource becomes available again. For significant/lenghthy maintenance work we will always advise all users in advance by email.
gpusqueue/gpustat- To see a list of GPU jobs you have in the system, you can use the custom
gpusqueuecommand. This runssqueuebut only shows GPU jobs and adds in some extra columns to show the types of GPUs requested:# Show GPU job information gpusqueue
scancel jobid- To remove your job from the batch system early, either to terminate a running job before it finishes or to simply remove a queued job before it has started running.
sacct -j jobid- Once your job has finished you can use this command to get a summary of information for wall-clock time, max memory consumption and exit status amongst many other statistics about the job. This is useful for diagnosing why a job failed.
A lot of information about a job is available using this command. To see a list of every possible field, run
sacct -e. To have a “long” list of fields automatically displayed when querying a job, usesacct --long -j jobid. Runman sacctfor further info about his command.For info about the
externstep that now appears in thesacctoutput, please see this FAQ answer. man sbatchman squeueman scancelman sacct
Error messages
When submitting a job, if you see the following errors, something is wrong:
sbatch: error: Batch job submission failed: No partition specified or system default partition
You must specify a partition, even for serial jobs. Add to your jobscript: #SBATCH -p partitionname.
sbatch: error: Batch job submission failed: Requested time limit is invalid (missing or exceeds some limit)
You must specify a “wallclock” time limit for your job. The maximum permitted is usually 7 days (or 4 days for GPU and HPC Pool jobs.) Add to your jobscript: #SBATCH -t timelimt.
Job Status using squeue
Examples
In this example squeue returns no output which means you have no jobs in the queue, either running or waiting:
[mabcxyz1@login1[csf3] ~]$ squeue [mabcxyz1@login1[csf3] ~]$
In this example squeue shows we have two jobs running (one using 1 core, the other using 8 cores) and one job waiting (it will use 16 cores when it runs.):
[mabcxyz1@login1[csf3] ~]$ squeue
JOBID PRIORITY PARTITION NAME USER ST SUBMIT_TIME START_TIME TIME NODES CPUS NODELIST(REASON)
372 0.0000005 multicore mymulticore mabcxyz1 R 08/03/25 13:02 08/03/25 13:32 2:04 1 8 node1260
371 0.0000005 serial simple.x mabcxyz1 R 09/03/25 14:58 09/03/25 15:02 8:22 1 1 node603
403 0.0000003 himem mypythonjob mabcxyz1 PD 11/03/25 09:25 N/A 0:00 1 4 (Resources)
# # # ### #
# # # # # Number of
# # # # # CPU cores
# # # #
# # # # If running: date & time the job started
# # # # If waiting: N/A
# # #
# # # R - job is running
# # # PD - job is queued waiting
# # # CG - Completing (contact us, may indicate an error)
# #
# # Usually the name of your jobscript
#
# Every job is given a unique job ID number
# (please tell us this number if requesting support)
Reasons for Pending Jobs
If your job is queued you might see one of the following reasons in the squeue output:
Changing the squeue output format
You can modify the list of fields (columns) output by the squeue command by setting the $SQUEUE_FORMAT or $SQUEUE_FORMAT2 environment variables. In fact, the default set of columns you see is given by the first variable – it has a default value when you login to the CSF. To see the value, run:
echo $SQUEUE_FORMAT %.15i %9p %9P %15j %8u %2t %14V %14S %10M %.6D %.5C %R
For more information on the two SQUEUE_FORMAT env vars and the column codes you can use, run man squeue.
GPU job status using gpusqueue / gpustat
The Research Infrastructure team provide the following extra command:
To monitor running GPU jobs, please see the nvitop command.
Delete a Job using scancel
Also use this if your job goes in to an error state or you decide you don’t want a job to run.
Note that if your job is in the CG state, please leave it in the queue if requesting support. It is easier for us to diagnose the error if we can see the job. We may ask you to scancel the job once we have looked at it – there is usually no way to fix an existing job.
For example, maybe you realise you’ve given a job the wrong input parameters causing it to produce junk results. You don’t need to leave it running to completion (which might be hours or days). Instead you can kill the job using scancel. You need to know the job-ID number of the job:
[mabcxyz1@login1[csf3] ~]$ scancel 12345
The job will eventually be deleted (it may take a minute or two for this to happen). Use squeue to check your list of jobs.
To delete all of your jobs:
# Delete all of your jobs. Use with CAUTION!
[mabcxyz1@login1[csf3] ~]$ scancel -u $USER
Please also see the Deleteing Job Arrays notes.
Get Info About Finished Jobs
Further Information
Our own documentation throughout this site provides lots of examples of writing jobscripts and how to submit jobs. SGE also comes with a set of comprehensive man pages. Some of the most useful ones are:
