Batch commands (qsub, qstat, qdel, qacct)

Batch Commands

Your applications should be run in the batch system. You’ll need a jobscript (a plain text file) describing your job – its CPU, memory and possibly GPU requirements, and also the commands you actually want the job to run.

Further details on how to write jobscripts are in the sections on serial jobs, parallel jobs, job-arrays and GPU jobs.

You’ll then use one or more of the following batch system commands to submit your job to the system and check on its status. These commands should be run from the CSF’s login nodes:

qsub jobscript
Submit a job to the batch system, usually by submitting a jobscript. Alternatively you can specify job options on the qsub command-line. We recommend using a jobscript because this allows you to easily reuse your jobscript every time you want to run the job. Remembering the command-line options you used (possibly months ago) is much more difficult.

The qsub command will return a unique job-ID number if it accepts the job. You can use this in other commands (see below) and, when requesting support about a job, you should include this number in the details you send in.

For example, when submitting a job you will see a message similar to:

[mabcxyz1@hlogin2 [csf3] ~]$ qsub myjobscript
Your job 12345 ("myjobscript") has been submitted

For scripting purposes, you may prefer just to receive the jobid number from the qsub command. Add the -terse flag to achieve this:

qsub -terse myjobscript
12345
qstat
Report the current status of your jobs in the batch system (queued/waiting, running, in error, finished). Note that if you see no jobs listed when you run qstat it means you have no jobs in the system – they have all finished or you haven’t submitted any!

Some examples:

In this example qstat returns no output which means you have no jobs in the queue, either running or waiting:

[mabcxyz1@hlogin2 [csf3] ~]$ qstat
[mabcxyz1@hlogin2 [csf3] ~]$

In this example qstat shows we have two jobs running (one using 1 core, the other using 28 cores) and one job waiting (it will use 16 cores when it runs.):

[mabcxyz1@hlogin2 [csf3] ~]$ qstat
job-ID prior  name    user     state submit/start at    queue                slots ja-task
------------------------------------------------------------------------------------------
10281 0.40000 chemsim mabcxyz1  r    10/26/2018 10:31:02 serial.q@node781.pri  1
10644 0.36474 fsolve  mabcxyz1  r    10/31/2018 14:08:21 parallel.q@node768.p  28
10690 0.35467 fsolve  mabcxyz1  qw   10/31/2018 14:12:19                       16
 #               #              #             ###                              #
 #               #              #              #                               # Number of
 #               #              #              #                               # CPU cores
 #               #              #              #
 #               #              #              # If running: date & time the job started
 #               #              #              # If waiting: date & time job was submitted
 #               #              #
 #               #              # r   - job is running
 #               #              # qw  - job is queued waiting
 #               #              # Eqw - the jobscript has reported an error 
 #               #
 #               # Usually the name of your jobscript
 #
 # Every job is given a unique job ID number
 # (please tell us this number if requesting support)

For more information about monitoring jobs, including how to monitor GPU jobs, please see the job monitoring page.

The following commands are used less frequently but can still be run if you need to:

qdel jobid
To remove your job from the batch system early, either to terminate a running job before it finishes or to simply remove a queued job before it has started running. Also use this if your job goes in to an error state or you decide you don’t want a job to run.

Note that if your job is in the Eqw state, please leave it in the queue if requesting support. It is easier for us to diagnose the error if we can see the job. We may ask you to qdel the job once we have looked at it – there is usually no way to fix an existing job.

For example, maybe you realise you’ve given a job the wrong input parameters causing it to produce junk results. You don’t need to leave it running to completion (which might be hours or days). Instead you can kill the job using qdel. You need to know the job-ID number of the job:

[mabcxyz1@hlogin1 [csf3] ~]$ qdel 12345
mabcxyz1 has registered the job 12345 for deletion

The job will eventually be deleted (it may take a minute or two for this to happen). Use qstat to check your list of jobs.

qacct -j jobid
Advanced users. Once your job has finished you can use this command to get a summary of information for wall-clock time, max memory consumption and exit status amongst many other statistics about the job. This is useful for diagnosing why a job failed.
qalter options
Advanced users, not a recommended command. It may be possible to modify a job that is waiting in the queue (e.g., if you forgot to request a high-memory node you could add that option without deleting the job and resubmitting it). However, we recommend that, if you think your job is incorrectly described by the jobscript, you should delete the job from the queue, fix the jobscript and resubmit it. It is much better to have an accurate jobscript that documents how you ran a job and allows you to rerun the job in future.

Further Information

Our own documentation throughout this site provides lots of examples of writing jobscripts and how to submit jobs. SGE also comes with a set of comprehensive man pages. Some of the most useful ones are:

  • man qsub
  • man qstat
  • man qdel
  • man qacct

Last modified on March 25, 2024 at 4:09 pm by George Leaver