Batch commands (qsub, qstat, qdel, qacct)
Batch Commands
Your applications should be run in the batch system. You’ll need a jobscript (a plain text file) describing your job – its CPU, memory and possibly GPU requirements, and also the commands you actually want the job to run.
Further details on how to write jobscripts are in the sections on serial jobs, parallel jobs, job-arrays and GPU jobs.
You’ll then use one or more of the following batch system commands to submit your job to the system and check on its status. These commands should be run from the CSF’s login nodes:
qsub jobscript
- Submit a job to the batch system, usually by submitting a jobscript. Alternatively you can specify job options on the
qsub
command-line. We recommend using a jobscript because this allows you to easily reuse your jobscript every time you want to run the job. Remembering the command-line options you used (possibly months ago) is much more difficult.The
qsub
command will return a unique job-ID number if it accepts the job. You can use this in other commands (see below) and, when requesting support about a job, you should include this number in the details you send in.For example, when submitting a job you will see a message similar to:
[mabcxyz1@hlogin2 [csf3] ~]$ qsub myjobscript Your job 12345 ("myjobscript") has been submitted
For scripting purposes, you may prefer just to receive the jobid number from the
qsub
command. Add the-terse
flag to achieve this:qsub -terse myjobscript 12345
qstat
- Report the current status of your jobs in the batch system (queued/waiting, running, in error, finished). Note that if you see no jobs listed when you run
qstat
it means you have no jobs in the system – they have all finished or you haven’t submitted any!Some examples:
In this example
qstat
returns no output which means you have no jobs in the queue, either running or waiting:[mabcxyz1@hlogin2 [csf3] ~]$ qstat [mabcxyz1@hlogin2 [csf3] ~]$
In this example
qstat
shows we have two jobs running (one using 1 core, the other using 28 cores) and one job waiting (it will use 16 cores when it runs.):[mabcxyz1@hlogin2 [csf3] ~]$ qstat job-ID prior name user state submit/start at queue slots ja-task ------------------------------------------------------------------------------------------ 10281 0.40000 chemsim mabcxyz1 r 10/26/2018 10:31:02 serial.q@node781.pri 1 10644 0.36474 fsolve mabcxyz1 r 10/31/2018 14:08:21 parallel.q@node768.p 28 10690 0.35467 fsolve mabcxyz1 qw 10/31/2018 14:12:19 16 # # # ### # # # # # # Number of # # # # # CPU cores # # # # # # # # If running: date & time the job started # # # # If waiting: date & time job was submitted # # # # # # r - job is running # # # qw - job is queued waiting # # # Eqw - the jobscript has reported an error # # # # Usually the name of your jobscript # # Every job is given a unique job ID number # (please tell us this number if requesting support)
For more information about monitoring jobs, including how to monitor GPU jobs, please see the job monitoring page.
The following commands are used less frequently but can still be run if you need to:
qdel jobid
- To remove your job from the batch system early, either to terminate a running job before it finishes or to simply remove a queued job before it has started running. Also use this if your job goes in to an error state or you decide you don’t want a job to run.
Note that if your job is in the
Eqw
state, please leave it in the queue if requesting support. It is easier for us to diagnose the error if we can see the job. We may ask you toqdel
the job once we have looked at it – there is usually no way to fix an existing job.For example, maybe you realise you’ve given a job the wrong input parameters causing it to produce junk results. You don’t need to leave it running to completion (which might be hours or days). Instead you can kill the job using qdel. You need to know the job-ID number of the job:
[mabcxyz1@hlogin1 [csf3] ~]$ qdel 12345 mabcxyz1 has registered the job 12345 for deletion
The job will eventually be deleted (it may take a minute or two for this to happen). Use
qstat
to check your list of jobs. qacct -j jobid
- Advanced users. Once your job has finished you can use this command to get a summary of information for wall-clock time, max memory consumption and exit status amongst many other statistics about the job. This is useful for diagnosing why a job failed.
qalter options
- Advanced users, not a recommended command. It may be possible to modify a job that is waiting in the queue (e.g., if you forgot to request a high-memory node you could add that option without deleting the job and resubmitting it). However, we recommend that, if you think your job is incorrectly described by the jobscript, you should delete the job from the queue, fix the jobscript and resubmit it. It is much better to have an accurate jobscript that documents how you ran a job and allows you to rerun the job in future.
Further Information
Our own documentation throughout this site provides lots of examples of writing jobscripts and how to submit jobs. SGE also comes with a set of comprehensive man pages. Some of the most useful ones are:
man qsub
man qstat
man qdel
man qacct