The CSF2 has been replaced by the CSF3 - please use that system! This documentation may be out of date. Please read the CSF3 documentation instead. To display this old CSF2 page click here. |
The Batch System (SGE) – Running Jobs
All jobs must be run in the batch system (SGE). This allows you to specify the resources (cores, memory, GPUs) you need for your job and ensures your jobs only run when those resources become available. It also ensures fair usage of the system – there are many jobs making different demands of the system and many users submitting jobs. The batch system will schedule your jobs according to resources requested and size of your group’s contribution to the system.
Applications should not be run on the login node. This is a small, light-weight node (not many cores, small memory) used to access the system, edit files, submit jobs. Many users will be connected to the login node. If you run an application there you may prevent all of those users from doing their work. The sysadmins will kill, without notice, any applications running on the login node.
Please take the time to learn how to submit jobs to the batch system.
Batch Tutorial
If you are unfamiliar with running jobs in a batch system please see our 10 minute tutorial on running jobs on the CSF.
Batch-system Commands to Manage Jobs
The two most common SGE commands you will use (on the login node) are:
qsub jobscript
to submit a job to the batch system. The jobscript is a simple text file describing your job, which will either be a serial job, a parallel job or a job array. The software pages also give example jobscripts for each application.qstat
to check on the status of your jobs
Note, if you receive an error:
bash: qsub: command not found
To fix this you need to load the batch system modulefile on the login node using:
module load services/gridscheduler
You will then be able to submit jobs, monitor your jobs and so on.
Current Policy and Configuration
The system is currently running fairshare: the scheduler attempts to ensure that each contributing research group receives a share of available computational resources which reflects the size of their contribution, integrated over approximately a month (the current half-life is 28 days).
The configuration is subject to change and adjustments. Changes are made in line with service requirements and policies set via the CSF User Group.
Intel vs AMD Nodes
Please note that some software or codes will only be suitable for running on the Intel nodes not the AMD nodes and vice versa. Where appropriate this will be indicated on the relevant software page.
Time limits
The intel nodes have a 7 day wallclock limit and the AMD 4 days. Jobs that have not completed within the wallclock time are automatically terminated by the batch system – not all applications will save the data at this point. Further advice on time limits.
Further Information
Our own documentation throughout this site provides lots of examples of writing jobscripts and how to submit jobs. SGE also comes with a set of comprehensive man pages. Some of the most useful ones are:
man qsub
man qstat
man qdel