Running Jobs – The Batch System (SGE)

Why use a batch system?

All jobs must be run in the batch system (SGE). This allows you to specify the resources (cores, memory, GPUs) you need for your jobs and ensures the jobs only run when those resources become available.

It also ensures fair usage of the system – there are many jobs making different demands of the system and many users submitting jobs. The batch system will schedule your jobs according to resources requested and size of your group’s contribution to the system.

Be kind to the login nodes and other users

Applications should not be run directly on the login nodes. These are relatively small, light-weight nodes (not many cores, small memory) used to access the system, edit files, submit jobs. Many users will be connected to the login nodes. If you run an application there, you may prevent all of those users from doing their work.

Go through the documentation of the parallel program/code/library

Every parallel program/code/library has different methods for controlling the number of cores/threads used by it.
It is users’ responsibility to go through the documentation of the codes/libraries/programs that they intend to run in any HPC, to see how to set/limit the number of cores/threads the code will use during runtime, and tweak those values/settings accordingly, before submitting them in HPC. Also, users need to request that many CPU cores in their parallel jobscript: #$ -pe smp.pe [num_cores]
If not set and controlled properly, parallel jobs can oversubscribe CPU cores and memory and deny other jobs running in the same node from their rightful share of resources assigned by batch system. This affects and degrades the performance of all jobs running in the same compute node. This can also prevent other eligible jobs from running.

Please do NOT run your application/programs/codes in login nodes or directly in compute nodes, submit your work to the batch system.

The sysadmins will kill, without notice, any applications running on the login nodes or jobs submitted to batch system which are oversubscribing resources by not setting parallel options properly.

Do not log in to compute nodes

If you want to diagnose / debug a problem with your application (e.g., quick test runs, trying different parameters, possibly modifying and recompiling code) without using a batch script, please use an interactive job (see qrsh). DO NOT do this type of work directly on the login node – your quick test run may use far more CPU and memory that you think and deprive jobs running on the nodes of their rightful resources assigned by the batch system. Such direct logins and related processes will be terminated without warning. Repeated such attempts will result in disabling of account.

Please take the time to learn how to submit jobs to the batch system.

Batch Tutorial

If you are unfamiliar with running jobs in a batch system please see our 10 minute tutorial on running jobs on the CSF.

Submitting Jobs and Requesting Resources

You will need to write a small jobscript,

gedit myjobscript

which is a simple text file that specifies

  1. Any additional or specific resources your job needs (number of CPU cores, the architecture/type of CPU, memory, GPUs).
    [The default is 1 CPU-core, any Intel CPU type, 4GB RAM, no GPU]
  2. The actual commands / application your job should execute.

Further details on how to write jobscripts, and some example job scripts, are in the sections on serial jobs and parallel jobs. The menu on the left also has pages for more advanced job options. Our software pages also have example jobscripts for each application we have installed.

Then submit the jobscript to the batch system using

qsub myjobscript

You may also wish to check on your job (is it still running?) using

qstat

See the batch commands for more information.

Last modified on October 14, 2024 at 12:54 pm by Abhijit Ghosh