Short Jobs and Time Limits
Default Maximum Wallclock Time
The wallclock time is currently set to a maximum of 7 days (unless otherwise noted in the
parallel environment (PE) table).
A 7 day runtime limit also affects how long your jobs may have to queue for. In general, the longer the permitted wallclock, the longer you potentially have to wait for your job to run: a 7 day wallclock limit is longer than most HPC systems offer. We find that this suits our users’ workloads. But it does mean that, when the system is busy, you could be waiting for up to 24 hours for some of your jobs to run (jobs are finishing all the time on the CSF and hence new jobs are continually being selected to run – but please be patient if your job does not run immediately).
We cannot extend a job’s max runtime once it has been submitted to the batch system.
Specifying a Shorter Wallclock Time
It is recommended, if you know the wallclock time you require, to state this during submission as this helps the SLURM scheduler make decisions about which jobs to run. This is, however, optional (unlike some HPC systems). This is done by adding to your jobscript:
#SBATCH -t hh:mm:ss # (--time=hh:mm:ss)
where hh is the number of hours, mm is the number of minutes, ss is the number of seconds.
Other acceptable time formats are:
#SBATCH -t minutes #SBATCH -t minutes:seconds #SBATCH -t hours:minutes:seconds #SBATCH -t days-hours #SBATCH -t days-hours:minutes #SBATCH -t days-hours:minutes:seconds
For example to give a limit of 10 minutes add the following line to your batch script:
#SBATCH -t 10
An example of 6 hours:
#SBATCH -t 06:00:00
An example of 2 days:
#SBATCH -t 2-0
Note that when the job time limit is reached the batch system sends a soft kill signal (SIGTERM) which some applications will detect and then shutdown cleanly – for example saving current state and results before exiting but this depends on your application’s capabilities. Some applications can checkpoint and then be restarted from a known status point. Please consult the manual for your software for more information.
If the job hasn’t shutdown 5 minutes after receiving the soft kill signal, a hard kill signal (SIGKILL) will be sent and the job will be killed immediately.
Short Jobs
A separate short job area does NOT exist on CSF4 at the moment (on CSF3 some nodes are reserved for short jobs). But by adding a runtime limit to your job, as detailed above, your job may be selected to run sooner than other jobs in the system. If you know one hour, for example, will be enough for your job to complete its work, add
#SBATCH -t 01:00:00
to your jobscript.