Short Jobs and Time Limits (Slurm)
There is NO Default Job Wallclock Time
The upgraded CSF now requires that you specify the maximum wallclock time that your job will be allowed to run for. If your job is still running after this amount of time, the system will kill the job.
There is NO default value, but there is a maximum you are allowed to specify (in the SGE batch system, the default was 7 days if you didn’t specify a wallclock time limit)
Why the change? This is to improve the job scheduling – Slurm may be able to run your job sooner if it can fit it in before other jobs are expected to start using similar resources. By giving a more realistic wallclock time, Slurm can better schedule jobs.
You don’t need to be super accurate! If you’re not sure how long your job will take, you should err on the side of caution and give it plenty of wallclock time. The maximum permitted is 7 days in most cases (4 days for GPU jobs.) Consult the Partitions page to see the per-partition limits.
Note: We cannot extend a job’s wallclock limit once it has been submitted to the batch system.
If you fail to specify the wallclock time, you’ll receive the error:
sbatch: error: Batch job submission failed: Requested time limit is invalid (missing or exceeds some limit)
What happens when the Wallclock limit is reached?
When the job time limit (wallclock) is reached, the batch system sends a soft kill signal (SIGTERM) to your job, which some applications will detect and then shutdown gracefully – for example saving current state and results before exiting, but this depends on your application’s capabilities. Some applications can use this “checkpoint” to be restarted from a known status point.
But most apps will simply exit! You might be left with an incomplete or empty or corrupt output file.
Please consult the manual for your software for more information.
If the job hasn’t shutdown 30 seconds after receiving the soft kill signal, a hard kill signal (SIGKILL) will be sent and the job will be killed immediately.
Specifying a Wallclock Time
To set the max wallclock for your job, add the following to your jobscript:
#SBATCH -t d-hh:mm:ss # (--time=d-hh:mm:ss)
where d is the number of days, hh is the number of hours, mm is the number of minutes, ss is the number of seconds.
Other acceptable time formats are:
#SBATCH -t minutes #SBATCH -t minutes:seconds #SBATCH -t hours:minutes:seconds #SBATCH -t days-hours # Recommended format (e.g., 4-0 for 4 days) #SBATCH -t days-hours:minutes #SBATCH -t days-hours:minutes:seconds
For example to give a limit of 10 minutes add the following line to your batch script:
#SBATCH -t 10
An example of 6 hours:
#SBATCH -t 06:00:00 #### OR use 0 days, 6 hours: #SBATCH -t 0-6
An example of 2 days:
#SBATCH -t 2-0
Can the time limit on a submitted job be changed?
Yes, you can increase or decrease the time you have requested thus:
scontrol update jobid=1234567 TimeLimit=4-0
Be sure to replace the jobid with the number of your job and set the TimeLimit to the new wallclock you require (the above example sets 4 days).
I receive an error ReqNodeNotAvail
when submitting jobs
If your job won’t submit, with error
Job unable to run due to ReqNodeNotAvail, reserved for maintenance
this means a maintenance period is scheduled for future work. The problem is that the wallclock time limit you have specified means that the job won’t finish in time before the maintenance period begins.
This would result in Slurm having to kill the job. So instead it refuses to allow you to submit it.
The solution is to specify a shorter wallclock time limit using the #sbatch -t timelimit
directive (or see above to modify already-queued jobs.)
Short Jobs
A separate short job area does NOT exist on the upgraded CSF3. But by adding a runtime limit to your job, as detailed above, your job may be selected to run sooner than other jobs in the system. If you know that one hour, for example, will be enough for your job to complete its work, add
#SBATCH -t 01:00:00
to your jobscript.