Monitoring Jobs
Monitoring Existing Jobs
You can use srun
to monitor existing jobs. It will login in to the allocated resource on the compute node where the job is running and give you an interactive session there.
You will need to know the JobID number of the job you wish to monitor, then run:
srun --jobid JOBID --pty bash
If you’ll be using a GUI tool to monitor your job, use:
srun-x11 --jobid JOBID # NO "--pty bash" needed for srun-x11
To limit the amount of time your interactive session will run for, add the -t timespec
flag to the srun
command. For example: -t 10
for 10 minutes.
GPU jobs
If running a GPU job, you can now run nvidia-smi
to get some info about your GPU usage.
Ending your monitoring session
Run exit
to end your interactive monitoring session. This will NOT terminate your batch job. You’ll return to the login node.
Job Statistics
A completed (successful) job
You can see the job stats – e.g., peak memory usage – with the seff
command, passing in a JOBID:
[mabcxyz1@login1[csf3] ~]$ seff 12345
Job ID: 12345
Cluster: csf3.man.alces.network
User/Group: username/xy01
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 2 # 2 CPUs requested in the jobscript
CPU Utilized: 00:04:13
CPU Efficiency: 49.41% of 00:08:32 core-walltime # 50% CPU usage suggests only 1 CPU was needed
Job Wall-clock time: 00:04:16
Memory Utilized: 21.45 GB # Peak memory usage
Memory Efficiency:33.5% of 64.00 GB # A low memory efficiency means this job did NOT need
# to use the himem partition. You should check this.
To check a specific jobarray task, use a JOBID of the form jobid_taskid:
seff 12345_501
Alternatively, use the sacct
command to obtain various stats about a job;
sacct -j 12345 # Or to just get the memory usage sacct -j 12345 -o maxrss
The sacct
command offers lots of options – use man sacct
to get more info.
Depending on the software you are using you may also find memory usage reported in output files.
A terminated (out of memory) job
If at any point during the running of the job, the job’s peak memory usage goes above the limit that the job is permitted to use, then the job will be terminated by the batch system.
The seff
command will show:
[mabcxyz1@login1[csf3] ~]$ seff 12345
State: OUT_OF_MEMORY (exit code 0)
You may see the following in your slurm-12345.out
file:
[mabcxyz1@login1[csf3] ~]$ cat slurm-12345.out
/var/spool/slurmd/job12345/slurm_script: line 4: 1851022 Killed ./some-app.exe -in data.dat -out results.dat
slurmstepd: error: Detected 1 oom_kill event in StepId=12345.batch. Some of the step tasks have been OOM Killed.
#
# OOM is "out of memory" - this means Slurm killed your job
# because it tried to use more memory than allowed.
You will need to resubmit your job, either requesting more cores (if using the standard partitions) or use a high memory partition.