Monitoring Jobs
Monitoring Existing Jobs
To monitor the resource usage of a running job, you’ll need to access the compute node where that job is running. This then allows you to run commands such as top or htop (for CPU/host monitoring), or nvitop (for GPU monitoring).
There are two ways to access the compute node where your job is running – see below.
Please note: if you don’t have a job running on a compute node, you will not be able to access that compute node.
Using ssh
It is now possible (and permitted) to access compute nodes using ssh to monitor your jobs.
# On the login node, find out where your GPU job is running squeue JOBID PRIORITY PARTITION NAME USER ST ... NODELIST 123456 0.000054 gpuA myjob mabcxyz1 R ... node860 # Now access the compute node ssh node860 # Run your monitoring command - for example: top htop # For Nvidia GPU jobs: module load tools/bintools/nvitop nvitop # To return to the login node: exit
Using srun
You can also use srun to login to the node where the job is running, which gives you an interactive Slurm session on the node:
srun --jobid JOBID --pty bash
If you’ll be using a GUI tool to monitor your job, use:
srun-x11 --jobid JOBID # NO "--pty bash" needed for srun-x11
To limit the amount of time your interactive session will run for, add the -t timespec flag to the srun command. For example: -t 10 for 10 minutes.
Ending your monitoring session
Run exit to end your interactive monitoring session. This will NOT terminate your batch job. You’ll return to the login node.
Job Statistics
A completed (successful) job
You can see the job stats – e.g., peak memory usage – with the seff command, passing in a JOBID:
[mabcxyz1@login1[csf3] ~]$ seff 12345
Job ID: 12345
Cluster: csf3.man.alces.network
User/Group: username/xy01
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 2 # 2 CPUs requested in the jobscript
CPU Utilized: 00:04:13
CPU Efficiency: 49.41% of 00:08:32 core-walltime # <50% CPU usage suggests only 1 CPU was needed
Job Wall-clock time: 00:04:16
Memory Utilized: 21.45 GB # Peak memory usage
Memory Efficiency:33.5% of 64.00 GB # A low memory efficiency means this job did NOT need
# to use the himem partition. You should check this.
To check a specific jobarray task, use a JOBID of the form jobid_taskid:
seff 12345_501
Alternatively, use the sacct command to obtain various stats about a job;
sacct -j 12345 # Or to just get the memory usage sacct -j 12345 -o maxrss
The sacct command offers lots of options – use man sacct to get more info.
Depending on the software you are using you may also find memory usage reported in output files.
A terminated (out of memory) job
If at any point during the running of the job, the job’s peak memory usage goes above the limit that the job is permitted to use, then the job will be terminated by the batch system.
The seff command will show:
[mabcxyz1@login1[csf3] ~]$ seff 12345
State: OUT_OF_MEMORY (exit code 0)
You may see the following in your slurm-12345.out file:
[mabcxyz1@login1[csf3] ~]$ cat slurm-12345.out
/var/spool/slurmd/job12345/slurm_script: line 4: 1851022 Killed ./some-app.exe -in data.dat -out results.dat
slurmstepd: error: Detected 1 oom_kill event in StepId=12345.batch. Some of the step tasks have been OOM Killed.
#
# OOM is "out of memory" - this means Slurm killed your job
# because it tried to use more memory than allowed.
You will need to resubmit your job, either requesting more cores (if using the standard partitions) or use a high memory partition.
