Monitoring Jobs

Monitoring Existing Jobs

You can use srun to monitor existing jobs. It will login in to the allocated resource on the compute node where the job is running and give you an interactive session there.

Your interactive session will consume some of the resources allocated to your batch job. This may adversely affect your batch job.

You will need to know the JobID number of the job you wish to monitor, then run:

srun --jobid JOBID --pty bash

If you’ll be using a GUI tool to monitor your job, use:

srun-x11 --jobid JOBID            # NO "--pty bash" needed for srun-x11

To limit the amount of time your interactive session will run for, add the -t timespec flag to the srun command. For example: -t 10 for 10 minutes.

GPU jobs

If running a GPU job, you can now run nvidia-smi to get some info about your GPU usage.

Ending your monitoring session

Run exit to end your interactive monitoring session. This will NOT terminate your batch job. You’ll return to the login node.

Job Statistics

A completed (successful) job

You can see the job stats – e.g., peak memory usage – with the seff command, passing in a JOBID:

[mabcxyz1@login1[csf3] ~]$ seff 12345
Job ID: 12345
Cluster: csf3.man.alces.network
User/Group: username/xy01
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 2                                   # 2 CPUs requested in the jobscript
CPU Utilized: 00:04:13
CPU Efficiency: 49.41% of 00:08:32 core-walltime    # 50% CPU usage suggests only 1 CPU was needed
Job Wall-clock time: 00:04:16
Memory Utilized: 21.45 GB              # Peak memory usage
Memory Efficiency:33.5% of 64.00 GB    # A low memory efficiency means this job did NOT need
                                       # to use the himem partition. You should check this.

To check a specific jobarray task, use a JOBID of the form jobid_taskid:

seff 12345_501

Alternatively, use the sacct command to obtain various stats about a job;

sacct -j 12345

# Or to just get the memory usage
sacct  -j 12345 -o maxrss

The sacct command offers lots of options – use man sacct to get more info.

Depending on the software you are using you may also find memory usage reported in output files.

A terminated (out of memory) job

If at any point during the running of the job, the job’s peak memory usage goes above the limit that the job is permitted to use, then the job will be terminated by the batch system.

The seff command will show:

[mabcxyz1@login1[csf3] ~]$ seff 12345
State: OUT_OF_MEMORY (exit code 0)

You may see the following in your slurm-12345.out file:

[mabcxyz1@login1[csf3] ~]$  cat slurm-12345.out

/var/spool/slurmd/job12345/slurm_script: line 4: 1851022 Killed             ./some-app.exe -in data.dat -out results.dat
slurmstepd: error: Detected 1 oom_kill event in StepId=12345.batch. Some of the step tasks have been OOM Killed.
                               #
                               # OOM is "out of memory" - this means Slurm killed your job
                               # because it tried to use more memory than allowed.

You will need to resubmit your job, either requesting more cores (if using the standard partitions) or use a high memory partition.

Last modified on June 11, 2025 at 2:34 pm by George Leaver