{"id":9244,"date":"2025-04-02T15:54:25","date_gmt":"2025-04-02T14:54:25","guid":{"rendered":"https:\/\/ri.itservices.manchester.ac.uk\/csf3\/?page_id=9244"},"modified":"2026-02-25T11:27:06","modified_gmt":"2026-02-25T11:27:06","slug":"monitoring","status":"publish","type":"page","link":"https:\/\/ri.itservices.manchester.ac.uk\/csf3\/batch-slurm\/monitoring\/","title":{"rendered":"Monitoring Jobs"},"content":{"rendered":"<h2>Monitoring Existing Jobs<\/h2>\n<p>To monitor the resource usage of a running job, you&#8217;ll need to access the compute node where that job is running. This then allows you to run commands such as <code>top<\/code> or <code>htop<\/code> (for CPU\/host monitoring), or <code>nvitop<\/code> (for GPU monitoring).<\/p>\n<p>There are two ways to access the compute node where your job is running &#8211; see below.<\/p>\n<p>Please note: if you <em>don&#8217;t<\/em> have a job running on a compute node, you will <em>not<\/em> be able to access that compute node.<\/p>\n<h3>Using ssh<\/h3>\n<p>It is now possible (and permitted) to access compute nodes using <code>ssh<\/code> to monitor your jobs.<\/p>\n<pre>\r\n# On the login node, find out where your job is running\r\nsqueue\r\n  JOBID  PRIORITY  PARTITION  NAME   USER      ST     ...      NODELIST\r\n 123456  0.000054  gpuA       myjob  mabcxyz1  R      ...      <em><strong>node860<\/strong><\/em>\r\n\r\n# Now access the compute node\r\nssh <em><strong>node860<\/strong><\/em>\r\n\r\n# Run your monitoring command - for example:\r\ntop\r\nhtop\r\n# For Nvidia GPU jobs:\r\nmodule load tools\/bintools\/nvitop\r\nnvitop\r\n\r\n# To return to the login node:\r\nexit\r\n<\/pre>\n<h3>Using srun<\/h3>\n<p>You can also use <code>srun<\/code> to login to the node where the job is running, which gives you an interactive Slurm session on the node:<\/p>\n<pre>\r\nsrun --jobid <em>JOBID<\/em> --pty bash\r\n<\/pre>\n<p>If you&#8217;ll be using a GUI tool to monitor your job, use:<\/p>\n<pre>\r\nsrun-x11 --jobid <em>JOBID<\/em>            # NO \"--pty bash\" needed for srun-x11\r\n<\/pre>\n<p>To limit the amount of time your interactive session will run for, add the <code>-t <em>timespec<\/em><\/code> flag to the <code>srun<\/code> command. For example: <code>-t 10<\/code> for 10 minutes.<\/p>\n<h3>Ending your monitoring session<\/h3>\n<p>Run <code>exit<\/code> to end your interactive monitoring session. This will NOT terminate your batch job. You&#8217;ll return to the login node.<\/p>\n<h2>Job Statistics<\/h2>\n<h3>A completed (successful) job<\/h3>\n<p>You can see the job stats &#8211; e.g., <em>peak<\/em> memory usage &#8211; with the <code>seff<\/code> command, passing in a JOBID:<\/p>\n<pre>[<em>mabcxyz1<\/em>@login1[csf3] ~]$ seff <em>12345<\/em>\r\nJob ID: 12345\r\nCluster: csf3.man.alces.network\r\nUser\/Group: <em>username<\/em>\/<em>xy<\/em>01\r\nState: COMPLETED (exit code 0)\r\nNodes: 1\r\nCores per node: 2                                   # 2 CPUs requested in the jobscript\r\nCPU Utilized: 00:04:13\r\nCPU Efficiency: 49.41% of 00:08:32 core-walltime    # &lt;50% CPU usage suggests only 1 CPU was needed\r\nJob Wall-clock time: 00:04:16\r\n<strong>Memory Utilized: 21.45 GB<\/strong>              # <strong>Peak memory usage<\/strong>\r\nMemory Efficiency:<strong>33.5%<\/strong> of 64.00 GB    # A low memory efficiency means this job did NOT need\r\n                                       # to use the himem partition. You should check this.\r\n<\/pre>\n<p>To check a specific jobarray task, use a JOBID of the form <em>jobid_taskid<\/em>:<\/p>\n<pre>seff 12345_501\r\n<\/pre>\n<p>Alternatively, use the <code>sacct<\/code> command to obtain various stats about a job;<\/p>\n<pre>sacct -j <em>12345<\/em>\r\n\r\n# Or to just get the memory usage\r\nsacct  -j <em>12345<\/em> -o maxrss\r\n<\/pre>\n<p>The <code>sacct<\/code> command offers lots of options &#8211; use <code>man sacct<\/code> to get more info.<\/p>\n<p>Depending on the software you are using you may also find memory usage reported in output files.<\/p>\n<h3>A terminated (out of memory) job<\/h3>\n<p>If at any point during the running of the job, the job&#8217;s peak memory usage goes <em>above<\/em> the limit that the job is permitted to use, then the job will be terminated by the batch system.<\/p>\n<p>The <code>seff<\/code> command will show:<\/p>\n<pre>[<em>mabcxyz1<\/em>@login1[csf3] ~]$ seff 12345\r\nState: OUT_OF_MEMORY (exit code 0)\r\n<\/pre>\n<p>You may see the following in your <code>slurm-12345.out<\/code> file:<\/p>\n<pre>[<em>mabcxyz1<\/em>@login1[csf3] ~]$  cat slurm-12345.out\r\n\r\n\/var\/spool\/slurmd\/job12345\/slurm_script: line 4: 1851022 Killed             .\/some-app.exe -in data.dat -out results.dat\r\nslurmstepd: error: Detected 1 oom_kill event in StepId=12345.batch. Some of the step tasks have been OOM Killed.\r\n                               #\r\n                               # OOM is \"out of memory\" - this means Slurm killed your job\r\n                               # because it tried to use more memory than allowed.\r\n<\/pre>\n<p>You will need to resubmit your job, either requesting more cores (if using the standard partitions) or use a <a href=\"\/csf3\/batch-slurm\/high-memory-jobs-slurm\/\">high memory partition<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Monitoring Existing Jobs To monitor the resource usage of a running job, you&#8217;ll need to access the compute node where that job is running. This then allows you to run commands such as top or htop (for CPU\/host monitoring), or nvitop (for GPU monitoring). There are two ways to access the compute node where your job is running &#8211; see below. Please note: if you don&#8217;t have a job running on a compute node, you.. <a href=\"https:\/\/ri.itservices.manchester.ac.uk\/csf3\/batch-slurm\/monitoring\/\">Read more &raquo;<\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"parent":9105,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-9244","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/ri.itservices.manchester.ac.uk\/csf3\/wp-json\/wp\/v2\/pages\/9244","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ri.itservices.manchester.ac.uk\/csf3\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/ri.itservices.manchester.ac.uk\/csf3\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/ri.itservices.manchester.ac.uk\/csf3\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/ri.itservices.manchester.ac.uk\/csf3\/wp-json\/wp\/v2\/comments?post=9244"}],"version-history":[{"count":20,"href":"https:\/\/ri.itservices.manchester.ac.uk\/csf3\/wp-json\/wp\/v2\/pages\/9244\/revisions"}],"predecessor-version":[{"id":11963,"href":"https:\/\/ri.itservices.manchester.ac.uk\/csf3\/wp-json\/wp\/v2\/pages\/9244\/revisions\/11963"}],"up":[{"embeddable":true,"href":"https:\/\/ri.itservices.manchester.ac.uk\/csf3\/wp-json\/wp\/v2\/pages\/9105"}],"wp:attachment":[{"href":"https:\/\/ri.itservices.manchester.ac.uk\/csf3\/wp-json\/wp\/v2\/media?parent=9244"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}