{"id":9711,"date":"2025-05-06T17:07:32","date_gmt":"2025-05-06T16:07:32","guid":{"rendered":"https:\/\/ri.itservices.manchester.ac.uk\/csf3\/?page_id=9711"},"modified":"2026-02-24T09:50:08","modified_gmt":"2026-02-24T09:50:08","slug":"gpu-jobs-slurm","status":"publish","type":"page","link":"https:\/\/ri.itservices.manchester.ac.uk\/csf3\/batch-slurm\/gpu-jobs-slurm\/","title":{"rendered":"Nvidia GPU Jobs (Slurm)"},"content":{"rendered":"<h2>Access<\/h2>\n<p>This page covers the Nvidia A100 (80GB), A100 (40GB) and L40S GPUs in Slurm.<\/p>\n<div class=\"hint\">Notes for Slurm users &#8211; please read:<\/p>\n<ol class=\"gaplist\">\n<li><strong>All users have access to the A100 (80GB) and L40S GPUs<\/strong> &#8211; you <em>do not<\/em> need to submit a ticket to request access.<\/li>\n<li>The A100 (<strong>40<\/strong>GB) GPUs are limited to a small number of users from a specific group.<\/li>\n<li>The H200 GPUs have a specific <a href=\"h200\/#access\">access policy<\/a> (please check before requesting access.)<\/li>\n<li>See the <a href=\"..\/partitions\/#gpunodes\">Slurm Partitions page<\/a> for the available hardware.<\/li>\n<li>GPUs now run in &#8220;DEFAULT&#8221; compute mode, not &#8220;EXCLUSIVE_PROCESS&#8221;. Hence you can run multiple processes on the GPUs assigned to your job. Slurm will prevent other jobs from accessing the GPUs assigned to your job.<\/li>\n<li><strong>Feb 2026 update<\/strong>: a change has been made to the way in which the 4-GPU free-at-point-of-use GPU limits are applied &#8211; <a href=\"#f@pou\">see below for details<\/a>.\n<\/ol>\n<\/div>\n<h3 id=\"f@pou\">Free at Point of Use access<\/h3>\n<p>Please use the <code>gpuA<\/code> and\/or <code>gpuL<\/code> paritions.<\/p>\n<p><strong>20-Feb-2026: The current limits are<\/strong>: up to <strong>four A100(80GB) or L40S GPUs in use at any one time<\/strong>. These can now be all of the same GPU type. For example you can use 4 x A100(80GB) GPUs or 4 x L40S GPUs or a mixture of the two. You can submit as many jobs as you wish &#8211; the system will run them within the limits. <\/p>\n<p>Users that have previous had temporary increased limits via the <code>gpuax4<\/code> or <code>gpulx4<\/code> QOSs should no longer use those QOSs. Simply submit your 4xGPU jobs and the new f@pou limits will allow the jobs to run.<\/p>\n<p><strong>PLEASE NOTE:<\/strong> Now that the free-at-point-of-use limits have been modified to allow 4 of the same type of GPUs to be used, <strong>we will NOT accept requests for access to more than 4 GPUs.<\/strong> <\/p>\n<h3>Contributor access<\/h3>\n<p>Members of groups that have contributed GPUs will be informed of their limits when CSF accounts are created.<\/p>\n<h3>A100 (40GB) GPU access<\/h3>\n<p>The A100-40G nodes have been funded by a specific research group and so access is <em>very<\/em> restricted. You will be informed if you have access to these GPUs.<\/p>\n<h2>GPU batch job submission (Slurm)<\/h2>\n<p>For jobs that require GPUs &#8211; running on one or more GPUs in a single compute node. A jobscript template is shown below. Please also consult the <a href=\"\/csf3\/batch-slurm\/partitions\">Partitions<\/a> page for details on available compute resources.<\/p>\n<p><em><strong>Please also consult the <a href=\"\/csf3\/software\/applications#gpu\">software page<\/a> for the code \/ application you are running for advice on running that application<\/strong><\/em>.<\/p>\n<p>A GPU job script will run in the directory (folder) from which you submit the job. The jobscript takes the form:<\/p>\n<pre>#!\/bin\/bash --login\r\n### <strong>Choose ONE of the following partitions depending on your permitted access<\/strong>\r\n#SBATCH -p gpuA              # A100 (80GB) GPUs  [up to 12 CPU cores per GPU permitted]\r\n#SBATCH -p gpuA40GB          # A100 (40GB) GPUs  [up to 12 CPU cores per GPU permitted]\r\n#SBATCH -p gpuL              # L40S GPUs         [up to 12 CPU cores per GPU permitted]\r\n### Required flags\r\n#SBATCH -G <em>N<\/em>                 # (or --gpus=<em>N<\/em>) Number of GPUs \r\n#SBATCH -t 1-0               # Wallclock timelimit (1-0 is one day, 4-0 is max permitted)\r\n### Optional flags\r\n#SBATCH -n <em>numcores<\/em>          # (or --ntasks=) Number of CPU (host) cores (default is 1)\r\n                             # See above for number of cores per GPU you can request.\r\n                             # Also affects host RAM allocated to job unless --mem=<em>num<\/em> used.\r\n\r\nmodule purge\r\nmodule load libs\/cuda\/<em>x.y.z<\/em>  # See below for specific versions\r\n\r\necho \"Job is using $SLURM_GPUS GPU(s) with ID(s) $CUDA_VISIBLE_DEVICES and $SLURM_NTASKS CPU core(s)\"\r\n<em>gpuApp args ...<\/em>\r\n<\/pre>\n<p>Note that the amount of <em>host<\/em> RAM your job has access to is dependent on the number of <em>CPU cores<\/em> you request, unless you request a specific amount of host memory for your job using the <code>--mem=<em>num<\/em>G<\/code> or <code>--mem-per-gpu=<em>num<\/em>G<\/code> flags. Note that the default units of memory are megabytes if no units are given (use <code>G<\/code> for gigabytes.)<\/p>\n<table class=\"striped\">\n<tbody>\n<tr>\n<th>GPU<\/th>\n<th>Max host cores per GPU<\/th>\n<th>Host RAM per core (GB)<\/th>\n<th>Max host RAM per GPU (GB)<\/th>\n<\/tr>\n<tr>\n<td>A100(80GB)<\/td>\n<td>12<\/td>\n<td>10.4<\/td>\n<td>125<\/td>\n<\/tr>\n<tr>\n<td>A100(40GB)<\/td>\n<td>12<\/td>\n<td>10.4<\/td>\n<td>125<\/td>\n<\/tr>\n<tr>\n<td>L40S<\/td>\n<td>12<\/td>\n<td>10.4<\/td>\n<td>125<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h2>Available Hardware and Resources<\/h2>\n<p>Please see the <a href=\"\/csf3\/batch-slurm\/partitions\/#gpunodes\">Partitions<\/a> page for details on available compute resources.<\/p>\n<h2>Software Applications<\/h2>\n<p>A range of GPU capable software is available on the CSF.<\/p>\n<p><a href=\"\/csf3\/software\/applications\/#gpu\">List of installed GPU capable software.<\/a><br \/>\n<a href=\"\/csf3\/software\/applications\/#ml\">List of installed Machine Learning specific software.<\/a><\/p>\n<p><a name=\"a100hw\"><\/a><\/p>\n<h2>GPU Hardware and Driver<\/h2>\n<p>The CSF (Slurm) will contain the following GPU nodes, which offer different types of Nvidia GPUs and host CPUs.<\/p>\n<p><strong>For the <em>current<\/em> list of resources please see<\/strong> the <a href=\"..\/partitions\">Slurm Partitions page<\/a>.<\/p>\n<p><strong>The following information is for reference only &#8211; it does not mean that all of these nodes are currently available in Slurm!<\/strong><\/p>\n<p><strong>***NO LONGER  IN SERVICE***<\/strong>17 GPU nodes each hosting 4 x Nvidia v100 GPUs (16GB GPU RAM) giving a total of 68 v100 GPUs. The node spec is:<\/p>\n<ul>\n<li>4 x NVIDIA v100 SXM2 16GB GPU (Volta architecture \u2013 hardware v7.0, compute architecture <code>sm_70<\/code>)<\/li>\n<li>Some GPU hosts: 2 x 16-core Intel Xeon Gold 6130 &#8220;Skylake&#8221; 2.10GHz<\/li>\n<li>Some GPU hosts: 2 x 16-core Intel Xeon Gold 5128 &#8220;Cascade Lake&#8221; 2.30GHz<\/li>\n<li>192 GB RAM (host)<\/li>\n<li>1.6TB NVMe local storage, 182GB local SSD storage<\/li>\n<li>CUDA Driver 535.154.05<\/li>\n<\/ul>\n<p>19 GPU nodes each hosting 4 x Nvidia A100 GPUs (80GB GPU RAM) giving a total of 64 A100 GPUs. The node spec is:<\/p>\n<ul>\n<li>4 x NVIDIA HGX A100 SXM4 80GB GPU (Ampere architecture \u2013 hardware v8.0, compute architecture <code>sm_80<\/code>)<\/li>\n<li>2 x 24-core AMD Epyc 7413 &#8220;Milan&#8221; 2.65GHz<\/li>\n<li>512 GB RAM (host)<\/li>\n<li>1.6TB local NVMe storage, 364GB local SSD storage<\/li>\n<li>CUDA Driver 535.154.05<\/li>\n<\/ul>\n<p>2 GPU nodes each hosting 4 x Nvidia A100 GPUs (<strong>40GB<\/strong> GPU RAM) giving a total of 8 A100_40G GPUs. The node spec is:<\/p>\n<ul>\n<li>4 x NVIDIA A100 SXM4 40GB GPU (Ampere architecture \u2013 hardware v8.0, compute architecture <code>sm_80<\/code>)<\/li>\n<li>2 x 24-core AMD Epyc 7413 &#8220;Milan&#8221; 2.65GHz<\/li>\n<li>512 GB RAM (host)<\/li>\n<li>1.6TB local NVMe storage, 364GB local SSD storage<\/li>\n<li>CUDA Driver 535.154.05<\/li>\n<\/ul>\n<p>21 GPU nodes hosting 4 x Nvidia L40S GPUs (48GB GPU RAM) giving a total of 12 L40S GPUs. The node spec is:<\/p>\n<ul>\n<li>4 x NVIDIA L40S 48GB GPU (Ada Lovelace architecture \u2013 hardware v8.9, compute architecture <code>sm_89<\/code>)<\/li>\n<li>2 x 24-core Intel Xeon(R) Gold 6442Y &#8220;Sapphire Rapids&#8221; 2.6GHz<\/li>\n<li>512 GB RAM (host)<\/li>\n<li>28TB local \/tmp storage<\/li>\n<li>CUDA Driver 535.183.01<\/li>\n<\/ul>\n<h3>Fast NVMe storage on the node<\/h3>\n<p>The very fast, local-to-node NVMe storage is available as <code>$TMPDIR<\/code> on each node. This environment variable gives the name of a temporary directory which is created by the batch system at the start of your job. You must access this from your jobscript &#8211; i.e., <em>on the node<\/em>, not on the login node. See <a href=\"#nvme\">below<\/a> for advice on how to use this in your jobs.<\/p>\n<p><strong>This directory (and all files in it) will be deleted automatically at the end of your job by the batch system.<\/strong><\/p>\n<p>Reminder: The above storage areas is local to the compute node where your job is running. You will <em>not<\/em> be able to access the files in the temporary storage on the login node.<\/p>\n<div class=\"hint\"><em>Batch jobs running on the GPU nodes have a maximum runtime of <strong>4 days<\/strong>.<br \/>\nInteractive GPU jobs have a maximum runtime of <strong>1 day<\/strong>.<\/em><\/div>\n<h2>Job Basics<\/h2>\n<p>Batch and interactive jobs can be run. You must specify how many GPUs your job requires <strong>AND<\/strong> how many CPU cores you need for the <em>host<\/em> code.<\/p>\n<p>A job can use up to 12 CPU cores <em>per<\/em> A100, A100_40G or L40S GPU. See below for example jobscripts.<\/p>\n<p>A GPU jobscript should be of the form:<\/p>\n<pre>#!\/bin\/bash --login\r\n### <strong>Choose ONE of the following partitions depending on your permitted access<\/strong>\r\n#SBATCH -p gpuA              # A100 (80GB) GPUs\r\n#SBATCH -p gpuA40GB          # A100 (40GB) GPUs\r\n#SBATCH -p gpuL              # L40S GPUs\r\n### Required flags\r\n#SBATCH -G <em>N<\/em>                 # (or --gpus=<em>N<\/em>) Number of GPUs \r\n#SBATCH -t 1-0               # Wallclock timelimit (1-0 is one day, 4-0 is max permitted)\r\n### Optional flags\r\n#SBATCH -n <em>numcores<\/em>          # (or --ntasks=) Number of CPU (host) cores (default is 1)\r\n\r\nmodule purge\r\nmodule load libs\/cuda\r\n<\/pre>\n<p>See <a href=\"#simplejob\">below<\/a> for a simple GPU job that you can run.<\/p>\n<h3>Runtime Limits<\/h3>\n<p>The maximum runtimes on the GPUs are as follows:<\/p>\n<ul>\n<li>batch jobs: 4 days<\/li>\n<li>interactive jobs: 1 day<\/li>\n<\/ul>\n<h3>CUDA Libraries<\/h3>\n<p>You will most likely need the CUDA software environment for your job, whether your application is pre-compiled (e.g., a python app) or an application you have written yourself and compiled using the Nvidia <code>nvcc<\/code> compiler. Please see our <a href=\"\/csf3\/software\/libraries\/cuda\">CUDA libraries documentation<\/a> for advice on compiling your own code.<\/p>\n<p>To always use the most up-to-date version installed use:<\/p>\n<pre># The main CUDA library and compiler (other libs have separate modulefiles - see below)\r\nmodule load libs\/cuda\r\n\r\n# Alternatively use the Nvidia HPC SDK which provides a complete set of CUDA libraries and tools\r\nmodule load libs\/nvidia-hpc-sdk\r\n<\/pre>\n<p>Use <code>module show libs\/cuda<\/code> to see what version is provided.<\/p>\n<p>If your application requires a specific version, or you want to fix on a specific version for reproducibility reasons, use:<\/p>\n<pre>module load libs\/cuda\/12.8.1\r\nmodule load libs\/cuda\/12.4.1        # A100 only. Please also load at least compilers\/gcc\/6.4.0\r\n\r\n# Older versions from CSF3 (SGE) are also available, but we recommend using the newer versions.\r\nTo see available versions:\r\nmodule avail libs\/cuda\r\n<\/pre>\n<p>The Nvidia cuDNN, NCCL and TensorRT libraries are also available. See:<\/p>\n<pre>module avail libs\/cuDNN\r\nmodule avail libs\/nccl\r\nmodule avail libs\/tensorrt\r\n<\/pre>\n<p>For more information on available libraries and how to compile CUDA code please see our <a href=\"\/csf3\/software\/libraries\/cuda\/\">CUDA page<\/a>.<\/p>\n<h3>Which GPUs will your job use (CUDA_VISIBLE_DEVICES)<\/h3>\n<p>When a job or interactive session runs, <strong>the batch system will set the environment variable <code>$CUDA_VISIBLE_DEVICES<\/code> to a comma-separated list of GPU IDs assigned to your job<\/strong>, where IDs <strong>always begin<\/strong> at 0 (for a single-GPU job), or will be 0,1 or 0,1,2 and so on, depending on how many GPUs you request. This differs from the SGE batch system, where the IDs did not always begin at zero.<\/p>\n<p>The CUDA library will read this variable automatically and so most CUDA applications already installed on the CSF will simply use the correct GPUs.<\/p>\n<p>The <code>SLURM_GPUS<\/code> variable gives the number of GPUs you requested for your job.<\/p>\n<h3 id=\"simplejob\">A Simple First Job &#8211; deviceQuery<\/h3>\n<p>Create a jobscript as follows:<\/p>\n<pre class=\"slurm\">#!\/bin\/bash --login\r\n#SBATCH -G 1       # 1 GPU\r\n#SBATCH -t 5       # Job will run for at most 5 minutes\r\n#SBATCH -n 8       # (or --ntasks=) Optional number of cores. The amount of host RAM\r\n                   # available to your job is affected by this setting.\r\n\r\necho \"Job is using $SLURM_GPUS GPU(s) with ID(s) $CUDA_VISIBLE_DEVICES and $SLURM_NTASKS CPU core(s)\"\r\n\r\n# Get the CUDA software libraries and applications \r\nmodule purge\r\nmodule load libs\/cuda\r\n\r\n# Run the Nvidia app that reports GPU statistics\r\ndeviceQuery\r\n<\/pre>\n<p>Submit the job using <code>sbatch jobscript<\/code>. It will print out hardware statistics about the GPU device.<\/p>\n<p>See below for more complex jobscripts.<\/p>\n<p><a name=\"nvme\"><\/a><\/p>\n<h3>NVMe fast local storage<\/h3>\n<p>The GPU host nodes contain a 1.6TB NVMe storage card. This is faster than SSD storage (and faster than your <em>scratch<\/em> area and the <em>home<\/em> storage area).<\/p>\n<p>This extra storage on the GPU nodes is accessible via the environment <code>$TMPDIR<\/code>:<\/p>\n<pre>cd $TMPDIR\r\n<\/pre>\n<p>This will access a private directory, which is specific to your job, in the <code>\/tmp<\/code> area <em>on the compute node<\/em> where you job is running (please do not use <code>\/tmp<\/code> directly).<\/p>\n<p>The actual name of the directory contains your job id number for the current job, so it will be unique to each job. It will be something like <code>\/tmp\/slurm.4619712<\/code>, but you can always use the <code>$TMPDIR<\/code> environment variable to access this rather than the actual directory name.<\/p>\n<p><strong>This directory (and all files in it) will be deleted automatically at the end of your job by the batch system.<\/strong><\/p>\n<p>It is highly recommended (especially for machine learning workloads) that you copy your data to <code>$TMPDIR<\/code> at the start of the job, process it from there and copy any results back to your <code>~\/scratch<\/code> area at the end of the job. If your job performs a lot of I\/O (e.g., reading large datasets, writing results) then doing so from <code>$TMPDIR<\/code> on the GPU nodes will be faster. Even with the cost of copying data to and from the NVMe cards (<code>$TMPDIR<\/code>), using this area during the job usually provides good speed-up.<\/p>\n<p>Remember that <code>$TMPDIR<\/code> is <em>local<\/em> to the node. So after your job has finished, you will not be able to access any files saved on the GPU node&#8217;s NVMe drive from the login node (i.e., <code>$TMPDIR<\/code> on the login node points to the login node&#8217;s local hard-disk, whereas <code>$TMPDIR<\/code> on the GPU node points to the GPU node&#8217;s local NVMe drive.) So you <em>must<\/em> ensure you do any file transfers back to the usual <code>~\/scratch<\/code> area (or your <em>home<\/em> area) <em>within the jobscript<\/em>.<\/p>\n<p>Here is an example of copying data to the <code>$TMPDIR<\/code> area at the start of the job, processing the data and then cleaning up at the end of the job:<\/p>\n<pre class=\"slurm\">#!\/bin\/bash --login \r\n#SBATCH -p gpuX              # Select the type of GPU (where X = V, A, L or A40GB)\r\n#SBATCH -G 1                 # 1 GPU\r\n#SBATCH -n 8                 # Select the no. of CPU (host) cores\r\n#SBATCH -t 2-0               # Job \"wallclock\" is required. Max permitted is 4 days (4-0).\r\n\r\nmodule purge\r\nmodule load libs\/cuda\/12.8.1\r\nmodule load your\/cuda\/app\r\n\r\n# Copy a directory of files from scratch to the GPU node's local NVMe storage\r\ncp -r ~\/scratch\/dataset1\/ $TMPDIR\r\n\r\n# Process the data with a GPU app, from within the local NVMe storage area\r\ncd  $TMPDIR\/dataset1\/\r\nsome_GPU_app  -i input.dat  -o results.dat\r\n\r\n# Copy the result file back to the main scratch area\r\ncp results.dat ~\/scratch\/dataset1\/\r\n\r\n# Or to copy an entire directory back:\r\ncp -r resultsdir ~\/scratch\/dataset1\/\r\n\r\n# The batch system will automatically delete the contents of $TMPDIR at the end of your job.\r\n<\/pre>\n<p>The above jobscript can be in your <em>home<\/em> or <em>scratch<\/em> storage. Submit it from there.<\/p>\n<h3>Etiquette<\/h3>\n<p>All users are reminded to log out of their interactive GPU session when it is no longer required. This will free up the GPU for other users. If an interactive GPU session is found to be idle for significant periods, making no use of the GPU, it may be killed. Interactive sessions should not be used to reserve a GPU for future use &#8211; only request a GPU when you need to use it.<\/p>\n<p>Batch jobs that only use CPU cores should not be submitted to GPU nodes. If such jobs are found they will be killed and access to GPU nodes may be removed. There are plenty of CPU-only nodes on which jobs will run.<\/p>\n<h2>Monitoring GPU jobs<\/h2>\n<p>You can now <code>ssh<\/code> to the compute-node where the GPU job is running &#8211; your <code>ssh<\/code> session will be <em>adopted<\/em> into the job. You can only <code>ssh<\/code> to a compute-node where you have a running job!<\/p>\n<pre>\r\n# On the login node, find out where your job is running:\r\nsqueue\r\n  JOBID  PRIORITY  PARTITION  NAME   USER      ST     ...      NODELIST\r\n 123456  0.000054  gpuA       myjob  mabcxyz1  R      ...      <strong>node860<\/strong>\r\n\r\n# Now ssh to the node:\r\nssh <strong>node860<\/strong>\r\n\r\n# Now run nvitop or nvidia-smi, or other GPU monitoring \/ debugging tools\r\nmodule load tools\/bintools\/nvitop\r\nnvitop\r\n\r\nnvidia-smi\r\n\r\n# To return to the login node\r\nexit\r\n<\/pre>\n<p>Note that <code>nvitop<\/code> will download the necessary python packages the first time you use it. These will be stored in your <code>~\/.cache\/uv\/<\/code> directory, but they do not take up a lot of storage.<\/p>\n<p>Alternatively, you can use <code>srun<\/code> to access the node where your job is running, by specifying the JOBID of the job:<\/p>\n<pre>\r\n# Login to the node where your job is running (will only give you access to your job's GPU)\r\nsrun --jobid=<em>12345<\/em> --pty bash\r\n\r\n# Use the standard Nvidia app:\r\nnvidia-smi\r\n\r\n# Or use nvitop:\r\nmodule load tools\/bintools\/nvitop\r\nnvitop\r\n\r\n# To return to the login node\r\nexit\r\n<\/pre>\n<h2>Batch Jobs &#8211; Example Jobscripts<\/h2>\n<p>The following section provides sample jobscripts for various combinations of number of GPUs requested and CPU cores requested.<\/p>\n<p>Note that in the examples below, we load modulefiles inside the jobscript, rather than on the login node. This is so we have a complete record in the jobscript of how we ran the job.<\/p>\n<h3>Single GPU, Single CPU-core<\/h3>\n<p>The simplest case &#8211; a single-GPU, single-CPU-core jobscript:<\/p>\n<pre>#!\/bin\/bash --login\r\n#SBATCH -p gpuL               # L40S GPUs\r\n#SBATCH -G 1                  # 1 GPU\r\n#SBATCH -t 1-0                # Wallclock limit (1-0 is 1 day, 4-0 is the max permitted)\r\n\r\n# Latest version of CUDA (add any other modulefiles you require)\r\nmodule purge\r\nmodule load libs\/cuda\r\n\r\necho \"Job is using $SLURM_GPUS GPU(s) with ID(s) $CUDA_VISIBLE_DEVICES and $SLURM_NTASKS CPU core(s)\"\r\n\r\n# Run an application (this Nvidia app will report info about the GPU). Replace with your app.\r\ndeviceQuery\r\n<\/pre>\n<h3>Single GPU, Multi CPU-cores<\/h3>\n<p>Even when using a single GPU, you may need more than one CPU core if your host-code uses OpenMP, for example, to do some parallel processing on the CPU. You can request up to 12 CPU cores <em>per<\/em> A100 &#038; L40S GPU. For example:<\/p>\n<pre>#!\/bin\/bash --login\r\n#SBATCH -p gpuL               # L40S GPUs\r\n#SBATCH -G 1                  # 1 GPU\r\n#SBATCH -t 1-0                # Wallclock limit (1-0 is 1 day, 4-0 is the max permitted)\r\n#SBATCH -n 1                  # One Slurm task\r\n#SBATCH -c 8                  # 8 CPU cores available to the host code.\r\n                              # Can use up to 12 CPUs with an A100 GPU.\r\n                              # Can use up to 12 CPUs with an L40S GPU.\r\n\r\n# Latest version of CUDA\r\nmodule purge\r\nmodule load libs\/cuda\r\n\r\necho \"Job is using $SLURM_GPUS GPU(s) with ID(s) $CUDA_VISIBLE_DEVICES and $SLURM_CPUS_PER_TASK CPU core(s)\"\r\n\r\n# This example uses OpenMP for multi-core host code - tell OpenMP how many CPU cores to use.\r\nexport OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK\r\n\r\n.\/mySimpleGPU_OpenMP_app\r\n<\/pre>\n<h3>Multi GPU, Single CPU-core<\/h3>\n<p>A multi-GPU job should request the required number of GPUs and optionally up to 12 CPU cores <em>per<\/em> A100 &#038; L40S GPU.<\/p>\n<p>For example a 2-GPU job that runs serial host code on one CPU core would be:<\/p>\n<pre>#!\/bin\/bash --login\r\n#SBATCH -p gpuL               # L40S GPUs\r\n#SBATCH -G 2                  # 2 GPUs\r\n#SBATCH -n 2                  # Two Slurm tasks\r\n#SBATCH -c 8                  # 8 CPU cores available to each task.\r\n                              # Can use up to 12 CPUs with an A100 GPU.\r\n                              # Can use up to 12 CPUs with an L40S GPU.\r\n\r\n# Latest version of CUDA\r\nmodule purge\r\nmodule load libs\/cuda\r\n\r\necho \"Job is using $SLURM_GPUS GPU(s) with ID(s) $CUDA_VISIBLE_DEVICES and $SLURM_NTASKS x $SLURM_CPUS_PER_TASK CPU core(s)\"\r\n\r\n.\/myMultiGPUapp.exe\r\n<\/pre>\n<h3>Multi GPU, Multi CPU-cores<\/h3>\n<p>Finally a multi-GPU job that also uses multiple CPU cores for the host code (up to 12 CPUs <em>per<\/em> A100 &#038; L40S GPU) would be:<\/p>\n<pre>#!\/bin\/bash --login\r\n#SBATCH -p gpuL               # L40S GPUs\r\n#SBATCH -G 2                  # 2 GPUs\r\n#SBATCH -n 1                  # One Slurm tasks\r\n#SBATCH -c 16                 # 16 CPU cores available to the host code\r\n                              # Can use up to 12 CPUs per GPU with an A100 GPU.\r\n                              # Can use up to 12 CPUs per GPU with an L40S GPU.\r\n\r\n# Latest version of CUDA\r\nmodule purge\r\nmodule load libs\/cuda\r\n\r\necho \"Job is using $SLURM_GPUS GPU(s) with ID(s) $CUDA_VISIBLE_DEVICES and $SLURM_NTASKS x $SLURM_CPUS_PER_TASK CPU core(s)\"\r\n\r\n# This example uses OpenMP for multi-core host code - tell OpenMP how many CPU cores to use.\r\nexport OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK\r\n\r\n# Will use $SLURM_CPUS_PER_TASK CPU cores via OpenMP\r\n.\/myMultiGPU_OpenMP_app\r\n<\/pre>\n<h3>Multi GPU, Multi CPU-cores for MPI Apps<\/h3>\n<p>Multi-GPU applications are often implemented using the MPI library &#8211; each MPI process (aka <em>rank<\/em>) uses a GPU to speed up its computation.<\/p>\n<p>The GPUs (in Slurm) are run in <code>Default<\/code> compute mode, meaning <em>multiple processes<\/em> can use a GPU at any one time. However, other users&#8217; jobs will NOT be able to access your job&#8217;s GPUs. But this allows you to run multiple processes on your assigned GPUs.<\/p>\n<p>You can also run processes on multiple GPUs, if you job has requested more than one GPU.<\/p>\n<p>The following CUDA-aware version of the OpenMPI libraries are available. This will usually give better performance when your application uses MPI to transfer data from one GPU to another (note that the openmpi modulefile will automatically load the cuda modulefile):<\/p>\n<pre># GCC Compiler\r\nmodule load mpi\/gcc\/openmpi\/5.0.7-cuda-gcc-14.2.0    # CUDA 12.8.1\r\n\r\n# Intel Compiler\r\nmpi\/intel-oneapi-2024.2.0\/openmpi\/5.0.7-cuda         # CUDA 12.8.1\r\n<\/pre>\n<p>Note that when running multi-GPU jobs using MPI you usually start one MPI process per GPU. For example:<\/p>\n<pre>#!\/bin\/bash --login\r\n#SBATCH -p gpuL      # L40S GPUs\r\n#SBATCH -G 4         # A 4-GPU request (<strong>Note: not all users have rights to run 4 GPUs.<\/strong>)\r\n#SBATCH -n 4         # 4 CPU (host) cores. We'll run 4 MPI processes.\r\n#SBATCH -t 1-0       # A 1-day wallclock limit. Max permitted is 4-0 (4 days.)\r\n\r\n# MPI library (which also loads the cuda modulefile)\r\nmodule purge\r\nmodule load  mpi\/gcc\/openmpi\/5.0.7-cuda-gcc-14.2.0\r\n\r\necho \"Job is using $SLURM_GPUS GPU(s) with ID(s) $CUDA_VISIBLE_DEVICES and $SLURM_NTASKS CPU core(s)\"\r\n\r\n# In this example we start one MPI process per GPU. We could use $SLURM_NTASKS or $SLURM_GPUS (both = 4)\r\n# It is assumed the application will ensure each MPI process uses a different GPU. For example\r\n# MPI rank 0 will use GPU 0, MPI rank 1 will use GPU 1 and so on.\r\nmpirun -n $SLURM_GPUS .\/myMultiGPU_MPI_app\r\n\r\n# If your application does <em>not<\/em> map MPI ranks to GPUs correctly, you can try the following method\r\n# where we explictly inform each rank which GPU to use via the CUDA_VISIBLE_DEVICES variable:\r\nmpirun -n $SLURM_GPUS -x CUDA_VISIBLE_DEVICES=0 .\/myMultiGPU_MPI_app : \\\r\n                      -x CUDA_VISIBLE_DEVICES=1 .\/myMultiGPU_MPI_app : \\\r\n                      -x CUDA_VISIBLE_DEVICES=2 .\/myMultiGPU_MPI_app : \\\r\n                      -x CUDA_VISIBLE_DEVICES=3 .\/myMultiGPU_MPI_app \r\n<\/pre>\n<p>Note that it also <em>is possible<\/em> to use a multi-threaded application (implemented using OpenMP for example to create multiple threads).<\/p>\n<p>An alternative method, which allows multiple MPI processes to run on the <em>same<\/em> GPU is now available &#8211; please see the section below on the <a href=\"#mps\">Nvidia MPS<\/a> facility.<\/p>\n<h2 id=\"gpuinter\">Interactive Jobs<\/h2>\n<p>You mainly use interactive jobs to run an GPU app that has a GUI, or to log-in to a GPU node to do app development and testing.<\/p>\n<p>Interactive jobs should be run using <code>srun<\/code> (not <code>sbatch<\/code>) from the login node as follows.<\/p>\n<div class=\"note\">We <strong>strongly advise<\/strong> that you use batch jobs rather than interactive jobs. Provided you have batch jobs in the queue, ready and waiting to be run, the system can select your jobs 24 hours a day. But interactive jobs require you to be logged in to the CSF and working at the terminal. You will get more work done on the system using batch jobs &#8211; the batch queues never need to go to sleeep!<\/div>\n<h3>Single GPU, Single CPU-core logging in to GPU node<\/h3>\n<p>Here we request an interactive session using 1-GPU and 1-CPU core, logging in to the node<\/p>\n<pre># Using long flag names\r\nsrun --partition=gpuL --gpus=1 --ntasks=1 --time=1-0 --pty bash\r\n\r\n# Using short flag names\r\nsrun -p gpuL -G 1 -n 1 -t 1-0 --pty bash\r\n<\/pre>\n<p>The above command will place you in your <em>current<\/em> directory when it logs you in to the GPU node.<\/p>\n<p>GPU <code>srun<\/code> jobs are limited to 24 hours.<\/p>\n<h3>Multi GPU, Multi CPU-cores logging in to GPU node<\/h3>\n<p>Here we start an interactive session requesting 2-GPUs and 4-CPU cores, logging in to the node:<\/p>\n<pre># Using long flag names\r\nsrun --partition=gpuL --gpus=2 --ntasks=4 --time=1-0 --pty bash\r\n\r\n# Using short flag names\r\nsrun -p gpuL -G 2 -n 4 -t 1-0 --pty bash\r\n<\/pre>\n<p><a name=\"mps\"><\/a><\/p>\n<h2>Nvidia Multi-Process Service (MPS)<\/h2>\n<p>The GPUs (in Slurm) all use <code>Default<\/code> compute mode &#8211; meaning <em>multiple processes<\/em> can access a GPU. This differs from the SGE batch system. Hence the use of MPS is now somewhat redundant &#8211; you can simply start multiple processes on your allocated GPUs.<\/p>\n<p>However, should you wish to use MPS to match your earlier SGE usage, you can do so in Slurm.<\/p>\n<p>The Nvidia <a href=\"https:\/\/docs.nvidia.com\/deploy\/mps\/index.html\">Multi-Process Service<\/a> (MPS) allows multiple <em>processes<\/em> to use the <em>same<\/em> GPU. You might want to do this for small MPI jobs, where each MPI process does not require the resources of an entire GPU. Hence all of the MPI processes could &#8220;fit&#8221; on a single GPU. Alternatively, if you have a lot of small jobs to run, you might be able to start multiple copies of the executable, all using the <em>same<\/em> GPU. Using MPI (<code>mpirun<\/code>) would be one method of doing this, even if the app itself is not an MPI job.<\/p>\n<p>An extra flag is required to start the NVMPS facility on the node allocated to your job. Hence you should add:<\/p>\n<pre>--extra=mps<\/pre>\n<p>to your jobscript (or <code>srun<\/code> command.)<\/p>\n<p>Note that you should still request enough CPU cores on which to run multiple processes. Even a GPU app does some work on the CPU and so if you are going to run several copies of an app, you should request the correct number of CPU cores so that each instance of your app has its own core(s) to run on. The examples below request 8 CPU cores (<code>-n 8<\/code>) so that we can run 8 copies of a GPU-capable application.<\/p>\n<p>The following example demonstrates running the <code>simpleMPI<\/code> example found in the CUDA SDK on a single GPU. Multiple MPI processes are started and they all run on the same GPU. Without MPS, a <em>GPU per MPI process<\/em> would be required (see later for what happens if we run the same job without using MPS.)<\/p>\n<pre>#!\/bin\/bash --login\r\n#SBATCH -p gpuL                   # L40S GPUs\r\n#SBATCH -G 1                      # 1 GPU\r\n#SBATCH -n 8                      # We want a CPU core for each process (see below)\r\n<strong>#SBATCH --extra=mps               # Extra flag to enable Nvidia MPS<\/strong>\r\n\r\n# Load a CUDA-aware MPI modulefile which will also load a cuda modulefile\r\nmodule purge\r\nmodule load mpi\/gcc\/openmpi\/5.0.7-cuda-gcc-14.2.0\r\n\r\n# Let's take a copy of the already-compiled simpleMPI example (the whole folder)\r\n# Not available in SLURM - TO-DO!\r\ncp -a $CUDA_SDK\/0_Simple\/simpleMPI\/ .\r\ncd simpleMPI\r\n\r\n# Now run <strong>more than 1<\/strong> copy of the app. In fact we run with 8 MPI processes\r\n# (Slurm knows you've requested 8 CPU-cores)\r\n# But we are only using 1 GPU, not 8! So all processes will use the <em>same<\/em> GPU.\r\nmpirun .\/simpleMPI\r\n<\/pre>\n<p>Submit the above jobscript using <code>sbatch <em>jobscript<\/em><\/code>. The job output will be something similar to:<\/p>\n<pre>Running on 8 nodes\r\nAverage of square roots is: 0.667337\r\nPASSED\r\n<\/pre>\n<p>You can also use the NV MPS facility with interactive jobs:<\/p>\n<pre># At the CSF login node, start an interactive session, requesting one GPU, 8 CPU cores and enable the NVMPS facility\r\n[<em>username<\/em>@login1 [csf3] ~]$ srun -p gpuL -G 1 -n 8 -t 10 --extra=mps --pty bash\r\n\r\n# Wait until you are logged in to a GPU node, then:\r\nmodule purge\r\nmodule load mpi\/gcc\/openmpi\/5.0.7-cuda-gcc-14.2.0\r\ncp -a $CUDA_SDK\/0_Simple\/simpleMPI .\r\ncd simpleMPI\r\n\r\n# Run more MPI processes than the 1 GPU we requested. This will only work when\r\n# (Slurm knows you've requested 8 CPU-cores)\r\nmpirun .\/simpleMPI\r\n\r\n# Return to the login node\r\nexit\r\n<\/pre>\n<h2>Profiling Tools<\/h2>\n<p>A number of profiling tools are available to help analyse and optimize your CUDA applications. We provide instructions on how to run (start) these tools below. Please note that instructions on how to <em>use<\/em> these tools are beyond the scope of this webpage. You should consult the <a href=\"https:\/\/docs.nvidia.com\/cuda\/profiler-users-guide\/index.html#profiling-overview\">Nvidia profiling documentation<\/a> for detailed instructions on how to use the tools listed below.<\/p>\n<p>We give the command name of each tool below. If running the profiler tool through its graphical user interface (GUI) or interactively on the command-line (i.e., not in a batch job which would be collecting profiling data without any interaction) then you <strong>must<\/strong> start an interactive session on a backend GPU node using the commands:<\/p>\n<pre># On the CSF login node, request an interactive session on a GPU node\r\nsrun -p gpuL -G 1 -n 1 -t 1-0 --pty bash  # Can instead use gpuA for the A100 GPUs\r\n\r\n# Wait to be logged in to the node, then run:\r\nmodule load libs\/cuda\/12.8.1                # Choose your required version\r\n<em>name-of-profiler-tool<\/em>                         # See below for the command names\r\n<\/pre>\n<h3>Nsight Compute<\/h3>\n<p>The Nvidia <a href=\"https:\/\/docs.nvidia.com\/nsight-compute\/\">Nsight Compute profile tools<\/a> are installed as of toolkit version 10.0.130 and later.<br \/>\nTo run the profiler for CUDA toolkit versions up to 11.7.0 use:<\/p>\n<pre>module load libs\/cuda\/11.7.0\r\n# Command-line version\r\nnv-nsight-cu-cli\r\n# GUI version\r\nsrun-x11 --partition=xxxX --gpus=x --ntasks=x --time=x-x\r\nnv-nsight-cu\r\n<\/pre>\n<p>From CUDA toolkit version 12.0.1 or later use:<\/p>\n<pre>module load libs\/cuda\/12.8.1\r\n# Command-line version\r\nncu\r\n# GUI version\r\nsrun-x11 --partition=xxxX --gpus=x --ntasks=x --time=x-x\r\nncu-ui\r\n<\/pre>\n<h3>Nsight Systems<\/h3>\n<p>The Nvidia <a href=\"https:\/\/developer.nvidia.com\/nsight-systems\">Nsight Systems<\/a> performance analysis tool designed to visualize an application\u2019s algorithms is installed as of toolkit version 10.1.168. To run the profiler:<\/p>\n<pre>nsight-sys\r\n<\/pre>\n<p>Nvidia recommend you use the above newer tools for profiling rather than the following older tools, although these tools are still available and may be familiar to you.<\/p>\n<h3>Visual Profiler<\/h3>\n<p>The Nvidia <a href=\"https:\/\/docs.nvidia.com\/cuda\/profiler-users-guide\/index.html#visual\">Visual Profiler<\/a> is installed as of toolkit version 7.5.18 and later. To run the profiler:<\/p>\n<pre>nvvp\r\n<\/pre>\n<p>Note that the Nvidia Visual Profiler <code>nvvp<\/code> can be used to view results collected by the <code>nvprof<\/code> command-line tool (see below). Hence you could use the the <code>nvprof<\/code> command in a batch job which will save profiling data to file, then view the results at a later time using the <code>nvvp<\/code> tool.<\/p>\n<h3>nvprof Command-line Profiler<\/h3>\n<p>The Nvidia command-line <a href=\"https:\/\/docs.nvidia.com\/cuda\/profiler-users-guide\/index.html#nvprof-overview\">nvprof profiler<\/a> is installed as of toolkit version 7.5.18 and later. To run the profiler:<\/p>\n<pre>nvprof\r\n<\/pre>\n<p>Note that the Nvidia Visual Profiler <code>nvvp<\/code> (see above) can be used to view results collected by the <code>nvprof<\/code> command-line tool. Hence you could use the the <code>nvprof<\/code> command in a batch job which will save profiling data to file, then view the results at a later time using the <code>nvvp<\/code> tool.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Access This page covers the Nvidia A100 (80GB), A100 (40GB) and L40S GPUs in Slurm. Notes for Slurm users &#8211; please read: All users have access to the A100 (80GB) and L40S GPUs &#8211; you do not need to submit a ticket to request access. The A100 (40GB) GPUs are limited to a small number of users from a specific group. The H200 GPUs have a specific access policy (please check before requesting access.) See.. <a href=\"https:\/\/ri.itservices.manchester.ac.uk\/csf3\/batch-slurm\/gpu-jobs-slurm\/\">Read more &raquo;<\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"parent":9105,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-9711","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/ri.itservices.manchester.ac.uk\/csf3\/wp-json\/wp\/v2\/pages\/9711","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ri.itservices.manchester.ac.uk\/csf3\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/ri.itservices.manchester.ac.uk\/csf3\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/ri.itservices.manchester.ac.uk\/csf3\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/ri.itservices.manchester.ac.uk\/csf3\/wp-json\/wp\/v2\/comments?post=9711"}],"version-history":[{"count":23,"href":"https:\/\/ri.itservices.manchester.ac.uk\/csf3\/wp-json\/wp\/v2\/pages\/9711\/revisions"}],"predecessor-version":[{"id":11929,"href":"https:\/\/ri.itservices.manchester.ac.uk\/csf3\/wp-json\/wp\/v2\/pages\/9711\/revisions\/11929"}],"up":[{"embeddable":true,"href":"https:\/\/ri.itservices.manchester.ac.uk\/csf3\/wp-json\/wp\/v2\/pages\/9105"}],"wp:attachment":[{"href":"https:\/\/ri.itservices.manchester.ac.uk\/csf3\/wp-json\/wp\/v2\/media?parent=9711"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}