Nvidia GPU jobs
Access
This page covers the Nvidia v100, A100 and L40S GPUs.
Access to the GPU nodes is not automatic.
Jobs will not run unless you have been added to a GPU group.
Please read about the different levels of access below then contact us via our help form to request access.
Your jobs will wait forever if you have not been granted GPU access!
Contributor Access
If you are a member of a contributing research group (i.e., your PI / Supervisor has funded some GPUs in the system) then please email your request to use the GPUs to the email address below, cc-ing your supervisor.
If your research group is interested in contributing funds for GPU nodes please contact us via our help form for more information.
Not sure if your PI/supervisor has contributed? We can check this for you – please contact us to request GPU access. If you are not a member of a contributing group, most people will be able to use the free-at-point-of-use contribution.
Free at Point of Use Access
There are also free-at-the-point-of-use v100 and A100 GPUs available, with job limits in place. These GPUs were funded by the University as part of the Research Lifecycle Programme, in particular Change Project M&K.
If you would like to access these GPUs please log a ticket via via our help form and provide some brief information about what you wish to use them for.
Free-at-point-of-use job limits v100 GPUs: Each person in the free-at-point-of-use group may have two v100 GPUs in use at any one time, provided there are resources available. The maximum number that can be used by this group in total at any one time is 36, again, provided there are resources available.
A100 GPU access
L40S GPU access
Updates
September 2024
New Nvidia L40s GPUs available. PLEASE NOTE: access is currently limited to people associated with Prof. Magnus Rattray and/or Dr. Syed Bakker as part of the Bioinformatics Core Facility. All requests for access need to be approved by Prof. Rattray / Syed.
July 2024
Essential maintenance to the cluster is required this month. As such, we will be draining nodes of jobs in batches, do the maintenance on those nodes, then put them back in to service. This will reduce the availability of nodes at any one time, but ensures there are always some nodes available to you.
We strongly advise that you use batch jobs rather than interactive sessions at this time. Batch jobs can be selected to run at any time – 24 hours a day. Whereas interactive jobs can only be used when you are logged in to the cluster! By submitting batch jobs you increase the amount of work you can do on the system – the batch queues never need to go to sleep!
May 2024
Nvidia CUDA driver on the GPU nodes updated to allow use of the CUDA 12.x toolkit (currently 12.2.2). The driver version of upgraded nodes is 535.154.05.
Feb 2023
Nvidia CUDA driver on the GPU nodes updated to allow use of the CUDA 12.x toolkit (currently 12.0.1). The driver version of upgraded nodes is 525.85.12.
All nodes have been upgraded. You no longer need to add -l cuda12
to land on an upgraded node.
The default CUDA toolkit remains at 11.6.2. Hence:
module load libs/cuda # Loads the current default on any GPU node - 11.6.2
To use the 12.2.2
toolkit, please specify the version explicitly on your module load command:
module load libs/cuda/12.2.2 # You may also need a GCC compiler module to use at least GCC v6. The CUDA samples in this version # have been compiled with GCC 6.4.0, so if you are testing those, please also do: module load compilers/gcc/6.4.0
To use the 12.0.1
toolkit, please specify the version explicitly on your module load command:
module load libs/cuda/12.0.1 # To use the newer toolkit please add the version number
Once all GPU nodes have the new driver installed we will make 12.0.1 the default toolkit version. This will be announced on the CSF MOTD.
May 2022
7th May 2022: The Nvidia CUDA driver on all GPU compute nodes has been upgraded to allow use of the CUDA 11.6.x toolkit. The driver version is now 510.47.03.
Software Applications
A range of GPU capable software is available on the CSF.
List of installed GPU capable software.
List of installed Machine Learning specific software.
GPU Hardware and Driver
The CSF contains the following GPU nodes, which offer different types of Nvidia GPUs and host CPUs.
17 GPU nodes each hosting 4 x Nvidia v100 GPUs (16GB GPU RAM) giving a total of 68 v100 GPUs. The node spec is:
- 4 x NVIDIA v100 SXM2 16GB GPU (Volta architecture – hardware v7.0, compute architecture
sm_70
) - Some GPU hosts: 2 x 16-core Intel Xeon Gold 6130 “Skylake” 2.10GHz
- Some GPU hosts: 2 x 16-core Intel Xeon Gold 5128 “Cascade Lake” 2.30GHz
- 192 GB RAM (host)
- 1.6TB NVMe local storage, 182GB local SSD storage
- CUDA Driver 535.154.05
16 GPU nodes each hosting 4 x Nvidia A100 GPUs (80GB GPU RAM) giving a total of 64 A100 GPUs. The node spec is:
- 4 x NVIDIA HGX A100 SXM4 80GB GPU (Ampere architecture – hardware v8.0, compute architecture
sm_80
) - 2 x 24-core AMD Epyc 7413 “Milan” 2.65GHz
- 512 GB RAM (host)
- 1.6TB local NVMe storage, 364GB local SSD storage
- CUDA Driver 535.154.05
3 GPU nodes hosting 4 x Nvidia L40s GPUs (48GB GPU RAM) giving a total of 12 L40s GPUs. The node spec is:
- 4 x NVIDIA L40S 48GB GPU (Ada Lovelace architecture – hardware v8.9, compute architecture
sm_89
) - 2 x 24-core Intel Xeon(R) Gold 6442Y “Sapphire Rapids” 2.6GHz
- 512 GB RAM (host)
- 28TB local /tmp storage
- CUDA Driver 535.183.01
Fast NVMe storage on the node
The very fast, local-to-node NVMe storage is available as $TMPDIR
on each node. This environment variable gives the name of a temporary directory which is created by the batch system at the start of your job. You must access this from your jobscript – i.e., on the node, not on the login node.) See below for advice on how to use this in your jobs.
This directory (and all files in it) will be deleted automatically at the end of your job by the batch system.
Reminder: The above storage areas is local to the compute node where your job is running. You will not be able to access the files in the temporary storage on the login node.
Interactive GPU jobs have a maximum runtime of 1 day.
Job Basics
Batch and interactive jobs can be run. You must specify how many GPUs your job requires AND how many CPU cores you need for the host code.
A job can use up to 8 CPU cores per v100 GPU or up to 12 CPU cores per A100 or L40S GPU. See below for example jobscripts.
A GPU jobscript should be of the form:
#!/bin/bash --login #$ -cwd # Choose ONE of the following depending on your permitted access # The -l nvidia_v100 (or just v100) or -l nvidia_a100 (or just a100) or -l nvidia_l40s (or just l40s) # is mandatory. M can be 1 to 4 (depending on user limits). If '=M' is missing # then 1 will be used. #$ -l nvidia_v100=M #$ -l nvidia_a100=M # RESTRICTED ACCESS - PLEASE SEE EARLIER #$ -l nvidia_l40s=M # VERY RESTRICTED ACCESS - PLEASE SEE EARLIER # The -pe line is optional. Number of CPU cores N can be 2..32 (v100 GPU) or 2..48 (A100 GPU) # or 2..48 (L40S GPU) i.e., max 8 per v100 GPU, max 12 per A100 or L40s GPU. Will be a serial job if # this line is missing. #$ -pe smp.pe N
See below for a simple GPU job that you can run (once your account has had GPU-access enabled.)
Runtime Limits
The maximum runtimes on the GPUs are as follows:
- batch jobs: 4 days
- interactive jobs: 1 day
CUDA Libraries
You will most likely need the CUDA software environment for your job, whether your application is pre-compiled (e.g., a python app) or an application you have written yourself and compiled using the Nvidia nvcc
compiler. Please see our CUDA libraries documentation for advice on compiling your own code.
To always use the most up-to-date version installed use:
# The main CUDA library and compiler (other libs have separate modulefiles - see below) module load libs/cuda # Alternatively use the Nvidia HPC SDK which provides a complete set of CUDA libraries and tools module load libs/nvidia-hpc-sdk
Use module show libs/cuda
to see what version is provided.
If your application requires a specific version, or you want to fix on a specific version for reproducibility reasons, use:
module load libs/cuda/12.2.2 # Please also load at least compilers/gcc/6.4.0 module load libs/cuda/12.0.1 module load libs/cuda/11.6.2 # This is the default for 'module load libs/cuda' module load libs/cuda/11.2.0 module load libs/cuda/11.1.1 module load libs/cuda/11.0.3 module load libs/cuda/10.1.243 module load libs/cuda/10.1.168 module load libs/cuda/9.2.148 module load libs/cuda/9.1.85 module load libs/cuda/9.0.176 module load libs/cuda/8.0.61 module load libs/cuda/7.5.18 # To see available versions: module avail libs/cuda
The Nvidia cuDNN, NCCL and TensorRT libraries are also available. See:
module avail libs/cuDNN module avail libs/nccl module avail libs/tensorrt
For more information on available libraries and how to compile CUDA code please see our CUDA page.
Which GPUs will your job use (CUDA_VISIBLE_DEVICES)
When a job or interactive session runs, the batch system will set the environment variable $CUDA_VISIBLE_DEVICES
to a comma-separated list of GPU IDs assigned to your job, where IDs can be one or more of 0,1,2,3 (for example 2
for a single GPU job which is using the GPU with id 2
, or 0,3
for a 2-GPU job – the IDs might not be contiguous.) The CUDA library will read this variable automatically and so most CUDA applications already installed on the CSF will simply use the correct GPUs.
You may have to tell your application how many GPUs to use (e.g., with a command-line flag – please check the application’s documentation). The batch system sets the variable $NGPUS
to the number of GPUs you requested. Both of the environment variables $CUDA_VISIBLE_DEVICES
and $NGPUS
can be used in your jobscript. See the example jobscripts below for how this can be useful.
When developing your own CUDA applications the device IDs used in the cudaSetDevice()
function should run from 0 to NGPUS-1. An ID of 0 means the “use the GPU who’s ID is listed first in the $CUDA_VISIBLE_DEVICES
list” and so on. The CUDA library will then map 0
(and so on) to the correct physical GPU assigned to your job.
If an application insists that you give it a flag specifying GPU IDs to use (some apps do, some don’t) then try using 0
(for a single GPU job), 0,1
for a two-GPU job and so on. A value of 0
means “use the GPU who’s ID is listed first in the $CUDA_VISIBLE_DEVICES
list” and so on. The CUDA library will then map 0
(and so on) to the correct physical GPU assigned to your job.
A Simple First Job – deviceQuery
Create a jobscript as follows:
#!/bin/bash --login #$ -cwd #$ -l v100 # Will give us 1 GPU to use in our job # No -pe line hence a serial (1 CPU-core) job # Can instead use 'a100' for the A100 GPUs (if permitted!) # Can instead use 'l40s' for the L40S GPUs (if permitted!) echo "Job is using $NGPUS GPU(s) with ID(s) $CUDA_VISIBLE_DEVICES and $NSLOTS CPU core(s)" # Get the CUDA software libraries and applications module load libs/cuda # Run the Nvidia app that reports GPU statistics deviceQuery
Submit the job using qsub jobscript
. It will print out hardware statistics about the GPU device.
See below for more complex jobscripts.
NVMe fast local storage
The GPU host nodes contain a 1.6TB NVMe storage card. This is faster than SSD storage (and faster than your scratch area and the home storage area).
This extra storage on the GPU nodes is accessible via the environment $TMPDIR
:
cd $TMPDIR
This will access a private directory, which is specific to your job, in the /tmp
area (please do not use /tmp
directly).
The actual name of the directory contains your job id number for the current job, so it will be unique to each job. It will be something like /tmp/4619712.1.nvidiagpu.q
, but you can always use the $TMPDIR
environment variable to access this rather than the actual directory name.
This directory (and all files in it) will be deleted automatically at the end of your job by the batch system.
It is highly recommended (especially for machine learning workloads) that you copy your data to $TMPDIR
at the start of the job, process it from there and copy any results back to your ~/scratch
area at the end of the job. If your job performs a lot of I/O (e.g., reading large datasets, writing results) then doing so from $TMPDIR
on the GPU nodes will be faster. Even with the cost of copying data to and from the NVMe cards ($TMPDIR
), using this area during the job usually provides good speed-up.
Remember that $TMPDIR
is local to the node. So after your job has finished, you will not be able to access any files saved on the GPU node’s NVMe drive from the login node (i.e., $TMPDIR
on the login node points to the login node’s local hard-disk, whereas $TMPDIR
on the GPU node points to the GPU node’s local NVMe drive.) So you must ensure you do any file transfers back to the usual ~/scratch
area (or your home area) within the jobscript.
Here is an example of copying data to the $TMPDIR
area at the start of the job, processing the data and then cleaning up at the end of the job:
#!/bin/bash --login #$ -cwd #$ -l nvidia_v100 # Can instead use 'nvidia_a100' for the A100 GPUs (if permitted!) # Can instead use 'nvidia_l40s' for the L40S GPUs (if permitted!) # Copy a directory of files from scratch to the GPU node's local NVMe storage cp -r ~/scratch/dataset1/ $TMPDIR # Process the data with a GPU app, from within the local NVMe storage area cd $TMPDIR/dataset1/ some_GPU_app -i input.dat -o results.dat # Copy the results back to the main scratch area cp results.dat ~/scratch/dataset1/ # The batch system will automatically delete the contents of $TMPDIR at the end of your job.
The above jobscript can be in your home or scratch storage. Submit it from there.
Monitoring GPU jobs
We have written a script to help monitor your GPU jobs. You can run the following on the CSF login nodes:
# Run the following commands on the CSF login node # Get a list of your GPU jobs in the batch system (similar to qstat) gpustat # Display the status of the GPUs in use by one of your running jobs (job id needed). # This shows the default output of nvidia-smi from the GPU node where your job is running. gpustat -j jobid # # Repace jobid with your job id number (e.g., 12345) # Display the status of the GPUs in use by one of your job-array tasks (job id & task id needed) # This shows the default output of nvidia-smi from the GPU node where your job is running. gpustat -j jobid -t taskid # # # # Replace taskid with your job array task id # # number (e.g., 1) # # Repace jobid with your job id number (e.g., 12345) # Continuously sample the GPU utilization and amount of memory used / free every second. # This uses nvidia-smi --query-gpu=fields to display various stats in CSV format. # Press Ctrl+C to stop sampling gpustat -j jobid [-t taskid] -s 2 # Continuously sample the GPU utilization and memory utilization (time spent reading/writing). # This uses nvidia-smi pmon to display various stats in CSV format (runs for a max of 5 mins.) # Press Ctrl+C to stop sampling gpustat -j jobid [-t taskid] -s 3 # Note that '-s 1' gives the default pretty-printed nvidia-smi output. Use gpustat -h for help.
Etiquette
All users are reminded to log out of their interactive GPU session when it is no longer required. This will free up the GPU for other users. If an interactive GPU session is found to be idle for significant periods, making no use of the GPU, it may be killed. Interactive sessions should not be used to reserve a GPU for future use – only request a GPU when you need to use it.
Batch jobs that only use CPU cores should not be submitted to GPU nodes. If such jobs are found they will be killed and access to GPU nodes may be removed. There are plenty of CPU-only nodes on which jobs will run.
Batch Jobs – Example Jobscripts
The following section provides sample jobscripts for various combinations of number of GPUs requested and CPU cores requested.
Note that in the examples below, we load modulefiles inside the jobscript, rather than on the login node. This is so we have a complete record in the jobscript of how we ran the job. There are two things we need to do to make the module
command work inside the jobscript: add --login
to the first line, then remove any #$ -V
line from the jobscript.
Single GPU, Single CPU-core
The simplest case – a single-GPU, single-CPU-core jobscript:
#!/bin/bash --login #$ -cwd #$ -l nvidia_v100=1 # If no =N given, 1 GPU will be assumed. # (Can use 'v100' instead of 'nvidia_v100'.) # No $# -pe line give means a 1-core jobs. # Can instead use 'nvidia_a100' for the A100 GPUs (if permitted!) # Can instead use 'nvidia_l40s' for the L40S GPUs (if permitted!) # Latest version of CUDA (add any other modulefiles you require) module load libs/cuda echo "Job is using $NGPUS GPU(s) with ID(s) $CUDA_VISIBLE_DEVICES and $NSLOTS CPU core(s)" # Run an application (this Nvidia app will report info about the GPU). Replace with your app. deviceQuery
Single GPU, Multi CPU-cores
Even when using a single GPU, you may need more than one CPU core if your host-code uses OpenMP, for example, to do some parallel processing on the CPU. You can request up to 8 CPU cores per v100 GPU and up to 12 CPU cores per A100 GPU. For example:
#!/bin/bash --login #$ -cwd #$ -l v100 # A 1-GPU request (v100 is just a shorter name for nvidia_v100) # Can instead use 'a100' for the A100 GPUs (if permitted!) # Can instead use 'l40s' for the L40S GPUs (if permitted!) #$ -pe smp.pe 8 # 8 CPU cores available to the host code. # Can use up to 12 CPUs with an A100 GPU. # Can use up to 12 CPUs with an L40s GPU. # Latest version of CUDA module load libs/cuda echo "Job is using $NGPUS GPU(s) with ID(s) $CUDA_VISIBLE_DEVICES and $NSLOTS CPU core(s)" # This example uses OpenMP for multi-core host code - tell OpenMP how many CPU cores to use. export OMP_NUM_THREADS=$NSLOTS ./mySimpleGPU_OpenMP_app
Multi GPU, Single CPU-core
A multi-GPU job should request the required number of GPUs and optionally up to 8 CPU cores per v100 GPU and up to 12 CPU cores per a100 GPU.
For example a 2-GPU job that runs serial host code on one CPU core would be:
#!/bin/bash --login #$ -cwd #$ -l nvidia_v100=2 # A 2-GPU job. Can use 'v100' instead of 'nvidia_v100'. # Can instead use 'nvidia_a100' for the A100 GPUs (if permitted!) # Can instead use 'nvidia_l40s' for the L40S GPUs (if permitted!) # No -pe smp.pe line means it is a serial job. # Latest version of CUDA module load libs/cuda echo "Job is using $NGPUS GPU(s) with ID(s) $CUDA_VISIBLE_DEVICES and $NSLOTS CPU core(s)" ./myMultiGPUapp.exe
Multi GPU, Multi CPU-cores
Finally a multi-GPU job that also uses multiple CPU cores for the host code (up to 8 CPUs per v100 GPU, up to 12 CPUs per A100 GPU) would be:
#!/bin/bash --login #$ -cwd #$ -l v100=4 # A 4-GPU request (v100 is just a shorter name for nvidia_v100) # Can instead use 'a100' for the A100 GPUs (if permitted!) # Can instead use 'l40s' for the L40S GPUs (if permitted!) #$ -pe smp.pe 32 # Let's use the max 8 CPUs per GPU (32 cores in total) # Can use a maximum of 48 CPUs with 4 x A100 GPUs. # Can use a maximum of 48 CPUs with 4 x L40S GPUs. # Latest version of CUDA module load libs/cuda echo "Job is using $NGPUS GPU(s) with ID(s) $CUDA_VISIBLE_DEVICES and $NSLOTS CPU core(s)" # This example uses OpenMP for multi-core host code - tell OpenMP how many CPU cores to use. export OMP_NUM_THREADS=$NSLOTS ./myMultiGPU_OpenMP_app
Multi GPU, Multi CPU-cores for MPI Apps
Multi-GPU applications are often implemented using the MPI library – each MPI process (aka rank) uses a GPU to speed up its computation.
Our GPUs are run in EXCLUSIVE_PROCESS mode, meaning only one process can use a GPU at any one time. This is to protect your GPUs from other jobs – if another job accidentally tries to use a GPU in use by your job, the other job will fail. But it also means that, when running an MPI application, each GPU can only be used by one MPI process. Hence you will usually request the same number of GPUs and CPUs.
The following CUDA-aware version of the OpenMPI libraries are available. This will usually give better performance when your application uses MPI to transfer data from one GPU to another (note that the openmpi modulefile will automatically load the cuda modulefile):
# GCC Compiler module load mpi/gcc/openmpi/4.0.1-cuda # CUDA 10.1.168 # Intel Compiler module load mpi/intel-18.0/openmpi/4.0.1-cuda # CUDA 10.1.168 module load mpi/intel-17.0/openmpi/3.1.3-cuda # CUDA 9.2.148 module load mpi/intel-17.0/openmpi/3.1.1-cuda # CUDA 9.2.148
Note that when running multi-GPU jobs using MPI you usually start one MPI process per GPU. For example:
#!/bin/bash --login #$ -cwd #$ -l v100=4 # A 4-GPU request (v100 is just a shorter name for nvidia_v100) # Can instead use 'a100' for the A100 GPUs (if permitted!) #$ -pe smp.pe 4 # Our MPI code only uses one MPI process (hence one CPU core) per GPU # MPI library (which also loads the cuda modulefile) module load mpi/intel-18.0/openmpi/4.0.1-cuda echo "Job is using $NGPUS GPU(s) with ID(s) $CUDA_VISIBLE_DEVICES and $NSLOTS CPU core(s)" # In this example we start one MPI process per GPU. We could use $NSLOTS or $NGPUS (both = 4) # It is assume the application will ensure each MPI process uses a different GPU. For example # MPI rank 0 will use GPU 0, MPI rank 1 will use GPU 1 and so on. mpirun -n $NGPUS ./myMultiGPU_MPI_app # If your application does not map MPI ranks to GPUs correctly, you can try the following method # where we explictly inform each rank which GPU to use via the CUDA_VISIBLE_DEVICES variable: mpirun -n $NGPUS -x CUDA_VISIBLE_DEVICES=0 ./myMultiGPU_MPI_app : \ -x CUDA_VISIBLE_DEVICES=1 ./myMultiGPU_MPI_app : \ -x CUDA_VISIBLE_DEVICES=2 ./myMultiGPU_MPI_app : \ -x CUDA_VISIBLE_DEVICES=3 ./myMultiGPU_MPI_app
Note that it is possible to use a multi-threaded application (implemented using OpenMP for example to create multiple threads). Even though the GPUs are in EXCLUSIVE_PROCESS mode, multiple OpenMP threads can access a different or the same GPU. This is one way to increase the usage of a single GPU if your CUDA kernels do not fully utilise a GPU.
An alternative method, which allows multiple MPI processes to run on the same GPU is now available – please see the section below on the Nvidia MPS facility.
Interactive Jobs
You mainly use interactive jobs to run an GPU app that has a GUI or to log-in to a GPU node to do app development and testing.
Interactive jobs should be done using qrsh
(not qsub
) from the login node as follows.
Note: Unlike the CSF2, there is no need to add the -l inter
flag to the qrsh
command-line. The use of qrsh
indicates you require an interactive job.
Single GPU, Single CPU-core logging in to GPU node
Here we request an interactive session using 1-GPU and 1-CPU core, logging in to the node
qrsh -l v100 bash # # # # You must supply the name of the shell to login with # # Wait until you are logged in to a GPU node (e.g. node800). # You can now load modulefiles (e.g., libs/cuda) and run your apps. # The CUDA_VISIBLE_DEVICES environment variable says which GPUs you have access to. # Can instead use a100 for the A100 GPUs (if permitted!) qrsh -l a100 bash # Can instead use l4-s for the L40S GPUs (if permitted!) qrsh -l l40s bash
The above command will place you in your home directory when it logs you in to the GPU node. If you wish to remain in the directory from where you run the qrsh
, add the -cwd
flag to the command:
qrsh -l v100 -cwd bash # Can instead use a100 for the A100 GPUs (if permitted!) qrsh -l a100 -cwd bash # Can instead use l40s for the L40S GPUs (if permitted!) qrsh -l l40s -cwd bash
GPU qrsh jobs are limited to 24 hours.
Multi GPU, Multi CPU-cores logging in to GPU node
Here we start an interactive session requesting 2-GPUs and 4-CPU cores, logging in to the node:
qrsh -l v100=2 -pe smp.pe 4 bash # # Wait until you are logged in to a GPU node (e.g. node800). # You can now load modulefiles (e.g., libs/cuda) and run your apps. # The CUDA_VISIBLE_DEVICES environment variable says which GPUs you have access to.
Multi GPU, Multi CPU-cores running app on GPU node
Here we request 2 x v100 GPUs and 16 CPU cores (max for the v100, it could be up to 24 CPU cores for the A100 GPUs), running my own executable/binary (-b y
) from the current directory (-cwd
) inheriting the modulefile settings from the login node (-V
). My sample application takes some custom flags that it understands:
module load libs/cuda/10.1.168 qrsh -l v100=2 -pe smp.pe 16 -b y -cwd -V ./myMultiGPUapp -in mydata.dat -out myresults.dat # Can instead use a100 for the A100 GPUs (if permitted!) and up to 24 CPU cores for 2 x A100 GPUs qrsh -l a100=2 -pe smp.pe 24 -b y -cwd -V ./myMultiGPUapp -in mydata.dat -out myresults.dat # Can instead use l40s for the L40S GPUs (if permitted!) and up to 24 CPU cores for 2 x L40s GPUs qrsh -l l40s=2 -pe smp.pe 24 -b y -cwd -B ./myMultiGPUapp -in mydata.dat -out myresults.dat
Nvidia Multi-Process Service (MPS)
Our GPUs all use EXCLUSIVE_PROCESS mode – meaning only one process can access a GPU. This protects the GPU allocated to your job from being used accidentally by another job that might be running on the same compute node.
But there are times when you might want to run more than one process (app) on the GPU allocated to your job.
The Nvidia Multi-Process Service (MPS) allows multiple processes to use the same GPU. You might want to do this for small MPI jobs, where each MPI process does not require the resources of an entire GPU. Hence all of the MPI processes could “fit” on a single GPU. Alternatively, if you have a lot of small jobs to run, you might be able to start multiple copies of the executable, all using the same GPU. Using MPI (mpirun
) would be one method of doing this, even if the app itself is not an MPI job.
An extra flag is required to start the NVMPS facility on the node allocated to your job. Hence you should add:
#$ -ac nvmps
to your jobscript (or qsub
command.)
Note that you should still request enough CPU cores on which to run multiple processes. Even a GPU app does some work on the CPU and so if you are going to run several copies of an app, you should request the correct number of CPU cores so that each instance of your app has its own core(s) to run on. The examples below request 8 CPU cores (-pe smp.pe 8
) so that we can run 8 copies of a GPU-capable application.
The following example demonstrates running the simpleMPI
example found in the CUDA SDK on a single GPU. Multiple MPI processes are started and they all run on the same GPU. Without MPS, a GPU per MPI process would be required (see later for what happens if we run the same job without using MPS.)
#!/bin/bash --login #$ -cwd #$ -l v100=1 # We request only 1 GPU #$ -pe smp.pe 8 # We want a CPU core for each process (see below) #$ -ac nvmps # Extra flag to enable Nvidia MPS # Load a CUDA-aware MPI modulefile which will also load a cuda modulefile module load mpi/gcc/openmpi/4.0.1-cuda-ucx # Let's take a copy of the already-compiled simpleMPI example (the whole folder) cp -a $CUDA_SDK/0_Simple/simpleMPI/ . cd simpleMPI # Now run more than 1 copy of the app. In fact we run with 8 MPI processes # ($NSLOTS is replaced by the number of cores requested on the -pe line above.) # But we are only using 1 GPU, not 8! So all processes will use the same GPU. mpirun -n $NSLOTS ./simpleMPI # # Without the '-ac nvmps' flag at the top of the jobscript, this command would fail! # That's because it launches 8 copies of an app on the same GPU, and this # is not allowed unless the Nvidia MPS feature is turned on. See below for the output # you will see if we run without MPS.
Submit the above jobscript using qsub jobscript
. The job output will be something similar to:
Running on 8 nodes Average of square roots is: 0.667337 PASSED
You can also use the NV MPS facility with interactive jobs:
# At the CSF login node, start an interactive session, requesting one GPU, 8 CPU cores and enable the NVMPS facility
[username@hlogin1 [csf3] ~]$ qrsh -l v100=1 -pe smp.pe 8 -ac nvmps bash
# Wait until you are logged in to a GPU node, then:
module load mpi/gcc/openmpi/4.0.1-cuda-ucx
cp -a $CUDA_SDK/0_Simple/simpleMPI .
cd simpleMPI
# Run more MPI processes than the 1 GPU we requested. This will only work when
# the interactive session is using the NV MPS facility. The $NSLOTS will be replaced
# with 8 in this example because we requested 8 CPU cores on the qrsh line.
mpirun -n $NSLOTS ./simpleMPI
# Return to the login node
exit
If we run the above job without using the -ac nvmps
flag, the job will fail, because the single GPU that we request will only accept one MPI process (because we run our GPUs in EXCLUSIVE_PROCESS mode.) Doing so will give you an error in the job output:
# Job output when trying to run multiple processes on a single GPU without using MPS facility CUDA error calling "cudaMalloc((void **)&deviceInputData, dataSize * sizeof(float))", code is 46 CUDA error calling "cudaMalloc((void **)&deviceInputData, dataSize * sizeof(float))", code is 46 ... Test FAILED Test FAILED ...
Profiling Tools
A number of profiling tools are available to help analyse and optimize your CUDA applications. We provide instructions on how to run (start) these tools below. Please note that instructions on how to use these tools are beyond the scope of this webpage. You should consult the Nvidia profiling documentation for detailed instructions on how to use the tools listed below.
We give the command name of each tool below. If running the profiler tool through its graphical user interface (GUI) or interactively on the command-line (i.e., not in a batch job which would be collecting profiling data without any interaction) then you must start an interactive session on a backend GPU node using the commands:
# On the CSF login node, request an interactive session on a GPU node qrsh -l v100=1 bash # Can instead use a100 for the A100 GPUs (if permitted!) qrsh -l a100=1 bash # Wait to be logged in to the node, then run: module load libs/cuda/10.1.168 # Choose your required version name-of-profiler-tool # See below for the command names
Nsight Compute
The Nvidia Nsight Compute profile tools are installed as of toolkit version 10.0.130 and later. To run the profiler:
nv-nsight-cu # GUI version nv-nsight-cu-cli # Command-line version
Nsight Systems
The Nvidia Nsight Systems performance analysis tool designed to visualize an application’s algorithms is installed as of toolkit version 10.1.168. To run the profiler:
nsight-sys
Nvidia recommend you use the above newer tools for profiling rather than the following older tools, although these tools are still available and may be familiar to you.
Visual Profiler
The Nvidia Visual Profiler is installed as of toolkit version 7.5.18 and later. To run the profiler:
nvvp
Note that the Nvidia Visual Profiler nvvp
can be used to view results collected by the nvprof
command-line tool (see below). Hence you could use the the nvprof
command in a batch job which will save profiling data to file, then view the results at a later time using the nvvp
tool.
nvprof Command-line Profiler
The Nvidia command-line nvprof profiler is installed as of toolkit version 7.5.18 and later. To run the profiler:
nvprof
Note that the Nvidia Visual Profiler nvvp
(see above) can be used to view results collected by the nvprof
command-line tool. Hence you could use the the nvprof
command in a batch job which will save profiling data to file, then view the results at a later time using the nvvp
tool.