Nvidia GPU jobs

Access

This page covers the Nvidia v100, A100 and L40S GPUs.

Access to the GPU nodes is not automatic. Jobs will not run unless you have been added to a GPU group. Please read about the different levels of access below then contact us via its-ri-team@manchester.ac.uk to request access.

Contributor Access

If you are a member of a contributing research group (i.e., your PI / Supervisor has funded some GPUs in the system) then please email your request to use the GPUs to the email address below, cc-ing your supervisor.

If your research group is interested in contributing funds for GPU nodes please contact its-ri-team@manchester.ac.uk for more information.

Not sure if your PI/supervisor has contributed? See below for the free-at-point-of-use details. Please email us and we’ll determine your level of access.

Free at Point of Use Access

There are also free-at-the-point-of-use v100 and A100 GPUs available, with job limits in place. These GPUs were funded by the University as part of the Research Lifecycle Programme, in particular Change Project M&K.

If you would like to access these GPUs please email a request to its-ri-team@manchester.ac.uk and provide some brief information about what you wish to use them for.

Free-at-point-of-use job limits v100 GPUs: Each person in the free-at-point-of-use group may have two v100 GPUs in use at any one time, provided there are resources available. The maximum number that can be used by this group in total at any one time is 36, again, provided there are resources available.

A100 GPU access

The A100 GPU nodes have mostly been funded by specific research groups and so access is restricted. There is a small amount of free-at-point-of-use access available. However, only users with a specific requirement to use A100 GPUs will be granted access. If there is a specific feature your code requires, or you can demonstrate that your code exceeds the GPU memory of the v100 GPU, then we will consider access. Please see below for further details on the nodes.

L40S GPU access

The L40S GPU nodes have been funded by a specific research group and so access is very restricted. PLEASE NOTE: access is currently limited to people associated with Prof. Magnus Rattray and/or Dr. Syed Bakker as part of the Bioinformatics Core Facility. All requests for access need to be approved by Prof. Rattray / Syed.

Updates

September 2024

New Nvidia L40s GPUs available. PLEASE NOTE: access is currently limited to people associated with Prof. Magnus Rattray and/or Dr. Syed Bakker as part of the Bioinformatics Core Facility. All requests for access need to be approved by Prof. Rattray / Syed.

July 2024

Essential maintenance to the cluster is required this month. As such, we will be draining nodes of jobs in batches, do the maintenance on those nodes, then put them back in to service. This will reduce the availability of nodes at any one time, but ensures there are always some nodes available to you.

We strongly advise that you use batch jobs rather than interactive sessions at this time. Batch jobs can be selected to run at any time – 24 hours a day. Whereas interactive jobs can only be used when you are logged in to the cluster! By submitting batch jobs you increase the amount of work you can do on the system – the batch queues never need to go to sleep!

May 2024

Nvidia CUDA driver on the GPU nodes updated to allow use of the CUDA 12.x toolkit (currently 12.2.2). The driver version of upgraded nodes is 535.154.05.

Feb 2023

Nvidia CUDA driver on the GPU nodes updated to allow use of the CUDA 12.x toolkit (currently 12.0.1). The driver version of upgraded nodes is 525.85.12.

All nodes have been upgraded. You no longer need to add -l cuda12 to land on an upgraded node.

The default CUDA toolkit remains at 11.6.2. Hence:

module load libs/cuda             # Loads the current default on any GPU node - 11.6.2

To use the 12.2.2 toolkit, please specify the version explicitly on your module load command:

module load libs/cuda/12.2.2

# You may also need a GCC compiler module to use at least GCC v6. The CUDA samples in this version
# have been compiled with GCC 6.4.0, so if you are testing those, please also do:
module load compilers/gcc/6.4.0

To use the 12.0.1 toolkit, please specify the version explicitly on your module load command:

module load libs/cuda/12.0.1      # To use the newer toolkit please add the version number

Once all GPU nodes have the new driver installed we will make 12.0.1 the default toolkit version. This will be announced on the CSF MOTD.

May 2022

7th May 2022: The Nvidia CUDA driver on all GPU compute nodes has been upgraded to allow use of the CUDA 11.6.x toolkit. The driver version is now 510.47.03.

Software Applications

A range of GPU capable software is available on the CSF.

List of installed GPU capable software.
List of installed Machine Learning specific software.

GPU Hardware and Driver

The CSF contains the following GPU nodes, which offer different types of Nvidia GPUs and host CPUs.

17 GPU nodes each hosting 4 x Nvidia v100 GPUs (16GB GPU RAM) giving a total of 68 v100 GPUs. The node spec is:

  • 4 x NVIDIA v100 SXM2 16GB GPU (Volta architecture – hardware v7.0, compute architecture sm_70)
  • Some GPU hosts: 2 x 16-core Intel Xeon Gold 6130 “Skylake” 2.10GHz
  • Some GPU hosts: 2 x 16-core Intel Xeon Gold 5128 “Cascade Lake” 2.30GHz
  • 192 GB RAM (host)
  • 1.6TB NVMe local storage, 182GB local SSD storage
  • CUDA Driver 535.154.05

16 GPU nodes each hosting 4 x Nvidia A100 GPUs (80GB GPU RAM) giving a total of 64 A100 GPUs. The node spec is:

  • 4 x NVIDIA HGX A100 SXM4 80GB GPU (Ampere architecture – hardware v8.0, compute architecture sm_80)
  • 2 x 24-core AMD Epyc 7413 “Milan” 2.65GHz
  • 512 GB RAM (host)
  • 1.6TB local NVMe storage, 364GB local SSD storage
  • CUDA Driver 535.154.05

3 GPU nodes hosting 4 x Nvidia L40s GPUs (48GB GPU RAM) giving a total of 12 L40s GPUs. The node spec is:

  • 4 x NVIDIA L40S 48GB GPU (Ada Lovelace architecture – hardware v8.9, compute architecture sm_89)
  • 2 x 24-core Intel Xeon(R) Gold 6442Y “Sapphire Rapids” 2.6GHz
  • 512 GB RAM (host)
  • 28TB local /tmp storage
  • CUDA Driver 535.183.01

Fast NVMe storage on the node

The very fast, local-to-node NVMe storage is available as $TMPDIR on each node. This environment variable gives the name of a temporary directory which is created by the batch system at the start of your job. You must access this from your jobscript – i.e., on the node, not on the login node.) See below for advice on how to use this in your jobs. This directory (and all files in it) will be deleted automatically at the end of your job by the batch system.

Reminder: The above storage areas is local to the compute node where your job is running. You will not be able to access the files in the temporary storage on the login node.

Batch jobs running on the GPU nodes have a maximum runtime of 4 days.
Interactive GPU jobs have a maximum runtime of 1 day.

Job Basics

Batch and interactive jobs can be run. You must specify how many GPUs your job requires AND how many CPU cores you need for the host code.

A job can use up to 8 CPU cores per v100 GPU or up to 12 CPU cores per A100 or L40S GPU. See below for example jobscripts.

A GPU jobscript should be of the form:

#!/bin/bash --login
#$ -cwd

# Choose ONE of the following depending on your permitted access
# The -l nvidia_v100 (or just v100) or -l nvidia_a100 (or just a100) or -l nvidia_l40s (or just l40s)
# is mandatory. M can be 1 to 4 (depending on user limits). If '=M' is missing
# then 1 will be used.
#$ -l nvidia_v100=M
#$ -l nvidia_a100=M           # RESTRICTED ACCESS - PLEASE SEE EARLIER
#$ -l nvidia_l40s=M           # VERY RESTRICTED ACCESS - PLEASE SEE EARLIER

# The -pe line is optional. Number of CPU cores N can be 2..32 (v100 GPU) or 2..48 (A100 GPU)
# or 2..48 (L40S GPU) i.e., max 8 per v100 GPU, max 12 per A100 or L40s GPU. Will be a serial job if
# this line is missing.
#$ -pe smp.pe N

See below for a simple GPU job that you can run (once your account has had GPU-access enabled.)

Runtime Limits

The maximum runtimes on the GPUs are as follows:

  • batch jobs: 4 days
  • interactive jobs: 1 day

CUDA Libraries

You will most likely need the CUDA software environment for your job, whether your application is pre-compiled (e.g., a python app) or an application you have written yourself and compiled using the Nvidia nvcc compiler. Please see our CUDA libraries documentation for advice on compiling your own code.

To always use the most up-to-date version installed use:

# The main CUDA library and compiler (other libs have separate modulefiles - see below)
module load libs/cuda

# Alternatively use the Nvidia HPC SDK which provides a complete set of CUDA libraries and tools
module load libs/nvidia-hpc-sdk

Use module show libs/cuda to see what version is provided.

If your application requires a specific version, or you want to fix on a specific version for reproducibility reasons, use:

module load libs/cuda/12.2.2        # Please also load at least compilers/gcc/6.4.0
module load libs/cuda/12.0.1
module load libs/cuda/11.6.2        # This is the default for 'module load libs/cuda'
module load libs/cuda/11.2.0
module load libs/cuda/11.1.1
module load libs/cuda/11.0.3
module load libs/cuda/10.1.243
module load libs/cuda/10.1.168
module load libs/cuda/9.2.148  
module load libs/cuda/9.1.85
module load libs/cuda/9.0.176
module load libs/cuda/8.0.61
module load libs/cuda/7.5.18

# To see available versions:
module avail libs/cuda

The Nvidia cuDNN, NCCL and TensorRT libraries are also available. See:

module avail libs/cuDNN
module avail libs/nccl
module avail libs/tensorrt

For more information on available libraries and how to compile CUDA code please see our CUDA page.

Which GPUs will your job use (CUDA_VISIBLE_DEVICES)

When a job or interactive session runs, the batch system will set the environment variable $CUDA_VISIBLE_DEVICES to a comma-separated list of GPU IDs assigned to your job, where IDs can be one or more of 0,1,2,3 (for example 2 for a single GPU job which is using the GPU with id 2, or 0,3 for a 2-GPU job – the IDs might not be contiguous.) The CUDA library will read this variable automatically and so most CUDA applications already installed on the CSF will simply use the correct GPUs.

You may have to tell your application how many GPUs to use (e.g., with a command-line flag – please check the application’s documentation). The batch system sets the variable $NGPUS to the number of GPUs you requested. Both of the environment variables $CUDA_VISIBLE_DEVICES and $NGPUS can be used in your jobscript. See the example jobscripts below for how this can be useful.

When developing your own CUDA applications the device IDs used in the cudaSetDevice() function should run from 0 to NGPUS-1. An ID of 0 means the “use the GPU who’s ID is listed first in the $CUDA_VISIBLE_DEVICES list” and so on. The CUDA library will then map 0 (and so on) to the correct physical GPU assigned to your job.

If an application insists that you give it a flag specifying GPU IDs to use (some apps do, some don’t) then try using 0 (for a single GPU job), 0,1 for a two-GPU job and so on. A value of 0 means “use the GPU who’s ID is listed first in the $CUDA_VISIBLE_DEVICES list” and so on. The CUDA library will then map 0 (and so on) to the correct physical GPU assigned to your job.

A Simple First Job – deviceQuery

Create a jobscript as follows:

#!/bin/bash --login
#$ -cwd
#$ -l v100               # Will give us 1 GPU to use in our job
                         # No -pe line hence a serial (1 CPU-core) job
                         # Can instead use 'a100' for the A100 GPUs (if permitted!)
                         # Can instead use 'l40s' for the L40S GPUs (if permitted!)

echo "Job is using $NGPUS GPU(s) with ID(s) $CUDA_VISIBLE_DEVICES and $NSLOTS CPU core(s)"

# Get the CUDA software libraries and applications 
module load libs/cuda

# Run the Nvidia app that reports GPU statistics
deviceQuery

Submit the job using qsub jobscript. It will print out hardware statistics about the GPU device.

See below for more complex jobscripts.

NVMe fast local storage

The GPU host nodes contain a 1.6TB NVMe storage card. This is faster than SSD storage (and faster than your scratch area and the home storage area).

This extra storage on the GPU nodes is accessible via the environment $TMPDIR:

cd $TMPDIR

This will access a private directory, which is specific to your job, in the /tmp area (please do not use /tmp directly).

The actual name of the directory contains your job id number for the current job, so it will be unique to each job. It will be something like /tmp/4619712.1.nvidiagpu.q, but you can always use the $TMPDIR environment variable to access this rather than the actual directory name.

It is highly recommended (especially for machine learning workloads) that you copy your data to $TMPDIR at the start of the job, process it from there and copy any results back to your ~/scratch area at the end of the job. If your job performs a lot of I/O (e.g., reading large datasets, writing results) then doing so from $TMPDIR on the GPU nodes will be faster. Even with the cost of copying data to and from the NVMe cards ($TMPDIR), using this area during the job usually provides good speed-up.

Remember that $TMPDIR is local to the node. So after your job has finished, you will not be able to access any files saved on the GPU node’s NVMe drive from the login node (i.e., $TMPDIR on the login node points to the login node’s local hard-disk, whereas $TMPDIR on the GPU node points to the GPU node’s local NVMe drive.) So you must ensure you do any file transfers back to the usual ~/scratch area (or your home area) within the jobscript.

Here is an example of copying data to the $TMPDIR area at the start of the job, processing the data and then cleaning up at the end of the job:

#!/bin/bash --login
#$ -cwd
#$ -l nvidia_v100            # Can instead use 'nvidia_a100' for the A100 GPUs (if permitted!)
                             # Can instead use 'nvidia_l40s' for the L40S GPUs (if permitted!)


# Copy a directory of files from scratch to the GPU node's local NVMe storage
cp -r ~/scratch/dataset1/ $TMPDIR

# Process the data with a GPU app, from within the local NVMe storage area
cd $TMPDIR/dataset1/
some_GPU_app -i input.dat -o results.dat

# Copy the results back to the main scratch area
cp results.dat ~/scratch/dataset1/

# The batch system will automatically delete the contents of $TMPDIR at the end of your job.

The above jobscript can be in your home or scratch storage. Submit it from there.

Monitoring GPU jobs

We have written a script to help monitor your GPU jobs. You can run the following on the CSF login nodes:

# Get a list of your GPU jobs in the batch system (similar to qstat)
gpustat

# Display the status of the GPUs in use by one of your running jobs (job id needed).
# This shows the default output of nvidia-smi from the GPU node where your job is running.
gpustat -j jobid
             #
             # Repace jobid with your job id number (e.g., 12345)

# Display the status of the GPUs in use by one of your job-array tasks (job id & task id needed)
# This shows the default output of nvidia-smi from the GPU node where your job is running.
gpustat -j jobid -t taskid
             #        # 
             #        # Replace taskid with your job array task id
             #        # number (e.g., 1)
             #
             # Repace jobid with your job id number (e.g., 12345)

# Continuously sample the GPU utilization and amount of memory used / free every second.
# This uses nvidia-smi --query-gpu=fields to display various stats in CSV format.
# Press Ctrl+C to stop sampling
gpustat -j jobid [-t taskid] -s 2

# Continuously sample the GPU utilization and memory utilization (time spent reading/writing).
# This uses nvidia-smi pmon to display various stats in CSV format (runs for a max of 5 mins.)
# Press Ctrl+C to stop sampling
gpustat -j jobid [-t taskid] -s 3

# Note that '-s 1' gives the default pretty-printed nvidia-smi output. Use gpustat -h for help.

Etiquette

All users are reminded to log out of their interactive GPU session when it is no longer required. This will free up the GPU for other users. If an interactive GPU session is found to be idle for significant periods, making no use of the GPU, it may be killed. Interactive sessions should not be used to reserve a GPU for future use – only request a GPU when you need to use it.

Batch jobs that only use CPU cores should not be submitted to GPU nodes. If such jobs are found they will be killed and access to GPU nodes may be removed. There are plenty of CPU-only nodes on which jobs will run.

Batch Jobs – Example Jobscripts

The following section provides sample jobscripts for various combinations of number of GPUs requested and CPU cores requested.

Note that in the examples below, we load modulefiles inside the jobscript, rather than on the login node. This is so we have a complete record in the jobscript of how we ran the job. There are two things we need to do to make the module command work inside the jobscript: add --login to the first line, then remove any #$ -V line from the jobscript.

Single GPU, Single CPU-core

The simplest case – a single-GPU, single-CPU-core jobscript:

#!/bin/bash --login
#$ -cwd
#$ -l nvidia_v100=1           # If no =N given, 1 GPU will be assumed.
                              # (Can use 'v100' instead of 'nvidia_v100'.)
                              # No $# -pe line give means a 1-core jobs.
                              # Can instead use 'nvidia_a100' for the A100 GPUs (if permitted!)
                              # Can instead use 'nvidia_l40s' for the L40S GPUs (if permitted!)

# Latest version of CUDA (add any other modulefiles you require)
module load libs/cuda

echo "Job is using $NGPUS GPU(s) with ID(s) $CUDA_VISIBLE_DEVICES and $NSLOTS CPU core(s)"

# Run an application (this Nvidia app will report info about the GPU). Replace with your app.
deviceQuery

Single GPU, Multi CPU-cores

Even when using a single GPU, you may need more than one CPU core if your host-code uses OpenMP, for example, to do some parallel processing on the CPU. You can request up to 8 CPU cores per v100 GPU and up to 12 CPU cores per A100 GPU. For example:

#!/bin/bash --login
#$ -cwd
#$ -l v100           # A 1-GPU request (v100 is just a shorter name for nvidia_v100)
                     # Can instead use 'a100' for the A100 GPUs (if permitted!)
                     # Can instead use 'l40s' for the L40S GPUs (if permitted!)

#$ -pe smp.pe 8      # 8 CPU cores available to the host code.
                     # Can use up to 12 CPUs with an A100 GPU.
                     # Can use up to 12 CPUs with an L40s GPU.

# Latest version of CUDA
module load libs/cuda

echo "Job is using $NGPUS GPU(s) with ID(s) $CUDA_VISIBLE_DEVICES and $NSLOTS CPU core(s)"

# This example uses OpenMP for multi-core host code - tell OpenMP how many CPU cores to use.
export OMP_NUM_THREADS=$NSLOTS

./mySimpleGPU_OpenMP_app

Multi GPU, Single CPU-core

A multi-GPU job should request the required number of GPUs and optionally up to 8 CPU cores per v100 GPU and up to 12 CPU cores per a100 GPU.

For example a 2-GPU job that runs serial host code on one CPU core would be:

#!/bin/bash --login
#$ -cwd
#$ -l nvidia_v100=2            # A 2-GPU job. Can use 'v100' instead of 'nvidia_v100'.
                               # Can instead use 'nvidia_a100' for the A100 GPUs (if permitted!)
                               # Can instead use 'nvidia_l40s' for the L40S GPUs (if permitted!)
                               # No -pe smp.pe line means it is a serial job.
# Latest version of CUDA
module load libs/cuda

echo "Job is using $NGPUS GPU(s) with ID(s) $CUDA_VISIBLE_DEVICES and $NSLOTS CPU core(s)"

./myMultiGPUapp.exe

Multi GPU, Multi CPU-cores

Finally a multi-GPU job that also uses multiple CPU cores for the host code (up to 8 CPUs per v100 GPU, up to 12 CPUs per A100 GPU) would be:

#!/bin/bash --login
#$ -cwd
#$ -l v100=4         # A 4-GPU request (v100 is just a shorter name for nvidia_v100)
                     # Can instead use 'a100' for the A100 GPUs (if permitted!)
                     # Can instead use 'l40s' for the L40S GPUs (if permitted!)
#$ -pe smp.pe 32     # Let's use the max 8 CPUs per GPU (32 cores in total)
                     # Can use a maximum of 48 CPUs with 4 x A100 GPUs.
                     # Can use a maximum of 48 CPUs with 4 x L40S GPUs.

# Latest version of CUDA
module load libs/cuda

echo "Job is using $NGPUS GPU(s) with ID(s) $CUDA_VISIBLE_DEVICES and $NSLOTS CPU core(s)"

# This example uses OpenMP for multi-core host code - tell OpenMP how many CPU cores to use.
export OMP_NUM_THREADS=$NSLOTS

./myMultiGPU_OpenMP_app

Multi GPU, Multi CPU-cores for MPI Apps

Multi-GPU applications are often implemented using the MPI library – each MPI process (aka rank) uses a GPU to speed up its computation.

Our GPUs are run in EXCLUSIVE_PROCESS mode, meaning only one process can use a GPU at any one time. This is to protect your GPUs from other jobs – if another job accidentally tries to use a GPU in use by your job, the other job will fail. But it also means that, when running an MPI application, each GPU can only be used by one MPI process. Hence you will usually request the same number of GPUs and CPUs.

The following CUDA-aware version of the OpenMPI libraries are available. This will usually give better performance when your application uses MPI to transfer data from one GPU to another (note that the openmpi modulefile will automatically load the cuda modulefile):

# GCC Compiler
module load mpi/gcc/openmpi/4.0.1-cuda               # CUDA 10.1.168

# Intel Compiler
module load mpi/intel-18.0/openmpi/4.0.1-cuda        # CUDA 10.1.168
module load mpi/intel-17.0/openmpi/3.1.3-cuda        # CUDA 9.2.148
module load mpi/intel-17.0/openmpi/3.1.1-cuda        # CUDA 9.2.148

Note that when running multi-GPU jobs using MPI you usually start one MPI process per GPU. For example:

#!/bin/bash --login
#$ -cwd
#$ -l v100=4         # A 4-GPU request (v100 is just a shorter name for nvidia_v100)
                     # Can instead use 'a100' for the A100 GPUs (if permitted!)
#$ -pe smp.pe 4      # Our MPI code only uses one MPI process (hence one CPU core) per GPU

# MPI library (which also loads the cuda modulefile)
module load mpi/intel-18.0/openmpi/4.0.1-cuda

echo "Job is using $NGPUS GPU(s) with ID(s) $CUDA_VISIBLE_DEVICES and $NSLOTS CPU core(s)"

# In this example we start one MPI process per GPU. We could use $NSLOTS or $NGPUS (both = 4)
# It is assume the application will ensure each MPI process uses a different GPU. For example
# MPI rank 0 will use GPU 0, MPI rank 1 will use GPU 1 and so on.
mpirun -n $NGPUS ./myMultiGPU_MPI_app

# If your application does not map MPI ranks to GPUs correctly, you can try the following method
# where we explictly inform each rank which GPU to use via the CUDA_VISIBLE_DEVICES variable:
mpirun -n $NGPUS -x CUDA_VISIBLE_DEVICES=0 ./myMultiGPU_MPI_app : \
                 -x CUDA_VISIBLE_DEVICES=1 ./myMultiGPU_MPI_app : \
                 -x CUDA_VISIBLE_DEVICES=2 ./myMultiGPU_MPI_app : \
                 -x CUDA_VISIBLE_DEVICES=3 ./myMultiGPU_MPI_app 

Note that it is possible to use a multi-threaded application (implemented using OpenMP for example to create multiple threads). Even though the GPUs are in EXCLUSIVE_PROCESS mode, multiple OpenMP threads can access a different or the same GPU. This is one way to increase the usage of a single GPU if your CUDA kernels do not fully utilise a GPU.

An alternative method, which allows multiple MPI processes to run on the same GPU is now available – please see the section below on the Nvidia MPS facility.

Interactive Jobs

You mainly use interactive jobs to run an GPU app that has a GUI or to log-in to a GPU node to do app development and testing.

Interactive jobs should be done using qrsh (not qsub) from the login node as follows.

Note: Unlike the CSF2, there is no need to add the -l inter flag to the qrsh command-line. The use of qrsh indicates you require an interactive job.

We strongly advise that you use batch jobs rather than interactive jobs. Provided you have batch jobs in the queue, ready and waiting to be run, the system can select your jobs 24 hours a day. But interactive jobs require you to be logged in to the CSF and working at the terminal. You will get more work done on the system using batch jobs – the batch queues never need to go to sleeep!

Single GPU, Single CPU-core logging in to GPU node

Here we request an interactive session using 1-GPU and 1-CPU core, logging in to the node

qrsh -l v100 bash
  #            #
  #            # You must supply the name of the shell to login with
  #
  # Wait until you are logged in to a GPU node (e.g. node800).
  # You can now load modulefiles (e.g., libs/cuda) and run your apps.
  # The CUDA_VISIBLE_DEVICES environment variable says which GPUs you have access to.

# Can instead use a100 for the A100 GPUs (if permitted!)
qrsh -l a100 bash

# Can instead use l4-s for the L40S GPUs (if permitted!)
qrsh -l l40s bash

The above command will place you in your home directory when it logs you in to the GPU node. If you wish to remain in the directory from where you run the qrsh, add the -cwd flag to the command:

qrsh -l v100 -cwd bash

# Can instead use a100 for the A100 GPUs (if permitted!)
qrsh -l a100 -cwd bash

# Can instead use l40s for the L40S GPUs (if permitted!)
qrsh -l l40s -cwd bash

GPU qrsh jobs are limited to 24 hours.

Multi GPU, Multi CPU-cores logging in to GPU node

Here we start an interactive session requesting 2-GPUs and 4-CPU cores, logging in to the node:

qrsh -l v100=2 -pe smp.pe 4 bash
  #
  # Wait until you are logged in to a GPU node (e.g. node800).
  # You can now load modulefiles (e.g., libs/cuda) and run your apps.
  # The CUDA_VISIBLE_DEVICES environment variable says which GPUs you have access to.

Multi GPU, Multi CPU-cores running app on GPU node

Here we request 2 x v100 GPUs and 16 CPU cores (max for the v100, it could be up to 24 CPU cores for the A100 GPUs), running my own executable/binary (-b y) from the current directory (-cwd) inheriting the modulefile settings from the login node (-V). My sample application takes some custom flags that it understands:

module load libs/cuda/10.1.168
qrsh -l v100=2 -pe smp.pe 16 -b y -cwd -V ./myMultiGPUapp -in mydata.dat -out myresults.dat

# Can instead use a100 for the A100 GPUs (if permitted!) and up to 24 CPU cores for 2 x A100 GPUs
qrsh -l a100=2 -pe smp.pe 24 -b y -cwd -V ./myMultiGPUapp -in mydata.dat -out myresults.dat

# Can instead use l40s for the L40S GPUs (if permitted!) and up to 24 CPU cores for 2 x L40s GPUs
qrsh -l l40s=2 -pe smp.pe 24 -b y -cwd -B ./myMultiGPUapp -in mydata.dat -out myresults.dat

Nvidia Multi-Process Service (MPS)

Our GPUs all use EXCLUSIVE_PROCESS mode – meaning only one process can access a GPU. This protects the GPU allocated to your job from being used accidentally by another job that might be running on the same compute node.

But there are times when you might want to run more than one process (app) on the GPU allocated to your job.

The Nvidia Multi-Process Service (MPS) allows multiple processes to use the same GPU. You might want to do this for small MPI jobs, where each MPI process does not require the resources of an entire GPU. Hence all of the MPI processes could “fit” on a single GPU. Alternatively, if you have a lot of small jobs to run, you might be able to start multiple copies of the executable, all using the same GPU. Using MPI (mpirun) would be one method of doing this, even if the app itself is not an MPI job.

An extra flag is required to start the NVMPS facility on the node allocated to your job. Hence you should add:

#$ -ac nvmps

to your jobscript (or qsub command.)

Note that you should still request enough CPU cores on which to run multiple processes. Even a GPU app does some work on the CPU and so if you are going to run several copies of an app, you should request the correct number of CPU cores so that each instance of your app has its own core(s) to run on. The examples below request 8 CPU cores (-pe smp.pe 8) so that we can run 8 copies of a GPU-capable application.

The following example demonstrates running the simpleMPI example found in the CUDA SDK on a single GPU. Multiple MPI processes are started and they all run on the same GPU. Without MPS, a GPU per MPI process would be required (see later for what happens if we run the same job without using MPS.)

#!/bin/bash --login
#$ -cwd
#$ -l v100=1         # We request only 1 GPU
#$ -pe smp.pe 8      # We want a CPU core for each process (see below)
#$ -ac nvmps         # Extra flag to enable Nvidia MPS

# Load a CUDA-aware MPI modulefile which will also load a cuda modulefile
module load mpi/gcc/openmpi/4.0.1-cuda-ucx

# Let's take a copy of the already-compiled simpleMPI example (the whole folder)
cp -a $CUDA_SDK/0_Simple/simpleMPI/ .
cd simpleMPI

# Now run more than 1 copy of the app. In fact we run with 8 MPI processes
# ($NSLOTS is replaced by the number of cores requested on the -pe line above.)
# But we are only using 1 GPU, not 8! So all processes will use the same GPU.
mpirun -n $NSLOTS ./simpleMPI
   #
   # Without the '-ac nvmps' flag at the top of the jobscript, this command would fail!
   # That's because it launches 8 copies of an app on the same GPU, and this
   # is not allowed unless the Nvidia MPS feature is turned on. See below for the output
   # you will see if we run without MPS.

Submit the above jobscript using qsub jobscript. The job output will be something similar to:

Running on 8 nodes
Average of square roots is: 0.667337
PASSED

You can also use the NV MPS facility with interactive jobs:

# At the CSF login node, start an interactive session, requesting one GPU, 8 CPU cores and enable the NVMPS facility
[username@hlogin1 [csf3] ~]$ qrsh -l v100=1 -pe smp.pe 8 -ac nvmps bash

# Wait until you are logged in to a GPU node, then:
module load mpi/gcc/openmpi/4.0.1-cuda-ucx
cp -a $CUDA_SDK/0_Simple/simpleMPI .
cd simpleMPI

# Run more MPI processes than the 1 GPU we requested. This will only work when
# the interactive session is using the NV MPS facility. The $NSLOTS will be replaced
# with 8 in this example because we requested 8 CPU cores on the qrsh line.
mpirun -n $NSLOTS ./simpleMPI

# Return to the login node
exit

If we run the above job without using the -ac nvmps flag, the job will fail, because the single GPU that we request will only accept one MPI process (because we run our GPUs in EXCLUSIVE_PROCESS mode.) Doing so will give you an error in the job output:

# Job output when trying to run multiple processes on a single GPU without using MPS facility
CUDA error calling "cudaMalloc((void **)&deviceInputData, dataSize * sizeof(float))", code is 46
CUDA error calling "cudaMalloc((void **)&deviceInputData, dataSize * sizeof(float))", code is 46
...
Test FAILED
Test FAILED
...

Profiling Tools

A number of profiling tools are available to help analyse and optimize your CUDA applications. We provide instructions on how to run (start) these tools below. Please note that instructions on how to use these tools are beyond the scope of this webpage. You should consult the Nvidia profiling documentation for detailed instructions on how to use the tools listed below.

We give the command name of each tool below. If running the profiler tool through its graphical user interface (GUI) or interactively on the command-line (i.e., not in a batch job which would be collecting profiling data without any interaction) then you must start an interactive session on a backend GPU node using the commands:

# On the CSF login node, request an interactive session on a GPU node
qrsh -l v100=1 bash
  # Can instead use a100 for the A100 GPUs (if permitted!)
  qrsh -l a100=1 bash

# Wait to be logged in to the node, then run:
module load libs/cuda/10.1.168                # Choose your required version
name-of-profiler-tool                         # See below for the command names

Nsight Compute

The Nvidia Nsight Compute profile tools are installed as of toolkit version 10.0.130 and later. To run the profiler:

nv-nsight-cu        # GUI version
nv-nsight-cu-cli    # Command-line version

Nsight Systems

The Nvidia Nsight Systems performance analysis tool designed to visualize an application’s algorithms is installed as of toolkit version 10.1.168. To run the profiler:

nsight-sys

Nvidia recommend you use the above newer tools for profiling rather than the following older tools, although these tools are still available and may be familiar to you.

Visual Profiler

The Nvidia Visual Profiler is installed as of toolkit version 7.5.18 and later. To run the profiler:

nvvp

Note that the Nvidia Visual Profiler nvvp can be used to view results collected by the nvprof command-line tool (see below). Hence you could use the the nvprof command in a batch job which will save profiling data to file, then view the results at a later time using the nvvp tool.

nvprof Command-line Profiler

The Nvidia command-line nvprof profiler is installed as of toolkit version 7.5.18 and later. To run the profiler:

nvprof

Note that the Nvidia Visual Profiler nvvp (see above) can be used to view results collected by the nvprof command-line tool. Hence you could use the the nvprof command in a batch job which will save profiling data to file, then view the results at a later time using the nvvp tool.

Last modified on October 28, 2024 at 10:52 am by George Leaver