CUDA, cuDNN, NCCL, TensorRT and HPC-SDK

This page covers compiling CUDA code on the CSF.

Please see the section on running GPU jobs for information on available hardware and how to submit batch jobs and run interactive sessions for programming on the Nvidia GPU nodes.

Nvidia HPC SDK (New!)

As of May 2021 the Nvidia HPC SDK (nvhpc) is installed on the CSF. This provides a complete bundle of the CUDA toolkit (libraries, headers etc), Nvidia compiler tools (including those from the PGI compiler suite which has now been acquired by Nvidia), maths libraries, OpenMPI, profilers and code examples. In particular, the nvfortran compiler is now available, providing access to OpenACC.

The Nvidia HPC SDK is therefore an alternative to loading the various individual CUDA modulefiles listed further down this page.

If you wish to carry on using the individual libraries and SDKs (instead of the HPC-SDK) please see below. These will continue to work and provide access to the current and earlier version of CUDA.

Please note: the Nvidia HPC SDK does not include the cuDDN library and so this will still have to be loaded as a separate modulefile (see below).

Set up procedure

Once you have emailed its-ri-team@manchester.ac.uk and been granted access to the GPUs, set up your environment by loading the appropriate module from the following:

# Note that if your code requires modern C++ language features you may need a newer
# version of GCC than the system-wide default 4.8.5. You should load the gcc compiler
# before the Nvidia HPC SDK modulefile. For example
#
# module load compilers/gcc/8.2.0



# Provides CUDA 12.0, compilers, maths libraries, profiles, nvfortran, OpenMPI 3.1.5 and examples
module load libs/nvidia-hpc-sdk/23.1               # Everything in the HPC SDK
module load libs/nvidia-hpc-sdk/23.1-nompi         # Everything but the MPI installation
module load libs/nvidia-hpc-sdk/23.1-nocompiler    # Everything but the Nvidia compilers
                                                   # (you are unlikely to use this one)

# Provides CUDA 11.6.2, compilers, maths libraries, profiles, nvfortran, OpenMPI 3.1.5 and examples
module load libs/nvidia-hpc-sdk/22.3               # Everything in the HPC SDK
module load libs/nvidia-hpc-sdk/22.3-nompi         # Everything but the MPI installation
module load libs/nvidia-hpc-sdk/22.3-nocompiler    # Everything but the Nvidia compilers
                                                   # (you are unlikely to use this one)

# Provides CUDA 11.2, compilers, maths libraries, profiles, nvfortran, OpenMPI 3.1.5 and examples
module load libs/nvidia-hpc-sdk/21.5               # Everything in the HPC SDK
module load libs/nvidia-hpc-sdk/21.5-nompi         # Everything but the MPI installation
module load libs/nvidia-hpc-sdk/21.5-nocompiler    # Everything but the Nvidia compilers
                                                   # (you are unlikely to use this one)

module load libs/nvidia-hpc-sdk/21.3               # Everything in the HPC SDK
module load libs/nvidia-hpc-sdk/21.3-nompi         # Everything but the MPI installation
module load libs/nvidia-hpc-sdk/21.3-nocompiler    # Everything but the Nvidia compilers
                                                   # (you are unlikely to use this one)

For more information on the contents of the HPC SDK please see the Nvidia HPC SDK 21.5 online documentation.

CUDA SDK

Before the Nvidia HPC SDK was released (see above) the CUDA SDK provided the Nvidia compiler (nvcc), the code profilers and maths libraries. This is still available via the following modulefiles and you can of course continue to use these modulefiles rather than the HPC SDK.

Set up procedure

Once you have emailed its-ri-team@manchester.ac.uk and been granted access to the GPUs, set up your environment by loading the appropriate module from the following:

module load libs/cuda/12.0.1
module load libs/cuda/11.7.0
module load libs/cuda/11.6.2         # Also: 'module load libs/cuda'
module load libs/cuda/11.2.0
module load libs/cuda/11.1.1
module load libs/cuda/11.0.3
module load libs/cuda/10.1.243
module load libs/cuda/10.1.168
module load libs/cuda/10.0.130
module load libs/cuda/9.2.148
module load libs/cuda/9.1.85
module load libs/cuda/9.0.176
module load libs/cuda/8.0.61
module load libs/cuda/7.5.18

Other Libraries

cuDNN – Deep Neural Networks

The Nvidia cuDNN libraries are also available, via the following modulefiles. Before you load these modulefiles you must load one of the cuda modulefiles or an HPC SDK modulefile from above – the list below indicates which versions of cuda can be used with the different cuDNN versions:

module load libs/cuDNN/8.5.0        # Load cuda 11.7.0 or 11.6.2 or 11.2.0 or 11.1.1 or 11.0.3 or HPC-SDK first
module load libs/cuDNN/8.4.0        # Load cuda 11.6.2 or 11.2.0 or 11.1.1 or 11.0.3 or HPC-SDK first
module load libs/cuDNN/8.1.0        # Load cuda 11.2.0 or 11.1.1 or 11.0.3 or HPC-SDK first
module load libs/cuDNN/8.0.5        # Load cuda 11.1.1 first
module load libs/cuDNN/8.0.4        # Load cuda 11.0.3 first
module load libs/cuDNN/7.6.5        # Load cuda 10.0.130 or 10.1.168 or 10.1.243 first
module load libs/cuDNN/7.6.2        # Load cuda 10.0.130 or 10.1.168 first
module load libs/cuDNN/7.2.1        # Load cuda 9.2.148 first
module load libs/cuDNN/7.1.3        # Load cuda 9.0.176 or 9.1.85 first
module load libs/cuDNN/6.0.21       # Load cuda 7.5.18 first

NCCL – Collective Communication

The Nvidia Collective Communication Libraries are also available via the following modulefiles. Before you load these modulefiles you must load one of the cuda modulefiles from above – the list below indicates which versions of cuda can be used with the different NCCL versions:

module load libs/nccl/2.12.10      # Load cuda 11.6.2 first
module load libs/nccl/2.8.3        # Load cuda 11.2.0 or 11.1.1 or 11.0.3 first
module load libs/nccl/2.5.6        # Load cuda 9.2.148 or 10.0.130 or 10.1.168 or 10.1.243 first
module load libs/nccl/2.4.7        # Load cuda 9.2.148 or 10.0.130 or 10.1.168 first
module load libs/nccl/2.2.13       # Load cuda 8.0.61 or 9.0.176 or 9.2.148 first
module load libs/nccl/7.1.3        # Load cuda 9.0.176 or 9.1.85 first

TensorRT

TensorRT provides API’s via C++ and Python that help to express deep learning models via the Network Definition API or load a pre-defined model via the parsers that allows TensorRT to optimize and run them on an NVIDIA GPU. It can also be used by Tensorflow to improve latency and throughput for inference on some models.

module load libs/tensorrt/7.2.2    # Load cuda 11.1.1 or 11.0.3 and libs/cuDNN/8.0.5 or libs/cuDNN/8.0.4 first
module load libs/tensorrt/6.0.1    # Load cuda 10.0.130 or 10.1.168 or 10.1.243 and libs/cuDNN/7.6.5 first

Compiling GPU Code

The following sections describe how to compile CUDA and OpenCL code on CSF.

CUDA

CUDA code can be compiled on the login node provided you are using the CUDA runtime library, and not the CUDA driver library. The runtime library is used when you allow CUDA to automatically set up the device. That is, your CUDA code uses the style where you assume CUDA will be set up on the first CUDA function call. We recommend this method because it makes it easy for your code to run on the correct GPU device when your job runs on a multi-GPU compute node.

For example:

#include <cuda_runtime.h>

int main( void ) {

   // We assume CUDA will set up the GPU device automatically
   cudaMalloc( ... );
   cudaMemcpy( ... );
   myKernel<<<...>>>( ... );
   cudaMemcpy( ... );
   cudaFree( ... );
   return 0;
}

The CUDA driver library allows much more low-level control of the GPU device (and makes CUDA set up more like OpenCL). In that case you must compile on a GPU node because the CUDA driver library is only available on the backend GPU nodes. Driver code will contain something like the following:

#include <cuda.h>

int main( void ) {

  // Low-level device setup using the driver API
  cuDeviceGetCount( ... );
  cuDeviceGet( ... );
  cuDeviceGetName( ... );
  cuDeviceComputeCapability( ... );
  ...

  return 0;
}

No matter where you compile your code you cannot run your CUDA code on the login node because it does not contain any GPUs (see the next section for running your code).

The CUDA libraries and header files are available in the following directories once you have loaded the CUDA module:

# All nodes
$CUDA_HOME/lib64     # CUDA runtime library, CUBlas, CURand etc
$CUDA_HOME/include

# On a GPU node only
/usr/lib64           # CUDA driver library

Compiling CUDA code using nvcc

It is beyond the scope of this page to give a tutorial on CUDA compilation and all of the possible flags that can be given to the NVidia compilers. The CUDA GPU Programming SDK available on CSF in $CUDA_SDK gives many examples of CUDA programs and how to compile them. However, we give a basic compile-command below.

Note that the Nvidia v100 GPUs available in CSF3 use the sm_70 architecture and the A100 GPUs use sm_80.

The following can be run on the login node or a GPU node. In most cases it is NOT necessary to be on a GPU node when you compile. A simple compile line to run on the command line would be as follows:

# Note: the v100 GPUs use sm_70, the A100_GPUs use sm_80
nvcc -o myapp myapp.cu -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -I$CUDA_HOME/include -L$CUDA_HOME/lib64 -lcudart

To use the above line in a Makefile, enclose the variable names in brackets as follows

# Simple CUDA Makefile
CC = nvcc
FLAGS = -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80

all: myapp

myapp: myapp.cu
        $(CC) -o myapp myapp.cu $(FLAGS) -I$(CUDA_HOME)/include -L$(CUDA_HOME)/lib64 -lcudart
#
# note: the preceeding line must start with a TAB, not 8 spaces. 'make' requires a TAB!

The above to compilation methods use the CUDA runtime libary (libcudart) and so can be used to compile on the login node.

OpenCL

Please see OpenCL programming on CSF for compiling OpenCL code.

Running the application

All work on the Nvidia GPUs must be via the batch system. There are two types of environments which can be used:

  • Batch: for non-interactive computational work – this should be used where possible.
  • Interactive environment for debugging and other necessarily-interactive work.

Please see running GPU jobs for information on how to run in both environments.

Profiling Tools

The various Nvidia profiler tools are installed as part of the CUDA toolkit. Please see the CSF documentation on how to run GPU jobs for a list of installed GPU profilers. This page also includes links to the Nvidia profiler online documentation.

CUDA and OpenCL SDK Examples (e.g., deviceQuery)

The CUDA SDK contains many example CUDA and OpenCL programs which can be compiled and run. A useful one is deviceQuery (and oclDeviceQuery) which gives you lots of information about the Nvidia GPU hardware.

Version 8.0.61 and later

In CUDA 8 and up there is no separate SDK installation directory. Instead the CUDA toolkit (which provides the nvcc compiler, profiler and numerical libraries) also contains a Samples directory. The examples have already been compiled but you may also take a copy of the samples so that you can modify them. You can access the samples by loading the CUDA modulefile and then going in to the directory:

cd $CUDA_SAMPLES

The compiled samples are available using

cd $CUDA_SAMPLES/bin/x86_64/linux/release/

As always, running the samples on the login node won’t work – there’s no GPU there!

The CUDA and OpenCL example programs are just like any other GPU code so please see the instructions earlier on running code either in batch or interactively on a GPU node.

Further info

Applications and compilers which can use the Nvidia GPUs are being installed on the CSF. Links to the appropriate documentation will be provided here and will include:

Last modified on November 17, 2023 at 6:26 am by Pen Richardson