Kaiju Nodes
Specification
Kaiju1-5 are servers hosting two Nvidia GPUs. Each host comprises:
- Two Six-core Intel Sandybridge CPUs
- 32 GB RAM
Nvidia GPUs per host:
- Two Tesla K20c, 5GB GPU memory, 2496 CUDA cores, CUDA compute capability 3.5
For full specifications please run deviceQuery
after logging in to a k20 node (see below for how to do this correctly) using the commands:
module load libs/cuda # Load the most recent CUDA modulefile deviceQuery # (other versions of CUDA are available)
The CUDA driver is v390.46.
To assist with running Amber on the GPUs, both GPUs have been modified to provide:
- Persistence (
nvidia-smi -pm 1
) - Compute Exclusive Mode (
nvidia-smi -c 3
)
These settings are applied at boot (see /etc/rc.local
on kaiju).
Please note that peer-to-peer copying between two GPUs in a single node is not possible. The motherboards in the nodes do not permit the GPUs to be seated such that they use the same IOH.
Getting Access to Kaiju Nodes
To gain access, please email the ITS RI team at its-ri-team@manchester.ac.uk.
Restrictions on Access
Priority is given to those who funded the system, but other University of Manchester academics and computational researchers may gain access to evaluation and pump-priming purposes.
Accessing the Host Node
Those who have been given access to kaiju can login to the node for interactive use or submit traditional batch jobs to the node (similar to the CSF):
For interactive use
From the Zrek login node, use qrsh
to log in to a kaiju node. This will give you a command-line on a kaiju node and you can then run GUI apps or non-GUI compute apps:
- To reserve one (of the two) k20 GPUs in one of the hosts:
qrsh -l k20 bash
- To reserve both k20 GPUs in one of the hosts:
qrsh -l k20duo bash
Reminder: run the above commands on the Zrek login node! No password will required from the zrek login node.
Once you’ve been logged in to a Kaiju node you should now load any modulefiles (see below) required for your applications.
You can also open more terminals (command-line windows) on that node by running:
xterm &
For traditional batch jobs
From the Zrek login node, batch jobs (non-interactive) can be submitted using qsub jobscript
and the jobscript should contain the following line to run on a single GPU:
#$ -l k20
or, to run on both GPUs in the same host:
#$ -l k20duo
Once you have submitted the batch job you can even log out of zrek – the job will be in the system and zrek will run it when a suitable GPU node becomes free.
Using the GPUs
Once you have been allocated a GPU by either qrsh
or qsub
you will have exclusive access to that GPU. The environment variable
CUDA_VISIBLE_DEVICES
will be set to either 0
or 1
, or 0,1
to indicate which GPU(s) in the host you have access to.
Most CUDA-capable applications will use this variable to determine which GPU to use at runtime (e.g., pmemd.cuda and MATLAB will honour this setting). You should NOT assume you have been given the first GPU in the system – another user may be using that. Hence if you have the option of specifying a fixed GPU id in your software you should generally not do so – let the CUDA library use the above environment variable instead.
You can determine which GPU you have been allocated as follows:
echo $CUDA_VISIBLE_DEVICES
It will show either 0
or 1
or 0,1
. If you don’t see any output then you have logged in to the node incorrectly. Log out (use exit
) and log back in again using the method above.
Note: if you want to open more terminals (command-line windows) to run other programs on the same Kaiju node, simpy run
xterm &
to get a new window.
The CUDA toolkit contains an application named deviceQuery
to report device properties. It will display properties for the device(s) you have access to. For example:
module load libs/cuda # Report what we have been allocated (in this example we requested 'k20duo'): echo $CUDA_VISIBLE_DEVICES 0,1 deviceQuery | grep ^De Detected 2 CUDA Capable device(s) Device 0: "Tesla K20m" Device 1: "Tesla K20m"
It is possible to check whether a GPU is free by running:
nvidia-smi
This will show stats about the two GPUs. For example, the following output shows that GPU 0 is busy running pmemd.CUDA
:
[mxyzabc1@kaiju1 ~]$ nvidia-smi +------------------------------------------------------+ | NVIDIA-SMI 352.29 Driver Version: 352.39 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla K20m On | 0000:04:00.0 Off | 0 | | N/A 61C P0 134W / 225W | 627MB / 4799MB | 99% E. Process | +-------------------------------+----------------------+----------------------+ | 1 Tesla K20m On | 0000:42:00.0 Off | 0 | | N/A 22C P8 15W / 225W | 13MB / 4799MB | 0% E. Process | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Compute processes: GPU Memory | | GPU PID Process name Usage | |=============================================================================| | 0 24458 pmemd.cuda 612MB | +-----------------------------------------------------------------------------+
The GPUs are setup to run in compute-exclusive mode so if you try to use a GPU already in use your application will fail.