Kaiju Nodes

Specification

Kaiju1-5 are servers hosting two Nvidia GPUs. Each host comprises:

  • Two Six-core Intel Sandybridge CPUs
  • 32 GB RAM

Nvidia GPUs per host:

  • Two Tesla K20c, 5GB GPU memory, 2496 CUDA cores, CUDA compute capability 3.5

For full specifications please run deviceQuery after logging in to a k20 node (see below for how to do this correctly) using the commands:

module load libs/cuda            # Load the most recent CUDA modulefile
deviceQuery                      # (other versions of CUDA are available)

The CUDA driver is v390.46.

To assist with running Amber on the GPUs, both GPUs have been modified to provide:

  • Persistence (nvidia-smi -pm 1)
  • Compute Exclusive Mode (nvidia-smi -c 3)

These settings are applied at boot (see /etc/rc.local on kaiju).

Please note that peer-to-peer copying between two GPUs in a single node is not possible. The motherboards in the nodes do not permit the GPUs to be seated such that they use the same IOH.

Getting Access to Kaiju Nodes

To gain access, please email the ITS RI team at its-ri-team@manchester.ac.uk.

Restrictions on Access

Priority is given to those who funded the system, but other University of Manchester academics and computational researchers may gain access to evaluation and pump-priming purposes.

Accessing the Host Node

Those who have been given access to kaiju can login to the node for interactive use or submit traditional batch jobs to the node (similar to the CSF):

For interactive use

From the Zrek login node, use qrsh to log in to a kaiju node. This will give you a command-line on a kaiju node and you can then run GUI apps or non-GUI compute apps:

  • To reserve one (of the two) k20 GPUs in one of the hosts:
    qrsh -l k20 bash
  • To reserve both k20 GPUs in one of the hosts:
    qrsh -l k20duo bash

Reminder: run the above commands on the Zrek login node! No password will required from the zrek login node.

Once you’ve been logged in to a Kaiju node you should now load any modulefiles (see below) required for your applications.

You can also open more terminals (command-line windows) on that node by running:

xterm &

For traditional batch jobs

From the Zrek login node, batch jobs (non-interactive) can be submitted using qsub jobscript and the jobscript should contain the following line to run on a single GPU:

#$ -l k20

or, to run on both GPUs in the same host:

#$ -l k20duo

Once you have submitted the batch job you can even log out of zrek – the job will be in the system and zrek will run it when a suitable GPU node becomes free.

Using the GPUs

Once you have been allocated a GPU by either qrsh or qsub you will have exclusive access to that GPU. The environment variable

CUDA_VISIBLE_DEVICES

will be set to either 0 or 1, or 0,1 to indicate which GPU(s) in the host you have access to.

Most CUDA-capable applications will use this variable to determine which GPU to use at runtime (e.g., pmemd.cuda and MATLAB will honour this setting). You should NOT assume you have been given the first GPU in the system – another user may be using that. Hence if you have the option of specifying a fixed GPU id in your software you should generally not do so – let the CUDA library use the above environment variable instead.

You can determine which GPU you have been allocated as follows:

echo $CUDA_VISIBLE_DEVICES

It will show either 0 or 1 or 0,1. If you don’t see any output then you have logged in to the node incorrectly. Log out (use exit) and log back in again using the method above.

Note: if you want to open more terminals (command-line windows) to run other programs on the same Kaiju node, simpy run

xterm &

to get a new window.

The CUDA toolkit contains an application named deviceQuery to report device properties. It will display properties for the device(s) you have access to. For example:

module load libs/cuda

# Report what we have been allocated (in this example we requested  'k20duo'):
echo $CUDA_VISIBLE_DEVICES
0,1
deviceQuery | grep ^De
  
   Detected 2 CUDA Capable device(s)
   Device 0: "Tesla K20m"
   Device 1: "Tesla K20m"

It is possible to check whether a GPU is free by running:

nvidia-smi

This will show stats about the two GPUs. For example, the following output shows that GPU 0 is busy running pmemd.CUDA:

[mxyzabc1@kaiju1 ~]$ nvidia-smi
+------------------------------------------------------+                       
| NVIDIA-SMI 352.29     Driver Version: 352.39         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K20m          On   | 0000:04:00.0     Off |                    0 |
| N/A   61C    P0   134W / 225W |      627MB /  4799MB |     99%   E. Process |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K20m          On   | 0000:42:00.0     Off |                    0 |
| N/A   22C    P8    15W / 225W |       13MB /  4799MB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|=============================================================================|
|    0     24458  pmemd.cuda                                           612MB  |
+-----------------------------------------------------------------------------+

The GPUs are setup to run in compute-exclusive mode so if you try to use a GPU already in use your application will fail.

Last modified on May 1, 2018 at 8:35 am by George Leaver