Kaiju P2P Node

Specification

Kaiju101 is a server hosting two Nvidia GPUs for peer-to-peer dual GPU codes. The host comprises:

    • Dell T630 host
    • Two Six-core Intel Sandybridge CPUs
    • 16 GB RAM (this is half what is in the other kaiju nodes!)

Nvidia GPUs per host:

  • Two Tesla K20c, 5GB GPU memory, 2496 CUDA cores, CUDA compute capability 3.5

For full specifications please run deviceQuery using

module load libs/cuda          # Load the most recent CUDA modulefile
deviceQuery

The CUDA driver is v390.46.

The peer-to-peer node allows two GPUs to communicate over the IO-Hub, which should make dual-GPU codes run faster. If the GPUs need to exchange information / data then they can do so without going through the main CPU memory.

To assist with running Amber on the GPUs, both GPUs have been modified to provide:

  • Persistence (nvidia-smi -pm 1)
  • Compute Exclusive mode is not enabled to allow peer-to-peer communication.

These settings are applied at boot (see /etc/rc.local on kaiju).

Getting Access to Kaiju Nodes

To gain access, please email the ITS RI team at its-ri-team@manchester.ac.uk.

Restrictions on Access

Priority is given to those who funded the system, but other University of Manchester academics and computational researchers may gain access to evaluation and pump-priming purposes.

Accessing the Host Node

Those who have been given access to kaiju can login to the node for interactive use or submit traditional batch jobs to the node (similar to the CSF):

For interactive use

From the Zrek login node, use qrsh to log in to a kaiju peer-to-peer node. This will give you a command-line on a kaiju p2p node and you can then run GUI apps or non-GUI compute apps:

  • To reserve both k20 GPUs in one of the hosts:
    qrsh -l k20duo_p2p bash

    Note that we do not allow individual GPUs in a peer-to-peer host to be reserved. You must reserve both GPUs. This node should only be used for peer-to-peer capable applications where two GPUs will be used.

Reminder: run the above commands on the Zrek login node! No password will required from the zrek login node.

Once you’ve been logged in to a Kaiju node you should now load any modulefiles (see below) required for your applications.

You can also open more terminals (command-line windows) on that node by running:

xterm &

To verify the peer-to-peer is enabled:

module load libs/cuda

# Run the simple p2p communication test
simpleP2P

# Run the bandwidth test
p2pBandwidthLatencyTest

For traditional batch jobs

From the Zrek login node, batch jobs (non-interactive) can be submitted using qsub jobscript and the jobscript should contain the following line to run on both GPUs in the same host:

#$ -l k20duo_p2p

Once you have submitted the batch job you can even log out of zrek – the job will be in the system and zrek will run it when a suitable GPU node becomes free.

Using the GPUs

Once you have been allocated a GPU by either qrsh or qsub you will have exclusive access to that GPU. The environment variable

CUDA_VISIBLE_DEVICES

will be set to either 0 or 1, or 0,1 to indicate which GPU(s) in the host you have access to. Most applications will use this variable to determine which GPU to use at runtime (e.g., pmemd.cuda and MATLAB will honour this setting).

You can determine which GPU you have been allocated as follows:

echo $CUDA_VISIBLE_DEVICES

It will show either 0 or 1 or 0,1. If you don’t see any output then you have logged in to the node incorrectly. Log out (use exit) and log back in again using the method above.

Note: if you want to open more terminals (command-line windows) to run other programs on the same Kaiju node, simpy run

xterm &

to get a new window.

The CUDA toolkit contains an application named deviceQuery to report device properties. It will display properties for the device(s) you have access to. For example:

module load libs/cuda

# Report what we have been allocated (in this example we requested  'k20duo'):
echo $CUDA_VISIBLE_DEVICES
0,1
deviceQuery | grep ^De

   Detected 2 CUDA Capable device(s)
   Device 0: "Tesla K20m"
   Device 1: "Tesla K20m"

It is possible to check whether a GPU is free by running:

nvidia-smi

This will show stats about the two GPUs. For example, the following output shows that GPU 0 is busy running pmemd.CUDA:

[mxyzabc1@kaiju1 ~]$ nvidia-smi
+------------------------------------------------------+                       
| NVIDIA-SMI 352.39     Driver Version: 352.39         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K20m          On   | 0000:04:00.0     Off |                    0 |
| N/A   61C    P0   134W / 225W |      627MB /  4799MB |     99%   E. Process |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K20m          On   | 0000:42:00.0     Off |                    0 |
| N/A   22C    P8    15W / 225W |       13MB /  4799MB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|=============================================================================|
|    0     24458  pmemd.cuda                                           612MB  |
+-----------------------------------------------------------------------------+

The GPUs are setup to run in compute-exclusive mode so if you try to use a GPU already in use your application will fail.

Last modified on May 1, 2018 at 8:40 am by George Leaver