Chainer

Overview

Chainer is a flexible framework for neural networks in Python supporting use of GPUs. Chainer adopts a Define-by-Run scheme, i.e., the network is defined on-the-fly via the actual forward computation.

Versions 1.22.0, 2.0.0, 3.0.0, 4.5.0, 5.2.0 are installed on the zrek. It supports CUDA, and cuDNN (and nccl as of 4.5.0) all on the K20 and K40 GPUs.

Restrictions on use

There are no access restrictions on zrek.

Supported Backend Nodes

This application is available on the Nvidia GPU nodes: besso and kaiju{[1-5],101}. Please see the K40 node instructions and the K20 node instructions for how to access the nodes.

Set up procedure

To access the software you must first load one of the modulefiles:

# Python 3.6
module load apps/gcc/python-packages/anaconda3-5.2.0/chainer/5.2.0
module load apps/gcc/python-packages/anaconda3-5.2.0/chainer/4.5.0

# Python 3.5.2
module load apps/gcc/python-packages/anaconda3-4.2.0/chainer/4.2.0
module load apps/gcc/python-packages/anaconda3-4.2.0/chainer/3.0.0

# Python 3.5.1
module load apps/gcc/python-packages/anaconda3-2.4.1/chainer/2.0.0
module load apps/gcc/python-packages/anaconda3-2.4.1/chainer/1.22.0

# Python 2.7.11
module load apps/gcc/python-packages/anaconda-2.5.0/chainer/1.22.0

This will automatically load the Anaconda3 v2.4.1, v2.5.0 or v4.2.0 Python module files (which provide python 3.5.1, 2.7.11 or 3.5.2) and also the latest CUDA and cuDNN modulefiles (use module list after loading the modulefile).

Running the application

Please do not run Chainer on the login node. Jobs should be run interactively on the backend nodes (via qrsh) or submitted to the compute nodes via batch.

The following instructions describe interactive use on a backend node and batch jobs from the login node.

Interactive use on a Backend Node

To see the commands used to log in to a particular backend node, run the following on the zrek login node:

backends

Once logged in to a backend K20 node or K40 node (using qrsh) and having loaded the modulefile there (see above), run:

python
   #
   # You can now type your chainer code in to python

Serial batch job submission

Do not log in to a backend node. The job must be submitted from the zrek login node. Ensure you have loaded the correct modulefile on the login node and then create a jobscript similar to the following:

#!/bin/bash
#$ -S /bin/bash
#$ -cwd                   # Run job from directory where submitted
#$ -V                     # Inherit environment (modulefile) settings
#$ -l k20                 # Select a single GPU (Nvidia K20) node
                          # Or use: #$ -l k40

python my_chainer_code.py

Submit your jobscript from the zrek login node using

qsub jobscript

where jobscript is the name of your jobscript.

Examples

The Chainer examples are installed in the directory

$CHAINER_EXAMPLES

For example, to see the mnist files:

ls $CHAINER_EXAMPLES/mnist/

Single GPU Example

To run the mnist example interactively on a K20 node, do the following from the zrek login node:

# This command is run on the zrek login node
qrsh -l k20 bash          # Or use k40

# Wait until you are logged in to a GPU node, then
module load apps/gcc/python-packages/anaconda3-5.2.0/chainer/5.2.0

# The example will download some data, so run in your scratch area
cd ~/scratch

# Run the example code on the GPU 
$CHAINER_EXAMPLES/mnist/train_mnist.py --gpu 0
    #                                        #
    #                                        # Always use 0 here (see below)
    #
    # The example will download data then process it on the GPU.
    # Remove the --gpu 0 flag to run on the CPU.

# Return to the login node when done to free the compute node for another user
exit

The --gpu 0 flag tells the code to use the first GPU that the CUDA library can see. When you ran qrsh you will have been assigned one of the two GPUs in the compute node. To see which, run:

echo $CUDA_VISIBLE_DEVICES
  #
  # This will report 0 or 1

This could show that you have been assigned GPU number 1 i.e the second GPU in the compute node. But you should not use 1 with the --gpu flag when running the mnist example. By passing 0 to the example it will use the first GPU that CUDA can see which is determined by the CUDA_VISIBLE_DEVICES variable.

Dual GPU Example

To run on two GPUs interactively:

# This command is run on the zrek login node (you must reserve two GPUs)
qrsh -l k20duo bash

# Wait until you are logged in to a GPU node, then load a chainer modulefile:
module load apps/gcc/python-packages/anaconda3-5.2.0/chainer/5.2.0

# The example will download some data, so run in your scratch area
cd ~/scratch

# To confirm you have access to both GPUs:
echo $CUDA_VISIBLE_DEVICES
0,1

# Run the parallel example on both GPUs (use -g and -G to indicate GPU IDs)
$CHAINER_EXAMPLES/mnist/train_mnist_data_parallel.py -g 0 -G 1

# Return to the login node when done to free the compute node for another user
exit

Further info

Updates

None.

Last modified on February 18, 2019 at 1:15 pm by George Leaver