Chainer
Overview
Chainer is a flexible framework for neural networks in Python supporting use of GPUs. Chainer adopts a Define-by-Run scheme, i.e., the network is defined on-the-fly via the actual forward computation.
Versions 1.22.0, 2.0.0, 3.0.0, 4.5.0, 5.2.0 are installed on the zrek. It supports CUDA, and cuDNN (and nccl as of 4.5.0) all on the K20 and K40 GPUs.
Restrictions on use
There are no access restrictions on zrek.
Supported Backend Nodes
This application is available on the Nvidia GPU nodes: besso and kaiju{[1-5],101}. Please see the K40 node instructions and the K20 node instructions for how to access the nodes.
Set up procedure
To access the software you must first load one of the modulefiles:
# Python 3.6 module load apps/gcc/python-packages/anaconda3-5.2.0/chainer/5.2.0 module load apps/gcc/python-packages/anaconda3-5.2.0/chainer/4.5.0 # Python 3.5.2 module load apps/gcc/python-packages/anaconda3-4.2.0/chainer/4.2.0 module load apps/gcc/python-packages/anaconda3-4.2.0/chainer/3.0.0 # Python 3.5.1 module load apps/gcc/python-packages/anaconda3-2.4.1/chainer/2.0.0 module load apps/gcc/python-packages/anaconda3-2.4.1/chainer/1.22.0 # Python 2.7.11 module load apps/gcc/python-packages/anaconda-2.5.0/chainer/1.22.0
This will automatically load the Anaconda3 v2.4.1, v2.5.0 or v4.2.0 Python module files (which provide python 3.5.1, 2.7.11 or 3.5.2) and also the latest CUDA and cuDNN modulefiles (use module list
after loading the modulefile).
Running the application
Please do not run Chainer on the login node. Jobs should be run interactively on the backend nodes (via qrsh
) or submitted to the compute nodes via batch.
The following instructions describe interactive use on a backend node and batch jobs from the login node.
Interactive use on a Backend Node
To see the commands used to log in to a particular backend node, run the following on the zrek login node:
backends
Once logged in to a backend K20 node or K40 node (using qrsh
) and having loaded the modulefile there (see above), run:
python # # You can now type your chainer code in to python
Serial batch job submission
Do not log in to a backend node. The job must be submitted from the zrek login node. Ensure you have loaded the correct modulefile on the login node and then create a jobscript similar to the following:
#!/bin/bash #$ -S /bin/bash #$ -cwd # Run job from directory where submitted #$ -V # Inherit environment (modulefile) settings #$ -l k20 # Select a single GPU (Nvidia K20) node # Or use: #$ -l k40 python my_chainer_code.py
Submit your jobscript from the zrek login node using
qsub jobscript
where jobscript
is the name of your jobscript.
Examples
The Chainer examples are installed in the directory
$CHAINER_EXAMPLES
For example, to see the mnist
files:
ls $CHAINER_EXAMPLES/mnist/
Single GPU Example
To run the mnist example interactively on a K20 node, do the following from the zrek login node:
# This command is run on the zrek login node qrsh -l k20 bash # Or use k40 # Wait until you are logged in to a GPU node, then module load apps/gcc/python-packages/anaconda3-5.2.0/chainer/5.2.0 # The example will download some data, so run in your scratch area cd ~/scratch # Run the example code on the GPU $CHAINER_EXAMPLES/mnist/train_mnist.py --gpu 0 # # # # Always use 0 here (see below) # # The example will download data then process it on the GPU. # Remove the --gpu 0 flag to run on the CPU. # Return to the login node when done to free the compute node for another user exit
The --gpu 0
flag tells the code to use the first GPU that the CUDA library can see. When you ran qrsh
you will have been assigned one of the two GPUs in the compute node. To see which, run:
echo $CUDA_VISIBLE_DEVICES # # This will report 0 or 1
This could show that you have been assigned GPU number 1
i.e the second GPU in the compute node. But you should not use 1
with the --gpu
flag when running the mnist example. By passing 0
to the example it will use the first GPU that CUDA can see which is determined by the CUDA_VISIBLE_DEVICES
variable.
Dual GPU Example
To run on two GPUs interactively:
# This command is run on the zrek login node (you must reserve two GPUs) qrsh -l k20duo bash # Wait until you are logged in to a GPU node, then load a chainer modulefile: module load apps/gcc/python-packages/anaconda3-5.2.0/chainer/5.2.0 # The example will download some data, so run in your scratch area cd ~/scratch # To confirm you have access to both GPUs: echo $CUDA_VISIBLE_DEVICES 0,1 # Run the parallel example on both GPUs (use -g and -G to indicate GPU IDs) $CHAINER_EXAMPLES/mnist/train_mnist_data_parallel.py -g 0 -G 1 # Return to the login node when done to free the compute node for another user exit
Further info
Updates
None.