TensorFlow
Overview
TensorFlow is an open source software library for numerical computation using data flow graphs. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them. The flexible architecture allows you to deploy computation to one or more CPUs or GPUs.
Various versions with GPU (CUDA and cuDNN support) are installed – see modulefiles below.
Restrictions on use
There are no access restrictions on zrek.
Supported Backend Nodes
This application is available on the Nvidia GPU nodes: besso and kaiju{[1-5],101}. Please see the K40 node instructions and the K20 node instructions for how to access the nodes.
Set up procedure
To access the software you must first load one of the following modulefile:
# Python 3.6, CUDA 9.0.176, cuDNN 7.3.0 module load apps/gcc/tensorflow/1.11.0-py36-gpu # New # Python 3.5, CUDA 9.0.176, cuDNN 7.0.3 module load apps/gcc/tensorflow/1.8.0-py35-gpu # Python 3.5, CUDA 8.0.44, cuDNN 5.1.5 module load apps/gcc/tensorflow/1.2.1-py35-gpu module load apps/gcc/tensorflow/1.0.0-py35-gpu module load apps/gcc/tensorflow/0.12.1-py35-gpu module load apps/gcc/tensorflow/0.11.0-py35-gpu # Python 3.4, CUDA 7.5.18, cuDNN 4.0.7 module load apps/gcc/tensorflow/0.10.0-py34-gpu module load apps/gcc/tensorflow/0.9.0rc0-py34-gpu module load apps/gcc/tensorflow/0.7.1-py34-gpu # Python 2.7, CUDA 9.0.176, cuDNN 7.0.3 # (module load 1.8.0-py27-gpu is in progress) # Python 2.7, CUDA 8.0.44, cuDNN 5.1.5 module load apps/gcc/tensorflow/1.2.1-py27-gpu module load apps/gcc/tensorflow/1.0.0-py27-gpu module load apps/gcc/tensorflow/0.12.1-py27-gpu # Python 2.7, CUDA 7.5.18, cuDNN 4.0.7 module load apps/gcc/tensorflow/0.10.0-py27-gpu module load apps/gcc/tensorflow/0.9.0rc0-py27-gpu module load apps/gcc/tensorflow/0.8.0-20160504-py27-gpu # Nightly build version (dated) module load apps/gcc/tensorflow/0.7.1-py27-gpu
The above modulefiles will load the following modulefiles automatically (you do not need to load these):
- One of the following Anaconda python modulefiles:
apps/binapps/anaconda/3/5.2.0
(python 3.6.5)apps/binapps/anaconda/3/4.2.0
(python 3.5.2)apps/binapps/anaconda/3/2.3.0
(python 3.4.3)apps/binapps/anaconda/2.5.0
(python 2.7.11)
compilers/gcc/6.3.0
(C++11 compatible compiler)libs/cuda/9.0.176
(Nvidia CUDA toolkit)libs/cuDNN/7.30
(Nvidia cuDNN toolkit)compilers/gcc/4.8.2
(C++11 compatible compiler)libs/cuda/8.0.44
(Nvidia CUDA toolkit)libs/cuDNN/5.1.5
(Nvidia cuDNN toolkit)libs/cuda/7.5.18
(Nvidia CUDA toolkit)libs/cuDNN/4.0.7
(Nvidia cuDNN toolkit)
Running the application
Please do not run Tensorflow on the login node. Jobs should be run interactively on the backend nodes (via qrsh
) or submitted to the compute nodes via batch.
The following instructions describe interactive use on a backend node and batch jobs from the login node.
Technical Note (you are not required to do anything – this is for information only)
- We use a modified python executable (a shell script) named
python
to start the usual Anaconda python interpreter. This actually runs the following:LD_PRELOAD=/usr/lib64/librt.so:$TFDIR/fixes/stubs/mylibc.so:$GCCDIR/lib64/libstdc++.so.6 python
The
LD_PRELOAD
is needed to load a few libraries that replace system libraries. The pre-compiled TensorFlow installation supplied by Google requires a newer version of GLIBC than is available on the zCSF. We have modified the TensorFlow library_pywrap_tensorflow.so
to be less strict about the version of GLIBC present. But we then supply some function that are missing in our older GLIBC library that are required by TensorFlow.
Interactive use on a Backend Node
To see the commands used to log in to a particular backend node, run the following on the zrek login node:
backends
Once logged in to a backend K20 node or K40 node (using qrsh
) and having loaded the modulefile there (see above), run:
python
When TensorFlow runs it will use the GPU assigned to you when you ran qrsh
to log in to the backend node. To see which GPU(s) you have been assigned run the following on the backend node:
echo $CUDA_VISIBLE_DEVICES
Single GPU Example
A simple TensorFlow test is as follows. This will use the single GPU assigned to your session:
# Assuming you are at the zrek login node: # 1. Log in to a backend node (e.g., a K40 node) qrsh -l k40 bash # 2. Load the modulefile your require on the backend node. For example: module load apps/gcc/tensorflow/0.7.1-py34-gpu # # See list above for newer versions # 3. Start python then enter the commands python # Enter the following program import tensorflow as tf # You should see: # successfully opened CUDA library libcudnn.so locally # (and other GPU details)... # Create a graph a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a') b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b') c = tf.matmul(a, b) # Turn on device placement reporting so we can see where a graph runs sess = tf.Session(config=tf.ConfigProto(log_device_placement=True)) # You should see: # I tensorflow/core/common_runtime/gpu/gpu_device.cc:717] Creating TensorFlow \ # device (/gpu:0) -> (device: 0, name: Tesla K40c, pci bus id: 0000:02:00.0) # Run the graph. It will report the GPU used to do so. sess.run(c) # You should see # b: /job:localhost/replica:0/task:0/gpu:0 # a: /job:localhost/replica:0/task:0/gpu:0 # MatMul: /job:localhost/replica:0/task:0/gpu:0 # Allocating 10.58GiB bytes. # GPU 0 memory begins at 0x2047a0000 extends to 0x4a9c53b34 # array([[ 22., 28.], # [ 49., 64.]], dtype=float32) # Exit the python shell Ctrl-D
Dual-GPU Example
The following example uses both GPUs in a backend node. You must request both when you log in to the node so that you are given a node on which no other jobs are running.
# Assuming you are at the zrek login node: # 1. Log in to a backend node (e.g., a K40 node) qrsh -l k20duo bash # or alternatively use the GPU-peer-to-peer capable node qrsh -l k20duo_p2p bash # 2. Load the modulefile on the backend node module load apps/gcc/tensorflow/0.7.1-py34-gpu # # See list above for newer versions # 3. Start python then enter the commands python # Enter the following program import tensorflow as tf # Will report the loading of CUDA libraries # Create a graph and use both GPUs c = [] for d in ['/gpu:0', '/gpu:1']: with tf.device(d): a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3]) b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2]) c.append(tf.matmul(a, b)) with tf.device('/cpu:0'): sum = tf.add_n(c) # Creates a session with log_device_placement set to True. sess = tf.Session(config=tf.ConfigProto(log_device_placement=True)) # Should report: # /job:localhost/replica:0/task:0/gpu:0 -> device: 0, name: Tesla K20c, pci bus # id:0000:02:00.0 # /job:localhost/replica:0/task:0/gpu:1 -> device: 1, name: Tesla K20c, pci bus id: # 0000:04:00.0 # Run the op. sess.run(sum) # Will report both GPUs being used # array([[ 44., 56.], # [ 98., 128.]], dtype=float32) # Exit the python shell Ctrl-D
Serial batch job submission
Do not log in to a backend node. The job must be submitted from the zrek login node. Ensure you have loaded the correct modulefile on the login node and then create a jobscript similar to the following:
#!/bin/bash #$ -S /bin/bash #$ -cwd # Run job from directory where submitted #$ -V # Inherit environment (modulefile) settings #$ -l k20 # Select a single GPU (Nvidia K20) node python my-script.py
Submit your jobscript from the zrek login node using
qsub jobscript
where jobscript
is the name of your jobscript.
Parallel batch job submission
Do not log in to a backend node. The job must be submitted from the zrek login node. Ensure you have loaded the correct modulefile on the login node and then create a jobscript similar to the following:
#!/bin/bash #$ -S /bin/bash #$ -cwd # Run job from directory where submitted #$ -V # Inherit environment (modulefile) settings #$ -l k20duo_p2p # Select a single GPU (Nvidia K20) node python my-script.py
Submit your jobscript from the zrek login node using
qsub jobscript
where jobscript
is the name of your jobscript.
Web-browser Jupyter Notebook remote usage
It is possible to run tensorflow in a Jupyter Notebook running on a zrek GPU node. In the example below we use a web browser on a local laptop and set up an SSH tunnel to the zrek k40 GPU node (e.g., besso). We show how to do this for on-campus use and off-campus use.
You will need an SSH program on your local laptop/PC that allows local tunnelling. This is easy with a command-line program (e.g., as found on a linux PC, MacOS laptop or in MobaXterm on Windows). Please consult your own ssh program’s documentation if not using any of those methods.
On Campus Access
Here we assume your laptop/PC is on campus and so you can log in to zrek directly. You can also use this method from the nyx3/4 linux virtual desktops (see http://ri.itservices.manchester.ac.uk/virtual-desktop-service/x2go/). The steps are:
- Open two terminal windows on you local laptop/PC.
- In terminal window 1 on your laptop/PC run:
- Login in to zrek:
ssh username@zrek.itservices.manchester.ac.uk
- Now log in to a GPU node (we use an Nvidia K40 node):
qrsh -l k40 bash # Wait until it logs you in to the backend node. # Your prompt will change when it has done: [username@besso(zrek) ~]$ # # # Make a note of the backend node name we have been # logged in to (besso in this example). If using a # K20 GPU this could be Kaiju3 for example.
- Now set up tensorflow (which also give access to anaconda python) and start a jupyter notebook:
module load apps/gcc/tensorflow/0.10.0-py27-gpu jupyter-notebook --no-browser --port=7777 # # Make a note of this port number # we use it later. If you get an error # try the next port up (7778 and so on)
- Note the message printed when the notebook starts. When we have finished with the notebook we’ll terminate it by pressing
Ctrl+C
in this window twice (we’ll come back to that later). For now just leave this notebook running.
- Login in to zrek:
- In terminal window 2 on your laptop/PC run:
- Tunnel in to zrek:
ssh -L 7777:besso:7777 username@zrek.itservices.manchester.ac.uk # # # # # # # # Replace username with your own IT username # # # # # # Use the port number you made a note of earlier # # # # Use the GPU node name you made a note of earlier # # Use the port number you made a note of earlier
Note that you do not need to type any commands in to this window once you have logged in but you must keep the window logged in otherwise your web-browser won’t be able to contact zrek.
- Tunnel in to zrek:
- Now open a web-browser on your laptop/PC and browser to:
http://localhost:7777 # # # Use the port number we made a note of earlier
You should see your Jupyter Notebook running on the zrek GPU node.
- Start a new
Python 2
notebook in the web-browser. - Enter the following to check you can access tensorflow:
import tensorflow as tf
and run the command (hit Ctrl+Enter).
That’s it, you now have a local web-browser talking to your Jupyter Notebook running on a zrek GPU node.
When you have finished with the Jupyter Notebook:
- Log out of the Jupyter Notebook using the button in the web-browser.
- In terminal window 1 on your laptop/PC press
Ctrl+C
twice to stop the Jupyter Notebook server. Then runexit
to log out of the GPU node. This will free it up for another user. - In terminal window 2 on your laptop/PC press
Ctrl+D
to log out of the tunnelled ssh session
Further info
Updates
03-mar-17 Tensorflow 1.0.0 added for Python 2.7 and 3.5
16-jan-17 Tensorflow 0.12.1 added for Python 2.7 and 3.5