TensorFlow

Overview

TensorFlow is an open source software library for numerical computation using data flow graphs. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them. The flexible architecture allows you to deploy computation to one or more CPUs or GPUs.

Various versions with GPU (CUDA and cuDNN support) are installed – see modulefiles below.

Restrictions on use

There are no access restrictions on zrek.

Supported Backend Nodes

This application is available on the Nvidia GPU nodes: besso and kaiju{[1-5],101}. Please see the K40 node instructions and the K20 node instructions for how to access the nodes.

Set up procedure

To access the software you must first load one of the following modulefile:

# Python 3.6, CUDA 9.0.176, cuDNN 7.3.0
module load apps/gcc/tensorflow/1.11.0-py36-gpu          # New

# Python 3.5, CUDA 9.0.176, cuDNN 7.0.3
module load apps/gcc/tensorflow/1.8.0-py35-gpu
# Python 3.5, CUDA 8.0.44, cuDNN 5.1.5
module load apps/gcc/tensorflow/1.2.1-py35-gpu

module load apps/gcc/tensorflow/1.0.0-py35-gpu
module load apps/gcc/tensorflow/0.12.1-py35-gpu
module load apps/gcc/tensorflow/0.11.0-py35-gpu

# Python 3.4, CUDA 7.5.18, cuDNN 4.0.7
module load apps/gcc/tensorflow/0.10.0-py34-gpu
module load apps/gcc/tensorflow/0.9.0rc0-py34-gpu
module load apps/gcc/tensorflow/0.7.1-py34-gpu

# Python 2.7, CUDA 9.0.176, cuDNN 7.0.3
# (module load 1.8.0-py27-gpu is in progress)

# Python 2.7, CUDA 8.0.44, cuDNN 5.1.5
module load apps/gcc/tensorflow/1.2.1-py27-gpu
module load apps/gcc/tensorflow/1.0.0-py27-gpu
module load apps/gcc/tensorflow/0.12.1-py27-gpu

# Python 2.7, CUDA 7.5.18, cuDNN 4.0.7
module load apps/gcc/tensorflow/0.10.0-py27-gpu
module load apps/gcc/tensorflow/0.9.0rc0-py27-gpu
module load apps/gcc/tensorflow/0.8.0-20160504-py27-gpu        # Nightly build version (dated)
module load apps/gcc/tensorflow/0.7.1-py27-gpu

The above modulefiles will load the following modulefiles automatically (you do not need to load these):

  • One of the following Anaconda python modulefiles:
    • apps/binapps/anaconda/3/5.2.0 (python 3.6.5)
    • apps/binapps/anaconda/3/4.2.0 (python 3.5.2)
    • apps/binapps/anaconda/3/2.3.0 (python 3.4.3)
    • apps/binapps/anaconda/2.5.0 (python 2.7.11)
  • compilers/gcc/6.3.0 (C++11 compatible compiler)
  • libs/cuda/9.0.176 (Nvidia CUDA toolkit)
  • libs/cuDNN/7.30 (Nvidia cuDNN toolkit)
  • compilers/gcc/4.8.2 (C++11 compatible compiler)
  • libs/cuda/8.0.44 (Nvidia CUDA toolkit)
  • libs/cuDNN/5.1.5 (Nvidia cuDNN toolkit)
  • libs/cuda/7.5.18 (Nvidia CUDA toolkit)
  • libs/cuDNN/4.0.7 (Nvidia cuDNN toolkit)

Running the application

Please do not run Tensorflow on the login node. Jobs should be run interactively on the backend nodes (via qrsh) or submitted to the compute nodes via batch.

The following instructions describe interactive use on a backend node and batch jobs from the login node.

Technical Note (you are not required to do anything – this is for information only)

  • We use a modified python executable (a shell script) named python to start the usual Anaconda python interpreter. This actually runs the following:
    LD_PRELOAD=/usr/lib64/librt.so:$TFDIR/fixes/stubs/mylibc.so:$GCCDIR/lib64/libstdc++.so.6 python

    The LD_PRELOAD is needed to load a few libraries that replace system libraries. The pre-compiled TensorFlow installation supplied by Google requires a newer version of GLIBC than is available on the zCSF. We have modified the TensorFlow library _pywrap_tensorflow.so to be less strict about the version of GLIBC present. But we then supply some function that are missing in our older GLIBC library that are required by TensorFlow.

Interactive use on a Backend Node

To see the commands used to log in to a particular backend node, run the following on the zrek login node:

backends

Once logged in to a backend K20 node or K40 node (using qrsh) and having loaded the modulefile there (see above), run:

python

When TensorFlow runs it will use the GPU assigned to you when you ran qrsh to log in to the backend node. To see which GPU(s) you have been assigned run the following on the backend node:

echo $CUDA_VISIBLE_DEVICES

Single GPU Example

A simple TensorFlow test is as follows. This will use the single GPU assigned to your session:

# Assuming you are at the zrek login node:

# 1. Log in to a backend node (e.g., a K40 node)
qrsh -l k40 bash

# 2. Load the modulefile your require on the backend node. For example:
module load apps/gcc/tensorflow/0.7.1-py34-gpu
                                  #
                                  # See list above for newer versions

# 3. Start python then enter the commands
python

# Enter the following program

import tensorflow as tf
# You should see:
#   successfully opened CUDA library libcudnn.so locally
#   (and other GPU details)...

# Create a graph
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
c = tf.matmul(a, b)

# Turn on device placement reporting so we can see where a graph runs
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
# You should see:
#   I tensorflow/core/common_runtime/gpu/gpu_device.cc:717] Creating TensorFlow \
#     device (/gpu:0) -> (device: 0, name: Tesla K40c, pci bus id: 0000:02:00.0)

# Run the graph. It will report the GPU used to do so.
sess.run(c)
# You should see
#   b: /job:localhost/replica:0/task:0/gpu:0
#   a: /job:localhost/replica:0/task:0/gpu:0
#   MatMul: /job:localhost/replica:0/task:0/gpu:0
#   Allocating 10.58GiB bytes.
#   GPU 0 memory begins at 0x2047a0000 extends to 0x4a9c53b34
#   array([[ 22.,  28.],
#          [ 49.,  64.]], dtype=float32)

# Exit the python shell
Ctrl-D

Dual-GPU Example

The following example uses both GPUs in a backend node. You must request both when you log in to the node so that you are given a node on which no other jobs are running.

# Assuming you are at the zrek login node:

# 1. Log in to a backend node (e.g., a K40 node)
qrsh -l k20duo bash

# or alternatively use the GPU-peer-to-peer capable node
qrsh -l k20duo_p2p bash

# 2. Load the modulefile on the backend node
module load apps/gcc/tensorflow/0.7.1-py34-gpu
                                  #
                                  # See list above for newer versions

# 3. Start python then enter the commands
python

# Enter the following program

import tensorflow as tf
# Will report the loading of CUDA libraries

# Create a graph and use both GPUs
c = []
for d in ['/gpu:0', '/gpu:1']:
  with tf.device(d):
    a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3])
    b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2])
    c.append(tf.matmul(a, b))

with tf.device('/cpu:0'):
  sum = tf.add_n(c)

# Creates a session with log_device_placement set to True.
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
# Should report:
#   /job:localhost/replica:0/task:0/gpu:0 -> device: 0, name: Tesla K20c, pci bus 
#   id:0000:02:00.0
#   /job:localhost/replica:0/task:0/gpu:1 -> device: 1, name: Tesla K20c, pci bus id: 
#   0000:04:00.0

# Run the op.
sess.run(sum)
# Will report both GPUs being used
#    array([[  44.,   56.],
#           [  98.,  128.]], dtype=float32)

# Exit the python shell
Ctrl-D

Serial batch job submission

Do not log in to a backend node. The job must be submitted from the zrek login node. Ensure you have loaded the correct modulefile on the login node and then create a jobscript similar to the following:

#!/bin/bash
#$ -S /bin/bash
#$ -cwd                   # Run job from directory where submitted
#$ -V                     # Inherit environment (modulefile) settings
#$ -l k20                 # Select a single GPU (Nvidia K20) node

python my-script.py

Submit your jobscript from the zrek login node using

qsub jobscript

where jobscript is the name of your jobscript.

Parallel batch job submission

Do not log in to a backend node. The job must be submitted from the zrek login node. Ensure you have loaded the correct modulefile on the login node and then create a jobscript similar to the following:

#!/bin/bash
#$ -S /bin/bash
#$ -cwd                   # Run job from directory where submitted
#$ -V                     # Inherit environment (modulefile) settings
#$ -l k20duo_p2p          # Select a single GPU (Nvidia K20) node

python my-script.py

Submit your jobscript from the zrek login node using

qsub jobscript

where jobscript is the name of your jobscript.

Web-browser Jupyter Notebook remote usage

It is possible to run tensorflow in a Jupyter Notebook running on a zrek GPU node. In the example below we use a web browser on a local laptop and set up an SSH tunnel to the zrek k40 GPU node (e.g., besso). We show how to do this for on-campus use and off-campus use.

You will need an SSH program on your local laptop/PC that allows local tunnelling. This is easy with a command-line program (e.g., as found on a linux PC, MacOS laptop or in MobaXterm on Windows). Please consult your own ssh program’s documentation if not using any of those methods.

On Campus Access

Here we assume your laptop/PC is on campus and so you can log in to zrek directly. You can also use this method from the nyx3/4 linux virtual desktops (see http://ri.itservices.manchester.ac.uk/virtual-desktop-service/x2go/). The steps are:

  1. Open two terminal windows on you local laptop/PC.
  2. In terminal window 1 on your laptop/PC run:
    1. Login in to zrek:
      ssh username@zrek.itservices.manchester.ac.uk
    2. Now log in to a GPU node (we use an Nvidia K40 node):
      qrsh -l k40 bash
      
       # Wait until it logs you in to the backend node.
       # Your prompt will change when it has done:
      
      [username@besso(zrek) ~]$
                  #
                  #
                  # Make a note of the backend node name we have been
                  # logged in to (besso in this example). If using a
                  # K20 GPU this could be Kaiju3 for example.
    3. Now set up tensorflow (which also give access to anaconda python) and start a jupyter notebook:
      module load apps/gcc/tensorflow/0.10.0-py27-gpu
      jupyter-notebook --no-browser --port=7777
                                              #
                                              # Make a note of this port number
                                              # we use it later. If you get an error
                                              # try the next port up (7778 and so on)
    4. Note the message printed when the notebook starts. When we have finished with the notebook we’ll terminate it by pressing Ctrl+C in this window twice (we’ll come back to that later). For now just leave this notebook running.
  3. In terminal window 2 on your laptop/PC run:
    1. Tunnel in to zrek:
      ssh -L 7777:besso:7777 username@zrek.itservices.manchester.ac.uk
              #     #     #      #
              #     #     #      # Replace username with your own IT username
              #     #     #
              #     #     # Use the port number you made a note of earlier 
              #     #
              #     # Use the GPU node name you made a note of earlier
              #
              # Use the port number you made a note of earlier

      Note that you do not need to type any commands in to this window once you have logged in but you must keep the window logged in otherwise your web-browser won’t be able to contact zrek.

  4. Now open a web-browser on your laptop/PC and browser to:
    http://localhost:7777
                      #
                      #
                      # Use the port number we made a note of earlier

    You should see your Jupyter Notebook running on the zrek GPU node.

  5. Start a new Python 2 notebook in the web-browser.
  6. Enter the following to check you can access tensorflow:
    import tensorflow as tf

    and run the command (hit Ctrl+Enter).

That’s it, you now have a local web-browser talking to your Jupyter Notebook running on a zrek GPU node.

When you have finished with the Jupyter Notebook:

  1. Log out of the Jupyter Notebook using the button in the web-browser.
  2. In terminal window 1 on your laptop/PC press Ctrl+C twice to stop the Jupyter Notebook server. Then run exit to log out of the GPU node. This will free it up for another user.
  3. In terminal window 2 on your laptop/PC press Ctrl+D to log out of the tunnelled ssh session

Further info

Updates

03-mar-17 Tensorflow 1.0.0 added for Python 2.7 and 3.5
16-jan-17 Tensorflow 0.12.1 added for Python 2.7 and 3.5

Last modified on October 22, 2018 at 4:28 pm by George Leaver