Tensorflow

Overview

TensorFlow is an open source software library for numerical computation using data flow graphs. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them. The flexible architecture allows you to deploy computation to one or more CPUs or GPUs.

See the modulefiles below for available versions.

As of Tensorflow version 2.4, Keras is now packaged within the Tensorflow install as tensorflow.keras.

It is also possible to install tensorflow in your own conda environments. This is often the easiest way to obtain a version newer than we’ve installed centrally. See below for a complete example of installing TF 2.12.

Restrictions on use

There are no access restrictions on the CSF.

Set up procedure

We now recommend installing a newer version of tensorflow, with newer CUDA libraries, all in a conda environment, which you can do in your home directory. Please follow the step-by-step instructions below for a complete example.

To access the older centrally installed versions, software you must first load one of the following modulefiles:

# This is now a fairly old version of Tensorflow. See end of page
# for how to install your own newer version in a conda environment.

# TF 2.8.0, Python 3.9 for GPUs: (uses CUDA 11.2.0, cuDNN 8.1.0, Anaconda3 2021.11)
module load apps/binapps/tensorflow/2.8.0-39-gpu

# TF 2.7.0, Python 3.7 for GPUs: (uses CUDA 11.0.3, cuDNN 8.0.4, Anaconda3 2019.07)
module load apps/binapps/tensorflow/2.7.0-37-gpu

# TF 2.4.0, Python 3.7 for GPUs: (uses CUDA 11.0.3, cuDNN 8.0.4, NCCL 2.5.6, TensorRT 6.0.1, Anaconda3 2019.07) 
module load apps/binapps/tensorflow/2.4.0-37-gpu

# TF 2.3.1, Python 3.7 for GPUs: (uses CUDA 10.1.243, cuDNN 7.6.5, NCCL 2.5.6, TensorRT 6.0.1, Anaconda3 2019.07)
module load apps/binapps/tensorflow/2.3.1-37-gpu

# TF 2.2.0, Python 3.7 for GPUs: (uses CUDA 10.1.243, cuDNN 7.6.5, NCCL 2.5.6, TensorRT 6.0.1, Anaconda3 2019.07)
module load apps/binapps/tensorflow/2.2.0-37-gpu

# TF 2.1.0, Python 3.7 for GPUs: (uses CUDA 10.1.243, cuDNN 7.6.5, NCCL 2.5.6, TensorRT 6.0.1, Anaconda3 2019.07)
module load apps/binapps/tensorflow/2.1.0-37-gpu

# TF 2.0.0, Python 3.7 for GPUs: (uses CUDA 10.0.130, cuDNN 7.6.2, NCCL 2.4.7, Anaconda3 2019.07)
module load apps/binapps/tensorflow/2.0.0-37-gpu

# TF 1.14.0, Python 3.6 for GPUs: (uses CUDA 10.0.130, cuDNN 7.6.2, NCCL 2.2.13, Anaconda3 5.2.0)
module load apps/binapps/tensorflow/1.14.0-36-gpu

# Python 3.6 for GPUs: (uses CUDA 9.0.176, cuDNN 7.3.0, NCCL 2.2.13, Anaconda3 5.2.0)
module load apps/binapps/tensorflow/1.11.0-36-gpu
module load apps/binapps/tensorflow/1.10.1-36-gpu

There are also CPU-only versions available:

# Python 3.9 for CPUs only: (uses Anaconda3 2021.11)
module load apps/binapps/tensorflow/2.8.0-39-cpu
# Python 3.7 for CPUs only: (uses Anaconda3 2019.07)
module load apps/binapps/tensorflow/2.7.0-37-cpu
module load apps/binapps/tensorflow/2.4.0-37-cpu
module load apps/binapps/tensorflow/2.3.1-37-cpu
module load apps/binapps/tensorflow/2.2.0-37-cpu 
module load apps/binapps/tensorflow/2.1.0-37-cpu 
module load apps/binapps/tensorflow/2.0.0-37-cpu 

# Python 3.6 for CPUs only: (uses Anaconda3 5.2.0) 
module load apps/binapps/tensorflow/1.14.0-36-cpu 
module load apps/binapps/tensorflow/1.11.0-36-cpu 
module load apps/binapps/tensorflow/1.10.1-36-cpu

The above modulefiles will load any necessary dependency modulefiles for you. Note that you cannot run the GPU version of tensorflow on a CPU-only node (it must be run on a GPU node).

Running the application on a GPU node

Please do not run Tensorflow on the login node. Jobs should be run interactively on the backend nodes (via qrsh) or submitted to the compute nodes via batch.

Example Tensorflow v2.8 GPU python script

Note that the following example will not work with Tensorflow 1.x due to significant changes in the Tensorflow API.

See https://www.tensorflow.org/tutorials/quickstart/beginner for more information on this code.


Create the following tensorflow example script for use on a GPU node (e.g., my-gpu-script.py):

# Tensorflow example on a GPU
import tensorflow as tf
print("TensorFlow version:", tf.__version__)
print("List of GPUs:", tf.config.list_physical_devices('GPU'))

mnist = tf.keras.datasets.mnist
(x_train, y_train),(x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10, activation='softmax')
])

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.fit(x_train, y_train, epochs=5, verbose=0)
# This should report a ~98% accuracy
model.evaluate(x_test,  y_test, verbose=2)

You can now run the above script interactively on a GPU node or in batch.

Interactive use on a GPU node

Once you have been granted access to the Nvidia v100 nodes, start an interactive session as follows:

qrsh -l nvidia_v100=1 bash

# Wait until you are logged in to a backed compute node, then:
module load apps/binapps/tensorflow/2.8.0-39-gpu

# Run the above script
python my-gpu-script.py

# Alternatively enter the above script in a python shell:
python
   # Enter each line of the script above - it will execute immediately
   import tensorflow as tf
   ...
   # When finished, exit python
   Ctrl-D

# When finished with your interactive session, return to the login node
exit

Batch usage on a GPU node

Once you have been granted access to the Nvidia v100 nodes, create a jobscript as follows:

#!/bin/bash --login
#$ -cwd                   # Run job from directory where submitted

# If running on a GPU, add:
#$ -l v100=1

#$ -pe smp.pe 8          # Number of cores on a single compute node. GPU jobs can
                         # use up to 8 cores per GPU.

# We now recommend loading the modulefile in the jobscript
module load apps/binapps/tensorflow/2.8.0-39-gpu

# $NSLOTS is automatically set to the number of cores requested on the pe line
# and can be read by your python code.
python my-gpu-script.py

Submit the jobscript using

qsub jobscript

where jobscript is the name of your jobscript file (not your python script file!)

Running the application on a CPU node

Please do not run Tensorflow on the login node. Jobs should be run interactively on the backend nodes (via qrsh) or submitted to the compute nodes via batch.

Example Tensorflow v2.x CPU python script

Create the following tensorflow example script for use on a CPU node (e.g., my-cpu-script.py). Note that we determine the number of CPU cores that can be used and instruct Tensorflow to only use that many threads.

# Tensorflow example on a CPU only (not a GPU)
import tensorflow as tf
import os

# Get number of cores reserved by the batch system (NSLOTS is automatically set, or use 1 if not)
NUMCORES=int(os.getenv("NSLOTS",1))
print("Using", NUMCORES, "core(s)" )

tf.config.threading.set_inter_op_parallelism_threads(NUMCORES) 
tf.config.threading.set_intra_op_parallelism_threads(NUMCORES)
tf.config.set_soft_device_placement(1)

# Now create a TF graph
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
tf.linalg.matmul(a, b)

Interactive use on a Backend CPU-only Node

To request an interactive session on a backend compute node run:

qrsh -l short

# Wait until you are logged in to a backend compute node, then:
module load apps/binapps/tensorflow/2.8.0-39-cpu

# Run the above python script, eg:
python my-cpu-script.py

# Alternatively enter the above script in a python shell:
python
   # Enter each line of the script above - it will execute immediately
   import tensorflow as tf
   ...
   # When finished, exit python
   Ctrl-D

# When finished with your interactive session, return to the login node
exit

Batch usage on a CPU node

Create a jobscript as follows:

#!/bin/bash --login
#$ -cwd                   # Run job from directory where submitted
#$ -pe smp.pe 16          # Number of cores on a single compute node. Can be 2-32 for CPU jobs.
                          # Remove the -pe line completely to run a serial (1-core) job.

# We now recommend loading the modulefile in the jobscript
module load apps/binapps/tensorflow/2.8.0-39-cpu

# $NSLOTS is automatically set to the number of cores requested on the pe line
# and can be read by your python code (see example above).
python my-cpu-script.py

Submit the jobscript using

qsub jobscript

where jobscript is the name of your jobscript file (not your python script file!)

Using Tensorflow in Conda Environments

Conda Environments are a way of installing all of the python packages you need for a project in a directory in your home directory. You can create other conda environments for other projects. This ensures each project is kept separate and the packages for one project do not break those of another. We recommend reading the Anaconda Python CSF page for more info on using conda environments.

June 2023: The proxy is no longer available.
To download packages from external sites (e.g., when creating a conda env), please do so from a batch job or use an interactive session on a backend node by running qrsh -l short or qrsh -l nvidia_v100 bash. You DO NOT then need to load the proxy modulefiles. Please see the qrsh notes for more information on interactive use.

Example 1 – Installing Tensorflow

The following notes have been updated in Feb 2024 for tensorflow 2.15.0 and include a fix for the “TensorRT not found” warning.

# From the login node, start an interactive session
qrsh -l v100 bash

# Now on the GPU node - quite a few steps to install, but is is then easy to use in your jobscripts

# Note that you must NOT have any existing conda environments active.
# If your command prompt looks something like:
(base) [username@node800 [csf3] ~]$
  #
  # Any (name) here is the name of the active conda env.
  #
  # then you need to deactivate the base conda env (or whatever name is showing)
  # using the following 'source deactivate' command:

source deactivate

# Now the prompt shows no conda env. That's correct. We are going to create a new env for tensorflow.
[username@node800 [csf3] ~]$

Continue with the tensorflow installation – the following commands were run on a CSF GPU node (e.g., node800 in our example):

module purge
# Use python 3.9 as required by tensorflow (2.15.0 at the time of writing)
module load apps/binapps/anaconda3/2022.10

# Create a conda env with some basic packages needed to install other packages.
# We're using anaconda3 v2022.10 which provides python 3.9.13 (use "python --version").
# We use 'tf' as the env name in this example (you can change this if you want).
python --version
conda create -n tf python=3.9.13
Proceed ([y]/n)? y                          # <--- Press y [return] to proceed

# Now activate the env so we can install other packages in to the env
source activate tf

# Now install latest tensorflow (2.15.0 at time of writing) and cuda inside the env
pip install --isolated --log pip.tf.log tensorflow[and-cuda]
pip install --isolated --log pip.rt.log --extra-index-url https://pypi.nvidia.com tensorrt

# Now fix a warning message that complains about tensorrt.
# I'm not sure that tensorrt is really used but the warning is annoying.
# Based on https://github.com/tensorflow/tensorflow/issues/61468

mkdir -p $CONDA_PREFIX/etc/conda/activate.d
# Note: No single quotes in this command but there are two dirname commands!
echo TENSORRT_PATH=$(dirname $(dirname $(python -c "import tensorrt;print(tensorrt.__file__)")))/tensorrt_libs >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh

# But we do want the single quotes in this command
echo 'export LD_LIBRARY_PATH=$TENSORRT_PATH:$LD_LIBRARY_PATH' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh

# Now while we're installing, apply the above settings manually.
# In future, when you "source activate tf" to activate the env,
# the settings will be applied automatically:
source $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh

# Fix some missing symlinks
pushd $TENSORRT_PATH
# Will need to update these commands in future to match the tensorrt version
ln -s libnvinfer.so.8 libnvinfer.so.8.6.1
ln -s libnvinfer_plugin.so.8 libnvinfer_plugin.so.8.6.1
popd

# Now run a quick test:
python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

# Return to the login node
exit

Let’s go back to the login node and do a test of everything from the beginning, without any install steps.

# Start a new interactive session
qrsh -l v100 bash
module load apps/binapps/anaconda3/2022.10
source activate tf
python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
source deactivate
exit

Let’s also test a batch job. First write a jobscript, e.g., using the command: gedit tf.qsub

#!/bin/bash --login
#$ -cwd
#$ -l v100
#$ -pe smp.pe 8
module load apps/binapps/anaconda3/2022.10
source activate tf
python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

Now subimt job:

qsub tf.qsub

You should see the following output in your tf.qsub.o123456 output file:

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

Note that the tf.qsub.e123456 error file will contain some warnings:

2024-02-19 19:45:15.604408: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-19 19:45:15.604463: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-19 19:45:15.606101: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-02-19 19:45:15.615548: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

They can be ignored.

Example 2 – Older notes for Tensorflow 2.12

The following creates a conda env and installs Tensorflow 2.12. There may be some setting which are needed in future so we keep this here for now.

# Start an interactive session
qrsh -l nvidia_v100 bash

# Note that you must NOT have any existing conda environments active.
# If your command prompt looks something like:
(base) [username@node800 [csf3] ~]$
  #
  # Any (name) here is the name of the active conda env.

then you need to deactivate the base conda env (or whatever name is showing) using the following command:

source deactivate

# Now the prompt shows no conda env
[username@node800 [csf3] ~]$

The following commands were run on a CSF compute node:

module purge
module load apps/binapps/anaconda3/2022.10

# Create a conda env with some basic packages needed to install other packages.
# We're using anaconda3 v2022.10 which provides python 3.9.13 (use "python --version").
# We use 'tensorflow' as the env name in this example (you can change this if you want).
python --version
conda create -n tensorflow python=3.9.13
Proceed ([y]/n)? y                          # <--- Press y [return] to proceed

# Now activate the env so we can install other packages in to the env
source activate tensorflow

# Now install CUDA. The tensorflow website tells you which version is required.
# If you install the wrong version, tensorflow will complain it can't find a GPU.
conda install -c conda-forge cudatoolkit=11.8.0
Proceed ([y]/n)? y                          # <--- Press y [return] to proceed

# now install tensorflow using pip. We use a slightly different command to that
# shown on the tensorflow website to ensure the packages are installed inside our
# conda env:
pip install --isolated nvidia-cudnn-cu11==8.6.0.163 tensorflow==2.12.*

# Extra steps to fix bug in tensorflow 2.11 and 2.12. Hopefully not needed in TF 2.13!!
conda install -c nvidia cuda-nvcc --yes
mkdir -p $CONDA_PREFIX/lib/nvvm/libdevice/
cp -p $CONDA_PREFIX/lib/libdevice.10.bc $CONDA_PREFIX/lib/nvvm/libdevice/

# Finally, deactivate the conda env on the login node
source deactivate

# End out interactive session and return to the login node
exit

We can now submit a test job. There are a couple of extra lines needed in the jobscript to set up the environment so that python can find your local installation of tensorflow. Create a jobscript as follows:

#!/bin/bash --login
#$ -pe smp.pe 8          # Use 8 CPU cores per GPU
#$ -l v100=1             # Use one v100 GPU

module load apps/binapps/anaconda3/2022.10
# Activate the conda env. In a jobscript you must use "source activate" not "conda activate"
source activate tensorflow

# Some extra setup needed by tensorflow to get the location of our conda env
CUDNN_PATH=$(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)"))
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib:$CUDNN_PATH/lib

# Extra setting needed to fix TF 2.11 and 2.12. Hopefully not needed in TF 2.13!
export XLA_FLAGS=--xla_gpu_cuda_data_dir=$CONDA_PREFIX/lib

# Now run some python code. This is just a simple TF test.
python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

# Can deactivate the env at the end of the job
source deactivate

Submit the jobs to the GPU nodes using

# You should be back on the login node at this point. Submit the batch job:
qsub jobscript

where jobscript is the name of your jobscript. The output should be:

# Look in the .oNNNNNN file for the output
cat jobscript.oNNNNNN
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

This shows that tensorflow ran and found a GPU. You should also be able to use the Tensorflow Keras example from earlier.

Example 3 – Installing a PyPaz

The following example creates a conda env in which to install a package named PyPAZ, an image processing package that uses tensorflow. It shows how to use pip to install the package within a conda env.

# The following commands were run on the CSF login node
qrsh -l nvidia_v100 bash

# Wait for your interactive session to be scheduled....

# Now on the GPU node:

# At the time of writing (Mar 2023) the following was the most recent anaconda installed on the CSF.
module load apps/binapps/anaconda3/2022.10

# Create a conda env that will use the same version of python as the central anaconda install.
# Otherwise conda will install the very latest version of python, which takes longer.
python --version
  # Python 3.9.13
conda create -n paz python=3.9.13

# Activate the env
source activate paz

# Now install CUDA. The tensorflow website tells you which version is required.
# If you install the wrong version, tensorflow will complain it can't find a GPU.
conda install -c conda-forge cudatoolkit=11.8.0
Proceed ([y]/n)? y                          # <--- Press y [return] to proceed

# now install tensorflow using pip. We use a slightly different command to that
# shown on the tensorflow website to ensure the packages are installed inside our
# conda env:
pip install --isolated nvidia-cudnn-cu11==8.6.0.163 tensorflow==2.12.*

# Install the paz package, telling pip to ignore any local config.
# This ensures the packages are installed inside the current conda env.
pip install --isolated --log ~/pypaz.log pypaz

# Extra steps to fix bug in tensorflow 2.11 and 2.12. Hopefully not needed in TF 2.13!!
conda install -c nvidia cuda-nvcc --yes
mkdir -p $CONDA_PREFIX/lib/nvvm/libdevice/
cp -p $CONDA_PREFIX/lib/libdevice.10.bc $CONDA_PREFIX/lib/nvvm/libdevice/

# Check what has been installed
conda list
  #
  # Note that tensorflow has been installed, so you do not need
  # to use our central installation on the CSF - you now have your
  # own version in the 'paz' conda env

# Let's deactivate the env now we've installed it. We recommend only activating
# conda environments when you want to install packages in them or when running
# jobs for that project.
source deactivate


# We'll now use the GPU node to test the installation
# We recommend you terminate your previous session and start a new one:
exit
qrsh -l nvidia_v100 bash

# Load the modulefile needed to use anaconda (also do this if you submit a jobscript for paz!)
module load apps/binapps/anaconda3/2022.10

# Activate the conda env (must use "source activate" not "conda activate" in a jobscript)
source activate paz

# Some extra setup needed by tensorflow to get the location of our conda env
CUDNN_PATH=$(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)"))
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib:$CUDNN_PATH/lib

# Extra setting needed to fix TF 2.11 and 2.12
export XLA_FLAGS=--xla_gpu_cuda_data_dir=$CONDA_PREFIX/lib

# Start python (we'll type python code directly at the prompt.)
python
  
  # Now using the python commands shown on the Paz github page:
  from paz.applications import SSD512COCO
  detect = SSD512COCO()
  exit()

# Return to the login node
exit

Further info

Updates

None.

Last modified on February 28, 2024 at 11:37 am by George Leaver