Running interactive or batch jobs on backend nodes

The following sections assume you have connected to the Zrek login node. Access to backend nodes is only possible from the login node.

Please note: We use SGE commands to access the backend nodes. This is mainly to ensure users have exclusive access to specific GPUs, Xeon PHIs, FPGAs in backend nodes so that users don’t trample on each others’ applications. Please see the FAQ below more more explanation about why we have switched to SGE.

To see a reminder of the commands needed to access all available backend nodes, simply run the following on the zrek login node:

# At the zrek login node prompt, run:
backends

Interactive and Batch Jobs

Connecting to a backend node for interactive use

For an interactive session on a backend node – which gives you a command line on that node and will allow you to start GUI apps, as well as non-GUI compute apps:

qrsh -l techflag bash

Note that you must put the bash command at the end of the line.

Submitting jobs to a backend node for batch processing

For a traditional batch job where you write a jobscript, submit it to the queue and can then log out of Zrek if you wish and let the system get on with your work (CSF users will be familiar with these) use:

qsub -l techflag [optional-flags] jobscript

or put everything in the jobscript, for example:

#!/bin/bash --login 
#$ -S bash
#$ -cwd
#$ -l techflag
module load path/to/required/app
./myapp.exe arg1 arg2

and submit the job using

qsub jobscript

Techflags to Select a Backend Node

In the qrsh or qsub commands above, use one of the techflag values from the table below to select the accelerator hardware you wish to use:

techflag	Gives you exclusive access to	Total available	Hostnames
`fpga_altera` or `fpgaalt`	One Altera PCIe-385n FPGA	1	namazu
`fpga_maxeler` or `fpgamax`	One Maxeler FPGA	1	merlion
`k20`	One Nvidia K20 GPU	7 (2 per node)	kaiju1, ~~kaiju2~~, kaiju3 (one GPU), kaiju4, kaiju5
`k20duo`	Two Nvidia K20 GPUs in the same node	7 (2 per node)	kaiju1, ~~kaiju2~~, kaiju3 (one GPU), kaiju4, kaiju5
`k20duo_p2p`	Two Nvidia K20 GPUs capable of CUDA Peer-to-peer in the same node	2	kaiju101
`k40`	One Nvidia K40 GPU	2	besso
`xeonphi`	One Intel Xeon Phi/MIC	2 (2 per node)	xenomorph
`xeonphiduo`	Two Intel Xeon Phi/MICs in the same node	2 (2 per node)	xenomorph

NOTE: When you have finished with a node, please end your qrsh session using exit otherwise you will prevent other users from accessing the node.

Frequently Asked Questions about SGE on Zrek

Why have we started using SGE?: As more users access the system, sharing access to backend nodes becomes problematic. You normally require exclusive access to one or more GPUs or other type of accelerator card. Sharing GPUs and accelerators gives non-optimal performance. Using SGE provides an automatic method of granting exclusive access to a backend node.
Will I have long wait-times to access a node?: Hopefully not. The number of users on zrek is small at the moment. Please remember to log-out of the backend node when you have finished with it so that other users can make use of it.
Do I need to use batch scripts (like on CSF) to run jobs?: Not necessarily. Zrek is meant for experimental work, code development, code-compile-run usage possibly using graphical debuggers and other development tools, as well as running simulations using existing code (e.g., open source simulation code). So using qrsh will allow you to get an interactive session with a command-line for code-compile-run development work, whereas using qsub will allow you to submit batch jobs that will simply run when the resources are free – and you won’t need to keep your desktop logged in to zrek!
How do I control which GPU or Xeon Phi (MIC) I use if there are two in a node?: By using SGE we manage that for you. When you access a backend node using qrsh or qsub the system will automatically allocate you a free GPU (or MIC card). It does this by setting specific environment variables: CUDA_VISIBLE_DEVICES to 0 or 1 or 0,1 (and similarly for the Xeon Phis: OFFLOAD_DEVICES). Most CUDA (or Xeon Phi) software will look at these environment variables and only use the permitted devices.
Do I still use modulefiles to set up the environment?: Yes. Load the modulefile after logging in to a backend node using qrsh. If submitting a batch job using qsub you can load the modulefile in your jobscript or on the login node first.
The nodes are busy – can I use ssh to log in and quickly compile something?: Access to the backend nodes must be via the SGE commands (qsub or qrsh). User found accessing backend nodes in any other way will be banned from the system. Compilation on the zrek login node is permitted, but any compilation that requires a GPU, Xeon Phi or FPGA will not work on the login node.
When I using qrsh to access a GPU (or Xeon Phi) node my code says there are no GPUs (or MICs). Why?: You have probably forgotten the bash command at the end of the qrsh command. We require you to explicitly run the bash shell (on the CSF you don’t need to do this). This is so that the CUDA_VISIBLE_DEVICES or OFFLOAD_DEVICES environment variables can be set up correctly. If you forget the bash command then the environment variables are not set and your code thinks there are no devices present. See above for more details on the qrsh command to use.
Will a batch job run on a node while I’m using it interactively?: If there is a free accelerator card (GPU, Phi, FPGA) because (a) you have only requested one of them in your qrsh command and (b) the node has two such cards, then any free card in the same host could be used by a batch job. A batch job and interactive job will never used the same accelerator card though.

Last modified on July 30, 2018 at 10:57 am by George Leaver