Running interactive or batch jobs on backend nodes
The following sections assume you have connected to the Zrek login node. Access to backend nodes is only possible from the login node.
Please note: We use SGE commands to access the backend nodes. This is mainly to ensure users have exclusive access to specific GPUs, Xeon PHIs, FPGAs in backend nodes so that users don’t trample on each others’ applications. Please see the FAQ below more more explanation about why we have switched to SGE.
To see a reminder of the commands needed to access all available backend nodes, simply run the following on the zrek login node:
# At the zrek login node prompt, run: backends
Interactive and Batch Jobs
Connecting to a backend node for interactive use
For an interactive session on a backend node – which gives you a command line on that node and will allow you to start GUI apps, as well as non-GUI compute apps:
qrsh -l techflag bash
Note that you must put the bash
command at the end of the line.
Submitting jobs to a backend node for batch processing
For a traditional batch job where you write a jobscript, submit it to the queue and can then log out of Zrek if you wish and let the system get on with your work (CSF users will be familiar with these) use:
qsub -l techflag [optional-flags] jobscript
or put everything in the jobscript, for example:
#!/bin/bash --login #$ -S bash #$ -cwd #$ -l techflag module load path/to/required/app ./myapp.exe arg1 arg2
and submit the job using
qsub jobscript
Techflags to Select a Backend Node
In the qrsh
or qsub
commands above, use one of the techflag values from the table below to select the accelerator hardware you wish to use:
techflag | Gives you exclusive access to | Total available | Hostnames |
---|---|---|---|
fpga_altera or fpgaalt |
One Altera PCIe-385n FPGA | 1 | namazu |
or
|
One Maxeler FPGA | 1 | merlion |
k20 |
One Nvidia K20 GPU | 7 (2 per node) | kaiju1, |
k20duo |
Two Nvidia K20 GPUs in the same node | ||
k20duo_p2p |
Two Nvidia K20 GPUs capable of CUDA Peer-to-peer in the same node | 2 | kaiju101 |
k40 |
One Nvidia K40 GPU | 2 | besso |
xeonphi |
One Intel Xeon Phi/MIC | 2 (2 per node) | xenomorph |
xeonphiduo |
Two Intel Xeon Phi/MICs in the same node |
NOTE: When you have finished with a node, please end your qrsh session using exit
otherwise you will prevent other users from accessing the node.
Frequently Asked Questions about SGE on Zrek
- Why have we started using SGE?
- As more users access the system, sharing access to backend nodes becomes problematic. You normally require exclusive access to one or more GPUs or other type of accelerator card. Sharing GPUs and accelerators gives non-optimal performance. Using SGE provides an automatic method of granting exclusive access to a backend node.
- Will I have long wait-times to access a node?
- Hopefully not. The number of users on zrek is small at the moment. Please remember to log-out of the backend node when you have finished with it so that other users can make use of it.
- Do I need to use batch scripts (like on CSF) to run jobs?
- Not necessarily. Zrek is meant for experimental work, code development, code-compile-run usage possibly using graphical debuggers and other development tools, as well as running simulations using existing code (e.g., open source simulation code). So using
qrsh
will allow you to get an interactive session with a command-line for code-compile-run development work, whereas usingqsub
will allow you to submit batch jobs that will simply run when the resources are free – and you won’t need to keep your desktop logged in to zrek! - How do I control which GPU or Xeon Phi (MIC) I use if there are two in a node?
- By using SGE we manage that for you. When you access a backend node using
qrsh
orqsub
the system will automatically allocate you a free GPU (or MIC card). It does this by setting specific environment variables:CUDA_VISIBLE_DEVICES
to0
or1
or0,1
(and similarly for the Xeon Phis:OFFLOAD_DEVICES
). Most CUDA (or Xeon Phi) software will look at these environment variables and only use the permitted devices. - Do I still use modulefiles to set up the environment?
- Yes. Load the modulefile after logging in to a backend node using
qrsh
. If submitting a batch job usingqsub
you can load the modulefile in your jobscript or on the login node first. - The nodes are busy – can I use ssh to log in and quickly compile something?
- Access to the backend nodes must be via the SGE commands (
qsub
orqrsh
). User found accessing backend nodes in any other way will be banned from the system. Compilation on the zrek login node is permitted, but any compilation that requires a GPU, Xeon Phi or FPGA will not work on the login node. - When I using qrsh to access a GPU (or Xeon Phi) node my code says there are no GPUs (or MICs). Why?
- You have probably forgotten the
bash
command at the end of theqrsh
command. We require you to explicitly run the bash shell (on the CSF you don’t need to do this). This is so that theCUDA_VISIBLE_DEVICES
orOFFLOAD_DEVICES
environment variables can be set up correctly. If you forget the bash command then the environment variables are not set and your code thinks there are no devices present. See above for more details on theqrsh
command to use. - Will a batch job run on a node while I’m using it interactively?
- If there is a free accelerator card (GPU, Phi, FPGA) because (a) you have only requested one of them in your
qrsh
command and (b) the node has two such cards, then any free card in the same host could be used by a batch job. A batch job and interactive job will never used the same accelerator card though.