The CSF2 has been replaced by the CSF3 - please use that system! This documentation may be out of date. Please read the CSF3 documentation instead. To display this old CSF2 page click here. |
OpenMPI on AMD Bulldozer
This page describes how best to compile and run parallel MPI jobs on the AMD Bulldozer architecture compute nodes on the CSF, i.e, how to get the best performance out of these nodes.
Overview
- The CSF AMD Bulldozer nodes each have 64 CPU cores, with 2 GB RAM per core; all are connected via Infiniband.
- Intel compilers do not fully support this architecture.
- AMD recommend the use of the AMD Open64 compiler with the AMD Core Mathematics Library (ACML) for maximum performance. The ACML is an implementation of BLAS and LAPACK optimised especially for AMD processors. The library contains other routines too, for example FFT. See the above link for more information on using ACML on the CSF.
- Compilation and linking of binaries for these nodes should be performed on a dedicated Bulldozer node by using
qrsh
as descibed below. - Jobs size must be a multiple of 64.
- The maximum runtime for a job is 4 days.
- Binaries compiled for the AMD Bulldozer compute nodes will not run on other nodes. Attempting to run such a binary on other nodes, for example the Intel nodes, will yield a warning
Illegal instruction
and the programme will not run.
Restrictions on use
Code should only be compiled and executed on AMD Bulldozer nodes. Normally code can be compiled on the login node and tested there using ‘very’ short test runs (e.g., one minute on fewer than 4 cores). This will not work for AMD Bulldozer codes because the login nodes use the Intel architecture.
To use MPI you will need to amend your program to include the relevant calls to the MPI library.
Compilation, linking
The required steps are:
- login to a Bulldozer node dedicated for this procedure;
- load the appropriate environment module;
- compile and link your MPI code;
- logoff from the dedicated Bulldozer node;
Example
qrsh -l bulldozer -l short module load mpi/open64-4.5.2/openmpi/1.6-ib-amd-bd mpif90 mynameis.f90 -o mynameis exit # # You are now back on the login node
Running MPI jobs
The required steps are as for other MPI jobs:
- create a suitable
qsub
script — an example is given below — and save it as, for example,my_open64_mpi_job.qsub
- load the appropriate environment module
- ensure you are on the login node of the CSF (not the dedicated compile/link node)
- submit your job to SGE, for example:
qsub my_open64_mpi_job.qsub
- The maximum runtime for a job is 4 days
Small MPI Jobs (fewer than 64 cores)
Small MPI jobs that don’t use all cores on the node can be run in the smp-64bd.pe
parallel environment. In this case you must load one of the following modulefiles:
# PGI 14.10 compiler module load mpi/pgi-14.10-acml-fma4/openmpi/1.8.3-amd-bd # Open64 4.5.2.1 compiler module load mpi/open64-4.5.2.1/openmpi/1.8.3-amd-bd module load mpi/open64-4.5.2.1/openmpi/1.6-amd-bd module load mpi/open64-4.5.2/openmpi/1.6-amd-bd # Intel compiler (code not as optimized as PGI or Open64) module load mpi/intel-14.0/openmpi/1.8.3 module load mpi/intel-12.0/openmpi/1.6 # GNU compiler (code not as optimized as PGI or Open64) module load mpi/gcc/openmpi/1.6
Use the smp-64bd.pe
in your jobscript. Note that if using the entire node (all 64 cores) your code may run faster if using the -ib
modulefiles in the next section. This is because when all 64 cores are used they will be pinned to cores which improves performance (the non-ib modulefiles above disable core-pinning which is on by default as of OpenMPI 1.7.4). When fewer than 64 cores are used they cannot be pinned to cores because another job running on the same node cannot determine which cores already have MPI processes pinned to them. Hence all jobs would start pinning from core number 1, 2, … and so on. Hence MPI’s task pinning mechanism is disabled and left to the operating system.
#!/bin/bash #$ -S bash #$ -cwd # Run in current directory #$ -V # Inherit settings from modulefiles #$ -pe smp-64bd.pe 16 # Small MPI job (can use up to 64 cores) # $NSLOTS is automatically set to the number you specify on -pe line mpirun -n $NSLOTS ./my_app_amd.exe
Large MPI Jobs (64 cores or more)
Large multi-node MPI jobs that use all cores on the node can be run in the orte-64bd-ib.pe
parallel environment. In this case you must load one of the following modulefiles (these can also be used for a single-node 64-core job using all 64-cores in that node in smp-64bd.pe
):
# PGI 14.10 compiler module load mpi/pgi-14.10-acml-fma4/openmpi/1.8.3-ib-amd-bd # Open64 4.5.2.1 compiler module load mpi/open64-4.5.2.1/openmpi/1.8.3-ib-amd-bd module load mpi/open64-4.5.2.1/openmpi/1.6-ib-amd-bd module load mpi/open64-4.5.2/openmpi/1.6-ib-amd-bd # Intel compiler (code not as optimized as PGI or Open64) module load mpi/intel-14.0/openmpi/1.8.3-ib module load mpi/intel-12.0/openmpi/1.6-ib # GNU compiler (code not as optimized as PGI or Open64) module load mpi/gcc/openmpi/1.6-ib
#!/bin/bash #$ -S bash #$ -cwd # Run in current directory #$ -V # Inherit settings from modulefiles #$ -pe orte-64bd-ib.pe 128 # Large MPI job (multiples of 64 cores only) # $NSLOTS is automatically set to the number you specify on -pe line mpirun -n $NSLOTS ./my_app_amd.exe
Further information
- Online help via the command line:
man mpif90 # for fortran mpi man mpicc # for C/C++ mpi man mpirun # for information on running mpi executables