Research Infrastructure > CSF2 (retired) > Software > Libraries (MPI, maths, misc) > OpenMPI on AMD Bulldozer

Page Contents

The CSF2 has been replaced by the CSF3 - please use that system! This documentation may be out of date. Please read the CSF3 documentation instead.
To display this old CSF2 page click here.

OpenMPI on AMD Bulldozer

This page describes how best to compile and run parallel MPI jobs on the AMD Bulldozer architecture compute nodes on the CSF, i.e, how to get the best performance out of these nodes.

Overview

The CSF AMD Bulldozer nodes each have 64 CPU cores, with 2 GB RAM per core; all are connected via Infiniband.
Intel compilers do not fully support this architecture.
AMD recommend the use of the AMD Open64 compiler with the AMD Core Mathematics Library (ACML) for maximum performance. The ACML is an implementation of BLAS and LAPACK optimised especially for AMD processors. The library contains other routines too, for example FFT. See the above link for more information on using ACML on the CSF.
Compilation and linking of binaries for these nodes should be performed on a dedicated Bulldozer node by using qrsh as descibed below.
Jobs size must be a multiple of 64.
The maximum runtime for a job is 4 days.
Binaries compiled for the AMD Bulldozer compute nodes will not run on other nodes. Attempting to run such a binary on other nodes, for example the Intel nodes, will yield a warning Illegal instruction and the programme will not run.

Restrictions on use

Code should only be compiled and executed on AMD Bulldozer nodes. Normally code can be compiled on the login node and tested there using ‘very’ short test runs (e.g., one minute on fewer than 4 cores). This will not work for AMD Bulldozer codes because the login nodes use the Intel architecture.

To use MPI you will need to amend your program to include the relevant calls to the MPI library.

Compilation, linking

The required steps are:

login to a Bulldozer node dedicated for this procedure;
load the appropriate environment module;
compile and link your MPI code;
logoff from the dedicated Bulldozer node;

Example

qrsh -l bulldozer -l short
module load mpi/open64-4.5.2/openmpi/1.6-ib-amd-bd
mpif90 mynameis.f90 -o mynameis
exit
  #
  # You are now back on the login node

Running MPI jobs

The required steps are as for other MPI jobs:

create a suitable qsub script — an example is given below — and save it as, for example, my_open64_mpi_job.qsub
load the appropriate environment module
ensure you are on the login node of the CSF (not the dedicated compile/link node)
submit your job to SGE, for example: qsub my_open64_mpi_job.qsub
The maximum runtime for a job is 4 days

Small MPI Jobs (fewer than 64 cores)

Small MPI jobs that don’t use all cores on the node can be run in the smp-64bd.pe parallel environment. In this case you must load one of the following modulefiles:

# PGI 14.10 compiler
module load mpi/pgi-14.10-acml-fma4/openmpi/1.8.3-amd-bd

# Open64 4.5.2.1 compiler
module load mpi/open64-4.5.2.1/openmpi/1.8.3-amd-bd
module load mpi/open64-4.5.2.1/openmpi/1.6-amd-bd
module load mpi/open64-4.5.2/openmpi/1.6-amd-bd

# Intel compiler (code not as optimized as PGI or Open64)
module load mpi/intel-14.0/openmpi/1.8.3
module load mpi/intel-12.0/openmpi/1.6

# GNU compiler (code not as optimized as PGI or Open64)
module load mpi/gcc/openmpi/1.6

Use the smp-64bd.pe in your jobscript. Note that if using the entire node (all 64 cores) your code may run faster if using the -ib modulefiles in the next section. This is because when all 64 cores are used they will be pinned to cores which improves performance (the non-ib modulefiles above disable core-pinning which is on by default as of OpenMPI 1.7.4). When fewer than 64 cores are used they cannot be pinned to cores because another job running on the same node cannot determine which cores already have MPI processes pinned to them. Hence all jobs would start pinning from core number 1, 2, … and so on. Hence MPI’s task pinning mechanism is disabled and left to the operating system.

#!/bin/bash
#$ -S bash
#$ -cwd                   # Run in current directory
#$ -V                     # Inherit settings from modulefiles
#$ -pe smp-64bd.pe 16     # Small MPI job (can use up to 64 cores)

# $NSLOTS is automatically set to the number you specify on -pe line

mpirun -n $NSLOTS ./my_app_amd.exe

Large MPI Jobs (64 cores or more)

Large multi-node MPI jobs that use all cores on the node can be run in the orte-64bd-ib.pe parallel environment. In this case you must load one of the following modulefiles (these can also be used for a single-node 64-core job using all 64-cores in that node in smp-64bd.pe):

# PGI 14.10 compiler
module load mpi/pgi-14.10-acml-fma4/openmpi/1.8.3-ib-amd-bd

# Open64 4.5.2.1 compiler
module load mpi/open64-4.5.2.1/openmpi/1.8.3-ib-amd-bd
module load mpi/open64-4.5.2.1/openmpi/1.6-ib-amd-bd
module load mpi/open64-4.5.2/openmpi/1.6-ib-amd-bd

# Intel compiler (code not as optimized as PGI or Open64)
module load mpi/intel-14.0/openmpi/1.8.3-ib
module load mpi/intel-12.0/openmpi/1.6-ib

# GNU compiler (code not as optimized as PGI or Open64)
module load mpi/gcc/openmpi/1.6-ib

#!/bin/bash
#$ -S bash
#$ -cwd                      # Run in current directory
#$ -V                        # Inherit settings from modulefiles
#$ -pe orte-64bd-ib.pe 128   # Large MPI job (multiples of 64 cores only)

# $NSLOTS is automatically set to the number you specify on -pe line

mpirun -n $NSLOTS ./my_app_amd.exe

Further information

Online help via the command line:

man mpif90
  # for fortran mpi
man mpicc
  # for C/C++ mpi
man mpirun
  # for information on running mpi executables

Last modified on February 2, 2015 at 11:27 am by Site Admin

Page Contents