The Computational Shared Facility 3

Running jobs on The HPC Pool

How to Log In

If you have had a project approved please log in to the CSF3. The HPC Pool is a separate resource, but it shares some aspects of the CSF3 such as login nodes, software installs and filesystems.

Job Throughtput and Accounting

Please note: If you are an existing CSF user, jobs run in the HPC Pool do not account against an existing CSF contribution and scheduling of jobs / throughput is not based on existing CSF usage. If you are not part of an existing CSF contributing group your HPC Pool access will also come with ‘free at the point of use’ access to the CSF3 enabled.

All jobs run in the HPC Pool will be required to specify an additional project code in the jobscript. We will tell you what that project code is once your application to use the HPC Pool has been approved.

Jobscript Options

Jobs in the HPC Pool should be submitted to he hpc.pe parallel environment. For example:

#!/bin/bash --login
#$ -cwd
#$ -pe hpc.pe ncores              # ncores: 128 to 1024 in multiples of 32
#$ -P hpc-projectcode             # hpc-projectcode: we will issue you with a project code


# Load the application's modulefile(s) to set up the environment
module load apps/some/name/1.2.3

# Run the app ($NSLOTS is automatically set to the number of cores requested above)
mpirun -n $NSLOTS appname args...

Number of Cores and Memory

A job should request between 128 and 1024 cores (inclusive) in multiples of 32.

A compute node has 32 cores (2 x 16-core sockets), 192GB RAM giving 6GB per core. However, only the the total memory usage is limited – your app may allocate memory as it sees fit (e.g., MPI rank 0 may allocate a lot more than other worker ranks), up to the 192GB per-node limit.

Maximum Wallclock time (job runtime)

The maximum job runtime in the HPC Pool is 4 days.

Filesystem

Jobs running in the HPC Pool should be run from the scratch filesystem – this is a faster filesystem than home storage – your jobs will benefit if reading/writing large data files. Using scratch also eliminates the risk of filling your group’s home storage, particularly if your application generates huge temporary files while a job is running.

Please note: the scratch filesystem automatic clean-up policy is active. If you have unused scratch files they will be deleted. Please read the Scratch Cleanup Policy for more information.

Compilers

All of the compilers available on CSF3 can be used. To see available versions please run:

module search compilers

We recommend using the Intel compiler if your software can be compiled with that compiler:

module load compilers/intel/19.1.2

The following flags will build an executable that supports the AVX-512 vector instructions in the Skylake CPUs used in the HPC Pool:

# The executable will only run on the Skylake CPUs
icc -xCORE-AVX512 ...

# The executable will run on all CSF and HPC Pool CPUs (CPU type detected at runtime)
icc -msse4.2 -axCORE-AVX512,CORE-AVX2,AVX ...

Please see the CSF compiler documentation for more information.

OpenMPI

The current MPI implementation provided by the CSF is available using the modulefile:

module load mpi/intel-19.1/openmpi/4.1.1         # Will also load the Intel compiler modulefile

This modulefile should be loaded when running MPI software and when compiling source code. For those that are interested: this uses the new UCX communication library.

Mixed-mode MPI+OpenMP

Your application software may support the use of mixed-mode MPI+OpenMP. This is where a small number of MPI processes are run on each compute (for example one or two per compute node). Then each MPI process uses OpenMP threads to do some multi-core processing. This may improve the performance of your application by reducing the number of MPI processes that communicate over the network between the nodes. Please check whether your application software supports mixed-mode MPI+OpenMP execution.

If your application does support mixed-mode execution, you will need to add some extra flags to the mpirun command in your jobscript to ensure the MPI processes are placed correctly on the compute nodes. Examples are given below. It is worth noting that the compute nodes in the HPC Pool each contain two 16-core sockets (physical CPUs) which gives 32 cores in total for each compute node.

The number of MPI processes vs OpenMP threads to use depends on your application, the work it is doing and the amount of data you are processing. There is no single configuration that is best for all applications. We recommend trying a few different configurations and timing how long each job takes to run.

Example 1: One MPI process per compute node

The following example shows how to place one MPI process on each compute node, with each MPI process using 32 OpenMP threads. The job uses 128 cores in total, which is equivalent to 4 compute nodes (each compute node has 32 cores).

#!/bin/bash --login
#$ -cwd
#$ -pe hpc.pe 128                # This gives the job 4 x 32-core compute nodes
#$ -P hpc-projectcode            # Your your own HPC project code here

module load mpi/intel-19.1/openmpi/4.1.1

# Instruct each MPI process to use 32 OpenMP threads
export OMP_NUM_THREADS=32

# Run 4 MPI processes in total, one per compute node. The --map-by flag describes this distribution.
mpirun -n 4 --map-by ppr:1:node:pe=$OMP_NUM_THREADS app.exe args ...

The above jobscript will run 4 MPI processes (-n 4). The --map-by flag describes how we want the processes distributed across the resources available to our job. In this case ppr:1:node:pe=$OMP_NUM_THREADS means the processes per resource are: 1 process per node (i.e., compute node). Each MPI process will use 32 threads. The :pe=$OMP_NUM_THREADS means we want that 1 MPI process to be bound to 32 physical CPU cores. Without this, the default is for the MPI process to be bound to only the cores in a single socket (CPU) and so you end up with two OpenMP threads running on each core of the first socket in the compute node. By specifying :pe=$OMP_NUM_THREADS we have each OpenMP thread running on its own CPU core – so the cores from both sockets (CPUs) in the compute node are used.

To see how processes are mapped to nodes / cores add --report-bindings to the mpirun command (before the name of your application appears on the line).

Example 2: One MPI process per socket (CPU)

The following example shows how to place one MPI process on each socket (CPU) in each compute node. A compute node contains two sockets (CPUs) and each socket provides 16 cores. Hence there will be two MPI process per compute node and each process will use 16 OpenMP threads for multi-core processing.

#!/bin/bash --login
#$ -cwd
#$ -pe hpc.pe 128                # This gives the job 4 x 32-core compute nodes
#$ -P hpc-projectcode            # Your your own HPC project code here

module load mpi/intel-19.1/openmpi/4.1.1

# Instruct each MPI process to use 16 OpenMP threads
export OMP_NUM_THREADS=16

# Run 8 MPI processes in total, one per socket. The --map-by flag describes this distribution.
mpirun -n 8 --map-by ppr:1:socket:pe=$OMP_NUM_THREADS app.exe args ...

The above jobscript will run 8 MPI processes (-n 8). The --map-by flag describes how we want the processes distributed across the resources available to our job. In this case ppr:1:socket:pe=$OMP_NUM_THREADS means the processes per resource are: 1 process per socket (where each compute node has two sockets). Each MPI process will use 16 threads. The :pe=$OMP_NUM_THREADS means we want each MPI process to be bound to 16 physical CPU cores. In this example the :pe=$OMP_NUM_THREADS is optional – the default behaviour is correct because we request that each MPI process uses 16 threads and there are 16 cores in each socket.

To see how processes are mapped to nodes / cores add --report-bindings to the mpirun command (before the name of your application appears on the line).

Example 3: Two MPI process per socket (CPU)

The following example shows how to place two MPI process on each socket (CPU) in each compute node. A compute node contains two sockets (CPUs) and each socket provides 16 cores. Hence there will be four MPI process per compute node and each process will use 8 OpenMP threads for multi-core processing.

#!/bin/bash --login
#$ -cwd
#$ -pe hpc.pe 128                # This gives the job 4 x 32-core compute nodes
#$ -P hpc-projectcode            # Your your own HPC project code here

module load mpi/intel-19.1/openmpi/4.1.1

# Instruct each MPI process to use 16 OpenMP threads
export OMP_NUM_THREADS=8

# Run 16 MPI processes in total, two per socket. The --map-by flag describes this distribution.
mpirun -n 16 --map-by ppr:2:socket:pe=$OMP_NUM_THREADS app.exe args ...

The above jobscript will run 16 MPI processes (-n 16). The --map-by flag describes how we want the processes distributed across the resources available to our job. In this case ppr:2:socket:pe=$OMP_NUM_THREADS means the processes per resource are: 2 processes per socket (where each compute node has two sockets). Each MPI process will use 8 threads. The :pe=$OMP_NUM_THREADS means we want each MPI process to be bound to 8 physical CPU cores.

To see how processes are mapped to nodes / cores add --report-bindings to the mpirun command (before the name of your application appears on the line).

Last modified on September 23, 2022 at 2:21 pm by George Leaver