Running jobs on The HPC Pool (Slurm)

Please note: The HPC Pool has now moved to the upgraded CSF3 running Slurm. This page has been updated to show Slurm usage, not SGE.

How to Log In

If you have had a project approved please log in to the CSF3. The HPC Pool is a separate resource, but it shares some aspects of the CSF3 such as login nodes, software installs and filesystems.

Job Throughtput and Accounting

Please note: If you are an existing CSF user, jobs run in the HPC Pool do not account against an existing CSF contribution. Scheduling of jobs / throughput is not based on existing CSF usage. If you are not part of an existing CSF contributing group, your HPC Pool access will also come with ‘free at the point of use’ access to the CSF3 enabled.

All jobs run in the HPC Pool will be required to specify an additional project code in the jobscript. We will tell you what that project code is once your application to use the HPC Pool has been approved.

Jobscript Options

As with all jobs transitioning from SGE to Slurm, job scripts will need to be updated accordingly. Further information can be found here: SGE to Slurm Reference.

Jobs in the HPC Pool should be submitted to the hpcpool Slurm partition. For example:

#!/bin/bash --login
### All of the following flags are required!
#SBATCH -p hpcpool        # The "partition" - named hpcpool
#SBATCH -N 4              # (or --nodes=) Minimum is 4, Max is 32. Job uses 32 cores on each node.
#SBATCH -n 128            # (or --ntasks=) TOTAL number of tasks. Max is 1024.
#SBATCH -t 1-0            # Wallclock limit. 1-0 is 1 day. Maximum permitted is 4-0 (4-days).
#SBATCH -A hpc-proj-name  # Use your HPC project code

module purge
module load apps/binapps/some_mpi_app/1.2.3

# Slurm knows to run $SLURM_NTASKS processes (-n above)
mpirun someMPIapp.exe

See below for more complex OpenMPI+OpenMP (mixed-mode) jobs.

Number of Cores and Memory

A job should request between 128 and 1024 cores (inclusive) in multiples of 32.

A compute node has 32 cores (2 x 16-core sockets), 192GB RAM giving 6GB per core. However, only the the total memory usage is limited – your app may allocate memory as it sees fit (e.g., MPI rank 0 may allocate a lot more than other worker ranks), up to the 192GB per-node limit.

Maximum Wallclock time (job runtime)

The maximum job runtime in the HPC Pool is 4 days. You must specify the wallclock timelimit for your job in the jobscript (see above.)

Filesystem

Jobs running in the HPC Pool should be run from the scratch filesystem – this is a faster filesystem than home storage – your jobs will benefit if reading/writing large data files. Using scratch also eliminates the risk of filling your group’s home storage, particularly if your application generates huge temporary files while a job is running.

Please note: the scratch filesystem automatic clean-up policy is active. If you have unused scratch files they will be deleted. Please read the Scratch Cleanup Policy for more information.

OpenMPI

You will likely use MPI when developing multi-node (HPC Pool capable) apps. The current MPI implementation provided by the CSF is available using the modulefile:

# Intel compiler
module load mpi/intel-oneapi-2024.2.0/openmpi/5.0.7    # Also loads the Intel compiler modulefile

# Or GCC compiler
module load mpi/gcc/openmpi/5.0.7-gcc-14.2.0           # Also loads the GCC compiler modulefile

This modulefile should be loaded when running MPI software and when compiling source code. For those that are interested: this uses the new UCX communication library.

The following flags will build an executable that supports the AVX-512 vector instructions in the Skylake CPUs used in the HPC Pool:

# Intel compiler: The executable will only run on the HPC Pool (Intel) CPUs
mpicc -xCORE-AVX512 ...

# Intel compiler: The executable will run on all CSF (AMD+Intel CPUs) and HPC Pool (Intel) CPUs
mpicc -mavx2 -axCORE-AVX512,CORE-AVX2,AVX ...

Please see the CSF compiler documentation for more information.

Mixed-mode MPI+OpenMP

Your application software may support the use of mixed-mode MPI+OpenMP. This is where a small number of MPI processes are run on each compute node (for example one or two per compute node). Then each MPI process uses OpenMP threads to do some multi-core processing. This may improve the performance of your application by reducing the number of MPI processes that communicate over the network between the nodes. Please check whether your application software supports mixed-mode MPI+OpenMP execution.

If your application does support mixed-mode execution, you will need to add some extra flags to the mpirun command in your jobscript to ensure the MPI processes are placed correctly on the compute nodes. Examples are given below. It is worth noting that the compute nodes in the HPC Pool each contain two 16-core sockets (physical CPUs) which gives 32 cores in total for each compute node.

The number of MPI processes vs OpenMP threads to use depends on your application, the work it is doing and the amount of data you are processing. There is no single configuration that is best for all applications. We recommend trying a few different configurations and timing how long each job takes to run.

Example 1: One MPI process per compute node

The following example shows how to place one MPI process on each compute node, with each MPI process using 32 OpenMP threads. The job uses 128 cores in total, which is equivalent to 4 compute nodes (each compute node has 32 cores).

#!/bin/bash --login
#SBATCH -p hpcpool        # The "partition" - named hpcpool
#SBATCH -N 4              # (or --nodes=) Minimum is 4, Max is 32. Job uses 32 cores on each node.
#SBATCH -n 4              # (or --ntasks=) TOTAL number of tasks - the MPI processes.
#SBATCH -c 32             # (or --cpus-per-task=) Number of cores per MPI process.
#SBATCH -t 1-0            # Wallclock limit. 1-0 is 1 day. Maximum permitted is 4-0 (4-days).
#SBATCH -A hpc-proj-name  # Use your HPC project code

module purge
module load mpi/intel-oneapi-2024.2.0/openmpi/5.0.7

# SLURM_NTASKS will be set to 4 (-n above)
# SLURM_CPUS_PER_TASK will be set to 32 (-c above)

# Instruct each MPI process to use 32 OpenMP threads
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

# Run 4 MPI processes in total, one per compute node. The --map-by flag describes this distribution.
mpirun -n $SLURM_NTASKS --map-by ppr:1:node:pe=$OMP_NUM_THREADS app.exe args ...

The above jobscript will run 4 MPI processes (-n 4). The --map-by flag describes how we want the processes distributed across the resources available to our job. In this case ppr:1:node:pe=$OMP_NUM_THREADS means the processes per resource are: 1 process per node (i.e., compute node). Each MPI process will use 32 threads. The :pe=$OMP_NUM_THREADS means we want the single MPI process to be bound to 32 physical CPU cores. Without this, the default is for the MPI process to be bound to only the cores in a single socket (CPU) and you would incorrectly have two OpenMP threads per core of the first socket in the compute node (this would overload the first socket.) By specifying :pe=$OMP_NUM_THREADS we ensure each OpenMP thread is running on its own core – so the cores from both sockets (CPUs) in the compute node are used.

To see how processes are mapped to nodes / cores add --report-bindings to the mpirun command (before the name of your application appears on the line).

Example 2: One MPI process per socket (CPU)

The following example shows how to place one MPI process on each socket (CPU) in each compute node. A compute node contains two sockets (CPUs) and each socket provides 16 cores. Hence there will be two MPI process per compute node and each process will use 16 OpenMP threads for multi-core processing.

#SBATCH -p hpcpool        # The "partition" - named hpcpool
#SBATCH -N 4              # (or --nodes=) Minimum is 4, Max is 32. Job uses 32 cores on each node.
#SBATCH -n 8              # (or --ntasks=) TOTAL number of tasks - the MPI processes.
#SBATCH -c 16             # (or --cpus-per-task=) Number of cores per MPI process.
#SBATCH -t 1-0            # Wallclock limit. 1-0 is 1 day. Maximum permitted is 4-0 (4-days).
#SBATCH -A hpc-proj-name  # Use your HPC project code

module purge
module load mpi/intel-oneapi-2024.2.0/openmpi/5.0.7

# SLURM_NTASKS will be set to 8 (-n above)
# SLURM_CPUS_PER_TASK will be set to 16 (-c above)

# Instruct each MPI process to use 16 OpenMP threads
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

# Run 8 MPI processes in total, one per socket. The --map-by flag describes this distribution.
mpirun -n $SLURM_NTASKS --map-by ppr:1:socket:pe=$OMP_NUM_THREADS app.exe args ...

The above jobscript will run 8 MPI processes (-n 8). The --map-by flag describes how we want the processes distributed across the resources available to our job. In this case ppr:1:socket:pe=$OMP_NUM_THREADS means the processes per resource are: 1 process per socket (where each compute node has two sockets). Each MPI process will use 16 threads. The :pe=$OMP_NUM_THREADS means we want each MPI process to be bound to 16 physical CPU cores. In this example the :pe=$OMP_NUM_THREADS is optional – the default behaviour is correct because we request that each MPI process uses 16 threads and there are 16 cores in each socket.

To see how processes are mapped to nodes / cores add --report-bindings to the mpirun command (before the name of your application appears on the line).

Example 3: Two MPI process per socket (CPU)

The following example shows how to place two MPI process on each socket (CPU) in each compute node. A compute node contains two sockets (CPUs) and each socket provides 16 cores. Hence there will be four MPI process per compute node and each process will use 8 OpenMP threads for multi-core processing.

#!/bin/bash --login
#SBATCH -p hpcpool        # The "partition" - named hpcpool
#SBATCH -N 4              # (or --nodes=) Minimum is 4, Max is 32. Job uses 32 cores on each node.
#SBATCH -n 16             # (or --ntasks=) TOTAL number of tasks - the MPI processes.
#SBATCH -c 8              # (or --cpus-per-task=) Number of cores per MPI process.
#SBATCH -t 1-0            # Wallclock limit. 1-0 is 1 day. Maximum permitted is 4-0 (4-days).
#SBATCH -A hpc-proj-name  # Use your HPC project code

module purge
module load mpi/intel-oneapi-2024.2.0/openmpi/5.0.7

# SLURM_NTASKS will be set to 16 (-n above)
# SLURM_CPUS_PER_TASK will be set to 8 (-c above)

# Instruct each MPI process to use 8 OpenMP threads
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

# Run 16 MPI processes in total, two per socket. The --map-by flag describes this distribution.
mpirun -n $SLURM_NTASKS --map-by ppr:2:socket:pe=$OMP_NUM_THREADS app.exe args ...

The above jobscript will run 16 MPI processes (-n 16). The --map-by flag describes how we want the processes distributed across the resources available to our job. In this case ppr:2:socket:pe=$OMP_NUM_THREADS means the processes per resource are: 2 processes per socket (where each compute node has two sockets). Each MPI process will use 8 threads. The :pe=$OMP_NUM_THREADS means we want each MPI process to be bound to 8 physical CPU cores.

To see how processes are mapped to nodes / cores add --report-bindings to the mpirun command (before the name of your application appears on the line).

Last modified on May 9, 2025 at 2:36 pm by George Leaver