New AMD Nodes Sept 2024

September 2024: New AMD “Genoa” compute nodes have been installed in the CSF3. Each node provides:

  • 168-core AMD EPYC 9634 “Genoa” processors (2 x 84-core CPUs, 2.25GHz clock-speed, 384 MB L3 cache)
  • 8GB RAM per core, 1.5TB total RAM on node
  • 1.7TB local /tmp disk on node

A total of 61 nodes (10,248 cores) will be deployed in two phases. Phase 1: 37 nodes (6216 cores) now installed. Phase 2: 24 additional nodes (4032 cores) now installed.

These AMD nodes were funded through researcher contributions and via a Research Lifecycle Programme business case to replace the oldest, least efficient, compute nodes. Hence some rather old Intel nodes will be removed during Phase 1 and 2. See below for details of which nodes have been, and will be, removed from the CSF.

You DO NOT need to request permission from Research IT to use the new nodes – everyone has access to them – simply read below to see what flags to use in your jobscripts.

Free-at-point-of-use users are still currently limited to 32 cores in use at any one time. You may run your jobs on the AMD nodes, but you cannot have more than 32-cores in use (so 32 is your maximum individual jobsize).

mpi-24-ib.pe users should begin using the new AMD nodes as soon as possible – the mpi-24-ib.pe environment will be completely removed in October 2024!! Try your applications on the new nodes now!

Submitting work to the new AMD nodes

Serial batch jobs (1-core)

Serial work is currently limited so that the new nodes can increase the CSF’s parallel job capacity. Please see the short interactive jobs notes below for serial jobs.

Parallel batch jobs – 2 to 168 cores

A new PE flag should be used in your jobscript:

#!/bin/bash --login
#$ -cwd
#$ -pe amd.pe n             # Where n is the number of cores: 2-168
        #
        # Note: It is amd.pe, NOT the usual smp.pe !!!

The usual smp.pe flag is still valid for jobs wanting to use the high-memory nodes, the GPUs and the remaining Intel CPU nodes (e.g., the 32-core Skylake nodes.) But if you want to use the new AMD nodes, use amd.pe instead.

Please continue to ensure that you tell your software how many cores have been requested (e.g. using $NSLOTS)

#$ -pe amd.pe 84
...
# OpenMP (multicore apps)
export OMP_NUM_THREADS=$NSLOTS
myOMPexe args...

# MPI (only single-node MPI permitted)
mpirun -n $NSLOTS myMPIexe args...

Notes:

  • The wallclock time for parallel jobs is 7 days.
  • OpenMP and single-node MPI jobs will both work
  • We are NOT enabling multi-node AMD jobs (i.e. jobs with more than 168 cores using 2 or more compute nodes.)

Some FAQs

Why no serial jobs?

You can run a small number of serial jobs in the short environment – see below.

For now, the focus is very much on parallel jobs as the large core count on the AMD nodes will benefit parallel applications the most.

Why no multi-node AMD jobs?

If you are currently using the mpi-24-ib.pe parallel environment to run multi-node MPI jobs (e.g., 48, 72, …, 120-core jobs) then you can achieve a higher core count (168 cores) with a single AMD node. Your jobs will run faster on a single AMD node than across multiple mpi-24-ib.pe nodes, which are now rather old hardware!

Note that the mpi-24-ib.pe environment will be retired when all 61 new AMD nodes have been installed. You are strongly encouraged to stop using mpi-24-ib.pe now, and instead run on the new AMD nodes.

Do I need to recompile / reinstall anything?

In general, no, the applications you’ve been running so far on the CSF should all run fine on the new nodes. Please see below for further advice.

What about the High Memory nodes, HPC Pool and GPU nodes?

These will remain in-service – carry on using them as before.

Short Batch and Interactive Jobs

short and interactive jobs are useful for testing and developing code on the AMD hardware. The queue times for these jobs may be shorter, but the jobs have a shorter maximum runtime.

  • The runtime limit is 1 hour.
  • Serial (1-core) jobs are permitted.
  • There are only two nodes available for short and interactive work.
  • The maximum number of cores any user may use here is 28.
  • This resource is for testing purposes, downloads, software installs/compiles and small pre/post-processing runs only.

Short Batch Jobs – serial (1-core)

To run a serial job on the AMD nodes, you must use the short environment. This is because the bulk of the AMD node workload is expected to be multi-core parallel jobs. Remember that short jobs will only run for 1-hour at most.

A serial (1-core) jobscript should be of the form:

#!/bin/bash --login
#$ -cwd
#$ -l short           # The usual "short flag" - max 1-hour runtime permitted
#$ -l amd             # Use new AMD hardware
                      # Note that NO amd.pe is used for a serial (1-core) job.

mySerialexe args...

Submit the job as using using qsub myjobscript.

Short Batch Jobs – parallel (2 to 28 cores)

A parallel (2-28 cores) jobscript should be of the form:

#!/bin/bash --login
#$ -cwd
#$ -l short
#$ -pe amd.pe n             # Where n is the number of cores: 2-28
                            # Note that NO "-l amd" is needed for parallel jobs

Remember, use amd.pe in your jobscript (instead of the usual smp.pe) where n is the number of cores you require.

Interactive Jobs – serial (1-core)

To start a serial interactive job

# Get a command shell on an AMD node
qrsh -l short -l amd

# Or run an interactive app directly
module load ....
qrsh -l short -l amd -cwd -V myapp.exe

Interactive Jobs – parallel (2 to 28 cores)

The maximum number of cores that can be used by an interactive parallel job is currently 28. If you need more cores, submit a batch job. To start an interactive job on multiple cores:

# Get a command shell on an AMD node
qrsh -l short -pe amd.pe n           # Number of cores n can be 2 -- 28
                   #
                   # Note - it is amd.pe, not smp.pe.

# Or run an interactive app directly
module load ....
qrsh -l short -pe amd.pe n -cwd -V myapp.exe        # Number of cores n can be 2 -- 28
                   #
                   # Note - it is amd.pe, not smp.pe.

where n is the number of cores you require. The $NSLOTS environment variable will be set to the number of cores you requested, and can be used in commands that you then run at the prompt on the compute node.

Change a job waiting in the queue to use new AMD nodes

If you have a job waiting in the queue, which originally requested the smp.pe parallel environment, and you now want to run it using the new AMD nodes instead, we strongly recommend that you delete the job from the queue (using qdel jobid), then modify your jobscript to use amd.pe and then resubmit the job (using qsub.)

It is possible to alter the job using the qalter command while it’s waiting in the queue, without modifying the jobscript. But then your jobscript file will still contain the smp.pe setting. If you submit this jobscript again, you’ll have to qalter that job too. It also means that if you refer back to the jobscript in 6 months time to see how you generated some results, say, it won’t be obvious that you ran on the AMD hardware.

Free at the point of use access

The limit for users with free-at-point-of-use accounts remains 32 cores in use at any one time, even on the new AMD nodes.

Tested software applications

So far, we have NOT recompiled or modified any of the existing applications for use on the AMD nodes. In most cases, the applications will run perfectly well, with good performance, without the need for recompilation.

  • Please use the modulefiles you’ve previously used in your jobs, as advertised on the software specific webpages.
  • We recommend you try some test jobs on the new AMD nodes before starting large, long-running production runs.
  • We will not be recompiling existing applications unless there is a significant issue. We will be looking into optimisation for the new hardware later this year.

It is not possible for us to test every application on the AMD new nodes. The following are believed to work:

Abaqus                           2018, 2023
ADF                              2023.104.nk
AMBER                            16, 20, 22
Anaconda Python                  2023.09
Fluent                           18.1.29.2, 19.3, 19.5
Gaussian                         09, 16
GROMACS                          2024.2
Jupyter Notebook                 6.0.0
LAMMPS                           29.09.21-packs-user
NAMD                             2.13,2.14,3.0
NBO                              7.0.8
Openfoam                         v6, v2312
ORCA                             5.02, 5.04, 6.0.0, 6.0.0-avx2
Paraview                         5.11.2
R                                3.6.2, 4.4.0
Singularity/Apptainer            1.3.1
StarCCM                          15.04-double, 18.02-mixed
Turbomole                        7.6-smp, 7.6-mpi
VASP                             6.3.0
VMD                              1.9.3

Compiling your own code

Note that in general, you will not need to recompile or reinstall any applications, python envs, R packages, conda envs and so on. Things will run perfectly well on the new AMD nodes.

The AMD Genoa hardware provides the avx, avx2 and avx512 vector instructions found in the CSF’s Intel CPUs. So applications are expected to perform at least as well on the new nodes. A full discussion of this hardware is outside of the scope of this page, so please see the AMD documentation if you want more in-depth information.

You may wish to compile code, to be optimized a little more for the AMD nodes. We will be providing more information about this in the next few months, but for now, we have some advice below.

We recommend using the GCC 13.3.0 compiler as this supports the AMD znver4 microarchitecture, which enables the AVX-512 extensions.

AMD provide some recommended compiler flags (PDF) to use with various compilers (GNU compiler collection, Intel OneAPI C/C++ and the AMD AOCC compiler.) You will need to use at least anarchitecture flag to enable the AVX-512 extensions available in the Genoa CPUs:

# Gnu compilers
-march=znver4                           # Code will only run on AMD Genoa and Intel Skylake (or newer)
-march=haswell -mtune=znver4            # Code will run on all CSF3 node types, with some further
                                        # tuning for the AVX-512 extensions found in the AMD and
                                        # Intel Skylake nodes where possible. 

# Intel OneAPI compilers
-mavx2 -axCORE-AVX512,CORE-AVX2,AVX     # Code will run on all CSF3 node types, with AVX-512
                                        # instruction enabled if supported

# AMD AOCC compilers (not yet installed on the CSF - coming soon)
-march=znver4                           # Code will only run on AMD Genoa and Intel Skylake (or newer)

# Note that the above flags can be applied when compiling code on the login nodes.
# An alternative is to login to the AMD nodes, using qrsh, and then compile for
# the "current" node's architecture, using:
-march=native

The above PDF provides further optimization flags you may wish to use in addition to the above architecture flags.

Having an issue with the new nodes?

The start of year is a busy time for the Research Infrastructure team, so in the first instance, please check whether your question is answered by the notes above.

If not, please let us know if you run into any issues using the new nodes and provide as much information as possible as per our Help Page.

We appreciate your ongoing patience during this particularly busy period for the team.

Decommissioning old nodes

The oldest, least efficient nodes, need to be removed from the CSF. Hence there will be reductions in the availability of existing nodes.

mpi-24-ib.pe

mpi-24-ib.pe will be completely removed from service in October 2024. We encourage anyone currently using it to move to the new AMD nodes as soon as possible. The new AMD nodes can run larger jobs than is possible in mpi-24-ib.pe.

Intel nodes – haswell, broadwell, skylake

By the end of October we anticipate that there will be:

  • 21 haswell nodes (24 cores each, total 504 cores ) serving smp.pe and serial
  • 32 broadwell nodes (28 cores each, total 896 cores) serving smp.pe and possibly serial
  • 41 skylake nodes (32 cores each, 1,312 cores) serving smp.pe

If your application is failing on the AMD nodes please use the above and report the issue to us with as much detail as possible.
Some of the above may have to be decommissioned at a later date. This will be advertised to all users as appropriate.

GPUs and High Memory Nodes

No changes are being made to these parts of the cluster. Please continue to use them as normal.

Last modified on November 4, 2024 at 8:57 am by Pen Richardson