New AMD Nodes Sept 2024
September 2024: New AMD “Genoa” compute nodes have been installed in the CSF3. Each node provides:
- 168-core AMD EPYC 9634 “Genoa” processors (2 x 84-core CPUs, 2.25GHz clock-speed, 384 MB L3 cache)
- 8GB RAM per core, 1.5TB total RAM on node
- 1.7TB local
/tmp
disk on node
A total of 61 nodes (10,248 cores) will be deployed in two phases. Phase 1: 37 nodes (6216 cores) now installed. Phase 2: 24 additional nodes (4032 cores) now installed.
These AMD nodes were funded through researcher contributions and via a Research Lifecycle Programme business case to replace the oldest, least efficient, compute nodes. Hence some rather old Intel nodes will be removed during Phase 1 and 2. See below for details of which nodes have been, and will be, removed from the CSF.
Free-at-point-of-use users are still currently limited to 32 cores in use at any one time. You may run your jobs on the AMD nodes, but you cannot have more than 32-cores in use (so 32 is your maximum individual jobsize).
mpi-24-ib.pe users should begin using the new AMD nodes as soon as possible – the mpi-24-ib.pe
environment will be completely removed in October 2024!! Try your applications on the new nodes now!
Submitting work to the new AMD nodes
Serial batch jobs (1-core)
Serial work is currently limited so that the new nodes can increase the CSF’s parallel job capacity. Please see the short interactive jobs notes below for serial jobs.
Parallel batch jobs – 2 to 168 cores
A new PE flag should be used in your jobscript:
#!/bin/bash --login #$ -cwd #$ -pe amd.pe n # Where n is the number of cores: 2-168 # # Note: It is amd.pe, NOT the usual smp.pe !!!
The usual smp.pe
flag is still valid for jobs wanting to use the high-memory nodes, the GPUs and the remaining Intel CPU nodes (e.g., the 32-core Skylake nodes.) But if you want to use the new AMD nodes, use amd.pe
instead.
Please continue to ensure that you tell your software how many cores have been requested (e.g. using $NSLOTS
)
#$ -pe amd.pe 84 ... # OpenMP (multicore apps) export OMP_NUM_THREADS=$NSLOTS myOMPexe args... # MPI (only single-node MPI permitted) mpirun -n $NSLOTS myMPIexe args...
Notes:
- The wallclock time for parallel jobs is 7 days.
- OpenMP and single-node MPI jobs will both work
- We are NOT enabling multi-node AMD jobs (i.e. jobs with more than 168 cores using 2 or more compute nodes.)
Some FAQs
Why no serial jobs?
You can run a small number of serial jobs in the short
environment – see below.
For now, the focus is very much on parallel jobs as the large core count on the AMD nodes will benefit parallel applications the most.
Why no multi-node AMD jobs?
If you are currently using the mpi-24-ib.pe
parallel environment to run multi-node MPI jobs (e.g., 48, 72, …, 120-core jobs) then you can achieve a higher core count (168 cores) with a single AMD node. Your jobs will run faster on a single AMD node than across multiple mpi-24-ib.pe
nodes, which are now rather old hardware!
Note that the mpi-24-ib.pe
environment will be retired when all 61 new AMD nodes have been installed. You are strongly encouraged to stop using mpi-24-ib.pe
now, and instead run on the new AMD nodes.
Do I need to recompile / reinstall anything?
In general, no, the applications you’ve been running so far on the CSF should all run fine on the new nodes. Please see below for further advice.
What about the High Memory nodes, HPC Pool and GPU nodes?
These will remain in-service – carry on using them as before.
Short Batch and Interactive Jobs
short and interactive jobs are useful for testing and developing code on the AMD hardware. The queue times for these jobs may be shorter, but the jobs have a shorter maximum runtime.
- The runtime limit is 1 hour.
- Serial (1-core) jobs are permitted.
- There are only two nodes available for short and interactive work.
- The maximum number of cores any user may use here is 28.
- This resource is for testing purposes, downloads, software installs/compiles and small pre/post-processing runs only.
Short Batch Jobs – serial (1-core)
To run a serial job on the AMD nodes, you must use the short
environment. This is because the bulk of the AMD node workload is expected to be multi-core parallel jobs. Remember that short
jobs will only run for 1-hour at most.
A serial (1-core) jobscript should be of the form:
#!/bin/bash --login #$ -cwd #$ -l short # The usual "short flag" - max 1-hour runtime permitted #$ -l amd # Use new AMD hardware # Note that NO amd.pe is used for a serial (1-core) job. mySerialexe args...
Submit the job as using using qsub myjobscript
.
Short Batch Jobs – parallel (2 to 28 cores)
A parallel (2-28 cores) jobscript should be of the form:
#!/bin/bash --login #$ -cwd #$ -l short #$ -pe amd.pe n # Where n is the number of cores: 2-28 # Note that NO "-l amd" is needed for parallel jobs
Remember, use amd.pe
in your jobscript (instead of the usual smp.pe
) where n
is the number of cores you require.
Interactive Jobs – serial (1-core)
To start a serial interactive job
# Get a command shell on an AMD node qrsh -l short -l amd # Or run an interactive app directly module load .... qrsh -l short -l amd -cwd -V myapp.exe
Interactive Jobs – parallel (2 to 28 cores)
The maximum number of cores that can be used by an interactive parallel job is currently 28. If you need more cores, submit a batch job. To start an interactive job on multiple cores:
# Get a command shell on an AMD node qrsh -l short -pe amd.pe n # Number of cores n can be 2 -- 28 # # Note - it is amd.pe, not smp.pe. # Or run an interactive app directly module load .... qrsh -l short -pe amd.pe n -cwd -V myapp.exe # Number of cores n can be 2 -- 28 # # Note - it is amd.pe, not smp.pe.
where n
is the number of cores you require. The $NSLOTS
environment variable will be set to the number of cores you requested, and can be used in commands that you then run at the prompt on the compute node.
Change a job waiting in the queue to use new AMD nodes
If you have a job waiting in the queue, which originally requested the smp.pe
parallel environment, and you now want to run it using the new AMD nodes instead, we strongly recommend that you delete the job from the queue (using qdel jobid
), then modify your jobscript to use amd.pe
and then resubmit the job (using qsub
.)
It is possible to alter the job using the qalter
command while it’s waiting in the queue, without modifying the jobscript. But then your jobscript file will still contain the smp.pe
setting. If you submit this jobscript again, you’ll have to qalter
that job too. It also means that if you refer back to the jobscript in 6 months time to see how you generated some results, say, it won’t be obvious that you ran on the AMD hardware.
Free at the point of use access
The limit for users with free-at-point-of-use accounts remains 32 cores in use at any one time, even on the new AMD nodes.
Tested software applications
So far, we have NOT recompiled or modified any of the existing applications for use on the AMD nodes. In most cases, the applications will run perfectly well, with good performance, without the need for recompilation.
- Please use the modulefiles you’ve previously used in your jobs, as advertised on the software specific webpages.
- We recommend you try some test jobs on the new AMD nodes before starting large, long-running production runs.
- We will not be recompiling existing applications unless there is a significant issue. We will be looking into optimisation for the new hardware later this year.
It is not possible for us to test every application on the AMD new nodes. The following are believed to work:
Abaqus 2018, 2023 ADF 2023.104.nk AMBER 16, 20, 22 Anaconda Python 2023.09 Fluent 18.1.29.2, 19.3, 19.5 Gaussian 09, 16 GROMACS 2024.2 Jupyter Notebook 6.0.0 LAMMPS 29.09.21-packs-user NAMD 2.13,2.14,3.0 NBO 7.0.8 Openfoam v6, v2312 ORCA 5.02, 5.04, 6.0.0, 6.0.0-avx2 Paraview 5.11.2 R 3.6.2, 4.4.0 Singularity/Apptainer 1.3.1 StarCCM 15.04-double, 18.02-mixed Turbomole 7.6-smp, 7.6-mpi VASP 6.3.0 VMD 1.9.3
Compiling your own code
Note that in general, you will not need to recompile or reinstall any applications, python envs, R packages, conda envs and so on. Things will run perfectly well on the new AMD nodes.
The AMD Genoa hardware provides the avx
, avx2
and avx512
vector instructions found in the CSF’s Intel CPUs. So applications are expected to perform at least as well on the new nodes. A full discussion of this hardware is outside of the scope of this page, so please see the AMD documentation if you want more in-depth information.
You may wish to compile code, to be optimized a little more for the AMD nodes. We will be providing more information about this in the next few months, but for now, we have some advice below.
We recommend using the GCC 13.3.0 compiler as this supports the AMD znver4
microarchitecture, which enables the AVX-512 extensions.
AMD provide some recommended compiler flags (PDF) to use with various compilers (GNU compiler collection, Intel OneAPI C/C++ and the AMD AOCC compiler.) You will need to use at least anarchitecture flag to enable the AVX-512 extensions available in the Genoa CPUs:
# Gnu compilers -march=znver4 # Code will only run on AMD Genoa and Intel Skylake (or newer) -march=haswell -mtune=znver4 # Code will run on all CSF3 node types, with some further # tuning for the AVX-512 extensions found in the AMD and # Intel Skylake nodes where possible. # Intel OneAPI compilers -mavx2 -axCORE-AVX512,CORE-AVX2,AVX # Code will run on all CSF3 node types, with AVX-512 # instruction enabled if supported # AMD AOCC compilers (not yet installed on the CSF - coming soon) -march=znver4 # Code will only run on AMD Genoa and Intel Skylake (or newer) # Note that the above flags can be applied when compiling code on the login nodes. # An alternative is to login to the AMD nodes, using qrsh, and then compile for # the "current" node's architecture, using: -march=native
The above PDF provides further optimization flags you may wish to use in addition to the above architecture flags.
Having an issue with the new nodes?
The start of year is a busy time for the Research Infrastructure team, so in the first instance, please check whether your question is answered by the notes above.
If not, please let us know if you run into any issues using the new nodes and provide as much information as possible as per our Help Page.
We appreciate your ongoing patience during this particularly busy period for the team.
Decommissioning old nodes
The oldest, least efficient nodes, need to be removed from the CSF. Hence there will be reductions in the availability of existing nodes.
mpi-24-ib.pe
mpi-24-ib.pe
will be completely removed from service in October 2024. We encourage anyone currently using it to move to the new AMD nodes as soon as possible. The new AMD nodes can run larger jobs than is possible in mpi-24-ib.pe
.
Intel nodes – haswell, broadwell, skylake
By the end of October we anticipate that there will be:
- 21 haswell nodes (24 cores each, total 504 cores ) serving
smp.pe
and serial - 32 broadwell nodes (28 cores each, total 896 cores) serving
smp.pe
and possibly serial - 41 skylake nodes (32 cores each, 1,312 cores) serving
smp.pe
If your application is failing on the AMD nodes please use the above and report the issue to us with as much detail as possible.
Some of the above may have to be decommissioned at a later date. This will be advertised to all users as appropriate.
GPUs and High Memory Nodes
No changes are being made to these parts of the cluster. Please continue to use them as normal.