HPC Pool (Slurm) – User Testing
Introduction
As part of the CSF upgrade work, the HPC Pool and associated hardware will be migrated from SGE to Slurm. The adoption of Slurm on CSF3 represents a significant change for CSF users who are accustomed to using the SGE batch system. This page outlines how to access the HPC Pool via the upgraded CSF3 Slurm Cluster.
Who Can Access the HPC Pool via Slurm
Similar to the HPC Pool in the existing CSF3 SGE cluster, access is not granted by default.
Only HPC projects that ran batch jobs between 1st August 2024 and April 2025 will be migrated to the new environment.
A complete list of project codes that will have access to the HPC Pool after the maintenance can be found below:
If your project is not listed and you still require access to the HPC Pool, you can request re-enablement via our help form: Requesting Help.
When will the HPC Pool be removed from the SGE Environment?
IMPORTANT: All jobs running on the HPC Pool in the CSF3 SGE cluster will be terminated at 09:00 AM on Wednesday, 23rd April 2025.
After this time, users will no longer be able to access the HPC Pool in the CSF3 SGE cluster. It will take approximately seven days for all hardware to be migrated into the Slurm environment, after which access will be restored and HPC Pool users informed via email.
When Can I Access the HPC Pool in the Upgraded Slurm Environment?
For testing purposes you can access two HPC Pool nodes today!
Two HPC Pool nodes have already been migrated to the Slurm environment. This has enabled Research IT to conduct preliminary testing on several popular software applications to ensure functionality.
Early tests have been successful; however, we strongly encourage existing HPC Pool users to test their codes in the upgraded environment to ensure they work as expected. Any issues should be reported via our help form: Requesting Help.
How to Test Jobs in the HPC Pool Using Slurm?
The purpose of testing is to ensure that jobs can load their modulefiles and start your apps successfully. Jobs are not expected to run in their entirety, nor are they allowed to access more than 64 cores (2 nodes).
As such, for testing purposes, the limits are:
- Jobs must be 64 cores – i.e., 2 compute nodes
- Jobs will run for no more than 1 hour
- You cannot have more than 2 jobs in the queue – either running or waiting
These restrictions are in place to give everyone a chance to test their HPC Pool jobs.
Please Note: Once the testing phase is complete and the HPC Pool has been successfully migrated to Slurm, normal limits will be restored, i.e. minimum job size of 128 cores, maximum job size of 1024 cores, and a maximum job runtime of 4 days.
The HPC Pool in Slurm uses the same login node as the rest of the upgraded CSF3 Slurm environment. Logging instructions can be found by following this link: Logging In – Upgraded (SLURM) CSF3.
As with all jobs transitioning from SGE to Slurm, job scripts will need to be updated accordingly. Further information can be found here: SGE to Slurm Reference.
The upgraded CSF3 Slurm has a new scratch filesystem. The old scratch system is accessible in read-only mode on the login node and a dedicated file transfer node. Users will need to copy their data from the old scratch file system to the new one. Instructions on how to perform this transfer can be found here: New Scratch Filesystem – Feb 2025
The HPC Pool has its own dedicated Slurm partition: #SBATCH -p hpcpool
. This must be specified in any job scripts to ensure that jobs run on the HPC pool nodes. Below is an example of a Slurm job script suitable for testing:
#!/bin/bash --login #SBATCH -p hpcpool # The "partition" - named hpcpool #SBATCH -N 2 # (or --nodes=) Maximum for testing is 2. Job uses 32 cores on each node. #SBATCH -n 64 # (or --ntasks=) TOTAL number of tasks. Maximum for testing is 64 . #SBATCH -t 01:00:00 # Maximum testing wallclock is 1 hour. #SBATCH -A hpc-proj-name # Use your HPC project code module purge module load apps/binapps/some_mpi_app/1.2.3 # Slurm knows how many cores to use for mpirun mpirun someMPIapp.exe
Submit the job using sbatch jobscript
.
If you see the following error when submitting an HPC Pool test job:
sbatch: error: QOSMaxSubmitJobPerUserLimit sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)
it means you have two jobs in the queue (either running or waiting). You cannot have more that 2 jobs for testing the HPC Pool.
HPC Pool projects migrated to the upgraded CSF3 environment
You should use one of these codes in your jobscript with the -A
flag (see above.)
hpc-am-gypsum hpc-am-vdwstructs hpc-ar-m3cfd hpc-ar-uhi hpc-as-thermofluids hpc-cp-memb hpc-ds-dmg hpc-ds-owcm hpc-fs-occp hpc-jh-futuredams hpc-jh-wrg hpc-jk-nmcc hpc-kl-psmc hpc-mcs-weld hpc-ml-acmpr hpc-nc-mrica hpc-nk-fortress hpc-nk-mcdr hpc-nk-smi2 hpc-nk-surfchemad hpc-nk-tshz hpc-pc-goh2o hpc-pc-npm hpc-po-enveng hpc-rb-piezo hpc-rb-topo hpc-rc-atp hpc-support hpc-sz-msss hpc-vf-tmdm hpc-ymi-thermofluids hpc-zhong-dlth hpc-zz-aerosol