HPC Pool (Slurm) – User Testing

Introduction

As part of the CSF upgrade work, the HPC Pool and associated hardware will be migrated from SGE to Slurm. The adoption of Slurm on CSF3 represents a significant change for CSF users who are accustomed to using the SGE batch system. This page outlines how to access the HPC Pool via the upgraded CSF3 Slurm Cluster.

Who Can Access the HPC Pool via Slurm

Similar to the HPC Pool in the existing CSF3 SGE cluster, access is not granted by default.
Only HPC projects that ran batch jobs between 1st August 2024 and April 2025 will be migrated to the new environment.

A complete list of project codes that will have access to the HPC Pool after the maintenance can be found below:

If your project is not listed and you still require access to the HPC Pool, you can request re-enablement via our help form: Requesting Help.

When will the HPC Pool be removed from the SGE Environment?

IMPORTANT: All jobs running on the HPC Pool in the CSF3 SGE cluster will be terminated at 09:00 AM on Wednesday, 23rd April 2025.

After this time, users will no longer be able to access the HPC Pool in the CSF3 SGE cluster. It will take approximately seven days for all hardware to be migrated into the Slurm environment, after which access will be restored and HPC Pool users informed via email.

When Can I Access the HPC Pool in the Upgraded Slurm Environment?

For testing purposes you can access two HPC Pool nodes today!

Two HPC Pool nodes have already been migrated to the Slurm environment. This has enabled Research IT to conduct preliminary testing on several popular software applications to ensure functionality.

Early tests have been successful; however, we strongly encourage existing HPC Pool users to test their codes in the upgraded environment to ensure they work as expected. Any issues should be reported via our help form: Requesting Help.

How to Test Jobs in the HPC Pool Using Slurm?

The purpose of testing is to ensure that jobs can load their modulefiles and start your apps successfully. Jobs are not expected to run in their entirety, nor are they allowed to access more than 64 cores (2 nodes).

As such, for testing purposes, the limits are:

Jobs must be 64 cores – i.e., 2 compute nodes
Jobs will run for no more than 1 hour
You cannot have more than 2 jobs in the queue – either running or waiting

These restrictions are in place to give everyone a chance to test their HPC Pool jobs.

Please Note: Once the testing phase is complete and the HPC Pool has been successfully migrated to Slurm, normal limits will be restored, i.e. minimum job size of 128 cores, maximum job size of 1024 cores, and a maximum job runtime of 4 days.

The HPC Pool in Slurm uses the same login node as the rest of the upgraded CSF3 Slurm environment. Logging instructions can be found by following this link: Logging In – Upgraded (SLURM) CSF3.

As with all jobs transitioning from SGE to Slurm, job scripts will need to be updated accordingly. Further information can be found here: SGE to Slurm Reference.

The upgraded CSF3 Slurm has a new scratch filesystem. The old scratch system is accessible in read-only mode on the login node and a dedicated file transfer node. Users will need to copy their data from the old scratch file system to the new one. Instructions on how to perform this transfer can be found here: New Scratch Filesystem – Feb 2025

The HPC Pool has its own dedicated Slurm partition: #SBATCH -p hpcpool. This must be specified in any job scripts to ensure that jobs run on the HPC pool nodes. Below is an example of a Slurm job script suitable for testing:

#!/bin/bash --login
#SBATCH -p hpcpool        # The "partition" - named hpcpool
#SBATCH -N 2              # (or --nodes=) Maximum for testing is 2. Job uses 32 cores on each node.
#SBATCH -n 64             # (or --ntasks=) TOTAL number of tasks. Maximum for testing is 64 . 
#SBATCH -t 01:00:00       # Maximum testing wallclock is 1 hour.  
#SBATCH -A hpc-proj-name  # Use your HPC project code

module purge
module load apps/binapps/some_mpi_app/1.2.3

# Slurm knows how many cores to use for mpirun
mpirun someMPIapp.exe

Submit the job using sbatch jobscript.

If you see the following error when submitting an HPC Pool test job:

sbatch: error: QOSMaxSubmitJobPerUserLimit
sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)

it means you have two jobs in the queue (either running or waiting). You cannot have more that 2 jobs for testing the HPC Pool.

HPC Pool projects migrated to the upgraded CSF3 environment

You should use one of these codes in your jobscript with the -A flag (see above.)

hpc-am-gypsum
hpc-am-vdwstructs
hpc-ar-m3cfd
hpc-ar-uhi
hpc-as-thermofluids
hpc-cp-memb
hpc-ds-dmg
hpc-ds-owcm
hpc-fs-occp
hpc-jh-futuredams
hpc-jh-wrg
hpc-jk-nmcc
hpc-kl-psmc
hpc-mcs-weld
hpc-ml-acmpr
hpc-nc-mrica
hpc-nk-fortress
hpc-nk-mcdr
hpc-nk-smi2
hpc-nk-surfchemad
hpc-nk-tshz
hpc-pc-goh2o
hpc-pc-npm
hpc-po-enveng
hpc-rb-piezo
hpc-rb-topo
hpc-rc-atp
hpc-support
hpc-sz-msss
hpc-vf-tmdm
hpc-ymi-thermofluids
hpc-zhong-dlth
hpc-zz-aerosol

Last modified on April 11, 2025 at 5:00 pm by George Leaver