Upgraded CSF3 Info Hub

Background – 14th May 2025

We are in the process of upgrading the CSF3 from the SGE batch system, to the Slurm batch system.

We are also introducing new login nodes and a new scratch filesystem.

Since 7th April CPU and GPU hardware has been moving from SGE to Slurm in groups. Thus the amount of compute resource in SGE has been shrinking, and increasing in SLURM.

The team have done a lot of testing of the new setup, but it is impossible for us to test every piece of software or user scenario in advance. If you have complex pipelines, e.g. in python, you may have to do some re-installation.

We want to thank everyone who has transitioned to the upgraded Slurm system in the last 5 weeks and provided us with questions and feedback.

Please DO check the documentation, and try things out, before logging a support ticket.

How To…

Known Issues and Workarounds

We will update the Known Issues and Workarounds page if problems arise. Please check that page before submitting a support ticket.

FAQ

I’m a brand new user and I’m not sure what to do

New users will initially access the “original” CSF3 running the SGE batch system. We recommend that you use the Getting Started section of our website, and in particular work through the batch tutorial as this will teach you the basics of using a system like the CSF3, and the SGE batch system in particular.

Once you have gained some confidence, try the upgraded system.

Is the SGE version of the CSF still running / available?

Yes! But not after 08:30 on Wed 21st May 2015 when

  • All jobs that are running or queued in SGE will be deleted.
  • No further job submissions to SGE will be possible.
  • All compute nodes, including GPUs, remaining in SGE will be moved to Slurm.
    • These resources are scheduled to  come back on-line in Slurm during week commencing 26th May.
  • The only compute node type not currently available in Slurm is the very high memory 128G per core (mem4000) node
    • There is only one node of this type, access to it will continue to be restricted.
    • If you have access under SGE you will have access again once it is ready in Slurm.
You should try your work on Slurm sooner rather than later

How much resource is in SGE and how much in Slurm?

You can find out what is available in SLURM by consulting the partition information (also shows the jobscript flags you can use.)

You can also see a summary of SGE and SLURM resources on the current configuration page (contains no information about how to submit work to either – it’s just the systems stats.)

Running ‘squeue’ shows very long Start Time, will it really take that long for my job to start?

The estimated start time shown when you run the ‘squeue’ command should not be considered accurate. This is because it is based on the Wallclock time requested by all of the jobs that are already queued and running for all users. Most jobs don’t need their entire Wallclock time requested – they finish successfully before the time limit, and some jobs may stop or crash prematurely if mistakes have been made in the jobscript, or code, or input data etc. For these reasons your job might start much sooner than the estimated time shown.

We encourage all users to submit jobs with reasonably accurate Wallclock times. If everybody submits jobs using the maximum permitted 7-day wallclock time, Slurm will not be able to calculate accurate start times for future jobs because it must assume that all jobs already in the system will need the full 7-days.

Last modified on May 15, 2025 at 8:03 am by Pen Richardson