Moving from CSF2 to CSF3

A number of changes have been made to CSF3 that will affect CSF2 users (more information below). When you are given access to the CSF3 you will still have CSF2 access for a while. Hence you can finish off work on the CSF2 while checking if you have to make any changes to your jobscripts / pipelines / workflow etc for use on CSF3, and testing those changes.

Please DO test your work on the CSF3 early, before the CSF2 is shut off!

Please note: the scratch filesystem automatic clean-up policy is active on CSF3. If you have old scratch files they may be deleted. Please read the Scratch Cleanup Policy for more information.

Information correct as of 31st July 2019

How to log in and login nodes

Please connect to csf3.itservices.manchester.ac.uk using the same tool you have used previously for CSF2 connections e.g. MobaXterm (on Windows). From a command line in MobaXterm, or a terminal in Linux or Mac this would be

ssh username@csf3.itservices.manchester.ac.uk
       #
       # Use your University IT username (and password when asked)

It is safe to accept a new key if asked.

Once you have logged in you will see the same home folder (and any other additional research data storage areas you have access to) as on CSF. However, the scratch area is not the same as on CSF2.

services/gridscheduler modulefile error message

When logging in to CSF3, you might see the message:

services(41):ERROR:105: Unable to locate a modulefile for 'services/gridscheduler'

This is due to a setting in your home-directory that only applies to CSF2. It can be ignored, it isn’t preventing you from doing anything on CSF3, but it is annoying. You can fix this by editing the ~/.modules file (if you have it!) as follows:

gedit ~/.modules

# Carefully add the bold text to the start of the 'module load services/gridscheduler' line:
hostname -f | grep -qc csf.com && module load services/gridscheduler

Now when you log in to CSF3 the services/gridscheduler modulefile will NOT be loaded. When you log in to CSF2 it will be loaded.

Overview of Changes

The main differences that CSF2 users must consider are:

  1. Some software applications will not initially be available, particularly old versions of applications. We may have upgraded an app to a later version and so it will have a different modulefile name. [more…]
  2. New hardware is available, offering more cores and more memory. Hence you should check what is available – it might allow you to run jobs that you could previously not do! [more…]
  3. Some batch system parallel environments (used in jobscripts) may no longer be available. In particular the special PEs used to run applications like Abaqus and StarCCM are no longer needed. AMD compute nodes (and their PEs) are being retired and will not move to CSF3. [more…]
  4. We now recommend loading modulefiles in the jobscript rather than on the login node before submitting a job. Hence we have updated our documentation to reflect this. [more…]
  5. We will NOT be copying your scratch files from CSF2 to CSF3. You should use this as an opportunity to tidy up (or leave behind entirely) your CSF2 scratch files and copy only what you need to the CSF3 scratch storage. [more…]

These changes mean that you may have to adjust some of the settings in your jobscripts. Further information is given about the changes below.

Software

Module names for some applications may be different to on CSF2. To check for a specific app do

module search appname

For example:

module search gromacs

We are in the process of installing software on CSF3, but you may find that not all the applications you use on CSF2 are ready yet. If this is the case please let us know and we will consider it for our install list. We have a lot of installs to organise, please be patient while we work through them.

If you are a CSF2 user looking for a bioinformatics application please see below for details of how to see if it has been made available as part of the DPSF migration (that having been a system that was primarily used for bioformatic work and which was moved to CSF3 before CSF2 users were moved). For example to locate bowtie

module search bowtie

Many applications or different versions (mostly for bioinformatics) are available on CSF3 by first doing:

module load apps/dpsf                     # Same as using apps/bioinf

Now using module avail and module search keyword will show many more bioinformatics applications.

Intel Compiler and MPI Library

In particular, the Intel compiler and OpenMPI library is now only available in the following versions:

# Intel compiler modulefile
compilers/intel/17.0.7

# OpenMPI library (Intel and GCC versions). NOTE: there is NO separate -ib modulefile.
mpi/intel-17.0/openmpi/3.1.3
mpi/gcc/openmpi/3.1.3

We have recompiled applications, where needed, to use these new versions. If you have your own source code that you compile on CSF2, you should recompile it on CSF3 with the above versions. If running large parallel jobs across multiple compute nodes you do not need to load a -ib version of the application’s modulefile. The fast InfiniBand network will be used by default for multi-node parallel jobs.

Further information about some of the available applications can be found in Software section of the website.

Compute nodes

We currently have 9120 Intel cores (some of which are for High Memory work, and some are assigned to GPU nodes). The full breakdown of node types can be found on the Current System Configuration webpage.

The AMD compute nodes (64-core Bulldozer and 32-core Magny-cours) are being retired and will not be made available in CSF3. Instead the Intel compute nodes should be used. In particular, members of the School of MACE who mostly use the AMD nodes for Abaqus and StarCCM jobs can run parallel jobs (single-node and larger multi-node) on the Intel nodes and these nodes will usually run jobs faster even though you may be using fewer cores.

We recommend that if you have access to the CSF3 you start to use it as soon as possible.

Batch system

You do not need to load a module file to access the batch system commands (e.g qsub, qstat) on the login node. They are available by default.

Some CSF2 users may have previously added module load services/gridscheduler to one of the files:

.modules
.bashrc
.bash_profile

and on log in to the CSF3 may see this error/warning

services(41):ERROR:105: Unable to locate a modulefile for 'services/gridscheduler'

You can ignore this message on CSF3, but we recommend you do not change your files while you are still using CSF2 – you will still need to load the modulefile there.

Intel CPU jobs

You may run serial work or parallel work. As usual, for single-node multicore jobs please submit to parallel environment

smp.pe              # This PE now accepts jobs with up to 32 cores (previous max 24 on CSF2)

For example:

#!/bin/bash --login
#$ -cwd
#$ -pe smp.pe 8     # EG: 8 cores. Can be 2--32 (a new upper limit!)
...

The multi-node parallel environment is also available. On CSF2 it was called orte-24-ib.pe. On CSF3 it has been renamed to mpi-24-ib.pe but use of the old name will still work (hence you do not need to edit every jobscript you have!).

As on CSF2, jobs must requests multiples of 24 cores (i.e., whole compute nodes) and have a minimum jobsize of 48 cores (i.e., two compute nodes).

mpi-24-ib.pe        # Jobs will still work if you use orte-24-ib.pe

For example:

#!/bin/bash --login
#$ -cwd
#$ -pe mpi-24-ib.pe 48     # EG: 48 cores. Can be 48--120 in multiples of 24
...

The time limit for all CPU jobs remains at 7 days.

AMD CPU jobs

The AMD nodes are being retired and will not be moved to CSF3. You should submit jobs to the (Intel) parallel environments:

# The following single-node multicore PEs for AMD nodes are NOT available.
# Use smp.pe (max 32 cores) instead

smp-32mc.pe
smp-64bd.pe
hp-mpi-smp-64bd.pe
hp-mpi-smp.pe
fluent-smp.pe         # Fluent runs in the ordinary smp.pe on CSF3

# These multi-node PEs for AMD nodes are NOT available.
# Use mpi-24-ib.pe (min 48 cores, max 120, jobs must be a multiple of 48) instead

orte-32-ib.pe
orte-64bd-ib.pe
hp-mpi-32-ib.pe
hp-mpi-64bd-ib.pe

While the Intel nodes have fewer cores than the AMD nodes in CSF2, we find that the Intel nodes run much faster than the AMD nodes. The AMD hardware is almost 7 years old! The Intel nodes are much newer and have new features that allow them to outperform the old AMD nodes.

Please check the documentation specific to the software you use for updated batch submission scripts and advice on which PE to use.

If you have previously run large jobs on the CSF2 AMD nodes you may wish to consider applying to use The HPC Pool.

High Memory Nodes

If you need more than 4 to 6GB of memory per core for your job you can request the high memory nodes using ONE of the following options in your jobscript:

#$ -l mem256      # for 16GB per core (16-core nodes)

#$ -l mem512      # for 32GB per core (16-core nodes)

#$ -l mem1024     # for 51GB per core (one 20-core node available, Restricted Access)

High memory parallel jobs should run in the smp.pe parallel environment.

Nvidia GPU jobs

The older Nvidia K20 GPU nodes have NOT been moved to the CSF3. Instead we have much newer Nvidia V100 nodes. Access to these is not automatic. Please see the dedicated GPU jobs page.

Interactive work via qrsh

If you have a need to run something interactively please do this using the qrsh facility. The command:

qrsh -l short
    # Note that there is no -l inter required anymore

will log you into a compute node and give you a command line. For more detailed examples including how to lauch your application via qrsh please see the detailed CSF qrsh documentation.

Modulefiles in Jobs

We now recommend that you set up your software in the job submission script rather than on the login node prior to job submission. This is so that the jobscript provides a complete record of how you ran the job. You may, of course, still load modulefiles on the login node to check versions, spelling of module names, etc, but you should ensure your job does NOT inherit these settings:

To do this, set the first line of a jobscript to be:

#!/bin/bash --login

Do not use the following (otherwise you inherit settings from the login node)

#$ -V           # remove this from your jobscript

You can now use module load something commands in your jobscript. A very simple example:

#!/bin/bash --login
#$ -cwd

# Can now load the modulefile in the jobscript (the --login above is needed for this)
module load apps/gcc/myfaveapp/1.2.3

./myfaveappexe

You should load only the modulefiles required by your job (some software modulefiles will in turn load others on which they depend).
If you forget to remove #$ -V from your jobscript then anything you have loaded on the login node may also get used and this can cause problems (e.g. clashes between compiler versions).

Scratch

The CSF2 scratch area is separate from the CSF3 scratch area. For performance reasons scratch filesystems are always local to the system they belong to. You may therefore need to copy scratch files from CSF2 to CSF3 (see below).

Please note: The scratch clean-up policy is in force – your scratch files are at risk!

Copying CSF2 scratch files to CSF3

Please do NOT copy everything – it is very unlikely you will need all of your scratch files from CSF2. Copying can be done as a batch job run on CSF2. For example, to copy a folder, and all of the files and folders within that folder, named mysamples1, from your CSF2 scratch to your CSF3 scratch, write a jobscript similar to the following, on CSF2:

#!/bin/bash
#$ -cwd

rsync -av ~/scratch/mysamples1 username@csf3.itservices.manchester.ac.uk:~/scratch/
                                  #
                                  # Replace username with your IT username. 

Submit the job on CSF2 using qsub jobscript where jobscript is the name of the jobscript file.

For transfers of individual, or small collections of files, you can run the above command on the CSF2 login node. For example:

# On the CSF2 login node, copy all .txt files from CSF2 scratch to CSF3 scratch
cd ~/scratch
rsync -av *.txt username@csf3.itservices.manchester.ac.uk:~/scratch/
                    #
                    # Replace username with your IT username.

The scratch clean-up policy remains in force on the CSF3 – files older than 3 months can be deleted by the system.

localscratch

You may notice a symlink in your home directory called localscratch. This is an easy means of accessing a directory in /tmp specific to you. It is mostly for use on compute nodes for jobs that can benefit from using local disk. That local disk of course is available only to the compute node on which a job runs, other nodes in the cluster cannot see it or share files that are in it and that means more management of files is required in jobscripts.

It recommended that you do not use this directory on the login node as /tmp there is very small and you will easily fill it up and cause issues.

Updates & support

Any major changes or work we plan on doing will be advertised via the appropriate channel.

If you need additional applications please let us know. Depending on what you require there may be a slightly longer than usual wait for new applications as our priority is getting those required by CSF2 users ready and moving hardware.

Any questions, issues etc please email

its-ri-team@manchester.ac.uk

Last modified on July 31, 2019 at 9:29 am by Pen Richardson