The Computational Shared Facility 3

User FAQ

If you have a question not covered in the sections below please contact us via its-ri-team@manchester.ac.uk providing as much information as possible about the query.

For questions about contributing to the CSF please see the Contributor FAQ.

FAQs

Logging in

  1. I’ve forgotten my password, have tried to log in several times and now I can’t even get ssh or PuTTY to connect to the CSF. What should I do?
  2. I’m running GlobalProtect while off campus but can’t log in to the CSF. What should I do?

Running Jobs, Jobscripts, Modulefiles

  1. I’m new to the CSF and batch computing – is there a quick tutorial?
  2. Can I quickly run my code or application on the login node – it shouldn’t take long so writing a jobscript seems like a lot of effort?
  3. My job appears to be stuck in the queue and is not running. Why?
  4. My job can’t read my input files. Why?
  5. Should I load modulefiles first then submit a job or load the modulefiles in the jobscript?
  6. I get /bin/sh and module errors in my batch jobs, what should I do?
  7. What is the maximum runtime for a job?
  8. Can I get an email when a job starts, finishes or aborts due to an error?
  9. Can I submit another job from within a jobscript?
  10. Can I make one job wait for a previous job to finish?
  11. I’m getting a display error. I don’t know how to fix it.
  12. Why can I not use watch to monitor qstat?
  13. My job has almost reached the runtime limit, please can you extend the running time on it?
  14. My job is running much slower than I expect, why?
  15. I get host is not a submit host when running qsub. Why?
  16. I get module: command not found and my job fails. Why?

Software and Applications

  1. How do I check if an app / piece of software is installed?
  2. The software I was using on CSF2 is not installed on CSF3. What should I do?

Compiling Software

  1. What Does the error forrtl: severe (40): recursive I/O operation, unit , -1 file unknown mean when compiling my Fortran code?
  2. ifort gives me an error: cannot find -lm. How do I fix that?

Windows Users

  1. I’m used to windows, not linux. How do I access and use the CSF?

Files and Filesystems

The questions below are just a few of the common ones we get asked. We have a longer and more detailed FAQ on this topic.

  1. How do I download something from a site external to the CSF?
  2. My job can’t read my input files. Why?
  3. I’ve deleted a file. Can you get it back for me?
  4. I’ve got 1000s of files in scratch I want to download. What’s the best way?
  5. How can I free up some space in my home or scratch area?
  6. Some of my scratch files have been deleted! Where have they gone?
  7. I have downloaded a .zip of several datasets but the scratch clean-up keeps deleting them. What can I do?

Answers

Logging in

I’ve forgotten my password, have tried to log in several times and now I can’t even get ssh or PuTTY to connect to the CSF. What should I do?

The CSF uses your central IT password – the same as used for University email, My Manchester, Blackboard and many other systems. Please see the getting started notes for more information.

I’m running GlobalProtect while off campus but can’t log in to the CSF. What should I do?

The CSF does NOT block access from GlobalProtect. To access the CSF from off campus (e.g., at home) you should be running GlobalProtect. But then your Internet Service Provider’s DNS must be able to resolve the CSF address csf3.itservices.manchester.ac.uk to the corresponding IP number(s). Some ISPs cannot do this correctly (because the CSF uses an internal 10.99 IP address for security, which some ISPs do not allow domestic customers to access.)

To test if your home ISP will work correctly with GlobalProtect

Run the following command in a terminal window (e.g., the Terminal app on a Mac, a shell window on Linux or a cmd prompt on Windows):

# Run this command in a Terminal window (Mac/Linux) or a CMD prompt window (MS Windows)
nslookup csf3.itservices.manchester.ac.uk
  #
  # If your ISP allows GlobalProtect you should see IP numbers:
  # 10.99.203.... 

If it does NOT report the IP numbers of the CSF login nodes then your ISP cannot resolve the name correctly. It does not matter whether GlobalProtect is running or not (try running the above command with and without GlobalProtect – you will get the same result).

What to do if your ISP has this problem

If you are having this problem you could try to change the DNS server used by your WiFi router. Using the public Google DNS server 8.8.8.8 will fix the problem in most cases. However we cannot provide any help doing this because of the large number of WiFi routers available – please consult the documentation that came with your router.

Alternatively, we can set you up on our ssh gateway which provides an alternative method of getting on campus and works with or without GlobalProtect running.

Running Jobs, Jobscripts, Modulefiles

I’m new to the CSF and batch computing – is there a quick tutorial?

Yes, please have a go at our 10 minute tutorial on running a simple job in the batch system.

There is also an Intro to CSF training course run at various times in the year which you may wish to attend (new users and those that wish to refresh their memory of the CSF are welcome).

Can I quickly run my code or application on the login node – it shouldn’t take long so writing a jobscript seems like a lot of effort?

The general answer is no, please do not run on the login node.

Here’s a quick one-liner you can use on the login node to run your code in the batch system without writing a jobscript:

qsub -b y -cwd -V -l short ./my_code.exe optional-args

where optional-args are any flags you normally pass to your application.

As you can see, running in batch is very simple – just a few extra words on the command-line. To help understand what is being done here:

  • The -b y flag tells qsub that the file at the end is a binary (executable) rather than the more usual jobscript file.
  • The -cwd runs your job in the current directory – your app will probably read files from and write files to this directory.
  • The -V (uppercase V) causes the job to inherit the current environment when it runs. This includes any settings made by modulefiles and any other environment variables that are set. The environment is copied immediately (when you run qsub, not when the job runs) so you can even log out of the CSF and the job will still see the current environment settings.
  • The -l short flag will run your app in an area reserved for short jobs (up to 1 hour) which is often enough for quick file conversions, post-processing, or other miscellaneous tasks. If you want to run for longer simply remove the -l option entirely to run in the 7-day environment. See the section on time limits for more info.
  • The ./my_app.exe runs your program assuming it is located in the current directory. If it is a system-wide application (installed by the sys admins) then you should load the application’s modulefile and then use simply the_app.exe (instead of ./the_app.exe), but obviously using the correct name for the application.

There’s a lot more you can do with the batch system so please read through our comprehensive documentation.

My job appears to be stuck in the queue and is not running. Why?

Sometimes when you type qstat you will see your job says Eqw against it. This indicates there is an error with the jobscript. There are four common causes of this issue.

However, if your job is simply sat waiting in the queue (the status is qw) then it is usually because the system is busy. Job scheduling is complicated – there are many rules that determine when your job runs (for example, how much of a share in the system your group has, how much work you have already run this month, how much work other people from your group are currently running – some limits apply to the sum of all people in your group for certain groups).

The configuration of the CSF batch system is very complex. Not all jobs use the same nodes and some jobs can only run on certain nodes. Thus while there may be a lot of jobs queued in the system many of them are waiting for different resources and thus your job is not necessarily waiting behind all the others that have been submitted before it. Some parts of the CSF adjust the number of nodes available as demand changes.

It is almost impossible to say how long a job will be waiting, but given that a job can run for up to 7 days, and there are hundreds of people using the system, a wait of up to 24 hours is actually quite reasonable. It should be noted that if you submit a number of jobs not all of them will start within 24 hours and that if you already have jobs running (e.g. submitted earlier/on a diffiferent day) then newly submitted jobs may have to wait until others you have running complete. We do try to ensure that all users who have submitted work have something running. The only way to guarantee short queue times is to reduce the maximum job runtime and most users prefer the longer job runtimes.

Please note that when the system is busy it can take longer then 24 hours for large jobs (72+ cores to start).

Specifying a particular node type (e.g. -l ivybridge) restricts where on the CSF your job can run and thus you may wait longer for that node type to become available. If you request 29-32 cores in smp.pe then your job can only use skylake (i.e. it is restricted as if you specified an architecture). Note: some software requires you to be specific – check the appropriate software documentation before submitting your work.

Are you asking for higher memory nodes (mem256/mem512/mem1024/mem1500)? If so, there are not many of these nodes available and on occasion they can get busy and which results in an increased wait time for them. We also have strict limits on the number of cores a user can run on them, this applies to all users regardless of your group share, but we do try to ensure that the limits are flexible in line with demand on them.

“Free at the point of use” users are restricted to a maximum of 32 cores in use at any one time (providing resources are available, note that priority for resources goes to contributing groups). Therefore if you have a 12 core job running and a 22 core job waiting the 12 core job will need to finish before the 22 core can start. The limit applies across the whole cluster, so if you have serial jobs running then parallel jobs may have to wait as well, for example if 2 serial jobs are running a 32 core job will queue until both the serial jobs finish. To find out if you are in this category enter the command groups on the CSF login node and if it says fatpou01 then you are a ‘Free at the point of use’ user.

Some users make the mistake of not submitting jobs because they think the CSF looks busy. The CSF is always busy (we don’t want it sat around idle) so the best strategy is to submit your jobs as soon as you can. The CSF can’t schedule your jobs if you don’t have them in the queue! Deleting your job and submitting at another time is also a bad idea as the job priority increases the longer it waits and if you submit a new job it will have a low priority.

We check the balance of and demands on the batch system regularly to try and ensure that wait times are not rising significantly. If you have no jobs running and the jobs you have submitted have waited longer than 24 hours then please email its-ri-team@manchester.ac.uk and we will check if there is an issue with your jobs.

My job can’t read my input files. Why?

If your input files are in the directory from where you are issuing the qsub command then you must ensure your jobscript contains the line:

#$ -cwd              

or your qsub command line has the flag, e.g. qsub -cwd. When the job runs it will run in the current directory. Without the cwd flag the job will run in your home directory even if you issue qsub from another directory. The xxxx.oNNNNN and .eNNNNN files will be created in the directory from where the job runs. Hence if the .o and .e files appear in your home directory unexpectedly then you’ve forgotten the cwd option.

Should I load modulefiles first then submit a job or load the modulefiles in the jobscript?

Both are valid – there are advantages to both.

If loading the modulefiles first (on the login node), submit the jobscript containing the option:

#$ -V          # Inherit the current environment (e.g., modulefile settings)

The advantage of this method is that it is easy to make mistakes loading modulefiles (e.g., misspelling the names, ensuring all required modulefiles are loaded in the correct order). By making these mistakes (and correcting them) on the login node before submitting the job, you are certain that your environment is correct when the job runs.

If you try to load modulefiles from a jobscript you have no way of knowing if everything is correct until the job runs. If you’ve made a mistake and the job fails you’ll need to correct the jobscript and resubmit it to the queue. So you’ll have to wait all over again for your job to be scheduled by the system!

Note that if you use #$ -V in your jobscript then the batch system copies your environment settings immediately, when you run qsub, not when the job runs. This means that after submitting the job you are free to change your environment (load other modules) or even log out of the CSF and your job will still see the environment settings you set up when you submitted the job.

However the advantage of loading the modulefiles from the jobscript is that you have a complete record of which modulefiles were used for the job. If you need to rerun the job (in six months time for example) you may not be able to remember which modulefiles were used if you loaded them on the login node before submitting the job.

To load a modulefile in your jobscript do the following:

  1. Ensure the first line of the jobscript is #!/bin/bash --login (or -l for short)
  2. Remove the #$ -V flag from your jobscript
  3. Add the module load some/module/to/load/1.2.3 you would normally run on the command-line.

Remember, if your jobscript fails to set up your environment correctly your job will probably fail. You’ll need to edit your jobscript and resubmit it. So you can actually practice loading the modulefiles on the login node first. By removing the #$ -V line from your jobscript the job will ignore any settings you have made on the login node.

I get /bin/sh and module errors in my batch jobs, what should I do?

If you have loaded the modulefiles required for your job on the login node before submitting then errors such as:

/bin/sh: module: line 1: syntax error: unexpected end of file
/bin/sh: error importing function definition for `module'

will not affect the running or results of the job. They may be an indication that you have specified modulefiles you frequently use in your .bashrc or .bash_profile files and the batch system/compute is unable, and most likely does not need to be able, to process them. We recommend that you use your .modules environment file instead, but it may not totally eliminate the errors/warnings in jobs.

If you are trying to load modulefiles in the job script then the error may indicate that you have not done it correctly. We have advice about this on our modules page.

What is the maximum runtime for a job?

7 days is the default maximum time a job can run for before the system kills it. However, some parallel environments (PEs) have shorter runtime limits. See the PE table for details.

Can I get an email when a job starts, finishes or aborts due to an error?

Yes! – Use the #$ -m options flag in your jobscript as described in the batch script options page.

Can I submit another job from within a jobscript?

No, this is not possible. You can however use a job dependency to make one job wait for another job to finish. You must run qsub twice to submit two separate jobs but with the job dependency flags the second job will not run until the first has completed.

Can I make one job wait for a previous job to finish?

Yes, you can use a job dependency to make one job wait for another job to finish. You must run qsub twice to submit two separate jobs but with the job dependency flags the second job will not run until the first has completed.

I’m getting a display error. I don’t know how to fix it.

This can occur when using tools such as gedit. Make sure that you have connected to the CSF with X11 enabled – see the GUI-based work documentation.

If you are sure that you have the above aspect correct, but you are getting an error like this:

(gedit:28355): Gtk-WARNING **: cannot open display: localhost:12.0

then the most likely cause is a problem with a small hidden file (.Xauthority) in your CSF home directory related to X windows. You should delete the file:

rm .Xauthority

(note the dot at the start of the filename is important), log out of the CSF and back in again. When you log in a new .Xauthority file will be created and you should now be able to use gedit.

Why can I not use watch to monitor qstat?

The watch cmd command repeatedly runs another command cmd for you, every 2 seconds by default. Repeatedly running qstat stresses the batch system so we ask that users do not do this. You may manually run qstat to check the status of your jobs. But there is very little value in doing so repeatedly. If you want to be informed when your job starts and/or finishes, you can ask the system to automatically send you an email – see the section on Batch script options which lists the extra flags you can add to a jobscript to get automatic emails from the job.

We have recently disabled the use of the watch qstat command.

My job has almost reached the runtime limit, please can you extend the running time on it?

Unfortunately, we cannot extend the time limit for running, or queued, jobs. If your jobs needs more than 7 days you will need to try one of the following options:

  • Can your job use more cores? If so will it complete faster?
  • Can you split your job into smaller chunks and run each of those as its own job (each will then get 7 days)?
  • Does your software have checkpoint restart capabilities? This is where every few hours the job writes its state to a file and then the job can be restarted from a known good point using that file.

Note: The runtime limit on GPU nodes is 4 days.

My job is running much slower than I expect, why?

We often get asked this question in relation to serial work on the CSF. The individual cores in the CSF may be less powerful (i.e. have a lower GHz) than your PC on which you may have run your work before. However, the CSF has a number of advantages over your PC:

  • You can run more jobs on the CSF at the same time than on your PC.
  • Offloading work to the CSF means your PC will be more responsive to other tasks e.g. email, web browsing, writing a paper.
  • If you switch off your PC any work running on it would be lost. Equally this would happen if your PC suffered a hardware fault. That may also take some time to get fixed. The CSF rarely suffers from such issues and if a compute node running your job does fail you can quickly and easily start the job again on another compute node.

Finally, you may want to investigate whether the software you are using can use more than one core – if it can jobs may be able to run faster.

I get host is not a submit host when running qsub. Why?

An error message similar to:

Unable to run job: denied: host "node403.pri.csf3.alces.network" is not a submit host

means that you are trying to run qsub on a compute node instead of the login node. You are on the compute node because you have an interactive session running (you ran qrsh -l short). But you cannot submit jobs from here. You must be on the login node to submit jobs.

You should exit your interactive session by running exit or log in to the CSF using another shell window (e.g., another MobaXterm window or another Terminal window on MacOS or the nyx virtual desktop). You can log in to the CSF multiple times if you need more than one window on the login node.

I get module: command not found and my job fails. Why?

This happends when the first line of the jobscript does not contain the required --login flag. Your jobscript should begin with the line:

#!/bin/bash --login

if you are going to load modulefiles inside the jobscript (which we do recommend).

Software and Applications

How do I check if an app / piece of software is installed?

Type (some part of) the name in the Search box above the menu on the left hand side on this page. Alternatively, have a look at the list of applications we have documented. Finally you could also log in to the CSF and run

module search appname

to check for the modulefile.

The software I was using on CSF2 is not installed on CSF3. What should I do?

First check to see if a newer version is installed on the CSF (see question above). We may have upgraded the version on CSF3. If the software has not been installed, please request it via its-ri-team@manchester.ac.uk and we will look in to it. Please note, we do not intend to install old versions of software on the CSF3. When moving to CSF3 please take the opportunity to upgrade your work to using a newer version of the software if possible.

Compiling software

What Does the error forrtl: severe (40): recursive I/O operation, unit -1, file unknown mean when compiling my Fortran code>?

We have seen this with v12 of the Intel compiler when several OpenMP threads attempt to write to standard out concurrently. We have specifically noticed this when linking FORTRAN from C/C++. Ensure that only one thread in your OpenMP code is writing to standard out.

ifort gives me an error: cannot find -lm. How do I fix that?

Please see our Intel Compiler notes for the fix.

Windows Users

I’m used to windows, not linux. How do I access and use the CSF?

Please see the guide for Windows users.

Files and Filesystems

The questions below are just a few of the common ones we get asked. We have a longer and more detailed FAQ on this topic.

How do I download something from a site external to the CSF?

The CSF does not have a direct connection to sites off campus. However, many research groups use tools like svn, git and wget that may need to connect to external repositories. This can be done by configuring the appropriate proxy settings in the software. The two key pieces of information which are normally required for this are:

  • URL: http://proxy.man.ac.uk
  • Port: 3128

The same URL and port can be used for http, https and ftp proxy settings. How each piece of software is configured differs. Please consult your software manual for details. If your software can read environment variables such as HTTP_PROXY (upper or lowercase) then you can set those easily using a modulefile:

module load tools/env/proxy

We have provided further information for the following tools:

  • svn
  • git
  • wget – we have included the proxy information in the global system configuration on the login node so you do not need to set it yourself. However, if you intend to download large files (50 GB or more) please contact us. This is best done elsewhere, away from the CSF login node, but we can provide you with an alternative.

I’ve deleted a file. Can you get it back for me?

Now that all home directories (and other research-group-owned data areas) are on the Isilon central storage you can recover the files yourself. See the filesystems page for full details.

Scratch is not backed up so files cannot be recovered from there. This is one reason why you should not use scratch for long-term storage – it is for temporary storage only while the job is running.

I’ve got 1000s of files in scratch I want to download. What’s the best way?

Downloading a large number of individual files can be very time consuming and will place a lot of strain on the login node.

First consider whether you need to download the files at all. If they are important result files you should consider keeping them in your home area which is on secure backed-up Isilon storage. Your research group may also have additional Isilon areas for specific research projects or data areas. Downloading to a PC that isn’t backed up could result in data loss if the local PCs disks fail. If you don’t have enough space in your Isilon area then consider compressing the files (with zip or gzip).

If you still want to download a copy then a better option would be to zip up the files in to a single compressed archive. Zip files are common on Windows / MacOS so if you want to transfer the files to a local Windows / MacOS computer you can create the zip file on the CSF and then download it. Alternatively if your local PC is running linux you can create a tar.gz file on the CSF and download that. We advise running the zip app as a batch job to prevent the login node from being overloaded. Here’s how:

# Go to the required location in scratch. For example:
cd ~/scratch/my_data/

# zip up all the files from a sub-directory named 'experiment1'
qsub -b y -cwd -m e -M $USER -l short zip -r my_stuff.zip experiment1

The above command will submit a batch job (without writing a jobscript), run it from the current directory in the short environment (and will email you when it has finished. The job will zip up and compress all the files in the experiment1 sub-directory of the scratch/my_data/ directory (change the names to suit your own directory structure). When the job finished you’ll have a file named my_stuff.zip in your scratch/my_data/ directory which you can then download using WinSCP, scp or other favourite file transfer program from your PC. Alternatively copy the zip file to your home area.

How can I free up some space in my home or scratch area?

The obvious answer is to delete unwanted files (use the rm command or your preferred graphical file browser such as that in MobaXterm). However, deleting results and data files is not always possible. But there are ways to reduce your usage:

  1. Compress your files. Many applications write out plain text results files and other log files. These can be huge. Do you need the log file? If not, delete it. But the results files will compress will using gzip myresult.dat (which will create a new smaller file named myresult.dat.gz. You can still read the file using zless myresult.dat.gz or uncompress it using gunzip myresult.dat.gz
  2. Delete unwanted job .oNNNNNN and .eNNNNNN output files. Every job will produce an output file capturing what would have been printed to screen when your application ran. The files can contain normal output (the .o file) and error messages (the .e file). Each file will have the unique job number at the end of the name. If you run a lot of jobs (1000s – and many users do!) you will soon have 1000s of files. We’ve seen some directories with millions of these output files! Individually each file is often small but they soon accumulate. They also take up more space on the filesystem than you think (the minimum block size of the storage system is used even if your file is smaller). Please delete unwanted job output files. The following command can be used:
    rm -f *.[oe][0-9]* 
    
  3. Keep your job directories tidy. Deleting files from jobs you ran months ago is never an exciting task – you may have 1000s of output files. Nobody likes looking through old files to see if you need them or not. Deleting unwanted files when the job finishes is the best way to keep your storage areas tidy. You can even but the delete commands (rm in your jobscript to clean up any junk at the end of a job.

Some of my scratch files have been deleted! Where have they gone?

It could be the automatic scratch clean-up policy that has deleted your files:

Please note: the scratch filesystem automatic clean-up policy is active. If you have scratch files unused for 3-months or more they may be deleted. Please read the Scratch Cleanup Policy for more information.

I have downloaded a .zip of several datasets but the scratch clean-up keeps deleting them. What can I do?

When you unzip (or tar) the archive of datasets, the files in the archive will be created with their original timestamps. These could be months or years in the past. The scratch-tidy will then see these old files and delete them.

The solution is to ask unzip (or tar) to extract the files and apply today’s date to them. See here for the extra flag/switch you must add to unzip or tar -xf to extract the files correctly.

Last modified on December 7, 2021 at 9:15 am by Pen Richardson