The CSF2 has been replaced by the CSF3 - please use that system! This documentation may be out of date. Please read the CSF3 documentation instead. To display this old CSF2 page click here. |
User FAQ
If you have a question not covered in the sections below please contact us via its-ri-team@manchester.ac.uk providing as much information as possible about the query.
For questions about contributing to the CSF please see the Contributor FAQ.
Logging in
- I’ve forgotten my password, have tried to log in several times and now I can’t even get ssh or PuTTY to connect to the CSF. What should I do?
- I’m running GlobalProtect while off campus but can’t log in to the CSF. What should I do?
Running Jobs, Jobscripts, Modulefiles
- I’m new to the CSF and batch computing – is there a quick tutorial?
- Can I quickly run my code or application on the login node – it shouldn’t take long so writing a jobscript seems like a lot of effort?
- My job appears to be stuck in the queue and is not running. Why?
- My job can’t read my input files. Why?
- Should I load modulefiles first then submit a job or load the modulefiles in the jobscript?
- I get
/bin/sh
andmodule
errors in my batch jobs, what should I do? - What is the maximum runtime for a job?
- Can I get an email when a job starts, finishes or aborts due to an error?
- Can I submit another job from within a jobscript?
- Can I make one job wait for a previous job to finish?
- I’m getting a
display
error. I don’t know how to fix it. - Why can I not use
watch
to monitorqstat
? - My job has almost reached the 7 day limit, please can you extend the running time on it?
- My job is running much slower than I expect, why?
Compiling software
- What Does the error
forrtl: severe (40): recursive I/O operation, unit , -1 file unknown
Mean? - ifort gives me an error:
cannot find -lm
. How do I fix that?
Windows Users
Files and Filesystems
- How do I download something from a site external to the CSF?
- My job can’t read my input files. Why?
- I’ve deleted a file. Can you get it back for me?
- I’ve got 1000s of files in scratch I want to download. What’s the best way?
- How can I free up some space in my home or scratch area?
Answers
All categories
I’ve forgotten my password, have tried to log in several times and now I can’t even get ssh or PuTTY to connect to the CSF. What should I do?
The CSF uses your central IT password – the same as used for University email, My Manchester, Blackboard and many other systems. Please see the getting started notes for more information.
I’m running GlobalProtect while off campus but can’t log in to the CSF. What should I do?
The CSF does NOT block access from GlobalProtect. To access the CSF from off campus (e.g., at home) you should be running GlobalProtect. But then your Internet Service Provider’s DNS must be able to resolve the CSF address csf2.itservices.manchester.ac.uk
to the corresponding IP number(s). Some ISPs cannot do this correctly. To test this, run the following command in a terminal window (e.g., the Terminal app on a Mac, a shell window on Linux or a cmd prompt on Windows):
nslookup csf2.itservices.manchester.ac.uk
If it does NOT report the IP numbers of the CSF login nodes then your ISP cannot resolve the name correctly. It does not matter whether GlobalProtect is running or not (try running the above command with and without GlobalProtect – you will get the same result).
If you are having this problem you try changing the DNS server used by your WiFi router. Using the public Google DNS server 8.8.8.8
will fix the problem in most cases. However we cannot provide any help doing this because of the large number of WiFi routers available – please consult the documentation that came with your router.
Alternative we will set you up on our ssh gateway which provides an alternative method of getting on campus.
I’m new to the CSF and batch computing – is there a quick tutorial?
Yes, please have a go at our 10 minute tutorial on running a simple job in the batch system.
There is also an Intro to CSF training course run at various times in the year which you may wish to attend (new users and those that wish to refresh their memory of the CSF are welcome).
Can I quickly run my code or application on the login node – it shouldn’t take long so writing a jobscript seems like a lot of effort?
The general answer is no, please do not run on the login node.
Here’s a quick one-liner you can use on the login node to run your code in the batch system without writing a jobscript:
qsub -b y -cwd -V -l short ./my_code.exe optional-args
where optional-args are any flags you normally pass to your application.
As you can see, running in batch is very simple – just a few extra words on the command-line. To help understand what is being done here:
- The
-b y
flag tells qsub that the file at the end is a binary (executable) rather than the more usual jobscript file. - The
-cwd
runs your job in the current directory – your app will probably read files from and write files to this directory. - The
-V
(uppercase V) causes the job to inherit the current environment when it runs. This includes any settings made by modulefiles and any other environment variables that are set. The environment is copied immediately (when you run qsub, not when the job runs) so you can even log out of the CSF and the job will still see the current environment settings. - The
-l short
flag will run your app in an area reserved for short jobs (up to 1 hour) which is often enough for quick file conversions, post-processing, or other miscellaneous tasks. If you want to run for longer simply remove the-l
option entirely to run in the 7-day environment. See the section on time limits for more info. - The
./my_app.exe
runs your program assuming it is located in the current directory. If it is a system-wide application (installed by the sys admins) then you should load the application’s modulefile and then use simplythe_app.exe
(instead of./the_app.exe
), but obviously using the correct name for the application.
There’s a lot more you can do with the batch system so please read through our comprehensive documentation.
My job appears to be stuck in the queue and is not running. Why?
Sometimes when you type qstat
you will see your job says Eqw
against it. This indicates there is an error with the jobscript. There are four common causes of this issue.
However, if your job is simply sat waiting in the queue (the status is qw
) then it is usually because the system is busy. Job scheduling is complicated – there are many rules that determine when your job runs (for example, how much of a share in the system your group has, how much work you have already run this month, how much work other people from your group are currently running – some limits apply to the sum of all people in your group for certain groups).
The configuration of the CSF batch system is very complex. Not all jobs use the same nodes and some jobs can only run on certain nodes. Thus while there may be a lot of jobs queued in the system many of them are waiting for different resources and thus your job is not necessarily waiting behind all the others that have been submitted before it. Some parts of the CSF adjust the number of nodes available as demand changes.
It is almost impossible to say how long a job will be waiting, but given that a job can run for up to 7 days and there are hundreds of people using the system a wait of up to 24 hours is actually quite reasonable. It should be noted that if you submit a number of jobs not all of them will start within 24 hours, some may have to wait until others you have running complete, but we try to ensure that all users who have submitted work have something running. The only way to guarantee short queue times is to reduce the maximum job runtime and most users prefer the longer job runtimes.
Please note that when the system is busy it can take longer then 24 hours for large jobs (72+ cores to start).
Specifying a particular node type (e.g. -l haswell
) restricts where on the CSF your job can run and thus you may wait longer for that node type to become available. If you request 17-24 cores in smp.pe
then your job can only use haswell (i.e. it is restricted as if you specified an architecture). Note: some software requires you to be specific – check the appropriate software documentation before submitting your work.
“Free at the point of use” users are restricted to a maximum of 24 cores in use at any one time (providing resources are available, note that priority for resources goes to contributing groups). Therefore if you have a 12 core job running and a 16 core job waiting the 12 core job will need to finish before the 16 core can start. The limit applies across the whole cluster, so if you have serial jobs running then parallel jobs may have to wait as well, for example if 2 serial jobs are running a 24 core job will queue until both the serial jobs finish. To find out if you are in this category enter the command groups
on the CSF login node and if it says fatpou01
then you are a ‘Free at the point of use’ user.
Some users make the mistake of not submitting jobs because they think the CSF looks busy. The CSF is always busy (we don’t want it sat around idle) so the best strategy is to submit your jobs as soon as you can. The CSF can’t schedule your jobs if you don’t have them in the queue!
We check the balance of and demands on the batch system regularly to try and ensure that wait times are not rising significantly. If you have no jobs running and the jobs you have submitted have waited longer than 24 hours then please email its-ri-team@manchester.ac.uk and we will check if there is an issue with your jobs.
Dec 2018: Some compute nodes have been moved to CSF3 as part of the upgrade project. As a result queue wait times may increase slightly, but we expect to still have some of your jobs running within 24 hours though you may have fewer jobs than usual running. This will be monitored to try and maintain a good throughput of work for everyone. The benefits of a bigger cluster which is easier to use, manage and support, will outweigh any short term inconvenience.
Dec 2018Some compute nodes have been moved to CSF3 as part of the upgrade project. As a result queue wait times may increase slightly and you may have fewer jobs than usually running. This will be monitored to try and maintain a good throughput of work for everyone. The benefits of a bigger cluster which is easier to use, manage and support, will outweigh any short term inconvenience.
My job can’t read my input files. Why?
If your input files are in the directory from where you are issuing the qsub
command then you must ensure your jobscript contains the line:
#$ -cwd
or your qsub
command line has the flag, e.g. qsub -cwd
. When the job runs it will run in the current directory. Without the cwd
flag the job will run in your home directory even if you issue qsub
from another directory. The xxxx.oNNNNN
and .eNNNNN
files will be created in the directory from where the job runs. Hence if the .o
and .e
files appear in your home directory unexpectedly then you’ve forgotten the cwd
option.
Should I load modulefiles first then submit a job or load the modulefiles in the jobscript?
Both are valid – there are advantages to both.
If loading the modulefiles first (on the login node), submit the jobscript containing the option:
#$ -V # Inherit the current environment (e.g., modulefile settings)
The advantage of this method is that it is easy to make mistakes loading modulefiles (e.g., misspelling the names, ensuring all required modulefiles are loaded in the correct order). By making these mistakes (and correcting them) on the login node before submitting the job, you are certain that your environment is correct when the job runs.
If you try to load modulefiles from a jobscript you have no way of knowing if everything is correct until the job runs. If you’ve made a mistake and the job fails you’ll need to correct the jobscript and resubmit it to the queue. So you’ll have to wait all over again for your job to be scheduled by the system!
Note that if you use #$ -V
in your jobscript then the batch system copies your environment settings immediately, when you run qsub, not when the job runs. This means that after submitting the job you are free to change your environment (load other modules) or even log out of the CSF and your job will still see the environment settings you set up when you submitted the job.
However the advantage of loading the modulefiles from the jobscript is that you have a complete record of which modulefiles were used for the job. If you need to rerun the job (in six months time for example) you may not be able to remember which modulefiles were used if you loaded them on the login node before submitting the job.
To load a modulefile in your jobscript do the following:
- Ensure the first line of the jobscript is
#!/bin/bash --login
(or-l
for short) - Remove the
#$ -V
flag from your jobscript - Add the
module load some/module/to/load/1.2.3
you would normally run on the command-line.
Remember, if your jobscript fails to set up your environment correctly your job will probably fail. You’ll need to edit your jobscript and resubmit it. So you can actually practice loading the modulefiles on the login node first. By removing the #$ -V
line from your jobscript the job will ignore any settings you have made on the login node.
I get /bin/sh
and module
errors in my batch jobs, what should I do?
If you have loaded the modulefiles required for your job on the login node before submitting then errors such as:
/bin/sh: module: line 1: syntax error: unexpected end of file /bin/sh: error importing function definition for `module'
will not affect the running or results of the job. They may be an indication that you have specified modulefiles you frequently use in your .bashrc
or .bash_profile
files and the batch system/compute is unable, and most likely does not need to be able, to process them. We recommend that you use your .modules
environment file instead, but it may not totally eliminate the errors/warnings in jobs.
If you are trying to load modulefiles in the job script then the error may indicate that you have not done it correctly. We have advice about this on our modules page.
What is the maximum runtime for a job?
7 days is the default maximum time a job can run for before the system kills it. However, some parallel environments (PEs) have shorter runtime limits. See the PE table for details.
Can I get an email when a job starts, finishes or aborts due to an error?
Yes! – Use the #$ -m options
flag in your jobscript as described in the batch script options page.
Can I submit another job from within a jobscript?
No, this is not possible. You can however use a job dependency to make one job wait for another job to finish. You must run qsub
twice to submit two separate jobs but with the job dependency flags the second job will not run until the first has completed.
Can I make one job wait for a previous job to finish?
Yes, you can use a job dependency to make one job wait for another job to finish. You must run qsub
twice to submit two separate jobs but with the job dependency flags the second job will not run until the first has completed.
I’m getting a display error. I don’t know how to fix it.
This can occur when using tools such as gedit
. Make sure that you have connected to the CSF with X windows enabled as per the ‘Fundamentals’ section of the GUI-based work documentation.
If you are sure that you have the above aspect correct, but you are getting an error like this:
(gedit:28355): Gtk-WARNING **: cannot open display: localhost:12.0
then the most likely cause is a problem with a small hidden file (.Xauthority
) in your CSF home directory related to X windows. You should delete the file:
rm .Xauthority
(note the dot at the start of the filename is important), log out of the CSF and back in again. When you log in a new .Xauthority
file will be created and you should now be able to use gedit
.
Why can I not use watch
to monitor qstat
?
The watch cmd
command repeatedly runs another command cmd
for you, every 2 seconds by default. Repeatedly running qstat
stresses the batch system so we ask that users do not do this. You may manually run qstat
to check the status of your jobs. But there is very little value in doing so repeatedly. If you want to be informed when your job starts and/or finishes, you can ask the system to automatically send you an email – see the section on Batch script options which lists the extra flags you can add to a jobscript to get automatic emails from the job.
We have recently disabled the use of the watch qstat
command.
My job has almost reached the 7 day limit, please can you extend the running time on it?
Unfortunately, we cannot extend the time limit for running, or queued, jobs. If your jobs needs more than 7 days you will need to try one of the following options:
- Can your job use more cores? If so will it complete faster?
- Can you split your job into smaller chunks and run each of those as it’s own job (each will then get 7 days)?
- Does your software have checkpoint restart capabilities? This is where every few hours the job writes it’s state to a file and then the job can be restarted from a known good point using that file.
My job is running much slower than I expect, why?
We often get asked this question in relation to serial work on the CSF. The individual cores in the CSF may be less powerful (i.e. have a lower GHz) than your PC on which you may have run your work before. However, the CSF has a number of advantages over your PC:
- You can run more jobs on the CSF at the same time than on your PC.
- Offloading work to the CSF means your PC will be more responsive to other tasks e.g. email, web browsing, writing a paper.
- If you switch off your PC any work running on it would be lost. Equally this would happen if your PC suffered a hardware fault. That may also take some time to get fixed. The CSF rarely suffers from such issues and if a compute node running your job does fail you can quickly and easily start the job again on another compute node.
Finally, you may want to investigate whether the software you are using can use more than one core – if it can jobs may be able to run faster.
What Does the error forrtl: severe (40): recursive I/O operation, unit -1, file unknown
Mean?
We have seen this with v12 of the Intel compiler when several OpenMP threads attempt to write to standard out concurrently. We have specifically noticed this when linking FORTRAN from C/C++. Ensure that only one thread in your OpenMP code is writing to standard out.
ifort gives me an error: cannot find -lm
. How do I fix that?
Please see our Intel Compiler notes for the fix.
I’m used to windows, not linux. How do I access and use the CSF?
Please see the guide for Windows users.
How do I download something from a site external to the CSF?
The CSF does not have a direct connection to sites off campus. However, many research groups use tools like svn, git and wget that may need to connect to external repositories. This can be done by configuring the appropriate proxy settings in the software. The two key pieces of information which are normally required for this are:
- URL:
http://webproxy.its.manchester.ac.uk
- Port:
3128
The same URL and port can be used for http, https and ftp proxy settings. How each piece of software is configured differs. Please consult your software manual for details. If your software can read environment variables such as HTTP_PROXY
(upper or lowercase) then you can set those easily using a modulefile:
module load tools/env/proxy
We have provided further information for the following tools:
- svn
- git
- wget – we have included the proxy information in the global system configuration on the login node so you do not need to set it yourself. However, if you intend to download large files (50 GB or more) please contact us. This is best done elsewhere, away from the CSF login node, but we can provide you with an alternative.
I’ve deleted a file. Can you get it back for me?
Now that all home directories (and other research-group-owned data areas) are on the Isilon central storage you can recover the files yourself. See the filesystems page for full details.
Scratch is not backed up so files cannot be recovered from there. This is one reason why you should not use scratch for long-term storage – it is for temporary storage only while the job is running.
I’ve got 1000s of files in scratch I want to download. What’s the best way?
Downloading a large number of individual files can be very time consuming and will place a lot of strain on the login node.
First consider whether you need to download the files at all. If they are important result files you should consider keeping them in your home area which is on secure backed-up Isilon storage. Your research group may also have additional Isilon areas for specific research projects or data areas. Downloading to a PC that isn’t backed up could result in data loss if the local PCs disks fail. If you don’t have enough space in your Isilon area then consider compressing the files (with zip
or gzip
).
If you still want to download a copy then a better option would be to zip up the files in to a single compressed archive. Zip files are common on Windows / MacOS so if you want to transfer the files to a local Windows / MacOS computer you can create the zip file on the CSF and then download it. Alternatively if your local PC is running linux you can create a tar.gz file on the CSF and download that. We advise running the zip app as a batch job to prevent the login node from being overloaded. Here’s how:
# Go to the required location in scratch. For example: cd ~/scratch/my_data/ # zip up all the files from a sub-directory named 'experiment1' qsub -b y -cwd -m e -M $USER -l short zip -r my_stuff.zip experiment1
The above command will submit a batch job (without writing a jobscript), run it from the current directory in the short environment (and will email you when it has finished. The job will zip up and compress all the files in the experiment1
sub-directory of the scratch/my_data/
directory (change the names to suit your own directory structure). When the job finished you’ll have a file named my_stuff.zip
in your scratch/my_data/
directory which you can then download using WinSCP, scp or other favourite file transfer program from your PC. Alternatively copy the zip file to your home area.
How can I free up some space in my home or scratch area?
The obvious answer is to delete unwanted files (use the rm
command or your preferred graphical file browser such as that in MobaXterm). However, deleting results and data files is not always possible. But there are ways to reduce your usage:
- Compress your files. Many applications write out plain text results files and other log files. These can be huge. Do you need the log file? If not, delete it. But the results files will compress will using
gzip myresult.dat
(which will create a new smaller file namedmyresult.dat.gz
. You can still read the file usingzless myresult.dat.gz
or uncompress it usinggunzip myresult.dat.gz
- Delete unwanted job
.oNNNNNN
and.eNNNNNN
output files. Every job will produce an output file capturing what would have been printed to screen when your application ran. The files can contain normal output (the.o
file) and error messages (the.e
file). Each file will have the unique job number at the end of the name. If you run a lot of jobs (1000s – and many users do!) you will soon have 1000s of files. We’ve seen some directories with millions of these output files! Individually each file is often small but they soon accumulate. They also take up more space on the filesystem than you think (the minimum block size of the storage system is used even if your file is smaller). Please delete unwanted job output files. The following command can be used:rm -f *.o[0-9]* *.e[0-9]*
- Keep your job directories tidy. Deleting files from jobs you ran months ago is never an exciting task – you may have 1000s of output files. Nobody likes looking through old files to see if you need them or not. Deleting unwanted files when the job finishes is the best way to keep your storage areas tidy. You can even but the delete commands (
rm
in your jobscript to clean up any junk at the end of a job.