The CSF2 has been replaced by the CSF3 - please use that system! This documentation may be out of date. Please read the CSF3 documentation instead. To display this old CSF2 page click here. |
Monitoring Jobs (qstat, Eqw state)
Job status information – qstat
The standard SGE command for monitoring jobs is qstat
.
qstat
gives a quick view of your jobs and you can see all jobs running and queued on the system with it. If you get no output all your jobs have finished – i.e., you have nothing in the queue (either waiting (qw
) or running (r
).
If you have no jobs running then this may indicate that the CSF is very busy. We usually try to ensure that everyone has some work start within 24 hours of submission. For further details please see the My job appears to be stuck in the queue and is not running. Why? FAQ.
How busy is the CSF?
It can be difficult to appreciate whether the cluster is busy or not. We are working on ways to give users a better feel for this. However, it is virtually impossible to say when an individual job will run. We recommend that if you have work to run you submit it as soon as possible (e.g. when you have the necessary input data etc in place.) If your work is not in the queue it simply can’t be run or be considered by the batch system! The scheduler is very good at prioritising all of the jobs in a fair way.
You will likely waste more time trying to guess when a good time to submit is than you will actually waiting in the queue.
Why does my queued job say ‘Eqw’?
Sometimes you will notice that your job goes into an error state:
[username@login1 ~]$ qstat job-ID prior name user state submit/start at queue slots ja-task-ID -------------------------------------------------------------------------------------- 79003 0.00752 myjob username Eqw 03/07/2012 17:10:22 4
The reason for the error can be obtained with the following command:
qstat -j 79003 | grep error # # ...replacing the number with your jobid #
The three most common reasons for a job in error state are:
Missing directory or file
- A directory or required by the job does not exist (note the path in the error may be truncated, it’s usually the working directory which has a problem):
error reason 1: 03/09/2012 08:34:22 [1118:25139]: error: can't chdir to \ /users01/username/mydir: No such file or dir
- If you have used qsub options to specify the output directories then those directories must exist before you submit the job otherwise you will get the slightly cryptic error
can't open output file "/path/to/somewhere": Is a directory
- Where the directory or file does exist, it may be that you have used spaces or other characters in the name which linux or the batch system is not able to interpret correctly. This is an issue commonly experienced by Windows users, so please consult our guide to using the CSF from Windows for more information.
- Alternatively check that you have not renamed or deleted the file/directory after submitting the job.
- You will need to delete the job (
qdel
), correct the directory problem and resubmit.
Your submission script was created on a windows PC, not the CSF/linux
- You have used a file created on a windows PC, most likely your submission script. In this case this is the error you will see:
error reason 1: 03/08/2012 05:04:35 [1318:12643]: \ execvp(/opt/gridware/ge/default/spool/node051/job_scripts/79003
- You will need to delete the job (
qdel
) because the batch system makes a copy of your script which cannot be changed. - After deleting the job run the command below and then resubmit:
dos2unix jobscript # # ...replacing 'jobscript' with the name of your jobscript. #
- Avoid the problem for future jobs: we recommend using gedit to create your job submissions scripts.
Out of disk space
- If your group has run out of home disk space and you are using that area as your working directory then you will not be able to create new files. The error in this case will be similar to:
error reason 1: 06/04/2013 13:51:27 [237659:20139]: error: can't open output file \ "/users44/username/workingdir/myjob.out"
- You will need to delete the job (
qdel
) and the group will need to reduce disk usage before you can resubmit. - You should review your own disk usage and delete files where possible as that may help.
- If disk space remains an issue please contact its-ri-team@manchester.ac.uk.
- It is better to run jobs from the scratch filesystem as there is a lot more space available there. See the suggested usage of filesystems documentation for more information.
can’t get password entry
- If your job reports:
can't get password entry for user "zzabc123". Either the user does not exist or NIS error!
then this means that the compute node your job tried to start on has an issue. The sysadmin team will need to identify which node and temporarily remove it from service. We can often clear this error from your job, in which case it does not need to be removed or resubmitted. Our weekday morning checks usually spot and deal with such jobs, but if you have a job with this error that has not been dealt with please email us: its-ri-team@manchester.ac.uk