Job Monitoring and Common Errors (Eqw)
Job status information – qstat
The standard batch system command for monitoring jobs is qstat
. It will show a list of your running, waiting and in-error jobs, or, no output at all if you have no jobs running or waiting in the batch system.
Output from qstat
The following example shows three jobs of different status:
qstat job-ID prior name user state submit/start at queue slots ja-task-ID ------------------------------------------------------------------------------------------------ 195501 0.05350 my-job-te. mxyzabc1 r 04/10/2018 09:33:49 serial.q@node332 1 195502 0.00000 simulatio. mxyzabc1 qw 04/10/2018 09:33:37 24 195503 0.05350 first-job. mxyzabc1 Eqw 04/10/2018 09:33:49 2 # # # # # # # Part of the # # # Number of # # jobscript name # # # cores # # # # # # When the job started (if running) # Job ID. Please tell us # # When you submitted to the queue (if waiting) # this if requesting help # # # r - a running job # qw - a queued job (waiting to be run) # Eqw - a job in error (basic error when it ran) # # Other status flags you may see but are less common # hqw - a held job (waiting for another to finish) # Rr - a rerun job (it automatically requeued itself) # t - a transferring job (changing from qw to r)
It can be difficult to appreciate whether the cluster is busy or not and it is virtually impossible to say exactly when an individual job will run. But the very best strategy for getting your jobs to run is: put them in the queue! Submit your jobs as soon as possible (e.g. when you have the necessary input data etc in place.) If your work is not in the queue it simply can’t be run or be considered by the batch system! The scheduler is very good at prioritising all of the jobs in a fair way.
Don’t wait for the CSF to appear less busy – it is always busy.
No output from qstat
If qstat
returns no output, you have nothing either running or waiting – all of your jobs have finished. Check the results to see if the job has done what you expect:
qstat # (no output means you have no jobs running or waiting!)
Note that a job in error (Eqw
) indicates a basic problem with the jobscript – for example it couldn’t find the program you asked it to run or a directory/folder is missing. If a job runs an application and that application generates an error (e.g., you gave it the wrong input or the wrong flags on the command-line) then the job will not go in to error. It will simply finish. You must always check the results of a job to determine whether it did what you wanted. See below for common Eqw
errors.
Monitoring GPU Jobs
We have written a script to help monitor your GPU jobs. You can run the following on the CSF login nodes:
# Get a list of your GPU jobs in the batch system (similar to qstat) gpustat # Display the status of the GPUs in use by one of your running jobs (job id needed) gpustat -j jobid # # Repace jobid with your job id number (e.g., 12345) # Display the status of the GPUs in use by one of your job-array tasks (job and task id needed) gpustat -j jobid -t taskid # # # # Replace taskid with your job array task id # # number (e.g., 1) # # Repace jobid with your job id number (e.g., 12345)
Why does my queued job say ‘Eqw’?
Sometimes you will notice that your job goes into an error state:
[username@login1 [csf3] ~]$ qstat job-ID prior name user state submit/start at queue slots ja-task-ID -------------------------------------------------------------------------------------- 79003 0.00752 myjob username Eqw 03/07/2012 17:10:22 4
The reason for the error can be obtained with the following command:
qstat -j 79003 | grep error # # Replace the number with your own jobid
The three most common reasons for a job in error state are:
Can’t chdir
- A directory required by the job does not exist (note the path in the error may be truncated, it’s usually the working directory which has a problem):
error reason 1: 03/09/2012 08:34:22 [1118:25139]: error: can't chdir to \ /mnt/iusers01/xy01/username/mydir: No such file or dir
- Where the directory or file does exist, it may be that you have used spaces or other characters in the name which linux or the batch system is not able to interpret correctly. This is an issue commonly experienced by Windows users, so please consult our guide to using the CSF from Windows for more information.
- Alternatively check that you have not renamed or deleted the file/directory after submitting the job, but before it ran.
- You will need to delete the job (
qdel
), correct the directory problem and resubmit.
Your submission script was created on Windows
- You have used a file created on a windows PC and uploaded it to the system, most likely your submission script. In this case this is the error you will see:
error reason 1: 03/08/2012 05:04:35 [1318:12643]: \ execvp(/opt/site/sge/default/spool/node770/job_scripts/79003
- You will need to delete the job (
qdel
), and run the command below and then resubmit:dos2unix jobscript # # ...replacing 'jobscript' with the name of your jobscript. #
- We recommend that you use
gedit
to create jobscripts (text files) directly on the systems and thus avoid this problem in the future and the guide. - Further details on this problem can be found in the guide dedicated to using the CIR from Windows
Out of disk space
- If your group has run out of home disk space and you are using that area as your working directory then you will not be able to create new files. The error in this case will be similar to:
error reason 1: 06/04/2013 13:51:27 [237659:20139]: error: can't open output file \ "/mnt/iusers01/xy01/username/workingdir/myjob.out"
- You will need to delete the job (
qdel
) and the group will need to reduce disk usage before you can resubmit. - You should review your own disk usage and delete files where possible as that may help.
- If disk space remains an issue please contact its-ri-team@manchester.ac.uk.
- It is better to run jobs from the scratch filesystem as there is a lot more space available there. See the suggested usage of filesystems documentation for more information.
Is the first line of your jobscript correct?
The following error:
error reason 1: 03/05/2020 19:24:24 [417871:190900]: execvlp(/opt/site/sge/default/spool/nodeXXX/job_scripts/1234562, "/opt/site/sge/default/spool/nodeXXX/job_scripts/12345") failed: No such file or directory
usually occurs because your first line is not
#!/bin/bash --login
For example, people often miss out the slash before the word bin.