Job Monitoring and Common Errors (Eqw)

Job status information – qstat

The standard batch system command for monitoring jobs is qstat. It will show a list of your running, waiting and in-error jobs, or, no output at all if you have no jobs running or waiting in the batch system.
 
 
 
 

Output from qstat

The following example shows three jobs of different status:

qstat

job-ID  prior   name        user     state  submit/start at      queue         slots ja-task-ID
------------------------------------------------------------------------------------------------
195501 0.05350  my-job-te.  mxyzabc1 r      04/10/2018 09:33:49 serial.q@node332  1
195502 0.00000  simulatio.  mxyzabc1 qw     04/10/2018 09:33:37                   24
195503 0.05350  first-job.  mxyzabc1 Eqw    04/10/2018 09:33:49                   2
  #               #                   #           #                               #
  #               # Part of the       #           #                               # Number of
  #               # jobscript name    #           #                               # cores
  #                                   #           #
  #                                   #           # When the job started (if running)
  # Job ID. Please tell us            #           # When you submitted to the queue (if waiting)
  # this if requesting help           #
                                      #
                                      # r   - a running job
                                      # qw  - a queued job (waiting to be run)
                                      # Eqw - a job in error (basic error when it ran)
                                      #
                                      # Other status flags you may see but are less common
                                      # hqw - a held job (waiting for another to finish)
                                      # Rr  - a rerun job (it automatically requeued itself)
                                      # t   - a transferring job (changing from qw to r)

It can be difficult to appreciate whether the cluster is busy or not and it is virtually impossible to say exactly when an individual job will run. But the very best strategy for getting your jobs to run is: put them in the queue! Submit your jobs as soon as possible (e.g. when you have the necessary input data etc in place.) If your work is not in the queue it simply can’t be run or be considered by the batch system! The scheduler is very good at prioritising all of the jobs in a fair way.

Don’t wait for the CSF to appear less busy – it is always busy.

No output from qstat

If qstat returns no output, you have nothing either running or waiting – all of your jobs have finished. Check the results to see if the job has done what you expect:

qstat

# (no output means you have no jobs running or waiting!)

Note that a job in error (Eqw) indicates a basic problem with the jobscript – for example it couldn’t find the program you asked it to run or a directory/folder is missing. If a job runs an application and that application generates an error (e.g., you gave it the wrong input or the wrong flags on the command-line) then the job will not go in to error. It will simply finish. You must always check the results of a job to determine whether it did what you wanted. See below for common Eqw errors.

Monitoring GPU Jobs

We have written a script to help monitor your GPU jobs. You can run the following on the CSF login nodes:

# Get a list of your GPU jobs in the batch system (similar to qstat)
gpustat

# Display the status of the GPUs in use by one of your running jobs (job id needed)
gpustat -j jobid
             #
             # Repace jobid with your job id number (e.g., 12345)

# Display the status of the GPUs in use by one of your job-array tasks (job and task id needed)
gpustat -j jobid -t taskid
             #        # 
             #        # Replace taskid with your job array task id
             #        # number (e.g., 1)
             #
             # Repace jobid with your job id number (e.g., 12345)

Why does my queued job say ‘Eqw’?

Sometimes you will notice that your job goes into an error state:

[username@login1 [csf3] ~]$ qstat 
job-ID  prior    name   user      state  submit/start at      queue   slots ja-task-ID
--------------------------------------------------------------------------------------
79003   0.00752  myjob  username  Eqw    03/07/2012 17:10:22          4

The reason for the error can be obtained with the following command:

qstat -j 79003 | grep error
           #
           # Replace the number with your own jobid

The three most common reasons for a job in error state are:

Can’t chdir

  • A directory required by the job does not exist (note the path in the error may be truncated, it’s usually the working directory which has a problem):
    error reason 1: 03/09/2012 08:34:22 [1118:25139]: error: can't chdir to \
      /mnt/iusers01/xy01/username/mydir: No such file or dir
    
  • Where the directory or file does exist, it may be that you have used spaces or other characters in the name which linux or the batch system is not able to interpret correctly. This is an issue commonly experienced by Windows users, so please consult our guide to using the CSF from Windows for more information.
  • Alternatively check that you have not renamed or deleted the file/directory after submitting the job, but before it ran.
  • You will need to delete the job (qdel), correct the directory problem and resubmit.

Your submission script was created on Windows

  • You have used a file created on a windows PC and uploaded it to the system, most likely your submission script. In this case this is the error you will see:
    error reason 1: 03/08/2012 05:04:35 [1318:12643]: \
      execvp(/opt/site/sge/default/spool/node770/job_scripts/79003
    
  • You will need to delete the job (qdel), and run the command below and then resubmit:
    dos2unix jobscript
               #
               # ...replacing 'jobscript' with the name of your jobscript.
               #
    
  • We recommend that you use gedit to create jobscripts (text files) directly on the systems and thus avoid this problem in the future and the guide.
  • Further details on this problem can be found in the guide dedicated to using the CIR from Windows

Out of disk space

  • If your group has run out of home disk space and you are using that area as your working directory then you will not be able to create new files. The error in this case will be similar to:
    error reason 1: 06/04/2013 13:51:27 [237659:20139]: error: can't open output file \
      "/mnt/iusers01/xy01/username/workingdir/myjob.out"
    
  • You will need to delete the job (qdel) and the group will need to reduce disk usage before you can resubmit.
  • You should review your own disk usage and delete files where possible as that may help.
  • If disk space remains an issue please contact its-ri-team@manchester.ac.uk.
  • It is better to run jobs from the scratch filesystem as there is a lot more space available there. See the suggested usage of filesystems documentation for more information.

Is the first line of your jobscript correct?

The following error:

error reason     1:      03/05/2020 19:24:24 [417871:190900]: execvlp(/opt/site/sge/default/spool/nodeXXX/job_scripts/1234562, "/opt/site/sge/default/spool/nodeXXX/job_scripts/12345") failed: No such file or directory

usually occurs because your first line is not

#!/bin/bash --login

For example, people often miss out the slash before the word bin.

Last modified on November 10, 2022 at 9:40 am by George Leaver