Monitoring Jobs

Job status information – qstat

The standard SGE command for monitoring jobs is qstat.

qstat gives a quick view of your jobs and you can see all jobs running and queued on the system with it. If you get no output all your jobs have finished – i.e., you have nothing in the queue (either waiting (qw) or running (r).

It can be difficult to appreciate whether the cluster is busy or not. We are working on ways to give users a better feel for this. However, it is virtually impossible to say when an individual job will run. We recommend that if you have work to run you submit it as soon as possible (e.g. when you have the necessary input data etc in place.) If your work is not in the queue it simply can’t be run or be considered by the batch system! The scheduler is very good at prioritising all of the jobs in a fair way.

You will likely waste more time trying to guess when a good time to submit is than you will actually waiting in the queue.

Why does my queued job say ‘Eqw’?

Sometimes you will notice that your job goes into an error state:

[username@login1 ~]$ qstat 
job-ID   prior   name   user      state  submit/start at      queue   slots ja-task-ID
--------------------------------------------------------------------------------------
79003  0.00752   myjob  username  Eqw    03/07/2012 17:10:22          4

The reason for the error can be obtained with the following command:

qstat -j 79003 | grep error
  #
  # ...replacing the number with your jobid
  #

The three most common reasons for a job in error state are:

Missing directory or file

  • A directory or required by the job does not exist (note the path in the error may be truncated, it’s usually the working directory which has a problem):
    error reason 1: 03/09/2012 08:34:22 [1118:25139]: error: can't chdir to \
      /users01/username/mydir: No such file or dir
    
  • Where the directory or file does exist, it may be that you have used spaces or other characters in the name which linux or the batch system is not able to interpret correctly. This is an issue commonly experienced by Windows users, so please consult our guide to using the CSF from Windows for more information.
  • Alternatively check that you have not renamed or deleted the file/directory after submitting the job.
  • You will need to delete the job (qdel), correct the directory problem and resubmit.

Your submission script was created on Windows

  • You have used a file created on a windows PC and uploaded it to the system, most likely your submission script. In this case this is the error you will see:
    error reason 1: 03/08/2012 05:04:35 [1318:12643]: \
      execvp(/opt/gridware/ge/default/spool/node051/job_scripts/79003
    
  • You will need to delete the job (qdel), and run the command below and then resubmit:
    dos2unix jobscript
               #
               # ...replacing 'jobscript' with the name of your jobscript.
               #
    
  • We recommend that you use gedit to create jobscripts (text files) directly on the systems and thus avoid this problem in the future and the guide.
  • Further details on this problem can be found in the guide dedicated to using the CIR from Windows

Out of disk space

  • If your group has run out of home disk space and you are using that area as your working directory then you will not be able to create new files. The error in this case will be similar to:
    error reason 1: 06/04/2013 13:51:27 [237659:20139]: error: can't open output file \
      "/users44/username/workingdir/myjob.out"
    
  • You will need to delete the job (qdel) and the group will need to reduce disk usage before you can resubmit.
  • You should review your own disk usage and delete files where possible as that may help.
  • If disk space remains an issue please contact its-ri-team@manchester.ac.uk.
  • It is better to run jobs from the scratch filesystem as there is a lot more space available there. See the suggested usage of filesystems documentation for more information.

Last modified on November 16, 2017 at 4:19 pm by George Leaver