Policy on long running jobs

The limited number of backbone nodes in the Condor pool, described in our understanding the hardware page, were purchased to allow researchers to run computations that are unsuitable for our standard Condor nodes. In comparison with our standard nodes the backbone nodes:

  • have more disk space
  • have more memory
  • stay in the Condor pool close to 100% of the time.

There are two issues relating to these backbone nodes which we need to address.

Firstly, some jobs are submitted without checkpointing that run for long times on the backbone nodes. Checkpointing allows jobs to be re-continued if they are terminated prematurely, e.g. hardware fails or is switched off for maintenance. Without checkpointing jobs that don’t complete waste hardware resources and electricity.

Therefore we will start contacting users who submit long running jobs without checkpointing to help then introduce checkpointing, if this is possible.

Secondly, at times the backbone nodes are filled by jobs that run for weeks or months before completing. When this occurs queued jobs requiring backbone nodes may wait a considerable time before starting. Therefore to ensure fairness for all users we will start purging long running jobs from the backbone nodes, but only if this is necessary to free up resources for queued jobs to start. Realistically, any jobs that are unlikely to complete within 2 weeks and are unable to use checkpointing are probably unsuitable for Condor.

If your research requires long running jobs without checkpointing, please contact the team at its-ri-team@manchester.ac.uk to discuss your requirements.

For more information on checkpointing see the Introduction to Checkpointing page.

Last modified on June 27, 2019 at 12:56 pm by Chris Heeley