How to be a good Condor Citizen
Introduction
These notes are intended to help you avoid wasting your time and the University’s electricity (= a lot of money) when using the Condor Pool.
If using the general purpose Condor nodes without checkpointing then in normal term time certain jobs may run multiple times without ever completing, or may get evicted and restarted from scratch many times before running to completion. For example consider the following example cases.
- A job starts running Monday evening but needs longer than overnight to complete. The job is automatically evicted each weekday morning and starts running from scratch the next evening until completing over the weekend.
- A job needs 70 hours (i.e. longer than a weekend) to complete and will be continuously restarted every weekday evening without ever completing.
In addition jobs can be evicted at any time, e.g. someone reboots a teaching cluster PC into Windows to use it. Even jobs running on backbone nodes may be evicted if the backbone node fails or needs maintenance.
Clearly there is huge potential to waste electricity and cost the University a lot of money, and for you to wait for jobs that never complete.
Best practice to get your results as soon as possible
- Testing: before undertaking full production runs (e.g. 1000s of jobs) see what happens when you queue up and run just one job. Repeat your testing whenever you have a new batch of jobs, e.g. a variable has changed or updated software has been installed.
- Log files: after every run has completed check the log files to see how long your job took and what resources it needed. Adjust your jobs accordingly if necessary.
- Error and Output files: check the error and output files to check your jobs ran as expected.
- Checkpointing: if your job runs for more than an hour or two, use some form of checkpointing, i.e. save the state of your computation to allow your job can continue after eviction.
- condor_lint:
condor_submit
has been modified to produce advisory messages when you submit your job, please read these messages and modify your job if appropriate.