Checkpointing

Introduction

Checkpointing is the act of saving enough program data to file so that a computation can be stopped and restarted. Users of long running computations (e.g. using Amber, Gaussian or MATLAB) should always use checkpointing.

One consideration when checkpointing is to choose the frequency to save checkpointing data to disk. This frequency is obviously problem dependent and should be chosen to balance the following criteria:

  1. Ensure time taken to write checkpointing data to disk is a small percentage of total run time
  2. Ensure significant computation is not wasted when jobs end unexpectedly

In general we recommend users aim to checkpoint every hour on standard nodes and every 2-4 hours on backbone nodes, depending on how long it takes to write the data to file, for example in tests 8GB of data was saved by MATLAB in 4 minutes as binary data. Checkpoint data should be saved in binary format as this ensures the data is stored exactly, and binary files are smaller and quicker to read and write.

When implementing checkpointing the submission script must include the has_checkpointing requirement and also specify that output files should always be transferred (on_exit_or_evict), see the C and MATLAB examples (below) for details.

Vanilla Universe (application level) checkpointing

Some software comes with checkpointing built in, therefore it’s always worth checking the software documentation to find out.

For additional help and advice please email its-ri-team@manchester.ac.uk.

Standard Universe checkpointing

If you can link your program with the Condor libraries and run it in the Standard Universe then checkpointing can be automatic. This means standard Universe checkpointing is only possible if you have access to source code or linkable object files. Also there are two drawbacks:

  1. The program’s state includes its entire memory image which may be very large. Our tests with Morphy for example show approximately 2GB data is required, and so we can’t support checkpointing for 100s of such jobs across our network simultaneously.
  2. There are a few restrictions on standard universe jobs as described in the Condor documentation, e.g. multi-process jobs are not allowed.

Last modified on June 27, 2019 at 1:14 pm by Chris Heeley