High Throughput Computing using Condor

Transferring data between submitter and your job

It is essential to understand some features of the Condor system, otherwise your jobs may fail, and you may cause problems for other users:

  1. There is no network file system between submitter and the worker nodes, this places a limit on how much data you can transfer to and from your jobs, and also means you may have to tell the system which files to transfer
  2. The file storage available to your Condor job is small, this places a limit on how much disk space you can use

The following information assumes you are using the vanilla universe, this is the most commonly used Condor universe. Information on the different universes can be found in the official Condor documentation.

Telling Condor which files to transfer

By default in the vanilla universe Condor automatically copies only your executable to the worker node. This executable is defined in your Condor submission file, for example see this tutorial where the executable file myscript.sh is defined by the line

executable = myscript.sh

This default behaviour can be switched off by adding

transfer_executable = false

to the submission file.

Any additional input files should be defined in a comma separated list in the submission file. See for example this MATLAB tutorial where files hello and run_hello.sh are transferred to the Condor worker node by adding the following line to the submission script

transfer_input_files = hello,run_hello.sh

Adding the following line to the submission file

when_to_transfer_output = On_Exit

tells Condor to copy back all new files created in the working directory when the job terminates normally. If you want data to be copied back also when a job is evicted use (which you must use if you have written your own checkpointing code)

when_to_transfer_output = On_Exit_Or_Evict

By default Condor ignores files created in new sub-folders when copying data back to submitter, but we have documented how to handle sub-folders if this is a requirement of your jobs. As data in sub-folders is not copied back automatically, any temporary files should be stored in sub-folders to ensure they will not be accidentally copied back to submitter.

Data transfer limits

All input files required by your job, and all output data required back on submitter, must be communicated over the University campus network. If this network is overloaded by large data transfers your jobs may fail, and it may cause problems for the Condor system (and even other users of the University network), therefore great care must be taken when using the pool. In general Condor is suitable for jobs which require 5GB or less data transfer at the start and end of jobs. Therefore please follow these rules:

  1. If your job generates temporary data which you do not require back on submitter, please ensure this data is deleted
    • Create temporary files in sub-folders in the Condor working directory, by default these files are not copied back to submitter and are deleted automatically when the job ends
    • If your job generates temporary data in /tmp, ensure this data is deleted even if your executable fails unexpectedly, e.g. have a clean up script to run after your main executable
  2. Please compress files before transfer
  3. Please use this method to reduce input data transfer
  4. If you are queuing significantly more that, say, 10 jobs that each download or generate 100s or 1000s of MBs of data, please throttle the submit rate by using:
        condor_submit -a concurrency_limits=BIGDATA submit.txt
    

For the last case above, as well as BIGDATA, which is set to 10, we have THROTTLE20, THROTTLE50 and THROTTLE100 with the obvious values. Please choose and use responsibly. Note also the lack of spaces in concurrency_limits=BIGDATA (or, if you like spaces around the equals sign, use single quotes around the entire argument to -a).

Condor job file store limits

Most machines running Condor only provide up to 4GB storage for the working directory, and this is shared amongst all the Condor jobs running on that machine. As there are typically 2-4 cores per processor, then up to 4 jobs may be running concurrently. Therefore generally the Condor system is suitable for jobs which require working file storage of only a few GB per job. Disk requirements should be defined, and it may be advantageous to run jobs requiring a large amount of disk on a whole machine rather than a single core. If you have problems because you run out of disk storage, please contact the support team for advice or to use the backbone nodes.

In addition there is up to 16GB storage available in /tmp which is wiped on reboot, this is not managed by the Condor system, and all files created should be deleted manually.

Last modified on May 31, 2017 at 3:14 pm by Pen Richardson