High Throughput Computing using Condor

Postprocessing your result files

The problem

Most users queue up 10s, 100s or even thousands of jobs per submission. This results in numerous output files, typically one file per computed results, one per any errors reported from each job, and either a single, merged log file or, again, one per the number queued. You then need to copy all of this back to your own PC for post-processing, which can impact on our network efficiency, and also remember to tidy up or, better still, delete the copy on submitter.

The idea

The above computing paradigm is called – or at least is very similar to – map-reduce (by Google and others). The map is the aggregate of results computed in parallel, and the reduce is the post-processing to exactly what is required. Herein, we propose a mechanism whereby the reduce action is carried out as part of the Condor submission. There is one caveat: as it will be performed on our submit node it should take seconds, rather than minutes or longer. (In the rare case where post-processing is long and complicated, we suggest you do it on your own PC.)

The implementation uses Condor’s DAGman feature. The Directed Acyclic Graph Manager is a powerful and complex beast, but can be used in many simple variations, of which this is one. DAGman is a workflow or job manager on top of regular Condor. We ignore that central feature and just use its ability to attach a post-processing script at any stage. That is, we run one job (that queues as many parallel tasks as required) and attach a Bash script for Condor to execute on the submit node when all the results have been returned.

There is one small restriction when using DAGman: there can be only one Queue line in the script. Of course it can still say Queue 10000 (or whatever).

The example Bash script to do the desired postprocessing: postjob.sh

cat job*.out > AggregatedOutFiles.txt
rm -f job*.out
for i in *.err; do if ! test -s $i; then rm -f $i; fi; done
exit 0

Remember the above is just an example. What it does is concatenate all the result files into a single one, deleting the originals; and finally it deletes any error files that are zero length. The exit 0 at the end is important or DAGman may think your job has failed.

Make postjob.sh executable before you submit:

chmod +x ./postjob.sh

The submission DAG script: submit.dag

This should be run in a terminal window via: condor_submit_dag submit.dag (or you can use the DropAndCompute interface provided the script is called submit.dag).

Job A condor_job
script post A  ./postjob.sh

The Condor job script example: condor_job

universe = vanilla
notification = never
requirements = (arch == "X86_64") && (opsys == "LINUX")
request_memory = 1
executable = job.sh
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
output = job$(Process).out
error  = job$(Process).err
log    = job.log
queue 10

The example job script: job.sh

# Pause for 10 seconds
sleep 10
# compute here - we just look up the target's name
exit 0

The Download

All of the above scripts have been put together as a download: example_postprocess

To use this, transfer it to submitter and use the following commands:

unzip example_postprocess.zip
cd example_postprocess
condor_submit_dag submit.dag

Last modified on May 31, 2017 at 3:35 pm by Pen Richardson