Postprocessing your result files
The problem
Most users queue up 10s, 100s or even thousands of jobs per submission. This results in numerous output files, typically one file per computed results, one per any errors reported from each job, and either a single, merged log file or, again, one per the number queued. You then need to copy all of this back to your own PC for post-processing, which can impact on our network efficiency, and also remember to tidy up or, better still, delete the copy on submitter.
The idea
The above computing paradigm is called – or at least is very similar to – map-reduce (by Google and others). The map is the aggregate of results computed in parallel, and the reduce is the post-processing to exactly what is required. Herein, we propose a mechanism whereby the reduce action is carried out as part of the Condor submission. There is one caveat: as it will be performed on our submit node it should take seconds, rather than minutes or longer. (In the rare case where post-processing is long and complicated, we suggest you do it on your own PC.)
The implementation uses Condor’s DAGman feature. The Directed Acyclic Graph Manager is a powerful and complex beast, but can be used in many simple variations, of which this is one. DAGman is a workflow or job manager on top of regular Condor. We ignore that central feature and just use its ability to attach a post-processing script at any stage. That is, we run one job (that queues as many parallel tasks as required) and attach a Bash script for Condor to execute on the submit node when all the results have been returned.
There is one small restriction when using DAGman: there can be only one Queue
line in the script. Of course it can still say Queue 10000
(or whatever).
The example Bash script to do the desired postprocessing: postjob.sh
#!/bin/bash
cat job*.out > AggregatedOutFiles.txt
rm -f job*.out
for i in *.err; do if ! test -s $i; then rm -f $i; fi; done
exit 0
Remember the above is just an example. What it does is concatenate all the result files into a single one, deleting the originals; and finally it deletes any error files that are zero length. The exit 0
at the end is important or DAGman may think your job has failed.
Make postjob.sh executable before you submit:
chmod +x ./postjob.sh
The submission DAG script: submit.dag
This should be run in a terminal window via: condor_submit_dag submit.dag
(or you can use the DropAndCompute interface provided the script is called submit.dag
).
Job A condor_job script post A ./postjob.sh
The Condor job script example: condor_job
universe = vanilla notification = never requirements = (arch == "X86_64") && (opsys == "LINUX") request_memory = 1 executable = job.sh should_transfer_files = YES when_to_transfer_output = ON_EXIT output = job$(Process).out error = job$(Process).err log = job.log queue 10
The example job script: job.sh
#!/bin/bash # Pause for 10 seconds sleep 10 # compute here - we just look up the target's name hostname exit 0
The Download
All of the above scripts have been put together as a download: example_postprocess
To use this, transfer it to submitter and use the following commands:
unzip example_postprocess.zip cd example_postprocess condor_submit_dag submit.dag