Getting data files selectively at runtime

The problem

Because we don’t have a shared file system across all the Condor nodes, input data files have to be copied around, by Condor, to each execution node. Whilst files are of modest size this is not a problem. However, for larger files the network bandwidth can become the limiting factor. And, clearly, can impact all users of our submitter.

To some extent this is something we have to live with: the input data files have to be local in the absence of a shared, network file system. Consider, though, what happens if you submit 8 jobs that need to read a 1GB data file and, further, Condor decides to allocate your 8 jobs to 8 processor cores on the same client PC. Your large data file is copied over to the PC 8 times. (In case you think this allocation strategy is unlikely, consider when our pool is small during the day or, for example, when you queue 100s or thousands of jobs.) Other scenarios where large files are copied to PCs that already have them are also easy to envisage.

The idea

Our main submit node (submitter.itservices.manchester.ac.uk) runs a web server. You can use this fact to selectively download a file from submitter to the client PC. Typically, you would download it to /tmp, if, and only if, it is not already there; then you could access it from there or create a link to it or copy it to the current working directory Condor is using.

The commonly used command, under Linux, to download a file from a web server is: wget. (curl can also be used.) We have hidden the details of how to use wget in a shell script called: mirror. This can also handle: attempts to download it to the same PC at the same time; aborted downloads; changes to the source file meaning it needs to be downloaded again; etc. Mirror takes two arguments: the first is your username (e.g. zzab1234) and the second is the name of the file. Many thanks to Chris Paul for writing and testing this script. From a machine on the campus network, you can download the shell script.

The Bash shell script code template

#!/bin/bash
bash ./mirror username BigDataFile
# normal processing
# code here
exit 0

The submission script

The only change to your normal submit.txt is that you don’t list for transfer any files that you are using this mirror method on (but you will need to transfer the very small mirror shell script).

Notes

The input data files for this method should be placed in a directory called CondorData (note the capital C and D) in your home directory on submitter.itservices.manchester.ac.uk. Your top level directory (~) must have at least the following permissions: drwx—–x.

    chmod a+x ~

When you create CondorData make sure it has permissions: drwxr-xr-x.

    mkdir CondorData
    chmod ga+rx CondorData

If you copy such data from /tmp to your current working directory, Condor will see it as an ‘output’ of your job and copy it back to submitter with all your real output files (unless you place it in a sub-directory that did not exist when the job started). Please, therefore, where necessary include code to delete it at the end of your Bash script; e.g:

    rm BigDataFile

On the vast majority of PCs in the Condor Pool, /tmp is cleared out whenever a reboot occurs.
The approach of getting data from a web server can be used in general; that is, for any such data anywhere on the World Wide Web. However, if 1000s of clients all request data at the same time, the suppling web site may well treat this as a denial of service attack.

Last modified on June 25, 2019 at 9:03 am by George Leaver