FAQ
If you have a question not covered in the sections below please contact us via the HPC Help Form providing as much information as possible about the query.
-
- Why does the same job sometimes work and sometimes fail?
- Why is my job idle when there is at least one Unclaimed slot?
- Who can use the UoM Condor Pool and is there any charge?
- What should I do differently when just running a test job?
- How can I be notified via email that my job has finished?
- What is the H or Held state?
- Are there any privacy issues related to using Drop and Compute?
- Can I use DAGman with DropAndCompute?
- What is checkpointing and why is it so important?
- How do I stop large temporary files from being transferred back to the submitter?
- Does the UoM Condor Pool have a shared file system?
- Is it OK to run condor_q from a script?
- What does the classad HAS_CHECKPOINTING mean?
- What is the maximum amount of data I can transfer?
- Why do my seeded random numbers appear the same across the pool?
- How do I find out what versions of Java are available?
- What is the advice on Memory constraint in submit.txt files?
Why does the same job sometimes work and sometimes fail?
It could be one of many reasons. The best thing to do is add &&(HAS_STANDARD_IMAGE =?= True)
to your Requirements line. This stops matching to machines contributed to the Pool that may not have all the libraries and software packages that our standard image does and that your job needs.
The most common answer is that there is no Unclaimed slot. The fact that condor_status
reports such is an artefact of our using Condor’s dynamic slots feature. With this, the top level of a machine is like a dummy place holder that is always Unclaimed. On demand, it can create sub-slots, up to the number of cores of the PC, and these are the ones that become Claimed.
Another possibility is that you got the Request_Memory
or Request_Disk
lines wrong. This is explained here.
Who can use the UoM Condor Pool and is there any charge?
Any academic, researcher or postgraduate may use use condor. There is no charge.
What should I do differently when just running a test job?
It is always good practice to run a single test job (i.e. Queue 1) whenever you try something new or the Condor environment changes. Also, make sure you match to one of our standard Condor clients with HAS_STANDARD_IMAGE=?=True
. The locally written script condor_lint
can also be used to advise you on the correctness of your submit file.
How can I be notified via email that my job has finished?
Add the following line to your submit file:
notify_user = firstname.lastname@manchester.ac.uk
What is the H or Held state?
Jobs become Held when they match but something fails to execute. They then stick in this state forever. Although it is technically possible to fix them and have them rescheduled, it is often better to remove them (condor_rm
), fix the issue, and resubmit.
If you are a command line user, you can try the following to get some help as to why your job is on hold:
condor_q -long yourjobnumber | grep HoldReason
Typical reasons are that you forget to start your Bash Shell script with
#!/bin/bash
or you remember but have prepared your script under Windows, using say Notepad, and it has inserted a return character at the end of each line. This causes Linux to try and run a program called
/bin/bash\r
which then fails. (The free Windows download Notepad++.exe can be set to use Linux end-of-line convention — just Google for it.) Also, under Linux
dos2unix filename
can be used to convert text files from Windows end-of-line format to Linux.
For command line users the Held state can be used for more fine-grained control of how many jobs are allowed to run per cluster. That is, the cluster can be held via
condor_hold clusternum
and processes within that cluster released to run via
condor_release clusternum.processnum
or perhaps
for i in {0..99}; do condor_release clusternum.$i; done
to release several to run at a time.
Are there any privacy issues related to using Drop and Compute?
When you use the Dropbox-based DropAndCompute, your data passes through servers in the USA. Although this data is encrypted, and also somewhat transient, it is technically possible for the US Federal Government, under the US Patriot Act, to demand that Dropbox (the US registered company) provide them with your data.
If you are worried about this, please do not use DropAndCompute. (If you are also a regular Dropbox user, sensitive data should be encrypted via, e.g., mounting a Truecrypt volume within your Dropbox folder. Or, simply don’t use it.)
We also provide a local version of DropAndCompute that does not use Dropbox. More details are here, but basically you drag your submit.zip
to the folder AutoSubmit at the top level of your account on submitter.itservices.manchester.ac.uk
.
Can I use DAGman with DropAndCompute?
Yes. Name the main submission file submit.dag
within your normal zipped up submit folder. If you prepare your files under Windows, please make sure you use an editor that does not insert ‘\r’ (return) characters at the end of text file lines.
What is checkpointing and why is it so important?
Checkpointing refers to the ability of a job to save enough of its current state, periodically or when evicted, so that it can be automatically restarted from the execution point it had reached when it is reallocated to another processor. It is clearly very important to use checkpointing for long running jobs so that the power consumed by a running Condor job is not wasted. Even for jobs that don’t run for long (say a couple of hours) it can save wastage because jobs can be evicted by the machine being re-booted, etc.
There are two main categories of checkpointing: fully automatic, as supported by Condor’s Standard Universe, and user-/application-level under the Vanilla Universe. While the former may seem the obvious one to use, it can be impractical in that the object code of the job needs to be relinked with the Condor libraries; for commercial applications (e.g. Mathematica) we do not have the ability to do this. (If you do use it with your bespoke code, please get in touch as we would like to document your usage as a case study.)
There is an example here; and how to program a form of checkpointing under the Vanilla Universe is also discussed in our Forum. We have had most success with Gaussian and MATLAB (and the latter will be further documented shortly), but we will help anyone who wishes to modify their application code to support checkpointing: please just get in touch. Finally, note that Condor will report, in the log file, that the job was not checkpointed; this refers to the fact that the automatic mechanism was not used. There is no mechanism to turn this message off because you are doing the checkpointing: just ignore it.
How do I stop large temporary files from being transferred back to the submitter?
Good question as this also helps everyone by helping to reduce our network traffic!
Any output files that you create in the current directory, potentially including large temporary files, are transferred back on normal exit or eviction. The same applies to files created in sub-directories (also known as sub-folders) that existed prior to the job starting. If you want to make sure such large and unwanted files are not transferred back, your job can create a new sub-directory and then create the temporary files within it. Alternatively, simple delete it (with the Linux rm
command) before your Bash script exits.
Does the UoM Condor Pool have a shared file system?
No, not currently. Condor transfers the files around as needed (unless you are using the Standard Universe which uses remote procedure calls to access files on the submitter).
Is it OK to run condor_q from a script?
condor_q
puts quite a load on the system. It is OK to run from time to time, but please don’t put it in a script that runs it every second.
What does the classad HAS_CHECKPOINTING
mean?
HAS_CHECKPOINTING
is straightforward to explain. You need to use it in your Requirements whenever you are using application-level (also known as user-level) checkpointing. Many of our client PCs support this classad.
What is the maximum amount of data I can transfer?
The short answer is approximately 5GB. However, apply some common sense and don’t replicate a job’s 5GB of data 100 times across the pool. Contact us for help, if needed. For further details on how to minimise large data transfers.
Why do my seeded random numbers appear the same across the pool?
We have noticed this commonly with MATLAB and its rand function, but it is a general issue. Please see this article.
How do I find out what versions of Java are available?
At a command prompt you can enter:
condor_status -java
The commonest version is 1.6 (and greater), but a few machines have the much older 1.4 installed. You can, for example, add to your Requirements line (JavaVersion >= "1.6")
.
What is the advice on Memory constraint in submit.txt files?
Jobs should specify an accurate estimate (and not mention Memory in Requirements):
Request_Memory = 3500
(unit is megabytes) as a separate line in the submit file. If you forget the Request_Memory
line, your job may not run or you may have to wait a long time for it to run. The same holds for Disk requirements (Request_Disk=k
where oddly k
is in kilobytes).