Multiple Similar Jobs – Job Arrays

 

CSF Users
Please do not run jobarrays in the short environment, even if your tasks have a short runtime. There are not enough cores in short for jobarrays.

 

Why?

Suppose you wish to run a large number of almost identical jobs – for example you may wish to run the same program many times with different arguments or parameters; or perhaps process a thousand different input files with the same application. You may have used a Condor pool to do this (where idle PCs on campus are used to run your jobs overnight) but systems such as the CSF, Redqueen and the DPSF can also run these High Throughput Computing jobs.

The wrong way to do this would be to write a Perl or Python script to generate all the required qsub jobscripts and then use a BASH script to submit them all (run qsub 1000s of times). This is not a good use of your time and it will do horrible things to the submit node (which manages the job queues) on a cluster. The sysadmins may will kill such jobs to keep the system running smoothly!

A much better way is to use an SGE Job Array. Simply put, a job array runs multiple copies (100s, 1000s, …) of your job in a way that places much less strain on the queue manager. You only write one jobscript and use qsub once to submit that job. Your jobscript includes a flag to say how many copies of it should be run. Each copy of the job that the system runs is given a unique task id. You use the task id in your jobscript to have each task do some unique work (e.g., each task reads from a different file or uses a different set of input parameters).

Using the unique task id in your jobscript is the key to writing a good job array script. You can be creative here – the task id can be used in many ways. Below we describe how to submit an SGE job comprising numerous serial (single core) tasks and also SMP (multicore) tasks and give several examples of how to use the task id in your jobscript.

Job runtime

Each task in an array gets a maximum runtime of 7 days. Job arrays are not terminated at the 7 day limit, they will remain in the system until all tasks complete.

Please note: On the CSF: job-arrays are not permitted in the short area. Please do not use short as a means of jumping the queue.

Job Array Basics

An SGE array job might be described as a job with a for-loop built in. Here is a simple example:

#!/bin/bash
#$ -cwd
#$ -V
# The 'myprog' below is serial hence no '-pe' option needed

#$ -t 1-1000
    # ...tell SGE that this is an array job, with "tasks" numbered from 1 
    #    to 1000...

./myprog < data.$SGE_TASK_ID > results.$SGE_TASK_ID

Computationally, this is equivalent to 1000 individual queue submissions in which $SGE_TASK_ID takes the values 1, 2, 3. . . 1000, and where input and output files are indexed by the ID. The SGE_TASK_ID variable is automatically set for you by the batch system when a particular task runs. Please note that for ”serial” jobs you don’t use a PE setting.

To submit the job simply issue one qsub command:

qsub jobscript

where jobscript is the name of your script above.

Multi-core (SMP) tasks (e.g., OpenMP jobs) can also be run in jobarrays. Each task will run your program with the requested number of cores. Simply add a -pe option to the jobscript and then tell your program how many cores it can use in the usual manner (see parallel job submission). Please be aware that each task will be requesting the specified resources (number of cores). It may take longer for each task to get through the batch queue, depending on how busy the system is.

An example SMP array job is given below:

#!/bin/bash
#$ -cwd
#$ -V
#$ -pe smp.pe 4    # Each task will use 4 cores

#$ -t 1-1000
    # ...tell SGE that this is an array job, with "tasks" numbered from 1 
    #    to 1000...

# My OpenMP program will read this variable to get how many cores to use.
# $NSLOTS is automatically set to the number specified on the -pe line above.
export OMP_NUM_THREADS=$NSLOTS

./myOMPprog < data.$SGE_TASK_ID > results.$SGE_TASK_ID

Again, simply submit the job once using qsub jobscript.

Job arrays have several advantages over submitting 100s or 1000s of individual jobs. In both of the above cases:

  • Only one qsub command is issued (and only one qdel command would be required to delete all tasks).
  • The batch system will try to run many of your tasks at once (possibly hundreds simultaneously for serial job arrays, depending on what we have set the limit to be according to demand on the system). So you get a lot more than one task running in parallel from just one qsub command. The system will churn through your tasks running them all as cores become free on the system, at which point the job is finished.
  • Only one entry appears to be queued (qw) in the qstat output for the job array, but each individual task running (r) will be visible. This makes reading your qstat output a lot easier than if you’d submitted 1000s of individual jobs.
  • The load on the SGE submit node (i.e., the cluster node responsible for managing the queues and scheduling which jobs run) is vastly less than that of submitting 1000 separate jobs.

There are many ways to use the $SGE_TASK_ID variable to supply a different input to each task and several examples are shown below.

A More General For Loop

It is not necessary that SGE_TASK_ID starts at 1; nor must the increment be 1. For example:

#$ -t 100-995:5

so that SGE_TASK_ID takes the values 100, 105, 110, 115... 995. However, the SGE_TASK_ID is not allowed to start at 0.

Incidently, in the case in which the upper-bound is not equal to the lower-bound plus an integer-multiple of the increment, for example

#$ -t 1-42:6

SGE automatically changes the upper bound, viz

prompt> qsub array.qsub
Your job-array 2642.1-42:6 ("array.qsub") has been submitted

prompt> qstat
job-ID   prior  name        user      state  submit/start at      queue    slots ja-task-ID 
-------------------------------------------------------------------------------------------
2642    0.00000 array.qsub  simonh    qw     04/24/2014 12:29:29               1 1-37:6

Note the 1-37 in the ja-task-ID column: the final task id has been adjusted to be 37 rather than 42. Hence the tasks ids used will be 1,7,13,19,25,31,37. Remember that the task id cannot start at zero so don’t be tempted to try #$ -t 0-42:6.

Related Environment Variables

There are three more automatically created environment variables one can use, as illustrated by this simple qsub script:

#!/bin/bash
#$ -cwd 
#$ -V

#$ -t 1-37:6

echo "The ID increment is: $SGE_TASK_STEPSIZE"

if [[ $SGE_TASK_ID == $SGE_TASK_FIRST ]]; then
    echo "first"
elif [[ $SGE_TASK_ID == $SGE_TASK_LAST ]]; then
    echo "last"
 else
    echo "neither"
fi

Note that the batch system will try to start your jobs in numerical order but there is no guarantee that they will finish in the same order — some tasks may take longer to run than others. So you cannot rely on the task with id $SGE_TASK_LAST being the last task to finish. Hence do not try something like:

# DO NOT do this in your jobscript - we may not be the last task to finish
if [[ $SGE_TASK_ID == $SGE_TASK_LAST ]]; then
  # Archive output files from all tasks (output.1, output.2, ...).
  tar czf ~/scratch/all-my-results.tgz output.*
    #
    # BAD: we may not be the last task to finish just because we are the last
    # BAD: task id. Hence we may miss some output files from other tasks that
    # BAD: are still running.
fi

The correct way to do something like this (where the work carried out by a task is dependent on other tasks having finished) is to use a job dependency which uses two separate jobs and automatically runs the second job only when the first job has completely finished. You would generally only use the $SGE_TASK_FIRST and $SGE_TASK_LAST variables where you wanted those tasks to do something different but where they are still independent of the other tasks.

Examples

We now show example job scripts which use the job array environment variables in various ways. All of the examples below are serial jobs (each task uses only one core) but you could equally use multicore (smp) jobs if your code/executable supports multicore. You should adapt these examples to your own needs.

A List of Input Files

One can be sneaky — suppose we have a list of input files, rather than input files explicitly indexed by suffix:

#!/bin/bash
#$ -cwd
#$ -V

#$ -t 1-42

# Task id 1 will read line 1 from my_file_list.txt
# Task id 2 will read line 2 from my_file_list.txt
# and so on...
# Each line contains the name of an input file to used by 'myprog'

INFILE=`awk "NR==$SGE_TASK_ID" my_file_list.txt`
    #
    # ...or used sed to print the n-th line of a file:
    #    
    #        INFILE=`sed -n "${SGE_TASK_ID}p" my_file_list.txt`
    #

./myprog < $INFILE

Bash Scripting and Arrays

Another way of passing different parameters to your application (e.g., to run the same simulation but with different input parameters) is to list all the parameters in a bash array and index in to the array. For example:

#!/bin/bash
#$ -cwd
#$ -V

#$ -t 1-10

# A bash array of my 10 input parameters
X_PARAM=( 3400 4500 9700 10020 20000 30000 40000 44400 50000 60910 )

# Bash arrays use zero-based indexing but you CAN'T use #$ -t 0-9 (0 is an invalid task id)
INDEX=$((SGE_TASK_ID-1))

# Run the app with one of the parameters
./myprog -xflag ${X_PARAM[$INDEX]} > output.${INDEX}.log

Running from Different Directories (simple)

Here we run each task in a separate directory (folder) that we create when each task runs. We run two applications from the jobscript – the first outputs to a file, the second reads that file as input and outputs to another file (your own applications may do something completely different). We run 1000 tasks, numbered 1…1000.

#!/bin/bash
#$ -cwd
#$ -V

#$ -t 1-1000

# Create a new directory for each task and go in to that directory
mkdir myjob-$SGE_TASK_ID
cd myjob-$SGE_TASK_ID

# Each task runs the same executables stored in the parent directory
../myprog-a.exe > a.output
../myprog-b.exe < a.output > b.output

In the above example all tasks use the same input and output filenames (a.output and b.output). This is safe because each task runs in its own directory.

Running from Different Directories (intermediate)

Here we use one of the techniques from above – read the directories we want to run in from a file. Task 1 will read line 1, task 2 reads line 2 and so on.

We assume the file contains sub-directory names. For example, suppose we are currently working in a directory named ~/scratch/jobs/ (it is in our scratch directory). The subdirectories are named after some property such as:

s023nn/arun1206/
s023nn/arun1207/
s023nn/brun1208/
s023nx/brun1201/
s023nx/crun1731/

and so on - it doesn't really matter what the subdirectories are called

The jobscript reads a line from the above list and cd‘s in to that directory:

#!/bin/bash
#$ -cwd                # Run from where we ran qsub
#$ -V

#$ -t 1-500            # my_dir_list.txt has 500 lines

# Task id 1 will read line 1 from my_dir_list.txt
# Task id 2 will read line 2 from my_dir_list.txt
# and so on...

SUBDIR=`sed -n "${SGE_TASK_ID}p" my_dir_list.txt`

# Go in to the subdirectory
cd $SUBDIR

# Run our code
./myprog < input.dat

The above script assumes that each subdirectory contains a file named input.dat which we process.

Running from Different Directories (advanced)

This example runs the same code but from different directories. Here we expect each directory to contain an input file. You can name your directories (and subdirectories) appropriately to match your experiments. We use BASH scripting to index in to arrays giving the names of directories. This example requires some knowledge of BASH but it should be straight forward to modify for your own work.

In this example we have the following directory structure (use whatever names are suitable for your code)

  • 3 top-level directories named: Helium, Neon, Argon
  • 2 mid-level directories named: temperature, pressure
  • 4 bottom-level directories named: test1, test2, test3, test4

So the directory tree looks something like:

|
+---Helium---+---temperature---+---test1
|            |                 +---test2
|            |                 +---test3
|            |                 +---test4
|            |
|            +------pressure---+---test1
|                              +---test2
|                              +---test3
|                              +---test4
|
+-----Neon---+---temperature---+---test1
|            |                 +---test2
|            |                 +---test3
|            |                 +---test4
|            |
|            +------pressure---+---test1
|                              +---test2
|                              +---test3
|                              +---test4
|
+----Argon---+---temperature---+---test1
             |                 +---test2
             |                 +---test3
             |                 +---test4
             |
             +------pressure---+---test1
                               +---test2
                               +---test3
                               +---test4

Hence we have 3*2*4=24 input files all named myinput.dat in paths such as

$HOME/scratch/chemistry/Helium/temperature/test1/myinput.dat
...
$HOME/scratch/chemistry/Helium/temperature/test4/myinput.dat
$HOME/scratch/chemistry/Helium/pressure/test1/myinput.dat
...
$HOME/scratch/chemistry/Neon/temperature/test1/myinput.dat
...
$HOME/scratch/chemistry/Argon/pressure/test4/myinput.dat

The following jobscript will run the executable mycode.exe in each path (so that we process all 24 input files). In this example the code is a serial code (hence no PE is specified).

#!/bin/bash
#$ -cwd
#$ -V

# This creates a job array of 24 tasks numbered 1...24 (IDs can't start at zero)
#$ -t 1-24

# Subdirectories will all have this common root (saves me some typing)
BASE=$HOME/scratch/chemistry

# Path to my executable
EXE=$BASE/exe/mycode.exe

# Arrays giving subdirectory names (note no commas - use spaces to separate)
DIRS1=( Helium Neon Argon )
DIRS2=( temperature pressure )
DIRS3=( test1 test2 test3 test4 )

# BASH script to get length of arrays
NUMDIRS1=${#DIRS1[@]}
NUMDIRS2=${#DIRS2[@]}
NUMDIRS3=${#DIRS3[@]}
TOTAL=$[$NUMDIRS1 * $NUMDIRS2 * $NUMDIRS3 ]
echo "Total runs: $TOTAL"

# Remember that $SGE_TASK_ID will be 1, 2, 3, ... 24.
# BASH array indexing starts from zero so decrment.
TID=$[SGE_TASK_ID-1]

# Create indices in to the above arrays of directory names.
# The first id increments the slowest, then the middle index, and so on.
IDX1=$[TID/$[NUMDIRS2*NUMDIRS3]]
IDX2=$[(TID/$NUMDIRS3)%$NUMDIRS2]
IDX3=$[TID%$NUMDIRS3]

# Index in to the arrays of directory names to create a path
JOBDIR=${DIRS1[$IDX1]}/${DIRS2[$IDX2]}/${DIRS3[$IDX3]}

# Echo some info to the job output file
echo "Running SGE_TASK_ID $SGE_TASK_ID in directory $BASE/$JOBDIR"

# Finally run my executable from the correct directory
cd $BASE/$JOBDIR
$EXE < myinput.dat > myoutput.dat

You may not need three levels of subdirectories and you’ll want to edit the names (BASE, EXE, DIRS1, DIRS2, DIRS3) and change the number of tasks requested.

To submit your job simply use qsub myjobscript.sh, i.e., you only submit a single jobscript.

Running MATLAB (lock file error)

If you wish to run compiled MATLAB code in a job array please see MATLAB job arrays (CSF documentation) for details of an extra environment variable needed to prevent a lock-file error. This is a problem in MATLAB when running many instances at the same time, which can occur if running from a job array.

Limit the number of tasks to be run at the same time

By default the batch system will try attempt to run as many tasks as possible concurrently. If you do not want this to happen you can limit how many tasks can be running at the same time with the -tc option. For example to limit it to 5 tasks:

#$ -tc 5

Job Dependencies with Job Arrays

It is possible to make a job wait for an entire job array to complete or to make the tasks of a job array wait for the corresponding task of another job array. It is also possible to make the tasks within a job array wait for other tasks (although this is limited). Examples are now given.

In the following examples we name each job using the -N flag to make the text more readable. This is optional. If you don’t name a job you should use the Job ID number when referring to previous jobs.

Wait for entire job array to finish

Suppose you want JobB to wait until all tasks in JobA have finished. JobB can be an ordinary job or another job array. But it will not run until all tasks in the job array JobA have finished. This is useful where you need to do something with the results of all tasks from a job array. Using a job dependency is the correct way to ensure all tasks in a job array have completed (using the last task in a job array to do some extra processing is incorrect because not all tasks may have finished even if they have all started).

------------------- Time --------------------->
JobA.task1 -----> End
JobA.task2 --------> End                    # JobA tasks can run in parallel
 ...
   JobA.taskN ----> End
                         JobB -----> End    # JobB won't start until all
                                            # of JobA's tasks have finished

Here is the jobscript for JobB – we use the -hold_jid flag to give the name of the job we should wait for.

#!/bin/bash
#$ -cwd
#$ -N JobB
#$ -hold_jid JobA      # We will wait for all of JobA's tasks to finish
./myapp.exe

Submit the jobs in the expected order and JobB will wait for JobA to finish.

qsub jobscript_a
qsub jobscript_b

Wait for individual tasks to finish

Suppose you have two job arrays to run, both with the same number of tasks. You want task 1 from JobB to run after task 1 from JobA has finished. Similarly you want task 2 from JobB to run after task 2 from JobA has finished. And so on. This allows you to pipeline tasks but still have them run independently and in parallel with other tasks.

----------------------- Time ----------------------->
JobA.task1 -----> End JobB.task1 ------> END
JobA.task2 -------> End JobB.task2 -------> END      # Tasks can run in parallel. JobA tasks
 ...                                                 # and JobB tasks form pipelines.
   JobA.taskN -----> End JobB.taskN ----> END

Use the -hold_jid_ad to set up an array dependency. Here are the two jobscripts:

JobA’s jobscript:

#!/bin/bash
#$ -cwd
#$ -N JobA            # We are JobA
#$ -t 1-20            # 20 Tasks in this example

./myAapp.exe data.$SGE_TASK_ID.in > data.$SGE_TASK_ID.A_result

JobB’s jobscript

#!/bin/bash
#$ -cwd
#$ -N JobB            # We are JobB
#$ -t 1-20            # 20 Tasks in this example (must be same as JobA)
#$ -hold_jid_ad JobA  # JobB.task1 waits only for JobA.task1 to finish and so on...
                      # (the _ad means array dependency)

./myBapp.exe data.$SGE_TASK_ID.A_result > data.$SGE_TASK_ID.B_result

Submit both jobs:

qsub jobscript_a
qsub jobscript_b

Tasks wait within a job array

It is possible to make the tasks within a jobarray wait for earlier tasks within the same jobarray. This is generally not recommend because it removes one of the advantages of job arrays – the ability to run independent jobs in parallel so that you get your results sooner. However, it does provide a method of easily submitting a large number of jobs where job N must wait for job N-1 to finish. We use the -tc flag to control the number of concurrent tasks that the job array can execute.

------------------------------- Time --------------------------------->
JobA.task1 -----> End
                     JobA.task2 -----> End
                                          ...
                                                  JobA.taskN -----> End

The -tc N allows the job array to run N tasks in the job array at the same time. Without the flag the batch system will attempt to run as many tasks as possible concurrently. If you set this to 1 you effectively make each task wait for the previous task to finish:

#!/bin/bash
#$ -cwd
#$ -t 1-10
#$ -tc 1     # Run only one task at a time
./myApp.exe data.$SGE_TASK_ID.in > data.$SGE_TASK_ID.out

Deleting Job Arrays

Deleting All Tasks

An entire job array can be deleted with one command. This will delete all tasks that are running and are yet to run from the batch system:

qdel 18305
     #
     # replace 18305 with your own job id number

Deleting Specific Tasks

Alternatively, it is possible to delete specific tasks while leaving other tasks running or in the queue waiting to run. You may wish to change some input files for these tasks, for example, or it might be because specific tasks have crashed and you need to delete them and then resubmit them. Simply add the -t taskrange flag to qdel where taskrange gives the tasks to delete. You must always give the job id followed by the tasks. For example:

  • To delete a single task (id 30) from a job (id 18205)
    qdel 18205 -t 30
    
  • To delete tasks 200-300 inclusive from a job (id 18205)
    qdel 18205 -t 200-300
    
  • To delete tasks 25,26,50,51,52,100 from a job (id 18205)
    qdel 18205 -t 25-26
    qdel 18205 -t 50-52
    qdel 18205 -t 100
    

    Note that it is not possible to use -t 25,26,50,51,52,100 to delete ad-hoc tasks.

Resubmitting Deleted/Failed Tasks

If you need to resubmit task ids that have been deleted, it is easiest to remove the #$ -t line from the jobscript and then specify the tasks on the qsub command line similar to the qdel command. For example, to resubmit the above six ad-hoc tasks:

# First remove '#$ -t start-end' from your jobscript
qsub -t 25-26 jobscript
qsub -t 50-52 jobscript
qsub -t 100 jobscript

Note that it is not possible to use -t 25,26,50,51,52,100 to submit ad-hoc tasks in one go.

Email from Jobarrays

As with ordinary batch jobs it is possible to have the job email you when it begins, ends or it aborts due to error. Unfortunately with a job array each task will email you. Hence you may receive 1000s of emails from a large job array. If this is what you require, please add the following to your jobscript:

#$ -M your.name@manchester.ac.uk       # Can use any address
#$ -m bea
      #
      # b = email when each job array task begins running
      # e = email when each job array task ends
      # a = email when each job array task aborts
      #
      # You can specify any one or more of b, e, a

It isn’t possible to have the job array email you only when the last task has finished. However, it is possible to do something very similar to this, as follows:

Email during last task

You can manually email yourself from the job array task with the highest (last) task id. There is no guarantee this will be the last task to finish (other tasks that started earlier may run for longer and finish later). But it will be the last task to start. Emailing from the last task to start would be close enough to also being the last task to finish in most cases. Do this as follows:

#$ -t 1-100

# Run some application in our jobscript
./my_app.exe data.$SGE_TASK_ID

# Send email at end of last task. Will usually be
# close enough to being the last task to finish.
if [[ $SGE_TASK_ID == $SGE_TASK_LAST ]]; then
  echo "Last task $SGE_TASK_ID in job $JOB_ID finished" | mail -s "CSF jobarray" $USER
fi

The email will be sent to your University email address.

Email from a Job Dependency after the Job Array

To guarantee that you receive an email after all tasks have finished you can submit a second serial job (not a job array) to the batch system, perhaps to the short area if available (CSF users only) that has a job dependency on the main job-array job. The second job will only run after all tasks in the jobarray have finished. Note however, that the second job may have to wait in the queue depending how busy the system is. So your email may arrive some extended time after the job array actually finished. To submit the jobs use the following command-lines:

# First, submit your job array as normal, noting the jobid
qsub my-jobarray.sh
Your job-array 869219.1-10:1 ("my-jobarray.sh") has been submitted
                 #
                 # Make a note of this job id

# Now immediately submit a serial job with a dependency (-hold_jid jobid) 
# and request that it emails you when it ends (-m e)
qsub -b y -hold_jid 869219 -m e -M $USER true
                       #                   #
                       #                   # this app simply returns 'no error'
                       #
                       # Use the job id from the previous job array

The second serial job will execute (it won’t do any useful work and will finish immediately) when the job array ends. It will send you an email when it has finished and so you will then know that the job array on which it was dependent has also finished. Note that the information in the email (wallclock time etc) is about the serial job, not the job array.

Further Information

More on SGE Job Arrays can be found at:

Last modified on July 25, 2018 at 8:34 am by Pen Richardson