{"id":199,"date":"2018-09-04T17:56:05","date_gmt":"2018-09-04T16:56:05","guid":{"rendered":"http:\/\/ri.itservices.manchester.ac.uk\/csf3\/?page_id=199"},"modified":"2025-06-19T18:17:47","modified_gmt":"2025-06-19T17:17:47","slug":"job-arrays","status":"publish","type":"page","link":"https:\/\/ri.itservices.manchester.ac.uk\/csf3\/batch\/job-arrays\/","title":{"rendered":"Job Arrays (Multiple similar jobs)"},"content":{"rendered":"<p><script type=\"text\/javascript\">\n    function toggle() {\n        var x = document.getElementById(\"hidetext\");\n        if (x.style.display === \"none\") {x.style.display = \"block\";}\n        else {x.style.display = \"none\";}\n    }\n<\/script><\/p>\n<div class=\"warning\">The SGE batch system has been shutdown and the CSF upgraded to use the Slurm batch system. Please read the <a href=\"\/csf3\/batch-slurm\">CSF3 Slurm documentation<\/a> instead.<\/p>\n<p>To display this old SGE page, <a href=\"javascript:toggle()\">click here<\/a>\n<\/div>\n<div id=\"hidetext\" style=\"display: none\">\n&nbsp;<\/p>\n<div class=\"warning\">\nPlease do not run jobarrays in the <code>short<\/code> environment, even if your tasks have a short runtime. There are not enough cores in <code>short<\/code> for jobarrays.<\/div>\n<h2>Why use a Job Array?<\/h2>\n<p>Suppose you wish to run a large number of almost identical jobs &#8211; for example you may wish to process a thousand different data files with the same application (e.g., processing 1000s of images with the same image-processing app). Or you may wish to run the same program many times with different arguments or parameters (e.g., to do a <em>parameter sweep<\/em> where you wish to find the <em>best<\/em> value of some variable.)<\/p>\n<p>You may have used the Condor pool to do this (where idle PCs on campus are used to run your jobs overnight) but the CSF can also run these <em>High Throughput Computing<\/em> jobs.<\/p>\n<h3>How NOT to do it<\/h3>\n<p>The <em>wrong<\/em> way to do this would be to write a script (using Perl, Python or BASH for example) to generate all the required <code>qsub<\/code> jobscripts and then use another BASH script to submit them all (running <code>qsub<\/code> 1000s of times). This is <em>not<\/em> a good use of your time and it will do horrible things to the submit node (which manages the job queues) on a cluster. The sysadmins may kill such jobs to keep the system running smoothly!<\/p>\n<div class=\"warning\">\nDo not run <code>qsub<\/code> 100s, 1000s, &#8230; of times to submit 100s, 1000s, &#8230; of individual jobs. This will strain the batch system. If you are about to do this, STOP. You should be using a job array instead. Please read through the examples below. <a href=\"\/csf3\/overview\/help\/\">Contact us<\/a> if you require further advice.<\/div>\n<h3>The right way to do it<\/h3>\n<p>A <em>much better<\/em> way is to use an SGE <em>Job Array<\/em>. Simply put, a job array runs multiple copies (100s, 1000s, &#8230;) of your job in a way that places much less strain on the queue manager. You only write <em>one<\/em> jobscript and use <code>qsub<\/code> <em>once<\/em> to submit that job.<\/p>\n<p>Your jobscript includes a flag to say how many copies of it should be run. Each copy of the job is given a unique <em>task<\/em> id. You use the task id in your jobscript to have each task do some unique work (e.g., each task processes a different data file or uses a different set of input parameters).<\/p>\n<p>Using the unique <em>task id<\/em> in your jobscript is the key to writing a good job array script. You can be creative here &#8211; the task id can be used in many ways.<\/p>\n<p>Below, we first describe how to submit an SGE job array that runs several <em>serial<\/em> (single core) <em>tasks<\/em>. We also show how to submit a job array that will run <em>SMP<\/em> (multicore) <em>tasks<\/em>. You can also submit job arrays that will run larger multi-node tasks. If you have access to the GPUs in the CSF, you can also submit jobarrays to the GPU nodes. The majority of this page then gives examples of how to use the <em>task id<\/em> in your jobscript in different ways. <\/p>\n<p>We have users on the CSF that run job-arrays with 10,000&#8217;s of tasks &#8211; hence they are a very easy way of repeatedly running the same job on <em>lots<\/em> of different data files!<\/p>\n<h3>Job runtime<\/h3>\n<p>Each <em>task<\/em> in an array gets a maximum runtime of 7 days. Job arrays are not terminated at the 7 day limit, they will remain in the system until <em>all<\/em> tasks complete.<\/p>\n<p><strong>Please note:<\/strong> Job-arrays are not permitted in the <code>short<\/code> area due to limited resources.<\/p>\n<h2>Job Array Basics<\/h2>\n<p>Here is a simple example of a job array &#8211; notice the <code>#$ -t 1-1000<\/code> line and the use of the special <code>$SGE_TASK_ID<\/code> variable. These will both be explained below.<\/p>\n<pre>#!\/bin\/bash --login\r\n#$ -cwd\r\n#$ -t 1-1000          # A job-array with 1000 \"tasks\", numbered 1...1000\r\n                      # NOTE: No #$ -pe line so each task will use 1-core by default.\r\n\r\n.\/myprog -in data.$SGE_TASK_ID.dat -out results.$SGE_TASK_ID.dat\r\n               #\r\n               # My input files are named: data.1.dat, data.2.dat, ..., data.1000.dat\r\n               # 1000 tasks (copies of this job) will run.\r\n               # Task 1 will read data.1.dat, task 2 will read data.2.dat, ... \r\n<\/pre>\n<p>Computationally, this is equivalent to 1000 individual queue submissions in which <code>$SGE_TASK_ID<\/code> takes the values <code>1, 2, 3. . .   1000<\/code>, and where input and output files have the task ID number in their name. Hence task 1 will read the file <code>data.1.dat<\/code> and write the results to <code>results.1.dat<\/code>. Task 2 will read <code>data.2.dat<\/code> and write the results to <code>results.2.dat<\/code> and so on, all the way up to task 1000.<\/p>\n<p>The <code>$SGE_TASK_ID<\/code> variable is automatically set for you by the batch system when a particular task runs. Please note that for <em>serial<\/em> jobs you don&#8217;t use a PE setting.<\/p>\n<p>To submit the job simply issue <em>one<\/em> <code>qsub<\/code> command:<\/p>\n<pre>qsub <em>jobscript<\/em>\r\n<\/pre>\n<p>where <em>jobscript<\/em> is the name of your script above.<\/p>\n<h3>Job Array Size Limit<\/h3>\n<p>The maximum number of <em>tasks<\/em> that can be requested in an array job on CSF3 is currently 75,000. For example:<\/p>\n<pre>\r\n#$ -t 1-75000         # Max job array size is currently 75000 on CSF3\r\n<\/pre>\n<p>If you request more than 75000 job array tasks you will receive the following error when trying to submit the job:<\/p>\n<pre>\r\nUnable to run job: job rejected: you tried to submit a job with more than 75000 tasks\r\nExiting.\r\n<\/pre>\n<h3>Multi-core Job Array Tasks<\/h3>\n<p>Multi-core (SMP) tasks (e.g., OpenMP jobs) can also be run in jobarrays. <em>Each<\/em> task will run your program with the requested number of cores. Simply add a <code>-pe<\/code> option to the jobscript and then tell your program how many cores it can use in the usual manner (see <a href=\"\/csf3\/batch\/parallel-jobs\">parallel job submission<\/a>). Please be aware that each task will be requesting the specified resources (number of cores). It may take longer for each task to get through the batch queue, depending on how busy the system is.<\/p>\n<p>An example SMP array job is given below:<\/p>\n<pre>#!\/bin\/bash --login\r\n#$ -cwd\r\n#$ -pe smp.pe 4       # Each task will use 4 cores in this example\r\n#$ -t 1-1000          # A job-array with 1000 \"tasks\", numbered 1...1000\r\n\r\n# My OpenMP program will read this variable to get how many cores to use.\r\n# $NSLOTS is automatically set to the number specified on the -pe line above.\r\nexport OMP_NUM_THREADS=$NSLOTS\r\n\r\n.\/myOMPprog -in data.$SGE_TASK_ID.dat -out results.$SGE_TASK_ID.dat\r\n<\/pre>\n<p>Again, simply submit the job once using <code>qsub <em>jobscript<\/em><\/code>.<\/p>\n<h3>Advantages of Job Arrays<\/h3>\n<p>Job arrays have several advantages over submitting 100s or 1000s of individual jobs. In both of the above cases:<\/p>\n<ul class=\"gaplist\">\n<li>Only one <code>qsub<\/code> command is issued (and only one <code>qdel<\/code> command would be required to delete all tasks).<\/li>\n<li>The batch system will try to run many of your tasks at once (possibly hundreds simultaneously for serial job arrays, depending on what we have set the limit to be according to demand on the system). So you get a lot more than one task running in parallel from just one <code>qsub<\/code> command. The system will churn through your tasks running them all as cores become free on the system, at which point the job is finished.<\/li>\n<li>Only one entry appears to be queued (qw) in the qstat output for the job array, but each individual task running (r) will be visible. This makes reading your <code>qstat<\/code> output a lot easier than if you&#8217;d submitted 1000s of individual jobs.<\/li>\n<li>The load on the SGE submit node (i.e., the cluster node responsible for managing the queues and scheduling which jobs run) is vastly less than that of submitting 1000 separate jobs.<\/li>\n<\/ul>\n<p>There are many ways to use the <code>$SGE_TASK_ID<\/code> variable to supply a different input to each task and <a href=\"#examples\">several examples<\/a> are shown below.<\/p>\n<h2>More General Task ID Numbering<\/h2>\n<p>It is not necessary that <code>$SGE_TASK_ID<\/code> starts at <code>1<\/code>; nor must the increment be 1. The general format is:<\/p>\n<pre>#$ -t <em>start<\/em>-<em>end<\/em><em>:increment<\/em>\r\n         #             #      \r\n         #             # The default :increment is 1 if not supplied\r\n         #\r\n         # The start task id CANNOT be zero! It must be &gt;=1\r\n<\/pre>\n<p>For example:<\/p>\n<pre>#$ -t 100-995:5\r\n<\/pre>\n<p>so that <code>$SGE_TASK_ID<\/code> takes the values <code>100, 105, 110, 115... 995<\/code>.<\/p>\n<p><strong>Note:<\/strong> The <code>$SGE_TASK_ID<\/code> is <strong>not<\/strong> allowed to start at 0. The start value must be 1 or more.<\/p>\n<p>Incidentally, in the case in which the upper-bound is not equal to the lower-bound <strong>plus<\/strong> an integer-multiple of the increment, for example<\/p>\n<pre>#$ -t 1-42:6     # Tasks will be numbered 1, 7, 13, 19, 25, 31, <strong>37<\/strong> !!\r\n<\/pre>\n<p>SGE automatically changes the upper bound, but this is only visible when you run <code>qstat<\/code> to check on the status of your job.<\/p>\n<pre>[<em>username<\/em>@hlogin2 [csf3] ~]$ qsub array.qsub\r\nYour job-array 2642.1-<strong>42<\/strong>:6 (\"array.qsub\") has been submitted\r\n                      #\r\n                      # The qsub command simply reports what you requested in the jobscript\r\n\r\n[<em>username<\/em>@hlogin2 [csf3] ~]$ qstat\r\njob-ID   prior  name        user      state  submit\/start at      queue    slots ja-task-ID\r\n-------------------------------------------------------------------------------------------\r\n2642    0.00000 array.qsub  simonh    qw     04\/24\/2014 12:29:29               1 <strong>1-37<\/strong>:6\r\n                                                                                   #\r\n                                                 # The qstat command now shows the #\r\n                                                 # adjusted upper bound (37).\r\n<\/pre>\n<p>Note the <code>1-37<\/code> in the <code>ja-task-ID<\/code> column: the final task id has been adjusted to be 37 rather than 42. Hence the tasks ids used will be 1,7,13,19,25,31,<strong>37<\/strong>. Remember that the task id cannot start at zero so don&#8217;t be tempted to try <code>#$ -t 0-42:6<\/code>.<\/p>\n<h3>Ad-hoc Task ID Numbering<\/h3>\n<p>It is <strong>not possible<\/strong> to specify several ad-hoc task numbers &#8211; <strong>e.g., you CANNOT use: <\/strong> <code>-t 3,7,25,26,50,51,52,100<\/code> to run ad-hoc tasks.<\/p>\n<p>Instead, specify multiple single task-ids or small ranges on the <code>qsub<\/code> command-line:<\/p>\n<pre>\r\n# In the following, the -t on the qsub command-line will override the '#$ -t' in the jobscript:\r\nqsub -t 3 <em>myjobscript<\/em>          # Run task 3\r\nqsub -t 7 <em>myjobscript<\/em>          # Run task 7\r\nqsub -t 25-26 <em>myjobscript<\/em>      # Run tasks 25 and 26\r\nqsub -t 50-52 <em>myjobscript<\/em>      # Run tasks 50, 51 and 52\r\nqsub -t 100 <em>myjobscript<\/em>        # Run task 100\r\n<\/pre>\n<h3>Related Environment Variables<\/h3>\n<p>There are three more automatically created environment variables one can use, as illustrated by this simple qsub script:<\/p>\n<pre>#!\/bin\/bash --login\r\n#$ -cwd \r\n#$ -t 1-37:6          # Tasks will be numbered 1, 7, 13, 19, 25, 31, 37\r\n\r\n# This will report that we requested an increment of 6\r\necho \"The ID increment is: $SGE_TASK_STEPSIZE\"\r\n\r\n# These should be used with caution (see below for explanation)\r\nif [[ $SGE_TASK_ID == $SGE_TASK_FIRST ]]; then\r\n    echo \"first\"\r\nelif [[ $SGE_TASK_ID == $SGE_TASK_LAST ]]; then\r\n    echo \"last\"\r\n else\r\n    echo \"neither - I am task $SGE_TASK_ID\"\r\nfi\r\n<\/pre>\n<p>Note that the batch system will try to start your jobs in numerical order but there is no guarantee that they will finish in the same order \u2014 some tasks may take longer to run than others. So you cannot rely on the task with id <code>$SGE_TASK_LAST<\/code> being the last task to <em>finish<\/em>. Hence <strong>do not<\/strong> try something like:<\/p>\n<pre># <strong>DO NOT do this in your jobscript - we may not be the last task to finish!<\/strong>\r\nif [[ $SGE_TASK_ID == $SGE_TASK_LAST ]]; then\r\n  # Archive output files from all tasks (output.1, output.2, ...).\r\n  tar czf ~\/scratch\/all-my-results.tgz output.*\r\n    #\r\n    # BAD: we may not be the last task to finish just because we are the last\r\n    # BAD: task id. Hence we may miss some output files from other tasks that\r\n    # BAD: are still running.\r\nfi\r\n<\/pre>\n<p>The correct way to do something like this (where the work carried out by a task is dependent on other tasks having finished) is to use a <a href=\"\/csf3\/batch\/job-dependencies\/\">job dependency<\/a> which uses two separate jobs and automatically runs the second job only when the first job has completely finished. You would generally only use the <code>$SGE_TASK_FIRST<\/code> and <code>$SGE_TASK_LAST<\/code> variables where you wanted those tasks to do something different but where they are still independent of the other tasks.<\/p>\n<h2>Job output files<\/h2>\n<p><strong>Please read:<\/strong> a possible downside of job arrays is that <em>every<\/em> task generates its own <code><em>jobname<\/em>.o<em>NNNNNN<\/em>.<em>TTT<\/em><\/code> and <code><em>jobname<\/em>.e<em>NNNNNN<\/em>.<em>TTT<\/em><\/code> job output files. You get a <code>.o<\/code> and <code>.e<\/code> file for <em>every<\/em> task in your job array! If you have a large job array, you might end up with 10,000s files in the directory (folder) from where you submitted the job. This can make managing your files very difficult.<\/p>\n<p>For example, if you submit the following job array named <code>myjobarray<\/code>:<\/p>\n<pre>#!\/bin\/bash\r\n#$ -cwd\r\n#$ -t 1-5000\r\nmodule load ...\r\n<em>theapp<\/em> ...\r\n<\/pre>\n<p>Then it will generate the following output files:<\/p>\n<pre># The job array created 10,000 output files! This is difficult to manage and can slow down your job!\r\nmyjobarray.o12345.1\r\nmyjobarray.o12345.2\r\n...\r\nmyjobarray.o12345.5000\r\nmyjobarray.e12345.1\r\nmyjobarray.e12345.2\r\n...\r\nmyjobarray.e12345.5000\r\n              #     #\r\n              #     # taskid number\r\n              #\r\n              # Jobid number\r\n<\/pre>\n<p>As you can see, there will be 10,000 files in total (5000 <code>.o<\/code> files and 5000 <code>.e<\/code> files.) Any output from your job will be captured in these files. This at least ensures that no output will be lost and then the output from one task will never overwrite the output from another task.<\/p>\n<p>When a folder contains 1000s of files it can be slow to list all of the files and to also read and write other files. Hence, you will be slowing down your own job if you allow a folder to contain 1000s of files!<\/p>\n<p>But it is often the case the the <code>.e<\/code> files are empty. For example:<\/p>\n<pre>ls -l myjobarray.e12345.1\r\n-rw-r--r-- 1 mabcxyz1 xy01 0 Aug 30 17:16 myjobarray.e12345.1\r\n                           #\r\n                           # This column shows the size, in this case 0 bytes means the file is empty\r\n<\/pre>\n<p>To reduce the number of files generated there are two things you can do:<\/p>\n<ol>\n<li>The first thing to do is <em>join<\/em> the <code>.o<\/code> and <code>.e<\/code> file for each task in to one file (the <code>.o<\/code> file) by adding the <code>#$ -j y<\/code> flag:\n<pre>#!\/bin\/bash --login\r\n#$ -cwd\r\n#$ -t 1-5000\r\n#$ -j y        # Yes, join the .o and .e file in to just the .o file.\r\n...\r\n<\/pre>\n<p>This will then cause the job to generate the just the <code>.o<\/code> files:<\/p>\n<pre># The .e has been joined in to the .o file so now we \"only\" have 5000 output files (still a lot!)\r\nmyjobarray.o12345.1\r\nmyjobarray.o12345.2\r\n...\r\nmyjobarray.o12345.5000\r\n<\/pre>\n<p>Any error messages that would have gone to the <code>.e<\/code> files will instead be captured by the <code>.o<\/code> files, so you won&#8217;t lose any output by joining the files.<\/li>\n<li>A second thing to do is decide whether you really need the <code>.o<\/code> and <code>.e<\/code> files at all. If your app write data to a different file then you can often have both of the <code>.o<\/code> and <code>.e<\/code> files being empty. In this case, you can completely disable the output of both files:\n<pre>#!\/bin\/bash --login\r\n#$ -cwd\r\n#$ -t 1-5000\r\n#$ -o \/dev\/null    # No .o files will be generated\r\n#$ -e \/dev\/null    # No .e files will be generated\r\n# You should check that your app will write its results to some other file!\r\n# For example:\r\n<em>theapp<\/em> -in <em>data<\/em>.$SGE_TASK_ID -out results.$SGE_TASK_ID\r\n<\/pre>\n<p>The use of the special <code>\/dev\/null<\/code> filename for the <code>.o<\/code> and <code>.e<\/code> filename prevents the job from creating any such output files.<\/li>\n<\/ol>\n<p>Remember, managing your files when you have 10,000s of files in a single directory (folder) can be difficult and you will often slow down your own job if you allow a lot of files to be created. It can also lead to the whole filesystem performance being impacted which then affects other users of the service.<\/p>\n<p><a name=\"examples\"><\/a><\/p>\n<h2>Examples<\/h2>\n<p>We now show example job scripts which use the job array environment variables in various ways. All of the examples below are serial jobs (each task uses only one core) but you could equally use multicore (smp) jobs if your code\/executable supports multicore. You should adapt these examples to your own needs.<\/p>\n<h3>A List of Input Files<\/h3>\n<p>Suppose we have a <em>list<\/em> of input files, rather than input files explicitly indexed by a number. For example, your input files may have names such as:<\/p>\n<pre>C2H4O.dat\r\nNH3.dat\r\nC6H8O7.dat\r\nPbCl2.dat\r\nH2O.dat\r\n... and so on ...\r\n<\/pre>\n<p>The files do not have the convenient <code>1, 2, 3, ....<\/code> number sequence in their name. So how do we use a job array with these input files? We can put the names in to a simple text file, with one name per line (as above). We then ask the job array tasks to read a name from this <em>master list<\/em> of filenames. As you might expect, task number 1 will read the filename on line 1 of the master-list. Task number 2 will read the filename from line 2 of the master-list, and so on.<\/p>\n<pre>#!\/bin\/bash --login\r\n#$ -cwd\r\n#$ -t 1-42\r\n\r\n# Task id 1 will read line 1 from my_file_list.txt\r\n# Task id 2 will read line 2 from my_file_list.txt\r\n# and so on...\r\n# Each line contains the name of an input file to used by 'my_chemistry_prog'\r\n\r\n# Use some Linux commands to save the filename read from 'my_file_list.txt' to\r\n# a script variable named INFILE that we can use in other commands.\r\nINFILE=`awk \"NR==$SGE_TASK_ID\" my_file_list.txt`\r\n    #\r\n    # Can also use another linux tool named 'sed' to get the n-th line of a file:\r\n    # INFILE=`sed -n \"${SGE_TASK_ID}p\" my_file_list.txt`\r\n   \r\n# We now use the <em>value<\/em> of our variable by using $INFILE.\r\n# In task 1, $INFILE will be replaced with C2H4O.dat\r\n# In task 2, $INFILE will be replaced with NH3.dat\r\n# ... and so on ...\r\n\r\n# Run the app with the .dat filename specific to this task\r\n.\/my_chemistry_prog -in $INFILE -out result.$INFILE\r\n<\/pre>\n<h3>Bash Scripting and Arrays<\/h3>\n<p>Another way of passing different parameters to your application (e.g., to run the same simulation but with different input parameters) is to list all the parameters in a bash array and index in to the array. For example:<\/p>\n<pre>#!\/bin\/bash --login\r\n#$ -cwd\r\n#$ -t 1-10           # Only 10 tasks in this example!\r\n\r\n# A bash array of my 10 input parameters\r\nX_PARAM=( 3400 4500 9700 10020 20000 30000 40000 44400 50000 60910 )\r\n\r\n# Bash arrays use zero-based indexing but you CAN'T use -t 0-9 above (0 is an invalid task id)\r\nINDEX=$((SGE_TASK_ID-1))\r\n\r\n# Run the app with one of the parameters\r\n.\/myprog -xflag ${X_PARAM[$INDEX]} &gt; output.${INDEX}.log\r\n<\/pre>\n<h3>Running from Different Directories (simple)<\/h3>\n<p>Here we run each task in a separate directory (folder) that we create when each task runs. We run two applications from the jobscript &#8211; the first outputs to a file, the second reads that file as input and outputs to another file (your own applications may do something completely different). We run 1000 tasks, numbered 1&#8230;1000.<\/p>\n<pre>#!\/bin\/bash --login\r\n#$ -cwd\r\n#$ -t 1-1000\r\n\r\n# Create a new directory for each task and go in to that directory\r\nmkdir myjob-$SGE_TASK_ID\r\ncd myjob-$SGE_TASK_ID\r\n\r\n# Each task runs the same executables stored in the parent directory\r\n..\/myprog-a.exe &gt; a.output\r\n..\/myprog-b.exe &lt; a.output &gt; b.output\r\n<\/pre>\n<p>In the above example all tasks use the same input and output filenames (<code>a.output<\/code> and <code>b.output<\/code>). This is safe because each task runs in its own directory.<br \/>\n<a name=\"dirlist\"><\/a><\/p>\n<h3>Running from Different Directories (intermediate)<\/h3>\n<p>Here we use one of the techniques from above &#8211; read the names of folders (directories) we want to run in from a file. Task 1 will read line 1, task 2 reads line 2 and so on.<\/p>\n<p>Create a simple text file (for example <code>my_dir_list.txt<\/code>) with each folder name you wish to run a job in listed on a new line.<\/p>\n<p>We assume the file contains <em>sub-directory<\/em> names. For example, suppose we are currently working in a directory named <code>~\/scratch\/jobs\/<\/code> (it is in our scratch directory). The subdirectories are named after some property such as:<\/p>\n<pre>s023nn\/arun1206\/\r\ns023nn\/arun1207\/\r\ns023nn\/brun1208\/\r\ns023nx\/brun1201\/\r\ns023nx\/crun1731\/\r\n\r\n<em>and so on - it doesn't really matter what the subdirectories are called.<\/em>\r\nNote that in this example we assume there are <strong>500 lines<\/strong> (i.e. <strong>500 directories<\/strong>)\r\nin this file. This tells us how many tasks to run in the job array.\r\n<\/pre>\n<p>The jobscript reads a line from the above list and <em>cd<\/em>&#8216;s in to that directory:<\/p>\n<pre>#!\/bin\/bash --login\r\n#$ -cwd                # Run from where we ran qsub\r\n#$ -t 1-<strong>500<\/strong>            # Assuming <strong>my_dir_list.txt has 500 lines<\/strong>\r\n\r\n# Task id 1 will read line 1 from my_dir_list.txt\r\n# Task id 2 will read line 2 from my_dir_list.txt\r\n# and so on...\r\n\r\n# This time we use the 'sed' command but we could use the 'awk' command (see earlier).\r\n# Task 1 will read line 1 of my_dir_list.txt, task 2 will read line 2, ...\r\n# Assign the name of this task's sub-directory to a variable named 'SUBDIR'\r\nSUBDIR=`sed -n \"${SGE_TASK_ID}p\" my_dir_list.txt`\r\n\r\n# Go in to the sub-directory for this task by reading the <em>value<\/em> of the variable\r\ncd $SUBDIR\r\n  #\r\n  # You could use the subdir name to form a longer path, for example:\r\n  # cd ~\/scratch\/myjobs\/medium_dataset\/$SUBDIR\r\n\r\n# Run our code. Each sub-directory contains a file named input.dat. \r\n.\/myprog -in input.dat\r\n<\/pre>\n<p>The above script assumes that each subdirectory contains a file named <code>input.dat<\/code> which we process.<\/p>\n<h3>Running from Different Directories (advanced)<\/h3>\n<p>This example runs the same code but from different directories. Here we expect each directory to contain an input file. You can name your directories (and subdirectories) appropriately to match your experiments. We use BASH scripting to index in to arrays giving the names of directories. This example requires some knowledge of BASH but it should be straight forward to modify for your own work.<\/p>\n<p>In this example we have the following directory structure (use whatever names are suitable for your code)<\/p>\n<ul>\n<li>3 top-level directories named: <code>Helium<\/code>, <code>Neon<\/code>, <code>Argon<\/code><\/li>\n<li>2 mid-level directories named: <code>temperature<\/code>, <code>pressure<\/code><\/li>\n<li>4 bottom-level directories named: <code>test1<\/code>, <code>test2<\/code>, <code>test3<\/code>, <code>test4<\/code><\/li>\n<\/ul>\n<p>So the directory tree looks something like:<\/p>\n<pre>|\r\n+---Helium---+---temperature---+---test1\r\n|            |                 +---test2\r\n|            |                 +---test3\r\n|            |                 +---test4\r\n|            |\r\n|            +------pressure---+---test1\r\n|                              +---test2\r\n|                              +---test3\r\n|                              +---test4\r\n|\r\n+-----Neon---+---temperature---+---test1\r\n|            |                 +---test2\r\n|            |                 +---test3\r\n|            |                 +---test4\r\n|            |\r\n|            +------pressure---+---test1\r\n|                              +---test2\r\n|                              +---test3\r\n|                              +---test4\r\n|\r\n+----Argon---+---temperature---+---test1\r\n             |                 +---test2\r\n             |                 +---test3\r\n             |                 +---test4\r\n             |\r\n             +------pressure---+---test1\r\n                               +---test2\r\n                               +---test3\r\n                               +---test4\r\n<\/pre>\n<p>Hence we have 3 x 2 x 4=24 input files all named <code>myinput.dat<\/code> in paths such as<\/p>\n<pre>$HOME\/scratch\/chemistry\/Helium\/temperature\/test1\/myinput.dat\r\n...\r\n$HOME\/scratch\/chemistry\/Helium\/temperature\/test4\/myinput.dat\r\n$HOME\/scratch\/chemistry\/Helium\/pressure\/test1\/myinput.dat\r\n...\r\n$HOME\/scratch\/chemistry\/Neon\/temperature\/test1\/myinput.dat\r\n...\r\n$HOME\/scratch\/chemistry\/Argon\/pressure\/test4\/myinput.dat\r\n<\/pre>\n<p>The following jobscript will run the executable <code>mycode.exe<\/code> in each path (so that we process all 24 input files). In this example the code is a serial code (hence no PE is specified).<\/p>\n<pre>#!\/bin\/bash --login\r\n#$ -cwd\r\n\r\n# This creates a job array of 24 tasks numbered 1...24 (IDs can't start at zero)\r\n#$ -t 1-24\r\n\r\n# Subdirectories will all have this common root (saves me some typing)\r\nBASE=$HOME\/scratch\/chemistry\r\n\r\n# Path to my executable\r\nEXE=$BASE\/exe\/mycode.exe\r\n\r\n# Arrays giving subdirectory names (note no commas - use spaces to separate)\r\nDIRS1=( Helium Neon Argon )\r\nDIRS2=( temperature pressure )\r\nDIRS3=( test1 test2 test3 test4 )\r\n\r\n# BASH script to get length of arrays\r\nNUMDIRS1=${#DIRS1[@]}\r\nNUMDIRS2=${#DIRS2[@]}\r\nNUMDIRS3=${#DIRS3[@]}\r\nTOTAL=$[$NUMDIRS1 * $NUMDIRS2 * $NUMDIRS3 ]\r\necho \"Total runs: $TOTAL\"\r\n\r\n# Remember that $SGE_TASK_ID will be 1, 2, 3, ... 24.\r\n# BASH array indexing starts from zero so decrment.\r\nTID=$[SGE_TASK_ID-1]\r\n\r\n# Create indices in to the above arrays of directory names.\r\n# The first id increments the slowest, then the middle index, and so on.\r\nIDX1=$[TID\/$[NUMDIRS2*NUMDIRS3]]\r\nIDX2=$[(TID\/$NUMDIRS3)%$NUMDIRS2]\r\nIDX3=$[TID%$NUMDIRS3]\r\n\r\n# Index in to the arrays of directory names to create a path\r\nJOBDIR=${DIRS1[$IDX1]}\/${DIRS2[$IDX2]}\/${DIRS3[$IDX3]}\r\n\r\n# Echo some info to the job output file\r\necho \"Running SGE_TASK_ID $SGE_TASK_ID in directory $BASE\/$JOBDIR\"\r\n\r\n# Finally run my executable from the correct directory\r\ncd $BASE\/$JOBDIR\r\n$EXE &lt; myinput.dat &gt; myoutput.dat\r\n<\/pre>\n<p>You may not need three levels of subdirectories and you&#8217;ll want to edit the names (BASE, EXE, DIRS1, DIRS2, DIRS3) and change the number of tasks requested.<\/p>\n<p>To submit your job simply use <code>qsub myjobscript.sh<\/code>, i.e., you only submit a single jobscript.<\/p>\n<h2>Running MATLAB (lock file error)<\/h2>\n<p>If you wish to run compiled MATLAB code in a job array please see <a href=\"\/csf3\/software\/applications\/matlab\/compiling-matlab\/#MATLAB_Job_Array_Error\">MATLAB job arrays<\/a> (CSF documentation) for details of an extra environment variable needed to prevent a lock-file error. This is a problem in MATLAB when running many instances at the same time, which can occur if running from a job array.<\/p>\n<p>Our MATLAB documentation also contains an <a href=\"\/csf3\/software\/applications\/matlab\/compiling-matlab\/#Using_the_Job_Array_task_ID_as_a_Command-line_Arg\">example<\/a> of passing the <code>$SGE_TASK_ID<\/code> value to your MATLAB code.<br \/>\n<a name=\"jobdeps\"><\/a><\/p>\n<h2>Limit the number of tasks to be run at the same time<\/h2>\n<p>By default the batch system will try attempt to run as many tasks as possible concurrently. If you do not want this to happen you can limit how many tasks can be running at the same time with the <code>-tc<\/code> option. For example to limit it to 5 tasks:<\/p>\n<pre>#$ -tc 5\r\n<\/pre>\n<h2>Job Dependencies with Job Arrays<\/h2>\n<p>It is possible to make a job wait for an entire job array to complete or to make the tasks of a job array wait for the corresponding task of another job array. It is also possible to make the tasks within a job array wait for other tasks in the same job array (although this is limited). Examples are now given.<\/p>\n<p>In the following examples we name each job using the <code>-N<\/code> flag to make the text more readable. This is optional. If you don&#8217;t name a job you should use the <em>Job ID<\/em> number when referring to previous jobs.<\/p>\n<h3>Wait for entire job array to finish<\/h3>\n<p>Suppose you want <code>JobB<\/code> to wait until <em>all<\/em> tasks in <code>JobA<\/code> have finished. JobB can be an ordinary job or another job array. But it will not run until <em>all<\/em> tasks in the job array <code>JobA<\/code> have finished. This is useful where you need to do something with the results of all tasks from a job array. Using a job dependency is the correct way to ensure that all tasks in a job array have completed (using the last task in a job array to do some extra processing is incorrect because not all tasks may have <em>finished<\/em> even if they have all <em>started<\/em>).<\/p>\n<pre>------------------- Time ---------------------&gt;\r\nJobA.task1 -----&gt; End   |\r\nJobA.task2 --------&gt; End|                     # JobA tasks can run in parallel\r\n ...                    |\r\n   JobA.taskN ----&gt; End |\r\n                        |JobB -----&gt; End      # JobB won't start until <strong>all<\/strong> of\r\n                        |                     # JobA's tasks have finished\r\n<\/pre>\n<p>Here is the jobscript for <code>JobB<\/code> &#8211; we use the <code>-hold_jid<\/code> flag to give the name of the job we should wait for.<\/p>\n<pre>#!\/bin\/bash --login\r\n#$ -cwd\r\n#$ -N JobB\r\n#$ -hold_jid <strong>JobA<\/strong>      # We will wait for all of JobA's tasks to finish\r\n.\/myapp.exe\r\n<\/pre>\n<p>Submit the jobs in the expected order and JobB will wait for JobA to finish.<\/p>\n<pre>qsub jobscript_a                   # The job-array\r\nqsub jobscript_b                   # The job that will wait for the job-array to finish before starting\r\n<\/pre>\n<h3>Wait for individual tasks to finish<\/h3>\n<p>Suppose you have two job arrays to run, both with the same number of tasks. You want task 1 from JobB to run after task 1 from JobA has finished. Similarly you want task 2 from JobB to run after task 2 from JobA has finished. And so on. This allows you to pipeline tasks but still have them run independently and in parallel with other tasks.<\/p>\n<pre>----------------------- Time -----------------------&gt;\r\nJobA.task1 -----&gt; End JobB.task1 ------&gt; END\r\nJobA.task2 -------&gt; End JobB.task2 -------&gt; END      # Tasks can run in parallel. JobA tasks\r\n ...                                                 # and JobB tasks form pipelines.\r\n   JobA.taskN -----&gt; End JobB.taskN ----&gt; END\r\n<\/pre>\n<p>Use the <code>-hold_jid_ad<\/code> to set up an <em>array dependency<\/em>. Here are the two jobscripts:<\/p>\n<p>JobA&#8217;s jobscript:<\/p>\n<pre>#!\/bin\/bash --login\r\n#$ -cwd\r\n#$ -N JobA            # We are JobA\r\n#$ -t 1-20            # 20 Tasks in this example\r\n\r\n.\/myAapp.exe data.$SGE_TASK_ID.in &gt; data.$SGE_TASK_ID.A_result\r\n<\/pre>\n<p>JobB&#8217;s jobscript<\/p>\n<pre>#!\/bin\/bash\r\n#$ -cwd\r\n#$ -N JobB            # We are JobB\r\n#$ -t 1-20            # 20 Tasks in this example (must be same as JobA)\r\n#$ -hold_jid<strong>_ad<\/strong> JobA  # JobB.task1 waits only for JobA.task1 to finish and so on...\r\n                      # (the _ad means <em>array dependency<\/em>)\r\n\r\n.\/myBapp.exe data.$SGE_TASK_ID.A_result &gt; data.$SGE_TASK_ID.B_result\r\n<\/pre>\n<p>Submit both jobs:<\/p>\n<pre>qsub jobscript_a\r\nqsub jobscript_b\r\n<\/pre>\n<h3>Tasks wait within a job array<\/h3>\n<p>It is possible to make the tasks within a jobarray wait for earlier tasks within the <em>same<\/em> jobarray. This is generally not recommend because it removes one of the advantages of job arrays &#8211; the ability to run independent jobs in parallel so that you get your results sooner. However, it does provide a method of easily submitting a large number of jobs where job <em>N<\/em> must wait for job <em>N-1<\/em> to finish. We use the <code>-tc<\/code> flag to control the number of concurrent tasks that the job array can execute.<\/p>\n<pre>------------------------------- Time ---------------------------------&gt;\r\nJobA.task1 -----&gt; End\r\n                     JobA.task2 -----&gt; End\r\n                                          ...\r\n                                                  JobA.taskN -----&gt; End\r\n<\/pre>\n<p>The <code>-tc <em>N<\/em><\/code> allows the job array to run <code><em>N<\/em><\/code> tasks in the job array at the same time. Without the flag the batch system will attempt to run as many tasks as possible concurrently. If you set this to <code>1<\/code> you effectively make each task wait for the previous task to finish:<\/p>\n<pre>#!\/bin\/bash --login\r\n#$ -cwd\r\n#$ -t 1-10\r\n#$ -tc 1     # Run only one task at a time\r\n.\/myApp.exe data.$SGE_TASK_ID.in &gt; data.$SGE_TASK_ID.out\r\n<\/pre>\n<h3>Capturing the Job ID<\/h3>\n<p>When submitting a job-array, the job ID returned by the <code>-terse<\/code> flag incudes some extra information about the number of tasks and the increment of the task counter:<\/p>\n<pre>\r\n# When submitting a job-array, info about the number of tasks and task increment is returned\r\nqsub -terse -t 1-100 jobscript-array1.sh\r\n<em><strong>129674.1-100:1<\/strong><\/em>\r\n\r\n# To capture only the jobid, use the cut command to remove the extra info:\r\nqsub -terse -t 1-100 jobscript-array1.sh <strong>| cut -d. -f1<\/strong>\r\n<em><strong>129674<\/strong><\/em>\r\n<\/pre>\n<p>You can use the job ID of the job array to programmatically make a second job wait for the earlier job. For example:<\/p>\n<pre>\r\n# Submit a job array and capture its jobid\r\nJID=$(qsub -terse -t 1-100 jobscript-array1.sh | cut -d. -f1)\r\n\r\n# Now submit a normal job that waits for all tasks in the job array to finish\r\nqsub -hold_jid $JID second-job.sh\r\n<\/pre>\n<h2>Deleting Job Arrays<\/h2>\n<h3>Deleting All Tasks<\/h3>\n<p>An entire job array can be deleted with one command. This will delete all tasks that are running and are yet to run from the batch system:<\/p>\n<pre>qdel 18305\r\n     #\r\n     # replace 18305 with your own job id number\r\n<\/pre>\n<h3>Deleting Specific Tasks<\/h3>\n<p>Alternatively, it is possible to delete specific tasks while leaving other tasks running or in the queue waiting to run. You may wish to change some input files for these tasks, for example, or it might be because specific tasks have crashed and you need to delete them and then resubmit them. Simply add the <code>-t <em>taskrange<\/em><\/code> flag to <code>qdel<\/code> where taskrange gives the tasks to delete. You must always give the job id followed by the tasks. For example:<\/p>\n<ul>\n<li>To delete a single task (id 30) from a job (id 18205)\n<pre>qdel 18205 -t 30\r\n<\/pre>\n<\/li>\n<li>To delete tasks 200-300 inclusive from a job (id 18205)\n<pre>qdel 18205 -t 200-300\r\n<\/pre>\n<\/li>\n<li>To delete tasks 3,7,25,26,50,51,52,100 from a job (id 18205)\n<pre>\r\nqdel 18205 -t 3\r\nqdel 18205 -t 7\r\nqdel 18205 -t 25-26\r\nqdel 18205 -t 50-52\r\nqdel 18205 -t 100\r\n<\/pre>\n<p>Note that it is <strong>not possible<\/strong> to use <code>-t 3,7,25,26,50,51,52,100<\/code> to delete ad-hoc tasks.<\/li>\n<\/ul>\n<p><a name=\"resubmit\"><\/a><\/p>\n<h3>Resubmitting Deleted\/Failed Tasks<\/h3>\n<p>If you need to resubmit specific task ids that have failed (for whatever reason, e.g., the app you were running crashed or there was a hardware problem), you can specify those specific task id numbers on the <code>qsub<\/code> command-line to <em>override<\/em> the range of task ids in the jobscript. For example, to resubmit the six ad-hoc tasks 3,7,25,26,50,51,52,100:<\/p>\n<pre># The -t flag on the command-line will override any $# -t <em>start-finish<\/em> line in the jobscript.\r\n# You <em>do not<\/em> need to edit your jobscript to remove the #$ -t line.\r\nqsub -t 3 <em>jobscript<\/em>\r\nqsub -t 7 <em>jobscript<\/em>\r\nqsub -t 25-26 <em>jobscript<\/em>\r\nqsub -t 50-52 <em>jobscript<\/em>\r\nqsub -t 100 <em>jobscript<\/em>\r\n<\/pre>\n<p>Note that it is <strong>not possible<\/strong> to use <code>-t 3,7,25,26,50,51,52,100<\/code> to submit ad-hoc tasks in one go.<\/p>\n<h2>Email from Jobarrays<\/h2>\n<p>Aug 2021: Please note a change of policy<\/p>\n<div class=\"warning\">\nIt is no longer possible to submit job-arrays that will send emails from each task when there are more than 20 tasks in the job-array. This is to protect the University mail routers which have recently blocked the CSF from sending emails due to large job-arrays sending 1000s of emails.<\/p>\n<p>To receive an email when an entire job array has completed, please see the section below on <a href=\"#jobdepemail\">using a job-dependency to send an email after a job array<\/a>.\n<\/div>\n<p>As with ordinary batch jobs it is possible to have the job email you when it begins, ends or it aborts due to error. Unfortunately with a job array each task will email you. Hence you may receive 1000s of emails from a large job array. If this is what you require, please add the following to your jobscript:<\/p>\n<pre>#$ -M &#121;&#111;&#x75;&#x72;&#46;&#110;&#x61;&#x6d;&#101;&#64;&#x6d;&#x61;&#110;&#99;&#x68;&#x65;&#115;&#116;&#x65;&#x72;&#46;&#97;&#x63;&#x2e;&#117;&#107;       # Can use any address\r\n#$ -m bea\r\n      #\r\n      # b = email when <em>each<\/em> job array task begins running\r\n      # e = email when <em>each<\/em> job array task ends\r\n      # a = email when <em>each<\/em> job array task aborts\r\n      #\r\n      # You can specify any one or more of b, e, a\r\n<\/pre>\n<p>Please note <strong>we do not recommend the above<\/strong> as it can cause issues. It isn&#8217;t possible to have the job array email you only when the last task has finished. However, it is possible to do something very similar to this, as follows:<\/p>\n<h3>Email during last task<\/h3>\n<p>You can manually email yourself from the job array task with the highest (last) task id. There is no guarantee this will be the last task to <em>finish<\/em> (other tasks that started earlier may run for longer and finish later). But it will be the last task to <em>start<\/em>. Emailing from the last task to start would be <em>close enough<\/em> to also being the last task to finish in most cases. Do this as follows:<\/p>\n<pre>#$ -t 1-100\r\n\r\n# Run some application in our jobscript\r\n.\/my_app.exe data.$SGE_TASK_ID\r\n\r\n# Send email at end of last task. Will usually be\r\n# close enough to being the last task to finish.\r\nif [[ $SGE_TASK_ID == $SGE_TASK_LAST ]]; then\r\n  echo \"Last task $SGE_TASK_ID in job $JOB_ID finished\" | mail -s \"CSF jobarray\" $USER\r\nfi\r\n<\/pre>\n<p>The email will be sent to your University email address.<br \/>\n<a name=\"jobdepemail\"><\/a><\/p>\n<h3>Email from a Job Dependency after the Job Array<\/h3>\n<p>To guarantee that you receive an email <em>after<\/em> all tasks have finished you can submit a second serial job (not a job array) to the batch system, perhaps to the <em>short<\/em> area if available that has a <em>job dependency<\/em> on the main job-array job. The second job will only run after all tasks in the jobarray have finished. Note however, that the second job may have to wait in the queue depending how busy the system is. So your email may arrive some extended time <em>after<\/em> the job array actually finished. To submit the jobs use the following command-lines:<\/p>\n<pre># First, submit your job array as normal, noting the jobid\r\nqsub my-jobarray.sh\r\nYour job-array <strong><em>869219<\/em><\/strong>.1-10:1 (\"my-jobarray.sh\") has been submitted\r\n                 #\r\n                 # Make a note of this job id\r\n\r\n# Now immediately submit a serial job with a dependency (-hold_jid <em>jobid<\/em>) \r\n# and request that it emails you when it ends (-m e)\r\nqsub -b y -hold_jid <strong><em>869219<\/em><\/strong> -m e -M $USER true\r\n                       #                   #\r\n                       #                   # this app simply returns 'no error'\r\n                       #\r\n                       # Use the job id from the previous job array\r\n\r\n<\/pre>\n<p>The second serial job will execute (it won&#8217;t do any useful work and will finish immediately) when the job array ends. It will send you an email when it has finished and so you will then know that the job array on which it was dependent has also finished. Note that the information in the email (wallclock time etc) is about the serial job, not the job array.<\/p>\n<h2>Job Array limits<\/h2>\n<p>A huge number of files in a single location can cause connectivity issues with the scratch servers.<br \/>\nIn order to mitigate this, please keep the number of files in any one directory below around 5000.<br \/>\nThis can be achieved by reducing the size of the array, or directing the file output to different locations &#8211; to keep each under the 5000 file number limit.<br \/>\nAnother helpful step is to direct .o and .e (output and error) files to \/dev\/null if they are not required.<\/p>\n<h2>Further Information<\/h2>\n<p>More on SGE Job Arrays can be found at:<\/p>\n<ul>\n<li><a href=\"http:\/\/wiki.gridengine.info\/wiki\/index.php\/Simple-Job-Array-Howto\">wiki.gridengine.info Simple Job Array HowTo<\/a><\/li>\n<\/ul>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>The SGE batch system has been shutdown and the CSF upgraded to use the Slurm batch system. Please read the CSF3 Slurm documentation instead. To display this old SGE page, click here &nbsp; Please do not run jobarrays in the short environment, even if your tasks have a short runtime. There are not enough cores in short for jobarrays. Why use a Job Array? Suppose you wish to run a large number of almost identical.. <a href=\"https:\/\/ri.itservices.manchester.ac.uk\/csf3\/batch\/job-arrays\/\">Read more &raquo;<\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"parent":22,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-199","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/ri.itservices.manchester.ac.uk\/csf3\/wp-json\/wp\/v2\/pages\/199","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ri.itservices.manchester.ac.uk\/csf3\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/ri.itservices.manchester.ac.uk\/csf3\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/ri.itservices.manchester.ac.uk\/csf3\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/ri.itservices.manchester.ac.uk\/csf3\/wp-json\/wp\/v2\/comments?post=199"}],"version-history":[{"count":21,"href":"https:\/\/ri.itservices.manchester.ac.uk\/csf3\/wp-json\/wp\/v2\/pages\/199\/revisions"}],"predecessor-version":[{"id":10426,"href":"https:\/\/ri.itservices.manchester.ac.uk\/csf3\/wp-json\/wp\/v2\/pages\/199\/revisions\/10426"}],"up":[{"embeddable":true,"href":"https:\/\/ri.itservices.manchester.ac.uk\/csf3\/wp-json\/wp\/v2\/pages\/22"}],"wp:attachment":[{"href":"https:\/\/ri.itservices.manchester.ac.uk\/csf3\/wp-json\/wp\/v2\/media?parent=199"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}