{"id":12364,"date":"2026-06-10T18:52:17","date_gmt":"2026-06-10T17:52:17","guid":{"rendered":"https:\/\/ri.itservices.manchester.ac.uk\/csf3\/?page_id=12364"},"modified":"2026-06-16T15:19:55","modified_gmt":"2026-06-16T14:19:55","slug":"scaling-slurm","status":"publish","type":"page","link":"https:\/\/ri.itservices.manchester.ac.uk\/csf3\/batch-slurm\/scaling-slurm\/","title":{"rendered":"How many cores should I use? Job Scaling"},"content":{"rendered":"<h2>Introduction<\/h2>\n<p>There is no single answer we can give here &#8211; there are a number of factors that determine whether your software will <em>scale<\/em>. By <em>scale<\/em> we mean whether the application will run faster (i.e., the job completes sooner) as you increase the number of cores used by your job.<\/p>\n<p>But we also want to know how <em>much<\/em> faster it goes with different numbers of cores. For example, if you double the number of cores, does the job complete in half the time? If you quadruple the number of cores, does the job complete in a quarter of the time, and so on.<\/p>\n<p>The above <em>scaling<\/em> is called <em>strong scaling<\/em> &#8211; the problem size is kept the same size but you increase the number of cores. If doubling the number of cores reduces the wallclock time by half (and so on) then this is called <em>ideal scaling<\/em>.<\/p>\n<h2>Does my software scale?<\/h2>\n<p>Some factors which determine whether your software will <em>scale<\/em>:<\/p>\n<ol class=\"gaplist\">\n<li>The parallel efficiency of the software: Is it a well-written, efficient application with little <em>parallel communication<\/em> overhead? You may not know the answer to this but checking the software&#8217;s documentation may give some details of how many cores the software can use and how scalable the software is. If the software developers recommend an upper limit on the number of cores to use then you should probably stick to that limit! If there are examples of the software being run on other large HPC systems then it is probably very scalable software.<\/li>\n<li>The algorithm your software is using. Even well-written software may be using an algorithm that doesn&#8217;t scale well. Again, you may not know this, but some applications, such as chemistry application which consider forces between atoms will scale better if only considering short-range forces. Switching on all of the calculations that an application can perform may reduce scalability. Only calculate what you actually need for your research.<\/li>\n<li>The size of your data (or the parameters you supply that may tune the algorithm). If you are using unnecessarily short time-steps in a simulation, or unnecessarily large <em>systems<\/em> to be solved then the application may not scale. Try to chose appropriate parameters for your software<\/li>\n<\/ol>\n<h2>Running some scaling tests<\/h2>\n<p>The best method to determine the number of cores to use is to run several jobs using the same dataset but with an increasing number of cores. You can then inspect the time the job took to complete (the <em>wallclock<\/em> time) and determine whether your application keeps going faster and faster as you increase the number of cores. This can be important if you have a lot of jobs to run (e.g., a lot of simulations to perform using different input datasets or different simulation parameters). Doing some early tests to find the best number of cores to use can save you time in the long-run.<\/p>\n<div class=\"hint\">The aim of running these scaling jobs is to determine the best number of cores to use in your jobs. If an application does <em>not<\/em> go any faster beyond a certain number of cores, there is no point submitting jobs with more and more cores! You will only wait longer in the queue without any increase in job performance!<\/div>\n<p>For example, suppose you have a parallel application that can be run on multiple cores in a single compute node (i.e., up to 168 cores on CSF3). You can submit multiple jobs using the following method:<\/p>\n<ol class=\"gaplist\">\n<li>Create a jobscript for a <em>serial<\/em> job (it uses only one core). We want to use the faster AMD Genoa cores; the only AMD partition that allows 1-core batch jobs is the <em>interactive<\/em> partition. Here is an example jobscript:\n<pre class=slurm>\r\n#!\/bin\/bash --login\r\n#SBATCH -p interactive\r\n#SBATCH -n 1\r\n#SBATCH -t 0-1\r\n\r\nmodule load apps\/intel-17.0\/myapp\/1.2.3        # Load the modulefile in the job\r\n\r\n# Run the app using one core. The $SLURM_NTASKS variable is automatically set to the\r\n# number of cores assigned to your job (1 for a serial job). The --numthreads flag\r\n# will probably be called something else in your app - check the docs!\r\nmyapp --numthreads $SLURM_NTASKS -in mydata.dat -out myresults.${SLURM_JOB_ID}.dat\r\n<\/pre>\n<p>Submit the job using the usual <code>sbatch <em>myjobscript.sh<\/em><\/code> command. Make a note of the <em>jobid<\/em> assigned to your job (a number printed out by the <code>sbatch<\/code> command).<\/li>\n<li>When complete, get the <em>Elapsed<\/em> time for the job using:\n<pre>\r\nsacct -XPnj <em>jobid<\/em> -o Elapsed\r\n<\/pre>\n<p>This will tell you how long the job took, in <em>DAYS-HOURS:MINUTES:SECONDS<\/em> format.<br \/>\nTo convert it into <em>seconds<\/em>, which is needed for our calculations later-on, pipe it to this one-line <em>awk<\/em> script:<\/p>\n<pre>\r\nsacct -XPnj <em>jobid<\/em> -o Elapsed | \\\r\nawk -F[-:] 'NF==3{$4=$3;$3=$2;$2=$1;$1=0}{print $1*24*60*60+$2*60*60+$3*60+$4}'\r\n<\/pre>\n<\/li>\n<li>Now repeat the job with an increasing number of cores. You <em>do not<\/em> need to edit the jobscript each time. Instead, supply the <em>partition<\/em> name and <em>number of cores<\/em> on the <code>sbatch<\/code> command-line. We also ensure we use the same AMD Genoa CPU architecture for all jobs to make the timing comparison as fair as possible:\n<pre>\r\nsbatch -p multicore -n  2 <em>myjobscript<\/em>\r\nsbatch -p multicore -n  4 <em>myjobscript<\/em>\r\nsbatch -p multicore -n  8 <em>myjobscript<\/em>\r\nsbatch -p multicore -n 16 <em>myjobscript<\/em>\r\nsbatch -p multicore -n 32 <em>myjobscript<\/em>\r\n<\/pre>\n<p>The jobs will all be given a unique <em>job id<\/em> which you should make a note of.<\/li>\n<li>You can now query the <em>job accounting<\/em> information as before:\n<pre>\r\nsacct -XPnj <em>jobid<\/em> -o Elapsed | \\\r\nawk -F[-:] 'NF==3{$4=$3;$3=$2;$2=$1;$1=0}{print $1*24*60*60+$2*60*60+$3*60+$4}'\r\n<\/pre>\n<\/li>\n<\/ol>\n<p>Suppose we get the following elapsed times for our job. We also calculate the <em>speed-up<\/em> using the formula<\/p>\n<pre>\r\nSpeed-up<sub>Ncores<\/sub> = Elapsed<sub>1core<\/sub> \/ Elapsed<sub>Ncores<\/sub>\r\n<\/pre>\n<p>and can also write down the ideal (linear) speed-up:<\/p>\n<pre>\r\nCores   Elapsed(s)     Speed-up    Ideal (Linear) speed-up\r\n    1          725          1.0            1.0\r\n    2          367          1.9            2.0\r\n    4          181          4.0            4.0\r\n    8           93          7.8            8.0\r\n   16           60         12.1           16.0\r\n   32           47         15.4           32.0\r\n<\/pre>\n<p>If we plot the <em>Speed-up<\/em> against the number of cores and also plot the <em>linear speed-up<\/em> we can see that the performance of the software is tailing off as we go beyond 8 cores:<br \/>\n<a href=\"\/csf3\/wp-content\/uploads\/sample-timings.png\"><img loading=\"lazy\" decoding=\"async\" src=\"\/csf3\/wp-content\/uploads\/sample-timings.png\" alt=\"Example Job Speed-up vs Linear Speed-up\" width=\"400\" height=\"200\" class=\"aligncenter size-full wp-image-2967\" srcset=\"https:\/\/ri.itservices.manchester.ac.uk\/csf3\/wp-content\/uploads\/sample-timings.png 400w, https:\/\/ri.itservices.manchester.ac.uk\/csf3\/wp-content\/uploads\/sample-timings-300x150.png 300w\" sizes=\"auto, (max-width: 400px) 100vw, 400px\" \/><\/a><\/p>\n<p>Hence for this application, jobs should be run with up to 8, or possibly 16, cores but requesting 32 cores would not provide a lot of benefit.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Introduction There is no single answer we can give here &#8211; there are a number of factors that determine whether your software will scale. By scale we mean whether the application will run faster (i.e., the job completes sooner) as you increase the number of cores used by your job. But we also want to know how much faster it goes with different numbers of cores. For example, if you double the number of cores,.. <a href=\"https:\/\/ri.itservices.manchester.ac.uk\/csf3\/batch-slurm\/scaling-slurm\/\">Read more &raquo;<\/a><\/p>\n","protected":false},"author":24,"featured_media":0,"parent":9105,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-12364","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/ri.itservices.manchester.ac.uk\/csf3\/wp-json\/wp\/v2\/pages\/12364","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ri.itservices.manchester.ac.uk\/csf3\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/ri.itservices.manchester.ac.uk\/csf3\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/ri.itservices.manchester.ac.uk\/csf3\/wp-json\/wp\/v2\/users\/24"}],"replies":[{"embeddable":true,"href":"https:\/\/ri.itservices.manchester.ac.uk\/csf3\/wp-json\/wp\/v2\/comments?post=12364"}],"version-history":[{"count":10,"href":"https:\/\/ri.itservices.manchester.ac.uk\/csf3\/wp-json\/wp\/v2\/pages\/12364\/revisions"}],"predecessor-version":[{"id":12377,"href":"https:\/\/ri.itservices.manchester.ac.uk\/csf3\/wp-json\/wp\/v2\/pages\/12364\/revisions\/12377"}],"up":[{"embeddable":true,"href":"https:\/\/ri.itservices.manchester.ac.uk\/csf3\/wp-json\/wp\/v2\/pages\/9105"}],"wp:attachment":[{"href":"https:\/\/ri.itservices.manchester.ac.uk\/csf3\/wp-json\/wp\/v2\/media?parent=12364"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}