{"id":287,"date":"2013-04-25T15:12:58","date_gmt":"2013-04-25T15:12:58","guid":{"rendered":"http:\/\/ri.itservices.manchester.ac.uk\/csf-apps\/?page_id=287"},"modified":"2015-02-02T11:27:26","modified_gmt":"2015-02-02T11:27:26","slug":"openmpibd","status":"publish","type":"page","link":"https:\/\/ri.itservices.manchester.ac.uk\/csf-apps\/software\/applications\/compilersamd\/openmpibd\/","title":{"rendered":"OpenMPI on AMD Bulldozer"},"content":{"rendered":"<p>This page describes how best to compile and run parallel MPI jobs on the AMD Bulldozer architecture compute nodes on the CSF, i.e, how to get the best performance out of these nodes.<\/p>\n<h2>Overview<\/h2>\n<ul class=\"gaplist\">\n<li>The CSF AMD Bulldozer nodes each have 64 CPU cores, with 2 GB RAM per core; all are connected via Infiniband.<\/li>\n<li>Intel compilers do not fully support this architecture.<\/li>\n<li>AMD recommend the use of the AMD Open64 compiler with the <a href=\"\/csf-apps\/software\/applications\/acml\">AMD Core Mathematics Library (ACML)<\/a> for maximum performance. The ACML is an implementation of BLAS and LAPACK optimised especially for AMD processors. The library contains other routines too, for example FFT. See the above link for more information on using ACML on the CSF.<\/li>\n<li>Compilation and linking of binaries for these nodes should be performed on a dedicated Bulldozer node by using <code>qrsh<\/code> as descibed below.<\/li>\n<li>Jobs size must be a multiple of 64.<\/li>\n<li>The maximum runtime for a job is 4 days.<\/li>\n<li>Binaries compiled for the AMD Bulldozer compute nodes will not run on other nodes. Attempting to run such a binary on other nodes, for example the Intel nodes, will yield a warning <code>Illegal instruction<\/code> and the programme will not run.<\/li>\n<\/ul>\n<h2>Restrictions on use<\/h2>\n<p>Code should only be compiled and executed on AMD Bulldozer nodes. Normally code can be compiled on the login node and tested there using &#8216;very&#8217; short test runs (e.g., one minute on fewer than 4 cores). This will not work for AMD Bulldozer codes because the login nodes use the Intel architecture.<\/p>\n<p>To use MPI you will need to amend your program to include the relevant calls to the MPI library. <\/p>\n<h2>Compilation, linking<\/h2>\n<p>The required steps are:<\/p>\n<ul>\n<li>login to a Bulldozer node dedicated for this procedure;<\/li>\n<li>load the appropriate environment module;<\/li>\n<li>compile and link your MPI code;<\/li>\n<li>logoff from the dedicated Bulldozer node;<\/li>\n<\/ul>\n<h3>Example<\/h3>\n<pre>\r\nqrsh -l bulldozer -l short\r\nmodule load mpi\/open64-4.5.2\/openmpi\/1.6-ib-amd-bd\r\nmpif90 mynameis.f90 -o mynameis\r\nexit\r\n  #\r\n  # You are now back on the login node\r\n<\/pre>\n<h2>Running MPI jobs<\/h2>\n<p>The required steps are as for other MPI jobs:<\/p>\n<ul>\n<li>create a suitable <code>qsub<\/code> script &mdash; an example is given below &mdash; and save it as, for example, <code>my_open64_mpi_job.qsub<\/code><\/li>\n<li>load the appropriate environment module<\/li>\n<li>ensure you are on the <strong>login node<\/strong> of the CSF (not the dedicated compile\/link node)<\/li>\n<li>submit your job to SGE, for example: <code>qsub my_open64_mpi_job.qsub<\/code><\/li>\n<li>The maximum runtime for a job is 4 days<\/li>\n<\/ul>\n<h3>Small MPI Jobs (fewer than 64 cores)<\/h3>\n<p>Small MPI jobs that don&#8217;t use all cores on the node can be run in the <code>smp-64bd.pe<\/code> parallel environment. In this case you must load one of the following modulefiles:<\/p>\n<pre>\r\n<strong># PGI 14.10 compiler<\/strong>\r\nmodule load mpi\/pgi-14.10-acml-fma4\/openmpi\/1.8.3-amd-bd\r\n\r\n<strong># Open64 4.5.2.1 compiler<\/strong>\r\nmodule load mpi\/open64-4.5.2.1\/openmpi\/1.8.3-amd-bd\r\nmodule load mpi\/open64-4.5.2.1\/openmpi\/1.6-amd-bd\r\nmodule load mpi\/open64-4.5.2\/openmpi\/1.6-amd-bd\r\n\r\n<strong># Intel compiler (code not as optimized as PGI or Open64)<\/strong>\r\nmodule load mpi\/intel-14.0\/openmpi\/1.8.3\r\nmodule load mpi\/intel-12.0\/openmpi\/1.6\r\n\r\n<strong># GNU compiler (code not as optimized as PGI or Open64)<\/strong>\r\nmodule load mpi\/gcc\/openmpi\/1.6\r\n<\/pre>\n<p>Use the <code>smp-64bd.pe<\/code> in your jobscript. Note that if using the entire node (all 64 cores) your code may run faster if using the <code>-ib<\/code> modulefiles in the next section. This is because when all 64 cores are used they will be <em>pinned<\/em> to cores which improves performance (the non-ib modulefiles above disable core-pinning which is on by default as of OpenMPI 1.7.4). When fewer than 64 cores are used they cannot be pinned to cores because another job running on the same node cannot determine which cores already have MPI processes pinned to them. Hence all jobs would start pinning from core number 1, 2, &#8230; and so on. Hence MPI&#8217;s task pinning mechanism is disabled and left to the operating system.<\/p>\n<pre>\r\n#!\/bin\/bash\r\n#$ -S bash\r\n#$ -cwd                   # Run in current directory\r\n#$ -V                     # Inherit settings from modulefiles\r\n#$ -pe smp-64bd.pe 16     # Small MPI job (can use up to 64 cores)\r\n\r\n# $NSLOTS is automatically set to the number you specify on -pe line\r\n\r\nmpirun -n $NSLOTS .\/my_app_amd.exe\r\n\r\n<\/pre>\n<h3>Large MPI Jobs (64 cores or more)<\/h3>\n<p>Large multi-node MPI jobs that use all cores on the node can be run in the <code>orte-64bd-ib.pe<\/code> parallel environment. In this case you must load one of the following modulefiles (these can also be used for a single-node 64-core job using all 64-cores in that node in <code>smp-64bd.pe<\/code>):<\/p>\n<pre>\r\n<strong># PGI 14.10 compiler<\/strong>\r\nmodule load mpi\/pgi-14.10-acml-fma4\/openmpi\/1.8.3-ib-amd-bd\r\n\r\n<strong># Open64 4.5.2.1 compiler<\/strong>\r\nmodule load mpi\/open64-4.5.2.1\/openmpi\/1.8.3-ib-amd-bd\r\nmodule load mpi\/open64-4.5.2.1\/openmpi\/1.6-ib-amd-bd\r\nmodule load mpi\/open64-4.5.2\/openmpi\/1.6-ib-amd-bd\r\n\r\n<strong># Intel compiler (code not as optimized as PGI or Open64)<\/strong>\r\nmodule load mpi\/intel-14.0\/openmpi\/1.8.3-ib\r\nmodule load mpi\/intel-12.0\/openmpi\/1.6-ib\r\n\r\n<strong># GNU compiler (code not as optimized as PGI or Open64)<\/strong>\r\nmodule load mpi\/gcc\/openmpi\/1.6-ib\r\n<\/pre>\n<pre>\r\n#!\/bin\/bash\r\n#$ -S bash\r\n#$ -cwd                      # Run in current directory\r\n#$ -V                        # Inherit settings from modulefiles\r\n#$ -pe orte-64bd-ib.pe 128   # Large MPI job (multiples of 64 cores only)\r\n\r\n# $NSLOTS is automatically set to the number you specify on -pe line\r\n\r\nmpirun -n $NSLOTS .\/my_app_amd.exe\r\n<\/pre>\n<h2>Further information<\/h2>\n<ul>\n<li>Online help via the command line:<\/li>\n<\/ul>\n<pre class=\"in1\">\r\nman mpif90\r\n  # for fortran mpi\r\nman mpicc\r\n  # for C\/C++ mpi\r\nman mpirun\r\n  # for information on running mpi executables\r\n<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>This page describes how best to compile and run parallel MPI jobs on the AMD Bulldozer architecture compute nodes on the CSF, i.e, how to get the best performance out of these nodes. Overview The CSF AMD Bulldozer nodes each have 64 CPU cores, with 2 GB RAM per core; all are connected via Infiniband. Intel compilers do not fully support this architecture. AMD recommend the use of the AMD Open64 compiler with the AMD.. <a href=\"https:\/\/ri.itservices.manchester.ac.uk\/csf-apps\/software\/applications\/compilersamd\/openmpibd\/\">Read more &raquo;<\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"parent":91,"menu_order":0,"comment_status":"open","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-287","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/ri.itservices.manchester.ac.uk\/csf-apps\/wp-json\/wp\/v2\/pages\/287","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ri.itservices.manchester.ac.uk\/csf-apps\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/ri.itservices.manchester.ac.uk\/csf-apps\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/ri.itservices.manchester.ac.uk\/csf-apps\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/ri.itservices.manchester.ac.uk\/csf-apps\/wp-json\/wp\/v2\/comments?post=287"}],"version-history":[{"count":9,"href":"https:\/\/ri.itservices.manchester.ac.uk\/csf-apps\/wp-json\/wp\/v2\/pages\/287\/revisions"}],"predecessor-version":[{"id":2224,"href":"https:\/\/ri.itservices.manchester.ac.uk\/csf-apps\/wp-json\/wp\/v2\/pages\/287\/revisions\/2224"}],"up":[{"embeddable":true,"href":"https:\/\/ri.itservices.manchester.ac.uk\/csf-apps\/wp-json\/wp\/v2\/pages\/91"}],"wp:attachment":[{"href":"https:\/\/ri.itservices.manchester.ac.uk\/csf-apps\/wp-json\/wp\/v2\/media?parent=287"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}