Microsoft R Open (MRO)

Overview

The Microsoft R Open version of R provides an optimized version that includes the Intel Math Kernel Libraries (MKL) for parallel (multi-core) matrix calculations. These are often substantially faster than the standard R matrix libraries, even when only using a single core.

This version also helps with reproducibility – any additional packages you install (see below) are from a snapshot of the CRAN repository and so do not change daily as the open source CRAN repo does.

By default Microsoft R Open will use all cores on a node. For this reason, users of this version must ensure their jobscripts set the number of cores to be used by the MKL libraries to be consistent with their jobscript, as described in the example jobscripts below. This is the case even if only one core is used by a serial job.

When this version of R is started it will report how many cores it has been given to use for the MKL:

Multithreaded BLAS/LAPACK libraries detected. Using N cores for math algorithms.

where M should be the number of cores requested by your jobscript (by setting the OMP_NUM_THREADS environment variable – see below).

Using the multi-threaded MKL library is distinct from using, e.g. the parallel library. Although both approaches can be used, you will need to take care not to “double parellelise” your code. Jobs submitted incorrectly are liable to be killed without warning; see the section on parallelism below.

You may also use the standard open-source R on the CSF.

The following versions of Microsoft R Open are installed on the CSF:

  • R 3.5.1

Note that Bioconductor is NOT available in the Microsoft R Open version of R. Please see our standard open-source R installation on the CSF if you wish to use Bioconductor.

You can also install packages to your own home directory using the Adding Packages instructions below (and in conjunction with information from the bioconductor website.) Alternatively, we may be able to add them to the central install – contact its-ri-team@manchester.ac.uk.

Restrictions on use

There are no restrictions on access to R as it is a free piece of software released under a GNU license. All users should familiarise themselves with the licensing information available via the R website.

All R jobs, aside from very short test jobs (e.g. those lasting less than one minute) must be submitted to the batch system.

Set up procedure

We now recommend loading modulefiles within your jobscript so that you have a full record of how the job was run. See the example jobscript below for how to do this. Alternatively, you may load modulefiles on the login node and let the job inherit these settings.

Load one of the following modulefiles:

module load apps/binapps/MRO/3.5.1

Running the application

Note that using R CMD BATCH, as below, may save and restore your workspace, which may not be what you want. Using Rscript instead avoids that.

Serial Batch job

Write a submission script, for example:

#!/bin/bash --login
#$ -cwd               # Run job from current directory

## When using Microsoft R Open, you must include this line:
export OMP_NUM_THREADS=$NSLOTS

## We now recommend loading the modulefile in the jobscript. Change the version as needed.
module load apps/binapps/MRO/3.5.1

R CMD BATCH  my_test.R  my_test.R.o$JOB_ID
   #                          #
   #                          # The final argument, "my_test.R.o$JOBID", tells R to send
   #                          #  output to a file with this name unique to the current job.
   #
   # R must be called with both the "CMD" and "BATCH" options which tell it
   # to run an R program, in this case my_test.R, instead of presenting
   # an interactive prompt

Submit the job using

qsub runmyRjob.qsub

where runmyRjob.qsub is the name of your job script.

By default, graphical output from batch jobs is sent to a file called Rplots.pdf. See below for more info on plotting in to an image file.

Parallel Batch Job (single node, multi-core)

The Microsoft R Open version will automatically make use of multiple cores if your code performs various matrix operations that use the Intel Math Kernel Library. In this case you don’t need to modify your code to use the multiple CPU cores. If you wish to parallelise other types of R code, you will have to modify your code, usually with the parallel R library.

#!/bin/bash --login
#$ -cwd               # Run job from current directory
#$ -pe smp.pe 12      # Number of cores to use. Can be between 2 and 32.
   
## When using Microsoft R Open, you must include this line:
export OMP_NUM_THREADS=$NSLOTS

## We now recommend loading the modulefile in the jobscript. Change the version as needed.
module load apps/binapps/MRO/3.5.1

R CMD BATCH my_test.R my_test.R.o$JOB_ID
  • Then submit your job to the batch system
qsub runmyRjob.qsub

where runmyRjob.qsub is the name of your job script.

The various libraries for performing parallel computation in R each have their own way of setting the number of cores to use within R. This will sometimes default to the total number of cores on the node. You need to make sure that your code is using no more than the number of cores you’ve requested in your job script, otherwise your job is liable to be killed without warning.

You can return the number of cores you requested in your jobscript as a variable, using the code:

numCoresAllowed <- Sys.getenv("NSLOTS", unset=1)

(If you're running the job interactively or on your local machine, the value specified in "unset" will be returned)

You should use this value when you set the number of cores. For example, if you're using the "doMC" package, you'd use:

registerDoMC(cores = numCoresAllowed)

Some libraries, e.g. the "parallel" library will take the number of cores to use from an environment variable (e.g. MC_CORES) directly. You can set the environment variable in your job script.

The section below contains some further considerations if you're using Microsoft R Open.

Parallelism

R's parallel library can be used to run sections of code in parallel. The mclapply and mcmapply functions provided parallelised versions of lapply and mapply. The number of cores these will use can be set using the mc.cores option, or (preferably) by setting the environment variable MC_CORES in your job script.

If you are using parallel, or another library to parallise your code, and you are using Microsoft R Open you will need to take care not to double-parallelise your code. Each thread spawned by parallel will use OMP_NUM_THREADS. The number of cores you request in your job script should be:

OMP_NUM_THREADS x MC_CORES

For example, if we wish to run a 16 core job, consisting of 8 parallel threads using mclapply, each of which will use 2 cores for matrix computation:

#!/bin/bash --login
#$ -cwd               # Run job from current directory
#$ -pe smp.pe 16      # Number of cores to use. Can be between 2 and 32.
    
export OMP_NUM_THREADS=2    # For MKL library in MRO
export MC_CORES=8           # For parallel library (used in your code)

## We now recommend loading the modulefile in the jobscript. Change the version as needed.
module load apps/binapps/MRO/3.5.1

R CMD BATCH my_test.R my_test.R.o$JOB_ID

The optimum division of threads between mclapply and the MKL library will depend on the nature of your workload. If the majority of your code can be embarrassingly parallised using mclapply we would expect setting OMP_NUM_THREADS=1 and MC_CORES=$NSLOTS to provide optimum performance - this way all of the code in mclapply will be run in parallel, not just the matrix operations. Even using a single core the MKL library is often substantially faster than R's standard library when performing matrix operations.

Remember that MKL will, by default, use all the node's cores, so you must always set OMP_NUM_THREADS in your jobscript when using Microsoft R Open.

Running R interactively

It is expected that most use of R on the CSF will be in batch mode, i.e., computational jobs will be submitted to the batch system and there will be no subsequent user interaction. However, if required, R can be run on the CSF using either the R command line, or GUI.

Do not simply login to the CSF and start R — your jobs will be killed by the system administrator! The only exception to this is when installing a package from a mirror in R (see below).

To run R jobs interactively on the CSF, make use of the qrsh facility, which literally queues interactive jobs. To start the R command line type

# Load the modulefile on the login node (use your required version)
module load apps/binapps/MRO/3.5.1

# Start R in an interactive session on a compute node
qrsh -cwd -V -l short R --vanilla --interactive
                          #          #
                          #          # Needed if you will be starting the R GUI (Rcmdr)
                          #
                          # Can be: "--save", "--no-save" or "--vanilla"...

Note:

  • Use one of --save, --no-save and --vanilla
  • If you want to use the GUI, ensure you type --interactive otherwise the GUI will not start and your will see an error message like:
    The Commander GUI is launched only in interactive sessions
    

    To start the GUI, enter library(Rcmdr) at the R command line:

    library(Rcmdr)
    Loading required package: tcltk
    Loading Tcl/Tk interface ... done
    

Adding packages

For advice on installing new R packages in our home directory please see the standard open-source R CSF documentation.

Plotting

For advice on plotting figures in R please see the standard open-source R CSF documentation.

Further Info

Updates

None

Last modified on March 14, 2019 at 7:45 pm by George Leaver