The CSF2 has been replaced by the CSF3 - please use that system! This documentation may be out of date. Please read the CSF3 documentation instead. To display this old CSF2 page click here. |
R
Overview
R is a free software environment for statistical computing and graphics.
The following versions are installed on the CSF:
- Microsoft R Open 3.2.4 – does not include BioConductor. Does include MKL libraries for parallel matrix calcs
Standard open source R
- R 3.4.3- includes BioConductor 3.6 & ukbtools 0.10.0, rnaseqGene-1.0.2 (Rcmdr is not installed).
- R 3.4.2- includes BioConductor 3.5 & ukbtools 0.9.0 (Rcmdr is not installed).
- R 3.4.1 – includes BioConductor 3.5 & ukbtools 0.9.0 (Rcmdr is not installed).
- R 3.3.3 – includes BioConductor 3.3 (Rcmdr is not installed).
- R 3.3.0 – includes BioConductor 3.3 (Rcmdr is not installed).
- R 3.2.3 – does not include BioConductor (due to dependency issues) and there is no Rcmdr
- R 3.2.0 – includes BioConductor 3.1 (please note Rcmdr does not work in this version)
- R 3.1.0 – includes BioConductor 2.14
- R 3.0.2 – does not include BioConductor
- R 2.15.3 – includes BioConductor 2.11
The Microsoft R Open version provides an optimized version of R that includes the Intel Math Kernel Libraries (MKL) for parallel (multi-core) matrix calculations (these are often substantially faster than the standard R matrix libraries, even when only using a single core). This version also helps with reproducibility – any additional packages you install (see below) are from a snapshot of the CRAN repository and so do not change daily as the open source CRAN repo does. By default Microsoft R Open will use all cores on a node. For this reason, users of this version must ensure their jobscripts set the number of cores to be used by the MKL libraries to be consistent with their jobscript, as described in the example jobscripts below. This is the case even if only one core is used by a serial job.
When this version of R is started it will report how many cores it has been given to use for the MKL:
Multithreaded BLAS/LAPACK libraries detected. Using N cores for math algorithms.
where M
should be the number of cores requested by your jobscript (by setting the OMP_NUM_THREADS
environment variable – see below).
Using the multi-threaded MKL library is distinct from using, e.g. the parallel library. Although both approaches can be used, you will need to take care not to “double parellise” your code. Such jobs are liable to be killed without warning; see the section on parallelism below.
Bioconductor is included in some of the central CSF installation of R (see above). Not all associated packages may be available, but you can install them to your own home directory using the Adding Packages instructions below, in conjunction with information from the bioconductor website. Alternatively, we may be able to add them to the central install – contact its-ri-team@manchester.ac.uk .
To use BioConductor enter the following command at the R prompt or in your R script:
source("https://bioconductor.org/biocLite.R")
Restrictions on use
There are no restrictions on access to R as it is a free piece of software released under a GNU license. All users should familiarise themselves with the licensing information available via the R website.
All R jobs, aside from very short test jobs (e.g. those lasting less than one minute) must be submitted to the batch system, SGE.
Set up procedure
To access the executables please load one of the appropriate modulefiles:
Microsoft R Open with MKL:
module load apps/gcc/MRO/3.2.4
Standard open source R:
module load apps/gcc/R/3.4.3 module load apps/gcc/R/3.4.2 module load apps/gcc/R/3.4.1 module load apps/gcc/R/3.3.0 module load apps/gcc/R/3.2.3 module load apps/gcc/R/3.2.0 module load apps/gcc/R/3.1.0 module load apps/gcc/R/3.0.2 module load apps/gcc/R/2.15.3 module load apps/gcc/R/2.15.0
Running the application
Note that using R CMD BATCH
, as below, may save and restore your workspace, which may not be what you want. Using Rscript
instead avoids that.
Serial Batch job
- Make sure you have the module loaded.
- Write a submission script, for example:
#!/bin/bash #$ -S /bin/bash #$ -cwd # Run job from current directory #$ -V # Inherit the environment (module) settings ## If using Microsoft R Open, you must include this line: export OMP_NUM_THREADS=$NSLOTS R CMD BATCH my_test.R my_test.R.o$JOB_ID # # -- R must be called with both the "CMD" and "BATCH" options which tell it # to run an R *program*, in this case "my_test.R", instead of presenting # an interactive prompt # # -- the final argument, "my_test.R.o$JOBID", tells R to send output to a # file with this name (without, R sends output to "my_test.Rout", which # would be over-written by a second submission of "my_test.R" via # "runmyRjob.qsub")
- Then submit your job to the batch system
qsub runmyRjob.qsub
where runmyRjob.qsub
is the name of your job script.
By default, graphical output from batch jobs is sent to a file called Rplots.pdf
.
Parallel Batch Job (single node, multi-core)
If using the Microsoft R Open version (see above) then the Intel Math Kernel Libraries (MKL) are used for parallel matrix calculations. You should set the number of cores that the MKL can use in the jobscript:
- Make sure you have the module loaded.
- Write a submission script, for example:
#!/bin/bash #$ -S /bin/bash #$ -cwd # Run job from current directory #$ -V # Inherit the environment (module) settings #$ -pe smp.pe 16 # Number of cores to use. Can be between 2 and 24. ## If using Microsoft R Open, you must include this line: export OMP_NUM_THREADS=$NSLOTS R CMD BATCH my_test.R my_test.R.o$JOB_ID
- Then submit your job to the batch system
qsub runmyRjob.qsub
where runmyRjob.qsub
is the name of your job script.
The various libraries for performing parallel computation in R each have their own way of setting the number of cores to use within R. This will sometimes default to the total number of cores on the node. You need to make sure that your code is using no more than the number of cores you’ve requested in your job script, otherwise your job is liable to be killed without warning.
You can return the number of cores you requested in your jobscript as a variable, using the code:
numCoresAllowed <- Sys.getenv("NSLOTS", unset=1)
(If you're running the job interactively or on your local machine, the value specified in "unset" will be returned)
You should use this value when you set the number of cores. For example, if you're using the "doMC" package, you'd use:
registerDoMC(cores = numCoresAllowed)
Some libraries, e.g. the "parallel" library will take the number of cores to use from an environment variable (e.g. MC_CORES) directly. You can set the environment variable in your job script.
The section below contains some further considerations if you're using Microsoft R Open.
Parallelism
R's "parallel" library can be used to run sections of code in parallel. The mclapply and mcmapply functions provided parallelised versions of lapply and mapply. The number of cores these will use can be set using the mc.cores option, or (preferably) by setting the environment variable MC_CORES in your job script.
If you are using "parallel" or another library to parallise your code, and you are using Microsoft R Open you will need to take care not to double-parallise your code. Each thread spawned by parallel will use OMP_NUM_THREADS. The number of cores you request in your job script should be:
OMP_NUM_THREADS * MC_CORES
For example, if we wish to run a 16 core job, consisting of 8 parallel threads using mclapply, each of which will use 2 cores for matrix computation:
#!/bin/bash #$ -S /bin/bash #$ -cwd # Run job from current directory #$ -V # Inherit the environment (module) settings #$ -pe smp.pe 16 # Number of cores to use. Can be between 2 and 24. export OMP_NUM_THREADS=2 # For MKL library export MC_CORES=8 # For parallel library R CMD BATCH my_test.R my_test.R.o$JOB_ID
The optimum division of threads between mclapply and the MKL library will depend on the nature of your workload. If the majority of your code can be embarrassingly parallised using mclapply we would expect setting OMP_NUM_THREADS=1 and MC_CORES=$NSLOTS to provide optimum performance - this way all of the code in mclapply will be run in parallel, not just the matrix operations. Even using a single core the MKL library is often substantially faster than R's standard library when performing matrix operations.
Remember that MKL will, by default, use all the node's cores, so you must always set OMP_NUM_THREADS if you are using Microsoft R Open.
Running R interactively
It is expected that most use of R on the CSF will be in batch mode, i.e., computational jobs will be submitted to SGE and there will be no subsequent user interaction. However, if required, R can be run on the CSF using either the R command line, or GUI.
Do not simply login to the CSF and start R — your jobs will be killed by the system administrator! The only exception to this is when installing a package from a mirror in R (see below).
To run R jobs interactively on the CSF, make use of the qrsh facility, which literally queues interactive jobs. To start the R command line type
qrsh -V -l inter -l short R --vanilla --interactive # # ..."--save", "--no-save" or "--vanilla"... #
- Use one of
--save
,--no-save
and--vanilla
- If you want to use the GUI, ensure you type
--interactive
otherwise the GUI will not start and your will see an error message like:
The Commander GUI is launched only in interactive sessions
To start the GUI, enter library(Rcmdr)
at the R command line:
library(Rcmdr) Loading required package: tcltk Loading Tcl/Tk interface ... done
Adding packages
For the purposes of adding packages only you should run R on the login node because there is no access to the proxy from the compute nodes.
Automatically Download from CRAN
To determine which repo you will be downloading from run the following in R:
getOption("repos")
By default this will be @CRAN@
which causes a package installation to prompt for a CRAN mirror. If you are using the Microsoft R Open version of R the repo will be similar to
https://mran.microsoft.com/snapshot/2016-04-01
To add packages to your personal R package directory (~/R/platform/version
) that will be downloaded by R from the CRAN (default cran.r-project.org):
- Add the University Web proxy to your
~/.Renviron
file (create the file if you don't already have one):http_proxy=http://webproxy.its.manchester.ac.uk:3128
- NB: The easiest way to add to or create a new file is to do the following once logged in to the CSF:
echo http_proxy=http://webproxy.its.manchester.ac.uk:3128 >> ~/.Renviron
- NB: The easiest way to add to or create a new file is to do the following once logged in to the CSF:
- Now start R in the usual way. Note this is done on the login node when adding packages so that R has access to the proxy. Do not run R on the login node for any other reason than to add a package:
# Please check above for newer versions - this example was written for 3.1.0 module load apps/gcc/R/3.1.0 R
- Ask R to install the required package and answer
y
when asked if you wish to create a personal library:> install.packages("thing") Warning in install.packages("thing") : 'lib = "/opt/gridware/apps/gcc/R/3.1.0/lib64/R/library"' is not writable Would you like to create a personal library ~/R/x86_64-unknown-linux-gnu-library/3.1 to install packages into? (y/n) y
- Select a UK mirror when prompted.
- Note, you may have to select a mirror name that does not have
[https]
next to it.
- Note, you may have to select a mirror name that does not have
- Once the package is installed, exit R and rerun it either in batch or via qrsh (interactively) on a compute node (see above). Do not continue to run computational work on the login node!
In the above instructions replace the module load
command with the one appropriate to the R version you wish to use.
If you wish to specify a repo in the install.packages
command instead of selecting it from a menu, try:
install.packages('thing', repos='http://www.stats.bris.ac.uk/R')
Library from Source
If you've downloaded an R library source file you can add it to your local workspace using the following commands (which assume the source package is in your home directory on the iCSF):
- Start R with extra command-line args (choose the version of R you require):
module load apps/gcc/R/3.2.0 R CMD INSTALL thing.x.y.z.tar.gz * installing to library ‘/mnt/iusers01/support/mabcxyz1/R/x86_64-unknown-linux-gnu-library/3.2’ * installing *source* package ‘thing’ ... ** package ‘thing’ successfully unpacked and MD5 sums checked ** R ** data ** demo ** preparing package for lazy loading ** help *** installing help indices ** building package indices ** testing if installed package can be loaded * DONE (thing)
- Now run R and test that the library can be loaded:
R > library(thing)
The compiled library files will be save in a directory named R
in your home directory. It contains subdirectories for each version of R so if you want to use the library in different versions of R you will have to repeat the above commands for each version.
Adding BioConductor Packages
BioConductor packages can be installed in to your local R library (in your home directory) as follows (this example assumes use of R 3.3.0):
module load apps/gcc/R/3.3.0 R source("https://bioconductor.org/biocLite.R") biocLite("packagename") # Give a biocLite package name: EG: "S4Vectors" # You will see some output then be asked to install locally: 'lib = "/opt/gridware/apps/gcc/R/3.3.0/lib64/R/library"' is not writable Would you like to use a personal library instead? (y/n) y # Answer 'y' Would you like to create a personal library ~/R/x86_64-pc-linux-gnu-library/3.3 to install packages into? (y/n) y # Answer 'y'
The package will be downloaded and installed in to your local R library.
Listing Packages
To list the installed packages run:
installed.packages();
Further info
For software documentation and FAQs please consult the R Project website.
See also the Bioconductor website.
There is a university R user group and an external Manchester R group.
Updates
July 2016 - a version of rjava has been compiled for R version 3.3.0 and has all the java environment variables automatically set. It can be accessed using:
module load apps/gcc/rjava/0.9.8-R3.3.0
Aug 2018 - Added notes on parallel package and interaction with MS R Open (updated 14 Aug)
Aug 2017 - R3.4.1 installed.
Jan 2016 - R3.2.3 installed.
June 2014 - R3.1.0 installed and Bioconductor installed in to this version.
March 2013 - R2.15.3 installed.
March 2013 - bioconductor installed into both versions. There may be differences between the two as some 'packages' require a more recent version of R. For example, minfi is only in the R 2.15.3 install.