R & bioconductor
June 2023: The proxy is no longer available. To download data from external sites, please do so from a batch job or use an interactive session on a backend node by running qrsh -l short . You DO NOT then need to load the proxy modulefiles. Please see the qrsh notes for more information on interactive use. |
R is a free software environment for statistical computing and graphics.
The following versions are installed on the CSF:
Standard open source R:
- R 4.4.1
- R 4.4.0
- R 4.3.1
- R 4.2.2
- R 4.1.2
- R 4.1.0
- R 4.0.2
- R 3.6.2
- R 3.6.1
- R 3.6.0
- R 3.5.2
- R 3.4.2
Note that Bioconductor is available via a separate modulefile – see below.
You may also want to try the Microsoft R Open version installed on the CSF – this version provides automatic parallelism of various maths / matrix routines in R.
You can also install packages to your own home directory using the Adding Packages instructions below (and in conjunction with information from the bioconductor website.) Alternatively, we may be able to add them to the central install – contact its-ri-team@manchester.ac.uk .
Restrictions on use
There are no restrictions on access to R as it is a free piece of software released under a GNU license. All users should familiarise themselves with the licensing information available via the R website.
All R jobs, aside from very short test jobs (e.g. those lasting less than one minute) must be submitted to the batch system, SGE.
Set up procedure
We now recommend loading modulefiles within your jobscript so that you have a full record of how the job was run. See the example jobscript below for how to do this. Alternatively, you may load modulefiles on the login node and let the job inherit these settings.
Load one of the following modulefiles:
Standard open source R:
module load apps/gcc/R/4.4.1 # Includes gcc 13.3.0 (helps package installs) # For BioConductor see below notes module load apps/gcc/R/4.4.0 # Includes gcc 12.2.0 (helps package installs) # For BioConductor see below notes module load apps/gcc/R/4.3.1 # Includes BioConductor 3.16, gcc 9.3 (helps package installs) module load apps/gcc/R/4.2.2 # Includes BioConductor 3.16, gcc 9.3 (helps package installs) module load apps/gcc/R/4.1.2 # Includes BioConductor 3.14, gcc 9.3 (helps package installs) module load apps/gcc/R/4.1.0 # Includes BioConductor 3.13, gcc 8.2 (helps package installs) module load apps/gcc/R/4.0.2 # Includes BioConductor 3.11, gcc 8.2 (helps package installs) module load apps/gcc/R/3.6.2 # Includes BioConductor 3.10, gcc 8.2 (helps package installs) module load apps/gcc/R/3.6.1 # Includes BioConductor 3.9, gcc 8.2 (helps package installs) module load apps/gcc/R/3.6.0 # Includes BioConductor 3.9, gcc 8.2 module load apps/R/3.5.2 # Does not include BioConductor - see modulefile below module load apps/R/3.4.2 # Does not include BioConductor - see modulefile below
BioConduction Installation
The central installations of R versions 4.4.0, 4.4.1 and later, do not include BioConductor.
Users can install BioConductor themselves by running the following commands in an interactive session:
# On the CSF login node, start an interactive session qrsh -l short # You'll now be on a compute node. You can run your commands directly, and R # will be able to download packages from the outside world: module load apps/gcc/R/4.4.1 # Choose your required version R install.packages("BiocManager") q(); # To return to the login node, exit from the interative (qrsh) session exit
Batch jobs can now use your local installation of BioConductor. For more information on using BioConductor, please see the BioConductor installation documentation.
Older version of R require a separate BioConductor modulefile. To use BioConductor, load:
# This is NOT needed for R 3.6 and newer! module load libs/bioconductor/3.4
This will load the R modulefile if not already loaded.
Running the application
Note that using R CMD BATCH
, as below, may save and restore your workspace, which may not be what you want. Using Rscript
instead avoids that.
Serial Batch job
Write a submission script, for example:
#!/bin/bash --login #$ -cwd # Run job from current directory ## We now recommend loading the modulefile in the jobscript. Change the version as needed. module load apps/R/3.4.2 R CMD BATCH --no-restore my_test.R my_test.R.o$JOB_ID # # # # # # The final argument, "my_test.R.o$JOBID", tells R to send # # # output to a file with this name unique to the current job. # # # # Do not restore any previously saved objects. Ensures you don't load in possibly # # large objects from previous runs of R. If jobs are failing due to lack of memory # # please add this flag or alternatively use --vanilla which applies the following: # # --no-save, --no-restore, --no-site-file, --no-init-file and --no-environ # # R must be called with both the "CMD" and "BATCH" options which tell it # to run an R program, in this case my_test.R, instead of presenting # an interactive prompt
Submit the job using
qsub runmyRjob.qsub
where runmyRjob.qsub
is the name of your job script.
By default, graphical output from batch jobs is sent to a file called Rplots.pdf
. See below for more info on plotting in to an image file.
Parallel Batch Job (single node, multi-core)
Please note that your R code must be parallelised (usually with the ‘parallel’ library) before you submit to more than 1 core. Asking for more than 1 core does not mean your code will automatically use them.
#!/bin/bash --login #$ -cwd # Run job from current directory #$ -pe smp.pe 12 # Number of cores to use. Can be between 2 and 32. module load apps/R/3.4.2 R CMD BATCH --no-restore my_test.R my_test.R.o$JOB_ID # # See the serial jobscript example above for a description # of the command-line flags.
- Then submit your job to the batch system
qsub runmyRjob.qsub
where runmyRjob.qsub
is the name of your job script.
The various libraries for performing parallel computation in R each have their own way of setting the number of cores to use within R. This will sometimes default to the total number of cores on the node. You need to make sure that your code is using no more than the number of cores you’ve requested in your job script, otherwise your job is liable to be killed without warning.
You can return the number of cores you requested in your jobscript as a variable, using the code:
numCoresAllowed <- Sys.getenv("NSLOTS", unset=1)
(If you’re running the job interactively or on your local machine, the value specified in “unset” will be returned)
You should use this value when you set the number of cores. For example, if you’re using the “doMC” package, you’d use:
registerDoMC(cores = numCoresAllowed)
Some libraries, e.g. the “parallel” library will take the number of cores to use from an environment variable (e.g. MC_CORES
) directly. You can set the environment variable in your job script:
Add this to your jobscript before the R CMD BATCH ...
THIS SECTION IS HERE IN CASE WE NEED IT – IT APPLIES ONLY TO MRO OR AN R THAT USES OPENBLAS OR MKL.The section below contains some further considerations if you’re using Microsoft R Open.
h2 Parallelism h2
R’s “parallel” library can be used to run sections of code in parallel. The mclapply and mcmapply functions provided parallelised versions of lapply and mapply. The number of cores these will use can be set using the mc.cores option, or (preferably) by setting the environment variable MC_CORES
in your job script.
If you are using “parallel” or another library to parallise your code, and you are using Microsoft R Open you will need to take care not to double-parallise your code. Each thread spawned by parallel will use OMP_NUM_THREADS. The number of cores you request in your job script should be:
For example, if we wish to run a 16 core job, consisting of 8 parallel threads using mclapply, each of which will use 2 cores for matrix computation:
#!/bin/bash #$ -S /bin/bash #$ -cwd # Run job from current directory #$ -V # Inherit the environment (module) settings #$ -pe smp.pe 16 # Number of cores to use. Can be between 2 and 24. export OMP_NUM_THREADS=2 # For MKL library export MC_CORES=8 # For parallel library R CMD BATCH --no-restore my_test.R my_test.R.o$JOB_ID
The optimum division of threads between mclapply and the MKL library will depend on the nature of your workload. If the majority of your code can be embarrassingly parallised using mclapply we would expect setting OMP_NUM_THREADS=1 and MC_CORES=$NSLOTS to provide optimum performance – this way all of the code in mclapply will be run in parallel, not just the matrix operations. Even using a single core the MKL library is often substantially faster than R’s standard library when performing matrix operations.
Remember that MKL will, by default, use all the node’s cores, so you must always set OMP_NUM_THREADS if you are using Microsoft R Open.
Running R interactively
It is expected that most use of R on the CSF will be in batch mode, i.e., computational jobs will be submitted to the batch system and there will be no subsequent user interaction. However, if required, R can be run on the CSF using either the R command line, or GUI.
Do not simply login to the CSF and start R — your jobs will be killed by the system administrator! The only exception to this is when installing a package from a mirror in R (see below).
To run R jobs interactively on the CSF, make use of the qrsh facility, which literally queues interactive jobs. To start the R command line type
# Load the modulefile on the login node (use your required version) # If you loaded other modulefiles to do a package installation (e.g., nlopt) # you should also load them here. module load apps/R/3.4.2 # Start R in an interactive session on a compute node qrsh -cwd -V -l short R --vanilla --interactive # # # # Needed if you will be starting the R GUI (Rcmdr) # # Can be: "--save", "--no-save" or "--vanilla"...
- Use one of
- If you want to use the GUI, ensure you type
otherwise the GUI will not start and your will see an error message like:The Commander GUI is launched only in interactive sessions
To start the GUI, enter
at the R command line:library(Rcmdr) Loading required package: tcltk Loading Tcl/Tk interface ... done
Adding packages
You may wish to use a particular package (library) in your code. The central installations of R may already have that package installed. If not, you can install it yourself (it will go in to a folder in your home directory).
Check if a package is already installed
To determine if a package is already installed, simply try loading it in R. For example, on the login node:
R > library(thing) Error in library(thing) : there is no package called ‘thing’ # (if you get no output it usually means the library is already installed!)
This tells use we need to install a package/library named ‘thing’. See below for how to do that. Installing BioConductor packages is also possible and this is also covered below.
Note: For the purposes of adding packages you can run R on the login node. But this is the only time you should run R on the login node. All data processing, development and testing must be run in batch jobs or in an interactive session on a compute node (see above for how to run R).
Install a package by Automatically Downloading from CRAN (the default repo)
To add packages to your personal R package directory (~/R/platform/version
), downloading from CRAN:
To start an interactive session:
# From the login node, start an interactive session: qrsh -l short # Once logged in to a compute node, load your required version of R (see above) module load apps/gcc/R/4.3.1 # # Note: you may need to load other modulefiles to complete a package installation. # If your install fails, look at the errors. You can exit from R, load some more # modulefiles, then run R again and try the install. Common packages are nlopt and # cmake - see sections below for more details. # Note: you may have old proxy settings in an ~/.Renviron file. You'll need to remove these: cat ~/.Renviron # # If you see the following, you do not need to do anything! cat: .Renviron: No such file or directory # If you see some lines containing http_proxy=http://proxy.man.ac.uk:3128 https_proxy=https://proxy.man.ac.uk:3128 # # Delete these lines or place a # at the start of each line. # If your ~/.Renviron file contains only the above proxy lines # you can delete the file rm ~/.Renviron
Now start R in the usual way:
Now ask R to install the required package and answer y
when asked if you wish to create a personal library:
> install.packages("thing") Warning in install.packages("thing") : 'lib = "/opt/apps/apps/gcc/R/3.6.1/lib64/R/library"' is not writeable Would you like to use a personal library instead? (y/n) y # Answer 'y' Would you like to create a personal library # (if first ever package!) ~/R/x86_64-pc-linux-gnu-library/3.6 to install packages into? (y/n) y # Answer 'y'
Select a UK mirror when prompted (e.g., UK Bristol which is near the bottom of the list.)
Once the package is installed, you can now check it has installed correctly by loading the library:
library(thing) # # No output (or some library-specific info) means it is installed correctly.
You can now exit R and then exit from your interactive sessions or install more libraries by repeating the above steps.
q() # Now exit your interactive session to return to the login noded exit
Please remember that your usual R usage, to run scripts and process data must be done in batch or via qrsh (interactively) on a compute node (see above). Do not continue to run computational work on the login node!
In the above instructions replace the module load
command with the one appropriate to the R version you wish to use.
If you wish to specify a mirror in the install.packages
command instead of selecting it from a menu, try:
install.packages('thing', repos='http://www.stats.bris.ac.uk/R')
Installing a Library from a source package
If you’ve downloaded an R library source file you can add it to your local workspace using the following commands (which assume the source package is in your home directory on the CSF):
Start R with extra command-line args (choose the version of R you require):
module load apps/gcc/R/3.6.1 R CMD INSTALL thing.x.y.z.tar.gz * installing to library ‘/mnt/iusers01/support/mabcxyz1/R/x86_64-unknown-linux-gnu-library/3.6’ * installing *source* package ‘thing’ ... ** package ‘thing’ successfully unpacked and MD5 sums checked ** R ** data ** demo ** preparing package for lazy loading ** help *** installing help indices ** building package indices ** testing if installed package can be loaded * DONE (thing)
Using an installed package
Now run R and test that the library can be loaded:
R > library(thing)
The compiled library files will be save in a directory named R
in your home directory. It contains subdirectories for each version of R so if you want to use the library in different versions of R you will have to repeat the above commands for each version.
nloptr dependency
Some packages fail to install because they depend on the nloptr
R package. Trying to install that specific package often fails due to a dependency on the nlopt
library, which R fails to compile. So we have provided this as a separate modulefile. For example:
# This will fail due to a failure to compile nloptr module load apps/gcc/R/4.3.2 R install.packages("nloptr") # Other R packages that depend on this one will also fail q() # The solution is to load an extra modulefile: module load apps/gcc/R/4.3.2 module load libs/gcc/nlopt/2.6.2 R install.packages("nloptr")
Note: you will also need to load the nlopt modulefile in your jobscript when submitting jobs to the batch system.
cmake dependency
If your packages requires cmake
to complete its installation, you can load the cmake modulefile before running R, then R will be able to find it:
module load apps/gcc/R/4.3.2 module load tools/gcc/cmake/3.25.1 # Other versions of cmake are available R install.packages("mice")
Note: you will likely NOT need to load the cmake modulefile in your jobscript when submitting jobs to the batch system. cmake is usually only used during the installation, not when you run R.
Please see the cmake page for available versions.
Adding BioConductor Packages – R 3.6.0 and newer
Note: This is NOT the method used for older versions of R (3.5 and older). See below for that.
The ‘manager’ for bioconductor has changed in version 3.6.0. Details are given here on how to install BioConductor packages in R 3.6.0 (and up).
# Check the BiocManager version BiocManager::version() # See what is installed BiocManager::available() # Install a package to your home directory BiocManager::install(c("esATAC")) ## In this case esATAC (replace that with the package you are interested in) ## You will be prompted to install to a local (your home) directory as below Bioconductor version 3.9 (BiocManager 1.30.4), R 3.6.0 (2019-04-26) Installing package(s) 'esATAC' Warning in install.packages(pkgs = doing, lib = lib, repos = repos, ...) : 'lib = "/opt/apps/apps/gcc/R/3.6.0/lib64/R/library"' is not writable Would you like to use a personal library instead? (yes/No/cancel) y # Answer 'y' Would you like to create a personal library ‘~/R/x86_64-pc-linux-gnu-library/3.6’ to install packages into? (yes/No/cancel) y # Answer 'y'
Adding BioConductor Packages – R 3.5.2 and R 3.4.2
Note: This is NOT the method used for newer versions of R (3.6 and newer). See above for that.
BioConductor packages can be installed in to your local R library (in your home directory) as follows:
# This will automatically load the R modulefile as well module load libs/bioconductor/3.4 R source("https://bioconductor.org/biocLite.R") biocLite("packagename") # Give a biocLite package name: EG: "S4Vectors" # You will see some output then be asked to install locally: 'lib = "/opt/gridware/apps/R/3.4.2/lib64/R/library"' is not writable Would you like to use a personal library instead? (y/n) y # Answer 'y' Would you like to create a personal library ~/R/x86_64-pc-linux-gnu-library/3.4 to install packages into? (y/n) y # Answer 'y'
The package will be downloaded and installed in to your local R library.
Using BioConductor Packages
BioConductor packages have to be loaded like any other package if you’ve previously installed them. For example, assuming you have installed a BioConductor package named bioThing
, to use it in your code use:
# Load/use a BioConductor package named 'bioThing' previously installed library(bioThing)
rjags is a popular package for working with Bayesian graphical models using MCMC. If is also used by other packages such as JMbayes. The rjags
package relies on a library named JAGS
. This is already installed on the CSF so you can make it available to R by loading its modulefile. This will allow you to then install rjags
and related packages such as JMbayes
. If you are using R 3.6.2 or later you must load the JAGS modulefile that is compatible with the GCC 8.2.0 compiler (which was used to install R 3.6.2). Here is a complete example of installing JMbayes
, which will install rjags
in your local R directory in your home directory:
# On the login node, start an interactive session qrsh -l short # On the interactive compute node module load apps/gcc/R/3.6.2 # Uses GCC 8.2.0 module load apps/gcc/jags/4.3.0-gcc-8.2.0 # Use the gcc-8.2.0 compatible version R install.packages("JMbayes") library(JMbayes)
You will be asked to select a mirror site from which to download the JMbayes packages (we typically use the Bristol UK mirror).
Once the package has been installed, the library(JMbayes)
command should be used each time you wish to use the package. You will also need to load the jags
modulefile, as well as the R modulefile, in your jobscripts.
Listing Packages
To list the installed packages run:
To list loaded packages run:
Removing Packages
Should you need to delete an installed package:
module load apps/gcc/R/version R remove.packages('thing')
This will remove it from your local library of packages, for the version of R you are currently using. If you’ve used several versions of R over time and have installed the package with each one, you would need to load the modulefile for each version and remove the package from each one in turn.
If you wish to plot graphs, for example, to image files, you will need to use the cairo
plotting device.
The following example generates a histogram and plots it to a .png
file and a .jpg
file. The jobscript is:
#!/bin/bash --login #$ -cwd #$ -l short module load apps/gcc/R/4.4.0 R CMD BATCH --no-restore plot.R
The R-code is:
# R script to demonstrate plotting to image files on the CSF # Enable cairo device (needed to prevent 'X11 not available' errors) options(bitmapType='cairo') # Initialize some data to plot x = rnorm(100) # Save a png plot png(file="hist.png") hist(x) rug(x,side=1) dev.off() # How about jpg jpeg(file="hist.jpg") hist(x) rug(x,side=1) dev.off() # R 3.6.1 (and later) can also do tiff tiff(file="hist.tif") hist(x) rug(x,side=1) dev.off()
Now to view your images while on the CSF, use the eye of gnome (eog
) image viewer:
# List the image files created by the above example ls hist.* hist.jpg hist.png hist.tif # Use the image viewer name 'eog' (Eye of Gnome) on the CSF login node eog hist.png
If you need other image file formats you can then convert your PNG file using the convert
command-line tool, available on the login node or can be run in your jobscript (note that convert
is a Linux command-line program, not an R function):
# Using the hist.png example file from the above R script, convert it to another format: convert hist.png hist.tif # R 3.6.1 can write tif files directly (see above) but older versions can't # How about a .pdf convert hist.png hist.pdf # Now view a .pdf on the login node evince hist.pdf
Further Info
- R website
- Bioconductor website
- There is a University R user group and an external Manchester R group.
- R, Open Research, and Reproducibility by Andrew Stewart course materials.
3.6.1 was installed October 2019.
3.6.0 was installed June 2019.
3.5.2 was installed Feb 2019.