Hosted Data Sets

In cases where a data set is used by multiple University research groups, we may be able to create a local mirror which will enable fast access from local computational platforms.

Currently-Hosted Data Sets

Please note that from September 2021 the UK BioBank Research Analysis PLatform (RAP) provides cloud-hosted access to UK BioBank datasets and compute. This is now the recommended method of accessing and analysing large datasets from the UK BioBank – the datasets are becoming too large to download and store locally.

We already have some data from early releases of UK BioBank datasets. These are still available (e.g., for use on the Computational Shared Facility) – see below. But we are unlikely to store centrally any further UK BioBank datasets.

UK BioBank Full Release

The UK BioBank Genotyping and Imputation Data Release v3 (data for all 500,000 participants in UK BioBank) is now available for use on central compute platforms (the CSF and iCSF). It is also available as a storage share that can be mapped as a network drive on campus PCs / desktops.

Full details are provided here on: requesting access to datasets, available formats, and accessing the data on central compute and campus PCs.

UK BioBank Activity Data – NO LONGER AVAILABLE

24-Feb-2023 Update: The UK BioBank Activity Data is NO LONGER available. The MTA under which is was downloaded has come to an end and is not being renewed.

If you need to use the Activity Data, please consider using the UKBioBank Research Analysis Platform (RAP), where you do your compute in their cloud, where the data is held.

GnomAD Dataset

The Genome Aggregation Database (gnomAD) is available for use on central compute platforms (the CSF, DPSF and iCSF). Please see the Broad Institute’s overview of gnomeAD and this blog post for a detailed description of the data. The Broad Institute’s download page shows exactly what has been downloaded.

Access on the CSF, DPSF, and iCSF can be made immediately via the path:

/mnt/data-sets/gnomeAD/

All users of the data should read the Broad Institute’s Terms of Use and follow the citation request on that page.

Download Tools

Downloading to Research Data Storage shares should be done on the RDS-SSH servers as this can be done entirely over fast data-centre networks. Downloading to a storage share mapped on your desktop will usually be slower because the campus network to your desktop is slower.

A number of download tools from the UK BioBank and European Genome Archive (EGA) are available on the RDS-SSH servers. These are automatically in your PATH upon login – simply login and run the commands you would normally run. The tools installed are:

# UKBioBank tools
ukbmd5        # Calculate size and MD5 of a file
ukbconv       # Convert unpacked UKB data to other formats
ukbunpack     # Unpack (decrypt and decompress) UKB data
ukbfetch      # The bulk data download tool
ukblink       # Download Returned-datasets and link between Applications
ukbgene       # Download approved genetic data. This tool supercedes a tool named gfetch.

egaclient     # EGAdemoClient tools (will automatically load the EgaDemoClient.jar file)
egacryptor

ascp          # Aspera downloader tools (uses the default aspera private key)
ascp_noid     # You should add the '-i PRIVATE-KEY-FILE' flag to supply a key

basemount     # Illumina BaseSpace tools

For more information on what is installed on the rds-ssh.itservices.manchester.ac.uk server, please see our RDS-SSH server download tools documentation.

To obtain an account on the RDS-SSH service please email its-ri-team@manchester.ac.uk. This system has access to your Research Data Storage areas and CSF / iCSF home directory. But it also has access to common UKBioBank data-download sites.

Last modified on February 24, 2023 at 10:39 am by George Leaver