UK BioBank Full Release

Introduction

The UK BioBank Genotyping and Imputation Data Release (data for all 500,000 participants in UK BioBank) is now available for use on central compute platforms (the CSF, and iCSF). It is also available as a storage share that can be mapped as a network drive on campus PCs / desktops.

The central copy provided by Research IT is there to save on storage space (at least 12TB) – your project can process the files from the central read-only storage areas so you do not need to request storage for your own copy.

Your project will need its own storage for any output/result files you generate.

The UKBioBank files are already decrypted so you can begin working with the data immediately once access has been provided. However, we do require your project to have been approved by the UKBioBank for access to the datasets.

If you are submitting a project proposal to UKBioBank, you should indicate in any project application that you intend to use an institution-held copy of the dataset (see below for Requesting Access advice.)

For advice on processing the data on the CSF or iCSF (once your access to the data has been approved) please see the UK BioBank modulefile page in the CSF docs.

Future Downloads

We are very unlikely to download further UK BioBank datasets to be held centrally – they are simply too large. Instead you are encouraged to use their cloud-based Research Analysis Platform.

For general information about the UK BioBank genetic dataset releases, please see the UK BioBank Genetic Data Timeline.

Important Updates – PLEASE READ

14-March-2018: The updated imputed dataset (v3, study id EGAD00010001474) is now available on the CSF and iCSF. The instructions given below have been updated to provide more details about this version of the data. Further information on what is available has also been provided by the UKBioBank in the FAQ document (pdf).

27-July-2017: This has now been fixed with the v3 release (see above). The UKBioBank has release a notice alerting of a problem with the supplied imputed dataset. Please review the notice if using this data.

Available Files – UPDATED March 2018

EGAD00010001474
This is the corrected imputed dataset (v3).

It replaces the previous imputed dataset (v2) EGAD00010001225 which should NO LONGER BE USED. The UKBioBank found errors in this data and released EGAD00010001474 to replace it.

EGAD00010001497
This is the genotyped dataset, previously named EGAD00010001226. Hence it is identical to the previously released EGAD00010001226 dataset. So we have NOT downloaded EGAD00010001497. Instead a shortcut / symlink in the storage is available so that you can also access the genotyped data using the new name EGAD00010001497.

All downloaded and decrypted files have had their md5 checksums verified – all files have been written to disk correctly. Note that gzip compressed version of the new EGAD00010001474 dataset are not available – this was only provided in uncompressed form. The EGAD00010001497 (previously named EGAD00010001226) dataset DOES contain gzip compressed and un-compressed files.

Requesting Access

The following advice has been provided by the UKBioBank Chief Information Officer concerning access to the data:

  • UKBioBank request that an Applicant Principal Investigator (PI) informs UKBioBank of their intent to use an institute-held dataset during the application review process, so that they can track who is using what and where.
  • The responsibility for who gets access to the dataset lies with the PI and it is part of their obligations under the Material Transfer Agreement that they maintain the list of collaborators who have access to the dataset.
  • Research IT, if approached by a PI requesting access to the dataset, will send a quick email request to the UKBioBank Access Team to confirm that the PI has an approved data release application in place. Research IT will send the the Application ID number and PI or Lead Collaborator name to UKBioBank. UKBioBank can then simply confirm the PI has an approved application and UKBioBank will know that the PI is seeking to use our institute-held dataset.
  • This should be done at the ‘application level’ – once Research IT know a PI has an approved request, we can respond to requests from the PI to provide specific individuals with access and where the responsibility remains with the PI that they are only asking access to be provisioned for people they have named on their application.

Some useful further advice:

  • Please remove any passwords from your documentation when sending to Research IT.
  • Visiting researchers from another institute (or country) can only be given access if named on an executed MTA.
  • The rule of thumb is that for a researcher to access our copy of the data, they must first be named on the MTA and have their access to the data (no matter where is it held) approved by the UKBioBank. Only then can we grant access to our copy of the data.

Please email your request to access the data and copy of the MTA to its-ri-team@manchester.ac.uk

For details on registering with the UK BioBank / European Genome Archive please see the UK BioBank Full Release FAQ (hosted by the UK BioBank).

Accessing the data

Once we have processed your request for access and have confirmed that your access has been set up please follow the instructions below to access the data:

From central compute platforms (CSF, iCSF)

To access the UK BioBank data on the central compute platforms you must be added to a group on those systems that has the correct access permissions. To check, run the following command on the login node:

groups

It should report dataset-ukbiobank-full in the list of your groups. Then you can use a modulefile on the CSF and iCSF to make accessing the datasets easier:

# New data as of March 2018. Uses the corrected v3 Imputation data from study id EGAD00010001474.
module load tools/env/ukbiobank-full-release-2018

# Original July 2017 data. Uses the incorrect v2 Imputation data from study id EGAD00010001225.
module load tools/env/ukbiobank-full-release

Please see the CSF’s UKBioBank modulefile documentation for further information, including where the data is kept, how to access it in job scripts, and an example job array for processing multiple datasets.

From on-campus PCs

To access the UK Bio Bank data on a campus PC you must be added to the storage share’s access control group. Once added to the group, to access the data, map a network drive using the path:

\\nasr.man.ac.uk\nonfacrss$\unsnapped\replicated\data-sets\ukbiobank\full-release

Please ensure you received confirmation from us that you have been given access to the above storage areas before attempting to access them (see above for how to request this). Once you have access you can find further details about the dataset file formats and descriptions in a file called ukb_genetic_file_description.txt which is also available online (plain text, hosted by UK BioBank) .

Mapping Files

Processing the data requires mapping files which are specific to your research work. You should download these files from the UK BioBank / EGA using your own EGA download account. You won’t be able to get these unless your project has been approved by the UK BioBank.

Various download tools for Linux are installed on our Research Data Storage SSH (rds-ssh) gateway. Please email its-ri-team@manchester.ac.uk if you require access to this server.

Further Information

Last modified on November 14, 2023 at 2:01 pm by George Leaver