UK BioBank Full Release
Introduction
The UK BioBank Genotyping and Imputation Data Release (data for all 500,000 participants in UK BioBank) is now available for use on central compute platforms (the CSF, and iCSF). It is also available as a storage share that can be mapped as a network drive on campus PCs / desktops.
The central copy provided by Research IT is there to save on storage space (at least 12TB) – your project can process the files from the central read-only storage areas so you do not need to request storage for your own copy.
Your project will need its own storage for any output/result files you generate.
The UKBioBank files are already decrypted so you can begin working with the data immediately once access has been provided. However, we do require your project to have been approved by the UKBioBank for access to the datasets.
If you are submitting a project proposal to UKBioBank, you should indicate in any project application that you intend to use an institution-held copy of the dataset (see below for Requesting Access advice.)
For advice on processing the data on the CSF or iCSF (once your access to the data has been approved) please see the UK BioBank modulefile page in the CSF docs.
Future Downloads
We are very unlikely to download further UK BioBank datasets to be held centrally – they are simply too large. Instead you are encouraged to use their cloud-based Research Analysis Platform.
For general information about the UK BioBank genetic dataset releases, please see the UK BioBank Genetic Data Timeline.
Important Updates – PLEASE READ
14-March-2018: The updated imputed dataset (v3, study id EGAD00010001474) is now available on the CSF and iCSF. The instructions given below have been updated to provide more details about this version of the data. Further information on what is available has also been provided by the UKBioBank in the FAQ document (pdf).
27-July-2017: This has now been fixed with the v3 release (see above). The UKBioBank has release a notice alerting of a problem with the supplied imputed dataset. Please review the notice if using this data.
Available Files – UPDATED March 2018
- EGAD00010001474
- This is the corrected imputed dataset (v3).
It replaces the previous imputed dataset (v2) EGAD00010001225 which should NO LONGER BE USED. The UKBioBank found errors in this data and released EGAD00010001474 to replace it.
- EGAD00010001497
- This is the genotyped dataset, previously named EGAD00010001226. Hence it is identical to the previously released EGAD00010001226 dataset. So we have NOT downloaded EGAD00010001497. Instead a shortcut / symlink in the storage is available so that you can also access the genotyped data using the new name EGAD00010001497.
All downloaded and decrypted files have had their md5 checksums verified – all files have been written to disk correctly. Note that gzip
compressed version of the new EGAD00010001474 dataset are not available – this was only provided in uncompressed form. The EGAD00010001497 (previously named EGAD00010001226) dataset DOES contain gzip
compressed and un-compressed files.
Requesting Access
The following advice has been provided by the UKBioBank Chief Information Officer concerning access to the data:
- UKBioBank request that an Applicant Principal Investigator (PI) informs UKBioBank of their intent to use an institute-held dataset during the application review process, so that they can track who is using what and where.
- The responsibility for who gets access to the dataset lies with the PI and it is part of their obligations under the Material Transfer Agreement that they maintain the list of collaborators who have access to the dataset.
- Research IT, if approached by a PI requesting access to the dataset, will send a quick email request to the UKBioBank Access Team to confirm that the PI has an approved data release application in place. Research IT will send the the Application ID number and PI or Lead Collaborator name to UKBioBank. UKBioBank can then simply confirm the PI has an approved application and UKBioBank will know that the PI is seeking to use our institute-held dataset.
- This should be done at the ‘application level’ – once Research IT know a PI has an approved request, we can respond to requests from the PI to provide specific individuals with access and where the responsibility remains with the PI that they are only asking access to be provisioned for people they have named on their application.
Some useful further advice:
- Please remove any passwords from your documentation when sending to Research IT.
- Visiting researchers from another institute (or country) can only be given access if named on an executed MTA.
- The rule of thumb is that for a researcher to access our copy of the data, they must first be named on the MTA and have their access to the data (no matter where is it held) approved by the UKBioBank. Only then can we grant access to our copy of the data.
Please email your request to access the data and copy of the MTA to its-ri-team@manchester.ac.uk
For details on registering with the UK BioBank / European Genome Archive please see the UK BioBank Full Release FAQ (hosted by the UK BioBank).
Accessing the data
Once we have processed your request for access and have confirmed that your access has been set up please follow the instructions below to access the data:
From central compute platforms (CSF, iCSF)
To access the UK BioBank data on the central compute platforms you must be added to a group on those systems that has the correct access permissions. To check, run the following command on the login node:
groups
It should report dataset-ukbiobank-full
in the list of your groups. Then you can use a modulefile on the CSF and iCSF to make accessing the datasets easier:
# New data as of March 2018. Uses the corrected v3 Imputation data from study id EGAD00010001474. module load tools/env/ukbiobank-full-release-2018 # Original July 2017 data. Uses the incorrect v2 Imputation data from study id EGAD00010001225. module load tools/env/ukbiobank-full-release
Please see the CSF’s UKBioBank modulefile documentation for further information, including where the data is kept, how to access it in job scripts, and an example job array for processing multiple datasets.
From on-campus PCs
To access the UK Bio Bank data on a campus PC you must be added to the storage share’s access control group. Once added to the group, to access the data, map a network drive using the path:
\\nasr.man.ac.uk\nonfacrss$\unsnapped\replicated\data-sets\ukbiobank\full-release
Please ensure you received confirmation from us that you have been given access to the above storage areas before attempting to access them (see above for how to request this). Once you have access you can find further details about the dataset file formats and descriptions in a file called ukb_genetic_file_description.txt
which is also available online (plain text, hosted by UK BioBank) .
Mapping Files
Processing the data requires mapping files which are specific to your research work. You should download these files from the UK BioBank / EGA using your own EGA download account. You won’t be able to get these unless your project has been approved by the UK BioBank.
Various download tools for Linux are installed on our Research Data Storage SSH (rds-ssh) gateway. Please email its-ri-team@manchester.ac.uk if you require access to this server.
Further Information
- UK BioBank Full Release FAQ (PDF, hosted by UK BioBank)
- Dataset file formats and descriptions (plain text, hosted by UK BioBank)
- CSF UK BioBank helper modulefile documentation
- How to map a network drive from the Research Data Storage service to your campus PC.
- You may wish to join the UKB-GENETICS mailing list – a discussion group for users of the UKBioBank data. Please register via https://jiscmail.ac.uk/cgi-bin/webadmin?A0=ukb-genetics
- You may also wish to join the local (UoM only) UOM-UKBIOBANK mailing list – a discussion group for users in Manchester to discuss hint / tips / issues with UKBioBank data. Please register via https://listserv.manchester.ac.uk/cgi-bin/wa?A0=UOM-UKBIOBANK