UKBioBank

Overview

These modulefiles set only environment variables to make accessing the centrally hosted UK Bio Bank Full Release Datasets easier on the CSF. They allow you to use the environment variables in your jobscripts to access the correct folders that provide the central copy of the UK Bio Bank data.

The information on this webpage is also valid for the CSF

Restrictions on use

To access the UK Bio Bank data on the CSF you must be added to a group on the system that has the correct access permissions (the dataset-ukbiobank-full Unix group). To be added to this group please email its-ri-team@manchester.ac.uk with confirmation of your approved UKBioBank/EGA access to the data – the UKBioBank have advised us that supplying us with a copy of your Material Transfer Agreement (MTA) is suitable proof of access (please remove any passwords from your documentation).

For details on registering with the UK Bio Bank / European Genome Archive please see the UK BioBank Full Release FAQ (hosted by the UK Bio Bank).

Set up procedure

Please load the modulefile:

# Includes access to the correct v3 imputed dataset EGAD00010001474

module load tools/env/ukbiobank-full-release-2018

The original modulefile providing access to the incorrect v2 imputed dataset EGAD00010001225 is still available. This is because you may have been using particular files in that study that the UKBioBank did not need to correct or you may be using the genotyped data in study id EGAD00010001226, which contains entirely correct data,

# Includes access to the incorrect v2 imputed dataset EGAD00010001226

module load tools/env/ukbiobank-full-release

Environment variables

The new 2018 modulefile sets the following environment variables (items in bold have changed since the previous modulefile):

UKBIOBANK_BASE            =   /mnt/data-sets/ukbiobank/full-release
UKBB_BASE                 =   /mnt/data-sets/ukbiobank/full-release
UKBB_IMPUTATION_STUDYID   =   EGAD00010001474                           # This is the new, correct, v3 imputation data (March 2018)
UKBB_GENOTYPED_STUDYID    =   EGAD00010001497                           # This is identical to the EGAD00010001226 data
UKBB_IMPUTATION_DIR       =   /mnt/data-sets/ukbiobank/full-release/EGAD00010001474          # No trailing 001 subdirectory
UKBB_GENOTYPED_DIR        =   /mnt/data-sets/ukbiobank/full-release/EGAD00010001497          # Symlink / shortcut to EGAD00010001226/001
UKBB_FILELIST             =   /mnt/data-sets/ukbiobank/full-release/filelist.2018.txt        # Includes names of v3 (correct) imputed data files

The original modulefile makes the following settings:

UKBIOBANK_BASE            =   /mnt/data-sets/ukbiobank/full-release
UKBB_BASE                 =   /mnt/data-sets/ukbiobank/full-release
UKBB_IMPUTATION_STUDYID   =   EGAD00010001225                           # This dataset should NOT be used (March 2018)
UKBB_GENOTYPED_STUDYID    =   EGAD00010001226                           # This dataset can be used (it is identical to EGAD00010001497)
UKBB_IMPUTATION_DIR       =   /mnt/data-sets/ukbiobank/full-release/EGAD00010001225/001
UKBB_GENOTYPED_DIR        =   /mnt/data-sets/ukbiobank/full-release/EGAD00010001226/001
UKBB_FILELIST             =   /mnt/data-sets/ukbiobank/full-release/filelist.txt             # Includes names of v2 (incorrect) imputed data files

Basic Usage of Environment Variables

You can use any of the above environment variables in your jobscripts and other commands.

For example, to see all of the files in the GENOTYPED dataset, assuming you have loaded the tools/env/ukbiobank-full-release-2018 modulefile:

ls -lh $UKBB_GENOTYPED_DIR

# Note: The compressed .gz files and checksum .md5 files
#       will also be listed.

To use the variables in a jobscript, load the modulefile (either on the login node or in the jobscript) and then you can use the variables to access particular files. For example:

MyGenomeApp -input $UKBB_GENOTYPED_DIR/ukb_cal_chr16_v2.bed -o myresults.dat

Job Arrays (on CSF)

The text file given by the $UKBB_FILELIST variable contains a list of the datasets available:

cat $UKBB_FILELIST
EGAD00010001474/ukb_imp_chr1_v3.bgen              # Note the v3 imputed files have changed their names compared to the v2 files
EGAD00010001474/ukb_imp_chr2_v3.bgen
EGAD00010001474/ukb_imp_chr3_v3.bgen
...
EGAD00010001474/ukb_mfi_chrX_v3.txt
EGAD00010001474/ukb_mfi_chrXY_v3.txt
...
EGAD00010001497/ukb_l2r_chrXY_v2.txt
EGAD00010001497/ukb_l2r_chrY_v2.txt
   #
   # Notice that the subdirectories EGAD00010001474 and EGAD00010001497 are included in the name

You may use this file when running job arrays on the CSF to process all files in a particular dataset. For example:

Suppose we wish to process all of the ukb_int_chrN_v2.bin files in the EGAD00010001497 study (this is the same dataset as the EGAD00010001226 study hence still use v2 in their name).

  1. Load the modulefile on the login node:
    module load tools/env/ukbiobank-full-release-2018
    
  2. Check which files we will be processing:
    grep ukb_int_chr $UKBB_FILELIST
      #
      # 'grep' prints all lines that contain the string 'ukb_int_chr'
    
    EGAD00010001497/ukb_int_chr1_v2.bin
    EGAD00010001497/ukb_int_chr2_v2.bin
    ...
    EGAD00010001497/ukb_int_chrY_v2.bin
       #
       # Notice that the subdirectory EGAD00010001497/ is included in the name
    
  3. Count the number of files and hence number of tasks needed in the job array (you probably expect this to be 26 but let’s check!)
    grep ukb_int_chr $UKBB_FILELIST | wc -l 
                                      #
                                      # 'wc' is a word count utility (-l counts lines) 26
    
  4. Write a job array script that will automatically run 26 copies (‘tasks’) of the job where each ‘task’ is given a unique ID 1, 2, …, 26. We will use this to make each task process a different dataset (one of the 26 files listed above):
    #!/bin/bash --login
    #$ -cwd                # Run in current directory
    #$ -t 1-26             # Run a 26-task job array
    ### No -V line (we load the modulefile in the jobscript, hence added --login above)
    
    ### Load the modulefiles in the jobscript
    module load tools/env/ukbiobank-full-release-2018
    # module load name/of/my/app/1.2.3
    
    ### SGE_TASK_ID is set to 1, 2, ... 26 (one for each of the 26 tasks)
    ### Read the N-th filename by generating the list as we did earlier
    FILENAME=`grep ukb_int_chr $UKBB_FILELIST | awk "NR==$SGE_TASK_ID {print}"`
    
    ### Report what we are doing
    echo "Processing $UKBB_BASE/$FILENAME in job array $JOB_ID task $SGE_TASK_ID"
    
    ### Process the file for this task (change this to use your own real app here!)
    myGenomeApp -input $UKBB_BASE/$FILENAME -output myResult_$SGE_TASK_ID.data
    
    
  5. Submit the job once using the usual command:
    qsub myjobscript
    

    where myjobscript is the name of your jobscript file.

Further Dataset Format Information

For a description of the datasets, a text file provided by the UK Bio Bank is available using:

less $UKBB_BASE/ukb_genetic_file_description.txt

This file is also available online (hosted by the UK Bio Bank).

Further info

Updates

None.

Last modified on November 23, 2018 at 4:14 pm by George Leaver