{"id":801,"date":"2018-11-14T11:00:20","date_gmt":"2018-11-14T11:00:20","guid":{"rendered":"http:\/\/ri.itservices.manchester.ac.uk\/csf3\/?page_id=801"},"modified":"2025-06-20T17:03:52","modified_gmt":"2025-06-20T16:03:52","slug":"ukbiobank","status":"publish","type":"page","link":"https:\/\/ri.itservices.manchester.ac.uk\/csf3\/software\/tools\/ukbiobank\/","title":{"rendered":"UKBioBank Helper Modulefile"},"content":{"rendered":"<h2>Overview<\/h2>\n<p>These modulefiles set only environment variables to make accessing the centrally hosted <a href=\"http:\/\/ri.itservices.manchester.ac.uk\/hosted-data-sets\/ukbiobank\/\">UK Bio Bank Full Release Datasets<\/a> easier on the CSF. They allow you to use the environment variables in your jobscripts to access the correct folders that provide the central copy of the UK Bio Bank data.<\/p>\n<p>The information on this webpage is also valid for the <a href=\"http:\/\/ri.itservices.manchester.ac.uk\/icsf\">iCSF<\/a><\/p>\n<h2>Restrictions on use<\/h2>\n<p>To access the UK Bio Bank data on the CSF you <em>must<\/em> be added to a group on the system that has the correct access permissions (the <code>dataset-ukbiobank-full<\/code> Unix group). To be added to this group please <a href=\"\/csf3\/overview\/help\/\">contact us<\/a> with confirmation of your approved UKBioBank\/EGA access to the data &#8211; the UKBioBank have advised us that supplying us with a copy of your Material Transfer Agreement (MTA) is suitable proof of access (please remove any passwords from your documentation). If you are sending a screenshot of the BioBank Portal which shows who is named on the MTA, please also include a screenshot (if possible) showing which datasets your project is permitted to use. <\/p>\n<p>For details on registering with the UK Bio Bank \/ European Genome Archive please see the <a href=\"http:\/\/www.ukbiobank.ac.uk\/wp-content\/uploads\/2017\/07\/UKB-Genotyping-and-Imputation-Data-Release-FAQ.pdf\">UK BioBank Full Release FAQ<\/a> (hosted by the UK Bio Bank).<\/p>\n<h2>Set up procedure<\/h2>\n<h3>Access to the UKBioBank <em>corrected<\/em> v3 dataset<\/h3>\n<p>In 2018 the UKBioBank released v3 of the <em>imputed<\/em> dataset, named EGAD00010001474, to correct some mistakes in the v2 dataset, named EGAD00010001225. Users should use the corrected version. The incorrect version (EGAD00010001225) is no longer available.<\/p>\n<p>The <em>genotyped<\/em> data was always correct, so is available via the old name EGAD00010001226 and new name EGAD00010001497.<\/p>\n<p>We use the new names for both datasets: EGAD00010001474 (imputed data) and EGAD00010001497 (genotyped data).<\/p>\n<p>Please load the modulefile:<\/p>\n<pre>\r\n# Access to the corrected v3 imputed dataset EGAD00010001474 and genotype dataset EGAD00010001497\r\n\r\nmodule load tools\/env\/ukbiobank-full-release-2018\r\n<\/pre>\n<h2>Environment variables<\/h2>\n<p>The new 2018 modulefile sets the following environment variables (items in <strong>bold<\/strong> have changed since the previous modulefile):<\/p>\n<pre>\r\n# Settings provided by the tools\/env\/ukbiobank-full-release-2018 modulefile:\r\n# You can use these environment variables on the command-line or in your jobscripts:\r\n# EG: ls ${UKBB_IMPUTATION_DIR}\r\n\r\nUKBIOBANK_BASE            =   \/mnt\/data-sets\/ukbiobank\/full-release\r\nUKBB_BASE                 =   \/mnt\/data-sets\/ukbiobank\/full-release\r\n\r\n# This is the new, correct, v3 imputation data (March 2018)\r\nUKBB_IMPUTATION_STUDYID   =   EGAD00010001474\r\n\r\n# This is the genotyped data, identical to the previously released EGAD00010001226 data\r\nUKBB_GENOTYPED_STUDYID    =   EGAD00010001497\r\n\r\n# Path to imputed data\r\nUKBB_IMPUTATION_DIR       =   \/mnt\/data-sets\/ukbiobank\/full-release\/EGAD00010001474\r\n\r\n# Path to genotyped data\r\nUKBB_GENOTYPED_DIR        =   \/mnt\/data-sets\/ukbiobank\/full-release\/EGAD00010001497\r\n\r\n# Includes names of v3 (correct) imputed data files\r\nUKBB_FILELIST             =   \/mnt\/data-sets\/ukbiobank\/full-release\/filelist.2018.txt\r\n<\/pre>\n<p>To use the variables in a jobscript, load the modulefile (either on the login node or in the jobscript) and then you can use the variables to access particular files. For example:<\/p>\n<pre>\r\nMyGenomeApp -input $UKBB_GENOTYPED_DIR\/ukb_cal_chr16_v2.bed -o myresults.dat\r\n<\/pre>\n<h3>Job Arrays<\/h3>\n<p>The text file given by the <code>$UKBB_FILELIST<\/code> variable contains a list of the datasets available:<\/p>\n<pre>\r\n<strong>cat $UKBB_FILELIST<\/strong>\r\nEGAD00010001474\/ukb_imp_chr1_v3.bgen              # Note the v3 imputed files have changed their names compared to the v2 files\r\nEGAD00010001474\/ukb_imp_chr2_v3.bgen\r\nEGAD00010001474\/ukb_imp_chr3_v3.bgen\r\n...\r\nEGAD00010001474\/ukb_mfi_chrX_v3.txt\r\nEGAD00010001474\/ukb_mfi_chrXY_v3.txt\r\n...\r\nEGAD00010001497\/ukb_l2r_chrXY_v2.txt\r\nEGAD00010001497\/ukb_l2r_chrY_v2.txt\r\n   #\r\n   # Notice that the subdirectories EGAD00010001474 and EGAD00010001497 are included in the name\r\n<\/pre>\n<p>You may use this file when running <a href=\"http:\/\/ri.itservices.manchester.ac.uk\/userdocs\/sge\/job-arrays\/\">job arrays<\/a> on the CSF to process all files in a particular dataset. For example:<\/p>\n<p>Suppose we wish to process all of the <code>ukb_int_chr<em>N<\/em>_v2.bin<\/code> files in the <code>EGAD00010001497<\/code> study (this is the same dataset as the EGAD00010001226 study hence still use v2 in their name). <\/p>\n<ol>\n<li>Load the modulefile on the login node:\n<pre>\r\nmodule load tools\/env\/ukbiobank-full-release-2018\r\n<\/pre>\n<\/li>\n<li>Check which files we will be processing:\n<pre>\r\ngrep ukb_int_chr $UKBB_FILELIST\r\n  #\r\n  # 'grep' prints all lines that contain the string 'ukb_int_chr'\r\n\r\nEGAD00010001497\/ukb_int_chr1_v2.bin\r\nEGAD00010001497\/ukb_int_chr2_v2.bin\r\n...\r\nEGAD00010001497\/ukb_int_chrY_v2.bin\r\n   #\r\n   # Notice that the subdirectory EGAD00010001497\/ is included in the name\r\n<\/pre>\n<\/li>\n<li>Count the number of files and hence number of tasks needed in the job array (you probably expect this to be 26 but let&#8217;s check!)\n<pre>\r\ngrep ukb_int_chr $UKBB_FILELIST | wc -l \r\n                                  #\r\n                                  # 'wc' is a word count utility (-l counts <em>lines<\/em>) 26\r\n<\/pre>\n<\/li>\n<li>Write a job array script that will automatically run 26 copies (&#8216;tasks&#8217;) of the job where each &#8216;task&#8217; is given a unique ID 1, 2, &#8230;, 26. We will use this to make each task process a different dataset (one of the 26 files listed above):\n<pre>\r\n#!\/bin\/bash --login\r\n#SBATCH -p serial      # (or --partition) Run on the node used for 1-core jobs\r\n#SBATCH -a 1-26        # Run a 26-task <em>job array<\/em>\r\n#SBATCH -t 2-0         # Wallclock time limit. Each job-array task will have this amount of time.\r\n                       # In this example 2-0 is two days. Max permitted is 7-0.\r\n\r\n### Load the modulefiles in the jobscript\r\nmodule load tools\/env\/ukbiobank-full-release-2018\r\n\r\n### You will probably want to load modulefiles for bio-inf applications\r\nmodule load <em>name\/of\/my\/app\/1.2.3<\/em>\r\n\r\n### ${SLURM_ARRAY_TASK_ID} is set to 1, 2, ... 26 (one for each of the 26 tasks)\r\n### Read the N-th filename by generating the list as we did earlier\r\nFILENAME=$(grep ukb_int_chr $UKBB_FILELIST | awk \"NR==${SLURM_ARRAY_TASK_ID} {print}\")\r\n\r\n### Report what we are doing\r\necho \"Processing $UKBB_BASE\/$FILENAME in job array ${SLURM_ARRAY_JOB_ID} task ${SLURM_ARRAY_TASK_ID}\"\r\n\r\n### Process the file for this task (change this to use your own real app here!)\r\n<em>myGenomeApp<\/em> -input $UKBB_BASE\/$FILENAME -output myResult_${SLURM_ARRAY_TASK_ID}.data\r\n<\/pre>\n<\/li>\n<li>Submit the job <em>once<\/em> using the usual command:\n<pre>\r\nsbatch <em>myjobscript<\/em>\r\n<\/pre>\n<p>where <code><em>myjobscript<\/em><\/code> is the name of your jobscript file.<\/li>\n<\/ol>\n<h3>Further Dataset Format Information<\/h3>\n<p>For a description of the datasets, a text file provided by the UK Bio Bank is available using:<\/p>\n<pre>\r\nless $UKBB_BASE\/ukb_genetic_file_description.txt\r\n<\/pre>\n<p>This file is also available <a href=\"http:\/\/www.ukbiobank.ac.uk\/wp-content\/uploads\/2017\/07\/ukb_genetic_file_description.txt\">online<\/a> (hosted by the UK Bio Bank).<\/p>\n<h2>Further info<\/h2>\n<ul>\n<li><a href=\"http:\/\/www.ukbiobank.ac.uk\/wp-content\/uploads\/2018\/03\/UKB-Genotyping-and-Imputation-Data-Release-FAQ-v3-2.pdf\">UK BioBank Full Release v3 FAQ<\/a> (PDF, hosted by UK BioBank)<\/li>\n<li><a href=\"http:\/\/www.ukbiobank.ac.uk\/wp-content\/uploads\/2017\/07\/UKB-Genotyping-and-Imputation-Data-Release-FAQ.pdf\">UK BioBank Full Release v2 FAQ<\/a> (PDF, hosted by UK BioBank)<\/li>\n<li><a href=\"http:\/\/www.ukbiobank.ac.uk\/wp-content\/uploads\/2017\/07\/ukb_genetic_file_description.txt\">Dataset file formats and descriptions<\/a> (plain text, hosted by UK BioBank)<\/li>\n<li><a href=\"\/csf-apps\/software\/applications\/bgen\">BGEN<\/a> CSF installation which has some information about processing the datasets with the <code>bgenix<\/code> tool.<\/li>\n<\/ul>\n<h2>Updates<\/h2>\n<p>None.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Overview These modulefiles set only environment variables to make accessing the centrally hosted UK Bio Bank Full Release Datasets easier on the CSF. They allow you to use the environment variables in your jobscripts to access the correct folders that provide the central copy of the UK Bio Bank data. The information on this webpage is also valid for the iCSF Restrictions on use To access the UK Bio Bank data on the CSF you.. <a href=\"https:\/\/ri.itservices.manchester.ac.uk\/csf3\/software\/tools\/ukbiobank\/\">Read more &raquo;<\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"parent":144,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-801","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/ri.itservices.manchester.ac.uk\/csf3\/wp-json\/wp\/v2\/pages\/801","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ri.itservices.manchester.ac.uk\/csf3\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/ri.itservices.manchester.ac.uk\/csf3\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/ri.itservices.manchester.ac.uk\/csf3\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/ri.itservices.manchester.ac.uk\/csf3\/wp-json\/wp\/v2\/comments?post=801"}],"version-history":[{"count":17,"href":"https:\/\/ri.itservices.manchester.ac.uk\/csf3\/wp-json\/wp\/v2\/pages\/801\/revisions"}],"predecessor-version":[{"id":10440,"href":"https:\/\/ri.itservices.manchester.ac.uk\/csf3\/wp-json\/wp\/v2\/pages\/801\/revisions\/10440"}],"up":[{"embeddable":true,"href":"https:\/\/ri.itservices.manchester.ac.uk\/csf3\/wp-json\/wp\/v2\/pages\/144"}],"wp:attachment":[{"href":"https:\/\/ri.itservices.manchester.ac.uk\/csf3\/wp-json\/wp\/v2\/media?parent=801"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}