The CSF2 has been replaced by the CSF3 - please use that system! This documentation may be out of date. Please read the CSF3 documentation instead. To display this old CSF2 page click here. |
BGEN
Overview
BGEN provides utility programs to process files in the BGEN format.
These tools will be of interest to researchers working with the UK BioBank full-release dataset.
The master branch from the repository (commit id d1f03a2c308a, downloaded 24-July-2017) is installed on the CSF.
Restrictions on use
This version is released under the Boost Software License v1.0 – there are no restrictions on access on the CSF.
Set up procedure
To access the software you must first load the modulefile:
module load apps/gcc/bgen/latest # The latest modulefile currently give you: # apps/gcc/bgen/d1f03a2c308a
The following tools are available:
bgenix cat-bgen edit-bgen
You can add the -help
flag to each tool to see the command-line flags (you may do this on the login node). For example:
bgenix -help
You may also wish to load the utility modulefile:
module load tools/env/ukbiobank-full-release
This will set some environment variables used in the examples below to make accessing the folder where the UK Bio Bank datasets are kept a little easier.
Running the application
Please do not run BGEN tools on the login node. Jobs should be submitted to the compute nodes via batch.
Serial batch job submission
Make sure you have the modulefile loaded then create a batch submission script, for example:
#!/bin/bash #$ -cwd # Job will run from the current directory #$ -V # Inherit settings from modulefile loaded on login node bgenix -g path/to/filename.bgen arg2 arg3 ...
Submit the jobscript using:
qsub scriptname
where scriptname is the name of your jobscript.
Working with the UK BioBank Full Release Dataset
The BGEN website offers some advice on using bgenix
with the UK BioBank data. In summary, there is a problem in that the UK BioBank filenames do not match the expected format used by bgenix
.
For example, the EGAD00010001225/001/ dataset contains files of the form:
BioBank BGEN filename BioBank INDEX filename INDEX filename expected by bgenix --------------------- ---------------------- --------------------------------- ukb_imp_chr1_v2.bgen ukb_bgi_chr1_v2.bgi ukb_imp_chr1_v2.bgen.bgi ukb_imp_chr2_v2.bgen ukb_bgi_chr2_v2.bgi ukb_imp_chr2_v2.bgen.bgi ... ... ... ukb_imp_chrN_v2.bgen ukb_bgi_chrN_v2.bgi ukb_imp_chrN_v2.bgen.bgi
If you simply run:
bgenix -g ukb_imp_chr1_v2.bgen -list # # Let bgenix generate the index filename based # on the name of the input bgen filename. # THIS WILL FAIL!
It will try to find an index file named ukb_imp_chr1_v2.bgen.bgi
but this is incorrect for the UK BioBank dataset. You will receive an error message:
!! Error opening index file "ukb_imp_chr1_v2.bgen.bgi": Could not open the index file "ukb_imp_chr1_v2.bgen.bgi"
It is possible to add the -i
flag to the bgenix
command-line to specify explicitly a different index filename use. For example:
bgenix -g ukb_imp_chr1_v2.bgen -i ukb_bgi_chr1_v2.bgi -list # # Tell bgenix what index filename to use. # THIS WILL SUCCEED!
If you wish to script the generation of index filenames from bgen filenames inside a jobscript you can use commands such as:
BGENFILE=ukb_imp_chr1_v2.bgen INDEXFILE=`echo $BGENFILE | sed 's/imp/bgi/g;s/bgen/bgi/g'` bgenix -g $BGENFILE -i $INDEXFILE args...
Job Array Example
The following job array will process all of the files:
ukb_imp_chr1_v2.bgen ukb_imp_chr2_v2.bgen ... ukb_imp_chr22_v2.bgen
Create a batch submission script similar to:
#!/bin/bash --login #$ -cwd ### Note we will load the modulefiles in the jobscript hence ### no '#$ -V' line and we've added --login above. ### Automatically run 22 copies of this job (each uses 1 core) #$ -t 1-22 # We load the modulefiles in the jobscript (hence no #$ -V line) module load apps/gcc/bgen/latest module load tools/env/ukbiobank-full-release ### ${SGE_TASK_ID} is automatically replaced by the number 1, 2, 3, ..., 22 BGENFILE=${UKBB_IMPUTATION_DIR}/ukb_imp_chr${SGE_TASK_ID}_v2.bgen ### Generate the correct index filename for bgenix INDEXFILE=`echo $BGENFILE | sed 's/imp/bgi/g;s/bgen/bgi/g'` ### Run bgenix bgenix -g $BGENFILE -i $INDEXFILE -list
Submit the jobscript using:
qsub scriptname
where scriptname is the name of your jobscript.
Further info
Updates
None.