Research Infrastructure > CSF2 (retired) > Software > Applications > BGEN

- Recent Posts & Updates

Page Contents

The CSF2 has been replaced by the CSF3 - please use that system! This documentation may be out of date. Please read the CSF3 documentation instead.
To display this old CSF2 page click here.

BGEN

Overview

BGEN provides utility programs to process files in the BGEN format.

These tools will be of interest to researchers working with the UK BioBank full-release dataset.

The master branch from the repository (commit id d1f03a2c308a, downloaded 24-July-2017) is installed on the CSF.

Restrictions on use

This version is released under the Boost Software License v1.0 – there are no restrictions on access on the CSF.

Set up procedure

To access the software you must first load the modulefile:

module load apps/gcc/bgen/latest

# The latest modulefile currently give you:
# apps/gcc/bgen/d1f03a2c308a

The following tools are available:

bgenix
cat-bgen
edit-bgen

You can add the -help flag to each tool to see the command-line flags (you may do this on the login node). For example:

bgenix -help

You may also wish to load the utility modulefile:

module load tools/env/ukbiobank-full-release

This will set some environment variables used in the examples below to make accessing the folder where the UK Bio Bank datasets are kept a little easier.

Running the application

Please do not run BGEN tools on the login node. Jobs should be submitted to the compute nodes via batch.

Serial batch job submission

Make sure you have the modulefile loaded then create a batch submission script, for example:

#!/bin/bash
#$ -cwd             # Job will run from the current directory
#$ -V               # Inherit settings from modulefile loaded on login node

bgenix -g path/to/filename.bgen arg2 arg3 ...

Submit the jobscript using:

qsub scriptname

where scriptname is the name of your jobscript.

Working with the UK BioBank Full Release Dataset

The BGEN website offers some advice on using bgenix with the UK BioBank data. In summary, there is a problem in that the UK BioBank filenames do not match the expected format used by bgenix.

For example, the EGAD00010001225/001/ dataset contains files of the form:

BioBank BGEN filename    BioBank INDEX filename     INDEX filename expected by bgenix
---------------------    ----------------------     ---------------------------------
ukb_imp_chr1_v2.bgen     ukb_bgi_chr1_v2.bgi        ukb_imp_chr1_v2.bgen.bgi
ukb_imp_chr2_v2.bgen     ukb_bgi_chr2_v2.bgi        ukb_imp_chr2_v2.bgen.bgi
...                      ...                        ...
ukb_imp_chrN_v2.bgen     ukb_bgi_chrN_v2.bgi        ukb_imp_chrN_v2.bgen.bgi

If you simply run:

bgenix -g ukb_imp_chr1_v2.bgen -list
  #
  # Let bgenix generate the index filename based
  # on the name of the input bgen filename.
  # THIS WILL FAIL!

It will try to find an index file named ukb_imp_chr1_v2.bgen.bgi but this is incorrect for the UK BioBank dataset. You will receive an error message:

!! Error opening index file "ukb_imp_chr1_v2.bgen.bgi":
Could not open the index file "ukb_imp_chr1_v2.bgen.bgi"

It is possible to add the -i flag to the bgenix command-line to specify explicitly a different index filename use. For example:

bgenix -g ukb_imp_chr1_v2.bgen -i ukb_bgi_chr1_v2.bgi -list
  #
  # Tell bgenix what index filename to use.
  # THIS WILL SUCCEED!

If you wish to script the generation of index filenames from bgen filenames inside a jobscript you can use commands such as:

BGENFILE=ukb_imp_chr1_v2.bgen
INDEXFILE=`echo $BGENFILE | sed 's/imp/bgi/g;s/bgen/bgi/g'`
bgenix -g $BGENFILE -i $INDEXFILE args...

Job Array Example

The following job array will process all of the files:

ukb_imp_chr1_v2.bgen
ukb_imp_chr2_v2.bgen
...
ukb_imp_chr22_v2.bgen

Create a batch submission script similar to:

#!/bin/bash --login
#$ -cwd
### Note we will load the modulefiles in the jobscript hence
### no '#$ -V' line and we've added --login above.

### Automatically run 22 copies of this job (each uses 1 core)
#$ -t 1-22

# We load the modulefiles in the jobscript (hence no #$ -V line)
module load apps/gcc/bgen/latest
module load tools/env/ukbiobank-full-release

### ${SGE_TASK_ID} is automatically replaced by the number 1, 2, 3, ..., 22 
BGENFILE=${UKBB_IMPUTATION_DIR}/ukb_imp_chr${SGE_TASK_ID}_v2.bgen

### Generate the correct index filename for bgenix
INDEXFILE=`echo $BGENFILE | sed 's/imp/bgi/g;s/bgen/bgi/g'`

### Run bgenix
bgenix -g $BGENFILE -i $INDEXFILE -list