bioawk

Overview

Bioawk is an extension created by Heng Li of Brian Kernighan’s awk which adds the support of several common biological data formats, including optionally gzip’ed BED, GFF, SAM, VCF, FASTA/Q and TAB-delimited formats with column names. It also adds a few built-in functions and an command line option to use TAB as the input/output delimiter. When the new functionality is not used, bioawk is intended to behave exactly the same as the original BWK awk.

Latest version dated 27-08-2013 is installed on the CSF. Bioawk does not have a version number hence date of github commit is used to identify versions.

Restrictions on use

No licence information available.

Set up procedure

To access the software you must first load the modulefile:

module load apps/gcc/bioawk/27-08-2013

Running the application

Please do not run bioawk on the login node. Jobs should be submitted to the compute nodes via batch.

Serial batch job submission

Make sure you have the modulefile loaded then create a batch submission script, for example:

#!/bin/bash --login
#$ -cwd             # Job will run from the current directory
module load apps/gcc/bioawk/27-08-2013
bioawk -Hc sam '!and($flag,4)'

Submit the jobscript using:

qsub scriptname

where scriptname is the name of your jobscript.

Parallel batch job submission

bioawk is not designed to run in parallel. If you need to run multiple similar jobs then please consider using job arrays.

Examples of usage:

  1. List the supported formats:
    bioawk -c help
    
  2. Extract unmapped reads without header:
    bioawk -c sam 'and($flag,4)' aln.sam.gz
    
  3. Extract mapped reads with header:
    bioawk -Hc sam '!and($flag,4)'
    
  4. Reverse complement FASTA:
    bioawk -c fastx '{print ">"$name;print revcomp($seq)}' seq.fa.gz

Further examples can be found on the bioawk help page.

Recognized Formats

These formats may be passed as the -c flag:

bed
1:chrom 2:start 3:end 4:name 5:score 6:strand 7:thickstart 8:thickend 9:rgb 10:blockcount 11:blocksizes 12:blockstarts
sam
1:qname 2:flag 3:rname 4:pos 5:mapq 6:cigar 7:rnext 8:pnext 9:tlen 10:seq 11:qual
vcf
1:chrom 2:pos 3:id 4:ref 5:alt 6:qual 7:filter 8:info
gff
1:seqname 2:source 3:feature 4:start 5:end 6:score 7:filter 8:strand 9:group 10:attribute
fastx
1:name 2:seq 3:qual

The fastx flag can handle both FASTA and FASTQ formats.

Further info

Updates

None.

Last modified on March 29, 2019 at 3:58 pm by Daniel Corbett