bioawk
Overview
Bioawk is an extension created by Heng Li of Brian Kernighan’s awk which adds the support of several common biological data formats, including optionally gzip’ed BED, GFF, SAM, VCF, FASTA/Q and TAB-delimited formats with column names. It also adds a few built-in functions and an command line option to use TAB as the input/output delimiter. When the new functionality is not used, bioawk is intended to behave exactly the same as the original BWK awk.
Latest version dated 27-08-2013 is installed on the CSF. Bioawk does not have a version number hence date of github commit is used to identify versions.
Restrictions on use
No licence information available.
Set up procedure
To access the software you must first load the modulefile:
module load apps/gcc/bioawk/27-08-2013
Running the application
Please do not run bioawk on the login node. Jobs should be submitted to the compute nodes via batch.
Serial batch job submission
Make sure you have the modulefile loaded then create a batch submission script, for example:
#!/bin/bash --login
#$ -cwd # Job will run from the current directory
module load apps/gcc/bioawk/27-08-2013
bioawk -Hc sam '!and($flag,4)'
Submit the jobscript using:
qsub scriptname
where scriptname is the name of your jobscript.
Parallel batch job submission
bioawk is not designed to run in parallel. If you need to run multiple similar jobs then please consider using job arrays.
Examples of usage:
- List the supported formats:
bioawk -c help
- Extract unmapped reads without header:
bioawk -c sam 'and($flag,4)' aln.sam.gz
- Extract mapped reads with header:
bioawk -Hc sam '!and($flag,4)'
- Reverse complement FASTA:
bioawk -c fastx '{print ">"$name;print revcomp($seq)}' seq.fa.gz
Further examples can be found on the bioawk help page.
Recognized Formats
These formats may be passed as the -c flag:
- bed
- 1:chrom 2:start 3:end 4:name 5:score 6:strand 7:thickstart 8:thickend 9:rgb 10:blockcount 11:blocksizes 12:blockstarts
- sam
- 1:qname 2:flag 3:rname 4:pos 5:mapq 6:cigar 7:rnext 8:pnext 9:tlen 10:seq 11:qual
- vcf
- 1:chrom 2:pos 3:id 4:ref 5:alt 6:qual 7:filter 8:info
- gff
- 1:seqname 2:source 3:feature 4:start 5:end 6:score 7:filter 8:strand 9:group 10:attribute
- fastx
- 1:name 2:seq 3:qual
The fastx flag can handle both FASTA and FASTQ formats.
Further info
Updates
None.