GATK
Overview
GATK offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size.
Various versions are installed on the CSF – please see modulefiles below.
Restrictions on use
There are no restrictions on accessing GATK4 on the CSF. It is released under the Apache 2.0 license and all use must adhere to that license.
GATKv3 is released under a more restrictive license which prohibits commercial/for-profit use. All usage must adhere to that license. If this is too restrictive, you must switch to GATK4, which is fully open sourced.
Set up procedure
We now recommend loading modulefiles within your jobscript so that you have a full record of how the job was run. See the example jobscript below for how to do this. Alternatively, you may load modulefiles on the login node and let the job inherit these settings.
Load one of the following modulefiles:
module load apps/singularity/gatk/4.5.0.0 module load apps/binapps/gatk/4.4.0.0 module load apps/binapps/gatk/4.1.8.0 # For older versions, first load the bioinf modulefile module load apps/bioinf # Then the required gatk modulefile module load apps/gatk/3.8.0 # See StatusLogger Log4j2 error fix below module load apps/gatk/3.6.0 module load apps/gatk/3.5.0
Running the application
Please do not run gatk on the login node to process data. Jobs should be submitted to the compute nodes via batch.
You may run gatk -h
on the login node to see a list of flags that can be used to run the various GATK tools in your jobscripts.
Please note that complete instructions on how to run gatk are beyond the scope of this page. Please consult the GATK Online Documentation for how to use this application.
StatusLogger Log4j2 Error Fix
StatusLogger
error, which has been seen in v3.8.0 and may exist in other versions.
If you receive an error similar to the following, particularly in v3.8.0 when running on the AMD compute nodes (#SBATCH -p multicore
):
ERROR StatusLogger Unable to create class org.apache.logging.log4j.core.impl.Log4jContextFactory ... ERROR StatusLogger Log4j2 could not find a logging implementation. Please add log4j-core to the classpath.
then please append the following flags to the gatk command-line in your jobscript:
-jdk_inflater -jdk_deflater
Without these flags, gatk will use some optimized components that only run on Intel CPUs.
See the jobscript examples below.
Serial batch job submission
Create a batch submission script (which will load the modulefile in the jobscript), for example:
#!/bin/bash --login #SBATCH -p serial # (or --partition=) Run on the nodes dedicated to 1-core jobs #SBATCH -t 4-0 # Wallclock time limit. 4-0 is 4 days. Max permitted is 7-0. # Start with a clean environment - modules are inherited from the login node by default. module purge module load apps/binapps/gatk/4.4.0.0 # Note: The -jdk_inflater -jdk_deflater may be needed in v3.8.0 jobs on the AMD (-p multicore) nodes gatk -T RealignerTargetCreator -R my.fasta -I my.bam -o my_realigner.intervals -jdk_inflater -jdk_deflater
Submit the jobscript using:
sbatch scriptname
where scriptname is the name of your jobscript.
Parallel batch job submission
If the app is multicore capable, given an example parallel jobscript, including suitable partition
#!/bin/bash --login #SBATCH -p multicore # (or --partition=) Run on the AMD 168-core nodes #SBATCH -n 16 # (or --ntasks=) Number of cores to use. #SBATCH -t 4-0 # Wallclock time limit. 4-0 is 4 days. Max permitted is 7-0. # Start with a clean environment - modules are inherited from the login node by default. module purge module load apps/binapps/gatk/4.4.0.0 # You must inform you app how many cores to use. $SLURM_NTASKS will be set to the -n number above. # Note: The -jdk_inflater -jdk_deflater may be needed in v3.8.0 jobs on the AMD (-p multicore) nodes gatk -T RealignerTargetCreator -nt $SLURM_NTASKS -R my.fasta -I my.bam -o my_realigner.intervals -jdk_inflater -jdk_deflater
Submit the jobscript using:
sbatch scriptname
where scriptname is the name of your jobscript.
Further info
Updates
None.