GTDB-tk

Overview

GTDB-tk is a software toolkit for assigning objective taxonomic classifications to bacterial and archaeal genomes based on the Genome Database Taxonomy GTDB. It is designed to work with recent advances that allow hundreds or thousands of metagenome-assembled genomes (MAGs) to be obtained directly from environmental samples. It can also be applied to isolate and single-cell genomes.

Versions 1.0.2 and 1.0.1 is installed on the CSF.

Restrictions on use

There are no access restrictions on accessing this software on the CSF. It is released under the GNU GPL v3.0 license and all usage must adhere to that license.

Set up procedure

We now recommend loading modulefiles within your jobscript so that you have a full record of how the job was run. See the example jobscript below for how to do this. Alternatively, you may load modulefiles on the login node and let the job inherit these settings.

Load one of the following modulefiles:

module load apps/python/gtdbtk/1.5.1
module load apps/python/gtdbtk/1.0.2
module load apps/python/gtdbtk/1.0.1

The above versions use the following external applications and will load the modulefiles automatically:

# These modulefiles will be loaded by the gtdbtk modulefile. You DO NOT need to load them.
apps/binapps/anaconda3/2019.07
apps/gcc/prodigal/2.6.3
apps/gcc/hmmer/3.2.1
apps/gcc/fastani/1.3
apps/intel-18.0/fasttree/2.1.11
apps/binapps/pplacer/1.1.alpha19

Reference Data

The GTDB reference data Release 89 has been downloaded and will automatically be available to your job. The modulefile will set the environment variable $GTDBTK_DATA_PATH to the path of the folder containing this data. GTDB-tk will use this automatically.

Running the application

Please note: The GTDB-tk website has the following recommendations for the hardware on which to run:

  • 100Gb of memory (RAM)
  • 27Gb of storage
  • Multiple CPUs

Hence we recommend running a multicore job on the CSF, in your scratch directory, possible on the high-memory nodes.

Please do not run GTDB-tk on the login node. Jobs should be submitted to the compute nodes via batch.

You may run the following command to obtain help about how to run the app:

gtdbtk -h

Similarly, help about a specific workflow or tool can be obtained using:

gtdbtk classify_wf -h

Serial batch job submission

Create a batch submission script (which will load the modulefile in the jobscript), for example:

#!/bin/bash --login
#$ -cwd             # Job will run from the current directory
#$ -l mem512        # OPTIONAL LINE: provides 32GB per core (this is a 1 core job)
                    # Without this line you get 4-5GB per core.
                    # NO -V line - we load modulefiles in the jobscript

# Choose the version you require
module load apps/python/gtdbyk/1.0.1

# Example of running a workflow. $NSLOTS is automatically set to 1 in a serial job.
gtdbtk classify_wf --cpus $NSLOTS --genome_dir my_genomes --out_dir output_dir

Submit the jobscript using:

qsub scriptname

where scriptname is the name of your jobscript.

Parallel batch job submission

GTDB-tk calls a number of external applications and some of these might benefit from being run with multiple CPU cores. Hence you can submit a parallel job to make more cores available to the job. This will also increase the memory available to your job (usally 4-5GB per core).

Create a batch submission script (which will load the modulefile in the jobscript), for example:

#!/bin/bash --login
#$ -cwd             # Job will run from the current directory
#$ -l mem512        # OPTIONAL LINE: provides 32GB per core.
                    #                Without this line you get 4-5GB per core.
#$ -pe smp.pe 4     # Number of cores: can be 2--32 (without the mem512 line)
                    #                  can be 2--16 (with the mem512 line)
                    # NO -V line - we load modulefiles in the jobscript

# Choose the version you require
module load apps/python/gtdbyk/1.0.1

# Example of running a workflow. $NSLOTS is set to the number of cores given above.
gtdbtk classify_wf --cpus $NSLOTS --genome_dir my_genomes --out_dir output_dir

Submit the jobscript using:

qsub scriptname

where scriptname is the name of your jobscript.

Further info

Should your jobs run out of memory, this advice should help – click here.

Updates

None.

Last modified on January 9, 2023 at 10:36 am by Ben Pietras