Installing GTDB-Tk

GTDB-Tk is available through multiple sources.

If you are unsure which source to install, Bioconda is generally the easiest.

Sources

Alternatively, GTDB-Tk can be run online through KBase (third party).

Hardware requirements

Domain

Memory

Storage

Time

Archaea

~34 GB

~65 GB

~1 hour / 1,000 genomes @ 64 CPUs

Bacteria

~55GB (320 GB when using –full_tree)

~65 GB

~1 hour / 1,000 genomes @ 64 CPUs

Note

The amount reported of memory reported can vary depending on the number of pplacer threads. See GTDB-Tk reaches the memory limit / pplacer crashes for more information.

Python libraries

GTDB-Tk is designed for Python >=3.6 and requires the following libraries, which will be automatically installed:

Library

Version

Reference

DendroPy

>= 4.1.0

Sukumaran, J. and Mark T. Holder. 2010. DendroPy: A Python library for phylogenetic computing. Bioinformatics 26: 1569-1571.

NumPy

>= 1.9.0

Harris, C.R., Millman, K.J., van der Walt, S.J. et al. Array programming with NumPy. Nature 585, 357–362 (2020). DOI: 0.1038/s41586-020-2649-2

tqdm

>= 4.35.0

DOI: 10.5281/zenodo.595120

Please cite these libraries if you use GTDB-Tk in your work.

Third-party software

GTDB-Tk makes use of the following 3rd party dependencies and assumes they are on your system path:

Tip

The check_install command will verify that all of the programs are on the path.

Software

Version

Reference

Prodigal

>= 2.6.2

Hyatt D, et al. 2010. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics, 11:119. doi: 10.1186/1471-2105-11-119.

HMMER

>= 3.1b2

Eddy SR. 2011. Accelerated profile HMM searches. PLOS Comp. Biol., 7:e1002195.

pplacer

>= 1.1

Matsen FA, et al. 2010. pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree. BMC Bioinformatics, 11:538.

FastANI

>= 1.32

Jain C, et al. 2019. High-throughput ANI Analysis of 90K Prokaryotic Genomes Reveals Clear Species Boundaries. Nat. Communications, doi: 10.1038/s41467-018-07641-9.

FastTree

>= 2.1.9

Price MN, et al. 2010. FastTree 2 - Approximately Maximum-Likelihood Trees for Large Alignments. PLoS One, 5, e9490.

Mash

>= 2.2

Ondov BD, et al. 2016. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol 17, 132. doi: doi: 10.1186/s13059-016-0997-x.

Please cite these tools if you use GTDB-Tk in your work.

GTDB-Tk reference data

GTDB-Tk requires ~66G of external data that needs to be downloaded and unarchived:

wget https://data.gtdb.ecogenomic.org/releases/latest/auxillary_files/gtdbtk_v2_data.tar.gz
wget https://data.ace.uq.edu.au/public/gtdb/data/releases/latest/auxillary_files/gtdbtk_v2_data.tar.gz  (or, mirror)
tar xvzf gtdbtk_v2_data.tar.gz

Note that different versions of the GTDB release data may not run on all versions of GTDB-Tk, below are all supported versions:

GTDB Release

Minimum version

Maximum version

R207_v2

2.1.0

Current

R207

2.0.0

2.0.0

R202

1.5.0

1.7.0

R95

1.3.0

1.4.2

R89

0.3.0

0.1.2

R86.2

0.2.1

0.2.2

R86

0.1.0

0.1.6

R83

0.0.6

0.0.7

Reference data for prior releases of GTDB-Tk are available at: https://data.ace.uq.edu.au/public/gtdbtk