Installing GTDB-Tk

GTDB-Tk is available through multiple sources, you only need to choose one. If you are unsure which one to choose, Bioconda is generally the easiest.

Sources

Alternatively, GTDB-Tk can be run online through KBase (third party). Note that the version may not be the most recent release.

Hardware requirements

Domain

Memory

Storage

Time

Archaea

~60 GB

~106 GB

~90 minutes / 1,000 genomes @ 64 CPUs

Bacteria

~90GB (545 GB when using –full_tree)

~106 GB

~90 minutes / 1,000 genomes @ 64 CPUs

Note

The amount reported of memory reported can vary depending on the number of pplacer threads. See GTDB-Tk reaches the memory limit / pplacer crashes for more information.

Python libraries

GTDB-Tk is designed for Python >=3.6 and requires the following libraries, which will be automatically installed:

Library

Version

Reference

DendroPy

>= 4.1.0

Sukumaran, J. and Mark T. Holder. 2010. DendroPy: A Python library for phylogenetic computing. Bioinformatics 26: 1569-1571.

NumPy

>= 1.9.0

Harris, C.R., Millman, K.J., van der Walt, S.J. et al. Array programming with NumPy. Nature 585, 357–362 (2020). DOI: 0.1038/s41586-020-2649-2

tqdm

>= 4.35.0

DOI: 10.5281/zenodo.595120

Please cite these libraries if you use GTDB-Tk in your work.

Third-party software

GTDB-Tk makes use of the following 3rd party dependencies and assumes they are on your system path:

Tip

The check_install command will verify that all of the programs are on the path.

Software

Version

Reference

Prodigal

>= 2.6.2

Hyatt D, et al. 2010. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics, 11:119. doi: 10.1186/1471-2105-11-119.

HMMER

>= 3.1b2

Eddy SR. 2011. Accelerated profile HMM searches. PLOS Comp. Biol., 7:e1002195.

pplacer

>= 1.1

Matsen FA, et al. 2010. pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree. BMC Bioinformatics, 11:538.

skani

>= 0.2.1

Shaw J. and Yu Y.W. 2023. Fast and robust metagenomic sequence comparison through sparse chaining with skani. Nature Methods, 20, pages1661–1665 (2023).

FastTree

>= 2.1.9

Price MN, et al. 2010. FastTree 2 - Approximately Maximum-Likelihood Trees for Large Alignments. PLoS One, 5, e9490.

Mash

>= 2.2

Ondov BD, et al. 2016. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol 17, 132. doi: doi: 10.1186/s13059-016-0997-x.

Please cite these tools if you use GTDB-Tk in your work.

GTDB-Tk reference data

GTDB-Tk requires ~110G of external data that needs to be downloaded and unarchived:

wget https://data.gtdb.ecogenomic.org/releases/latest/auxillary_files/gtdbtk_package/full_package/gtdbtk_data.tar.gz
wget https://data.ace.uq.edu.au/public/gtdb/data/releases/latest/auxillary_files/gtdbtk_package/full_package/gtdbtk_data.tar.gz  (or, mirror)
tar xvzf gtdbtk_data.tar.gz

Note

Note that different versions of the GTDB release data may not run on all versions of GTDB-Tk, check the supported versions!

GTDB Release

Minimum version

Maximum version

MD5

R220

2.4.0

Current

5aafa1b9c27ceda003d75adf238ed9e0

R214

2.1.0

2.3.2

630745840850c532546996b22da14c27

R207_v2

2.1.0

2.3.2

df468d63265e8096d8ca01244cb95f30

R207

2.0.0

2.0.0

b04c55104b491f84e053a9011b36164a

R202

1.5.0

1.7.0

4986526c2b935fd4dcc2e604c0322517

R95

1.3.0

1.4.2

06924c63f4b555ac6fd1525b09901186

R89

0.3.0

0.1.2

82966ef36086237d7230955e2bfff759

R86.2

0.2.1

0.2.2

f71408d69fa2a289f2cdc734b7a58a02

R86

0.1.0

0.1.6

d019b3541746c3673181f24e666594ba

R83

0.0.6

0.0.7

9cf523761da843b5787f591f6c5a80de