GTDB-TK now uses a divide-and-conquer approach where the bacterial reference tree is split into multiple class-level subtrees. This reduces the memory requirements of GTDB-Tk from 320 GB of RAM when using the full GTDB R07-RS207 reference tree to approximately 55 GB. A manuscript describing this approach is in preparation. If you wish to continue using the full GTDB reference tree use the –full-tree flag. This is the main change from v2.0.0. The split tree approach has been modified from order-level trees to class-level trees to resolve specific classification issues (see #383).
Genomes that cannot be assigned to a domain (e.g. genomes with no bacterial or archaeal markers or genomes with no genes called by Prodigal) are now reported in the gtdbtk.bac120.summary.tsv as ‘Unclassified’
Genomes filtered out during the alignment step are now reported in the gtdbtk.bac120.summary.tsv or gtdbtk.ar53.summary.tsv as ‘Unclassified Bacteria/Archaea’
–write_single_copy_genes flag in now available in the classify_wf and de_novo_wf workflows.
GTDB-TK now uses a divide-and-conquer approach where the bacterial reference tree is split into multiple order-level subtrees. This reduces the memory requirements of GTDB-Tk from 320 GB of RAM when using the full GTDB R07-RS207 reference tree to approximately 35 GB. A manuscript describing this approach is in preparation. If you wish to continue using the full GTDB reference tree use the –full-tree flag.
Archaeal classification now uses a refined set of 53 archaeal-specific marker genes based on the recent publication by Dombrowski et al., 2020. This set of archaeal marker genes is now used by GTDB for curating the archaeal taxonomy.
By default, all directories containing intermediate results are now removed by default at the end of the classify_wf and de_novo_wf pipelines. If you wish to retain these intermediates files use the –keep-intermediates flag.
All MSA files produced by the align step are now compressed with gzip.
The classification summary and failed genomes files are now the only files linked in the root directory of classify_wf.
(#373) convert_to_itol to convert trees into iTOL format
(#369) Output FASTA files are compressed by default
(#369) Intermediate files will be removed by default when using classify/de-novo workflows unless specified by –keep_intermediates
(#362) Add –genes flag for Error
This version is not backwards compatible with GTDB release 202.
This version requires a new reference package
(#336) Warn the user if they have provided an incorrectly formatted taxonomy file.
(#348) Gracefully exit the program if no single copy hits could be identified.
(#351) Fixed an issue where GTDB-Tk would crash if spaces were present in the reference data path.
(#311) Updated GTDB-Tk to support R202. See https://ecogenomics.github.io/GTDBTk/installing/index.html#gtdb-tk-reference-data for instructions on downloading R202.
Check if stdout is being piped to a file before adding colour.
(#283) Significantly improved
classifyperformance (noticeable when running trees > 1,000 taxa).
Automatically cap pplacer CPUs to 64 unless specifying
--pplacer_cpusto prevent pplacer from hanging.
identifycommand. Writes unaligned single-copy AR53/BAC120 marker genes to disk.
-versionwarn if GTDB-Tk is not running the most up-to-date version (disable via
GTDBTK_VER_CHECK = Falsein
config.py). If GTDB-Tk encounters an error it will silently continue (3 second timeout).
(#276) Renamed the column
(#277) Fixed an issue where if the user overrides the translation table using the optional 3rd column in the batchfile, the other coding density would appear as -100. Both translation table densities are now reported.
The check_install command now also checks that all third party binaries can be found on the system path.
alignstep is now approximately 10x faster.
classify_wfwhich allows the user to specify the minimum alignment fraction for FastANI.
--mash_dbcommand to re-use the GTDB-Tk Mash reference database in
This version of GTDB-Tk requires a new version of the GTDB-Tk reference package (gtdbtk_r95_data.tar.gz) available here.
Updated reference package to use the GTDB Release 95 taxonomy.
Report if the species-specific ANI circumscription criteria is satisfied in the
ani_closest.tsvfile output by
Estimated time until completion has been dampened.
(#241) Moved GTDB-Tk entry point to
bin/gtdbtkto support execution in some HPC systems (
gtdbtkwill still be aliased on install).
(#251) Allow parsing of FastANI v1.0 output files. However, a warning will be displayed to update FastANI.
(#254) Fixed an issue where
--scratch_dirwould fail, and not clean-up the mmap file.
(#242) Added the
decoratecommand allowing the
de novo workflowto be run
(#244) Added the
infer_rankmethod which established the taxonomic ranks of internal nodes of user trees based on RED
(#248) If the identify command is run on the same directory, genomes which were already processed will be skipped.
pplaceroutput with running the
In rare cases pplacer would assign an empty taxonomy string which would raise an error.
(#229) Genomes using windows line carriage
\r\nwould raise an error.
(#227) CentOS machines would fail when using
The bac120 symlink was pointing to the archaeal tree when using the
gtdb_to_ncbi_majority_vote.pyscript for translating taxonomy.
(#195) Added the
--pplacer_cpusargument to specify the number of pplacer threads when running
alignoutputs aligned markers to disk before trimming.
(#225) An optional third column in the
--batchfilewill specify an override to which translation table should be used. Leave blank to automatically determine the translation table (default).
(#131) Users can now specify genomes which have NCBI accessions, as long as they are not GTDB-Tk representatives (a warning will be raised).
(#191) Added a new command
ani_repwhich calculates the ANI of input genomes to all GTDB representative genomes.
This command uses Mash in a pre-filtering step. If pre-filtering is enabled (default) then
mashwill need to be on the system path. To disable pre-filtering use the
(#230) Improved how markers are used in determining the correct domain, and gene selection for the alignment.
Fixed an issue where FastANI threads would timeout with
FastANI returned a non-zero exit code.
Migrated to Python 3, you must be running at least Python 3.6 or later to use this version.
check_installnow does an exhaustive check of the reference data.
Resolved an issue where gene calling would fail for low quality genomes (#192).
Improved FastANI multiprocessing performance.
Third party software versions are reported where possible.
A bug has been fixed which affected
classify_wfwhen using the
--batchfileargument with genome IDs that differed from the FASTA filename. This issue resulted in the assigned taxonomy being derived only from tree placement without any ANI calculations being considered. Consequently, in some cases genomes may have been classified as a new species within a genus when they should have been assigned to an existing species. If you have genomes with species assignments this bug did not impact you.
Progress is now displayed for: hmmalign, and pplacer.
Fixed an issue where the
rootcommand could not be run independently.
Improved MSA masking performance.
FastANI calculations are more robust.
Optimisation of RED calculations.
Improved output messages when errors are encountered.
Pplacer taxonomy is now available in the summary file.
FastANI species assignment will be selected over phylogenetic placement (Topology case).
Best translation table displayed in summary file.
GTDB-Tk now supports gzipped genomes as inputs (
By default, GTDB-Tk uses precalculated RED values.
New option to recalculate RED value during classify step (
New option to export the untrimmed reference MSA files.
New option to skip_trimming during align step.
New option to use a custom taxonomy file when rooting a tree.
New FAQ page available.
New output structure.
Species classification is now based strictly on the ANI to reference genomes
The “classify” function now reports the closest reference genome in the summary file even if the ANI is <95%
The summary.tsv file has 4 new columns: aa_percent, red_values, fastani_reference_radius, and warnings
By default, the “align” function now performs the same MSA trimming used by the GTDB
New pplacer support for writing to a scratch file (
Random seed option for MSA trimming has been added to allow for reproducible results
Configuration of the data directory is now set using the environment variable
GTDBTK_DATA_PATH(see pip installation)
Perl dependencies has been removed
Python libraries biolib, mpld3 and jinja have been removed
This version requires a new version of the GTDB-Tk data package (gtdbtk.r86_v2_data.tar.gz) available here
GTDB-Tk v0.1.3 has been released and addresses an issue with species assignments based on the placement of genomes in the reference tree. This impacted species assignment when submitting multiple closely related genomes. Species assignments reported by ANI were not impacted.