Change log

2.1.0

Major changes:

  • GTDB-TK now uses a divide-and-conquer approach where the bacterial reference tree is split into multiple class-level subtrees. This reduces the memory requirements of GTDB-Tk from 320 GB of RAM when using the full GTDB R07-RS207 reference tree to approximately 55 GB. A manuscript describing this approach is in preparation. If you wish to continue using the full GTDB reference tree use the –full-tree flag. This is the main change from v2.0.0. The split tree approach has been modified from order-level trees to class-level trees to resolve specific classification issues (see #383).

  • Genomes that cannot be assigned to a domain (e.g. genomes with no bacterial or archaeal markers or genomes with no genes called by Prodigal) are now reported in the gtdbtk.bac120.summary.tsv as ‘Unclassified’

  • Genomes filtered out during the alignment step are now reported in the gtdbtk.bac120.summary.tsv or gtdbtk.ar53.summary.tsv as ‘Unclassified Bacteria/Archaea’

  • –write_single_copy_genes flag in now available in the classify_wf and de_novo_wf workflows.

Features:

  • (#392) –write_single_copy_genes flag available in workflows.

  • (#387) specific memory requirements set in classify_wf depending on the classification approach.

2.0.0

Major changes:

  • GTDB-TK now uses a divide-and-conquer approach where the bacterial reference tree is split into multiple order-level subtrees. This reduces the memory requirements of GTDB-Tk from 320 GB of RAM when using the full GTDB R07-RS207 reference tree to approximately 35 GB. A manuscript describing this approach is in preparation. If you wish to continue using the full GTDB reference tree use the –full-tree flag.

  • Archaeal classification now uses a refined set of 53 archaeal-specific marker genes based on the recent publication by Dombrowski et al., 2020. This set of archaeal marker genes is now used by GTDB for curating the archaeal taxonomy.

  • By default, all directories containing intermediate results are now removed by default at the end of the classify_wf and de_novo_wf pipelines. If you wish to retain these intermediates files use the –keep-intermediates flag.

  • All MSA files produced by the align step are now compressed with gzip.

  • The classification summary and failed genomes files are now the only files linked in the root directory of classify_wf.

Features:

  • (#373) convert_to_itol to convert trees into iTOL format

  • (#369) Output FASTA files are compressed by default

  • (#369) Intermediate files will be removed by default when using classify/de-novo workflows unless specified by –keep_intermediates

  • (#362) Add –genes flag for Error

  • (#360 / #356) A warning will be displayed if pplacer fails to place a genome

Important

  • This version is not backwards compatible with GTDB release 202.

  • This version requires a new reference package

1.7.0

  • (#336) Warn the user if they have provided an incorrectly formatted taxonomy file.

  • (#348) Gracefully exit the program if no single copy hits could be identified.

  • (#351) Fixed an issue where GTDB-Tk would crash if spaces were present in the reference data path.

  • (#354) Added optional --tmpdir argument to set temporary directory (thanks tr11-sanger!).

1.6.0

  • (#337) Set minimum tqdm version to 4.35.0

  • (#335) Fixed typo in output log messages (@fplaza)

  • Removed the option to re-calculate RED values (–recalculate_red)

1.5.1

  • (#327) Disallow spaces in genome names/file paths due to downstream application issues.

  • (#326) Disallow genome names that are blank.

1.5.0

1.4.2

  • (#311) Fixed –scratch_dir not working in v 1.4.1 for classify_wf

  • (#312) Automatic drop of genome leads to error in downstream modules of classify_wf

1.4.1

  • Updated GitHub CI/CD to trigger docker build / tag version on release.

  • (#255) (#297) Fixed 'Namespace' object has no attribute errors by adding default arguments to argparse.

1.4.0

  • Check if stdout is being piped to a file before adding colour.

  • (#283) Significantly improved classify performance (noticeable when running trees > 1,000 taxa).

  • Automatically cap pplacer CPUs to 64 unless specifying --pplacer_cpus to prevent pplacer from hanging.

  • (#262) Added --write_single_copy_genes to the identify command. Writes unaligned single-copy AR53/BAC120 marker genes to disk.

  • When running -version warn if GTDB-Tk is not running the most up-to-date version (disable via GTDBTK_VER_CHECK = False in config.py). If GTDB-Tk encounters an error it will silently continue (3 second timeout).

  • (#276) Renamed the column aa_percent to msa_percent in summary.tsv (produced by classify).

  • (#286) Fixed a file not found error when the reference data is a symbolic link (thanks davidealbanese!).

  • (#277) Fixed an issue where if the user overrides the translation table using the optional 3rd column in the batchfile, the other coding density would appear as -100. Both translation table densities are now reported.

  • The check_install command now also checks that all third party binaries can be found on the system path.

  • The align step is now approximately 10x faster.

  • (#289) Added --min_af to classify and classify_wf which allows the user to specify the minimum alignment fraction for FastANI.

  • Added the --mash_db command to re-use the GTDB-Tk Mash reference database in ani_rep.

1.3.0

  • This version of GTDB-Tk requires a new version of the GTDB-Tk reference package (gtdbtk_r95_data.tar.gz) available here.

  • Updated reference package to use the GTDB Release 95 taxonomy.

  • Report if the species-specific ANI circumscription criteria is satisfied in the ani_closest.tsv file output by ani_rep.

  • Estimated time until completion has been dampened.

1.2.0

  • (#241) Moved GTDB-Tk entry point to __main__.py instead of bin/gtdbtk to support execution in some HPC systems (gtdbtk will still be aliased on install).

  • (#251) Allow parsing of FastANI v1.0 output files. However, a warning will be displayed to update FastANI.

  • (#254) Fixed an issue where --scratch_dir would fail, and not clean-up the mmap file.

  • (#242) Added the decorate command allowing the de novo workflow to be run

  • (#244) Added the infer_rank method which established the taxonomic ranks of internal nodes of user trees based on RED

  • (#248) If the identify command is run on the same directory, genomes which were already processed will be skipped.

  • (#248) Improved pplacer output with running the classify command.

1.1.0

  • In rare cases pplacer would assign an empty taxonomy string which would raise an error.

  • (#229) Genomes using windows line carriage \r\n would raise an error.

  • (#227) CentOS machines would fail when using ~ in paths.

  • The bac120 symlink was pointing to the archaeal tree when using the root command.

  • Updated the gtdb_to_ncbi_majority_vote.py script for translating taxonomy.

  • (#195) Added the --pplacer_cpus argument to specify the number of pplacer threads when running classify and classify_wf (#195).

  • (#198) The --debug flag of align outputs aligned markers to disk before trimming.

  • (#225) An optional third column in the --batchfile will specify an override to which translation table should be used. Leave blank to automatically determine the translation table (default).

  • (#131) Users can now specify genomes which have NCBI accessions, as long as they are not GTDB-Tk representatives (a warning will be raised).

  • (#191) Added a new command ani_rep which calculates the ANI of input genomes to all GTDB representative genomes.

  • This command uses Mash in a pre-filtering step. If pre-filtering is enabled (default) then mash will need to be on the system path. To disable pre-filtering use the --no_mash flag.

  • (#230) Improved how markers are used in determining the correct domain, and gene selection for the alignment.

1.0.2

  • Fixed an issue where FastANI threads would timeout with FastANI returned a non-zero exit code.

  • Versions affected: 1.0.0, and 1.0.1.

1.0.0

  • Migrated to Python 3, you must be running at least Python 3.6 or later to use this version.

  • check_install now does an exhaustive check of the reference data.

  • Resolved an issue where gene calling would fail for low quality genomes (#192).

  • Improved FastANI multiprocessing performance.

  • Third party software versions are reported where possible.

0.3.3

  • A bug has been fixed which affected classify and classify_wf when using the --batchfile argument with genome IDs that differed from the FASTA filename. This issue resulted in the assigned taxonomy being derived only from tree placement without any ANI calculations being considered. Consequently, in some cases genomes may have been classified as a new species within a genus when they should have been assigned to an existing species. If you have genomes with species assignments this bug did not impact you.

  • Progress is now displayed for: hmmalign, and pplacer.

  • Fixed an issue where the root command could not be run independently.

  • Improved MSA masking performance.

0.3.2

  • FastANI calculations are more robust.

  • Optimisation of RED calculations.

  • Improved output messages when errors are encountered.

0.3.1

  • Pplacer taxonomy is now available in the summary file.

  • FastANI species assignment will be selected over phylogenetic placement (Topology case).

0.3.0

  • Best translation table displayed in summary file.

  • GTDB-Tk now supports gzipped genomes as inputs (--extension gz).

  • By default, GTDB-Tk uses precalculated RED values.

  • New option to recalculate RED value during classify step (--recalculate_red).

  • New option to export the untrimmed reference MSA files.

  • New option to skip_trimming during align step.

  • New option to use a custom taxonomy file when rooting a tree.

  • New FAQ page available.

  • New output structure.

0.2.1

  • Species classification is now based strictly on the ANI to reference genomes

  • The “classify” function now reports the closest reference genome in the summary file even if the ANI is <95%

  • The summary.tsv file has 4 new columns: aa_percent, red_values, fastani_reference_radius, and warnings

  • By default, the “align” function now performs the same MSA trimming used by the GTDB

  • New pplacer support for writing to a scratch file (--mmap-file option)

  • Random seed option for MSA trimming has been added to allow for reproducible results

  • Configuration of the data directory is now set using the environment variable GTDBTK_DATA_PATH (see pip installation)

  • Perl dependencies has been removed

  • Python libraries biolib, mpld3 and jinja have been removed

  • This version requires a new version of the GTDB-Tk data package (gtdbtk.r86_v2_data.tar.gz) available here

0.1.3

  • GTDB-Tk v0.1.3 has been released and addresses an issue with species assignments based on the placement of genomes in the reference tree. This impacted species assignment when submitting multiple closely related genomes. Species assignments reported by ANI were not impacted.

0.1.0