classify

Determine taxonomic classification of genomes.

Arguments

usage: gtdbtk classify (--genome_dir GENOME_DIR | --batchfile BATCHFILE)
                       --align_dir ALIGN_DIR --out_dir OUT_DIR
                       (--skip_ani_screen | --mash_db MASH_DB) [--no_mash]
                       [--mash_k MASH_K] [--mash_s MASH_S] [--mash_v MASH_V]
                       [--mash_max_distance MASH_MAX_DISTANCE] [-x EXTENSION]
                       [--prefix PREFIX] [--cpus CPUS]
                       [--pplacer_cpus PPLACER_CPUS]
                       [--scratch_dir SCRATCH_DIR] [--genes] [-f]
                       [--min_af MIN_AF] [--tmpdir TMPDIR] [--debug] [-h]

mutually exclusive required arguments

--genome_dir

directory containing genome files in FASTA format

--batchfile

path to file describing genomes - tab separated in 2 or 3 columns (FASTA file, genome ID, translation table [optional])

required named arguments

--align_dir

output directory of ‘align’ command

--out_dir

directory to output files

mutually exclusive required arguments

--skip_ani_screen

Skip the ani_screening step to classify genomes using mash and skani.

--mash_db

path to save/read (if exists) the Mash reference sketch database (.msh)

optional Mash arguments

--no_mash

skip pre-filtering of genomes using Mash

--mash_k

k-mer size [1-32]

Default: 16

--mash_s

maximum number of non-redundant hashes

Default: 5000

--mash_v

maximum p-value to keep [0-1]

Default: 1.0

--mash_max_distance

Maximum Mash distance to select a potential GTDB genome as representative of a user genome.

Default: 0.15

Named Arguments

-x, --extension

extension of files to process, gz = gzipped

Default: “fna”

--prefix

prefix for all output files

Default: “gtdbtk”

--cpus

number of CPUs to use

Default: 1

--pplacer_cpus

number of CPUs to use during pplacer placement

--scratch_dir

reduce pplacer memory usage by writing to disk (slower).

--genes

indicates input files contain predicted proteins as amino acids (skip gene calling).Warning: This flag will skip the ANI comparison steps (ani_screen and classification).

-f, --full_tree

use the unsplit bacterial tree for the classify step; this is the original GTDB-Tk approach (version < 2) and requires more than 320 GB of RAM to load the reference tree

--min_af

minimum alignment fraction to assign genome to a species cluster

Default: 0.5

--tmpdir

specify alternative directory for temporary files

Default: “/tmp”

--debug

create intermediate files for debugging purposes

Example

Input

gtdbtk classify --align_dir align_3lines/ --batchfile 3lines_batchfile.tsv --out_dir 3classify_ani --mash_db mash_db_dir/ --cpus 20

Output

[2024-03-27 15:46:26] INFO: GTDB-Tk v2.3.2
[2024-03-27 15:46:26] INFO: gtdbtk classify --align_dir 500_align --out_dir 500_classify --mash_db mash_db.msh --cpus 90 --batchfile genomes/500_batchfile.tsv
[2024-03-27 15:46:26] INFO: Using GTDB-Tk reference data version r214: /srv/db/gtdbtk/official/release214_skani/release214
[2024-03-27 15:46:27] WARNING: Setting pplacer CPUs to 64, as pplacer is known to hang if >64 are used. You can override this using: --pplacer_cpus
[2024-03-27 15:46:27] INFO: Loading reference genomes.
[2024-03-27 15:46:27] INFO: Using Mash version 2.2.2
[2024-03-27 15:46:27] INFO: Creating Mash sketch file: 500_classify/classify/ani_screen/intermediate_results/mash/gtdbtk.user_query_sketch.msh
[2024-03-27 15:46:29] INFO: Completed 500 genomes in 1.74 seconds (286.68 genomes/second).
[2024-03-27 15:46:29] INFO: Loading data from existing Mash sketch file: mash_db.msh
[2024-03-27 15:46:32] INFO: Calculating Mash distances.
[2024-03-27 15:47:17] INFO: Calculating ANI with skani v0.2.1.
[2024-03-27 15:47:28] INFO: Completed 4,383 comparisons in 10.51 seconds (417.14 comparisons/second).
[2024-03-27 15:47:30] INFO: 357 genome(s) have been classified using the ANI pre-screening step.
[2024-03-27 15:47:30] TASK: Placing 143 bacterial genomes into backbone reference tree with pplacer using 64 CPUs (be patient).
[2024-03-27 15:47:30] INFO: pplacer version: v1.1.alpha19-0-g807f6f3
[2024-03-27 15:50:19] INFO: Calculating RED values based on reference tree.
[2024-03-27 15:50:20] INFO: 143 out of 143 have an class assignments. Those genomes will be reclassified.
[2024-03-27 15:50:20] TASK: Placing 30 bacterial genomes into class-level reference tree 3 (1/8) with pplacer using 64 CPUs (be patient).
[2024-03-27 15:58:09] INFO: Calculating RED values based on reference tree.
[2024-03-27 15:58:12] TASK: Traversing tree to determine classification method.
[2024-03-27 15:58:12] INFO: Completed 30 genomes in 0.02 seconds (1,364.18 genomes/second).
[2024-03-27 15:58:15] TASK: Calculating average nucleotide identity using skani (v0.2.1).
[2024-03-27 15:58:16] INFO: Completed 505 comparisons in 1.32 seconds (381.38 comparisons/second).
[2024-03-27 15:58:19] INFO: 0 genome(s) have been classified using skani and pplacer.
[2024-03-27 15:58:19] TASK: Placing 27 bacterial genomes into class-level reference tree 2 (2/8) with pplacer using 64 CPUs (be patient).
[2024-03-27 16:05:53] INFO: Calculating RED values based on reference tree.
[2024-03-27 16:05:56] TASK: Traversing tree to determine classification method.
[2024-03-27 16:05:56] INFO: Completed 27 genomes in 0.04 seconds (606.99 genomes/second).
[2024-03-27 16:05:59] TASK: Calculating average nucleotide identity using skani (v0.2.1).
[2024-03-27 16:06:00] INFO: Completed 317 comparisons in 1.07 seconds (297.29 comparisons/second).
[2024-03-27 16:06:03] INFO: 0 genome(s) have been classified using skani and pplacer.
[2024-03-27 16:06:03] TASK: Placing 26 bacterial genomes into class-level reference tree 6 (3/8) with pplacer using 64 CPUs (be patient).
[2024-03-27 16:12:28] INFO: Calculating RED values based on reference tree.
[2024-03-27 16:12:30] TASK: Traversing tree to determine classification method.
[2024-03-27 16:12:30] INFO: Completed 26 genomes in 0.05 seconds (497.16 genomes/second).
[2024-03-27 16:12:31] TASK: Calculating average nucleotide identity using skani (v0.2.1).
[2024-03-27 16:12:32] INFO: Completed 41 comparisons in 0.83 seconds (49.26 comparisons/second).
[2024-03-27 16:12:33] INFO: 0 genome(s) have been classified using skani and pplacer.
[2024-03-27 16:12:34] TASK: Placing 22 bacterial genomes into class-level reference tree 7 (4/8) with pplacer using 64 CPUs (be patient).
[2024-03-27 16:18:24] INFO: Calculating RED values based on reference tree.
[2024-03-27 16:18:27] TASK: Traversing tree to determine classification method.
[2024-03-27 16:18:27] INFO: Completed 22 genomes in 0.03 seconds (715.55 genomes/second).
[2024-03-27 16:18:28] TASK: Calculating average nucleotide identity using skani (v0.2.1).
[2024-03-27 16:18:28] INFO: Completed 117 comparisons in 0.84 seconds (138.63 comparisons/second).
[2024-03-27 16:18:30] INFO: 0 genome(s) have been classified using skani and pplacer.
[2024-03-27 16:18:30] TASK: Placing 22 bacterial genomes into class-level reference tree 1 (5/8) with pplacer using 64 CPUs (be patient).
[2024-03-27 16:26:01] INFO: Calculating RED values based on reference tree.
[2024-03-27 16:26:04] TASK: Traversing tree to determine classification method.
[2024-03-27 16:26:04] INFO: Completed 22 genomes in 0.05 seconds (486.20 genomes/second).
[2024-03-27 16:26:05] TASK: Calculating average nucleotide identity using skani (v0.2.1).
[2024-03-27 16:26:06] INFO: Completed 373 comparisons in 1.05 seconds (354.70 comparisons/second).
[2024-03-27 16:26:08] INFO: 0 genome(s) have been classified using skani and pplacer.
[2024-03-27 16:26:08] TASK: Placing 8 bacterial genomes into class-level reference tree 4 (6/8) with pplacer using 64 CPUs (be patient).
[2024-03-27 16:32:45] INFO: Calculating RED values based on reference tree.
[2024-03-27 16:32:48] TASK: Traversing tree to determine classification method.
[2024-03-27 16:32:48] INFO: Completed 8 genomes in 0.15 seconds (52.49 genomes/second).
[2024-03-27 16:32:48] TASK: Calculating average nucleotide identity using skani (v0.2.1).
[2024-03-27 16:32:49] INFO: Completed 176 comparisons in 0.90 seconds (195.38 comparisons/second).
[2024-03-27 16:32:50] INFO: 0 genome(s) have been classified using skani and pplacer.
[2024-03-27 16:32:50] TASK: Placing 4 bacterial genomes into class-level reference tree 8 (7/8) with pplacer using 64 CPUs (be patient).
[2024-03-27 16:35:30] INFO: Calculating RED values based on reference tree.
[2024-03-27 16:35:31] TASK: Traversing tree to determine classification method.
[2024-03-27 16:35:31] INFO: Completed 4 genomes in 0.00 seconds (5,959.93 genomes/second).
[2024-03-27 16:35:32] TASK: Calculating average nucleotide identity using skani (v0.2.1).
[2024-03-27 16:35:33] INFO: Completed 31 comparisons in 0.93 seconds (33.24 comparisons/second).
[2024-03-27 16:35:33] INFO: 0 genome(s) have been classified using skani and pplacer.
[2024-03-27 16:35:33] TASK: Placing 4 bacterial genomes into class-level reference tree 5 (8/8) with pplacer using 64 CPUs (be patient).
[2024-03-27 16:40:57] INFO: Calculating RED values based on reference tree.
[2024-03-27 16:40:59] TASK: Traversing tree to determine classification method.
[2024-03-27 16:40:59] INFO: Completed 4 genomes in 0.00 seconds (4,607.86 genomes/second).
[2024-03-27 16:40:59] TASK: Calculating average nucleotide identity using skani (v0.2.1).
[2024-03-27 16:41:00] INFO: Completed 46 comparisons in 0.86 seconds (53.34 comparisons/second).
[2024-03-27 16:41:00] INFO: 0 genome(s) have been classified using skani and pplacer.
[2024-03-27 16:41:00] WARNING: 5 of 500 genomes have a warning (see summary file).
[2024-03-27 16:41:00] INFO: Note that Tk classification mode is insufficient for publication of new taxonomic designations. New designations should be based on one or more de novo trees, an example of which can be produced by Tk in de novo mode.
[2024-03-27 16:41:00] INFO: Done.