classify

Determine taxonomic classification of genomes.

Arguments

usage: gtdbtk classify (--genome_dir GENOME_DIR | --batchfile BATCHFILE)
                       --align_dir ALIGN_DIR --out_dir OUT_DIR
                       (--skip_ani_screen | --mash_db MASH_DB) [--no_mash]
                       [--mash_k MASH_K] [--mash_s MASH_S] [--mash_v MASH_V]
                       [--mash_max_distance MASH_MAX_DISTANCE] [-x EXTENSION]
                       [--prefix PREFIX] [--cpus CPUS]
                       [--pplacer_cpus PPLACER_CPUS]
                       [--scratch_dir SCRATCH_DIR] [--genes] [-f]
                       [--min_af MIN_AF] [--tmpdir TMPDIR] [--debug] [-h]

mutually exclusive required arguments

--genome_dir

directory containing genome files in FASTA format

--batchfile

path to file describing genomes - tab separated in 2 or 3 columns (FASTA file, genome ID, translation table [optional])

required named arguments

--align_dir

output directory of ‘align’ command

--out_dir

directory to output files

mutually exclusive required arguments

--skip_ani_screen

Skip the ani_screening step to classify genomes using mash and FastANI

--mash_db

path to save/read (if exists) the Mash reference sketch database (.msh)

optional Mash arguments

--no_mash

skip pre-filtering of genomes using Mash

--mash_k

k-mer size [1-32]

Default: 16

--mash_s

maximum number of non-redundant hashes

Default: 5000

--mash_v

maximum p-value to keep [0-1]

Default: 1.0

--mash_max_distance

Maximum Mash distance to select a potential GTDB genome as representative of a user genome.

Default: 0.15

Named Arguments

-x, --extension

extension of files to process, gz = gzipped

Default: “fna”

--prefix

prefix for all output files

Default: “gtdbtk”

--cpus

number of CPUs to use

Default: 1

--pplacer_cpus

number of CPUs to use during pplacer placement

--scratch_dir

reduce pplacer memory usage by writing to disk (slower).

--genes

indicates input files contain called genes (skip gene calling).Warning: This flag will also skip the ANI comparison steps (ani_screen and classification).

-f, --full_tree

use the unsplit bacterial tree for the classify step; this is the original GTDB-Tk approach (version < 2) and requires more than 320 GB of RAM to load the reference tree

--min_af

minimum alignment fraction to assign genome to a species cluster

Default: 0.5

--tmpdir

specify alternative directory for temporary files

Default: “/tmp”

--debug

create intermediate files for debugging purposes

Example

Input

gtdbtk classify --align_dir align_3lines/ --batchfile 3lines_batchfile.tsv --out_dir 3classify_ani --mash_db mash_db_dir/ --cpus 20

Output

[2023-02-15 08:37:11] INFO: GTDB-Tk v2.2.2
[2023-02-15 08:37:11] INFO: gtdbtk classify --align_dir align_3lines/ --batchfile 3lines_batchfile.tsv --out_dir 3classify_ani --mash_db mash_db_dir/ --cpus 20
[2023-02-15 08:37:11] INFO: Using GTDB-Tk reference data version r207: /srv/projects/gtdbtk/test_new_features/release207_v2/
[2023-02-15 08:37:12] INFO: Loading reference genomes.
[2023-02-15 08:37:13] INFO: Using Mash version 2.2.2
[2023-02-15 08:37:13] INFO: Loading data from existing Mash sketch file: 3classify_ani/classify/ani_screen/intermediate_results/mash/gtdbtk.user_query_sketch.msh
[2023-02-15 08:37:13] INFO: Loading data from existing Mash sketch file: mash_db_dir/gtdb_ref_sketch.msh
[2023-02-15 08:37:16] INFO: Calculating Mash distances.
[2023-02-15 08:37:20] INFO: Calculating ANI with FastANI v1.3.
[2023-02-15 08:37:21] INFO: Completed 12 comparisons in 0.62 seconds (19.21 comparisons/second).
[2023-02-15 08:37:21] INFO: 1 genome(s) have been classified using the ANI pre-screening step.
[2023-02-15 08:37:21] TASK: Placing 2 bacterial genomes into backbone reference tree with pplacer using 20 CPUs (be patient).
[2023-02-15 08:37:21] INFO: pplacer version: v1.1.alpha19-0-g807f6f3
[2023-02-15 08:39:24] INFO: Calculating RED values based on reference tree.
[2023-02-15 08:39:25] INFO: 2 out of 2 have an class assignments. Those genomes will be reclassified.
[2023-02-15 08:39:25] TASK: Placing 1 bacterial genomes into class-level reference tree 6 (1/2) with pplacer using 20 CPUs (be patient).
[2023-02-15 08:43:39] INFO: Calculating RED values based on reference tree.
[2023-02-15 08:43:42] TASK: Traversing tree to determine classification method.
[2023-02-15 08:43:42] INFO: Completed 1 genome in 0.00 seconds (2,451.38 genomes/second).
[2023-02-15 08:43:42] TASK: Calculating average nucleotide identity using FastANI (v1.3).
[2023-02-15 08:43:43] INFO: Completed 34 comparisons in 0.90 seconds (37.77 comparisons/second).
[2023-02-15 08:43:43] INFO: 0 genome(s) have been classified using FastANI and pplacer.
[2023-02-15 08:43:43] TASK: Placing 1 bacterial genomes into class-level reference tree 5 (2/2) with pplacer using 20 CPUs (be patient).
[2023-02-15 08:46:38] INFO: Calculating RED values based on reference tree.
[2023-02-15 08:46:40] TASK: Traversing tree to determine classification method.
[2023-02-15 08:46:40] INFO: Completed 1 genome in 0.05 seconds (20.80 genomes/second).
[2023-02-15 08:46:40] INFO: 0 genome(s) have been classified using FastANI and pplacer.
[2023-02-15 08:46:41] WARNING: 1 of 3 genome has a warning (see summary file).
[2023-02-15 08:46:41] INFO: Note that Tk classification mode is insufficient for publication of new taxonomic designations. New designations should be based on one or more de novo trees, an example of which can be produced by Tk in de novo mode.
[2023-02-15 08:46:41] INFO: Done.