classify¶
Determine taxonomic classification of genomes.
Arguments¶
usage: gtdbtk classify (--genome_dir GENOME_DIR | --batchfile BATCHFILE)
--align_dir ALIGN_DIR --out_dir OUT_DIR
(--skip_ani_screen | --mash_db MASH_DB) [--no_mash]
[--mash_k MASH_K] [--mash_s MASH_S] [--mash_v MASH_V]
[--mash_max_distance MASH_MAX_DISTANCE] [-x EXTENSION]
[--prefix PREFIX] [--cpus CPUS]
[--pplacer_cpus PPLACER_CPUS]
[--scratch_dir SCRATCH_DIR] [--genes] [-f]
[--min_af MIN_AF] [--tmpdir TMPDIR] [--debug] [-h]
mutually exclusive required arguments¶
- --genome_dir
directory containing genome files in FASTA format
- --batchfile
path to file describing genomes - tab separated in 2 or 3 columns (FASTA file, genome ID, translation table [optional])
required named arguments¶
- --align_dir
output directory of ‘align’ command
- --out_dir
directory to output files
mutually exclusive required arguments¶
- --skip_ani_screen
Skip the ani_screening step to classify genomes using mash and skani.
- --mash_db
path to save/read (if exists) the Mash reference sketch database (.msh)
optional Mash arguments¶
- --no_mash
skip pre-filtering of genomes using Mash
- --mash_k
k-mer size [1-32]
Default: 16
- --mash_s
maximum number of non-redundant hashes
Default: 5000
- --mash_v
maximum p-value to keep [0-1]
Default: 1.0
- --mash_max_distance
Maximum Mash distance to select a potential GTDB genome as representative of a user genome.
Default: 0.15
Named Arguments¶
- -x, --extension
extension of files to process,
gz
= gzippedDefault: “fna”
- --prefix
prefix for all output files
Default: “gtdbtk”
- --cpus
number of CPUs to use
Default: 1
- --pplacer_cpus
number of CPUs to use during pplacer placement
- --scratch_dir
reduce pplacer memory usage by writing to disk (slower).
- --genes
indicates input files contain predicted proteins as amino acids (skip gene calling).Warning: This flag will skip the ANI comparison steps (ani_screen and classification).
- -f, --full_tree
use the unsplit bacterial tree for the classify step; this is the original GTDB-Tk approach (version < 2) and requires more than 320 GB of RAM to load the reference tree
- --min_af
minimum alignment fraction to assign genome to a species cluster
Default: 0.5
- --tmpdir
specify alternative directory for temporary files
Default: “/tmp”
- --debug
create intermediate files for debugging purposes
Files output¶
- classify
- intermediate_results
- ani_screen
- intermediate_results
Example¶
Input¶
gtdbtk classify --align_dir align_3lines/ --batchfile 3lines_batchfile.tsv --out_dir 3classify_ani --mash_db mash_db_dir/ --cpus 20
Output¶
[2024-03-27 15:46:26] INFO: GTDB-Tk v2.3.2
[2024-03-27 15:46:26] INFO: gtdbtk classify --align_dir 500_align --out_dir 500_classify --mash_db mash_db.msh --cpus 90 --batchfile genomes/500_batchfile.tsv
[2024-03-27 15:46:26] INFO: Using GTDB-Tk reference data version r214: /srv/db/gtdbtk/official/release214_skani/release214
[2024-03-27 15:46:27] WARNING: Setting pplacer CPUs to 64, as pplacer is known to hang if >64 are used. You can override this using: --pplacer_cpus
[2024-03-27 15:46:27] INFO: Loading reference genomes.
[2024-03-27 15:46:27] INFO: Using Mash version 2.2.2
[2024-03-27 15:46:27] INFO: Creating Mash sketch file: 500_classify/classify/ani_screen/intermediate_results/mash/gtdbtk.user_query_sketch.msh
[2024-03-27 15:46:29] INFO: Completed 500 genomes in 1.74 seconds (286.68 genomes/second).
[2024-03-27 15:46:29] INFO: Loading data from existing Mash sketch file: mash_db.msh
[2024-03-27 15:46:32] INFO: Calculating Mash distances.
[2024-03-27 15:47:17] INFO: Calculating ANI with skani v0.2.1.
[2024-03-27 15:47:28] INFO: Completed 4,383 comparisons in 10.51 seconds (417.14 comparisons/second).
[2024-03-27 15:47:30] INFO: 357 genome(s) have been classified using the ANI pre-screening step.
[2024-03-27 15:47:30] TASK: Placing 143 bacterial genomes into backbone reference tree with pplacer using 64 CPUs (be patient).
[2024-03-27 15:47:30] INFO: pplacer version: v1.1.alpha19-0-g807f6f3
[2024-03-27 15:50:19] INFO: Calculating RED values based on reference tree.
[2024-03-27 15:50:20] INFO: 143 out of 143 have an class assignments. Those genomes will be reclassified.
[2024-03-27 15:50:20] TASK: Placing 30 bacterial genomes into class-level reference tree 3 (1/8) with pplacer using 64 CPUs (be patient).
[2024-03-27 15:58:09] INFO: Calculating RED values based on reference tree.
[2024-03-27 15:58:12] TASK: Traversing tree to determine classification method.
[2024-03-27 15:58:12] INFO: Completed 30 genomes in 0.02 seconds (1,364.18 genomes/second).
[2024-03-27 15:58:15] TASK: Calculating average nucleotide identity using skani (v0.2.1).
[2024-03-27 15:58:16] INFO: Completed 505 comparisons in 1.32 seconds (381.38 comparisons/second).
[2024-03-27 15:58:19] INFO: 0 genome(s) have been classified using skani and pplacer.
[2024-03-27 15:58:19] TASK: Placing 27 bacterial genomes into class-level reference tree 2 (2/8) with pplacer using 64 CPUs (be patient).
[2024-03-27 16:05:53] INFO: Calculating RED values based on reference tree.
[2024-03-27 16:05:56] TASK: Traversing tree to determine classification method.
[2024-03-27 16:05:56] INFO: Completed 27 genomes in 0.04 seconds (606.99 genomes/second).
[2024-03-27 16:05:59] TASK: Calculating average nucleotide identity using skani (v0.2.1).
[2024-03-27 16:06:00] INFO: Completed 317 comparisons in 1.07 seconds (297.29 comparisons/second).
[2024-03-27 16:06:03] INFO: 0 genome(s) have been classified using skani and pplacer.
[2024-03-27 16:06:03] TASK: Placing 26 bacterial genomes into class-level reference tree 6 (3/8) with pplacer using 64 CPUs (be patient).
[2024-03-27 16:12:28] INFO: Calculating RED values based on reference tree.
[2024-03-27 16:12:30] TASK: Traversing tree to determine classification method.
[2024-03-27 16:12:30] INFO: Completed 26 genomes in 0.05 seconds (497.16 genomes/second).
[2024-03-27 16:12:31] TASK: Calculating average nucleotide identity using skani (v0.2.1).
[2024-03-27 16:12:32] INFO: Completed 41 comparisons in 0.83 seconds (49.26 comparisons/second).
[2024-03-27 16:12:33] INFO: 0 genome(s) have been classified using skani and pplacer.
[2024-03-27 16:12:34] TASK: Placing 22 bacterial genomes into class-level reference tree 7 (4/8) with pplacer using 64 CPUs (be patient).
[2024-03-27 16:18:24] INFO: Calculating RED values based on reference tree.
[2024-03-27 16:18:27] TASK: Traversing tree to determine classification method.
[2024-03-27 16:18:27] INFO: Completed 22 genomes in 0.03 seconds (715.55 genomes/second).
[2024-03-27 16:18:28] TASK: Calculating average nucleotide identity using skani (v0.2.1).
[2024-03-27 16:18:28] INFO: Completed 117 comparisons in 0.84 seconds (138.63 comparisons/second).
[2024-03-27 16:18:30] INFO: 0 genome(s) have been classified using skani and pplacer.
[2024-03-27 16:18:30] TASK: Placing 22 bacterial genomes into class-level reference tree 1 (5/8) with pplacer using 64 CPUs (be patient).
[2024-03-27 16:26:01] INFO: Calculating RED values based on reference tree.
[2024-03-27 16:26:04] TASK: Traversing tree to determine classification method.
[2024-03-27 16:26:04] INFO: Completed 22 genomes in 0.05 seconds (486.20 genomes/second).
[2024-03-27 16:26:05] TASK: Calculating average nucleotide identity using skani (v0.2.1).
[2024-03-27 16:26:06] INFO: Completed 373 comparisons in 1.05 seconds (354.70 comparisons/second).
[2024-03-27 16:26:08] INFO: 0 genome(s) have been classified using skani and pplacer.
[2024-03-27 16:26:08] TASK: Placing 8 bacterial genomes into class-level reference tree 4 (6/8) with pplacer using 64 CPUs (be patient).
[2024-03-27 16:32:45] INFO: Calculating RED values based on reference tree.
[2024-03-27 16:32:48] TASK: Traversing tree to determine classification method.
[2024-03-27 16:32:48] INFO: Completed 8 genomes in 0.15 seconds (52.49 genomes/second).
[2024-03-27 16:32:48] TASK: Calculating average nucleotide identity using skani (v0.2.1).
[2024-03-27 16:32:49] INFO: Completed 176 comparisons in 0.90 seconds (195.38 comparisons/second).
[2024-03-27 16:32:50] INFO: 0 genome(s) have been classified using skani and pplacer.
[2024-03-27 16:32:50] TASK: Placing 4 bacterial genomes into class-level reference tree 8 (7/8) with pplacer using 64 CPUs (be patient).
[2024-03-27 16:35:30] INFO: Calculating RED values based on reference tree.
[2024-03-27 16:35:31] TASK: Traversing tree to determine classification method.
[2024-03-27 16:35:31] INFO: Completed 4 genomes in 0.00 seconds (5,959.93 genomes/second).
[2024-03-27 16:35:32] TASK: Calculating average nucleotide identity using skani (v0.2.1).
[2024-03-27 16:35:33] INFO: Completed 31 comparisons in 0.93 seconds (33.24 comparisons/second).
[2024-03-27 16:35:33] INFO: 0 genome(s) have been classified using skani and pplacer.
[2024-03-27 16:35:33] TASK: Placing 4 bacterial genomes into class-level reference tree 5 (8/8) with pplacer using 64 CPUs (be patient).
[2024-03-27 16:40:57] INFO: Calculating RED values based on reference tree.
[2024-03-27 16:40:59] TASK: Traversing tree to determine classification method.
[2024-03-27 16:40:59] INFO: Completed 4 genomes in 0.00 seconds (4,607.86 genomes/second).
[2024-03-27 16:40:59] TASK: Calculating average nucleotide identity using skani (v0.2.1).
[2024-03-27 16:41:00] INFO: Completed 46 comparisons in 0.86 seconds (53.34 comparisons/second).
[2024-03-27 16:41:00] INFO: 0 genome(s) have been classified using skani and pplacer.
[2024-03-27 16:41:00] WARNING: 5 of 500 genomes have a warning (see summary file).
[2024-03-27 16:41:00] INFO: Note that Tk classification mode is insufficient for publication of new taxonomic designations. New designations should be based on one or more de novo trees, an example of which can be produced by Tk in de novo mode.
[2024-03-27 16:41:00] INFO: Done.