identify

Identify marker genes in genome(s). The following heuristic is used to establish the translation table used by Prodigal: use table 11 unless the coding density using table 4 is 5% higher than when using table 11 and the coding density under table 4 is >70%. Distinguishing between tables 4 and 25 is challenging so GTDB-Tk does not attempt to distinguish between these two tables. If you know the correct translation table for your genomes this can be provided to GTDB-Tk in the –batchfile.

Arguments

usage: gtdbtk identify (--genome_dir GENOME_DIR | --batchfile BATCHFILE)
                       --out_dir OUT_DIR [-x EXTENSION] [--prefix PREFIX]
                       [--genes] [--cpus CPUS] [--force]
                       [--write_single_copy_genes] [--tmpdir TMPDIR] [--debug]
                       [-h]

mutually exclusive required arguments

--genome_dir

directory containing genome files in FASTA format

--batchfile

path to file describing genomes - tab separated in 2 or 3 columns (FASTA file, genome ID, translation table [optional])

required named arguments

--out_dir

directory to output files

Named Arguments

-x, --extension

extension of files to process, gz = gzipped

Default: “fna”

--prefix

prefix for all output files

Default: “gtdbtk”

--genes

indicates input files contain predicted proteins as amino acids (skip gene calling).Warning: This flag will skip the ANI comparison steps (ani_screen and classification).

--cpus

number of CPUs to use

Default: 1

--force

continue processing if an error occurs on a single genome

--write_single_copy_genes

output unaligned single-copy marker genes

--tmpdir

specify alternative directory for temporary files

Default: “/tmp”

--debug

create intermediate files for debugging purposes

## Files output

Example

Input

gtdbtk identify --genome_dir genomes/ --out_dir identify_output --cpus 3

Output

[2022-04-11 11:48:59] INFO: GTDB-Tk v2.0.0
[2022-04-11 11:48:59] INFO: gtdbtk identify --genome_dir /tmp/gtdbtk/genomes --out_dir /tmp/gtdbtk/identify --extension gz --cpus 2
[2022-04-11 11:48:59] INFO: Using GTDB-Tk reference data version r207: /srv/db/gtdbtk/official/release207
[2022-04-11 11:48:59] INFO: Identifying markers in 2 genomes with 2 threads.
[2022-04-11 11:48:59] TASK: Running Prodigal V2.6.3 to identify genes.
[2022-04-11 11:49:10] INFO: Completed 2 genomes in 10.94 seconds (5.47 seconds/genome).
[2022-04-11 11:49:10] TASK: Identifying TIGRFAM protein families.
[2022-04-11 11:49:16] INFO: Completed 2 genomes in 5.78 seconds (2.89 seconds/genome).
[2022-04-11 11:49:16] TASK: Identifying Pfam protein families.
[2022-04-11 11:49:16] INFO: Completed 2 genomes in 0.42 seconds (4.81 genomes/second).
[2022-04-11 11:49:16] INFO: Annotations done using HMMER 3.1b2 (February 2015).
[2022-04-11 11:49:16] TASK: Summarising identified marker genes.
[2022-04-11 11:49:16] INFO: Completed 2 genomes in 0.05 seconds (40.91 genomes/second).
[2022-04-11 11:49:16] INFO: Done.