identify¶
Identify marker genes in genome(s). The following heuristic is used to establish the translation table used by Prodigal: use table 11 unless the coding density using table 4 is 5% higher than when using table 11 and the coding density under table 4 is >70%. Distinguishing between tables 4 and 25 is challenging so GTDB-Tk does not attempt to distinguish between these two tables. If you know the correct translation table for your genomes this can be provided to GTDB-Tk in the –batchfile.
Arguments¶
usage: gtdbtk identify (--genome_dir GENOME_DIR | --batchfile BATCHFILE)
--out_dir OUT_DIR [-x EXTENSION] [--prefix PREFIX]
[--genes] [--cpus CPUS] [--force]
[--write_single_copy_genes] [--tmpdir TMPDIR] [--debug]
[-h]
mutually exclusive required arguments¶
- --genome_dir
directory containing genome files in FASTA format
- --batchfile
path to file describing genomes - tab separated in 2 or 3 columns (FASTA file, genome ID, translation table [optional])
required named arguments¶
- --out_dir
directory to output files
Named Arguments¶
- -x, --extension
extension of files to process,
gz
= gzippedDefault: “fna”
- --prefix
prefix for all output files
Default: “gtdbtk”
- --genes
indicates input files contain predicted proteins as amino acids (skip gene calling).Warning: This flag will skip the ANI comparison steps (ani_screen and classification).
- --cpus
number of CPUs to use
Default: 1
- --force
continue processing if an error occurs on a single genome
- --write_single_copy_genes
output unaligned single-copy marker genes
- --tmpdir
specify alternative directory for temporary files
Default: “/tmp”
- --debug
create intermediate files for debugging purposes
## Files output
- identify
intermediate_results/marker_genes/[genome_id]/
Example¶
Input¶
gtdbtk identify --genome_dir genomes/ --out_dir identify_output --cpus 3
Output¶
[2022-04-11 11:48:59] INFO: GTDB-Tk v2.0.0
[2022-04-11 11:48:59] INFO: gtdbtk identify --genome_dir /tmp/gtdbtk/genomes --out_dir /tmp/gtdbtk/identify --extension gz --cpus 2
[2022-04-11 11:48:59] INFO: Using GTDB-Tk reference data version r207: /srv/db/gtdbtk/official/release207
[2022-04-11 11:48:59] INFO: Identifying markers in 2 genomes with 2 threads.
[2022-04-11 11:48:59] TASK: Running Prodigal V2.6.3 to identify genes.
[2022-04-11 11:49:10] INFO: Completed 2 genomes in 10.94 seconds (5.47 seconds/genome).
[2022-04-11 11:49:10] TASK: Identifying TIGRFAM protein families.
[2022-04-11 11:49:16] INFO: Completed 2 genomes in 5.78 seconds (2.89 seconds/genome).
[2022-04-11 11:49:16] TASK: Identifying Pfam protein families.
[2022-04-11 11:49:16] INFO: Completed 2 genomes in 0.42 seconds (4.81 genomes/second).
[2022-04-11 11:49:16] INFO: Annotations done using HMMER 3.1b2 (February 2015).
[2022-04-11 11:49:16] TASK: Summarising identified marker genes.
[2022-04-11 11:49:16] INFO: Completed 2 genomes in 0.05 seconds (40.91 genomes/second).
[2022-04-11 11:49:16] INFO: Done.