de_novo_wf¶
For arguments and output files, see each of the individual steps:
The de novo workflow infers new bacterial and archaeal trees containing all user supplied and GTDB-Tk reference genomes. The classify workflow is recommended for obtaining taxonomic classifications, and this workflow only recommended if a de novo domain-specific trees are desired.
This workflow consists of five steps: identify
, align
, infer
, root
,
and decorate
.
The identify
and align
steps are the same as in the classify workflow.
The infer
step uses FastTree with the WAG+GAMMA models to calculate independent, de novo bacterial and archaeal trees.
These trees can then be rooted using a user specified outgroup and decorated with the GTDB taxonomy.
The de novo workflow can be run as follows:
gtdbtk de_novo_wf --genome_dir <my_genomes> --<marker_set> --outgroup_taxon <outgroup> --out_dir <output_dir>
This will process all genomes in <my_genomes> using the specified marker set and place the results in <output_dir>.
Only genomes previously identified as being bacterial (archaeal) should be included when using the bacterial (archaeal) marker set.
The tree will be rooted with the <outgroup> taxon (typically a phylum in the domain-specific tree) as required for
correct decoration of the tree. In general, we suggest the resulting tree be treated as unrooted when interpreting results.
Identical to the classify workflow, the location of genomes can also be specified using a batch file with the --batchfile
flag.
The workflow supports several optional flags, including:
cpus: maximum number of CPUs to use
min_perc_aa: filter genomes with an insufficient percentage of AA in the MSA (default: 50)
taxa_filter: filter genomes to taxa within specific taxonomic groups
prot_model: protein substitution model for tree inference (LG or WAG; default: WAG)
For other flags please consult the command line interface.
Arguments¶
usage: gtdbtk de_novo_wf (--genome_dir GENOME_DIR | --batchfile BATCHFILE)
(--bacteria | --archaea) --outgroup_taxon
OUTGROUP_TAXON --out_dir OUT_DIR [-x EXTENSION]
[--skip_gtdb_refs] [--taxa_filter TAXA_FILTER]
[--min_perc_aa MIN_PERC_AA] [--custom_msa_filters]
[--cols_per_gene COLS_PER_GENE]
[--min_consensus MIN_CONSENSUS]
[--max_consensus MAX_CONSENSUS]
[--min_perc_taxa MIN_PERC_TAXA] [--rnd_seed RND_SEED]
[--prot_model {JTT,WAG,LG}] [--no_support] [--gamma]
[--gtdbtk_classification_file GTDBTK_CLASSIFICATION_FILE]
[--custom_taxonomy_file CUSTOM_TAXONOMY_FILE]
[--write_single_copy_genes] [--prefix PREFIX]
[--genes] [--cpus CPUS] [--force] [--tmpdir TMPDIR]
[--keep_intermediates] [--debug] [-h]
mutually exclusive required arguments¶
- --genome_dir
directory containing genome files in FASTA format
- --batchfile
path to file describing genomes - tab separated in 2 or 3 columns (FASTA file, genome ID, translation table [optional])
mutually exclusive required arguments¶
- --bacteria
process bacterial genomes
- --archaea
process archaeal genomes
required named arguments¶
- --outgroup_taxon
taxon to use as outgroup (e.g.,
p__Patescibacteria
orp__Altiarchaeota
)- --out_dir
directory to output files
Named Arguments¶
- -x, --extension
extension of files to process,
gz
= gzippedDefault: “fna”
- --skip_gtdb_refs
do not include GTDB reference genomes in multiple sequence alignment.
- --taxa_filter
filter GTDB genomes to taxa (comma separated) within specific taxonomic groups (e.g.:
d__Bacteria
orp__Proteobacteria,p__Actinobacteria
)- --min_perc_aa
exclude genomes that do not have at least this percentage of AA in the MSA (inclusive bound)
Default: 10
- --custom_msa_filters
perform custom filtering of MSA with
cols_per_gene
,min_consensus
max_consensus
, andmin_perc_taxa
parameters instead of using canonical mask- --cols_per_gene
maximum number of columns to retain per gene when generating the MSA
Default: 42
- --min_consensus
minimum percentage of the same amino acid required to retain column (inclusive bound)
Default: 25
- --max_consensus
maximum percentage of the same amino acid required to retain column (exclusive bound)
Default: 95
- --min_perc_taxa
minimum percentage of taxa required to retain column (inclusive bound)
Default: 50
- --rnd_seed
random seed to use for selecting columns, e.g.
42
- --prot_model
Possible choices: JTT, WAG, LG
protein substitution model for tree inference
Default: “WAG”
- --no_support
do not compute local support values using the Shimodaira-Hasegawa test
- --gamma
rescale branch lengths to optimize the Gamma20 likelihood
- --gtdbtk_classification_file
file with GTDB-Tk classifications produced by the classify command
- --custom_taxonomy_file
file indicating custom taxonomy strings for user genomes, that should contain any genomes belonging to the outgroup. Format: GENOME_ID<TAB>d__;p__;c__;o__;f__;g__;s__
- --write_single_copy_genes
output unaligned single-copy marker genes
- --prefix
prefix for all output files
Default: “gtdbtk”
- --genes
indicates input files contain predicted proteins as amino acids (skip gene calling).Warning: This flag will skip the ANI comparison steps (ani_screen and classification).
- --cpus
number of CPUs to use
Default: 1
- --force
continue processing if an error occurs on a single genome
- --tmpdir
specify alternative directory for temporary files
Default: “/tmp”
- --keep_intermediates
keep intermediate files in the final directory
- --debug
create intermediate files for debugging purposes
Example¶
Input¶
gtdbtk de_novo_wf --genome_dir genomes/ --outgroup_taxon p__Undinarchaeota --archaea --out_dir de_novo_wf --cpus 3
gtdbtk de_novo_wf --genome_dir genomes/ --outgroup_taxon p__Chloroflexota --bacteria --taxa_filter p__Firmicutes --out_dir de_novo_output
#Skip GTDB reference genomes ( requires --custom_taxonomy_file for outgrouping)
gtdbtk de_novo_wf --genome_dir genomes/ --outgroup_taxon p__Customphylum --bacteria --custom_taxonomy_file custom_taxonomy.tsv --out_dir de_novo_output
#Use a subset of GTDB reference genomes (p__Firmicutes) and outgroup on a custom Phylum (p__Customphylum)
gtdbtk de_novo_wf --genome_dir genomes/ --taxa_filter p__Firmicutes --outgroup_taxon p__Customphylum --bacteria --custom_taxonomy_file custom_taxonomy.tsv --out_dir de_novo_output
Custom Taxonomy Format¶
The custom taxonomy file is a Tab-delimited file with the first column listing user genomes (i.e Fasta filename without the extension) and the second column listing the standardized 7-rank taxonomy.
#For genome_1.fna, genome_2.fna and genome_3.fna
genome_1 d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Salmonella;s__Salmonella enterica
genome_2 d__Bacteria;p__Actinobacteriota;c__Actinomycetia;o__Mycobacteriales;f__Mycobacteriaceae;g__Mycobacterium;s__Mycobacterium tuberculosis
genome_3 d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Streptococcaceae;g__Streptococcus;s__Streptococcus pyogenes