de_novo_wf

For arguments and output files, see each of the individual steps:

The de novo workflow infers new bacterial and archaeal trees containing all user supplied and GTDB-Tk reference genomes. The classify workflow is recommended for obtaining taxonomic classifications, and this workflow only recommended if a de novo domain-specific trees are desired. One should take the taxonomic assignments as a guide, but not as final classifications. In particular, no effort is made to resolve the taxonomic assignment of lineages composed exclusively of user submitted genomes.

This workflow consists of five steps: identify, align, infer, root, and decorate.

The identify and align steps are the same as in the classify workflow.

The infer step uses FastTree with the WAG+GAMMA models to calculate independent, de novo bacterial and archaeal trees. These trees can then be rooted using a user specified outgroup and decorated with the GTDB taxonomy.

The de novo workflow can be run as follows:

gtdbtk de_novo_wf --genome_dir <my_genomes> --<marker_set> --outgroup_taxon <outgroup> --out_dir <output_dir>

This will process all genomes in <my_genomes> using the specified marker set and place the results in <output_dir>. Only genomes previously identified as being bacterial (archaeal) should be included when using the bacterial (archaeal) marker set. The tree will be rooted with the <outgroup> taxon (typically a phylum in the domain-specific tree) as required for correct decoration of the tree. In general, we suggest the resulting tree be treated as unrooted when interpreting results. Identical to the classify workflow, the location of genomes can also be specified using a batch file with the --batchfile flag.

The workflow supports several optional flags, including:

  • cpus: maximum number of CPUs to use

  • min_perc_aa: filter genomes with an insufficient percentage of AA in the MSA (default: 50)

  • taxa_filter: filter genomes to taxa within specific taxonomic groups

  • prot_model: protein substitution model for tree inference (LG or WAG; default: WAG)

For other flags please consult the command line interface.

Arguments

usage: gtdbtk de_novo_wf (--genome_dir GENOME_DIR | --batchfile BATCHFILE)
                         (--bacteria | --archaea) --outgroup_taxon
                         OUTGROUP_TAXON --out_dir OUT_DIR [-x EXTENSION]
                         [--skip_gtdb_refs] [--taxa_filter TAXA_FILTER]
                         [--min_perc_aa MIN_PERC_AA] [--custom_msa_filters]
                         [--cols_per_gene COLS_PER_GENE]
                         [--min_consensus MIN_CONSENSUS]
                         [--max_consensus MAX_CONSENSUS]
                         [--min_perc_taxa MIN_PERC_TAXA] [--rnd_seed RND_SEED]
                         [--prot_model {JTT,WAG,LG}] [--no_support] [--gamma]
                         [--gtdbtk_classification_file GTDBTK_CLASSIFICATION_FILE]
                         [--custom_taxonomy_file CUSTOM_TAXONOMY_FILE]
                         [--write_single_copy_genes] [--prefix PREFIX]
                         [--genes] [--cpus CPUS] [--force] [--tmpdir TMPDIR]
                         [--keep_intermediates] [--debug] [-h]

mutually exclusive required arguments

--genome_dir

directory containing genome files in FASTA format

--batchfile

path to file describing genomes - tab separated in 2 or 3 columns (FASTA file, genome ID, translation table [optional])

mutually exclusive required arguments

--bacteria

process bacterial genomes

--archaea

process archaeal genomes

required named arguments

--outgroup_taxon

taxon to use as outgroup (e.g., p__Patescibacteria or p__Altiarchaeota)

--out_dir

directory to output files

Named Arguments

-x, --extension

extension of files to process, gz = gzipped

Default: “fna”

--skip_gtdb_refs

do not include GTDB reference genomes in multiple sequence alignment.

--taxa_filter

filter GTDB genomes to taxa (comma separated) within specific taxonomic groups (e.g.: d__Bacteria or p__Proteobacteria,p__Actinobacteria)

--min_perc_aa

exclude genomes that do not have at least this percentage of AA in the MSA (inclusive bound)

Default: 10

--custom_msa_filters

perform custom filtering of MSA with cols_per_gene, min_consensus max_consensus, and min_perc_taxa parameters instead of using canonical mask

--cols_per_gene

maximum number of columns to retain per gene when generating the MSA

Default: 42

--min_consensus

minimum percentage of the same amino acid required to retain column (inclusive bound)

Default: 25

--max_consensus

maximum percentage of the same amino acid required to retain column (exclusive bound)

Default: 95

--min_perc_taxa

minimum percentage of taxa required to retain column (inclusive bound)

Default: 50

--rnd_seed

random seed to use for selecting columns, e.g. 42

--prot_model

Possible choices: JTT, WAG, LG

protein substitution model for tree inference

Default: “WAG”

--no_support

do not compute local support values using the Shimodaira-Hasegawa test

--gamma

rescale branch lengths to optimize the Gamma20 likelihood

--gtdbtk_classification_file

file with GTDB-Tk classifications produced by the classify command

--custom_taxonomy_file

file indicating custom taxonomy strings for user genomes, that should contain any genomes belonging to the outgroup. Format: GENOME_ID<TAB>d__;p__;c__;o__;f__;g__;s__

--write_single_copy_genes

output unaligned single-copy marker genes

--prefix

prefix for all output files

Default: “gtdbtk”

--genes

indicates input files contain predicted proteins as amino acids (skip gene calling).Warning: This flag will skip the ANI comparison steps (ani_screen and classification).

--cpus

number of CPUs to use

Default: 1

--force

continue processing if an error occurs on a single genome

--tmpdir

specify alternative directory for temporary files

Default: “/tmp”

--keep_intermediates

keep intermediate files in the final directory

--debug

create intermediate files for debugging purposes

Example

Input

gtdbtk de_novo_wf --genome_dir genomes/ --outgroup_taxon p__Undinarchaeota --archaea --out_dir de_novo_wf --cpus 3

gtdbtk de_novo_wf --genome_dir genomes/ --outgroup_taxon p__Chloroflexota --bacteria  --taxa_filter p__Firmicutes --out_dir de_novo_output

#Skip GTDB reference genomes ( requires --custom_taxonomy_file for outgrouping)
gtdbtk de_novo_wf --genome_dir genomes/ --outgroup_taxon p__Customphylum --bacteria --custom_taxonomy_file custom_taxonomy.tsv --out_dir de_novo_output

#Use a subset of GTDB reference genomes (p__Firmicutes) and outgroup on a custom Phylum (p__Customphylum)
gtdbtk de_novo_wf --genome_dir genomes/ --taxa_filter p__Firmicutes --outgroup_taxon p__Customphylum --bacteria --custom_taxonomy_file custom_taxonomy.tsv --out_dir de_novo_output

Custom Taxonomy Format

The custom taxonomy file is a Tab-delimited file with the first column listing user genomes (i.e Fasta filename without the extension) and the second column listing the standardized 7-rank taxonomy.

#For genome_1.fna, genome_2.fna and genome_3.fna
genome_1    d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Salmonella;s__Salmonella enterica
genome_2    d__Bacteria;p__Actinobacteriota;c__Actinomycetia;o__Mycobacteriales;f__Mycobacteriaceae;g__Mycobacterium;s__Mycobacterium tuberculosis
genome_3    d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Streptococcaceae;g__Streptococcus;s__Streptococcus pyogenes