For arguments and output files, see each of the individual steps:

The de novo workflow infers new bacterial and archaeal trees containing all user supplied and GTDB-Tk reference genomes. The classify workflow is recommended for obtaining taxonomic classifications, and this workflow only recommended if a de novo domain-specific trees are desired.

This workflow consists of five steps: identify, align, infer, root, and decorate.

The identify and align steps are the same as in the classify workflow.

The infer step uses FastTree with the WAG+GAMMA models to calculate independent, de novo bacterial and archaeal trees. These trees can then be rooted using a user specified outgroup and decorated with the GTDB taxonomy.

The de novo workflow can be run as follows:

gtdbtk de_novo_wf --genome_dir <my_genomes> --<marker_set> --outgroup_taxon <outgroup> --out_dir <output_dir>

This will process all genomes in <my_genomes> using the specified marker set and place the results in <output_dir>. Only genomes previously identified as being bacterial (archaeal) should be included when using the bacterial (archaeal) marker set. The tree will be rooted with the <outgroup> taxon (typically a phylum in the domain-specific tree) as required for correct decoration of the tree. In general, we suggest the resulting tree be treated as unrooted when interpreting results. Identical to the classify workflow, the location of genomes can also be specified using a batch file with the --batchfile flag.

The workflow supports several optional flags, including:

  • cpus: maximum number of CPUs to use

  • min_perc_aa: filter genomes with an insufficient percentage of AA in the MSA (default: 50)

  • taxa_filter: filter genomes to taxa within specific taxonomic groups

  • prot_model: protein substitution model for tree inference (LG or WAG; default: WAG)

For other flags please consult the command line interface.


usage: gtdbtk de_novo_wf (--genome_dir GENOME_DIR | --batchfile BATCHFILE)
                         (--bacteria | --archaea) --outgroup_taxon
                         OUTGROUP_TAXON --out_dir OUT_DIR [-x EXTENSION]
                         [--skip_gtdb_refs] [--taxa_filter TAXA_FILTER]
                         [--min_perc_aa MIN_PERC_AA] [--custom_msa_filters]
                         [--cols_per_gene COLS_PER_GENE]
                         [--min_consensus MIN_CONSENSUS]
                         [--max_consensus MAX_CONSENSUS]
                         [--min_perc_taxa MIN_PERC_TAXA] [--rnd_seed RND_SEED]
                         [--prot_model {JTT,WAG,LG}] [--no_support] [--gamma]
                         [--gtdbtk_classification_file GTDBTK_CLASSIFICATION_FILE]
                         [--custom_taxonomy_file CUSTOM_TAXONOMY_FILE]
                         [--write_single_copy_genes] [--prefix PREFIX]
                         [--genes] [--cpus CPUS] [--force] [--tmpdir TMPDIR]
                         [--keep_intermediates] [--debug] [-h]

mutually exclusive required arguments


directory containing genome files in FASTA format


path to file describing genomes - tab separated in 2 or 3 columns (FASTA file, genome ID, translation table [optional])

mutually exclusive required arguments


process bacterial genomes


process archaeal genomes

required named arguments


taxon to use as outgroup (e.g., p__Patescibacteria or p__Altiarchaeota)


directory to output files

Named Arguments

-x, --extension

extension of files to process, gz = gzipped

Default: “fna”


do not include GTDB reference genomes in multiple sequence alignment


filter GTDB genomes to taxa (comma separated) within specific taxonomic groups (e.g.: d__Bacteria or p__Proteobacteria,p__Actinobacteria)


exclude genomes that do not have at least this percentage of AA in the MSA (inclusive bound)

Default: 10


perform custom filtering of MSA with cols_per_gene, min_consensus max_consensus, and min_perc_taxa parameters instead of using canonical mask


maximum number of columns to retain per gene when generating the MSA

Default: 42


minimum percentage of the same amino acid required to retain column (inclusive bound)

Default: 25


maximum percentage of the same amino acid required to retain column (exclusive bound)

Default: 95


minimum percentage of taxa required to retain column (inclusive bound)

Default: 50


random seed to use for selecting columns, e.g. 42


Possible choices: JTT, WAG, LG

protein substitution model for tree inference

Default: “WAG”


do not compute local support values using the Shimodaira-Hasegawa test


rescale branch lengths to optimize the Gamma20 likelihood


file with GTDB-Tk classifications produced by the classify command


file indicating custom taxonomy strings for user genomes, that should contain any genomes belonging to the outgroup. Format: GENOME_ID<TAB>d__;p__;c__;o__;f__;g__;s__


output unaligned single-copy marker genes


prefix for all output files

Default: “gtdbtk”


indicates input files contain called genes (skip gene calling).Warning: This flag will also skip the ANI comparison steps (ani_screen and classification).


number of CPUs to use

Default: 1


continue processing if an error occurs on a single genome


specify alternative directory for temporary files

Default: “/tmp”


keep intermediate files in the final directory


create intermediate files for debugging purposes



gtdbtk de_novo_wf --genome_dir genomes/ --outgroup_taxon p__Undinarchaeota --archaea --out_dir de_novo_wf --cpus 3

gtdbtk de_novo_wf --genome_dir genomes/ --outgroup_taxon p__Chloroflexota --bacteria  --taxa_filter p__Firmicutes --out_dir de_novo_output

#Skip GTDB reference genomes ( requires --custom_taxonomy_file for outgrouping)
gtdbtk de_novo_wf --genome_dir genomes/ --outgroup_taxon p__Customphylum --bacteria --custom_taxonomy_file custom_taxonomy.tsv --out_dir de_novo_output

#Use a subset of GTDB reference genomes (p__Firmicutes) and outgroup on a custom Phylum (p__Customphylum)
gtdbtk de_novo_wf --genome_dir genomes/ --taxa_filter p__Firmicutes --outgroup_taxon p__Customphylum --bacteria --custom_taxonomy_file custom_taxonomy.tsv --out_dir de_novo_output

Custom Taxonomy Format

The custom taxonomy file is a Tab-delimited file with the first column listing user genomes (i.e Fasta filename without the extension) and the second column listing the standardized 7-rank taxonomy.

#For genome_1.fna, genome_2.fna and genome_3.fna
genome_1    d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Salmonella;s__Salmonella enterica
genome_2    d__Bacteria;p__Actinobacteriota;c__Actinomycetia;o__Mycobacteriales;f__Mycobacteriaceae;g__Mycobacterium;s__Mycobacterium tuberculosis
genome_3    d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Streptococcaceae;g__Streptococcus;s__Streptococcus pyogenes