Create a multiple sequence alignment based on the AR53/BAC120 marker set.


usage: gtdbtk align --identify_dir IDENTIFY_DIR --out_dir OUT_DIR
                    [--skip_gtdb_refs] [--taxa_filter TAXA_FILTER]
                    [--min_perc_aa MIN_PERC_AA]
                    [--cols_per_gene COLS_PER_GENE]
                    [--min_consensus MIN_CONSENSUS]
                    [--max_consensus MAX_CONSENSUS]
                    [--min_perc_taxa MIN_PERC_TAXA] [--rnd_seed RND_SEED]
                    [--prefix PREFIX] [--cpus CPUS] [--tmpdir TMPDIR]
                    [--debug] [-h] [--custom_msa_filters | --skip_trimming]

required named arguments


output directory of ‘identify’ command


directory to output files

Named Arguments


do not include GTDB reference genomes in multiple sequence alignment


filter GTDB genomes to taxa (comma separated) within specific taxonomic groups (e.g.: d__Bacteria or p__Proteobacteria,p__Actinobacteria)


exclude genomes that do not have at least this percentage of AA in the MSA (inclusive bound)

Default: 10


maximum number of columns to retain per gene when generating the MSA

Default: 42


minimum percentage of the same amino acid required to retain column (inclusive bound)

Default: 25


maximum percentage of the same amino acid required to retain column (exclusive bound)

Default: 95


minimum percentage of taxa required to retain column (inclusive bound)

Default: 50


random seed to use for selecting columns, e.g. 42


prefix for all output files

Default: “gtdbtk”


number of CPUs to use

Default: 1


specify alternative directory for temporary files

Default: “/tmp”


create intermediate files for debugging purposes

mutually exclusive optional arguments


perform custom filtering of MSA with cols_per_gene, min_consensus max_consensus, and min_perc_taxa parameters instead of using canonical mask


skip the trimming step and return the full MSAs



gtdbtk align --identify_dir identify_output/ --out_dir align_output --cpus 3


[2022-04-11 11:59:14] INFO: GTDB-Tk v2.0.0
[2022-04-11 11:59:14] INFO: gtdbtk align --identify_dir /tmp/gtdbtk/identify --out_dir /tmp/gtdbtk/align --cpus 2
[2022-04-11 11:59:14] INFO: Using GTDB-Tk reference data version r207: /srv/db/gtdbtk/official/release207
[2022-04-11 11:59:15] INFO: Aligning markers in 3 genomes with 2 CPUs.
[2022-04-11 11:59:16] INFO: Processing 3 genomes identified as archaeal.
[2022-04-11 11:59:16] INFO: Read concatenated alignment for 3,412 GTDB genomes.
[2022-04-11 11:59:16] TASK: Generating concatenated alignment for each marker.
[2022-04-11 11:59:16] INFO: Completed 3 genomes in 0.01 seconds (139.73 genomes/second).
[2022-04-11 11:59:16] TASK: Aligning 52 identified markers using hmmalign 3.1b2 (February 2015).
[2022-04-11 11:59:17] INFO: Completed 52 markers in 0.86 seconds (60.66 markers/second).
[2022-04-11 11:59:17] TASK: Masking columns of archaeal multiple sequence alignment using canonical mask.
[2022-04-11 11:59:21] INFO: Completed 3,414 sequences in 4.19 seconds (815.22 sequences/second).
[2022-04-11 11:59:21] INFO: Masked archaeal alignment from 13,540 to 10,153 AAs.
[2022-04-11 11:59:21] INFO: 0 archaeal user genomes have amino acids in <10.0% of columns in filtered MSA.
[2022-04-11 11:59:21] INFO: Creating concatenated alignment for 3,414 archaeal GTDB and user genomes.
[2022-04-11 11:59:23] INFO: Creating concatenated alignment for 3 archaeal user genomes.
[2022-04-11 11:59:23] INFO: Done.