align¶

Create a multiple sequence alignment based on the AR53/BAC120 marker set.

Arguments¶

usage: gtdbtk align --identify_dir IDENTIFY_DIR --out_dir OUT_DIR
                    [--skip_gtdb_refs] [--taxa_filter TAXA_FILTER]
                    [--min_perc_aa MIN_PERC_AA]
                    [--cols_per_gene COLS_PER_GENE]
                    [--min_consensus MIN_CONSENSUS]
                    [--max_consensus MAX_CONSENSUS]
                    [--min_perc_taxa MIN_PERC_TAXA] [--rnd_seed RND_SEED]
                    [--prefix PREFIX] [--cpus CPUS] [--tmpdir TMPDIR]
                    [--debug] [-h] [--custom_msa_filters | --skip_trimming]

required named arguments¶

--identify_dir: output directory of ‘identify’ command
--out_dir: directory to output files

Named Arguments¶

--skip_gtdb_refs

do not include GTDB reference genomes in multiple sequence alignment.

--taxa_filter

filter GTDB genomes to taxa (comma separated) within specific taxonomic groups (e.g.: d__Bacteria or p__Proteobacteria,p__Actinobacteria)

--min_perc_aa

exclude genomes that do not have at least this percentage of AA in the MSA (inclusive bound)

Default: 10

--cols_per_gene

maximum number of columns to retain per gene when generating the MSA

Default: 42

--min_consensus

minimum percentage of the same amino acid required to retain column (inclusive bound)

Default: 25

--max_consensus

maximum percentage of the same amino acid required to retain column (exclusive bound)

Default: 95

--min_perc_taxa

minimum percentage of taxa required to retain column (inclusive bound)

Default: 50

--rnd_seed

random seed to use for selecting columns, e.g. 42

--prefix

prefix for all output files

Default: “gtdbtk”

--cpus

number of CPUs to use

Default: 1

--tmpdir

specify alternative directory for temporary files

Default: “/tmp”

--debug

create intermediate files for debugging purposes

mutually exclusive optional arguments¶

--custom_msa_filters: perform custom filtering of MSA with cols_per_gene, min_consensus max_consensus, and min_perc_taxa parameters instead of using canonical mask
--skip_trimming: skip the trimming step and return the full MSAs

Files output¶

[prefix].log
[prefix].json
[prefix].warnings.log
align
- [prefix].[domain].msa.fasta.gz
- [prefix].[domain].user_msa.fasta.gz
- [prefix].[domain].filtered.tsv
- intermediate_results
  
  [prefix].[domain].marker_info.tsv

Example¶

Input¶

gtdbtk align --identify_dir identify_output/ --out_dir align_output --cpus 3

Output¶

[2022-04-11 11:59:14] INFO: GTDB-Tk v2.0.0
[2022-04-11 11:59:14] INFO: gtdbtk align --identify_dir /tmp/gtdbtk/identify --out_dir /tmp/gtdbtk/align --cpus 2
[2022-04-11 11:59:14] INFO: Using GTDB-Tk reference data version r207: /srv/db/gtdbtk/official/release207
[2022-04-11 11:59:15] INFO: Aligning markers in 3 genomes with 2 CPUs.
[2022-04-11 11:59:16] INFO: Processing 3 genomes identified as archaeal.
[2022-04-11 11:59:16] INFO: Read concatenated alignment for 3,412 GTDB genomes.
[2022-04-11 11:59:16] TASK: Generating concatenated alignment for each marker.
[2022-04-11 11:59:16] INFO: Completed 3 genomes in 0.01 seconds (139.73 genomes/second).
[2022-04-11 11:59:16] TASK: Aligning 52 identified markers using hmmalign 3.1b2 (February 2015).
[2022-04-11 11:59:17] INFO: Completed 52 markers in 0.86 seconds (60.66 markers/second).
[2022-04-11 11:59:17] TASK: Masking columns of archaeal multiple sequence alignment using canonical mask.
[2022-04-11 11:59:21] INFO: Completed 3,414 sequences in 4.19 seconds (815.22 sequences/second).
[2022-04-11 11:59:21] INFO: Masked archaeal alignment from 13,540 to 10,153 AAs.
[2022-04-11 11:59:21] INFO: 0 archaeal user genomes have amino acids in <10.0% of columns in filtered MSA.
[2022-04-11 11:59:21] INFO: Creating concatenated alignment for 3,414 archaeal GTDB and user genomes.
[2022-04-11 11:59:23] INFO: Creating concatenated alignment for 3 archaeal user genomes.
[2022-04-11 11:59:23] INFO: Done.