align¶
Create a multiple sequence alignment based on the AR53/BAC120 marker set.
Arguments¶
usage: gtdbtk align --identify_dir IDENTIFY_DIR --out_dir OUT_DIR
[--skip_gtdb_refs] [--taxa_filter TAXA_FILTER]
[--min_perc_aa MIN_PERC_AA]
[--cols_per_gene COLS_PER_GENE]
[--min_consensus MIN_CONSENSUS]
[--max_consensus MAX_CONSENSUS]
[--min_perc_taxa MIN_PERC_TAXA] [--rnd_seed RND_SEED]
[--prefix PREFIX] [--cpus CPUS] [--tmpdir TMPDIR]
[--debug] [-h] [--custom_msa_filters | --skip_trimming]
required named arguments¶
- --identify_dir
output directory of ‘identify’ command
- --out_dir
directory to output files
Named Arguments¶
- --skip_gtdb_refs
do not include GTDB reference genomes in multiple sequence alignment.
- --taxa_filter
filter GTDB genomes to taxa (comma separated) within specific taxonomic groups (e.g.:
d__Bacteria
orp__Proteobacteria,p__Actinobacteria
)- --min_perc_aa
exclude genomes that do not have at least this percentage of AA in the MSA (inclusive bound)
Default: 10
- --cols_per_gene
maximum number of columns to retain per gene when generating the MSA
Default: 42
- --min_consensus
minimum percentage of the same amino acid required to retain column (inclusive bound)
Default: 25
- --max_consensus
maximum percentage of the same amino acid required to retain column (exclusive bound)
Default: 95
- --min_perc_taxa
minimum percentage of taxa required to retain column (inclusive bound)
Default: 50
- --rnd_seed
random seed to use for selecting columns, e.g.
42
- --prefix
prefix for all output files
Default: “gtdbtk”
- --cpus
number of CPUs to use
Default: 1
- --tmpdir
specify alternative directory for temporary files
Default: “/tmp”
- --debug
create intermediate files for debugging purposes
mutually exclusive optional arguments¶
- --custom_msa_filters
perform custom filtering of MSA with
cols_per_gene
,min_consensus
max_consensus
, andmin_perc_taxa
parameters instead of using canonical mask- --skip_trimming
skip the trimming step and return the full MSAs
Files output¶
Example¶
Input¶
gtdbtk align --identify_dir identify_output/ --out_dir align_output --cpus 3
Output¶
[2022-04-11 11:59:14] INFO: GTDB-Tk v2.0.0
[2022-04-11 11:59:14] INFO: gtdbtk align --identify_dir /tmp/gtdbtk/identify --out_dir /tmp/gtdbtk/align --cpus 2
[2022-04-11 11:59:14] INFO: Using GTDB-Tk reference data version r207: /srv/db/gtdbtk/official/release207
[2022-04-11 11:59:15] INFO: Aligning markers in 3 genomes with 2 CPUs.
[2022-04-11 11:59:16] INFO: Processing 3 genomes identified as archaeal.
[2022-04-11 11:59:16] INFO: Read concatenated alignment for 3,412 GTDB genomes.
[2022-04-11 11:59:16] TASK: Generating concatenated alignment for each marker.
[2022-04-11 11:59:16] INFO: Completed 3 genomes in 0.01 seconds (139.73 genomes/second).
[2022-04-11 11:59:16] TASK: Aligning 52 identified markers using hmmalign 3.1b2 (February 2015).
[2022-04-11 11:59:17] INFO: Completed 52 markers in 0.86 seconds (60.66 markers/second).
[2022-04-11 11:59:17] TASK: Masking columns of archaeal multiple sequence alignment using canonical mask.
[2022-04-11 11:59:21] INFO: Completed 3,414 sequences in 4.19 seconds (815.22 sequences/second).
[2022-04-11 11:59:21] INFO: Masked archaeal alignment from 13,540 to 10,153 AAs.
[2022-04-11 11:59:21] INFO: 0 archaeal user genomes have amino acids in <10.0% of columns in filtered MSA.
[2022-04-11 11:59:21] INFO: Creating concatenated alignment for 3,414 archaeal GTDB and user genomes.
[2022-04-11 11:59:23] INFO: Creating concatenated alignment for 3 archaeal user genomes.
[2022-04-11 11:59:23] INFO: Done.