This jupyter notebook demonstrates how to run the classify_wf command on a set of test genomes.

For a full list of commands see: * gtdbtk -h, or * GTDB-Tk commands

Step 1: Obtaining data

This example will use the following two genomes that we will refer to as: * Genome A: GCF_003947435.1 [GTDB / NCBI] * Genome B: GCA_002011125.1 [GTDB / NCBI]

# Create the directory.
!mkdir -p /tmp/gtdbtk && cd /tmp/gtdbtk

# Obtain the genomes.
!mkdir -p /tmp/gtdbtk/genomes
!wget -q https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/947/435/GCF_003947435.1_ASM394743v1/GCF_003947435.1_ASM394743v1_genomic.fna.gz -O /tmp/gtdbtk/genomes/genome_a.fna.gz
!wget -q https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/002/011/125/GCA_002011125.1_ASM201112v1/GCA_002011125.1_ASM201112v1_genomic.fna.gz -O /tmp/gtdbtk/genomes/genome_b.fna.gz

Step 2: Gene calling (identify)

Note that the workflow can be run as a single command classify_wf, however, each step will be run individually in this notebook. It can sometimes be useful to run the steps individually when processing large pipelines.

[ ]:
!ls -l /tmp/gtdbtk/genomes
!gtdbtk identify --genome_dir /tmp/gtdbtk/genomes --out_dir /tmp/gtdbtk/identify --extension gz --cpus 2
[2022-04-11 11:48:59] INFO: GTDB-Tk v2.0.0
[2022-04-11 11:48:59] INFO: gtdbtk identify --genome_dir /tmp/gtdbtk/genomes --out_dir /tmp/gtdbtk/identify --extension gz --cpus 2
[2022-04-11 11:48:59] INFO: Using GTDB-Tk reference data version r207: /srv/db/gtdbtk/official/release207
[2022-04-11 11:48:59] INFO: Identifying markers in 2 genomes with 2 threads.
[2022-04-11 11:48:59] TASK: Running Prodigal V2.6.3 to identify genes.
[2022-04-11 11:49:10] INFO: Completed 2 genomes in 10.94 seconds (5.47 seconds/genome).
[2022-04-11 11:49:10] TASK: Identifying TIGRFAM protein families.
[2022-04-11 11:49:16] INFO: Completed 2 genomes in 5.78 seconds (2.89 seconds/genome).
[2022-04-11 11:49:16] TASK: Identifying Pfam protein families.
[2022-04-11 11:49:16] INFO: Completed 2 genomes in 0.42 seconds (4.81 genomes/second).
[2022-04-11 11:49:16] INFO: Annotations done using HMMER 3.1b2 (February 2015).
[2022-04-11 11:49:16] TASK: Summarising identified marker genes.
[2022-04-11 11:49:16] INFO: Completed 2 genomes in 0.05 seconds (40.91 genomes/second).
[2022-04-11 11:49:16] INFO: Done.


The called genes and marker information can be found under each genomes respeective intermediate files directory, as shown below.

!ls /tmp/gtdbtk/identify/identify/intermediate_results/marker_genes/genome_a.fna/
genome_a.fna_pfam_tophit.tsv         genome_a.fna_protein.gff.sha256
genome_a.fna_pfam_tophit.tsv.sha256  genome_a.fna_tigrfam.out
genome_a.fna_pfam.tsv                genome_a.fna_tigrfam.out.sha256
genome_a.fna_pfam.tsv.sha256         genome_a.fna_tigrfam_tophit.tsv
genome_a.fna_protein.faa             genome_a.fna_tigrfam_tophit.tsv.sha256
genome_a.fna_protein.faa.sha256      genome_a.fna_tigrfam.tsv
genome_a.fna_protein.fna             genome_a.fna_tigrfam.tsv.sha256
genome_a.fna_protein.fna.sha256      prodigal_translation_table.tsv
genome_a.fna_protein.gff             prodigal_translation_table.tsv.sha256

However, it is sometimes more useful to just read the summary files which detail markers identified from either the archaeal 53, or bacterial 120 marker set.

!cat /tmp/gtdbtk/identify/identify/gtdbtk.ar53.markers_summary.tsv
name    number_unique_genes     number_multiple_genes   number_multiple_unique_genes    number_missing_genes    list_unique_genes       list_multiple_genes     list_multiple_unique_genes      list_missing_genes
genome_a.fna    44      0       0       9       PF00410.20,PF00466.21,PF00687.22,PF00827.18,PF00900.21,PF01000.27,PF01015.19,PF01090.20,PF01200.19,PF01280.21,PF04919.13,PF07541.13,TIGR00037,TIGR00111,TIGR00134,TIGR00279,TIGR00291,TIGR00323,TIGR00335,TIGR00405,TIGR00448,TIGR00483,TIGR00491,TIGR00967,TIGR00982,TIGR01008,TIGR01012,TIGR01018,TIGR01020,TIGR01028,TIGR01046,TIGR01052,TIGR01213,TIGR01952,TIGR02236,TIGR02390,TIGR03626,TIGR03627,TIGR03628,TIGR03671,TIGR03672,TIGR03674,TIGR03676,TIGR03680                     TIGR00064,TIGR00373,TIGR00522,TIGR01171,TIGR02338,TIGR02389,TIGR03629,TIGR03670,TIGR03673
genome_b.fna    50      0       0       3       PF00410.20,PF00466.21,PF00687.22,PF00827.18,PF00900.21,PF01000.27,PF01015.19,PF01090.20,PF01200.19,PF01280.21,PF04919.13,PF07541.13,TIGR00037,TIGR00064,TIGR00111,TIGR00134,TIGR00279,TIGR00291,TIGR00323,TIGR00335,TIGR00373,TIGR00405,TIGR00448,TIGR00483,TIGR00491,TIGR00522,TIGR00967,TIGR00982,TIGR01008,TIGR01012,TIGR01018,TIGR01020,TIGR01028,TIGR01046,TIGR01052,TIGR01952,TIGR02236,TIGR02338,TIGR02389,TIGR02390,TIGR03626,TIGR03628,TIGR03629,TIGR03670,TIGR03671,TIGR03672,TIGR03673,TIGR03674,TIGR03676,TIGR03680                 TIGR01171,TIGR01213,TIGR03627

Step 3: Aligning genomes (align)

The align step will align all identified markers, determine the most likely domain and output a concatenated MSA.

!gtdbtk align --identify_dir /tmp/gtdbtk/identify --out_dir /tmp/gtdbtk/align --cpus 2
[2022-04-11 11:59:14] INFO: GTDB-Tk v2.0.0
[2022-04-11 11:59:14] INFO: gtdbtk align --identify_dir /tmp/gtdbtk/identify --out_dir /tmp/gtdbtk/align --cpus 2
[2022-04-11 11:59:14] INFO: Using GTDB-Tk reference data version r207: /srv/db/gtdbtk/official/release207
[2022-04-11 11:59:15] INFO: Aligning markers in 2 genomes with 2 CPUs.
[2022-04-11 11:59:16] INFO: Processing 2 genomes identified as archaeal.
[2022-04-11 11:59:16] INFO: Read concatenated alignment for 3,412 GTDB genomes.
[2022-04-11 11:59:16] TASK: Generating concatenated alignment for each marker.
[2022-04-11 11:59:16] INFO: Completed 2 genomes in 0.01 seconds (139.73 genomes/second).
[2022-04-11 11:59:16] TASK: Aligning 52 identified markers using hmmalign 3.1b2 (February 2015).
[2022-04-11 11:59:17] INFO: Completed 52 markers in 0.86 seconds (60.66 markers/second).
[2022-04-11 11:59:17] TASK: Masking columns of archaeal multiple sequence alignment using canonical mask.
[2022-04-11 11:59:21] INFO: Completed 3,414 sequences in 4.19 seconds (815.22 sequences/second).
[2022-04-11 11:59:21] INFO: Masked archaeal alignment from 13,540 to 10,153 AAs.
[2022-04-11 11:59:21] INFO: 0 archaeal user genomes have amino acids in <10.0% of columns in filtered MSA.
[2022-04-11 11:59:21] INFO: Creating concatenated alignment for 3,414 archaeal GTDB and user genomes.
[2022-04-11 11:59:23] INFO: Creating concatenated alignment for 2 archaeal user genomes.
[2022-04-11 11:59:23] INFO: Done.


It is important to pay attention to the output, if a genome had a low number of markers identified it will be excluded from the analysis at this step. A warning will appear if that is the case.

Depending on the domain, a prefixed file of either ar53 or bac120 will appear containing the MSA of the user genomes and the GTDB genomes, or just the user genomes (gtdbtk.ar53.msa.fasta.gz and gtdbtk.ar53.user_msa.fasta.gz respectively.)

!ls /tmp/gtdbtk/align/align/
gtdbtk.ar53.filtered.tsv  gtdbtk.ar53.user_msa.fasta.gz
gtdbtk.ar53.msa.fasta.gz  intermediate_results

Step 4: Classifying genomes (classify)

The classify step will place the genomes into a reference tree, then determine their most likely classification.

!gtdbtk classify --genome_dir /tmp/gtdbtk/genomes --align_dir /tmp/gtdbtk/align --out_dir /tmp/gtdbtk/classify -x gz --cpus 2
[2022-04-11 12:02:06] INFO: GTDB-Tk v2.0.0
[2022-04-11 12:02:06] INFO: gtdbtk classify --genome_dir /tmp/gtdbtk/genomes --align_dir /tmp/gtdbtk/align --out_dir /tmp/gtdbtk/classify -x gz --cpus 2
[2022-04-11 12:02:06] INFO: Using GTDB-Tk reference data version r207: /srv/db/gtdbtk/official/release207
[2022-04-11 12:02:07] TASK: Placing 2 archaeal genomes into reference tree with pplacer using 2 CPUs (be patient).
[2022-04-11 12:02:07] INFO: pplacer version: v1.1.alpha19-0-g807f6f3
                                                               [2022-04-11 12:07:06] INFO: Calculating RED values based on reference tree.
[2022-04-11 12:07:06] TASK: Traversing tree to determine classification method.
[2022-04-11 12:07:06] INFO: Completed 2 genomes in 0.00 seconds (18,558.87 genomes/second).
[2022-04-11 12:07:06] TASK: Calculating average nucleotide identity using FastANI (v1.32).
[2022-04-11 12:07:08] INFO: Completed 4 comparisons in 1.61 seconds (2.49 comparisons/second).
[2022-04-11 12:07:08] INFO: 2 genome(s) have been classified using FastANI and pplacer.
[2022-04-11 12:07:08] INFO: Note that Tk classification mode is insufficient for publication of new taxonomic designations. New designations should be based on one or more de novo trees, an example of which can be produced by Tk in de novo mode.
[2022-04-11 12:07:08] INFO: Done.


The two main files output are the summary files (gtdbtk.ar53.summary.tsv, and gtdbtk.bac120.summary.tsv respectively). Classification of the genomes are present in the summary file.

!ls /tmp/gtdbtk/classify
classify  gtdbtk.ar53.summary.tsv  gtdbtk.log  gtdbtk.warnings.log
[ ]: