gtdblib.taxonomy package

Submodules

gtdblib.taxonomy.taxonomy module

class gtdblib.taxonomy.taxonomy.Taxonomy(d: Taxon, p: Taxon, c: Taxon, o: Taxon, f: Taxon, g: Taxon, s: Taxon)

Bases: object

CLASS_INDEX = 2
DOMAIN_INDEX = 0
FAMILY_INDEX = 4
GENUS_INDEX = 5
ORDER_INDEX = 3
PHYLUM_INDEX = 1
RANK_INDEX = {'c__': 2, 'd__': 0, 'f__': 4, 'g__': 5, 'o__': 3, 'p__': 1, 's__': 6}
RANK_LABELS = ('domain', 'phylum', 'class', 'order', 'family', 'genus', 'species')
RANK_PREFIXES = ('d__', 'p__', 'c__', 'o__', 'f__', 'g__', 's__')
SPECIES_INDEX = 6
c
d
f
g
o
p
s
gtdblib.taxonomy.taxonomy.fill_trailing_ranks(taxa: List[str]) List[str]

Fill in missing trailing ranks in a taxonomy string.

This function assumes it will be provided with a list of taxa from the 7 canonical ranks in rank order. Any trailing ranks are filled in.

Example: [d__Bacteria, d__Firmicutes] => [d__Bacteria, d__Firmicutes, c__, o__, f__, g__, s__]

Parameters:

list (taxa) – List of taxa.

Returns:

List of taxa with filled trailing ranks.

gtdblib.taxonomy.taxonomy.read_taxonomy(taxonomy_file: str, use_canonical_gid: bool = False) Dict[str, List[str]]

Read Greengenes-style taxonomy file.

This method is generic in that it can read any Greengenes-style taxonomy string and does not strictly require there to be the seven canonical ranks present.

Expected format is:

<accession> <taxonomy string>

where the taxonomy string will typically have the formats:

d__; c__; o__; f__; g__; s__

Parameters:
  • taxonomy_file – File indicating Greengenes-style taxonomic assignments.

  • use_canonical_gid – Flag indicating if accessions should be converted to their canonical form.

Returns:

Mapping from accessions to a list of taxa.

gtdblib.taxonomy.taxonomy.read_taxonomy_from_tree(tree: str, warnings: bool = True) Dict[str, List[str]]

Obtain the taxonomy for each extant taxa as specified by internal tree labels.

This method is generic in that it can read any Greengenes-style taxonomy string and does not strictly require extran taxa to be classified to theseven canonical ranks.

Parameters:
  • tree – Filename of Newick tree or Dendropy tree object.

  • warnings – Flag indicating if issues reading tree should be logged as warnings.

Returns:

Mapping from extent taxon labels to a list of taxa.

gtdblib.taxonomy.validation module

Methods for validating a Greengenes-style taxonomy.

This class validates canonical 7 rank Greengenes-style taxonomy string:

d__; c__; o__; f__; g__; s__

gtdblib.taxonomy.validation.duplicate_names(taxonomy: Dict[str, List[str]], check_species: bool = True) Dict[str, List[str]]

Identify duplicate names in taxonomy.

Parameters

taxonomyd[unique_id] -> [d__<taxon>; …; s__<taxon>]

Taxonomy strings indexed by unique ids.

Returns

dictd[taxon] -> lineages

List of lineages for duplicate taxa.

gtdblib.taxonomy.validation.taxonomic_consistency(taxonomy: Dict[str, List[str]], report_errors: bool = True) Dict[str, str]

Determine taxonomically consistent classification for taxa at each rank.

Parameters

taxonomyd[unique_id] -> [d__<taxon>; …; s__<taxon>]

Taxonomy strings indexed by unique ids.

report_errorsboolean

Flag indicating if errors should be written to screen.

Returns

dictd[taxa] -> expected parent

Expected parent taxon for taxa at all taxonomic ranks, or None if the taxonomy is inconsistent.

gtdblib.taxonomy.validation.validate_species_name(species_name: str, require_full: bool = True, require_prefix: bool = True) Tuple[bool, str]

Validate species name.

A full species name should be binomial and include a ‘generic name’ (genus) and a ‘specific epithet’ (species), i.e. Escherichia coli. This method assumes the two names should be separated by a space.

Parameters

species_namestr

Species name to validate

require_fullboolean

Flag indicating if species name must include ‘generic name and ‘specific epithet’.

require_prefixboolean

Flag indicating if name must start with the species prefix (’s__’).

Returns

boolean

True if species name is valid, otherwise False.

str

Reason for failing validation, otherwise None.

gtdblib.taxonomy.validation.validate_taxonomy(taxonomy: Dict[str, List[str]], check_prefixes: bool, check_ranks: bool, check_hierarchy: bool, check_species: bool, check_group_names: bool, check_duplicate_names: bool, check_capitalization: bool, report_errors: bool = True) Tuple[Dict[str, str], Dict[str, str], Dict[str, List[str]], Dict[str, List[str]]]

Check if taxonomy forms a strict hierarchy with all expected ranks.

This method implements a full workflow for validating Greengenes-style taxonomy strings assigned to a set of accessions. It identifies a number of common issues and the issues checked can be configured using the set of boolean inputs.

Parameters

taxonomyd[unique_id] -> [d__<taxon>; …; s__<taxon>]

Taxonomy strings indexed by unique ids.

check_prefixesboolean

Flag indicating if prefix of taxon should be validated.

check_ranksboolean

Flag indicating if the presence of all ranks should be validated.

check_hierarchyboolean

Flag indicating if the taxonomic hierarchy should be validated.

check_speciesboolean

Flag indicating if the taxonomic consistency of named species should be validated.

check_group_namesboolean

Flag indicating if group names should be checked for invalid characters.

check_duplicate_namesboolean

Flag indicating if group names should be checked for duplicates.

report_errorsboolean

Flag indicating if errors should be written to screen.

Returns

dictd[taxon_id] -> taxonomy

Taxa with invalid number of ranks.

dictd[taxon_id] -> [taxon, taxonomy]

Taxa with invalid rank prefixes.

dict: d[taxon_id] -> [species name, error message]

Taxa with invalid species names.

dict: d[child_taxon_id] -> two or more parent taxon ids

Taxa with invalid hierarchies.

Module contents