gtdblib.taxonomy package
Submodules
gtdblib.taxonomy.taxonomy module
- class gtdblib.taxonomy.taxonomy.Taxonomy(d: Taxon, p: Taxon, c: Taxon, o: Taxon, f: Taxon, g: Taxon, s: Taxon)
Bases:
object
- CLASS_INDEX = 2
- DOMAIN_INDEX = 0
- FAMILY_INDEX = 4
- GENUS_INDEX = 5
- ORDER_INDEX = 3
- PHYLUM_INDEX = 1
- RANK_INDEX = {'c__': 2, 'd__': 0, 'f__': 4, 'g__': 5, 'o__': 3, 'p__': 1, 's__': 6}
- RANK_LABELS = ('domain', 'phylum', 'class', 'order', 'family', 'genus', 'species')
- RANK_PREFIXES = ('d__', 'p__', 'c__', 'o__', 'f__', 'g__', 's__')
- SPECIES_INDEX = 6
- c
- d
- f
- g
- o
- p
- s
- gtdblib.taxonomy.taxonomy.fill_trailing_ranks(taxa: List[str]) List[str]
Fill in missing trailing ranks in a taxonomy string.
This function assumes it will be provided with a list of taxa from the 7 canonical ranks in rank order. Any trailing ranks are filled in.
Example: [d__Bacteria, d__Firmicutes] => [d__Bacteria, d__Firmicutes, c__, o__, f__, g__, s__]
- Parameters:
list (taxa) – List of taxa.
- Returns:
List of taxa with filled trailing ranks.
- gtdblib.taxonomy.taxonomy.read_taxonomy(taxonomy_file: str, use_canonical_gid: bool = False) Dict[str, List[str]]
Read Greengenes-style taxonomy file.
This method is generic in that it can read any Greengenes-style taxonomy string and does not strictly require there to be the seven canonical ranks present.
- Expected format is:
<accession> <taxonomy string>
- where the taxonomy string will typically have the formats:
- Parameters:
taxonomy_file – File indicating Greengenes-style taxonomic assignments.
use_canonical_gid – Flag indicating if accessions should be converted to their canonical form.
- Returns:
Mapping from accessions to a list of taxa.
- gtdblib.taxonomy.taxonomy.read_taxonomy_from_tree(tree: str, warnings: bool = True) Dict[str, List[str]]
Obtain the taxonomy for each extant taxa as specified by internal tree labels.
This method is generic in that it can read any Greengenes-style taxonomy string and does not strictly require extran taxa to be classified to theseven canonical ranks.
- Parameters:
tree – Filename of Newick tree or Dendropy tree object.
warnings – Flag indicating if issues reading tree should be logged as warnings.
- Returns:
Mapping from extent taxon labels to a list of taxa.
gtdblib.taxonomy.validation module
Methods for validating a Greengenes-style taxonomy.
- gtdblib.taxonomy.validation.duplicate_names(taxonomy: Dict[str, List[str]], check_species: bool = True) Dict[str, List[str]]
Identify duplicate names in taxonomy.
Parameters
- taxonomyd[unique_id] -> [d__<taxon>; …; s__<taxon>]
Taxonomy strings indexed by unique ids.
Returns
- dictd[taxon] -> lineages
List of lineages for duplicate taxa.
- gtdblib.taxonomy.validation.taxonomic_consistency(taxonomy: Dict[str, List[str]], report_errors: bool = True) Dict[str, str]
Determine taxonomically consistent classification for taxa at each rank.
Parameters
- taxonomyd[unique_id] -> [d__<taxon>; …; s__<taxon>]
Taxonomy strings indexed by unique ids.
- report_errorsboolean
Flag indicating if errors should be written to screen.
Returns
- dictd[taxa] -> expected parent
Expected parent taxon for taxa at all taxonomic ranks, or None if the taxonomy is inconsistent.
- gtdblib.taxonomy.validation.validate_species_name(species_name: str, require_full: bool = True, require_prefix: bool = True) Tuple[bool, str]
Validate species name.
A full species name should be binomial and include a ‘generic name’ (genus) and a ‘specific epithet’ (species), i.e. Escherichia coli. This method assumes the two names should be separated by a space.
Parameters
- species_namestr
Species name to validate
- require_fullboolean
Flag indicating if species name must include ‘generic name and ‘specific epithet’.
- require_prefixboolean
Flag indicating if name must start with the species prefix (’s__’).
Returns
- boolean
True if species name is valid, otherwise False.
- str
Reason for failing validation, otherwise None.
- gtdblib.taxonomy.validation.validate_taxonomy(taxonomy: Dict[str, List[str]], check_prefixes: bool, check_ranks: bool, check_hierarchy: bool, check_species: bool, check_group_names: bool, check_duplicate_names: bool, check_capitalization: bool, report_errors: bool = True) Tuple[Dict[str, str], Dict[str, str], Dict[str, List[str]], Dict[str, List[str]]]
Check if taxonomy forms a strict hierarchy with all expected ranks.
This method implements a full workflow for validating Greengenes-style taxonomy strings assigned to a set of accessions. It identifies a number of common issues and the issues checked can be configured using the set of boolean inputs.
Parameters
- taxonomyd[unique_id] -> [d__<taxon>; …; s__<taxon>]
Taxonomy strings indexed by unique ids.
- check_prefixesboolean
Flag indicating if prefix of taxon should be validated.
- check_ranksboolean
Flag indicating if the presence of all ranks should be validated.
- check_hierarchyboolean
Flag indicating if the taxonomic hierarchy should be validated.
- check_speciesboolean
Flag indicating if the taxonomic consistency of named species should be validated.
- check_group_namesboolean
Flag indicating if group names should be checked for invalid characters.
- check_duplicate_namesboolean
Flag indicating if group names should be checked for duplicates.
- report_errorsboolean
Flag indicating if errors should be written to screen.
Returns
- dictd[taxon_id] -> taxonomy
Taxa with invalid number of ranks.
- dictd[taxon_id] -> [taxon, taxonomy]
Taxa with invalid rank prefixes.
- dict: d[taxon_id] -> [species name, error message]
Taxa with invalid species names.
- dict: d[child_taxon_id] -> two or more parent taxon ids
Taxa with invalid hierarchies.