gtdblib.util.bio package

Submodules

gtdblib.util.bio.accession module

gtdblib.util.bio.accession.canonical_gid(gid: str) str

Get canonical form of NCBI genome accession.

Example:

G005435135 -> G005435135 GCF_005435135.1 -> G005435135 GCF_005435135.1_ASM543513v1_genomic -> G005435135 RS_GCF_005435135.1 -> G005435135 GB_GCA_005435135.1 ->

Parameters:

gid – Genome accesion to conver to canonical form.

Returns:

Canonical form of accession.

gtdblib.util.bio.accession.is_same_accn_version(accn1: str, accn2: str) bool

Check if accessions have same version number.

This method assumes accession versions are provided as a suffix proceeding a period. This is the format used for NCBI accessions, e.g. GCF_005435135.1.

Parameters:
  • accn1 – First accession.

  • accn2 – Second accession.

Returns:

True if version number is the same, else False.

gtdblib.util.bio.exceptions module

exception gtdblib.util.bio.exceptions.BioError(msg)

Bases: Exception

gtdblib.util.bio.newick module

gtdblib.util.bio.newick.create_label(support: Optional[float], taxon: Optional[str], auxiliary_info: Optional[str]) str

Create label for Newick tree.

Parameters:
  • support – The support value.

  • taxon – The taxon.

  • auxiliary_info – The auxiliary information.

gtdblib.util.bio.newick.parse_label(label: str) Tuple[Optional[float], Optional[str], Optional[str]]

Parse a Newick label which may contain a support value, taxon, and/or auxiliary information.

Parameters:

label – The label to parse.

Returns:

A tuple containing the support value, taxon, and auxiliary information.

gtdblib.util.bio.seq_io module

exception gtdblib.util.bio.seq_io.InputFileError(msg)

Bases: BioError

gtdblib.util.bio.seq_io.extract_seqs(fasta_file, seqs_to_extract)

Extract specific sequences from fasta file.

Parameters:
  • fasta_file – str, Fasta file containing sequences.

  • seqs_to_extract – set, Ids of sequences to extract.

Returns:

dict[seq_id] -> seq, Dictionary of sequences indexed by sequence id.

gtdblib.util.bio.seq_io.is_nucleotide(seq_file, req_perc=0.95, max_seqs_to_read=10)

Check if a file contains sequences in nucleotide space.

The check is performed by looking for the characters in {a,c,g,t,n,.,-} and confirming that these comprise the majority of a sequences. A set number of sequences are read and the file assumed to be not be in nucleotide space if none of these sequences are comprised primarily of the defined nucleotide set.

Parameters:
  • seq_file – str, Name of fasta/q file to read.

  • req_perc – float, Percentage of nucleotide bases before declaring the sequences as being in nucleotide space.

  • max_seqs_to_read – int, Maximum sequences to read before declaring sequence file to not be in nucleotide space.

Returns:

boolean, True is sequences are in nucleotide space.

gtdblib.util.bio.seq_io.is_protein(seq_file, req_perc=0.95, max_seqs_to_read=10)

Check if a file contains sequences in protein space.

The check is performed by looking for the 20 amino acids, along with X, and the insertion characters ‘-’ and ‘.’, in order to confirm that these comprise the majority of a sequences. A set number of sequences are read and the file assumed to be not be in nucleotide space if none of these sequences are comprised primarily of the defined nucleotide set.

Parameters:
  • seq_file – str, Name of fasta/q file to read.

  • req_perc – float, Percentage of amino acid bases before declaring the sequences as being in amino acid space.

  • max_seqs_to_read – int, Maximum sequences to read before declaring sequence file to not be in amino acid space.

Returns:

boolean, True is sequences are in amino acid space.

gtdblib.util.bio.seq_io.read(seq_file)

Read sequences from fasta/q file.

Parameters:

seq_file – str, Name of fasta/q file to read.

Returns:

dict[seq_id] -> seq, Sequences indexed by sequence id.

gtdblib.util.bio.seq_io.read_fasta(fasta_file, keep_annotation=False)

Read sequences from fasta file.

Parameters:
  • fasta_file – str, Name of fasta file to read.

  • keep_annotation – boolean, Determine is sequence id should contain annotation.

Returns:

dict[seq_id] -> seq, Sequences indexed by sequence id.

gtdblib.util.bio.seq_io.read_fasta_seq(fasta_file, keep_annotation=False)

Generator function to read sequences from fasta file.

This function is intended to be used as a generator in order to avoid having to have large sequence files in memory. Input file may be gzipped.

Example:
>>> seq_io = SeqIO()
>>> for seq_id, seq in seq_io.read_fasta_seq(fasta_file):
>>>     print seq_id
>>>     print seq
Parameters:
  • fasta_file – str, Name of fasta file to read.

  • keep_annotation – boolean, Determine if annotation string should be returned.

Returns:

Iterator[seq_id, seq, [annotation]], Unique id of the sequence followed by the sequence itself, and the annotation if keep_annotation is True.

gtdblib.util.bio.seq_io.read_fastq(fastq_file)

Read sequences from fastq file.

Parameters:

fastq_file – str, Name of fastq file to read.

Returns:

dict[seq_id] -> seq, Sequences indexed by sequence id.

gtdblib.util.bio.seq_io.read_fastq_seq(fastq_file)

Generator function to read sequences from fastq file.

This function is intended to be used as a generator in order to avoid having to have large sequence files in memory. Input file may be gzipped.

Example:
>>> seq_io = SeqIO()
>>> for seq_id, seq in seq_io.read_fastq_seq(fastq_file):
>>>     print seq_id
>>>     print seq
Parameters:

fastq_file – str, Name of fastq file to read.

Returns:

Iterator[seq_id, seq], Unique id of the sequence followed by the sequence itself.

gtdblib.util.bio.seq_io.read_seq(seq_file, keep_annotation=False)

Generator function to read sequences from fasta/q file.

This function is intended to be used as a generator in order to avoid having to have large sequence files in memory. Input file may be gzipped and in either fasta or fastq format. It is slightly more efficient to directly call read_fasta_seq() or read_fastq_seq() if the type of input file in known.

Example:
>>> seq_io = SeqIO()
>>> for seq_id, seq in seq_io.read_seq(fasta_file):
>>>     print seq_id
>>>>    print seq

:param seq_file:str, Name of fasta/q file to read. :param keep_annotation:boolean, Determine if annotation string should be returned.

Returns:

Iterator[seq_id, seq, [annotation]], Unique id of the sequence followed by the sequence itself, and the annotation if keep_annotation is True.

gtdblib.util.bio.seq_io.seq_lengths(fasta_file)

Calculate length of each sequence.

Parameters:

fasta_file – str, Fasta file containing sequences.

Returns:

d[seq_id] -> length, Length of each sequence.

gtdblib.util.bio.seq_io.write_fasta(seqs, output_file, wrap=80)

Write sequences to fasta file.

If the output file has the extension ‘gz’, it will be compressed using gzip.

Parameters:
  • seqs – dict[seq_id] -> seq, Sequences indexed by sequence id.

  • output_file – str, Name of fasta file to produce.

  • wrap – int, Number of characters per line.

Module contents