Alphabets

In order to accommodate different type of data (DNA, protein, codons or microsatellites, for example) a special class has been designed, Alphabet, which centralizes the list of valid allelic values (separated in exploitable and missing alleles). Even though the user can define his own alphabets, it is recommended to use pre-defined instances present in the package, for two reasons: first, a number of tools expect data of a particular type and in this case the alphabet is used to validate the data type (nucleotide sequences must be coded using the alphabets.DNA alphabet), and, second, some of the pre-implemented alphabets are optimized (especially DNA). Pre-defined instances are available in the alphabets module.

As a result, in order to instanciate an instance of the class Align , Container or Site it is mandatory to indicate under which alphabet they should operate. For example to import a DNA alignment from a file, one can just type the following:

>>> aln = egglib.io.from_fasta('align1.fas', alphabet=egglib.alphabets.DNA)

Note

Some functions require pre-implemented alphabets, such as tools.to_codons(), tools.to_bases() or the functions of the class tools.Translator. Other functions are completely generic in this regard and accept custom alphabets so long as they are well-constructed.

Facilities available in Align/Container instances

Automatic conversion of string input

To visualize data one can use the function string() from the SequenceView class such as in the following example:

>>> aln = egglib.io.from_fasta('align4.fas', alphabet=egglib.alphabets.DNA)
>>> print(aln.get_sequence(0).string())
ACCGTGGAGAGCGCGTTGCA

The following is a similar way to obtain the same result:

>>> aln = egglib.io.from_fasta('align4.fas', alphabet=egglib.alphabets.DNA)
>>> print(aln.get_sequence(0)[:])

Be careful however that sequences will only spawn strings when the underlying alphabet allows it. It is not the case when using simulated data from the coalescent simulator since the simulator does not use a DNA alphabet in order to be able to accommodate more general models, or when processing numeric data (microsatellite data):

>>> coal = egglib.coalesce.Simulator(1, num_chrom=[10], theta=2.0)
>>> aln = coal.simul()
>>> print(aln.get_sequence(0).string())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/stephane/.local/lib/python3.9/site-packages/egglib/_interface.py", line 245, in string
    raise ValueError('cannot generate sequence string with alphabet {0}'.format(self._parent._alphabet.name))
ValueError: cannot generate sequence string with alphabet KAM:2
>>> seq =  aln.get_sequence(0)
>>> print(seq[:])
[0, 0, 0, 1, 0, 1, 0]

It is however possible to visualize alignments generated by this type of simulations. One can use the fasta() method and specifying the egglib.alphabets.DNA as an argument:

>>> print(aln.fasta(alphabet=egglib.alphabets.DNA))
>
AAACACA
>
CCAACAC
>
AAACACA
>
AACAAAA
>
AAAAAAA
>
AAACACA
>
AAACACA
>
AAACAAA
>
AAAAAAA
>
AAACAAA

This will map the simulated alleles to nucleotide bases, based on an arbitrary order of bases and generate an intuitive nucleotide sequence. Note that the latter will work only if the mutation model used for simulations allows at most four alleles.

Protein translation

EggLib allows you to translate DNA sequences to proteins. The translation operates in two steps: DNA to codons and codons to protein. The standard is therefore to use the function tools.to_codons() which converts the alignment from the DNA alphabet to the codon alphabet. Translation requires codon sequences. Other functions that strictly require coding sequences are tools.backalign(), tools.trailing_stops(), tools.iter_stops(), tools.has_stop() and the class CodingDiversity. To perform the DNA to codon extraction, and if the sequences are not an open reading frame (the default assumption), a tools.ReadingFrame instance indicating the bounds of the reading frame must be passed as an argument to the tools.to_codons() function.

EggLib has built-in support for all genetic codes described in USA’s National Center for Biotechnology Information (NCBI) database (http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi). The genetic codes and their identifiers are listed here.

Protein translation tools lie in the tools module, which gathers various functionalities that are described in the following of this section.

The tools.translate() function

The simplest way to translate nucleotide sequences to amino acids with EggLib is probably the tools.translate() function. This function is flexible: it takes as input alignments, sequence sets, or individual sequences, and returns a corresponding object, as listed in the table below:

Input type

Returned type

Align

Align

Container

Container

SequenceView

str

str

str

Once the DNA-to-codon conversion performed, the usage is straightforward, as exemplified below:

>> aln = egglib.Align.create([
...         ('sample1',  'TTGCTAGGTGTATAG'),
...         ('sample2',  'TTCCTAGATGAATAG'),
...         ('sample3',  'ATGCTAGATGAATAG')],
...         alphabet=egglib.alphabets.DNA)
>>> aln.to_codons()
>>> prot = egglib.tools.translate(aln)
>>> print(prot.fasta())
>sample1
LLGV*
>sample2
FLDE*
>sample3
MLDE*

The code option of this function lets you specify the genetic code to be used, if it is not the standard one. Among others, the option in_place tells the function to overwrite the object you provide, instead of returning a new one. This can be useful in memory- or especially time-critical applications:

>>> aln = egglib.Align.create([
...         ('sample1',  'TTGCTAGGTGTATAG'),
...         ('sample2',  'TTCCTAGATGAATAG'),
...         ('sample3',  'ATGCTAGATGAATAG')],
...         alphabet = egglib.alphabets.DNA)
>>> aln.to_codons()
>>> egglib.tools.translate(aln, in_place=True) # returns None
>>> print(aln.fasta())
>sample1
LLGV*
>sample2
FLDE*
>sample3
MLDE*

translate() can also translate individual sequences, either provided as a SequenceView or a str (in_place does not work in these two cases):

>>> print(egglib.tools.translate('CCATTGGTAATGGCC'))
PLVMA

Use the allow_alt=True option to support alternative start codons (a rare phenomenon accounted for by all genetic codes).

“Smart” translation

By default, codons with any missing data are translated as missing data (the X character). However, in certain cases it is possible to guess: for example (in the standard genetic code), CTN necessarily translates to a Leucine because all four possibilities (CTA, CTC, CTG, and CTT) do. Similarly, both AGC and TGC encode a Serine, so WGC also necessarily encodes a Serine. The option smart=True turns on automatic detection of those case, based on the table of ambiguity characters in the nomenclature for nucleotide codes reproduced in the table below:

Symbol

Meaning

G

G

A

A

T

T

C

C

R

G or A

Y

T or C

M

A or C

K

G or T

S

G or C

W

A or T

H

A or C or T

B

G or T or C

V

G or A or C

D

G or A or T

N

G or A or T or C

The tools.Translator class

There is a Translator class in the tools module that performs the same operations as the tools.translate() function. If you need to translate many data sets in one go, it will probably be faster to use this class as in the example below, where we assume that aligns is a list of Align instances:

>>> trans = egglib.tools.Translator(code=1)
>>> for aln in aligns:
...    trans.translate_align(aln, in_place=True)

The option code=1 is actually the default. It is shown just to show that options are specified in the class’s constructor.

Detecting open reading frames and processing stop codons

There are additionnal tools helping you to manage coding sequences, or sequences containing open reading frames:

  • tools.orf_iter() provides an iterator over all possible open reading frames of the sequence. It is configurable (see options) and is used as this (with default options):

    >>> for start, stop, length, frame in egglib.tools.orf_iter(seq):
    ...    print(seq[start:stop])
    

    The example displays the sequence of each open reading frame, although some may need to be reverse-complemented, as determined by the sign of the frame variable.

  • tools.longest_orf() is a shortcut to find the longest possible open reading frame. If there is a tie, the function does not take the decision for you and raises an exception.

  • tools.trailing_stops() detects and optionally fixes stop codons at the end of sequences of an alignment.

  • tools.iter_stops() provides you with an iterator over the positions of all stop codons of the alignment.

  • tools.has_stop() is a shortcut to test if an alignment contains any stop codons.