Top level components

The top-level EggLib namespace contains the most useful classes and functions, as well as items that are not specialised to a particular task. They can be grouped in several categories according to the task they are aimed to fulfill:

Holding multi-site data

The two classes Align and Container (and their helpers) allow to store data sequence data for multiple samples as well as any form of genetic data as long as there can be several data items for multiple samples.

egglib.Align(alphabet[, nsam, nsit, init])

Dataset of aligned sequences.

egglib.Container(alphabet)

Dataset of unaligned sequences.

egglib.SampleView(parent, index)

Proxy class representing a given sample.

egglib.SequenceView(parent, index)

Proxy class representing the sequence of a given sample.

egglib.LabelView(parent, index)

Proxy class representing the labels of a given sample.

egglib.encode(obj[, nbits])

Temporarily rename samples using unique keys. This is a context manager that can be used to perform some code using a Align or Container instance with temporarily renamed samples. The names are encoded (using the encode() method) before executing user code and then decoded (using rename()) as the end of the block (even if an exception occurs)::.

Holding single-site data

The class Site allows to store any kind of data at a single site. There are three independent functions to create instances of this class. Freq only holds frequencies.

egglib.Site([alphabet])

Store alleles from a single site.

egglib.site_from_align(align, index)

Create a site from a position of an alignment.

egglib.site_from_list(array, alphabet)

Create a site based on a list of allelic values.

egglib.site_from_vcf(vcf[, start, stop])

Extract a site from a VCF parser.

egglib.Freq()

Hold allelic and genotypic frequencies for a single site.

egglib.freq_from_site(site[, struct])

Compute allele frequencies from a site.

egglib.freq_from_list(ingroup, outgroup[, ...])

Import allele frequencies.

egglib.freq_from_vcf(vcf)

Import allelic frequencies from a VCF parser.

Representing data with alphabets

Alphabets are described in their specific module. The class for creating custom alphabets is available in the top EggLib namespace.

egglib.Alphabet(cat, expl, miss[, ...])

Define acceptable lists of alleles for a type of data.

Describing sample structure

Sample structure is described by a specific class. There are four methods allowing to generate instances.

egglib.Structure()

Class describing organisation of samples.

egglib.struct_from_labels(data[, lvl_clust, ...])

Extract structure from labels attached to an alignment.

egglib.struct_from_samplesizes(pops[, ...])

Build a structure based on sample size per population.

egglib.struct_from_iterable(iterable[, fmt, ...])

Build a structure based on labels stored in an iterable.

egglib.struct_from_dict(ingroup, outgroup)

Build a structure from dictionaries.

Trees

Trees are described by a specific class, with a companion class representing a particular node of a tree.

egglib.Tree([fname, string])

Editable genealogical or phylogenetic tree.

egglib.Node([label])

Manage a single node of a given tree.

Module content documentation

class egglib.Align(alphabet, nsam=0, nsit=0, init=None)[source]

Dataset of aligned sequences. The data consists of a given number of samples, each with the same number of sites. There can be any number of labels for any sample meaning that samples can be described by several labels in addition to their name. The type of genetic data stored in the instance is determined by the alphabet. The optional constructor arguments allow to setup a matrix of a pre-defined dimension (by default, ns and ls are equal to 0). If samples are created at construction, they have empty names and no labels.

Parameters:
  • alphabet – an Alphabet instance. There must be at least one valid allele (either exploitable or missing).

  • num_sam – number of samples to initialize.

  • num_sit – number of sites to initialize.

  • init – initial values for all data entries. Must be a valid allele of the alphabet. Ignored if num_sam or num_sit is 0. By default, use the first exploitable allele or the first missing allele if there are none). Only considered if num_sam and num_sit are set to non-null values.

Both Align and Container instances support the following operations:

Operation

Alternative

Action

Notes

len(aln)

aln.ns

number of samples

for sam in aln: ...

perform an iteration over samples

aln[i]

aln.get_sample(i)

access a given sample

aln[i] = sam

aln.set_sample(i, *sam)

copy data over an existing sample

del aln[i]

aln.del_sample(i)

delete a sample

name in aln

aln.find(name) != None

True if at least one sample has this name

name not in aln

aln.find(name) == None

True if no sample has this name

Notes:

  1. To iterate over sites, use: for site in aln.iter_sites().

  2. All values that can be interpreted as a (name, sequence) or (name, sequence, labels) tuple can be passed as right-hand operand (including SampleView instances).

Attributes

alphabet

Alphabet instance associated to this object.

is_matrix

True if the instance is an Align, and False if it is a Container.

ls

Alignment length.

ns

Number of samples.

Methods

add_label(sample, value)

Add a group label to a specific sample.

add_sample(name, data[, labels])

Add a sample to the instance.

add_samples(items)

Add several samples at the end of the instance.

clear()

Clear the instance and release all memory.

column(index)

Extract the allele values of a site at a given position.

consensus()

Generates the consensus of the object.

create(source[, alphabet])

Copy provided data and create a new instance.

del_columns(site[, num])

Delete one or more full columns of the alignment.

del_sample(index)

Delete a sample.

encode([nbits])

Rename all samples using unique keys.

extract(...)

Create a sub-alignment based on specified positions.

fasta([fname, first, last, alphabet, ...])

Export alignment in the fasta format.

filter(ratio[, relative])

Remove the sequences with too few exploitable data.

find(name[, regex, flags, multi, index])

Find a sample by its name.

find_motif(sample, motif[, start, stop])

Locate the first instance of a sequence motif.

fix_ends([replaced, replacing])

Replace characters at both ends of each sequence.

get(sample, site)

Get a data entry.

get_label(sample, index)

Get a group label.

get_name(index)

Get the name of a sample.

get_sample(index)

Get a sample as a SampleView instance.

get_sequence(index)

Get a sequence as a SequenceView.

group_mapping([level, as_position, liberal])

Map labels to samples.

insert_columns(position, values)

Insert sites at a given position of an alignment.

intersperse(length[, positions, alleles])

Insert non-varying sites within the alignment.

iter_sites([start, stop])

Iterate over sites of the alignment.

lower(index[, start, stop])

Converts allele values of a sample to lower case.

name_mapping()

Map sample names to lists of SampleView instances.

names()

Generate the list of sample names.

nexus([prot])

Generates a simple nexus-formatted string.

phylip([format])

Return a phylip-formatted string.

phyml()

Return a phyml-formatted string.

random_missing(rate[, missing])

Randomly introduces missing data in the current instance.

remove_duplicates()

Remove all duplicates, based on name exact matching.

rename(mapping[, liberal])

Rename sequences of the instance.

reserve([nsam, lnames, nlbl, nsit])

Pre-allocate memory.

reset([alphabet])

Reset the instance.

set(sample, site, value)

Set a data entry.

set_label(sample, index, value)

Set a group label.

set_name(index, name)

Set the name of a sample.

set_sample(index, name, sequence[, labels])

Overwrite all data for a given sample.

set_sequence(index, value)

Overwrite sequence data for a given sample.

slider(wwidth, wstep)

Run a sliding window over the alignment.

subset(samples)

Extract a subset of samples.

to_bases()

Codon to DNA conversion.

to_codons([frame])

DNA to codons conversion.

upper(index[, start, stop])

Converts allele values of a sample to upper case.

add_label(sample, value)

Add a group label to a specific sample.

Parameters:
  • sample – sample index.

  • value – new label value.

add_sample(name, data, labels=None)

Add a sample to the instance.

Parameters:
  • name – name of the new sample.

  • data – an iterable of allele values containing the data to set for the new sample. For an Align instance and if the sample is not the first, the number of data must fit any previous ones.

  • labels – if not None, must be an iterable of strings

add_samples(items)

Add several samples at the end of the instance.

Parameters:

items – must be a sequence, an Align instance, or a Container instance. If a sequence is passed, each item must be of length 2 or 3 and contain the sample name string, the sequence of data values and, if provided, the sequence of labels. See the method add_sample() for more details about the structure of each item. If the current instance is an Align, all sequence length must be identical. For a Container, sequences may have different lengths.

property alphabet

Alphabet instance associated to this object. This member cannot be modified.

clear()

Clear the instance and release all memory. In most cases, it is preferable to use the method reset().

column(index)[source]

Extract the allele values of a site at a given position.

Parameters:

index – the index of a site within the alignment.

Returns:

A list of allelic values.

consensus()[source]

Generates the consensus of the object. The alphabet must be alphabets.DNA. The consensus is generated based on standard ambiguity codes (see here). The consensus is returned as a string, of length matching the alignment length. In case of a disagrement involving an occurrence of “?” and “-”, the output base is always “?”. Otherwise, the consensus base defined by the IUPAC convention is followed. If a site is not variable, the fixed value is incorporated in the consensus in all cases.

Returns:

A string.

classmethod create(source, alphabet=None)

Copy provided data and create a new instance.

Parameters:

The object source can be:

To be used as in:

>>> aln = egglib.Align.create([('name1', 'AAACCGGCAC', [0]),
...     ('name2', 'AAACCTGCCC', [0]), ('name3', 'TCACCGGCAA', [1])
...     ('name4', 'TCAGAGGCAA', [1])], alphabet=egglib.alphabets.DNA)

or in:

>>> aln2 = egglib.Container.create(aln)

The type of the created object is determined by the class on which you call the method. The object passed an argument can be an instance of the same type or not, or a compatible list. It is therefore possible to convert between Align and Container (but note that in the case of a Container to Align conversion, all sequences must have the same length).

del_columns(site, num=1)[source]

Delete one or more full columns of the alignment.

By default (if num=1), remove a single site. If num is larger than 1, remove a range of sites.

Parameters:
  • site – index of the (first) site to remove. This site must be a valid index.

  • num – maximal number of sites to remove. The value cannot be negative.

del_sample(index)

Delete a sample.

Parameters:

index – index of the sample to delete.

encode(nbits=10)

Rename all samples using unique keys. Generate a random mapping of names to unique keys and use this keys as names for samples. Keys are of length nbits, using alphanumerical characters only but always starting with a capital letter.

Parameters:

nbits – length of the keys (encoded names). This value must be >= 4 and <= 63.

Returns:

A dict mapping all the generated keys to the actual sequence names. The keys are case-dependent and guaranteed not to start with a number.

The returned mapping can be used to restore the original names using rename(). This method is not affected by the presence of sequences with identical names in the original instance (and rename() will also work properly in that case).

extract(...)[source]

Create a sub-alignment based on specified positions.

The thre possible ways to call this method are:

extract(start, stop) to extract a continuous range of sites,

extract(rf) to extract exon positions from a ReadingFrame instance, and

extract(indices) to extract an arbitrary list of positions (in any order).

Parameters:
  • start – first position to extract. This position must be a valid index for this alignment.

  • stop – stop position for the range to extract. This position is not extracted. If this position is equal to or smaller than *start, empty sequences are generated. If this position is equal to or larger than the length of the alignment, or if it is equal to None, all positions until the end of the alignment are extracted.

  • rf – a ReadingFrame object. Note that the keep_truncated argument of ReadingFrame has an effect.

  • indices – a list (or other iterable type with a length) of alignment positions (or column indices). This list may contain repetitions and does not need to be sorted. The positions will be extracted in the specified order.

Returns:

A new Align instance.

Note

Keyword arguments are not supported.

fasta(fname=None, first=None, last=None, alphabet=None, labels=False, linelength=50)

Export alignment in the fasta format. The data are required to be encoded using a char or a string alphabet. If not, the user may use the alphabet argument to pass a valid alphabet that will be used for exporting.

Parameters:
  • fname – name of the file to export data to. By default, the file is created (or overwritten if it already exists). If the option append is True, data is appended at the end of the file (and it must exist). If fname is None (default), no file is created and the formatted data is returned as a string. In the alternative case, nothing is returned and the fasta string is written to a file.

  • first – if only part of the sequences should be exported: index of the first sequence to export. By default, use the first sequence if any, otherwise, generate an empty string.

  • last – if only part of the sequences should be exported: index of the last sequence to export. If the value is larger than the index of the last sequence, all sequences are exported until the last (this is the default). If last < first, no sequences are exported.

  • alphabet – exporting class:.Alphabet instance to use instead of the instance’s alphabet. The provided alphabet must contain character or string values, and all data in the object must be valid with respect to this alphabet. All alleles of the instance are mapped to an allele of the exporting alphabet based on their rank in their respective alphabets.

  • labels – a boolean indicating whether group labels should be exported or ignored.

  • linelength – the length of lines for internal breaking of sequences.

Returns:

If fname is None: a fasta-formatted string. Otherwise: None.

filter(ratio, relative=False)[source]

Remove the sequences with too few exploitable data. The list of exploitable data is defined by the alphabet attached to the instance. This method modifies the current instance and returns None.

Parameters:
  • ratio – limit threshold, expressed as a proportion of the alignment length (default) or as a proportion of the maximum number of exploitable data over all considered samples (if the relative argument is True).

  • relative – Take as a reference the number of exploitable data of the sequence which has most.

Note

If the length of the alignment is 0,nothing is done.

find(name, regex=False, flags=None, multi=False, index=False)

Find a sample by its name.

Parameters:
  • name – name of sample to identify.

  • regex – a boolean indicating whether the value passed as name is a regular expression. If so, the string is passed as is to the re module (using function re.search()). Otherwise, only exact matches will be considered.

  • flags – list of flags to be passed to re.search() (only considered if regex is True). For example, when looking for samples containing the term “sample” but being case-insensitive, use the following syntax: align.find("sample", regex=True, flags=[re.I]). By default (None) no further argument is passed.

  • multi – a boolean indicating whether all hits should be returned. If so, a list of SampleView instances is always returned (the list will be empty in case of no hits). Otherwise, a single SampleView instance (or its index) will be returned for the first hit, or None in case of no hits.

  • index – boolean indicating whether the index of the sample should be returned. In that case return values for hits are int (by default, SampleView instances).

Returns:

The type of the returned value depends on the argument multi and on whether any hits were found:

multi

index

return value

False

False

SampleView or None

False

True

integer or None

True

False

list of SampleView instances

True

True

list of integers

find_motif(sample, motif, start=0, stop=None)

Locate the first instance of a sequence motif.

Return the index of the first exact hit to a given substring. The returned value is the position of the first base of the hit. Only exact forward matches are implemented. To use regular expression (for example to find degenerated motifs), one should extract the string for the sequence and use a tool such as the regular expression module (re).

Parameters:
  • sample – sample index.

  • motif – list of allelic values constituting the motif to search.

  • start – position at which to start searching. The method will never return a value smaller than start. By default, search from the start of the sequence.

  • stop – position at which to stop search (the motif cannot overlap this position). No returned value will be larger than ` stop - len(motif)``. By default, or if stop is equal to or larger than the length of the sequence, search until the end of the sequence.

fix_ends(replaced='-', replacing='?')[source]

Replace characters at both ends of each sequence.

Replace all leading and trailing occurrence of the allele specified by replaced by the one specified by replacing. Internal occurrences of the replaced allele (those having at least one character other than either the replaced and replacing allele at both sides) are left unchanged. Both alleles are required to be valid alleles for this alphabet

get(sample, site)

Get a data entry.

Parameters:
  • sample – sample index.

  • site – site index.

Returns:

An allele value (type depends on the alphabet)

get_label(sample, index)

Get a group label.

Parameters:
  • sample – sample index.

  • index – label index.

Returns:

A string.

get_name(index)

Get the name of a sample.

Parameters:

index – sample index.

Retur:

A string.

get_sample(index)

Get a sample as a SampleView instance. The returned object allows to modify the underlying data.

Parameters:

index – index of the sample to access.

Returns:

A SampleView instance.

get_sequence(index)

Get a sequence as a SequenceView. The returned object allows modifying the underlying data.

Parameters:

index – sample index.

Returns:

A SequenceView instance.

group_mapping(level=0, as_position=False, liberal=False)

Map labels to samples. Each sample is either represented by a SampleView instance (by default) or its position index within the instance.

Parameters:
  • level – index of labels to consider.

  • as_position – if True, represent samples by their position index in the instance instead of a SampleView instance.

  • liberal – if True, ignore samples for which this label level is not defined (by default, an IndexError is raised).

Returns:

A dict of lists, containing either integers or SampleView.

insert_columns(position, values)[source]

Insert sites at a given position of an alignment.

Parameters:
  • position – the position at which to insert sites. Sites are inserted before the specified position, so the user can use 0 to insert sites at the beginning of the sequence. To insert sites at the end of the sequence, pass the current length of the alignment, or None. If position is larger than the length of the sequence or None, new sites are inserted at the end of the alignment. The position might be negative to count from the end. Warning: the position -1 means before the last position.

  • values – a sequence to insert for each sample. To insert a single site, use a single-item list, or a single-character string for character alphabets (such as alphabets.DNA).

intersperse(length, positions=None, alleles='A')[source]

Insert non-varying sites within the alignment. The current object is permanently modified.

Parameters:
  • length – Desired length of the final alignment. If the value is smaller than the original (current) alignment length, nothing is done and the alignment is unchanged.

  • positions – Position that each site of the current alignment must have in the final alignment. The number of positions must be equal to the number of sites of the alignment (before interspersing). The argument value must be either a sequence of positive integers or a sequence of real numbers comprised between 0 and 1. In either case, values must be in increasing order. In the former case, the last (maximal) value must be smaller than the desired length of the final alignment. In the latter case, values are expressed relatively, and they will be converted to integer indices by the method. In that case, if site positioning is non-trivial (typically, if conversion of positions to integer yield identical position for different conscutive sites), it will be resolved randomly. By default (if None), sites are placed regularly along the final alignment length. If int and float types are mixed, the first occurring type will condition what will happen.

  • alleles – List of allelic values providing the alleles to be used to fill non-varying positions of the resulting alignment. If there is more than one allele, the allele will be picked randomly for each site, independently for each inserted site. All alleles must be valid alleles for the current alphabet.

property is_matrix

True if the instance is an Align, and False if it is a Container.

iter_sites(start=0, stop=None)[source]

Iterate over sites of the alignment.

Parameters:
  • start – first site to consider; by default, start at beginning of alignment.

  • stop – site where to stop iteration (this site is not included); by default, continue until the end of the alignment.

This method allows to run the following type of iterations:

>>> for site in aln:
...     ...

where site is a unique Site object which is reset and updated at each iteration round.

lower(index, start=0, stop=None)

Converts allele values of a sample to lower case. Only applicable for alleles that have an lower case equivalent. All other allele values are ignored. The underlying object data is modified and this method returns None. Only accepted for case-sensitive character and string alphabets. Furthermore, all lower case alleles that are implied are required to be valid alleles in the alphabet.

Parameters:
  • index – sample index.

  • start – index of the first allele to convert. By default, from the start of the sequence.

  • stop – stop index (index of the allele immediately after the last allele to convert). By default, convert until the end of the sequence.

property ls

Alignment length. This value cannot be set or modified directly.

name_mapping()

Map sample names to lists of SampleView instances. This method is most useful when several sequences have the same name. It may be used to detect and process duplicates.

Returns:

A dict.

names()

Generate the list of sample names.

Returns:

A list.

nexus(prot='auto')[source]

Generates a simple nexus-formatted string. If prot is True, add datatype=protein in the file, allowing it to be imported as proteins (but doesn’t perform further checking). If prot is False, never add this command. By default, add the protein datatype command if the alphabet is alphabets.protein.

Returns:

A nexus-formatted string. Note: any spaces and tabs in sequence names are replaced by underscores.

Note

This nexus implementation is minimal but will normally suffice to import sequences to programs expecting nexus.

The data must be exportable as strings.

property ns

Number of samples.

phylip(format='I')[source]

Return a phylip-formatted string. Raise a ValueError if any name of the instance contains at least one character of the following list: ()[]{},; or a space, tab, newline or linefeed. Labels are never exported. Sequence names cannot be longer than 10 characters. A ValueError will be raised if a longer name is met. format must be ‘I’ or ‘S’ (case-independent), indicating whether the data should be formatted in the sequential (S) or interleaved (I) format (see PHYLIP’s documentation for definitions). The user is responsible of ensuring that all names are unique. If not, the exported file may cause subsequent programs to fail.

The alphabet must be represented as characters (such as alphabets.DNA).

phyml()[source]

Return a phyml-formatted string. The phyml format is suitable as input data for the PhyML and PAML programs. Raise a ValueError if any name of the instance contains at least one character in the following list: ()[]{},; or a space, tab, newline or linefeed. Labels are never exported.

The alphabet must be represented as characters (such as alphabets.DNA).

random_missing(rate, missing='N')[source]

Randomly introduces missing data in the current instance. Random positions of the alignment are changed to missing data. Only data that are currently non-missing data are considered.

Parameters:
  • rate – probability that a non-missing allele is turned into missing data.

  • missing – missing data allele value, to be used for all replacements. This must be a valid missing allele for the instance’s alphabet.

remove_duplicates()

Remove all duplicates, based on name exact matching. For all pairs of samples with identical name, only the one occurring first is conserved. The current instance is modified and this method returns None.

rename(mapping, liberal=False)

Rename sequences of the instance.

Parameters:
  • mapping – a dict providing the mapping of old names (as keys) to new names (which may, if needed, contain duplicates).

  • liberal – if this argument is False and a name does not appear in mapping, a ValueError is raised. If liberal is True, names that don’t appear in mapping are left unchanged.

Returns:

The number of samples that have been actually renamed, overall.

reserve(nsam=0, lnames=0, nlbl=0, nsit=0)

Pre-allocate memory. This method can be used when the size of arrays is known a priori, in order to speed up memory allocations. It is not necessary to set all values. Values less than 0 are ignored.

Parameters:
  • nsam – number of samples in the ingroup.

  • lnames – length of sample names.

  • nlbl – number of labels.

  • nsit – number of sites.

reset(alphabet=None)

Reset the instance. If the alphabet is specified, it replaces the previous one.

set(sample, site, value)

Set a data entry. The value must be a valid allelic value for this instance.

Parameters:
  • sample – sample index.

  • site – site index.

  • value – allele value (type depends on the alphabet).

set_label(sample, index, value)

Set a group label.

Parameters:
  • sample – sample index.

  • index – label index.

  • value – new label value.

set_name(index, name)

Set the name of a sample.

Parameters:
  • index – index of the sample

  • name – new name value.

set_sample(index, name, sequence, labels=None)

Overwrite all data for a given sample.

Parameters:
  • index – index of the sample to access (slices are not permitted).

  • name – new name of the sample.

  • sequence – list of allelic values giving the new values to set. str is supported for character alphabets such as alphabets.DNA. In case of an Align, it is required to pass a sequence with length matching the number of sites of the instance.

  • labels – a list of string labels. The default corresponds to an empty list.

set_sequence(index, value)

Overwrite sequence data for a given sample.

Parameters:
  • index – sample index.

  • value – can be a SequenceView instance, a list of allelic values, or a string (only for single-character alphabets such as alphabets.DNA). In the case of an Align instance, the length of the sequence must match the current alignment length.

slider(wwidth, wstep)[source]

Run a sliding window over the alignment.

This method returns an iterator that can be used as:

>>> for window in align.slider(wwidth, wstep):
...     ...

where, for each step, window will be the reference an Align instance of length wwidth (or less if not enough sequence is available near the end of the alignment). Each step moves forward following the value of wstep. Note that the returned Align object is always the same and is updated at each iteration round.

Parameters:
  • wwidth – size of the window (the last window might be smaller).

  • wstep – iteration step.

Returns:

An iterator.

subset(samples)

Extract a subset of samples. Generate and return a copy of the instance with only a specified list of samples. Sample indices are not required to be consecutive.

Parameters:

samples – a list of sample indices, or a Structure instance, giving the list of samples that must be exported to the returned object. In the latter case, all samples referred to in the structure are extracted.

Returns:

A new instance of the same type.

to_bases()[source]

Codon to DNA conversion. Convert the instance from the alphabets.codons alphabet to the alphabets.DNA alphabet. The number of sites will be multiplied by three.

to_codons(frame=None)[source]

DNA to codons conversion. Convert the instance from the alphabets.DNA alphabet to the alphabets.codons alphabet. This method overwrites the current instance. After the conversion, there will one site per codon.

Parameters:

frame – a ReadingFrame instance providing the exon positions in the correct frame. By default, a non-segmented frame covering all sequences is assumed (in case the provided alignment is the coding region; in such case the length must be a multiple of 3).

upper(index, start=0, stop=None)

Converts allele values of a sample to upper case. Only applicable for alleles that have an upper case equivalent. All other allele values are ignored. The underlying object data is modified and this method returns None. Only accepted for case-sensitive character and string alphabets. Furthermore, all upper case alleles that are implied are required to be valid alleles in the alphabet.

Parameters:
  • index – sample index.

  • start – index of the first allele to convert. By default, from the start of the sequence.

  • stop – stop index (index of the allele immediately after the last allele to convert). By default, convert until the end of the sequence.

class egglib.Container(alphabet)[source]

Dataset of unaligned sequences.

This class shares most of its functionalities with Align. The fundamental difference is that each sample may have a different number of sites.

All operations listed here are available for Container instances.

Parameters:

alphabet – an Alphabet instance.

Attributes

alphabet

Alphabet instance associated to this object.

is_matrix

True if the instance is an Align, and False if it is a Container.

ns

Number of samples.

Methods

add_label(sample, value)

Add a group label to a specific sample.

add_sample(name, data[, labels])

Add a sample to the instance.

add_samples(items)

Add several samples at the end of the instance.

clear()

Clear the instance and release all memory.

create(source[, alphabet])

Copy provided data and create a new instance.

del_sample(index)

Delete a sample.

del_sites(sample, site[, num])

Delete data entries from a sample.

encode([nbits])

Rename all samples using unique keys.

equalize([value])

Equalize the length of all sequences.

fasta([fname, first, last, alphabet, ...])

Export alignment in the fasta format.

find(name[, regex, flags, multi, index])

Find a sample by its name.

find_motif(sample, motif[, start, stop])

Locate the first instance of a sequence motif.

get(sample, site)

Get a data entry.

get_label(sample, index)

Get a group label.

get_name(index)

Get the name of a sample.

get_sample(index)

Get a sample as a SampleView instance.

get_sequence(index)

Get a sequence as a SequenceView.

group_mapping([level, as_position, liberal])

Map labels to samples.

insert_sites(sample, position, values)

Insert sites at a given position for a given sample

lower(index[, start, stop])

Converts allele values of a sample to lower case.

ls(index)

Get the number of sites of an ingroup sample.

name_mapping()

Map sample names to lists of SampleView instances.

names()

Generate the list of sample names.

remove_duplicates()

Remove all duplicates, based on name exact matching.

rename(mapping[, liberal])

Rename sequences of the instance.

reserve([nsam, lnames, nlbl, nsit])

Pre-allocate memory.

reset([alphabet])

Reset the instance.

set(sample, site, value)

Set a data entry.

set_label(sample, index, value)

Set a group label.

set_name(index, name)

Set the name of a sample.

set_sample(index, name, sequence[, labels])

Overwrite all data for a given sample.

set_sequence(index, value)

Overwrite sequence data for a given sample.

subset(samples)

Extract a subset of samples.

to_bases()

Codon to DNA conversion.

to_codons()

DNA to codons conversion.

upper(index[, start, stop])

Converts allele values of a sample to upper case.

add_label(sample, value)

Add a group label to a specific sample.

Parameters:
  • sample – sample index.

  • value – new label value.

add_sample(name, data, labels=None)

Add a sample to the instance.

Parameters:
  • name – name of the new sample.

  • data – an iterable of allele values containing the data to set for the new sample. For an Align instance and if the sample is not the first, the number of data must fit any previous ones.

  • labels – if not None, must be an iterable of strings

add_samples(items)

Add several samples at the end of the instance.

Parameters:

items – must be a sequence, an Align instance, or a Container instance. If a sequence is passed, each item must be of length 2 or 3 and contain the sample name string, the sequence of data values and, if provided, the sequence of labels. See the method add_sample() for more details about the structure of each item. If the current instance is an Align, all sequence length must be identical. For a Container, sequences may have different lengths.

property alphabet

Alphabet instance associated to this object. This member cannot be modified.

clear()

Clear the instance and release all memory. In most cases, it is preferable to use the method reset().

classmethod create(source, alphabet=None)

Copy provided data and create a new instance.

Parameters:

The object source can be:

To be used as in:

>>> aln = egglib.Align.create([('name1', 'AAACCGGCAC', [0]),
...     ('name2', 'AAACCTGCCC', [0]), ('name3', 'TCACCGGCAA', [1])
...     ('name4', 'TCAGAGGCAA', [1])], alphabet=egglib.alphabets.DNA)

or in:

>>> aln2 = egglib.Container.create(aln)

The type of the created object is determined by the class on which you call the method. The object passed an argument can be an instance of the same type or not, or a compatible list. It is therefore possible to convert between Align and Container (but note that in the case of a Container to Align conversion, all sequences must have the same length).

del_sample(index)

Delete a sample.

Parameters:

index – index of the sample to delete.

del_sites(sample, site, num=1)[source]

Delete data entries from a sample.

By default (if num=1), remove a single site. If num is larger than 1, remove a range of sites.

Parameters:
  • sample – sample index.

  • site – index of the (first) site to remove. This site must be a valid index.

  • num – maximal number of sites to remove. The value cannot be negative.

encode(nbits=10)

Rename all samples using unique keys. Generate a random mapping of names to unique keys and use this keys as names for samples. Keys are of length nbits, using alphanumerical characters only but always starting with a capital letter.

Parameters:

nbits – length of the keys (encoded names). This value must be >= 4 and <= 63.

Returns:

A dict mapping all the generated keys to the actual sequence names. The keys are case-dependent and guaranteed not to start with a number.

The returned mapping can be used to restore the original names using rename(). This method is not affected by the presence of sequences with identical names in the original instance (and rename() will also work properly in that case).

equalize(value='?')[source]

Equalize the length of all sequences. Extend sequences such as they all have the length of the longest sequence.

Parameters:

value – the value to use to extend sequences. It must be a valid allelic value for the instance’s alphabet.

fasta(fname=None, first=None, last=None, alphabet=None, labels=False, linelength=50)

Export alignment in the fasta format. The data are required to be encoded using a char or a string alphabet. If not, the user may use the alphabet argument to pass a valid alphabet that will be used for exporting.

Parameters:
  • fname – name of the file to export data to. By default, the file is created (or overwritten if it already exists). If the option append is True, data is appended at the end of the file (and it must exist). If fname is None (default), no file is created and the formatted data is returned as a string. In the alternative case, nothing is returned and the fasta string is written to a file.

  • first – if only part of the sequences should be exported: index of the first sequence to export. By default, use the first sequence if any, otherwise, generate an empty string.

  • last – if only part of the sequences should be exported: index of the last sequence to export. If the value is larger than the index of the last sequence, all sequences are exported until the last (this is the default). If last < first, no sequences are exported.

  • alphabet – exporting class:.Alphabet instance to use instead of the instance’s alphabet. The provided alphabet must contain character or string values, and all data in the object must be valid with respect to this alphabet. All alleles of the instance are mapped to an allele of the exporting alphabet based on their rank in their respective alphabets.

  • labels – a boolean indicating whether group labels should be exported or ignored.

  • linelength – the length of lines for internal breaking of sequences.

Returns:

If fname is None: a fasta-formatted string. Otherwise: None.

find(name, regex=False, flags=None, multi=False, index=False)

Find a sample by its name.

Parameters:
  • name – name of sample to identify.

  • regex – a boolean indicating whether the value passed as name is a regular expression. If so, the string is passed as is to the re module (using function re.search()). Otherwise, only exact matches will be considered.

  • flags – list of flags to be passed to re.search() (only considered if regex is True). For example, when looking for samples containing the term “sample” but being case-insensitive, use the following syntax: align.find("sample", regex=True, flags=[re.I]). By default (None) no further argument is passed.

  • multi – a boolean indicating whether all hits should be returned. If so, a list of SampleView instances is always returned (the list will be empty in case of no hits). Otherwise, a single SampleView instance (or its index) will be returned for the first hit, or None in case of no hits.

  • index – boolean indicating whether the index of the sample should be returned. In that case return values for hits are int (by default, SampleView instances).

Returns:

The type of the returned value depends on the argument multi and on whether any hits were found:

multi

index

return value

False

False

SampleView or None

False

True

integer or None

True

False

list of SampleView instances

True

True

list of integers

find_motif(sample, motif, start=0, stop=None)

Locate the first instance of a sequence motif.

Return the index of the first exact hit to a given substring. The returned value is the position of the first base of the hit. Only exact forward matches are implemented. To use regular expression (for example to find degenerated motifs), one should extract the string for the sequence and use a tool such as the regular expression module (re).

Parameters:
  • sample – sample index.

  • motif – list of allelic values constituting the motif to search.

  • start – position at which to start searching. The method will never return a value smaller than start. By default, search from the start of the sequence.

  • stop – position at which to stop search (the motif cannot overlap this position). No returned value will be larger than ` stop - len(motif)``. By default, or if stop is equal to or larger than the length of the sequence, search until the end of the sequence.

get(sample, site)

Get a data entry.

Parameters:
  • sample – sample index.

  • site – site index.

Returns:

An allele value (type depends on the alphabet)

get_label(sample, index)

Get a group label.

Parameters:
  • sample – sample index.

  • index – label index.

Returns:

A string.

get_name(index)

Get the name of a sample.

Parameters:

index – sample index.

Retur:

A string.

get_sample(index)

Get a sample as a SampleView instance. The returned object allows to modify the underlying data.

Parameters:

index – index of the sample to access.

Returns:

A SampleView instance.

get_sequence(index)

Get a sequence as a SequenceView. The returned object allows modifying the underlying data.

Parameters:

index – sample index.

Returns:

A SequenceView instance.

group_mapping(level=0, as_position=False, liberal=False)

Map labels to samples. Each sample is either represented by a SampleView instance (by default) or its position index within the instance.

Parameters:
  • level – index of labels to consider.

  • as_position – if True, represent samples by their position index in the instance instead of a SampleView instance.

  • liberal – if True, ignore samples for which this label level is not defined (by default, an IndexError is raised).

Returns:

A dict of lists, containing either integers or SampleView.

insert_sites(sample, position, values)[source]

Insert sites at a given position for a given sample

Parameters:
  • sample – index of the sample to which insert sites.

  • position – the position at which to insert sites. Sites are inserted before the specified position, so the user can use 0 to insert sites at the beginning of the sequence. To insert sites at the end of the sequence, pass the current length of the sequence. If position is larger than the length of the sequence or None, new sites are inserted at the end of the sequence. The position might be negative Warning: the position -1 means before the last position.

  • values – a list of allelic values, or a string for compatible alphabets such as alphabets.DNA to be inserted into the sequence.

property is_matrix

True if the instance is an Align, and False if it is a Container.

lower(index, start=0, stop=None)

Converts allele values of a sample to lower case. Only applicable for alleles that have an lower case equivalent. All other allele values are ignored. The underlying object data is modified and this method returns None. Only accepted for case-sensitive character and string alphabets. Furthermore, all lower case alleles that are implied are required to be valid alleles in the alphabet.

Parameters:
  • index – sample index.

  • start – index of the first allele to convert. By default, from the start of the sequence.

  • stop – stop index (index of the allele immediately after the last allele to convert). By default, convert until the end of the sequence.

ls(index)[source]

Get the number of sites of an ingroup sample.

Parameters:

index – sample index.

name_mapping()

Map sample names to lists of SampleView instances. This method is most useful when several sequences have the same name. It may be used to detect and process duplicates.

Returns:

A dict.

names()

Generate the list of sample names.

Returns:

A list.

property ns

Number of samples.

remove_duplicates()

Remove all duplicates, based on name exact matching. For all pairs of samples with identical name, only the one occurring first is conserved. The current instance is modified and this method returns None.

rename(mapping, liberal=False)

Rename sequences of the instance.

Parameters:
  • mapping – a dict providing the mapping of old names (as keys) to new names (which may, if needed, contain duplicates).

  • liberal – if this argument is False and a name does not appear in mapping, a ValueError is raised. If liberal is True, names that don’t appear in mapping are left unchanged.

Returns:

The number of samples that have been actually renamed, overall.

reserve(nsam=0, lnames=0, nlbl=0, nsit=0)

Pre-allocate memory. This method can be used when the size of arrays is known a priori, in order to speed up memory allocations. It is not necessary to set all values. Values less than 0 are ignored.

Parameters:
  • nsam – number of samples in the ingroup.

  • lnames – length of sample names.

  • nlbl – number of labels.

  • nsit – number of sites.

reset(alphabet=None)

Reset the instance. If the alphabet is specified, it replaces the previous one.

set(sample, site, value)

Set a data entry. The value must be a valid allelic value for this instance.

Parameters:
  • sample – sample index.

  • site – site index.

  • value – allele value (type depends on the alphabet).

set_label(sample, index, value)

Set a group label.

Parameters:
  • sample – sample index.

  • index – label index.

  • value – new label value.

set_name(index, name)

Set the name of a sample.

Parameters:
  • index – index of the sample

  • name – new name value.

set_sample(index, name, sequence, labels=None)

Overwrite all data for a given sample.

Parameters:
  • index – index of the sample to access (slices are not permitted).

  • name – new name of the sample.

  • sequence – list of allelic values giving the new values to set. str is supported for character alphabets such as alphabets.DNA. In case of an Align, it is required to pass a sequence with length matching the number of sites of the instance.

  • labels – a list of string labels. The default corresponds to an empty list.

set_sequence(index, value)

Overwrite sequence data for a given sample.

Parameters:
  • index – sample index.

  • value – can be a SequenceView instance, a list of allelic values, or a string (only for single-character alphabets such as alphabets.DNA). In the case of an Align instance, the length of the sequence must match the current alignment length.

subset(samples)

Extract a subset of samples. Generate and return a copy of the instance with only a specified list of samples. Sample indices are not required to be consecutive.

Parameters:

samples – a list of sample indices, or a Structure instance, giving the list of samples that must be exported to the returned object. In the latter case, all samples referred to in the structure are extracted.

Returns:

A new instance of the same type.

to_bases()[source]

Codon to DNA conversion. Convert the instance from the alphabets.codon alphabet to the alphabets.DNA alphabet. The number of sites will be multiplied by three.

to_codons()[source]

DNA to codons conversion. Convert the instance from the alphabets.DNA alphabet to the alphabets.codon alphabet. In the output alignment, there is one site per codon.

upper(index, start=0, stop=None)

Converts allele values of a sample to upper case. Only applicable for alleles that have an upper case equivalent. All other allele values are ignored. The underlying object data is modified and this method returns None. Only accepted for case-sensitive character and string alphabets. Furthermore, all upper case alleles that are implied are required to be valid alleles in the alphabet.

Parameters:
  • index – sample index.

  • start – index of the first allele to convert. By default, from the start of the sequence.

  • stop – stop index (index of the allele immediately after the last allele to convert). By default, convert until the end of the sequence.

class egglib.SampleView(parent, index)[source]

Proxy class representing a given sample. SampleView objects allow iteration and general manipulation of (large) data sets stored in an Align or a Container instance without unnecessary extraction of full sequences. Modifications of SampleView objects are immediately applied to the underlying data holder object. SampleView instances are iterable.

In principle, only Align and Container instances are supposed to build SampleView instances.

Parameters:
  • parent – a Align or Container instance.

  • index – an index within the parent instance.

Attributes

index

Index of the sample.

labels

Access to labels.

ls

Length of the sequence.

name

Sample name.

parent

Reference of the parent.

sequence

Access to allele values.

property index

Index of the sample. Index of the sample within the Align or Container instance containing this sample.

property labels

Access to labels. Return a LabelView instance which can be modified to edit labels. This attribute can also be modified altogether using a list of labels of any length, erasing completely the previous list of labels.

property ls

Length of the sequence. If the underlying object is an Align instance, this is the alignment length. Otherwise, this is the length of this particular sequence.

property name

Sample name. The name can be modified by a new string.

property parent

Reference of the parent. Reference of the Align or Container instance containing this sample.

property sequence

Access to allele values. This attribute is represented by a SequenceView instance which allow modifying the underlying sequence container object through its own attributes and methods. It is also possible to change this member altogether using either a string, a list of allele values (matching the alphabet of the parent instance). When modifying an Align instance, all sequences must have the same length as the current alignment length.

class egglib.SequenceView(parent, index)[source]

Proxy class representing the sequence of a given sample. This class manages the sequence of a sample of an Align or Container instance. It can be obtained directly from one of those classes or from the intermediate SampleView.

In principle, only SampleView, Align, and Container instances are supposed to build SequenceView instances.

Parameters:
  • parent – a Align or Container instance.

  • index – an index within the parent instance.

SequenceView instances behave somewhat like editable strings. They support the following operations:

Operation

Action

Notes

len(seq)

length of the sequence

for i in seq: ...

iterate over alleles

seq[i]

get allele at index i

seq[i:j]

get alleles from index i to index j-1

seq[i] = a

modify allele i by a

seq[i:j] = s

modify alleles i to j-1

del seq[i]

remove allele at index i

del seq[i:j]

remove alleles from index i to index j-1

Notes:

  1. This is the same as the alignment length for an Align and can also be accessed as SampleView.ls on the corresponding SampleView instance.

  2. The length of s is required to be equal to (j-i) (replacing a segment by a segment of the same size) if the parent instance is an Align, otherwise it can have a different length.

  3. Only available if the parent instance is a Container.

Note

When editing large sequences, it is much more efficient to use the operators described above than replace the whole sequence by the edited one.

Methods

find(motif[, start, stop])

Locate the first instance of a motif.

insert(position, values)

Insert data entries.

lower([start, stop])

Converts allele values of this sample to lower case.

lstrip(values)

Delete leading occurrences of any characters given in the values argument (starting from the beginning).

rstrip(values)

Delete trailing occurrences of any characters given in the values argument (starting from the end).

string()

Generate a string from all data entries by concatenating all allelic values.

strip(values)

Delete leading and trailing occurrences of any characters given in the values argument.

upper([start, stop])

Converts allele values of this sample to upper case.

find(motif, start=0, stop=None)[source]

Locate the first instance of a motif. Return the index of the first exact hit to a given substring. The returned value is the position of the first base of the hit. Only exact forward matches are implemented. To use regular expression (for example to find degenerated motifs), one should extract the string for the sequence and use a tool such as the regular expression module (re).

Parameters:
  • motif – list of allelic values constituting the motif to search.

  • start – position at which to start searching. The method will never return a value smaller than start. By default, search from the start of the sequence.

  • stop – position at which to stop search (the motif cannot overlap this position). No returned value will be larger than ` stop - len(motif)``. By default, or if stop is equal to or larger than the length of the sequence, search until the end of the sequence.

Returns:

An integer.

insert(position, values)[source]

Insert data entries. This method is only available for samples belonging to a Container instance. For Align instances, it is possible to insert data entries to all samples using the method insert_columns().

Parameters:
  • position – the position at which to insert sites. The new sites are inserted before the specified index. Use 0 to add sites at the beginning of the sequence, and the current number of sites for this sample to add sites at the end. If the value is larger than the current number of sites for this sample, or if None is provided, new sites are added at the end of the sequence.

  • values – list of allelic values to insert in the sequence.

lower(start=0, stop=None)[source]

Converts allele values of this sample to lower case. Only accepted for case-sensitive character and string alphabet. The alphabet must support all lower-case alleles that will be generated. Alleles that don’t have a lower-case equivalent are ignored. The underlying object data is modified and this method returns None

Parameters:
  • start – index of the first allele to convert. By default, from the start of the sequence.

  • stop – stop index (index of the allele immediately after the last allele to convert). By default, to the end of the sequence.

lstrip(values)[source]

Delete leading occurrences of any characters given in the values argument (starting from the beginning). The underlying object is modified and this method returns None.

Parameters:

values – an iterable containing the allelic values to strip out. They must be all be valid values with respect to the current alphabet. Repetitions are silently supported.

rstrip(values)[source]

Delete trailing occurrences of any characters given in the values argument (starting from the end). The underlying object is modified and this method returns None.

Parameters:

values – an iterable containing the allelic values to strip out. They must be all be valid values with respect to the current alphabet. Repetitions are silently supported.

string()[source]

Generate a string from all data entries by concatenating all allelic values. The alphabet type must be character or string.

Returns:

A string.

strip(values)[source]

Delete leading and trailing occurrences of any characters given in the values argument. The underlying object is modified and this method returns None. See also lstrip() and rstrip().

Parameters:

values – an iterable containing the allelic values to strip out. They must be all be valid values with respect to the current alphabet. Repetitions are silently supported.

upper(start=0, stop=None)[source]

Converts allele values of this sample to upper case. Only applicable for characters that have an upper case equivalent. All other allele values are ignored. The underlying object data is modified and this method returns None. Only accepted for case-sensitive character and string alphabets. Furthermore, all lower case alleles that are implied are required to be valid alleles in the alphabet.

Parameters:
  • start – index of the first allele to convert. By default, from the start of the sequence.

  • stop – stop index (index of the allele immediately after the last allele to convert). By default, to the end of the sequence.

class egglib.LabelView(parent, index)[source]

Proxy class representing the labels of a given sample. This class manages the list of labels of a sample of the Align and Container classes. This class can be considered as a list of unsigned integers, is iterable.

In principle, only SampleView, Align, and Container instances are supposed to build LabelView instances.

Parameters:
  • parent – a Align or Container instance.

  • index – an index within the parent instance.

LabelView instances support the following operations:

Operation

Action

len(labels)

number of labels for this sample

for lbl in labels: ...

iterate over labels

labels[i]

get an arbitrary label

labels[i] = lbl

modify label i

Methods

append(value)

Add a label at the end of the list.

append(value)[source]

Add a label at the end of the list.

class egglib.encode(obj, nbits=10)[source]

Temporarily rename samples using unique keys. This is a context manager that can be used to perform some code using a Align or Container instance with temporarily renamed samples. The names are encoded (using the encode() method) before executing user code and then decoded (using rename()) as the end of the block (even if an exception occurs):

>>> with egglib.encode(obj) as mapping:
...     ...

In the example above, the names of the original object obj have been renamed using random, unique keys composed of alphanumerical characters. If needed, the mapping of the temporary keys to original names is available as mapping. Operations requiring unique names or names compatible with external programs can be performed in the body of the with statement. And, finally, the original names will be restored when leaving the with statement, even if an exception occurs.

Note

There is a race condition with this context manager: while it is operating, the passed object is actually modified.

class egglib.Site(alphabet=None)[source]

Store alleles from a single site. Instances can be created from a position in an Align instance (using site_from_align()), from the current position of a VcfParser (using site_from_vcf()), or from a user-provided list of allelic values (using site_from_list()).

Parameters:

alphabet – an Alphabet instance (only useful if data are supposed to be loaded one by one append() or extend()).

The following operations are available on site if it is a Site instance:

Operation

Result

len(site)

number of samples

for i in site: ...

iterate over alleles

site[i]

access allele at given index

site[i] = a

overwrite allele at given index

del site[i]

delete allele at given index

Attributes

alphabet

Alphabet attached to the instance.

ns

Number of samples

num_missing

Number of missing data.

position

Position of the site.

Methods

append(val)

Add an allele at the end of the site.

as_list()

Generate a list containing data from the instance.

extend(vals)

Add several alleles at the end of the site.

from_align(align, index)

Import data from the provided Align.

from_list(array, alphabet)

Import alleles from the provided list.

from_vcf(vcf[, start, stop])

Import data from the provided io.VcfParser.

reset()

Clear all data from the instance (including the alphabet).

property alphabet

Alphabet attached to the instance. It is possible to set the alphabet, but not change it (the alphabet can be changed only immediately after creation if no alphabet has been specified, or after calling reset().

append(val)[source]

Add an allele at the end of the site.

as_list()[source]

Generate a list containing data from the instance.

Returns:

A list of allelic values.

extend(vals)[source]

Add several alleles at the end of the site. The alleles must be provided as an iterable.

from_align(align, index)[source]

Import data from the provided Align. All data currently stored in the instance are discarded.

Parameters:
  • align – an Align instance.

  • index – index of the site to extract.

from_list(array, alphabet)[source]

Import alleles from the provided list.

Parameters:
  • array – a list of allelic values.

  • alphabet – an alphabet (input allelic values must match this alphabet).

from_vcf(vcf, start=0, stop=None)[source]

Import data from the provided io.VcfParser. The VCF parser must have processed a variant and the variant is required to have genotypic data available as the GT format field. An exception is raised otherwise.

Parameters:
  • vcf – a VcfParser instance containing data. There must at least one sample and the GT format field must be available. It is not required to extract variant data manually.

  • start – index of the first sample to process. Index is required to be within available bounds (it must be at least 0 and smaller than the number of samples in the VCF data). Note that in a VCF file a sample corresponds to a genotype.

  • stop – sample index at which processing must be stopped (this sample is not processed). Index is required to be within available bounds (if must be at least equal to start and not larger than the number of samples in the VCF data). Note that in a VCF file, a sample corresponds to a genotype.

property ns

Number of samples

property num_missing

Number of missing data.

property position

Position of the site. The position is set automatically if the instance is created or reset from an Align or a VcfParser. In all cases, the value can be modified.

reset()[source]

Clear all data from the instance (including the alphabet).

egglib.site_from_align(align, index)[source]

Create a site from a position of an alignment. Import allelic and genotypic data from a position of the provided Align instance.

Parameters:
  • align – an Align instance.

  • index – the index of a valid (not out of range) position of the alignment.

Returns:

A new Site instance.

egglib.site_from_list(array, alphabet)[source]

Create a site based on a list of allelic values.

Parameters:
  • array – a list of allelic values.

  • alphabet – an alphabet (input allelic values must match this alphabet).

egglib.site_from_vcf(vcf, start=0, stop=None)[source]

Extract a site from a VCF parser. Import allelic and genotypic data from a VCF parser. The VCF parser must have processed a variant and the variant is required to have genotypic data available as the GT format field. An exception is raised otherwise.

Parameters:
  • vcf – an io.VcfParser instance containing data. There must at least one sample and the GT format field must be available. It is not required to extract variant data manually.

  • start – index of the first sample to process. Index is required to be within available bounds (it must be at least 0 and smaller than the number of samples in the VCF data). Note that in a VCF file a sample corresponds to a genotype.

  • stop – sample index at which processing must be stopped (this sample is not processed). Index is required to be within available bounds (if must be at least equal to start and not larger than the number of samples in the VCF data). Note that in a VCF file, a sample corresponds to a genotype.

class egglib.Freq[source]

Hold allelic and genotypic frequencies for a single site. Instances can be created using the three functions freq_from_site(), freq_from_list(), and freq_from_vcf(), or using the default constructor. After it is created by any way, instances can be re-used (which is faster), using their methods from_site(), from_list(), and from_vcf().

Variables:
  • ingroup – whole ingroup.

  • outgroup – outgroup.

  • cluster – a specific cluster (identified by its index).

  • population – a specific population (identified by its index).

The variables above specify what subset of data should be considered. You should never try to modify them. For clusters and populations, an index should be provided as well.

They can be used as:

>>> freq.freq_allele(allele_index, cpt=egglib.Freq.ingroup)
>>> freq.freq_allele(allele_index, cpt=egglib.Freq.population, idx=pop_index)

or:

>>> freq.freq_allele(allele_index, cpt=freq.ingroup)
>>> freq.freq_allele(allele_index, cpt=freq.population, idx=pop_index)

Attributes

num_alleles

Number of alleles in the whole site.

num_clusters

Number of clusters.

num_genotypes

Number of genotypes in the whole site.

num_populations

Number of populations.

ploidy

Ploidy.

Methods

allele(idx)

Get an allele.

freq_allele(allele[, cpt, idx])

Get the frequency of an allele.

freq_genotype(genotype[, cpt, idx])

Get the frequency of an genotype.

from_list(ingroup, outgroup[, geno_list, ...])

Import frequencies from lists.

from_site(site[, struct])

Import frequencies from a site.

from_vcf(vcf)

Import frequencies from a VCF file.

genotype(idx)

Get a genotype, as a tuple of alleles.

nieff([cpt, idx])

Get the number of individuals within a given compartment.

nseff([cpt, idx])

Get the number of samples within a given compartment.

allele(idx)[source]

Get an allele.

freq_allele(allele, cpt=0, idx=None)[source]

Get the frequency of an allele.

Parameters:
  • allele – allele index.

  • cptcompartment identifier.

  • idx – compartment index (required for clusters and populations, ignored otherwise).

freq_genotype(genotype, cpt=0, idx=None)[source]

Get the frequency of an genotype.

Parameters:
  • genotype – genotype index.

  • cptcompartment identifier.

  • idx – compartment index (required for clusters and populations, ignored otherwise).

from_list(ingroup, outgroup, geno_list=None, alphabet=None)[source]

Import frequencies from lists. Reset the instance as if it was created using freq_from_list(). Arguments are identical to this function.

from_site(site, struct=None)[source]

Import frequencies from a site. Reset the instance as if it was created using freq_from_site(). Arguments are identical to this function.

from_vcf(vcf)[source]

Import frequencies from a VCF file. Reset the instance as if it was created using freq_from_vcf(). Argument is identical to this function.

genotype(idx)[source]

Get a genotype, as a tuple of alleles.

nieff(cpt=0, idx=None)[source]

Get the number of individuals within a given compartment. In the haploid case, this method is identical to nseff().

Parameters:
  • cptcompartment identifier.

  • idx – compartment index (required for clusters and populations, ignored otherwise).

nseff(cpt=0, idx=None)[source]

Get the number of samples within a given compartment.

Parameters:
  • cptcompartment identifier.

  • idx – compartment index (required for clusters and populations, ignored otherwise).

property num_alleles

Number of alleles in the whole site.

property num_clusters

Number of clusters.

property num_genotypes

Number of genotypes in the whole site.

property num_populations

Number of populations.

property ploidy

Ploidy.

egglib.freq_from_site(site, struct=None)[source]

Compute allele frequencies from a site.

Parameters:
  • site – a Site instance.

  • struct

    this argument can be:

    • A Structure instance , allowing to select a subset of samples and/or define the structure.

    • A list (or compatible) of at least one integer providing the sample size (as numbers of samples) of all populations, assuming the individuals are organized in the corresponding order (all samples of a given population grouped together). Samples are grouped in haploid individuals.

    • None (no structure, all samples placed in haploid individuals and in a single population).

Returns:

A new Freq instance.

egglib.freq_from_list(ingroup, outgroup, geno_list=None, alphabet=None)[source]

Import allele frequencies.

Parameters:
  • ingroup – a list of genotype or allele frequencies (based on the value of geno_list). There must be one item for each allele. If there is a single list, a unique population is assumed. To specify multiple populations, the user must pass a list of lists (each inner list containing the same number of items representing allele frequencies). To specify multiple clusters, the user must pass a list of lists of lists of allele frequencies. There may also be single-item lists when necessary. The frequencies must be null or positive integers. The number of frequencies per population is required to be constant for all populations (corresponding to the number of alleles or genotypes). If geno_list is defined, data must be the frequencies of the provided genotypes, in the same order. Otherwise, data must be allelic frequencies, in the order of increasing allele index (in the latter case, data will be loaded as haploid).

  • outgroup – a list of allele/genotype frequencies for the outgroup. The number of alleles or genotypes is also required to match. If None, no outgroup samples (equivalent to a list of zeros).

  • geno_list – list of genotypes. Genotypes must be provided as tuples or lists. Their length is equal to the ploidy and is required to be at least one and constant for all genotypes. For haploid data, it is still required to passed allelic values as one-item lists (with a ploidy of one). Order of alleles within genotypes is significant. If None, data are loaded as haploid alleles and the index of alleles is taken as allelic value.

  • alphabet – alphabet to be used to describe alleles. Required if geno_list is used. By default, use a

Returns:

A new Freq instance.

Note

It is required that there is at least one cluster and one population.

egglib.freq_from_vcf(vcf)[source]

Import allelic frequencies from a VCF parser. The VCF parser must have processed a variant and the variant is required to have frequency data available as the AC format field along with the AN field. An exception is raised otherwise.

This function only imports haploid allele frequencies in the ingroup (without structure). The first allele is the reference, by construction, then all the alternate alleles in the order in which they are provided in the VCF file.

Parameters:

vcf – a VcfParser instance containing data. There must at least one sample and the AN/AC format fields must be available. It is not required to extract variant data manually.

class egglib.Alphabet(cat, expl, miss, case_insensitive=False, name=None)[source]

Define acceptable lists of alleles for a type of data. This class is designed to be associated to EggLib objects holding diversity data (Align, Container, and Site). It defines the lists of exploitable and missing allelic values (both being allowed in all objects using this alphabet). Several pre-defined instances instances are available in Pre-defined alphabets, but the user can easily define his own alphabets using this class. It is strongly advised to use the pre-defined alphabet designed for DNA sequences as it is optimized for access speed.

Parameters:
  • cat

    Type of the alphabet, as a string taken from the following list:

    int

    Positive or negative, discrete integers. range is more adapted to cases with many alleles.

    char

    Single-character strings (such as DNA).

    string

    Strings representing segments of biological sequences (appropriate for codons or indel variants).

    custom

    Alleles represented by a code or a name (e.g. for genomic rearrangements). In particular, this type of alphabets does not allow sequence concatenation, contrary to string.

    range

    Continuous ranges of integers (regardless of their sign). This is much more efficient than using an integer alphabet with many alleles.

  • expl – list of exploitable alleles (for range, see below).

  • miss – list of missing alleles (for range, see below). There must not be any overlap between exploitable and missing alleles in any case.

  • case_insensitive – if set to True, case will be ignored for allele comparison (alleles differing in case only will be considered to be identical). Note that this does not preserve case of provided alleles which are all converted to upper case. Only accepted for char and string alphabets.

  • name – name of the alphabet. Alphabet name is of relatively modest importance and is not required to be unique. The alphabet name is used in error messages when an invalid allele is processed. By default, use the alphabet type as name.

For range alphabets, both parameters expl and miss must be specified as a (start, stop) expression. All values from start up to stop-1 will be considered as valid allele values (stop is not included). To specify a unique value, use (value, value+1). To specify no value at all, use None (you can also abuse the syntax by using (x, x) with any x value. You can omit either term, or both, to denote a semi-infinite or fully infinite range. Of course, setting expl=(None, None) prevent you from specifying any missing allele value.

Attributes

case_insensitive

Boolean indicating if the alphabet is case-insensitive.

locked

True if the alphabet is locked.

name

Name of the alphabet.

num_alleles

Total number of alleles.

num_exploitable

Number of exploitable alleles.

num_missing

Number of missing alleles.

type

Type of the alphabet.

Methods

add_exploitable(value)

Add an exploitable allele to the instance.

add_missing(value)

Add a missing allele to the instance.

get_alleles()

Generate the list of alleles.

get_code(value)

Return the code of a given allele.

get_value(code)

Return the value of for given code.

lock()

Lock the alphabet.

add_exploitable(value)[source]

Add an exploitable allele to the instance. Not allowed for range alphabets and for the special alphabet alphabets.DNA. The allele value must be of the appropriate type, and unique.

add_missing(value)[source]

Add a missing allele to the instance. Not allowed for range alphabets and for the special alphabet alphabets.DNA. The allele value must be of the appropriate type, and unique.

property case_insensitive

Boolean indicating if the alphabet is case-insensitive.

get_alleles()[source]

Generate the list of alleles. Return a tuple with two items: the list of exploitable and the list of missing alleles, respectively. For a range alphabet, each list is replaced by a tuple with the bounds of the range (replaced by None in case of an empty range).

get_code(value)[source]

Return the code of a given allele. Missing alleles are indicated by a negative code.

get_value(code)[source]

Return the value of for given code. Missing alleles are indicated by a negative code.

lock()[source]

Lock the alphabet. When the alphabet is locked, a ValueError will be raised if an attempt to modify the alphabet is made.

property locked

True if the alphabet is locked.

property name

Name of the alphabet. The value might be changed.

property num_alleles

Total number of alleles.

property num_exploitable

Number of exploitable alleles.

property num_missing

Number of missing alleles.

property type

Type of the alphabet. Same as the cat argument of the constructor.

class egglib.Structure[source]

Class describing organisation of samples. This class allows to map the samples of a Site, Align, or (probably more rarely) a Container instance to four levels of structuration: ingroup versus outgroup, then (within the ingroup) clusters of populations, populations, and individuals. If data is lacking for a particular level of structure, it can be skipped. The number of individuals per population can vary, but the number of samples per individuals (that is, the ploidy) must be constant.

New objects are created using the fonctions struct_from_labels() and struct_from_dict(), and existing objects can be recycled with their corresponding methods (from_labels() and from_dict()).

Attributes

no

Number of outgroup samples included in the structure.

ns

Number of ingroup samples included in the structure.

num_clust

Number of clusters.

num_indiv_ingroup

Total number of ingroup individuals.

num_indiv_outgroup

Number of outgroup individuals.

num_pop

Total number of populations.

ploidy

Ploidy.

req_ns

Required number of samples.

Methods

as_dict()

Generate dictionaries representing the structure.

from_dict(ingroup, outgroup)

Import structure from user-specified dictionaries.

from_iterable(iterable[, fmt, data, ...])

Create structure from a list of labels.

from_labels(data[, lvl_clust, lvl_pop, ...])

Import structure from an Align or Container.

from_samplesizes(pops[, ploidy, outgroup])

Create structure from sample sizes per population.

get_clusters()

Return a list with the label of all clusters.

get_populations()

Return a list with the label of all populations from all clusters.

get_samples()

Return a set with the index of all samples from the ingroup.

make_auxiliary()

Generate the derived structure.

make_sorted_auxiliary()

Generate the sorted version of the derived structure.

shuffle([mode, nr, rnd])

Shuffle randomly the structure (context manager).

subset([pops, clusters, outgroup])

Make of copy with only selected populations.

as_dict()[source]

Generate dictionaries representing the structure. Return a tuple of two dict instances representing, respectively, the ingroup and outgroup structure.

The ingroup dictionary is a three-fold nested dictionary (meaning it is a dictionary of dictionaries of dictionaries) holding lists of sample indices. The keys for these three nested levels are, respectively, cluster, population, and individual labels. Based on how the instance was created, there may be just one item or even none at all in any dictionary. In practice, if d is the ingroup dictionary and clt, pop and idv are, respectively, cluster, population, and individual labels, the expression d[clt][pop][idv] will yield a list of sample indices.

The outgroup dictionary is a non-nested dictionary with individual labels as keys and lists of sample indices as values.

from_dict(ingroup, outgroup)[source]

Import structure from user-specified dictionaries. Reset the instance as if it was built using the standalone function struct_from_dict(). Arguments are identical, but this method returns None.

from_iterable(iterable, fmt=None, data=None, missing=None, start=0, stop=None, function=None, skip_missing_names=False)[source]

Create structure from a list of labels. Reset the instance as if it was built using the standalone function struct_from_iterable(). Arguments are identical, but this method returns None.

from_labels(data, lvl_clust=None, lvl_pop=None, lvl_indiv=None, ploidy=None, skip_outgroup=False, outgroup_label='#')[source]

Import structure from an Align or Container. Reset the instance as if it was built using the standalone function struct_from_labels(). The definitions of arguments are identical, but return None.

from_samplesizes(pops, ploidy=1, outgroup=0)[source]

Create structure from sample sizes per population. Reset the instance as if it was built using the standalone function struct_from_samplesizes(). Arguments are identical, but this method returns None.

get_clusters()[source]

Return a list with the label of all clusters.

get_populations()[source]

Return a list with the label of all populations from all clusters.

get_samples()[source]

Return a set with the index of all samples from the ingroup. All clusters, populations, and individuals stored in this object, excluding the outgroup.

make_auxiliary()[source]

Generate the derived structure. This method generates the Structure instance that should be used to analyse data that have been filtered using the original structure and wherein the individual level has been collapsed. Formally return a new Structure instance describing the organisation of individuals in clusters and populations (ignoring the intra-individual level) using the rank of individuals as indices, the individuals being ranked in the order of increasing cluster and population indices.

make_sorted_auxiliary()[source]

Generate the sorted version of the derived structure. Return a new Structure instance with sample indices ordered (to be used with objects which have been filtered using the original structure but wherein the individual level has not been collapsed).

property no

Number of outgroup samples included in the structure.

property ns

Number of ingroup samples included in the structure.

property num_clust

Number of clusters.

property num_indiv_ingroup

Total number of ingroup individuals.

property num_indiv_outgroup

Number of outgroup individuals.

property num_pop

Total number of populations.

property ploidy

Ploidy.

property req_ns

Required number of samples. Equal to the largest index overall, plus one.

Changed in version 3.2: Return the larged sample index overall, not only of ingroup samples (which made little sense).

shuffle(mode='it', nr=None, rnd=None)[source]

Shuffle randomly the structure (context manager). This method allows to test for independence of any level of structuration. It is not allowed to modify the structure object while the object returned by this method is in use. This method is designed to be used in a context manager, and the returned object is iterable (see examples below).

A simple example:

>>> struct = egglib.struct_from_labels(aln, lvl_clust=0, lvl_pop=1, lvl_idv=2)
>>> cs = egglib.ComputeStats(struct=struct)
>>> cs.add_stats('FistWC')
>>> with struct.shuffle(mode='st'):
...     print(cs.process_align(aln))

Example as an iterator:

>>> with struct.shuffle(mode='st', nr=100) as shuffler:
...     for _ in shuffler:
...         print(cs.process_align(aln))

Note

We can ignore the return value of the iterator since it constantly returns the same instance, which is internally modified at each iterator round. The modifications will be taken in account by ComputeStats when computing diversity statistics.

Parameters:
  • mode

    A string describing what structure level to shuffle and with what constraint. This is best understood with reference to the corresponding F-statistic: for example, the mode is assumes that \(F_{is}\) is zero.

    • it – samples shuffled in total.

    • ic – samples shuffled within their original cluster.

    • is – samples shuffled within their original population.

    • st – whole individuals shuffled in total.

    • sc – whole individuals shuffled within their original cluster.

    • ct – whole populations shuffled in total.

  • nr – number of replicates. If None, the instance is shuffled immediately, and once, when entering the context manager. Otherwise, return an iterable context manager which will perform the requested number of shuffling replicates and yield back the same (modified) structure at each iteration. In the latter case, shuffling won’t start until iteration has been entered.

Returns:

A context manager, which itself is iterable if nr is not None.

subset(pops=None, clusters=None, outgroup=True)[source]

Make of copy with only selected populations. Generate a new Structure instance containing only the specified populations. There must be at least one population and all populations must be valid.

pops and clusters are lists of valid population and cluster labels. The order and possible replicates are not significant. outgroup specifies if the outgroup must be included.

New in version 3.2.

egglib.struct_from_labels(data, lvl_clust=None, lvl_pop=None, lvl_indiv=None, ploidy=None, skip_outgroup=False, outgroup_label='#')[source]

Extract structure from labels attached to an alignment. Create a new instance based on the labels of an Align (or Container) instance.

Parameters:
  • data – an Align or Container instance containing the labels to be processed.

  • lvl_clust – index of cluster labels. If None, all populations are placed in a single cluster with None as label.

  • lvl_pop – index of population labels. If None, all individuals of a cluster are placed in a single population with the same label as their cluster.

  • lvl_indiv – index of individual labels. If None, the individual level is skipped and each sample is placed in a haploid individual. If specified, it is required that, if present and not ignored, outgroup samples have a label indicating individuals. If lvl_indiv is None, any additional label of outgroup individuals will be ignored, even if present.

  • ploidy – indicate the ploidy. Must be a positive number and, if specified, data must match the value. If not specified, ploidy will be detected automatically (it must still be consistent over all ingroup and outgroup individuals). Ploidy is ignored if lvl_indiv is None.

  • skip_outgroup – indicate that outgroup samples should be skipped. No effect if there are no outgroup samples.

  • outgroup_label – Label identifying outgroup samples. Only considered when it is the first label. Outgroup samples can only have either one or two labels (the first one equal to the one specified by this option, and the second one an optional individual label).

Returns:

A new Structure.

egglib.struct_from_samplesizes(pops, ploidy=1, outgroup=0)[source]

Build a structure based on sample size per population.

Parameters:
  • pops – list of sample sizes (one item per population). The values are interpreted as numbers of individuals.

  • ploidy – ploidy (number of samples per individual).

  • outgroup – number of outgroup individuals.

Returns:

A new Structure.

The returned instance contains a single cluster, referenced as None and the specified number of populations. Population keys are pop1, pop2, … If there is only one population, its label is pop1. Similarly, individuals are referenced as idv1, …, consecutively for all populations and for the outgroup (in that order).

egglib.struct_from_iterable(iterable, fmt=None, data=None, missing=None, start=0, stop=None, function=None, skip_missing_names=False)[source]

Build a structure based on labels stored in an iterable. Like a list or a file handle. By default, labels are mapped to samples based on their order of apparition (the first label is applied to the first sample, and so on). Alternatively, names can be included in the iterable (if so, their position needs to be specified by passing a N to fmt) and their index will be sought in an Align or Container passed as data.

Parameters:
  • iterable – any iterable except str (typically a list, possibly an open file). Iteration rounds (rows) must be non-empty string if fmt is None, or sequences (e.g. list) of non-empty strings otherwise. If rows are sequences, they must contain strings and have number of items specified by fmt. If an open file is passed, it is probably required to use the function option (such as function=str.split to split rows or function=str.strip to just strip trailing newlines, or anything more complex as needed).

  • fmt – specifies the meaning of items contained by each row. If None, rows must be strings and specify population labels. Otherwise, fmt must be a sequence of one-character strings (fmt may be a string) specifying what each item of a given row is supposed to represent. C, P, and I stand for cluster, population, and individual labels, respectively. They may appear in any order but cannot be repeated. There must be at least one of them and C, if present, requires that P is present also. N refers to sample names. It is optional and must also appear also once. If present, data must also be specified and the passed instance must not have duplicate names. If a name is missing in data, it will be silently skipped and excluded from the resulting structure. In addition, * can be used to specify irrelevant columns that should be ignored. It is optional and can be repeated.

  • data – an Align or Container instance in which sample names should be searched. Required if N is specified in fmt and not allowed otherwise.

  • missing – value used to specify that a sample should not be included in the structure. Allows to skip an index (in particular when using the function in the default mode, not relying on sample names). If one label has the missing value, all labels must have it. Note that the string "None" (e.g. as read from a file) is not interpreted as the builtin object None.

  • start – index of the first row to process (also: number of rows to skip before starting processing). To skip a header line, one would use start=1. Skipped rows don’t affect the counter used to identify samples by index (the default, if names are not specified).

  • stop – index where to stop processing (this row is not processed and iteration stops).

  • function – a function to be applied to each processed row (rows before start and from stop on are not concerned).

  • skip_missing_names – if names are searched in an Align or Container (fmt containing the N flag), names not appearing in the instance passed as data are skipped (the default is to raise an error).

egglib.struct_from_dict(ingroup, outgroup)[source]

Build a structure from dictionaries. Create a new instance based on the structure specification provided as dictionaries. The two arguments are in the same format as the return value of the as_dict() method (see the documentation). Either argument can be replaced by None which is equivalent to an empty dictionary (no samples). Note that all keys must be strings.

Parameters:
  • ingroup – a three-fold nested dict of ingroup samples indices, or None. The keys of the outter dictionary are cluster labels and its values are dictionaries representing each cluster. Each cluster dictionary has population labels as keys and dictionaries representing populations has values. Each population dictionary has ingroup individual labels as keys and lists of sample indices, pointing to samples in an Align, Container, or Site instance.

  • outgroup – a dict of outgroup samples indices, or None. The keys must be outgroup individual labels, and the values lists of sample indices, pointing to samples in an Align, Container, or Site instance.

Returns:

A new Structure.

class egglib.Tree(fname=None, string=None)[source]

Editable genealogical or phylogenetic tree. A tree is a linked collection of nodes which all have one parent (except the ultimate base node) and any number of children.

Parameters:
  • fname – name of a newick-formatted file containing the tree to import.

  • string – newick-formatted string representing the the tree to import. If it not allowed to specified both fname and string.

The instance can be initialized as an empty tree (with only a root node), or from a newick-formatted string. By default, the string is read from the file name passed as the fname argument to the constructor, but it can be passed directly through the constructor argument string. It is not allowed to set both fname and string at the same time. If neither is specified, an empty tree, containing only a bare base node, is created. The newick parser expects a well-formed newick string (including the trailing semicolon).

Nodes are implemented as Node instances. A node without children is a leaf and otherwise it is internal. A node with exactly one child is generally meaningless, but is allowed. All nodes (internal nodes as well as leaves) have a label which in the case of leaves can be used as a sample name. It is not possible to apply a name and a label to leaf node, in agreement with the newick format. All connections between nodes (branches) are oriented and can have a length (although the lengths can be omitted) but note that labels are applied to nodes, not branches. All Tree instances have one base node which is the only one allowed not to have a parent. Network-like structures are not allowed (because nodes must have exactly one parent).

Import and export to/from strings and files are in the bracket-based newick format (the parser treats terminal node labels as strings, and internal node labels as integers or, by default, floats). Tree instances can be exported as a built-in str function by the syntax str(tree) or using the method newick(). Nodes also have a newick() method.

Tree instances are iterable. Three iterators are provided: one is depth-first (depth_iter()), another is breath-first (breadth_iter()) and one iterates on terminal nodes (leaves) only (iter_leaves()).

Attributes

base

Basal node of the tree.

num_leaves

Number of terminal nodes in the tree.

num_nodes

Total number of nodes in the tree.

Methods

add_node(parent[, label, brlen])

Add a node to the tree.

breadth_iter([start])

Return a breadth-first iterator.

clean_branch_lengths()

Remove all branch lengths.

clean_internal_labels()

Remove all internal node labels.

collapse(node[, ignore_len, ignore_label])

Collapse a branch of the tree.

copy([node])

Make a deep copy of the tree.

depth_iter([start])

Return a depth-first iterator.

extract(node[, label])

Remove a subtree.

find_clade(names[, ancestral, both_sides])

Check whether a group is one of the clades defined by the tree.

frequency_nodes(trees[, relative])

Label nodes based on their number of occurrences in a list of trees.

get_leaf(label)

Get a terminal node.

iter_leaves()

Return an iterator over the leaves as Node instances.

lateralize([reverse])

Flush bigger clades to one side of the tree.

map_descendants()

Map all leaves of the trees to internal nodes.

midroot()

Automatic midpoint rooting of the tree.

newick([skip_labels, skip_brlens])

Return the newick-formatted string representing the instance.

remove_node(node[, keep_parent])

Remove a node from the tree, as well as all its descendants.

root(outgroup[, branch_split, reoriente])

Root or reoriente the tree.

total_length()

Compute the sum of branch lengths.

unroot([reverse])

Remove the root.

add_node(parent, label=None, brlen=None)[source]

Add a node to the tree.

Parameters:
  • parent – one of the nodes of this instance, as a Node instance.

  • label – node label which will be internal node label or leaf name according to the final structure of the tree. The new node has initially no children and is therefore a leaf until it is itself connected to a child (if ever).

  • brlen – length of the branche connecting parent to the new node.

Returns:

The new node as a Node reference.

property base

Basal node of the tree. If the tree is unrooted, this is a trifurcation whose location should be considered as arbitrary, unless one the three clades below this node is the outgroup. If the tree is rooted, this is the root). This attribute is a Node instance which can be modified, but it cannot be replaced.

breadth_iter(start=None)[source]

Return a breadth-first iterator. Iterate over the Node instances of the trees, starting from the base but, then, following a breadth-first order.

Parameters:

start – start point of the iteration, as a Node instance of this tree. By default, start from the base of the tree.

clean_branch_lengths()[source]

Remove all branch lengths. In practice, they are set to None.

clean_internal_labels()[source]

Remove all internal node labels. This included the base of the tree. In practice, they are set to None.

collapse(node, ignore_len=False, ignore_label=False)[source]

Collapse a branch of the tree.

Parameters:
  • nodeNode representing the branch to remove.

  • ignore_len – don’t try to transfer branch lengths to children.

  • ignore_label – don’t transfer label of the destroyed node to its parent.

node represents the branch that must be removed from the tree (this node is destroyed in the process). It must be one of the nodes contained in the tree (as a Node instance), but not the base of the tree. It cannot be an terminal node (leaf).

If ignore_label is not set to True, the label of the destroyed node is transferred to the parent based on the following procedure: (1) if the destroyed node’s label is None, nothing is done; (2) if the destroyed node’s parent’s label is None, the destroyed node’s label is copied to its parent as is; (3) otherwise both labels are converted to strings (if they are not yet) and concatenated as in the string a;b (where a is the parent’s label and b is the destroyed node’s label), even if the two labels are identical.

If ignore_len is not set to True, and if the length of the removed branch (branch from the specified node to its parent) is specified, it will be spread equally among the branches to its children (see example below). This requires that the branch length to all children are specified. If the removed branch has no specified length, nothing is done.

Collapsing node [4] on the following tree:

 /------------------------------------------->[1]
 |
 |             /----------------------------->[3]
 |             |
 |----------->[2]             /-------------->[5]
 |             |              |
 |             \------------>[4]
[0]                           |
 |                            \-------------->[6]
 |
 |              /---------------------------->[8]
 |              |
 \------------>[7]            /------------->[10]
                |             |
                \----------->[9]
                              |
                              \------------->[11]

will generate the following tree, with the correction of edge lengths as depicted:

 /------------------------------------------->[1]
 |
 |             /----------------------------->[3]
 |             |
 |----------->[2]
 |             |
 |             |-------------------->[5]        L5 = L5+L4/2
[0]            |
 |             \-------------------->[6]        L6 = L6+L4/2
 |
 |              /---------------------------->[8]
 |              |
 \------------>[7]            /------------->[10]
                |             |
                \----------->[9]
                              |
                              \------------->[11]

Although the total edge length of the tree is not modified, the relationships will be altered: the distance between the descendants of the collapsed node (nodes 5 and 6 in the example above) will be artificially increased.

copy(node=None)[source]

Make a deep copy of the tree. Create a new instance of Tree that is a deep copy of a subtree of the the current tree, or a deep copy of it all.

Parameters:

node – a Node instance (one of the nodes of the current tree) at the base of the subtree that should be copied. By default, or if None, or the base of the tree is passed, the whole tree is copied. It is not allowed to pass a leaf.

depth_iter(start=None)[source]

Return a depth-first iterator. Iterate over the Node instances of the trees, starting from the base but, then, following a depth-first order.

Parameters:

start – start point of the iteration, as a Node instance of this tree. By default, start from the base of the tree.

extract(node, label=None)[source]

Remove a subtree. The subtree is returned as a new tree. All nodes of the subtree descending from the requested node will now belong to the new tree. The label of the node at the base of the selected clade is deleted. In the original tree, the extracted clade is replaced by a terminal node which has, by default, the label of the node at the base of the extracted clade, or the label passed as argument.

Parameters:
  • node – a Node instance (one of the nodes of the current tree) at the base of the subtree that should be extracted. It is not allowed to pass a leaf or the base of the tree.

  • label – label to affect to the terminal node that is introduced to replace the extracted clade in the orginal tree. By default (or if None), use the label of the node passed as first argument (note that this label should be in principle a number).

Returns:

A new Tree instance.

find_clade(names, ancestral=False, both_sides=False)[source]

Check whether a group is one of the clades defined by the tree.

The leaf names must be provided as an iterable (most logically, a set). Leaf names are normally str instances. All leaves must be present in the tree.

If the*ancestral* is False, search for the clade that contains the provided list of names as descendant. There must not be any other name amongst its descendants. If the tree is unrooted and oriented in such a way that the a base lies within the requested clade, it will not be detected. It is possible (if both_sides is True) to allow searching for the complement of the clade, thereby detecting the right clade even if it is at the base of the tree. By default, if this situation occurs, the clade will not be detected.

If the option ancestral is True, search for the most recent common ancestor of all leaves specified in names. Use of this method necessarily supposes that the tree is rooted (however, there is no requirement regarding its shape such as bifurcation at the base) and it is not allowed to both_sides is True. With this option, it is not possible to have a None return value (since it is required that all leaves are present in the tree, in the worse case the base of the tree is returned).

Parameters:
  • name – a set (or compatible) specifying the requested leaves (as node labels, normally str instances).

  • ancestral – whether to look for the most recent common ancestral clade containing requested leaves (by default, look for the clade containing the exact same list of leaves).

  • both_sides – only allowed if ancestral is False. Look for both the requested list of leaves and its complement, allowing to detect a clade even if it is spanning the base of the tree.

Returns:

The Node instance, if it exists, which has the exact same list of descendants than taxa. If no such clade is found, None.

Warning

This method assumes that all leaf names of the tree are unique, as well as the list of names provided as argument. If this condition is not fulfilled, the right clade might not be found even if it exists.

Changed in version 3.0.0: Replaces previous methods findGroup(), findMonophyleticGroup(), smallest_group() and smallest_monophyleticGroup() with a modification of the underlying algorithm.

frequency_nodes(trees, relative=False)[source]

Label nodes based on their number of occurrences in a list of trees. Each node receives an integer as label counting the number of trees where the same node exists among the trees provided as argument. It is required that all leaf labels are unique.

Parameters:
  • trees – an iterable containing Tree instances with exactly the same set of leaf labels.

  • relative – node frequencies are expressed as fractions. The use of this option requires that at least one tree is provided.

Note

With the exception of the base of the tree (which is ignored by this function) and leaf labels, all previously set labels are erased.

get_leaf(label)[source]

Get a terminal node. Return the node (as a Node instance) that has the requested leaf label. If several nodes have this label, returns the first one. If no nodes have this label, returns None.

iter_leaves()[source]

Return an iterator over the leaves as Node instances.

lateralize(reverse=False)[source]

Flush bigger clades to one side of the tree. Modify the order of children of all nodes of the trees in such a way that they are sorted from the smallest to the largest number of descending leaves.

Parameters:

reverse – sort from in the more-descendants to less-descendants order instead.

map_descendants()[source]

Map all leaves of the trees to internal nodes. Generate and return a dict which gives, for all internal nodes of the trees (excluding the base), the list of terminal nodes that are ultimately connected when reading the tree away from its base.

midroot()[source]

Automatic midpoint rooting of the tree. The tree must be initially unrooted (trifurcation at the root). This method identifies the most distant pair of terminal nodes (in case of a draw, one pair is picked randomly) and the root of the tree (as a new node) placed at the middle point of this path.

newick(skip_labels=False, skip_brlens=False)[source]

Return the newick-formatted string representing the instance.

Parameters:
  • skip_labels – omit internal branch labels.

  • skip_brlens – omit the branch lengths.

property num_leaves

Number of terminal nodes in the tree. If the tree is empty (only the default base node), the number of leaves is 0.

property num_nodes

Total number of nodes in the tree. This number is never smaller than 1, even for empty trees.

remove_node(node, keep_parent=False)[source]

Remove a node from the tree, as well as all its descendants. Since this operation may create a node with a single child, this method may remove the parent or the brother of the removed node depending on the structure of the tree, unless specified otherwise (see keep_parent).

Parameters:
  • node – the node to remove, as a Node instance belowing to the current tree. Terminal nodes can be removed, but not the base of the tree.

  • keep_parent – don’t remove the parent of the removed node if it is left with only one child. If the parent is the base of the tree, keep the other descendant if it is not terminal (see below).

Assume we remove node [3] from the tree with this structure:

               /---------------------------->[2]
               |
 /----------->[1]           /--------------->[4]
 |             |            |
 |             \---------->[3]
 |                          |
[0]                         \--------------->[5]
 |
 |              /--------------------------->[7]
 |              |
 \------------>[6]            /------------->[9]
                |             |
                \----------->[8]
                              |
                              \------------>[10]

Then, we would end up with the following tree:

 /----------->[1]--------------------------->[2]
 |
 |
[0]
 |              /--------------------------->[7]
 |              |
 \------------>[6]            /------------->[9]
                |             |
                \----------->[8]
                              |
                              \------------>[10]

The default behaviour is then to remove node [1] (and delete its label if it exists) and to set the length of the branch from [0] to [2] to the sum of the [0] to [1] and [1] to [2]. But, with keep_parent is True, the tree is left as is.

There is a special case with the base of the tree. Assume that we remove node [1] from the original tree above. We then would have a non-standard structure with a single child to the base of the tree:

                /--------------------------->[7]
                |
[0]----------->[6]            /------------->[9]
                |             |
                \----------->[8]
                              |
                              \------------>[10]

In that case, the base is not removed, but node [6] is removed using the collapse() method (using option ignore_len set to False but ignore_label set to True since the base is not supposed to bear a label). We end up with the following structure:

 /--------------------------------->[7]
 |
[0]                  /------------->[9]
 |                   |
 \----------------->[8]
                     |
                     \------------>[10]

The length of the branch from [0] to [6] is spread equally between the branch from [0] to [7] and the branch from [0] to [8] (and so on if there are actually more than one descendants). If keep_parent is True or if [6] is a terminal node, it is not removed.

root(outgroup, branch_split=0.5, reoriente=False)[source]

Root or reoriente the tree. By default, a new node is created to represent the root and is placed on the branch leading to the provided outgroup node (the second argument determines where the new node is placed on this branch). Otherwise, the tree is reoriented such as its base is placed at the location of the provided outgroup. In the former case, its ends with a bifurcation at the root; in the latter case, a trifurcation. It is illegal to call this method on trees that are already rooted (have a difurcation at the root).

Parameters:
  • outgroupNode instance contained in this tree. It can be a leaf or any internal node, but not the current base of the tree (unless reoriente is True: in that case, it might be the base of the tree [it will not change anything] and it cannot be a leaf).

  • branch_split – where to cut the branch leading to the outgroup.

  • reoriente – don’t create any root node branch_site is therefore not considered) and only place the node provided as outgroup at the base of the tree, thereby merely changing the representation of the tree.

The information below describes the case where reoriente is False (proper rooting).

If the branch to the provided outgroup doesn’t have a branch length, the branch_split argument is ignored. Otherwise, branch_split must be a real number between 0 and 1 and give the proportion of the branch that must be allocated to the basal branch leading to the outgroup, the complement being allocated to the branch leading to the rest of the tree. If branch_split is either 0 or 1, one of the branch will have a length of 0, but it will exist anyway.

If the original tree has this structure:

 /------------------------------------------->[1]
 |
 |             /----------------------------->[3]
 |             |
 |----------->[2]             /-------------->[5]
 |             |              |
 |             \------------>[4]
[0]                           |
 |                            \-------------->[6]
 |
 |              /---------------------------->[8]
 |              |
 \------------>[7]            /------------->[10]
                |             |
                \---[ROOT]-->[9]
                              |
                              \------------->[11]

And rooting is requested at node [9], the root will be placed on the branch marked by [ROOT]. The outcome will be as depicted below, with the introduction of a new node (marked [ROOT]) and the reorientation of the tree to place it at the base:

                     /-------------------------[1]
                     |
              /-----[0]     /------------------[3]
              |      |      |
              |      \-----[2]      /----------[5]
              |             |       |
  /---[E2]---[7]            \------[4]
  |           |                     |
  |           |                     \----------[6]
  |           |
[ROOT]        \--------------------------------[8]
  |
  |                     /---------------------[10]
  |                     |
  \--------[E1]--------[9]
                        |
                        \---------------------[11]

In this example, the relationship between nodes [7] and [0] (the previous base of the tree) is reverted. The label of node [7] is automatically transferred to node [0]. This is consistent with the idea that internal node labels describe a property of the branch. The original label of the base, if it exists, is discarded. Since the branch between [7] and [9] is cut in two, the original label of node [9] is copied to node [7], leaving them both with the same label. However, if the outgroup is a terminal node, the label is not copied and the other basal branch is left without label.

Let \(L\) be the length of the branch from [7] and [9] in the original tree, and \(r\) the value of the parameter branch_split. The length of the branch [E1] will be set to \(rL\), and the branch [E2] to \((1-r)L\). Overall, the length of the tree will not be modified.

In the case that reoriente is True, the final tree is rather:

 /------------------------------------------->[10]
 |
 |------------------------------------------->[11]
 |
[9]       /----------------------------------->[8]
 |        |
 \------>[7]       /-------------------------->[1]
          |        |
          \------>[0]      /------------------>[3]
                   |       |
                   \----->[2]        /-------->[5]
                           |         |
                           \------->[4]
                                     |
                                     \-------->[6]

The topology of the tree is the same as the initial one, except that the base is now [9]. The lengths of all branches are conserved. However, node labels between the old and the new base are reverted: the node label of the new base ([9] in the example) is affected to the next node ([7] in the example) and so on until the old base ([0] in the example), whose label, if it exists, is discarded.

total_length()[source]

Compute the sum of branch lengths. All branch lengths must be defined (non-None), otherwise a ValueError will be raised.

unroot(reverse=False)[source]

Remove the root. The tree must be initially rooted (bifurcation at the root). This method removes the root node and places the base of the tree at one of the two basal nodes (the nodes that are ancestral to the two basal groups). This method does not change to total length of the tree. And error is raised if only one of the two basal branches has a length. If the initial basal node has a label, it is lost. If the node that becomes the base has a label, it is left there (it will appear a the base of the tree).

Parameters:

reverse – if True, place the base of the tree at the second basal node (by default, the first basal node is used).

class egglib.Node(label=None)[source]

Manage a single node of a given tree. This class provides an interface to Tree instance’s nodes and allows access and modification of data attached to a given node as well as the tree descending from that node. A node must be understood as the point below a branch (in the direction towards leaves). So the length describe in the instance concerns the branch above the corresponding node (towards the base of the tree). Branches (connections between nodes) have a direction: they go from a node to another node. Nodes have therefore children and parents (a given node have one parent except the base of the tree which has none). Connecting a node to itself, making a two-way branch (to branches connecting the same two nodes in opposite directions) or duplicate branches (between the same two nodes and in the same direction) are illegal.

Parameters:

label – node label (in case of a terminal node, its leaf label), if needed. Labels, if provided, are expected to be strings for terminal nodes, and numeric values for internal nodes, but technically, all user-supplied values are accepted (however, some Tree methods require proper types).

Attributes

label

Node's label (modifiable).

num_children

Number of children connected to this node.

parent

Parent of this node.

parent_branch

Length of the branch to the parent.

Methods

branch_to(child)

Length of the branch to a child node.

child(idx)

Return a given child, as a Node instance.

children()

Return an iterator over this node's children.

has_descendant(node)

Tell if a Node is a descendant of this node.

is_child(node)

Tell if a Node is a child of this node.

is_parent(node)

Tell if a Node is the parent of this node.

leaves_down()

Recursively gets all leaf labels descending from that node.

leaves_up()

Recursively gets all leaf labels contained on the other side of the tree.

newick([skip_labels, skip_brlens])

Formats the node and the subtree descending from is as a newick string.

set_branch_to(child, brlen)

Set the length of the branch to a child node.

siblings()

List of other children of this node's parent.

branch_to(child)[source]

Length of the branch to a child node. Non-specified branch lengths are represented by None.

Parameters:

child – node whose branch length is requested. It can be represented by a direct reference (as a Node), or by its index in the children list. In that case, ensure that the index is currently valid.

child(idx)[source]

Return a given child, as a Node instance.

children()[source]

Return an iterator over this node’s children.

has_descendant(node)[source]

Tell if a Node is a descendant of this node.

is_child(node)[source]

Tell if a Node is a child of this node.

is_parent(node)[source]

Tell if a Node is the parent of this node. Passing None to an instance that has no parent will return True.

property label

Node’s label (modifiable).

leaves_down()[source]

Recursively gets all leaf labels descending from that node. If this is a terminal node, returns its label in a one-item list.

leaves_up()[source]

Recursively gets all leaf labels contained on the other side of the tree. In other words, get all leaves of the tree except those descending from this node). If this is the root node, returns an empty list.

newick(skip_labels=False, skip_brlens=False)[source]

Formats the node and the subtree descending from is as a newick string.

Parameters:
  • skip_labels – omit internal branch labels.

  • skip_brlens – omit the branch lengths.

property num_children

Number of children connected to this node.

property parent

Parent of this node. None if the node has no parent.

property parent_branch

Length of the branch to the parent. Non-specified branch lengths are represented by None. An exception is thrown upon accessing this attribute if this node has no parent. This attribute can be modified.

set_branch_to(child, brlen)[source]

Set the length of the branch to a child node.

Parameters:
  • child – node whose branch should be resized. It can be represented by a direct reference (as a Node), or by its index in the children list. In that case, ensure that the index is currently valid.

  • brlen – new branch length (it is allowed to pass None).

siblings()[source]

List of other children of this node’s parent. It is required that this node has a parent.