Diversity statistics

In the module stats, a number of tools are provided to compute diversity statistics out of Site or Align instances. Some statistics are applicable to individual sites, some to sets of sites, and some to phased sequences alignments. Note that the objects may indifferently contain nucleotide sequences, protein sequences, microsatellite alleles encoded (or not) as allele length, or any arbitrary representation of allelic diversity.

The alphabets define lists of alleles alleles and their representation, but won’t be used to decide what statistics can be computed or not. What is important to note that EggLib will compute any statistic you request out of your data, even if it is meaningless. Special attention should be granted to statistics requiring a phase, since you can easily load unphased data to objects that can be used to compute those statistics.

In many cases, not computable statistics are returned as None, but this is only when they are technically not computable (due to missing data or unvailability of a specific feature such as outgroup sequences or subpopulations).

In the sections of this chapter, we will present statistics available in the stats module. Statistics will be grouped by families (a family of statistics being a group of statistics that require the same type of data and the same kind of information). Most of the statistics are computed by stats.ComputeStats (see this tutorial section for an introduction), or by other functions available in the same module.

Outgroup

Some of the statistics require an outgroup to be computed. The outgroup should be included in the analysed dataset (Site or Align instance) but identified by the means of a Structure instance. There might be more than one outgroup samples. The ougroup information will be used to identify the ancestral variant (that is, the one which is shared with the outgroup) if the outgroup has one of the alleles present in the main sample (the ingroup) and, if there are several outgroup samples, all of them have the same all. If you outgroup has an allele not found in the outgroup, or if the outgroup contains several alleles, then the site will be considerer not orientable and won’t be used for statistics requiring an outgroup. Statistics not requiring an outgroup will be computed normally, though.

Population structure

Many statistics require that several populations are present, some require that an individual structure is defined, and a single statistic in stats.ComputeStats requires clusters of populations. Like the outgroup, the structure of samples is described by Structure instances (see here for an introduction). If the appropriate level of structure is not defined in the Structure provided to the class or function computing statistics (or if no Structure is provided), the concerned statistics will be None.

Here is the list of families of statistics that are described in the following sections: