Single site statistics

Statistics that can be computed for a single site are mainly aimed at genetic markers exhibiting many alleles (such as microsatellite). Some of them can be relevant to DNA polymorphism, but in most cases they should be averaged over many sites.

All those statistics are computed by the class ComputeStats. The methods process_freq() and process_site() return the values for a single site, while process_sites() and process_align() compute an average over all provided sites.

Code

Definition

Formula

Requirement

Notes

ns_site

Number of analyzed samples

1

ns_site_o

Number of analyzed outgroup samples

Outgroup

1

Aing

Number of alleles in ingroup

Aotg

Number of alleles in outgroup

Outgroup

Atot

Number of alleles in whole dataset

As

Number of singleton alleles

2

Asd

Number of singleton alleles (derived)

Outgroup

2

R

Allelic richness

(1)

He

Expected heterozygosity

(2)

thetaIAM

\(\theta\) estimator under the IAM model

(3)

thetaSMM

\(\theta\) estimator under the SMM model

(4)

Ho

Observed heterozygosity

Individuals

3

Fis

Inbreeding coefficient

(5)

Individuals

maf

Minority allele relative frequency

maf_pop

Minority allele per population

Populations

4,5

Hst

Hudson’s Hst

(6)

Populations

Gst

Nei’s Gst

(7)

Populations

Gste

Hedrick’s Gst’

(8)

Populations

Dj

Jost’s D

(9)

Populations

FstWC

Weir and Cockerham estimator (haploid data)

(10)

Populations

6

FistWC

Weir and Cockerham estimators (diploid data)

(11) (12) (13)

Populations, individuals

6,7

FisctWC

Weir and Cockerham estimators (hierarchical)

(14) (15) (16) (17)

Populations, individuals, clusters

6,7

numSp

Number of population-specific alleles

Populations

8

numSpd

Number of population-specific derived alleles

Populations, outgroup

8

numShA

Number of shared alleles

Populations

8

numShP

Number of shared segregating alleles

Populations

8

numFxA

Number of fixed alleles

Populations

8

numFxD

Number of fixed differences

Populations

8

numSp*

Number of sites with at least one population-specific allele

Populations

8, 9

numSpd*

Number of sites with at least one population-specific derived allele

Populations, outgroup

8, 9

numShA*

Number of sites with at least one shared allele

Populations

8, 9

numShP*

Number of sites with at least one shared segregating allele

Populations

8, 9

numFxA*

Number of sites with at least one fixed allele

Populations

8, 9

numFxD*

Number of sites with at least one fixed difference

Populations

8, 9

triconfig

Number of sites falling into fixation pattern categories

Three populations

10

Notes:

  1. Total number of samples excluding all samples with missing data. A sample is defined as a sampled allele (a diploid individual corresponds to two samples).

  2. A singleton allele is an allele present in one copy in the whole sample (excluding outgroup).

  3. Computed as the proportion of heterozygote individuals.

  4. Relative frequency in each population of the allele which is minority in the whole sample, even if it is absent or not minority in some populations.

  5. Returned as a list, even if there is only one population.

  6. Multi-site average is computed as the ratio of the sum of numerator terms to the sum of numerator terms for all exploitable sites.

  7. Returned as a list with the different estimators (see formulas).

  8. A population-specific allele is an allele which is at non-null frequency in one population only. A fixed allele is an allele which is at frequency 0 in at least one population and at (relative) frequency 1 in at least one population. A shared allele is an allele which is at non-null frequencies in at least two populations. A shared polymorphism is a pair of populations which have at least two common segregating (0 < relative frequency < 1) alleles. A fixed difference is a pair of populations which have two different alleles at relative frequency 1.

  9. Only computed if several sites are analyzed.

  10. Only biallelic sites meeting the missing data criterion are considered. The criterion is given by the configuration option triconfig_min (minimum number of samples per population, default 2) and max_missing, if relevant, is ignored. The result is given as a 13-item list, filled with zeros by default, giving the counts for the patterns in the following order (where A and B stand for two arbitrary alleles fixed in a population, and P a polymorphism of the two alleles in the population): ABB, ABA, AAB, PAA, PAB, APA, APB, AAP, ABP, PPA, PAP, APP, PPP.

Basic statistics

(1)\[R = \frac{k-1}{n-1}\]
(2)\[H_e = (1 - \sum_i^k {p_i}^2) \frac{n} {(n-1)}\]

with:

  • \(n\), the number of samples (given by ns_site)

  • \(k\), the number of alleles

  • \(p_i\), the relative frequency of allele \(i\)

Theta estimators

thetaIAM

(3)\[\hat{\theta}_{IAM} = \frac{H_e}{1 - H_e}\]

thetaSMM

(4)\[\hat{\theta}_{SMM} = \frac{1}{2} \left[ \frac{1}{(1 - H_e)^2} - 1 \right]\]

Fixation index (departure from Hardy Weinberg equilibrium)

(5)\[F_{IS} = 1 - \frac{H_o}{H_e}\]

Population differentiation

In this section we define:

\(r\)

number of populations

\(n_i\)

sample size of population \(i\)

\(n_t\)

total sample size

\(k\)

number of alleles

\(p_i\)

relative frequency of allele \(i\) in the whole sample

\(p_{ij}\)

relative frequency of allele \(i\) in population \(j\)

and we exclude any populations with less than two samples.

\(H_{ST}\) (Hudson et al. Mol. Biol. Evol. 1992 9:138-151) is defined as follows:

(6)\[H_{ST} = 1 - \frac{H_{S_1}}{H_{T_1}}\]

with

\[H_{S_1} = \frac{1}{\sum_i^r n_i - 2} \sum_i^r (n_i-2) H_i\]

and

\[H_{T_1} = \frac{n_t}{n_t - 1} \left[ 1-\sum_i^k \left( \frac{1}{n_t}\sum_j^r p_{ij} n_i \right)^2 \right]\]

with:

\[H_i = \frac{n_i}{n_i-1} \left[ 1 - \sum_j^k {p_{ji}}^2 \right]\]

Nei’s \(G_{ST}\) (Hudson et al. Mol. Biol. Evol. 1992 9:138-151) is defined as follows:

(7)\[G_{ST} = 1 - \frac{H_{S_2}}{\tilde{H}_T}\]

with

\[H_{S_2} = \frac{1}{n_t} \sum_i^r n_i H_i\]

and

\[\tilde{H}_T = 1 - \sum_i^k \left( \frac{1}{n_t} \sum_j^r p_{ij} n_j \right) ^2 + \frac{1}{r \cdot \tilde{n}} H_{S_2}\]

with

\[\tilde{n} = \frac{r} {\sum_i^r \frac{1}{n_i}}\]

\(G_{ST}'\) (Hedrick Evolution 17:4015-4026) is defined as:

(8)\[G'_{ST} = \frac{1 + H_{S_3}}{1 - H_{S_3}} \left( 1 - \frac{H_{S_3}}{H_{T_2}} \right)\]

with

\[H_{S_3} = \frac{1}{r} \sum_i^r \left( 1 - \sum_j^k {p_{ji}}^2 \right)\]

and

\[H_{T_2} = 1 - \sum_i^k \left( \frac{1}{r} \sum_j^r p_{ij} \right) ^2\]

Jost’s \(D\) (Mol. Ecol. 2008 18:4015-4026) is computed as:

(9)\[D = \frac{r}{r-1} \frac{H_{T_3} - H_{S_4}} {1 - H_{S_4}}\]

with:

\[H_{S_4} = \frac{\tilde{n}}{\tilde{n}-1} H_{S_3}\]

and

\[H_{T_3} = H_{T_2} + \frac{1}{r \cdot \tilde{n}} H_{S_4}\]

F-statistics estimators

Estimators of F-statistics are based on Weir and Cockerham (Evolution 1984 38:1358-1370) and Weir and Hill (Annu Rev. Genet. 36:721-750).

Different estimators are available depending on which levels of structure are provided through a Structure instance.

Population structure only

If only the population structure is available, only the equivalent of \(F_{ST}\) (\(\hat{\theta}\) in Weir and Cockerham’s notation) is available.

\[n_c = \frac{1}{k - 1} \left( n_t - \frac{1}{n_t} \sum_p^k {n_p}^2 \right)\]

where \(n_p\) is the number of samples of population \(p\), \(n_t\) is the total number of samples, and \(k\) is the number of considered populations. Only populations with at least two samples are considered.

For a given allele \(i\), we compute:

\[\alpha_i = \frac{1}{k-1} \sum_p^k n_p (p_{ip} - \bar{p}_i) ^2\]
\[\delta_i = \frac{1}{n_t-k} \sum_p^k n_p \cdot p_{ip} (1-p_{ip})\]

where \(\bar{p}_i\) is the overall relative frequency of allele \(i\) in the whole sample and \(p_{ip}\) is the relative frequency of allele \(i\) in population \(p\).

The equivalent of \(F_{ST}\) is then computed as:

(10)\[\hat{\theta} = \frac{\sum_i^A \alpha_i - \delta_i}{\sum_i \alpha_i + (n_c - 1) \delta_i}\]

Population and individual structure

If both population and individual structures are available, the decomposition of inbreeding in three terms, \(F\) (equivalent to \(F_{IT}\)), \(\theta\) (equivalent to \(F_{ST}\), and \(f\) (equivalent to \(F_{IS}\)) is possible. The estimators of these fixation indexes are defined below, following Weir and Cockerham (Evolution 1984 38:1358-1370).

The estimators are based on three components of variance, noted \(a\) (between populations), \(b\) (between individuals within populations), and \(c\) (within individuals):

\[a = \sum_i^A \frac{\bar{n}}{n_c} \left\{ s^2_i - \frac{1}{\bar{n}-1} \left[ \bar{p}_i(1-\bar{p}_i) - s^2_i\frac{k-1}{k} - \frac{\bar{h}_i}{4} \right] \right\}\]
\[b = \sum_i^A \frac{\bar{n}}{\bar{n}-1} \left[ \bar{p}_i(1-\bar{p}_i) - s^2_i \frac{k-1}{k} - \bar{h}_i\frac{2\bar{n}-1}{4\bar{n}} \right]\]
\[c = \sum_i^A \frac{1}{2} \bar{h}_i\]

with:

  • \(A\), the number of alleles

  • \(k\), the number of populations with at least one individual

  • \(\bar{n}\), the average number of individuals per population

  • \(\bar{p}_i\), the relative frequency of allele \(i\) in the whole sample

  • \(\bar{h}_i\), the proportion of individuals carrying allele \(i\) as the heterozygote state, calculated in the whole sample

  • \(s^2_i\), as defined below:

\[s^2_i = \frac{\bar{n}}{k-1} \sum_p^k n_p (p_{ap} - \bar{p}_a)^2\]
  • \(n_c\), as defined below:

\[n_c = \frac{1}{k-1} \left( k \cdot \bar{n} - \frac{1}{k \cdot \bar{n}} \sum_p^k {n_p}^2 \right)\]
  • \(n_p\), the number of individuals in population \(p\)

  • \(p_{ap}\) the relative frequency of allele \(a\) in population \(p\)

The return value for FistWC is a tuple with the three F-statistics estimators: \(\left(\hat{f}, \hat{\theta}, \hat{F}\right)\), which are equivalent to \(\left(F_{IS}, F_{ST}, F_{IT}\right)\) and are defined as follows:

(11)\[1 - \hat{f} = \frac{c}{b+c}\]
(12)\[\hat{\theta} = \frac{a}{a+b+c}\]
(13)\[1 - \hat{F} = \frac{c}{a+b+c}\]

Clusters, population and individual structure

If, in addition, populations are grouped in clusters, it is possible to compute an additional fixation index: the between-population fixation index \(\theta\) (or \(F_{ST}\)) is subdivided in a between-population, within-cluster component \(\theta_1\) (or \(F_{SC}\)) and a between-cluster component \(\theta_2\) (or \(F_{CT}\)). The estimators are based on four components of variance, noted \(a\) (between clusters), \(b_2\) (between populations within clusters), \(b_1\) (between individuals within populations), and \(c\) (within individuals). They are computed as described in Weir and Cockerham (Evolution 1984 38:1358-1370).

\[a = \sum_i^A \frac{n_3 \epsilon_i - n_1 \delta_i - (n_3-n_1) \beta_i} {2 \cdot n_2 \cdot n_3}\]
\[b_2 = \sum_i^A \frac{\delta_i - \beta_i} {2 \cdot n_3}\]
\[b_1 = \sum_i^A \frac{1}{2} (\beta_i - \alpha_i)\]
\[c = \sum_i^A \alpha_i\]

\(\alpha\) (MSG en Weir and Cockerham’s article) is computed as:

\[\alpha_i = \frac{1}{2 n} \sum_p^k h_{ip}\]

\(\beta\) (MSI en Weir and Cockerham’s article) is computed as:

\[\beta_i = \frac{2 \sum_p^k n_p p_{ip} (1-p_{ip}) - \frac{1}{2} \sum_p^k h_{ip}} {n_t - k}\]

\(\delta\) (MSD en Weir and Cockerham’s article) is computed as:

\[\delta_i = \frac{2}{k - r} \sum_p^k n_p (p_{ip} - p_{ic_p}) ^2\]

\(\epsilon\) (MSP en Weir and Cockerham’s article) is computed as:

\[\epsilon_i = \frac{2}{r-1}\sum_c^r n_c (p_{ic} - p_i) ^2\]

with:

  • \(k\) number of populations with at least one individual

  • \(r\) number of clusters with at least one population

  • \(n\) total number of individuals (in considered populations)

  • \(n_p\) number of individuals in population \(p\)

  • \(n_c\) number of individuals in population \(c\)

  • \(p_i\) relative frequency of allele \(i\) in the whole sample

  • \(p_{ip}\) relative frequency of allele \(i\) in population \(p\)

  • \(p_{ic_p}\) relative frequency of allele \(i\) in the cluster containing population \(p\)

  • \(p_{ic}\) relative frequency of allele \(i\) in the cluster \(c\)

  • \(h_{ip}\) number of heterozygote individuals carrying allele \(i\) in population \(p\)

The return value for FisctWC is a tuple with the four F-statistics estimators: \(\left(\hat{f}, \hat{\theta}_1, \hat{\theta}_2, \hat{F}\right)\), which are equivalent to \(\left(F_{IS}, F_{SC}, F_{CT}, F_{IT}\right)\) and are defined as follows:

(14)\[1 - \hat{f} = \frac{c}{b_1+c}\]
(15)\[\hat{\theta}_1 = \frac{a+b_2}{a+b_2+b+1+c}\]
(16)\[\hat{\theta}_2 = \frac{a}{a+b_2+b+1+c}\]
(17)\[1 - \hat{F} = \frac{c}{a+b_2+b+1+c}\]