Unphased sites statistics

The following statistics are designed to be computed over a set of sites but do not require that the sites are phased. Most of them are applicable to an alignment of DNA sequences (or for a set of single-nucleotide polymorphism markers).

They are computed by process_align() and process_sites() of ComputeStats, as well as process_freq() and process_site() in the multiple site mode.

Code

Definition

Equation

Requirement

Notes

nseff

Average number of exploitable samples

1

lseff

Number of analysed sites

2

nsmax

Maximal number of available samples per site

S

Number of segregating sites

Ss

Number of sites with one singleton allele

eta

Minimal number of mutations

3

sites

Polymorphic sites

5

singl

Sites with one singleton allele

5

nall

Number of alleles per polymorphic site

frq

Allelic frequencies per polymorphic site

frqp

Population allelic frequencies per polymorphic site

thetaW

Watterson’s estimator of \(\theta\)

(1)

4

Pi

Nucleotide diversity

(2)

4

lseffo

Number of analysed orientable sites

Outgroup

nseffo

Average number of exploitable samples at orientable sites

Outgroup

nsmaxo

Maximal number of available samples per orientable site

Outgroup

2

sites_o

Orientable polymorphic sites

Outgroup

5

singl_o

Orientable sites with one singleton allele

Outgroup

5

So

Number of segregating orientable sites

Outgroup

Sso

Number of orientable sites with one singleton allele

Outgroup

nsingld

Number of sites with one derived singleton allele

Outgroup

etao

Minimal number of mutations are orientable sites

Outgroup

3

D

Tajima’s D

(3)

Deta

Tajima’s D using eta instead of S

(4)

Dfl

Fu and Li’s D

(5)

Outgroup

F

Fu and Li’s F

(6)

Outgroup

D*

Fu and Li’s D*

(7)

F*

Fu and Li’s F*

(8)

nM

Number of sites used for the MFDM test

Outgroup

pM

P-value of MDFM test

(9)

Outgroup

thetaPi

Pi using orientable sites

(10)

Outgroup

4

thetaH

Fay and Wu’s estimator of \(\theta\)

(11)

Outgroup

4

thetaL

Zeng et al.’s estimator of \(\theta\)

(12)

Outgroup

4

Hns

Fay and Wu’s H (unstandardized)

(13)

Outgroup

Hsd

Fay and Wu’s H (standardized)

(14)

Outgroup

E

Zeng et al.’s E

(15)

Outgroup

Dxy

Pairwise distance

(16)

Two populations

6

Da

Net pairwise distance

(17)

Two populations

6

Notes:

  1. The number of exploitable samples may vary between sites due to missing data.

  2. Number of sites considered for polymorphism detection, after discarding sites with too many missing data in the case of the method process_align() (controlled by the parameter max_missing). Sites with less than two non-missing samples are always discarded.

  3. This value is properly computed even if sites with more than two alleles are excluded.

  4. Provided per gene (must be divided by lseff or lseffo to be expressed per site).

  5. Returned as a list containing the index of all concerned sites.

  6. Only computed if there are two populations exactly.

Level of diversity

The so-called Watterson’s estimator of \(\theta\) (theta_W) is mentioned in Watterson (Theor. Popul. Biol. 1975, 7:256-276).

(1)\[\hat{\theta}_{W} = \frac{S}{\sum_i^{n-1}\frac{1}{i}}\]

where n is equal to nseff rounded to the closest integer.

Nucleotide diversity (Pi) is given by:

(2)\[\pi = \sum_i \left[ (1 - \sum_j {p_{i,j}}^2) \frac{n_i} {(n_i-1)} \right]\]

where \(p_{i,j}\) is the relative frequency of allele j at site i and \(n_i\) is the number of exploitable samples at site i.

Tajima’s D

Tajima’s D (Genetics 1989 123:585-595) is computed as follows:

(3)\[D = \frac{\pi - \hat{\theta}_W} {\sqrt{V(\pi - \hat{\theta}_W)}}\]

where the variance is computed as follows:

\[a_1 = \sum_i^{n-1} \frac{1}{i}\]
\[a_2 = \sum_i^{n-1} \frac{1}{i^2}\]
\[b_1 = \frac{n+1} {3 * (n-1)}\]
\[b_2 = \frac{2(n^2 + n + 3)} {9 n (n-1)}\]
\[c_1 = b_1 - \frac{1}{a_1}\]
\[c_2 = b_2 - \frac{n+2} {a_1 n} + \frac{a_2}{{a_1}^2}\]
\[e_1 = \frac{c_1}{a_1}\]
\[e_2 = \frac{c_2}{{a_1}^2 + a_2}\]
\[V(\pi - \hat{\theta}_W) = e_1 S + e_2 S (S-1)\]

A variant is available where \(\eta\) (the minimal number of mutation) is used instead of \(S\):

(4)\[D = \frac{\pi - \eta/a_1} {\sqrt{V(\pi - \eta/a_1)}}\]

with:

\[V(\pi - \eta/a_1) = e_1 \eta + e_2 \eta (\eta-1)\]

Fu and Li’s tests with an outgroup

Fu and Li (Genetics 1993 133:693-709) proposed alternatives to Tajima’s D computed as follows:

(5)\[D = \frac{\eta - a_1 \eta_e} {\sqrt{u_D \eta + v_D \eta ^2}}\]

and

(6)\[F = \frac{\sum H_{e_i} - \eta_e} {\sqrt{u_F \eta + v_F \eta ^2}}\]

where \(\eta\) is the minimal number of mutations at orientable sites, \(\eta_e\) is the total number of singletons at orientable sites, and \(H_{e_i}\) is the heterozygosity at site \(i\).

\[c_n = \frac{2 [n a_1 - 2 (n - 1)]} {(n - 1) (n - 2)}\]

(If \(n\) is equal to 2, \(c_n\) is set to 1.)

\[v_D = 1 + \frac{{a_1}^2} {b_n + {a_1}^2} \left( c_n - \frac{n + 1} {n - 1} \right)\]
\[u_D = a_1 - 1 - v_D\]

The variance for F is computed as follows:

\[v_F = \frac{1}{{a_1}^2 + b_n} \left[c_n + \frac{2 (n^2 + n + 3)}{9 n (n-1)} - \frac{2}{n-1} \right]\]
\[u_F = \frac{1}{a_1} \left[ 1 + \frac{n+1} {3 (n-1)} - 4 \frac{n+1}{(n-1)^2} \left( a_1 + \frac{1}{n} - \frac{2n}{n+1} \right) \right] - v_F\]

Variables are computed as for Tajima’s D but considering only orientable sites.

Fu and Li’s tests without outgroup

The following tests don’t require an outgroup:

(7)\[D* = \frac{\frac{n} {n-1} \eta - a_1 \eta_e} {\sqrt{u_D \eta + v_D \eta ^ 2}}\]
(8)\[F* = \frac{\pi - (n - 1) \frac{\eta_e}{n}} {\sqrt{u_F \eta + v_F \eta^2}}\]
\[c_n = \frac{2 [n a_1 - 2 (n - 1)]} {(n - 1) (n - 2)}\]
\[d_n = c_n + \frac{n-2} {(n-1)^2} + \frac{2} {n-1} \left[ \frac{3}{2} - \frac{2 (a_1 + \frac{1}{n})-3} {n-2} - \frac{1}{n} \right]\]
\[v_D = \frac{\frac{n^2} {(n-1)^2} a_2 + {a_1}^2 d_n - a_1 (a_1+1) \frac{2n} {(n-1)^2}} {{a_1}^2 + a_2}\]
\[u_D = \frac{n} {n-1} \left( a_1 - \frac{n} {n-1} \right) - v_D\]
\[v_F = \frac{1}{{a_1}^2+a_2} \left[ \frac{2n^3 + 110n^2 - 255n + 153} {9n^2(n-1)} + \frac{2(n-1)a_1}{n^2} - \frac{8a_2}{n} \right]\]
\[u_F = \frac{1}{a_1} \frac{4n^2 + 19n + 3 - 12(n+1)(a_1+\frac{1}{n})}{3n(n-1)} - v_F\]

\(n\) is nseff rounded to unity as for Tajima’s D, \(\eta_e\) is the total number of singletons. Expressions for \(v_F\) and \(u_F\) are given by Simonsen et al. (Genetics 1995 141:413-429).

MFDM test

The P-value of the MFDM (maximum frequency of derived mutation) test (Li Mol. Biol. Evol. 2011 28:365-375) is computed as follows:

(9)\[P = \min_i \left( 2 \frac{n_i - \max_j{d_{i,j}}}{n_i - 1} \right)\]

where \(n_i\) is the number of available samples at site i and \(d_{i,j}\) is the absolute frequency of the derived allele j, assuming that its frequency is more than half of the sample. If no site has a derived allele most frequent than half of the sample, the P-value is set to 1. If no site has a derived allele at least as frequent as half of the sample, the P-value is undefined.

Neutrality tests with an outgroup

Statistics defined by Fay and Wu (Genetics 2000 155:1405-1413) and Zeng et al. (Genetics 2006 174:1431-1439).

Three additional \(\theta\) estimators are defined based on orientable sites:

(10)\[\hat{\theta}_\pi = \frac{2}{n_{max}(n_{max}-1)} \sum_i^{n_{max}} i (n_{max}-1) S_i\]
(11)\[\hat{\theta}_H = \frac{2}{n_{max}(n_{max}-1)} \sum_i^{n_{max}} i^2 S_i\]
(12)\[\hat{\theta}_L = \frac{1}{n_{max}-1} \sum_i^{n_{max}} i S_i\]

where \(n_{max}\) is the maximal number of exploitable samples over orientable sites, and \(S_i\) is the number of derived alleles (aggregating all alleles from all considered sites) which have been found in i copies.

The following test statistics are defined. First, the non-standardized H statistic of Fay and Wu:

(13)\[H_{ns} = \hat{\theta}_\pi - \hat{\theta}_H\]

Second, the standardized version of the above:

(14)\[H_{sd} = \frac{\hat{\theta}_\pi - \hat{\theta}_L}{\sqrt{V(\hat{\theta}_\pi - \hat{\theta}_L)}}\]

with the numerator variance estimated as follows:

\[a_1 = \sum_i^{n_o'} \frac{1}{i}\]
\[b_n = \sum_i^{n_o'} \frac{1}{i^2}\]
\[b_{np1} = b_n + \frac{1}{{n_o'}^2}\]
\[\theta = \frac{\eta_o}{a_1}\]
\[\theta_2 = \frac{\eta_o(\eta_o - 1)}{{a_1}^2 + b_n}\]
\[V(\hat{\theta}_\pi - \hat{\theta}_L) = \theta \frac{n_o-2}{6(n_o-1)} + \theta_2 \frac{18{n_o}^2(3n_o+2)b_{np1} - (88n_o^3+9{n_o}^2-13n_o+6)} {9n_o(n_o-1)^2}\]

where \(\eta_o\) is equal to etao, the total number of mutations at orientable sites, \(n_o\) is equal to nseffo, the average number of samples at orientable sites, and \(n_o'\) is nseffo rounded to unity.

Third, the E statistic:

(15)\[E = \frac{\hat{\theta}_E - \theta}{\sqrt{V(\hat{\theta}_E - \theta)}}\]

with:

\[V(\hat{\theta}_E - \theta) = \theta \left[ \frac{n_o}{2(n_o-1)}-\frac{1}{a_1} \right] + \theta_2 \left[ \frac{b_n}{{a_1}^2} + 2 b_n \left( \frac{n_o}{n_o-1} \right)^2 - 2 \frac{n_o \cdot b_n-n_o+1}{(n_o-1)a_1} - \frac{3n_o+1}{n_o-1} \right]\]

Pairwise population distance

Here is how pairwise distance is computed (Nei 1987 Molecular Evolutionary Genetics), with

(16)\[D_{xy} = \frac{1}{L} \sum_i^L 1 - \sum_j^{k_i} p_{ij1} p_{ij2}\]

Here is the formula for the net pairwise distance:

(17)\[D_a = D_{xy} - \frac {\pi_1 + \pi_2} {2L}\]

where \(L\) is the number of sites, \(k_i\) is the number of alleles at site \(i\), \(p_{ijk}\) is the relative frequency of allele \(j\) of site \(i\) in population \(k\), and \(\pi_k\) is \(\pi\) for population \(k\). These statistics are only computed for a pair of populations.