Unphased sites statistics¶

The following statistics are designed to be computed over a set of sites but do not require that the sites are phased. Most of them are applicable to an alignment of DNA sequences (or for a set of single-nucleotide polymorphism markers).

They are computed by process_align() and process_sites() of ComputeStats, as well as process_freq() and process_site() in the multiple site mode.

Code	Definition	Equation	Requirement	Notes
`nseff`	Average number of exploitable samples			1
`lseff`	Number of analysed sites			2
`nsmax`	Maximal number of available samples per site
`S`	Number of segregating sites
`Ss`	Number of sites with one singleton allele
`eta`	Minimal number of mutations			3
`sites`	Polymorphic sites			5
`singl`	Sites with one singleton allele			5
`nall`	Number of alleles per polymorphic site
`frq`	Allelic frequencies per polymorphic site
`frqp`	Population allelic frequencies per polymorphic site
`thetaW`	Watterson’s estimator of \(\theta\)	(1)		4
`Pi`	Nucleotide diversity	(2)		4
`lseffo`	Number of analysed orientable sites		Outgroup
`nseffo`	Average number of exploitable samples at orientable sites		Outgroup
`nsmaxo`	Maximal number of available samples per orientable site		Outgroup	2
`sites_o`	Orientable polymorphic sites		Outgroup	5
`singl_o`	Orientable sites with one singleton allele		Outgroup	5
`So`	Number of segregating orientable sites		Outgroup
`Sso`	Number of orientable sites with one singleton allele		Outgroup
`nsingld`	Number of sites with one derived singleton allele		Outgroup
`etao`	Minimal number of mutations are orientable sites		Outgroup	3
`D`	Tajima’s D	(3)
`Deta`	Tajima’s D using `eta` instead of `S`	(4)
`Dfl`	Fu and Li’s D	(5)	Outgroup
`F`	Fu and Li’s F	(6)	Outgroup
`D*`	Fu and Li’s D*	(7)
`F*`	Fu and Li’s F*	(8)
`nM`	Number of sites used for the MFDM test		Outgroup
`pM`	P-value of MDFM test	(9)	Outgroup
`thetaPi`	Pi using orientable sites	(10)	Outgroup	4
`thetaH`	Fay and Wu’s estimator of \(\theta\)	(11)	Outgroup	4
`thetaL`	Zeng et al.’s estimator of \(\theta\)	(12)	Outgroup	4
`Hns`	Fay and Wu’s H (unstandardized)	(13)	Outgroup
`Hsd`	Fay and Wu’s H (standardized)	(14)	Outgroup
`E`	Zeng et al.’s E	(15)	Outgroup
`Dxy`	Pairwise distance	(16)	Two populations	6
`Da`	Net pairwise distance	(17)	Two populations	6

Notes:

The number of exploitable samples may vary between sites due to missing data.
Number of sites considered for polymorphism detection, after discarding sites with too many missing data in the case of the method process_align() (controlled by the parameter max_missing). Sites with less than two non-missing samples are always discarded.
This value is properly computed even if sites with more than two alleles are excluded.
Provided per gene (must be divided by lseff or lseffo to be expressed per site).
Returned as a list containing the index of all concerned sites.
Only computed if there are two populations exactly.

Level of diversity¶

The so-called Watterson’s estimator of \(\theta\) (theta_W) is mentioned in Watterson (Theor. Popul. Biol. 1975, 7:256-276).

(1)¶\[\hat{\theta}_{W} = \frac{S}{\sum_i^{n-1}\frac{1}{i}}\]

where n is equal to nseff rounded to the closest integer.

Nucleotide diversity (Pi) is given by:

(2)¶\[\pi = \sum_i \left[ (1 - \sum_j {p_{i,j}}^2) \frac{n_i} {(n_i-1)} \right]\]

where \(p_{i,j}\) is the relative frequency of allele j at site i and \(n_i\) is the number of exploitable samples at site i.

Tajima’s D¶

Tajima’s D (Genetics 1989 123:585-595) is computed as follows:

(3)¶\[D = \frac{\pi - \hat{\theta}_W} {\sqrt{V(\pi - \hat{\theta}_W)}}\]

where the variance is computed as follows:

\[a_1 = \sum_i^{n-1} \frac{1}{i}\]

\[a_2 = \sum_i^{n-1} \frac{1}{i^2}\]

\[b_1 = \frac{n+1} {3 * (n-1)}\]

\[b_2 = \frac{2(n^2 + n + 3)} {9 n (n-1)}\]

\[c_1 = b_1 - \frac{1}{a_1}\]

\[c_2 = b_2 - \frac{n+2} {a_1 n} + \frac{a_2}{{a_1}^2}\]

\[e_1 = \frac{c_1}{a_1}\]

\[e_2 = \frac{c_2}{{a_1}^2 + a_2}\]

\[V(\pi - \hat{\theta}_W) = e_1 S + e_2 S (S-1)\]

A variant is available where \(\eta\) (the minimal number of mutation) is used instead of \(S\):

(4)¶\[D = \frac{\pi - \eta/a_1} {\sqrt{V(\pi - \eta/a_1)}}\]

with:

\[V(\pi - \eta/a_1) = e_1 \eta + e_2 \eta (\eta-1)\]

Fu and Li’s tests with an outgroup¶

Fu and Li (Genetics 1993 133:693-709) proposed alternatives to Tajima’s D computed as follows:

(5)¶\[D = \frac{\eta - a_1 \eta_e} {\sqrt{u_D \eta + v_D \eta ^2}}\]

and

(6)¶\[F = \frac{\sum H_{e_i} - \eta_e} {\sqrt{u_F \eta + v_F \eta ^2}}\]

where \(\eta\) is the minimal number of mutations at orientable sites, \(\eta_e\) is the total number of singletons at orientable sites, and \(H_{e_i}\) is the heterozygosity at site \(i\).

\[c_n = \frac{2 [n a_1 - 2 (n - 1)]} {(n - 1) (n - 2)}\]

(If \(n\) is equal to 2, \(c_n\) is set to 1.)

\[v_D = 1 + \frac{{a_1}^2} {b_n + {a_1}^2} \left( c_n - \frac{n + 1} {n - 1} \right)\]

\[u_D = a_1 - 1 - v_D\]

The variance for F is computed as follows:

\[v_F = \frac{1}{{a_1}^2 + b_n} \left[c_n + \frac{2 (n^2 + n + 3)}{9 n (n-1)} - \frac{2}{n-1} \right]\]

\[u_F = \frac{1}{a_1} \left[ 1 + \frac{n+1} {3 (n-1)} - 4 \frac{n+1}{(n-1)^2} \left( a_1 + \frac{1}{n} - \frac{2n}{n+1} \right) \right] - v_F\]

Variables are computed as for Tajima’s D but considering only orientable sites.

Fu and Li’s tests without outgroup¶

The following tests don’t require an outgroup:

(7)¶\[D* = \frac{\frac{n} {n-1} \eta - a_1 \eta_e} {\sqrt{u_D \eta + v_D \eta ^ 2}}\]

(8)¶\[F* = \frac{\pi - (n - 1) \frac{\eta_e}{n}} {\sqrt{u_F \eta + v_F \eta^2}}\]

\[c_n = \frac{2 [n a_1 - 2 (n - 1)]} {(n - 1) (n - 2)}\]

\[d_n = c_n + \frac{n-2} {(n-1)^2} + \frac{2} {n-1} \left[ \frac{3}{2} - \frac{2 (a_1 + \frac{1}{n})-3} {n-2} - \frac{1}{n} \right]\]

\[v_D = \frac{\frac{n^2} {(n-1)^2} a_2 + {a_1}^2 d_n - a_1 (a_1+1) \frac{2n} {(n-1)^2}} {{a_1}^2 + a_2}\]

\[u_D = \frac{n} {n-1} \left( a_1 - \frac{n} {n-1} \right) - v_D\]

\[v_F = \frac{1}{{a_1}^2+a_2} \left[ \frac{2n^3 + 110n^2 - 255n + 153} {9n^2(n-1)} + \frac{2(n-1)a_1}{n^2} - \frac{8a_2}{n} \right]\]

\[u_F = \frac{1}{a_1} \frac{4n^2 + 19n + 3 - 12(n+1)(a_1+\frac{1}{n})}{3n(n-1)} - v_F\]

\(n\) is nseff rounded to unity as for Tajima’s D, \(\eta_e\) is the total number of singletons. Expressions for \(v_F\) and \(u_F\) are given by Simonsen et al. (Genetics 1995 141:413-429).

MFDM test¶

The P-value of the MFDM (maximum frequency of derived mutation) test (Li Mol. Biol. Evol. 2011 28:365-375) is computed as follows:

(9)¶\[P = \min_i \left( 2 \frac{n_i - \max_j{d_{i,j}}}{n_i - 1} \right)\]

where \(n_i\) is the number of available samples at site i and \(d_{i,j}\) is the absolute frequency of the derived allele j, assuming that its frequency is more than half of the sample. If no site has a derived allele most frequent than half of the sample, the P-value is set to 1. If no site has a derived allele at least as frequent as half of the sample, the P-value is undefined.

Neutrality tests with an outgroup¶

Statistics defined by Fay and Wu (Genetics 2000 155:1405-1413) and Zeng et al. (Genetics 2006 174:1431-1439).

Three additional \(\theta\) estimators are defined based on orientable sites:

(10)¶\[\hat{\theta}_\pi = \frac{2}{n_{max}(n_{max}-1)} \sum_i^{n_{max}} i (n_{max}-1) S_i\]

(11)¶\[\hat{\theta}_H = \frac{2}{n_{max}(n_{max}-1)} \sum_i^{n_{max}} i^2 S_i\]

(12)¶\[\hat{\theta}_L = \frac{1}{n_{max}-1} \sum_i^{n_{max}} i S_i\]

where \(n_{max}\) is the maximal number of exploitable samples over orientable sites, and \(S_i\) is the number of derived alleles (aggregating all alleles from all considered sites) which have been found in i copies.

The following test statistics are defined. First, the non-standardized H statistic of Fay and Wu:

(13)¶\[H_{ns} = \hat{\theta}_\pi - \hat{\theta}_H\]

Second, the standardized version of the above:

(14)¶\[H_{sd} = \frac{\hat{\theta}_\pi - \hat{\theta}_L}{\sqrt{V(\hat{\theta}_\pi - \hat{\theta}_L)}}\]

with the numerator variance estimated as follows:

\[a_1 = \sum_i^{n_o'} \frac{1}{i}\]

\[b_n = \sum_i^{n_o'} \frac{1}{i^2}\]

\[b_{np1} = b_n + \frac{1}{{n_o'}^2}\]

\[\theta = \frac{\eta_o}{a_1}\]

\[\theta_2 = \frac{\eta_o(\eta_o - 1)}{{a_1}^2 + b_n}\]

\[V(\hat{\theta}_\pi - \hat{\theta}_L) = \theta \frac{n_o-2}{6(n_o-1)} + \theta_2 \frac{18{n_o}^2(3n_o+2)b_{np1} - (88n_o^3+9{n_o}^2-13n_o+6)} {9n_o(n_o-1)^2}\]

where \(\eta_o\) is equal to etao, the total number of mutations at orientable sites, \(n_o\) is equal to nseffo, the average number of samples at orientable sites, and \(n_o'\) is nseffo rounded to unity.

Third, the E statistic:

(15)¶\[E = \frac{\hat{\theta}_E - \theta}{\sqrt{V(\hat{\theta}_E - \theta)}}\]

with:

\[V(\hat{\theta}_E - \theta) = \theta \left[ \frac{n_o}{2(n_o-1)}-\frac{1}{a_1} \right] + \theta_2 \left[ \frac{b_n}{{a_1}^2} + 2 b_n \left( \frac{n_o}{n_o-1} \right)^2 - 2 \frac{n_o \cdot b_n-n_o+1}{(n_o-1)a_1} - \frac{3n_o+1}{n_o-1} \right]\]

Pairwise population distance¶

Here is how pairwise distance is computed (Nei 1987 Molecular Evolutionary Genetics), with

(16)¶\[D_{xy} = \frac{1}{L} \sum_i^L 1 - \sum_j^{k_i} p_{ij1} p_{ij2}\]

Here is the formula for the net pairwise distance:

(17)¶\[D_a = D_{xy} - \frac {\pi_1 + \pi_2} {2L}\]

where \(L\) is the number of sites, \(k_i\) is the number of alleles at site \(i\), \(p_{ijk}\) is the relative frequency of allele \(j\) of site \(i\) in population \(k\), and \(\pi_k\) is \(\pi\) for population \(k\). These statistics are only computed for a pair of populations.