Phased sites statistics¶

The following statistics are designed to be computed over a set of phased sites. If individuals are defined, alleles within individuals must be phased as well (with the exception of \(\bar{r}_d\)).

They are computed by process_align() and process_sites() of ComputeStats, as well as process_site() in the multiple site mode, but not process_freq().

Code	Definition	Equation	Requirement	Notes
`R2`	Ramos-Onsins and Rozas’s \(R_2\)	(1)
`R3`	Ramos-Onsins and Rozas’s \(R_3\)	(1)
`R4`	Ramos-Onsins and Rozas’s \(R_4\)	(1)
`Ch`	Ramos-Onsins and Rozas’s \(Ch\)	(2)
`R2E`	Ramos-Onsins and Rozas’s \(R_{2E}\)		Outgroup	1
`R3E`	Ramos-Onsins and Rozas’s \(R_{3E}\)		Outgroup	1
`R4E`	Ramos-Onsins and Rozas’s \(R_{4E}\)		Outgroup	1
`ChE`	Ramos-Onsins and Rozas’s \(Ch_E\)		Outgroup	1
`B`	Wall’s B statistic	(3)
`Q`	Walls Q statistic	(4)
`Ki`	Number of haplotypes (only ingroup)
`Kt`	Total number of haplotypes (including outgroup)
`FstH`	Hudson et al’s \(F_{ST}\)	(5)	Populations
`Kst`	Hudson et al’s \(K_{ST}\)	(6)	Populations
`Snn`	Hudson’s nearest nearest neighbour statistic’	(7)	Populations
`rD`	\(\bar{r}_d\) statistic	(8)		2
`Rmin`	Minimal number of recombination events			3
`RminL`	Number of sites used to compute Rmin			3
`Rintervals`	List of start/end positions of recombination intervals			3
`nPairs`	Number of allele pairs used for \(Z_{nS}\) and related statistics
`nPairsAdj`	Allele pairs at adjacent sites (used for \(ZZ\) and \(Z_A\))
`ZnS`	Kelly et al.’s \(Z_{nS}\)	(9)
`Z*nS`	Kelly et al.’s \(Z^*_{nS}\)	(10)
`ZnS`	Kelly et al.’s \(Z^_{nS}{}^\)	(11)
`Za`	Rozas et al.’s \(Z_A\)	(12)
`ZZ`	Rozas et al.’s \(ZZ\)	(12)
`Fs`	Fu’s F_S	(13)

Notes:

Based on mutations on external branches (that is, derived singletons) instead of all singletons.
Does not require that alleles within individuals are phased.
The minimal number of recombination events (Rmin) is computed after Hudson and Kaplan (Genetics 1985 111:147-164). Briefly, this number of equal to the minimal number of non-overlapping segments defined by incompatible sites (ie breaking the three-allele rule). Site with missing data or with more than two alleles are skipped. The number of sites used for this analysis and the positions of those intervals are provided as RminL and Rintervals, respectively.

Ramos-Onsins and Rozas’s test statistics¶

Ramos-Onsins and Rozas (Mol. Biol. Evol. 2002 19:2092-2100) develop several tests of neutrality based on singletons. \(R_2\), \(R_3\), and \(R_4\) are computed as:

(1)¶\[R_p = \left[ \frac{1}{n} \sum_i^n \left( S_i - \frac{k}{2} \right) ^ p \right] ^ \frac{1}{p}\]

with:

\[k = \frac{n}{n-1} \sum_i^S 1 - \sum_j^{k_i} p_{ij} ^2\]

\(n\) the number of samples, \(S\) the number of segregating sites, \(k_i\) the number of alleles at site \(i\), \(S_i\) the number of singletons borne by the \(i\)th sample, and \(p_{ij}\) the relative frequency of allele \(j\) at site \(i\).

and \(Ch\) is computed as:

(2)¶\[Ch = (U - k) ^2 \frac{S} {k (S - k)}\]

where \(U\) is the total number of singletons.

Wall’s statistics¶

Tests based on partitions of the sample defined by polymorphic are defined by Wall (Genet. Res. 1999 74:65-79):

(3)¶\[B = \frac{B'}{S-1}\]

(4)¶\[Q = \frac{B' + n_P}{S}\]

where \(B'\) is defined as the number of pairs of adjacent polymorphic sites (considering only sites with no missing data and two alleles) that are congruent (that is, for each there is only two haplotypes considering the pair of sites) and \(n_P\) is the number of distinct partitions of the sample set defined by sites (\(S\) is the number of sites considered in the analysis).

Hudson’s differentiation statistics¶

Hudson et al. (Mol. Biol. Evol. 1992 9:138-151) haplotype statistics based on Wright’s fixation index.

(5)¶\[F_{ST} = 1 - \frac{H_W/n_W}{H_B/n_B}\]

(6)¶\[K_{ST} = 1 - \frac{K_S}{K_T}\]

with:

\[H_W = \sum_i^r \frac{2}{n_i(n_i-1)}K_i\]

\[H_B = \sum_i^{r-1} \sum_{j=i+1}^r \frac{K_{d_{ij}}}{n_i n_j}\]

\[K_S = \frac{1}{n} \sum_i^r n_i \frac{2}{n_i(n_i-1)}K_i\]

\[K_T = \frac{1}{2n(n-1)} \left( \sum_i^r K_i + \sum_i^{r-1} \sum_{j=i+1}^r K_{d_{ij}} \right)\]

where \(r\) is the number of populations, \(n\) is the total number of samples, \(n_i\) is the number of samples in population \(i\), \(K_i\) is the sum of the number of pairwise differences between all pairs of samples of population \(i\), \(K_{d_{ij}}\) is the sum of pairwise differences between all pairs of samples comprising one sample from population \(i\) and the other from population \(j\), \(n_W\) is the number of populations, \(n_B\) is the number of pairs of populations (populations with less than two samples are excluded).

Hudson (Genetics 2000 155:2011-2014) introduced the nearest neighbour statistic. The nearest neighbour is, for a given sequence \(i\), the sequence which has the less pairwise differences relatively to sequence \(i\) (excluding itself). There can be several ex aequo nearest neighbours. Then, \(X_i\) is the proportion of those nearest neighbours which come from the same population as sequence \(i\), and \(S_{nn}\) is the average of \(X_i\):

(7)¶\[S_{nn} = \frac{1}{n}\sum_i X_i\]

Standardized association index¶

The \(\bar{r}_d\) statistic has been introduced by Agapow and Burt (Mol. Ecol. Notes 2001 1:101-102).

(8)¶\[\bar{r}_d = \left(V_O - V_E\right)/\left(2 \sum_i^{L-1} \sum_{j=i+1}^L \sqrt{V_i V_j}\right)\]

with:

\[V_O = \frac{1}{n_P} \left( \sum_s^L \sum_i^{n-1} \sum_{j=i+1}^n {d_{sij}}^2 \right)\]

and:

\[V_E = \sum_s^L V_s\]

where the site variance is given, for site \(s\), by:

\[V_s = \frac{2}{n_s(n_s-1)} \left[ \sum_i^{n_s-1} \sum_{j=i+1}^{n_s} {d_{sij}}^2 - \frac{2}{n_s(n_s-1)} \left( \sum_{j=i+1}^{n_s} d_{sij} \right) ^2 \right]\]

where \(L\) is the total number of sites considered, \(k_{ij}\) is the number of sites with available data for samples \(i\) and \(j\), \(n_P\) is the number of pairs of samples with \(k_{ij}\) greater than 0, \(n_s\) is the number of samples available at site \(s\), and \(d_{sij}\) is the number of alleles of the genotype of individual \(i\) that are not present in the genotype of individual \(j\) as site \(s\).

Linkage disequilibrium summary statistics¶

Kelly (Genetics 1997 146:1197-1206) introduced a neutrality statistic based on pairwise linkage disequilibrium values:

(9)¶\[Z_{nS} = \frac{\sum r^2}{n}\]

Two variants are available:

(10)¶\[Z^*_{nS} = Z_{nS} + 1 - \frac{\sum {D'}^2}{n}\]

(11)¶\[Z^*_{nS}{}^* = Z_{nS} \frac{n}{\sum {D'}^2}\]

Rozas et al. (Genetics 2001 158:1147-1155) introduced the additional statistics \(ZZ\):

(12)¶\[ZZ = Z_A - Z_{nS}\]

where \(Z_a\) is computed as \(Z_{nS}\) but considering only adjacent polymorphic sites (that is, pairs of polymorphic sites that don’t have a polymorphic site in between).

\(n\) is the number of allele pairs considered for each statistic.

The sums of \(r^2\) and of \({D'}^2\) are computed over all pairs of sites. For sites with more than two alleles, the behaviour is controlled by the option LD_multiallelic:

ignore: skip all sites with more than two alleles.
use_main: use the most frequeny allele.
use_all: use all possible pairs of alleles.

Linkage disequilibrium statistics are defined here

Fu’s statistic¶

Fu’s \(F_S\) (Genetics 1997 147:915-925) is computed as:

(13)¶\[F_S = \log{\left(S'\right)} - \log{\left(1-S'\right)}\]

with:

\[S' = \sum_{k=K}^n \exp {\left[ S_n^k + k \log{(\pi)} - \sum_{i=1}^n \log{(\pi+i-1)} \right]}\]

where \(K\) is the number of haplotypes, \(n\) is the number of samples used, and \(S_n^k\) is the Sterling number of the first kind as computed:

\[S_n^k = \log{\left( \lvert s_n^k \rvert \right)}\]

\[s_n^k = s_{n-1}^{k-1} - (n-1) s_{n-1}^k\]