Phased sites statistics

The following statistics are designed to be computed over a set of phased sites. If individuals are defined, alleles within individuals must be phased as well (with the exception of \(\bar{r}_d\)).

They are computed by process_align() and process_sites() of ComputeStats, as well as process_site() in the multiple site mode, but not process_freq().

Code

Definition

Equation

Requirement

Notes

R2

Ramos-Onsins and Rozas’s \(R_2\)

(1)

R3

Ramos-Onsins and Rozas’s \(R_3\)

(1)

R4

Ramos-Onsins and Rozas’s \(R_4\)

(1)

Ch

Ramos-Onsins and Rozas’s \(Ch\)

(2)

R2E

Ramos-Onsins and Rozas’s \(R_{2E}\)

Outgroup

1

R3E

Ramos-Onsins and Rozas’s \(R_{3E}\)

Outgroup

1

R4E

Ramos-Onsins and Rozas’s \(R_{4E}\)

Outgroup

1

ChE

Ramos-Onsins and Rozas’s \(Ch_E\)

Outgroup

1

B

Wall’s B statistic

(3)

Q

Walls Q statistic

(4)

Ki

Number of haplotypes (only ingroup)

Kt

Total number of haplotypes (including outgroup)

FstH

Hudson et al’s \(F_{ST}\)

(5)

Populations

Kst

Hudson et al’s \(K_{ST}\)

(6)

Populations

Snn

Hudson’s nearest nearest neighbour statistic’

(7)

Populations

rD

\(\bar{r}_d\) statistic

(8)

2

Rmin

Minimal number of recombination events

3

RminL

Number of sites used to compute Rmin

3

Rintervals

List of start/end positions of recombination intervals

3

nPairs

Number of allele pairs used for \(Z_{nS}\) and related statistics

nPairsAdj

Allele pairs at adjacent sites (used for \(ZZ\) and \(Z_A\))

ZnS

Kelly et al.’s \(Z_{nS}\)

(9)

Z*nS

Kelly et al.’s \(Z^*_{nS}\)

(10)

Z*nS*

Kelly et al.’s \(Z^*_{nS}{}^*\)

(11)

Za

Rozas et al.’s \(Z_A\)

(12)

ZZ

Rozas et al.’s \(ZZ\)

(12)

Fs

Fu’s F_S

(13)

Notes:

  1. Based on mutations on external branches (that is, derived singletons) instead of all singletons.

  2. Does not require that alleles within individuals are phased.

  3. The minimal number of recombination events (Rmin) is computed after Hudson and Kaplan (Genetics 1985 111:147-164). Briefly, this number of equal to the minimal number of non-overlapping segments defined by incompatible sites (ie breaking the three-allele rule). Site with missing data or with more than two alleles are skipped. The number of sites used for this analysis and the positions of those intervals are provided as RminL and Rintervals, respectively.

Ramos-Onsins and Rozas’s test statistics

Ramos-Onsins and Rozas (Mol. Biol. Evol. 2002 19:2092-2100) develop several tests of neutrality based on singletons. \(R_2\), \(R_3\), and \(R_4\) are computed as:

(1)\[R_p = \left[ \frac{1}{n} \sum_i^n \left( S_i - \frac{k}{2} \right) ^ p \right] ^ \frac{1}{p}\]

with:

\[k = \frac{n}{n-1} \sum_i^S 1 - \sum_j^{k_i} p_{ij} ^2\]

\(n\) the number of samples, \(S\) the number of segregating sites, \(k_i\) the number of alleles at site \(i\), \(S_i\) the number of singletons borne by the \(i\)th sample, and \(p_{ij}\) the relative frequency of allele \(j\) at site \(i\).

and \(Ch\) is computed as:

(2)\[Ch = (U - k) ^2 \frac{S} {k (S - k)}\]

where \(U\) is the total number of singletons.

Wall’s statistics

Tests based on partitions of the sample defined by polymorphic are defined by Wall (Genet. Res. 1999 74:65-79):

(3)\[B = \frac{B'}{S-1}\]
(4)\[Q = \frac{B' + n_P}{S}\]

where \(B'\) is defined as the number of pairs of adjacent polymorphic sites (considering only sites with no missing data and two alleles) that are congruent (that is, for each there is only two haplotypes considering the pair of sites) and \(n_P\) is the number of distinct partitions of the sample set defined by sites (\(S\) is the number of sites considered in the analysis).

Hudson’s differentiation statistics

Hudson et al. (Mol. Biol. Evol. 1992 9:138-151) haplotype statistics based on Wright’s fixation index.

(5)\[F_{ST} = 1 - \frac{H_W/n_W}{H_B/n_B}\]
(6)\[K_{ST} = 1 - \frac{K_S}{K_T}\]

with:

\[H_W = \sum_i^r \frac{2}{n_i(n_i-1)}K_i\]
\[H_B = \sum_i^{r-1} \sum_{j=i+1}^r \frac{K_{d_{ij}}}{n_i n_j}\]
\[K_S = \frac{1}{n} \sum_i^r n_i \frac{2}{n_i(n_i-1)}K_i\]
\[K_T = \frac{1}{2n(n-1)} \left( \sum_i^r K_i + \sum_i^{r-1} \sum_{j=i+1}^r K_{d_{ij}} \right)\]

where \(r\) is the number of populations, \(n\) is the total number of samples, \(n_i\) is the number of samples in population \(i\), \(K_i\) is the sum of the number of pairwise differences between all pairs of samples of population \(i\), \(K_{d_{ij}}\) is the sum of pairwise differences between all pairs of samples comprising one sample from population \(i\) and the other from population \(j\), \(n_W\) is the number of populations, \(n_B\) is the number of pairs of populations (populations with less than two samples are excluded).

Hudson (Genetics 2000 155:2011-2014) introduced the nearest neighbour statistic. The nearest neighbour is, for a given sequence \(i\), the sequence which has the less pairwise differences relatively to sequence \(i\) (excluding itself). There can be several ex aequo nearest neighbours. Then, \(X_i\) is the proportion of those nearest neighbours which come from the same population as sequence \(i\), and \(S_{nn}\) is the average of \(X_i\):

(7)\[S_{nn} = \frac{1}{n}\sum_i X_i\]

Standardized association index

The \(\bar{r}_d\) statistic has been introduced by Agapow and Burt (Mol. Ecol. Notes 2001 1:101-102).

(8)\[\bar{r}_d = \left(V_O - V_E\right)/\left(2 \sum_i^{L-1} \sum_{j=i+1}^L \sqrt{V_i V_j}\right)\]

with:

\[V_O = \frac{1}{n_P} \left( \sum_s^L \sum_i^{n-1} \sum_{j=i+1}^n {d_{sij}}^2 \right)\]

and:

\[V_E = \sum_s^L V_s\]

where the site variance is given, for site \(s\), by:

\[V_s = \frac{2}{n_s(n_s-1)} \left[ \sum_i^{n_s-1} \sum_{j=i+1}^{n_s} {d_{sij}}^2 - \frac{2}{n_s(n_s-1)} \left( \sum_{j=i+1}^{n_s} d_{sij} \right) ^2 \right]\]

where \(L\) is the total number of sites considered, \(k_{ij}\) is the number of sites with available data for samples \(i\) and \(j\), \(n_P\) is the number of pairs of samples with \(k_{ij}\) greater than 0, \(n_s\) is the number of samples available at site \(s\), and \(d_{sij}\) is the number of alleles of the genotype of individual \(i\) that are not present in the genotype of individual \(j\) as site \(s\).

Linkage disequilibrium summary statistics

Kelly (Genetics 1997 146:1197-1206) introduced a neutrality statistic based on pairwise linkage disequilibrium values:

(9)\[Z_{nS} = \frac{\sum r^2}{n}\]

Two variants are available:

(10)\[Z^*_{nS} = Z_{nS} + 1 - \frac{\sum {D'}^2}{n}\]
(11)\[Z^*_{nS}{}^* = Z_{nS} \frac{n}{\sum {D'}^2}\]

Rozas et al. (Genetics 2001 158:1147-1155) introduced the additional statistics \(ZZ\):

(12)\[ZZ = Z_A - Z_{nS}\]

where \(Z_a\) is computed as \(Z_{nS}\) but considering only adjacent polymorphic sites (that is, pairs of polymorphic sites that don’t have a polymorphic site in between).

\(n\) is the number of allele pairs considered for each statistic.

The sums of \(r^2\) and of \({D'}^2\) are computed over all pairs of sites. For sites with more than two alleles, the behaviour is controlled by the option LD_multiallelic:

  • ignore: skip all sites with more than two alleles.

  • use_main: use the most frequeny allele.

  • use_all: use all possible pairs of alleles.

Linkage disequilibrium statistics are defined here

Fu’s statistic

Fu’s \(F_S\) (Genetics 1997 147:915-925) is computed as:

(13)\[F_S = \log{\left(S'\right)} - \log{\left(1-S'\right)}\]

with:

\[S' = \sum_{k=K}^n \exp {\left[ S_n^k + k \log{(\pi)} - \sum_{i=1}^n \log{(\pi+i-1)} \right]}\]

where \(K\) is the number of haplotypes, \(n\) is the number of samples used, and \(S_n^k\) is the Sterling number of the first kind as computed:

\[S_n^k = \log{\left( \lvert s_n^k \rvert \right)}\]
\[s_n^k = s_{n-1}^{k-1} - (n-1) s_{n-1}^k\]