Extended haplotype homozygosity

Extended haplotype homozygosity (EHH) is a method designed to detect unusually long haplotypes resulting from recent selective sweeps. A dedicated class, stats.EHH, is provided to compute those statistics.

The statistics listed in the table below are available as methods or attributes of this class. The documentation provides more information regarding the usage.

Accessor

Description

Equation

Reference

get_EHH()

EHH

(1)

1

get_EHHc()

Complementary EHH

(1)

1

get_rEHH()

Relative EHH

(2)

1

get_iHH()

Integrated EHH

(3)

2

get_iHHc()

Integrated EHHc

(3)

2

get_iHS()

Integrated haplotype score (unstandardized)

(4)

2

get_EHHS()

Site-level EHH

(5)

3

get_iES()

Integrated EHHS

(6)

3

get_EHHG()

EHHS for genotypic data

(7)

3

get_iEG()

Integrated EHHG

(8)

3

References

  1. Sabeti et al. (Nature 2002 419:832-837).

  2. Voight et al. (PLoS Biol. 2006 4:e772).

  3. Tang et al. (PLoS Biol. 2007 5:e171).

Raw EHH statistics

If haplotype \(i\) is present in \(n_{i, 0}\) copies at the core site, and if this haplotype has been split in \(k\) haplotypes at distant site \(s\), each present in \(n_{j, 0}\) copies, the EHH for haplotype \(i\) at distant site \(s\) is given by:

(1)\[EHH_{i,s} = \frac{\sum_j n_{j,s} (n_{j,s} - 1)} {n_{i,0} (n_{i,0} - 1)}\]

\(EHHc_{i,s}\) is computed like \(EHH_{i,s}\) but considering the complement of haplotype \(i\) instead of haplotype \(i\) itself.

\(rEHH_{i,s}\) is computed as:

(2)\[rEHH_{i,s} = \frac{EHH_{i,s}}{EHHc_{i,s}}\]

Integrated EHH statistics

Denoting the core site as \(s=0\) and the first site for which \(EHH_{i,s}\) is below the threshold \(EHH_t\) as \(s=s^*\), and \(d_s\) the distance of site \(s\) to the core, the integrated statistic \(iHH_{i,s^*}\) is computed as:

(3)\[iHH_{i,s*} = \sum_{s=0}^{s^*-1} \left[ (d_s - d_{s-1}) \frac{(EHH_{i,s-1}-EHH_t) + (EHH_{i,s}-EHH_t)}{2} \right] + (d_{s^*} - d_{s^*-1}) \frac{(EHH_{i,s^*-1}-EHH_t)^2}{2(EHH_{i,s^*-1}-EHH_{i,s^*})}\]

As long as no site has an EHH value below the threshold, the statistic is computed without the last term.

The complementary \(iHHc\) is computed using \(EHHc\) instead of \(EHH\).

The integrated haplotype score iHS is not standardized:

(4)\[iHS_{i,s} = \log \frac{iHHc}{iHH}\]

Site-level EHH statistics

If \(n\) is the total number of available samples at the core sites, the whole-site \(EHH\) is computed as:

(5)\[EHHS_s = 1 - \frac{n}{n-1} \left( 1 -\frac{\sum_i n_{i,s}^2}{n^2} \right)\]

The integrated \(EHHS\) (\(iES\)) is computed similarly as \(iHH\) based on a given threshold \(EHHS_t\) and \(s^*\) being the first site for which \(EHHS\) is below this threshold (see above):

(6)\[iES_{s^*} = \sum_{s=0}^{s^*-1} \left[ (d_s - d_{s-1}) \frac{EHHS_s + EHHS_{s-1} - 2 EHHS_t}{2} \right] + (d_{s^*} - d_{s^*-1}) \frac{(EHHS_{s^*-1}-EHHS_t)^2}{2(EHHS_{s^*-1}-EHHS_{s^*})}\]

EHHS for genotypic data

Defining \(H_s\) as the proportion of heterozygote individuals at site \(s\), and \(H_{0s}\) the proportion of individuals heterozygote as the core site among those which are non-missing at site \(s\), \(EHHG\) is computed as:

(7)\[EHHG_s = \frac{H_s}{H_{0s}}\]

The integrated \(EHHG\), \(iEG\), is computed just as \(iHH\) and \(iES\), based on a given threshold \(EHHG_t\):

(8)\[iEG_s = \sum_{s=0}^{s^*-1} \left[ (d_s - d_{s-1}) \frac{EHHG_s + EHHG_{s-1} - 2 EHHG_t}{2} \right] + (d_{s^*} - d_{s^*-1}) \frac{(EHHG_{s^*-1}-EHHG_t)^2}{2(EHHG_{s^*-1}-EHHG_{s^*})}\]