Missing data

Supporting missing data

Consider the following alignment of 10 sequences, with 30 sites, of which 10 are polymorphic (those at indexes: 2, 5, 7, 8, 12, 13, 14, 18, 22, and 26):

GCACCATCATGACTTACGGTATTATCGTAT
.....T......A.A...........C...
.....T.GT...A.................
......................A...C...
......................A...C...
..T..T.GT.............A...C...
..T..T.......G................
..T..T........A...........C...
..T....GT.....A...C...A...C...
..T..T........A...C.......C...

Assumed it is saved a Fasta in align6A.fas. Without missing data, computing diversity is relatively straightforward:

>>> cs = egglib.stats.ComputeStats()
>>> cs.add_stats('lseff', 'nseff', 'S', 'sites')
>>> alnA = egglib.io.from_fasta('align6A.fas',egglib.alphabets.DNA)
>>> print(cs.process_align(alnA))
{'sites': [2, 5, 7, 8, 12, 13, 14, 18, 22, 26], 'lseff': 30, 'S': 10, 'nseff': 10.0}

Suppose now that the same data were obtained but with less than a perfect sequencing technology and that consequently some data are missing:

GCACCATCATGACTTACGGTATTATCGTAT
.....T......A.A...........C...
.....T.GT...A.................
......................A...C...
......................A...C...
..T..T.GW....K....S...A...C...
..T..T.......G............S...
..T..T........A...........S...
..T....GT.....A...C...A...C...
..T..T........A.??????????????

By default, all sites with at least one missing data will be excluded, causing a significant loss of data:

>>> alnB = egglib.io.from_fasta('align6B.fas',egglib.alphabets.DNA)
>>> print(cs.process_align(alnB))
{'sites': [2, 5, 7, 12, 14], 'lseff': 14, 'S': 5, 'nseff': 10.0}

Note that it is important to always refer to lseff to know how many sites have been actually used for the analysis (usually smaller than ls). In this case, only 14 of the 30 sites are considered. In particular, the sites 8 and 13, which are polymorphic, are excluded because the 6th sequence has ambiguity data. Furthermore, all sites after index 15 are excluded because the last sequence (and sometimes other ones) has missing data. Among those 14 sites included, only 5 remain polymorphic.

EggLib has a specific option to allow considering sites even if they contain missing data, up to a user-specified threshold. In stats.ComputeStats.process_align(), the threshold is expressed as a proportion. Assume we want to allow for only one missing data (10%). We can do it as shown below:

>>> print(cs.process_align(alnB, max_missing=0.1))
{'sites': [2, 5, 7, 8, 12, 13, 14, 22], 'lseff': 28, 'S': 8, 'nseff': 9.5}

This analysis has included 28 sites: all except two that have more than one missing piece of data (sites at indexes 18 and 26, which appear to be both polymorphic). The statistic nseff gives the average number of samples in sites included in the analysis. While max_missing was set to the default (0), all included sites necessarily had have all samples, so the value was always equal to 10. Now, included sites may have less than all samples available. For 14 of them all samples are available, while the 14 others have one missing data, thereby the average value of 9.5. Whenever possible, computations take this effect into account and in all cases frequencies are expressed relatively to the effective number of samples for a given site.

It is possible to be even more liberal and allow for more missing data. Here, we need to allow for at least 3 missing data to accept all sites (and detect all polymorphic sites):

>>> print(cs.process_align(alnB, max_missing=0.3))
{'sites': [2, 5, 7, 8, 12, 13, 14, 18, 22, 26], 'lseff': 30, 'S': 10, 'nseff': 9.366666666666667}

Eventually, all sites are processed and all 10 polymorphic sites are detected, while the average number of samples in used sites drops to 9.37.

The cost of increasing the value for max_missing is primarily that large values (if there are many missing data) lead to including a lot of sites with few exploitable samples for which all statistics will be poorly estimated, which will be in the end counter-productive.

Linkage disequilibrium statistics

Part of the statistics connected to linkage between distant sites can be computed using the stats.ComputeStats class. Other require separate tools, also available in the stats module.

From ComputeStats

A number of linkage-related statistics are available among the list of statistics proposed by ComputeStats. They include proper linkage disequilibrium statistics such as Kelly et al.’s and Rozas et al.’s statistics (such as Za, ZZ and ZnS) and closely related as Hudson’s Rmin. We can also consider that statistics based on haplotypes are somewhat connected (including K, Hudson’s Fst, Fu’s Fs, but also Ramos-Onsins and Rozas’s R2 or Wall’s B and Q). Please refer to Diversity statistics for the full list. It should be noted that all those statistics require that the data are phased, meaning that the order of samples must match over all provided sites (including the order of alleles within individuals if the individuals are not haploid). This may seem obvious, but keep in mind that you may compute haplotype or linkage disequilibrium statistics over completely unrelated sites (with non-matching lists of samples). Provided that the number of samples matches, EggLib will compute all statistics you ask and it is up to you to decide whether they are meaningful or not.

This being said, you can compute linkage disequilibrium and haplotype statistics using the methods process_align() and process_sites() of ComputeStats. In both cases, it will be assumed that the data are phased. When using an alignment, computing this category of statistics is rather straighforward, but note that it is also possible to process a list of Site instances as in the example below:

 >>> site1 = egglib.site_from_list('AAAAAAAACCCCCCCC', egglib.alphabets.DNA)
 >>> site2 = egglib.site_from_list('GGGGGGGGGGGTTTTT', egglib.alphabets.DNA)
 >>> site3 = egglib.site_from_list('CCCCCCAAAAAAAAAA', egglib.alphabets.DNA)
 >>> site4 = egglib.site_from_list('TTTTAAAAAAATTTTT', egglib.alphabets.DNA)
 >>> site5 = egglib.site_from_list('CCGGGGGGGGGGCCCG', egglib.alphabets.DNA)
 >>> site6 = egglib.site_from_list('AATTAAAAAAAAAAAT', egglib.alphabets.DNA)
 >>> cs = egglib.stats.ComputeStats()
 >>> cs.add_stats('Rmin', 'Rintervals', 'ZnS', 'Ki')
 >>> print(cs.process_sites([site1, site2, site3, site4, site5, site6]))
{'Rintervals': [(2, 3)], 'Rmin': 1, 'Ki': 8, 'ZnS': 0.17767944289156412}

Note that several options can be set using configure() to control the computation of several statistics in this category (refer to the documentation of this method for all details).

Note also that the presence of missing data can be a problem, since, on one hand, a sequence that contain any missing data cannot be easily used while identifying haplotypes and, on the other hand, the effect of missing data is magnified in the context of pairwise comparisons. To compute this family of statistics, it can be better to remove samples for which the amount of missing data is above a given threshold while keeping the argument max_missing to a (very) low value.

Pairwise linkage disequilibrium

There are two functions in the stats module to compute linkage disequilibrium between sites: one to process a pair of sites (stats.pairwise_LD()) and one to process all sites for an alignment (stats.matrix_LD()).

The stats.pairwise_LD() function

This function takes two sites as arguments. It is up to the user to provide sites with little enough missing data to make the computation relevant. The function has options to control what happens if there are more than two alleles at either site (this is not addressed here; see the function’s documentation for more details).

The fragment of code below shows what the function does:

>>> print(egglib.stats.pairwise_LD(site1, site2))
{'D': 0.15625, 'Dp': 1.0, 'r': 0.674199862463242, 'rsq': 0.4545454545454545, 'n': 1}
>>> print(egglib.stats.pairwise_LD(site1, site4))
{'D': -0.03125, 'Dp': -0.14285714285714285, 'r': -0.1259881576697424, 'rsq': 0.01587301587301587, 'n': 1}

The values are:

  • n, the number of pair of alleles considered (significant if there are more than two alleles at either site).

  • D, linkage disequilibrium.

  • Dp, Lewontin’s D’.

  • r, Pearson’s correlation coefficient.

  • rsq, squared Pearson’s correlation coefficient.

Note that if statistics cannot be computed (typically, because of the presence of missing data), values in the returned dictionary are replaced by None.

The stats.matrix_LD() function

This function generates the linkage disequilibrium matrix between all pairs of sites (for which computation is possible) from a user-provided alignment. This function has a bunch of options:

  • Options to control what happens if there are more than two alleles at either site.

  • Options to apply filters on sites based on allelic frequencies (to exclude sites with too unbalanced frequencies, which are less informative) and the number of available data.

  • List of positions of sites provided in the alignment (by default, the index of sites is used).

The function requires two mandatory arguments: an Align instance and the list of statistics: d, D, Dp, r, and rsq, as listed above.

In this example, we consider a very simple alignment of 10 sequences and 10 sites (of which only 3 are variable), saved in a Fasta-formatted file named align7.fas:

>sample01
CACATGTGGA
>sample02
CACATGTGGA
>sample03
CAGATGTGGT
>sample04
CAGATGTGGT
>sample05
CACATGTAGT
>sample06
CACATGTAGT
>sample07
CAGATGTGGT
>sample08
CAGATGTGGT
>sample09
CACATGTGGT
>sample10
CACATGTGGA

The code below demonstrates the usage of stats.matrix_LD() and will help describe its return value:

>>> aln = egglib.io.from_fasta('align7.fas', egglib.alphabets.DNA)
>>> print(egglib.stats.matrix_LD(aln, ('d', 'rsq')))
([2.0, 7.0, 9.0], [[None], [[5.0, 0.16666666666666652], None],
 [[7.0, 0.28571428571428575], [2.0, 0.10714285714285716], None]])

The usage is fairly straighforward. Just remember to specify the list of statistics you want to be computed (in this case, we requested d and rsq) and consider if some of the optional arguments are needed.

The return value is a tuple of two lists. The first list is the position of sites that have been included in the analysis, in this case the three polymorphic sites, which appear to be at positions 2, 7, and 9 (there is currently an automatic conversion to floating-point number but this may change in the future).

The second item of the return value is a nested list, containing the lower half-matrix with requested values. The additional example below shows how to collect results in such as way that we loop over all pairs of sites:

>>> pos, mat = egglib.stats.matrix_LD(aln, ('d', 'rsq'))
>>> n = len(pos)
>>> for i in range(n):
...     for j in range(i):
...         p1 = pos[i]
...         p2 = pos[j]
...         d = mat[i][j][0]
...         r2 = mat[i][j][1]
...         print('pos:', p1, p2, 'd:', d, 'r2:', r2)
...
pos: 7.0 2.0 d: 5.0 r2: 0.166666666667
pos: 9.0 2.0 d: 7.0 r2: 0.285714285714
pos: 9.0 7.0 d: 2.0 r2: 0.107142857143

Note that there is a “gotcha” with this function. If one requests a class:!list of statistics as the stats argument, the values in the half-matrix provided as the second return value are a list of values, even if only one statistic is requested, as in:

>>> print(egglib.stats.matrix_LD(aln, ['rsq']))
([2.0, 7.0, 9.0], [[None], [[0.16666666666666652], None], [[0.28571428571428575], [0.10714285714285716], None]])

However, if one passes a statistic code (not an iterable) as value for the stat option, the items in the returned half-matrix are statistic values, not embedded in a list:

>>> print(egglib.stats.matrix_LD(aln, 'rsq'))
([2.0, 7.0, 9.0], [[None], [0.16666666666666652, None], [0.28571428571428575, 0.10714285714285716, None]])

The diagonal values are always None.

EHH statistics

Extended haplotype homozygosity (EHH) statistics are one a set of statistics specifically developed with the aim of exploiting large-scale sequencing data. The documentation for the class stats.EHH provides all details and the list of literature references. In this section we provide a quick overview of the usage of this class.

EHH would be too complex to be included among statistics proposed by stats.ComputeStats. It is the reason why it has a class of its own. To use it, one must therefore first create a class instance:

>>> ehh = egglib.stats.EHH()

Loading the core site

The first step consists in loading a core site, which will be used as reference for all computations. In the original paper, it was a non-recombining region for which haplotypes have been determined. It can as well be a two-allele SNP marker. Assume that we have a file named sites1.txt, of which the first line is:

0000000000000000010002000000010001001000000000000001000000002000000100000000101000000101000010010010

This is one way to represent haplotypic data for 100 individuals in a simple manner. We will use simple code to import this string of haplotype identifiers as a Site:

>>> f = open('sites1.txt')
>>> core = list(map(int,f.readline().strip()))
>>> site = egglib.Site()
>>> site.from_list(core, egglib.alphabets.positive_infinite)

There is a method to set a Site instance as the core site for EHH, but first we need to set its position because it will be needed to compute EHH statistics. But, when created from a list, Site instances don’t have a position. Here we will record distances relatively to the core site so we set its position to 0.

>>> site.position = 0
>>> ehh.set_core(site)

This method takes an array of options, one of which (min_freq) controlling the frequency threshold (to exclude low-frequency haplotypes from the analysis). But we don’t use it in this example.

After the core site has been loaded, basic statistics can be accessed:

>>> print(ehh.num_haplotypes)
3
>>> print(ehh.nsam)
100
>>> print([ehh.nsam_core(i) for i in range(ehh.num_haplotypes)])
[85, 13, 2]

These three lines show, respectively:

  • num_haplotypes, the number of core haplotypes (if haplotypes have been excluded due to the min_freq option, this property gives the number of included haplotypes).

  • nsam, the total number of samples (only considering included haplotypes).

  • nsam_core(), the absolute frequency (at the core site) of a core haplotype.

Below we show the initial value of some of the EHH statistics (they actually have a value after loading the core site, but it is only after loading other sites that they can be interpreted):

>>> print(ehh.cur_haplotypes)
3
>>> print(ehh.get_EHH(0))
1.0
>>> print(ehh.get_EHH(1))
1.0
>>> print(ehh.get_EHH(2))
1.0
>>> print(ehh.get_rEHH(0))
1.32911392405
>>> print(ehh.get_iHH(0))
0.0
>>> print(ehh.get_iHS(0))
None
>>> print(ehh.get_EHHS())
1.0
>>> print(ehh.get_iES())
0.0

The table below lists the statistics shown (more are available):

Statistic

Meaning

Initial value

cur_haplotypes

current number of haplotypes

equal to num_haplotypes

get_EHH(i)

EHH value for haplotype i

equal to 1

get_rEHH(i)

Ratio of EHH and EHHc for haplotype i

not defined

get_iHH(i)

Integrated value of EHH

equal to 0

get_iHS(i)

Ratio of iHH and its complement iHHc

not computable because both terms are 0

get_EHHS()

Site-wise EHHS

equal to 1

get_iES()

Integrated value of iES

equal to 0

Loading the distant sites

After this, we can start to load other sites (referred to as distant sites). Distant sites are classically SNP markers located at increasing distance from the core site (the increasing distance is required). Here we continue reading the file sites1.txt which contains, after the core site, biallelic sites formatted as shown below, taking the first 5 sites as example:

0000000000000000010000000000010001001000000000000001000000000000000100000000100000000101000010000000
0000100001000100000000000000000010000100001001001000100000100001100000100000000010000010100000000100
0000000000000000010000000000010001001000000000000001000000000000000100000000101000000101000010010010
0000100001000100010000000000010001001000000000000001000000000000000100100000101000000101000010010010
0000000000100000000000000000000000000000000000000000000000000000000000000100000000000000000100000000

The code below processes the next site in file (using a similar syntax than for the core site) and pass it to the stats.EHH instance. The positon of the new site must also be set (when recycling the Site instance, its distance is reset to None so we need to set it to a proper value, say 0.1.:

>>> distant = list(map(int,f.readline().strip()))
>>> site.from_list(distant, egglib.alphabets.positive_infinite)
>>> site.position = 0.1
>>> ehh.load_distant(site)
>>> print(ehh.cur_haplotypes)
4
>>> print(ehh.get_EHH(0))
1.0
>>> print(ehh.get_EHH(1))
0.6153846153846154
>>> print(ehh.get_EHH(2))
1.0
>>> print(ehh.get_rEHH(0))
2.14285714286
>>> print(ehh.get_iHH(0))
0.1
>>> print(ehh.get_iHS(0))
-0.495077266798
>>> print(ehh.get_EHHS())
0.991778569471
>>> print(ehh.get_iES())
0.0995889284736

We can observe that there is already an additional haplotype, meaning that the first distant site already modifies the sample partition. It concerns the second haplotype, whose EHH value drops to ~0.6. The rEHH value of the first haplotype has increased because the complement of the first haplotype is affected by the new haplotype. iHH is 0.1 (integration of EHH over the distance arbitrarily set to 0.1). The value of iHS is negative, which is possible because it is the logarithm of a ratio. The whole-site EHHS value has decreased because it accounts for data at the whole site.

The code below reads all the other sites of the file and displays final values, always using an increment of 0.1 for the position of each site:

>>> for i, line in enumerate(f):
...     site.from_list(list(map(int,line.strip())), egglib.alphabets.positive_infinite)
...     site.position = 0.2 + i / 10.0
...     ehh.load_distant(site)
...
>>> print(ehh.cur_haplotypes)
42
>>> print(ehh.get_EHH(0))
0.0392156862745098
>>> print(ehh.get_EHH(1))
0.2058823529411765
>>> print(ehh.get_EHH(2))
1.566554621848739
>>> print(ehh.get_rEHH(0))
0.7030486479143488
>>> print(ehh.get_iHH(0))
0.04384762948753081
>>> print(ehh.get_iHS(0))
0.703048647914
>>> print(ehh.get_EHHS())
0.04384762948753081
>>> print(ehh.get_iES())
1.623691422307482

For more details about controlling at what point integration of EHH statistics should stop, or management of missing data, see the class’s manual.