SNP Haplotype reconstruction Statistics 246, Week 14 2002

Download Report

Transcript SNP Haplotype reconstruction Statistics 246, Week 14 2002

BMI 731- Winter 2004
Haplotype reconstruction
Catalin Barbacioru
Department of Biomedical Informatics
Ohio State University
The problem
We start with a collection of genotypes in the form of allelic
determinations at tightly linked single nucleotide
polymorphisms (SNPs) for each of a set of n individuals. For
example, we might describe 3 SNPs as follows:
Name SNP alleles (major, minor)
SNP1
T, A
SNP2
A, G
SNP3
C, G
An individual might have genotype AT at SNP1, AA at SNP2,
and CG at SNP3, which we will denote by AT//AA//CG.
Possible haplotype pairs for this person are AAC/TAG and
AAG/TAC, and without further information, we can’t distinguish
between these two pairs.
The problem, cont.
What can be done? With information on the individual’s
parents, we can usually infer the haplotypes, the only
problem being that the parents may not be fully informative.
For example, if the maternal and paternal genotypes were
TA//AA//CC and TT//AA//CG respectively, at SNPs 1, 2 and 3,
and the individual is AT//AA//CG, then it would be clear that
the haplotypes were AAC/TAG (why?). On the other hand, if
the parents both had genotypes AT//AA//CG, then we wouldn’t
be able to determine unique haplotypes for the individual.
Even in the first case, we might have to make an assumption
about the frequency of recombination: what is it?
Our problem here is to determine haplotypes, or make good
guesses at them without parental genotypes.
Origin of the problem
Why do we want to determine haplotypes for individuals at
tightly linked SNP loci?
a)Haplotypes are more powerful discriminators between
cases and controls in disease association studies. Why?
b) With haplotypes we can conduct evolutionary studies.
c) Use of haplotypes in disease association studies reduces
the number of tests to be carried out, and hence the penalty
for multiple testing. Is this the same point as a)?
Two aspects of the problem
With a random sample of multilocus genotypes at a set of
SNPs, we can attempt a) to estimate the frequencies of all
possible haplotypes, and b) to infer the haplotypes of all
individuals.
The first step on this problem was taken by A Clark in 1990.
He gave what we might call a parsimony solution to b) above.
It goes like this. With a reasonable sample size, we might
expect to have some individuals homozygous at every locus,
e.g. TT//AA//CC, or heterozygous at just one locus, e.g.
TT//AA//CG. With the individuals of former type, we have
unambiguously identified one (TAC), and of the latter type two
(TAC and TAG) haplotypes present in the population. The
algorithm begins by finding all homozygotes and single SNP
heterozygotes and tallying the resulting known haplotypes.
A Clark’s algorithm, cont.
Now proceed as follows. For each known haplotype, look at all
remaining unresolved cases, and ask whether the known
haplotype can be made from some combination of ambiguous
sites from an unresolved case.
For example, if we have identified TAC as a known haplotype
from a TT//AA//CC homozygote, and we have an individual
AT//AA//CG still unresolved, then we infer that s/he is TAC/AAG,
and we have have “resolved” this person’s haplotype and added
a putative haplotype to our list. Similarly, a TT//AA//CG
individual gives us both TAC and TAG as known haplotypes,
and both of these go into the initial list.
This chain of inferences is continued until either all haplotypes
have been recovered, or until no more new haplotypes can be
found in this way.
Clark’s algorithm with SNPs in practice
This method should work in principle, but there are three
problems that might arise in practice: a) there may be no
homozygotes or single SNP heterozygotes in the sample, and
so the chain might never get started; b) there may be many
unresolved haplotypes left at the end; and c) haplotypes
might be erroneously inferred if a crossover product of two
actual haplotypes is identical to another true haplotype.
The frequency of these problems will depend on average
heterozygosity of the SNPs, number of loci, their recombination
rates and the sample size. Clark (1990) did some calculations
and simulations which led him to believe the algorithm would
perform well, even with relatively small sample sizes. And it did.
The EM algorithm solution to problem a)
(SNPem)
We now describe an EM algorithm to infer haplotype frequencies in a
population on the basis of a random sample and the assumption of random
mating for haplotypes. Escoffier and Slatkin (1995) call the phase unknown
multilocus genotypes, e.g. TT//AA//CG, phenotypes, and keep the term
genotype for the corresponding haplotype pair TAC/TAG. Others use the term
diplotype for a pair of haplotypes, but neither of these has caught on.
The observed data in a random sample of n individuals will be multilocus
genotype frequencies, and the natural model is multinomial. The number c of
haplotype pairs leading to a given phenotype will depend on the number s of
heterozygous SNPs, and will be 2s-1. E.g if our genotype is TT//AA//CG, then
we can recover the haplotypes unambiguously (c=1), but for our original case,
AT//AA//CG, there were c = 2 possible haplotype pairs.
Under the assumption of random mating, the probability of a given genotype is
just the sum of 2s-1 squares or products of haplotype probabilities, e.g.
P=pr(AT//AA//CG)
= pr(AAC/TAG) + pr(AAG/TAC)
= 2pr(AAC)pr(TAG) + 2pr(AAG)pr(TAC).
The EM algorithm, cont.
The probability of a sample of n individuals conditioned by the phenotype
frequencies P1, …, Pm (i.e. the likelihood of the data given the parameters)
is given by the multinomial probability,
n!
n
n
P( sam ple| P1 ,...,Pm ) 
P1 1 ...Pm m
n1!...nm !
where m different phenotypes are observed with counts n1,…,nm.
If we h possible haplotypes with (unknown) frequencies p1,…,ph, then the
likelihood of the haplotype frequencies given the phenotypic counts is


L( p1 ,..., p h )  a   P(hik hil ) 
j 1  i 1

m
cj
nj
where a is a constant incorporating the multinomial coefficient, hp,r are the
possible haplotypes and cj is the number of genotypes leading to the jth
phenotype.
In principle, we want to estimate the haplotype frequencies that maximize
the likelihood function, by solving a set of h-1 equations involving first partial
derivatives of the logarithm of the likelihood, ∂logL/∂pi.
The EM algorithm, cont.
The EM algorithm is an iterative method to compute successive sets of
haplotype frequencies p1,…,ph, starting with initial values p1(0),…,ph(0).
These initial values are used as if they are the true frequencies to estimate
genotype frequencies P(hkhl)(0) (the expectation step). These expected
genotype frequencies are used in turn to estimate haplotype frequencies at
the next generation p1(1),…,ph(1) (the maximization step) and so on, until
convergence is reached.
Possible initial haplotype frequencies can be evaluated assuming that all
haplotypes are equally frequent in the sample, or that haplotype
frequencies are equal to the product of single-locus allele frequencies, or
that initial haplotypes frequencies are chosen random, or that for each
phenotype, the possible genotypes are equally likely, so
P(genotype hkhl in phenotype j)(0) = Pj(hkhl)(0)
= 1/cj
The EM algorithm, cont.
The expectation step at the gth iteration consists of using the haplotype
frequencies in the previous iteration to calculate the probability of resolving
each genotype given the phenotype:
Pj (h1h2 )
(g)

P ( h1h2 ) ( g )
 P ( hk hl )
(g)
where the sum is taken over all genotypes (hk,hl) leading to the jth phenotype,
and P(h1,h2)(g) = 2p1(g)p2(g) if h1 and h2 are different and (p1(g))2 otherwise.
Haplotype frequencies are then computed in the maximization step:
pt
( g 1)
c
1 m j
  it Pj (hk hl ) ( g )
2 j 1 i 1
where δit is equal to the number of times haplotype t is present in genotype
i = (hk,hl) (0,1, or 2).
The EM algorithm: performance with SNPs
The EM algorithm: performance with SNPs
Previous slide:
Influence of sample size on haplotype frequency estimates.
A and B, Haplotype frequencies at the three steps of the simulation procedure.
Generating frequencies (Gk [line]), sample frequencies (Sk [triangles]), and
resulting haplotype frequency estimates from the EM algorithm (Ek
[unblackened circles]) for a five-locus system with equally frequent
population haplotype frequencies, with sample size set to N = 50 (A) and N =
500 (B) are shown. C, Average MSE and 95% CI for batches of 500 data sets
of each sample size for five-locus haplotypes generated under the N(1/k,2)
model. Unbroken line denotes comparisons of EM estimates to sample values
(SE); dotted line, EM estimates to generating parameters (GE).
A Gibbs sampler approach to haplotype
reconstruction, Stephens et al (2001) (PHASE)
Our notation is as follows: G = (G1,…, Gn) denotes the observed multilocus
genotype frequencies, H=(H1,…,Hn) will be the corresponding unknown
haplotype pairs, while F=(F1, …, FM) will denote the M unknown
population haplotype frequencies. In these terms, the EM algorithm sought
that F which maximized pr(G|F).
The Gibbs sampler uses the following three steps, starting from some
initial haplotype reconstruction H(0):
i) Choose an individual i, uniformly and at random from all ambiguous
individuals;
ii) Sample Hi(t+1) from pr(Hi | g, H-i(t)), where H-i is the set of haplotypes
excluding individual i;
iii) Set Hj(t+1) = Hj(t) for j=1,…,i-1,i+1, …,n.
General theory tells us that this produces a Markov chain with the desired
stationary distribution. The question is: what is pr(Hi | g, H-i(t))?
The Stephens et al Gibbs sampler, some details
For any haplotype pair Hi = (hi1,hi2) consistent with genotype Gi,
write
pr(Hi | G, H-i)  pr(Hi | H-i)  pr(hi1|H-I)pr(hi2 | H-I, hi1)
(*)
where pr(.|H) is the conditional distribution of a future sampled
haplotype, given a set H of previously sampled haplotypes.
There are a couple of choices here, one assuming that the type
of a mutant offspring is h with probability h, independent of the
type of the parent. In that case, for a constant-size random
mating population, pr(h|H) = (rh + h)/(r + ), where rh is the
number of haplotypes of type h in H, r is the total number of
haplotypes in H, and  is the scaled mutation rate.
The Stephens et al Gibbs sampler, more details
In principle we can substitute this formula into step ii) of the generic Gibbs
algorithm, and off we go. In practice, the number of possible values of Hi is
too large: 2s-1, where s is the number of loci at which individual i
heterozygous. However, if we take h =1/M, where M is the total number of
different possible haplotypes that could be observed in the population, we
can exploit the fact
Comparison
Comparison
Previous Figure
Comparison of accuracy of PHASE method (solid line) versus EM (dotted
line) and Clark's method (dashed line) for short sequence data (530
segregating sites). Top row, mean error rate (defined in text) for haplotype
reconstruction. Bottom row, mean discrepancy (defined in text) for
estimation of haplotype frequencies. We simulated data sets of 2n
haplotypes, randomly paired to form n genotypes, under an infinite-sites
model, with = 4 and different assumptions about the local recombination
rate R (R and are defined in the note to table 1), using a coalescent-based
program kindly provided by R. R. Hudson.
For each combination of parameters considered, we generated 100
independent data sets and discarded those data sets for which the total
number of possible haplotypes was >105 (the limit of our implementation
of the EM algorithm), which typically left >90 data sets on which to compare
the methods. Each point thus represents an average over 90100 simulated
data sets. Horizontal lines above and below each point show approximate
95% confidence intervals for this average (±2 standard errors). The results
for Clark's algorithm for R = 40 are omitted, as we had difficulty getting the
algorithm to consistently provide a unique haplotype reconstruction for
these data.
The Stephens et al Gibbs sampler, yet more details
Stephens et al (2001) offer a second algorithm, based on slightly different
population genetic modelling. It takes the form
pr(h | H) = ∑{E} ∑ {s ≥0} [r/r][/(r+ )]s[r/(r+)](Ps)h ,
where r is the number of haplotypes of type  in the set H, r is the total
number of haplotypes in H, and  is the scaled mutation rate. Here E is the
set of types for a general mutation model, and P a reversible mutation matrix.
Informally, this corresponds to the next sampled haplotype h being obtained
by applying a random number of mutations s to a randomly chosen existing
haplotype , where s is sampled from a geometric distribution. This related
to sampling from what is known as the coalescent, see Stephens et al (2001)
for details and references.
The algorithm uses the above expression in (*) a few slides back. There are
several issues that need to be dealt with here: estimation of , dealing with
the fact that the dimension of P is M, the number of possible haplotypes, etc.
An alternative Gibbs sampler, Nui et al (2002)
These authors are from the Bayesian school, who assume
Dirichlet priors with parameters  = (1, …, M), for the haplotype
frequencies F = (f1, …, fM). As with the EM approach, our model
is multinomial, being a product over individuals in the sample, of
phase unknown multilocus genotype probabilities, which in turn
are sums of products of pairs of haplotype probabilities. The
problem, as always, is in the large number of haplotypes which
are compatible with a given genotype. More fully, thinking of the
haplotypes H as “missing”, we can write
pr(G,H | F)  ∏{i=1,..n} fhi1fhi2 ∏j fjj - 1.
The steps are familiar: conditional on F, sample hi1 and hi2 for
individual i according to (**) below; then sample F given Hupdated .
(**) pr(hi1=g, hi2 = h|F, Gi) = fgfh/∑{g’h’=Gi} fg’fh’ .
Here g’h’=Gi means that g’ and h’ combine to give Gi .
Alternative Gibbs sampler, cont.
The advantage of this approach is that it is independent of
complex and potentially dangerous population genetic modelling
assumptions. Furthermore, tricks we have met before and some
new ones can be used to improve the efficiency of the algorithm.
Predictive updating. Here, as elsewhere, we integrate out the
parameters F to improve the Gibbs sampler. In this case
pr(G,H)  [|+N(H)|]/ [+N(H)],
where we are using the abbreviated gamma function notation
once more, and where N(H) is the vector of haplotype counts.
Now we can use a different, easier sampler: pick an individual i
at random or in a certain order, and update his/her haplotype hi
by sampling from
pr(hi = (g,h)|H-i,G)  (ng + g)(nh + h),
where ng and nh are counts of g and h in H-i, respectively.
Alternative sampler, cont: Ligation.
Handling a large number of loci and hence haplotypes still presents a
challenge. It is natural to adopt a divide and conquer approach: solve the
problem for small blocks of loci, and then piece together the solutions. Niu et
al suggest partitioning L loci into blocks of about 8 loci. They offer two
strategies: progressive ligation and hierarchical ligation. In both cases, the first
step is to carry out block level haplotype reconstruction using the sampler just
described. They then record the B most probable (partial) haplotypes for each
block, between 40 and 50 in their examples, and proceed to join them.
Progressive ligation begins at one end and pieces consecutive pairs together,
using the Gibbs sampler restricted to longer haplotypes which are among the
B2 combinations from the two most probable sets of B partial haplotypes. This
process is then continued until all the blocks are joined.
Hierarchical ligation is analogous, but working across the whole length, see
Figure on next slide. It is worth pointing out that these strategies not only deal
with the large state space, they also help the Markov Chain to converge more
rapidly.
Schematic depicting ligation. Here there are L loci, initially in
segments of K, while  is the highest level of the pyramidal hierarchy.
Alternative sampler completed
Prior annealing. A nice trick used in Niu et al to enable
the Gibbs sampler to more move freely in haplotype space is to
use high pseudo-counts at the beginning of the iterations, and
progressively reduce them at a fixed rate as the sampler
continues. Their formula for T iterations is
(t) = (0) + t((T) - (0) )/T.
Missing marker data. The absence of both alleles of
an SNP marker is common owing to PCR dropouts. Another
concern is the “one-allele” problem, in which only one allele is
unscored. The Gibbs sampler adapts nicely to having multiple
categories of errors and dealing with these in the sampling.
References
A G Clark. Inference of haplotypes from PCR-amplified samples of diploid
populations. Mol. Biol. Evol. 7: 111-122. 1990
L Escoffier and M Slatkin. Maximum likelihood estimation of molecular
haplotype frequencies in a diploid population. Mol. Biol. Evol. 12: 921-927,
1995.
D Fallin and NJ Schork. Accuracy of haplotype frequency estimation for
biallelic loci, via the expectation maximization algorithm, for unphased
diploid genotype data. Am J Hum, Genet. 67:947-959, 2000.
M Stephens, NJ Smith and P Donnelly. A new statistical method for haplotype
reconstruction from population data. Am. J. Hum. Genet. 68: 978-989, 2001.
T Niu, ZS Qin, X. Xu and J Liu. Bayesian haplotype inference for multiple linked
single-nucleotide polymorphisms. Am. J. Hum. Genet. 70:157-169, 2002.
T Speed, Statistics 246 (notes), Berkeley 2002