Transcript ppt

HMM in crosses and small
pedigrees, cont.
Lecture 9, Statistics 246,
February 19, 2004
1
Another problem: reconstructing haplotypes
The problem here is to reconstruct the childrens’ haplotypes as in the
2
figure, from marker data on both the children and the parents.
The Lander-Green HMM
Recap. The states of the Markov chain are the inheritance vectors. At any
locus on a chromosome, the entry in the inheritance vector for a non-founder
are 0 if the parental variant passed on at that locus was grandmaternal, and
1 otherwise.
Consider a two-parent two-child nuclear family and suppose that the mother
and father are 1/2 and 3/4, respectively, while the first (girl) child is 1/3 and
the second (boy) is 2/4. Then the inheritance vector is of length 4, v =
(vgm , vgp , vbm, vbp), where gm represents the girl’s maternal meiosis, gp her
paternal meiosis, and so on.
What are the assigments for v at the marker? We don’t know which of the
mother’s alleles 1 and 2 came from her mother and which came from her
father, but we can arbitrarily declare that it was the one she passed on to her
daughter, and similarly for the alleles 3 and 4 of the father. With this
assignment, we find that v = (0, 0, 1, 1), because in each case, the boy
received alleles from his parents different from his sister’s. In fact the
specification of the paternal and maternal chromosomes of a founder is
completely arbitrary, and we’ll mention later how this can be turned into a
symmetry which speeds up the calculations.
3
Transition probabilities
Now suppose that the same family has genotypes 1’/2’, 3’/4’, 1’/3’ and 2’/4’
at a locus near the first one. If the recombination fraction between the two
loci is r, and r is small, then we might expect the inheritance vector v’ at the
second locus to coincide with v = (0,0,1,1). But what if the genotypes were
1’/2’, 3’/4’, 1’/3’ and 2’/3’, respectively? This suggests that v’ = (0,0,1,0), with
the 10 in the boy’s paternally inherited chromosome denoting a
recombination.
How do we weigh up these competing possibilities? As with the mouse
chromosomes, we need a transition matrix P(r) connecting adjacent
inheritance vectors. The form of P is as in the mouse case, namely, a tensor
power of the 22 matrices having 1-r on the diagonal, and r on the offdiagonal elements, here P = R(r)4 . In general, it is an 2nth tensor power,
where n is the number of non-founders. Thus we have our states and our
transition probability matrix, and hence our (product) Markov chain.
To complete the specification of our HMM, we need observations and the
4
associated emission probabilities, and an initial distribution.
Alternative representation
The purpose of inheritance vectors is to describe the possible patterns of
gene flow through a pedigree. Once ordered pairs of alleles are assigned to
founders, the 0s and 1s in the inheritance vectors specify the alleles that are
passed from parent to offspring down the pedigree. As mentioned in the last
lecture, an alternative representation of this gene flow is via what is known
as a descent graph, see Lange’s book, ch 9.
A recent paper gave an even more economical representation of what is
needed, via what the authors called a sparse gene flow tree. I leave those
interested to consult Abecasis et al, Nature Genetics 30 2002: 97-101.
There (in the program Allegro) the efficient matrix multiplication that we will
describe shortly is replaced by a sparse matrix-vector multiplication
algorithm more general that the one we will give.
5
Observations (phenotypes)
We now turn to our observations. The data we have on our (small)
pedigree will generally consist of genotypes at many marker loci, and
perhaps additional disease or other phenotype data, see the pedigree on
p.2, where black filling of a square or circle indicated that the person is
affected by some specified disease. In the case of interest to us here:
reconstructing haplotypes, we’ll assume that there are just marker data.
Suppose the unordered pairs of alleles (i.e. genotypes) at marker locus t
come from a set (t). Then for f founders and n non-founders, our
observations come from (t)f(t)n. Here I could have simply written
the (f+n)th power, but it is convenient to keep founders and non-founders
notationally distinct. As with the mouse crosses, we could add in ambiguity
and missing data “observations”, but for simplicity we won’t do so here.
Since we are mainly interested in marker data, let’s denote a typical
(vector) observation at locus t by mt.
6
Emission probabilities
Referring to our general discussion of HMM in the previous lecture, we
now need to specify the equivalent of the emission probabilities q(i,j,k; t) at
each locus t. These probabilities are generally functions of the current and
previous state, but here they just depend on the current state, vt , and take
the form q(vt ,mt ; t). We want this to be
q(vt ,mt ; t) = pr(observations at t = mt | inheritance vector at t = vt ),
but right now vt just describes gene flow: we haven’t got started.
To complete our description, let us write at = (at,1 ,at,2 ….,at, 2f-1 ,at, 2f ) for
the assignment of an ordered pair of alleles to each founder. Suppose that
the frequency of allele at,h is pt,h . Under the population equilibrium
assumptions we have previously mentioned, pr(at ) = ∏h pt,h . Finally, we
sum over all at (in practice, over all at compatible with the observed
phenotypes). This gets us started towards defining q(vt ,mt ; t).
7
Penetrances
Having got started, note that alleles at for the founders and an inheritance
vector vt gives us a set of ordered genotypes gt = (at ,vt ) at locus t by
following the flow. We are almost there.
The observations on each individual in the pedigree can now be assigned
probabilities given their (ordered) genotypes. This last step involves the
terms we have previously called penetrances - probabilities of observed
phenotypes given genotypes - and a non-trivial assumption: that the
probabilities of the different individuals’ phenotypes are conditionally
independent given their (ordered) genotypes, and depend on the pedigree’s
(ordered) genotypes only on their own (ordered) genotypes,
i.e.
pr(mt | gt ) = ∏i pr(mt,i | gt,i ),
where the product is over individuals i in the pedigree, and the terms in the
product are penetrances.
Putting all this together, our HMM emission probabilities have the form
q(vt ,mt ; t) = ∑ pr(at )pr(mt | at ,vt ) = ∑ ∏ pt,h ∏i pr(mt,i | gt,i ).
8
Pedigree calculations
The HMM is fully specified as soon as we give an initial distribution for the
states, and this is simply the uniform distribution: each inheritance vector
is assigned initial probability 2-2n.
We now turn to carrying out the calculations efficiently, so that we can
deal with as large small pedigrees as possible. Since the HMM
formulation of pedigree analysis was introduced by Lander and Green in
1987, there have been a number of improvements. Looking back over
nearly 20 years of these, the arms race comes to mind. Details of the first
human version Crimap were never published, but it would have used the
standard HMM simplifications we will review shortly. In 1995 Genehunter
came on the scene, Kruglyak et al (1995,1996), and this had a few new
tricks. Then Genehunter+ incorporated founder phase symmetry and
what Kruglyak and Lander (1998) called Fourier analysis (on {0,1}n). The
next improvement was Allegro (Gudbjartsson et al 2000), whose tricks
are yet to be revealed to the world, and the most recent and fastest is
Merlin (Abecasis et al, 2002), which uses sparse gene flow trees and
9
some ideas from coding theory. The race will go on!
The forward equations
It is time to present the standard HMM formulae, these going back to the
original work of Baum et al, c1970. We will revert to the general HMM
notation for this discussion, that is, our HMM has components (Xt, Yt),
with Xt “hidden” and Yt observed. With Baum et al, define the forward
quantities 1(i) = pr(Y1,X1 = i), where i is an arbitrary state of X1 , and
t+1(j) = pr(Y1,Y2,…..,Yt+1 , Xt+1 = j)
= ∑i pr(Y1,Y2,…,Yt ,Yt+1 , Xt = i, Xt+1 = i)
= ∑i pr(Y1,Y2,…,Yt , Xt = i)  pr(Xt+1 = j |Y1,Y2,…Yt , Xt = i)
 pr(Yt+1 |Y1,Y2,…,Yt , Xt = i, Xt+1 = j )
= ∑i t(i)p(i, j; t)q(i, j,Yt+1 ; t+1).
When these are all computed up to T, simply sum over states i of XT to get
pr(Y1,Y2,….., YT) = ∑iT(i).
The work is in evaluating the sum, and the q terms. We look at the sum first.
10
Evaluating the sum (a)
In our small pedigree problem, T is the number of marker loci along our
chromosome, which might be anything from 20 to 1,000, and the transition
probability matrices p are all 2n2n, where n is the number of non-founders.
For small small pedigrees, such as our nuclear family, this is not a problem,
but as we try to push the limits of this algorithm, the evaluation of products of
such matrices by conformal vectors (which is what our sum is) becomes
limiting.
How can we speed up the calculation of such products? It has been known
for decades, perhaps centuries, that the application of NN matrices which
are (equivalent to) tensor powers to N1 vectors can have the O(N2)
multiplications reduced to O(N logN). I say centuries because it is said that
Gauss used a form of the fast Fourier transform (FFT), and all these speedups are similar in spirit to the famous (and many times discovered) FFT. In
our 2n2n case, the speed-up goes back to Yates in the 1930s, who used it to
carry out the calculations for 2n factorial experiments. The details are in my
class notes Stat 260, Week 8, Appendix to the Tuesday lecture.
11
Evaluating the sum (b)
The other term needing attention in the sum is the emission probability
q(vt, mt; t), which we have defined as a potentially large sum over ordered
pairs of alleles for all founders. A lot of effort has been devoted to finding
ways of reducing the calculations
To be completed
12
The backward equations
In a manner quite analogous to that which gave us the forward equations,
we can define T (i) = 1, and
t-1 (i) = pr(Yt,Yt+1,…..,YT | Xt-1 = i), and find that it reduces to
= ∑j t(j)p(i, j; t-1)q(i, j,Yt ; t).
Exercise: Prove this.
Note that Y = (Y1,Y2,…..,YT ), then pr(Y, Xt = i) = t(i)t(i), and hence
pr(Y) = ∑i t(i)t(i) for any i.
Exercise: Prove that pr (Xt = i | Y) = t(i)t(i) / ∑i t(i)t(i) , and that
pr(Xt+1 = j | Xt = i, Y) = p(i, j ; t )q(i, j, Yt+1; t+1)t+1(j) / t(i) .
Exercise:
13
Reconstructing the haplotypes
At last we come to our initial problem.
To be completed….
14
Founders: individuals without parents in the pedigree.
Non-founders: individuals with one or more parent in the pedigree
15
Founders: individuals without parents in the pedigree.
Non-founders: individuals with one or more parent in the pedigree
16
Founders: individuals without parents in the pedigree.
Non-founders: individuals with one or more parent in the pedigree
17