Examples of the EM algorithm

Download Report

Transcript Examples of the EM algorithm

Three examples of the EM algorithm
Week 12, Lecture 1
Statistics 246, Spring 2002
1
The estimation of linkage from the
offspring of selfed heterozygotes
R A Fisher and Bhai Balmukand,
Journal of Genetics 20 79-92 (1928)
See also:
Collected Papers of R A Fisher,
Volume II, Paper 71, pp 285-298.
2
The problem
In modern terminology, we have two linked bi-allelic loci, A and
B say, with alleles A and a, and B and b, respectively, where A is
dominant over a and B is dominant over b.
A double heterozygote AaBb will produce gametes of four types:
AB, Ab, aB and ab.
Since the loci are linked, the types AB and ab will appear with a
frequency different from that of Ab and aB, say 1-r and r,
respectively, in males, and 1-r’ and r’ respectively in females.
Here we suppose that the parental origin of these heterozygotes
is from the mating AABB  aabb, so that r and r’ are the male and
female recombination rates between the two loci.
The problem is to estimate r and r’, if possible, from the offspring
3
of selfed double heterozygotes.
Offspring genotypic and phenotypic
frequencies
Since gametes AB, Ab, aB and ab are produced in proportions
(1-r)/2, r/2, r/2 and (1-r)/2 respectively by the male parent, and
(1-r’)/2, r’/2, r’/2 and (1-r’)/2 respectively by the female parent,
zygotes with genotypes AABB, AaBB, …etc, are produced with
frequencies (1-r)(1-r’)/4, (1-r)r’/4, etc.
Exercise: Complete the Punnett square of offspring genotypes
and their associated frequencies.
The problem here is this: although there are 16 distinct
offspring genotypes, taking parental origin into account, the
dominance relations imply that we only observe 4 distinct
phenotypes, which we denote by A*B*, A*b*, a*B* and a*b*.
Here A* (res. B*) denotes the dominant, while a* (resp. b*)
denotes the recessive phenotype determined by the alleles at
4
A (resp. B).
Offspring genotypic and phenotypic
probabilities, cont.
Thus individuals with genotypes AABB, AaBB, AABb or AaBb,
which account for 9/16 gametic combinations (check!), all exhibit
the phenotype A*B*, i.e. the dominant alternative in both
characters, while those with genotypes AAbb or Aabb (3/16)
exhibit the phenotype A*b*, those with genotypes aaBB and
aaBb (3/16) exhibit the phenotype a*B*, and finally the double
recessives aabb (1/16) exhibit the phenotype a*b*.
It is a slightly surprising fact that the probabilities of the four
phenotypic classes are definable in terms of the parameter
= (1-r)(1-r’), as follows: a*b* has probability /4 (easy to see),
a*B* and A*b* both have probabilities (1-)/4, while A*B* has
probability 1 minus the sum of the preceding, which is (2+)/4.
5
Exercise: Calculate these phenotypic probabilities.
Estimation of 
Now suppose we have a random sample of n offspring from the
selfing of our double heterozygote. Thus the 4 phenotypic classes
will be represented roughly in proportion to their theoretical
probabilities, their joint distribution being multinomial
Mult [n; (2+ )/4, (1-)/4, (1-)/4, /4].
Note that here neither r nor r’ will be separately estimable from
these data, but only the product (1-r)(1-r’). Note that since we
know that r≤1/2 and r’≤1/2, it follows that  ≥1/4.
How do we estimate ? Fisher and Balmukand discuss a variety
of methods that were in the literature at the time they wrote, and
compare them with maximum likelihood, which is the method of
choice in problems like this. We describe a variant on their 6
approach to illustrate the EM algorithm.
The incomplete data formulation
Let us denote (cf p. 26 of Week 11b) the counts of the 4
phenotypic classes by y1, y2, y3 and y4, these having
probabilities (2+ )/4, (1- )/4, (1-)/4 and /4, respectively.
Now the probability of the observing genotype AABB is /4, just
as it is with aabb, and although this genotype is phenotypically
indistinguishable from the 8 others with phenotype A*B*, it is
convenient to imagine that we can distinguish them.So let us
denote their count by x1, and let x2 denote count of the
remainder of that class, so that x1+x2 = y1. Note that x2 has
marginal probability 1/2. In the jargon of the EM algorithm, x1
and x2 are missing data, as we only observe their sum y1. Next,
as in p.26 of Week 11b, we let y2=x3, y3=x4 and y4=x5. We now
illustrate the approach of the EM algorithm, referring to material
in Week 9b and Week 11b for generalities.
7
The EM algorithm for this problem
The complete data log likelihood at  is (Ex: Check):
(x2+x5)log + (x3+x4)log(1- ).
The expected value of the complete data log likelihood given
observed data taken at ’ (E-step: think of ’ as -initial) is:
(E’(x2|y1)+y4)log + (y2+y3)log(1-).
Now E’(x2|y1) is just ky1, where k= ’/(2+’). (Ex: Check.)
The maximum over  of the expected value of the complete
data log likelihood given observed data taken at ’ (M-step)
occurs at
’’ = (ky1+y4)/(ky1+y2+y3+y4). (Ex: Check)
Here we think of ’’ as -next.
It should now be clear how the E-step (calculation of k) and the
8
M-step (calculation of ’’) can be iterated.
Comments on this example
This completes our discussion of this example. It appeared in
the famous EM paper (Dempster, Laird and Rubin, JRSSB
1977) without any explanation of its genetic origins. Of course
it is an illustration of the EM only, for the actual likelihood
equation generated by the observed data only is a quadratic,
and so easy to solve (see Fisher & Balmukand). Thus it is not
necessary to use the EM in this case (some would say in any
case, but that is for another time). We have omitted any of the
fascinating detail provided in Fisher and Balmukand, and
similarly in Dempster et al. Read these papers: both are
classics, with much of interest to you. Rather than talk about
details concerning the EM (most importantly, starting and
stopping it, the issue of global max, and SEs for parameter
estimates), I turn to another important EM example: mixtures.
9
Fitting a mixture model by EM to discover
motifs in biopolymers
T L Bailey and C Elkan, UCSD Technical Report CS94-351; ISMB94.
Here we outline some more EM theory, this being relevant to
motif discovery. We follow the above report, as we will be
discussing the program MEME written by these authors in a
later lecture. This part is called MM.
A finite mixture model supposes that data X = (X1,…,Xn) arises
from two or more groups with known distributions but different,
unknown parameters  = (1, …, g), where g is the number
of groups, and mixing parameters  = (1,…, g), where the s
are non-negative and sum to 1.
It is convenient to introduce indicator vectors Z = (Z1,…,Zn),
where Zi = (Zi1,…,Zig), and Zij = 1 if Xi is from group j, and = 0
otherwise. Thus Zi gives group membership for the ith sample.
It follows that pr(Zij= 1 | , ) = j . For any given i, all Zij are 0
apart from one.
10
Complete data log likelihood
Under the assumption that the pairs (Zi,Xi) are mutually
independent, their joint density may be written (Exercise:
Carry out this calculation in detail.)
pr(Z, X | , ) = ∏ij [j pr(Xi | j) ] Zij
The complete-data log likelihood is thus
log L(,  | Z, X) = ∑∑ Zij log [ j pr(Xi | j) ].
The EM algorithm iteratively computes the expectation of
this quantity given the observed data X, and initial estimates
’ and ’ of  and  (the E-step), and then maximizes the
result in the free variables  and  leading to new estimates
’’ and ’’ (the M-step). Our interest here is in the particular
11
calculations necessary to carry out these two steps.
Mixture models: the E-step
Since the log likelihood is the sum of over i and j of terms
multiplying Zij, and these are independent across i, we
need only consider the expectation of one such, given Xi.
Using initial parameter values ’ and ’, and the fact that
the Zij are binary, we get
E( Zij | X, ’, ’) = ’j pr(Xi|’j)/ ∑k ’kpr(Xi|’k)
= Z’ij, say.
Exercise: Obtain this result.
12
Mixture models: the M-step
Here our task is to maximize the result of an E-step:
∑∑ Z’ij j + ∑∑ Z’ij log pr(Xi | j).
The maximization over  is clearly independent of the rest and
is readily seen to be achieved (Ex: check this) with
j’’ = ∑iZ’ij / n.
Maximizing over  requires that we specify the model in more
detail. The case of interest to us is where g = 2, and the
distributions for class 1 (the motif) and class 2 (the
background) are given by position specific and a general
multinomial distribution, respectively.
13
Mixture models: the M-step, cont.
Our initial observations are supposed to consist of N sequences
from an L-letter alphabet. Unlike what we did in the last lecture,
these sequences are now broken up into all n overlapping
subsequences of length w, and these are our Xi. We need w+1
sets of probabilities, namely fjk and f0k, where j=1,…,w (the
length of the motif) and k runs over the symbol alphabet.
With these parameters, we can write
pr(Xi | 1) = ∏j∏k fjk I(k,Xij) , and pr(Xi | 2) = ∏j∏k f0k I(k,Xij)
where Xij is the letter in the jth position of sample i, and I(k,a) = 1
if a=ak, and =0 otherwise. With this notation, for k=1,…L, write
c0k = ∑∑ Z’i2I(k,Xij),
and cjk = ∑∑ Z’i1I(k,Xij).
Here c0k is the expected number of times letter ak appears in the
background, and cjk the expected number of times ak appears in14
occurrences of the motif in the data.
Mixture models: the M-step completed.
With these preliminaries, it is straightforward to maximize the
expected complete-data log likelihood, given the observations
X, evaluated at initial parameters ’ and ’. Exercise: Fill in the
details missing below. We obtain
f’’jk = cjk / ∑k=1L cjk ,
j = 0,1,…,w; k = 1,…,L.
In practice, care must be taken to avoid zero frequencies, so
that either one uses explicit Dirichlet prior distributions, or one
adds small constants j ,∑j = , giving
f’’jk = (cjk+ j )/ (∑k=1L cjk + ) ,
j = 0,1,…,w; k = 1,…,L.
15
Comment on the EM algorithm
A common misconception concerning the EM algorithm is
that we are estimating or predicting the missing data,
plugging that estimate or prediction into the complete data
log likelihood, “completing the data” you might say, and
then maximizing this in the free parameters, as though we
had complete data.
This is emphatically NOT what is going on.
A more accurate description might be this. We are using the
inferred missing data to weight different parts of the
complete data log likelihood differently in such a way that
the pieces combine into the maximum likelihood estimates.
16
An Expectation Maximization (EM) Algorithm for the Identification and
Characterization of Common Sites in Unaligned Biopolymer Sequences
C E Lawrence and A A Reilly
PROTEINS: Structure, Function and Genetics 7:41-51 (1990)
This paper was the precursor of the Gibbs-Metropolis algorithm
we described in Week11b. We now briefly describe the algorithm
along the lines of the discussion just given, but in the notation
used in Week11b. On p. .. of that lecture, the full data log
likelihood was given, without the term for the marginal
distribution of A being visible. We suppose that the ak are
independent, and uniformly distributed over {1,..,nk-W+1}.
Because this term does not depend on either 0 or , it
disappears in the likelihood proportionality constant. Thus the
complete data log likelihood is
log L(0, |R,A) = h(R{A}c)log0 + ∑jh(RA+j-1)logj ,
17
where we use the notation log0 = ∑iilog0,i cf p.12, Week 11b.
The E-step in this case.
The expected value of the complete data log likelihood given
the observed data and initial parameters ’0 and ’ is just
∑A pr(A | R, ’0,’) log pr(R, A | 0 , )
(*)
where the sum is over all A = {a1,…,aK}, and so our task is to
calculate the first term. Now we are treating all the rows
(sequences) as mutually independent, so pr(A | R, ’0,’)
factorizes in k, and we need only deal with a typical row, the
kth, say. Letting ak denote the random variable
corresponding to the start of the motif in the kth row, then
pr(ak = i) = 1/(nk-W+1), i=1, …, nk-W+1.
We can readily calculate pr(ak = i | ’0,’, R) by Bayes’
theorem, and these get multiplied and inserted in (*) above.
Exercise: Carry out this calculation.
18
The M-step in this case.
Once we have calculated the expected value of the
complete data log likelihood given the observed data
and initial parameters ’0 and ’, we then maximize it in
the free variables 0 and , leading to new parameters
’’0 and ’’.
How is this done? Without giving the gory details, just
notice that (*) is a weighted combination of multinomial
log likelihoods just like the one we met in our previous
example, the mixture model. There the weights were
Z’s, and here they are pr(A | R, ’0 ,’)s. It follows
(Exercise: Fill in the details) that the maximizing values
of 0 and , which we denote by ’’0 and ’’, are ratios
of expected counts similar to the c0 and cj in the mixture
discussion. As there, we will want to deal with small or
zero counts by invoking Dirichlet priors.
19