log P(x,y| )
Download
Report
Transcript log P(x,y| )
The EM algorithm
Lecture #11
.
Acknowledgement: Some slides of this lecture are due to Nir Friedman.
Expectation Maximization (EM) for
Bayesian networks
Intuition (as before):
When we have access to all counts, then we
can find the ML estimate of all parameters in
all local tables directly by counting.
However, missing values do not allow us to
perform such counts.
So instead, we compute the expected
counts using the current parameter
assignment, and then use them to compute
the maximum likelihood estimate.
A
B
C
D
P(A= a | )
P(B=b|A= a, )
P(C=c|A= a, )
P(D=d|b,c, )
2
Expectation Maximization (EM)
X
Z
Y
Data
P(Y=H|X=H,Z=T, ) = 0.3
Current
parameters
X
Y Z
H
T
H
H
T
?
?
H
T
T
T
? 0.2
?
T 0.1
H
Expected Counts
N (X,Y )
X Y #
H
T
H
T
H
H
T
T
1.3
0.4
1.7
1.6
N (X,Z )
X Z #
H
T
H
T
H
H
T
T
0.1
0.2
2.9
1.8
P(Y=H|X=T, ) = 0.4
3
EM (cont.)
Reiterate
Initial network (G, ’)
X1
X2
X3
Z1
Z2
Expected Counts
Computation
Y
Z3
Training
Data
Updated network (G, )
(E-Step)
N(X1)
N(X2)
N(X3)
N(Y, X1, X1, X3)
N(Z1, Y)
N(Z2, Y)
N(Z3, Y)
X1
Reparameterize
(M-Step)
Z1
X2
X3
Y
Z2
Z3
Note: This EM iteration corresponds to the nonhomogenous HMM iteration. When parameters
are shared across local probability tables or are
functions of each other, changes are needed. 4
EM in Practice
Initial parameters:
Random parameters setting
“Best” guess from other source
Stopping criteria:
Small change in likelihood of data
Small change in parameter values
Avoiding bad local maxima:
Multiple restarts
Early “pruning” of unpromising ones
5
Relative Entropy – a measure of
difference among distributions
We define the relative entropy H(P||Q) for two
probability distributions P and Q of a variable X (with x
being a value of X) as follows:
H(P||Q)= xiP(xi) log2(P(xi)/Q(xi))
This is a measure of difference between P(x) and Q(x).
It is not a symmetric function. The distribution P(x) is
assumed the “true” distribution used for taking the
expectation of the log of the difference .
The following properties hold:
H(P||Q) 0
Equality holds if and only if P(x) = Q(x) for all x.
6
Average Score for sequence
comparisons
Recall that we have defined the scoring function via
P ( a, b)
(a, b) log
Q(a)Q(b)
Note that the average score is the
relative entropy H(P ||Q) where
Q(a,b) = Q(a) Q(b).
Relative entropy also arises when
choosing amongst competing models.
7
The setup of the EM algorithm
We start with a likelihood function parameterized by .
The observed quantity is denoted X=x. It is often a vector
x1,…,xL of observations (e.g., evidence for some nodes in a
Bayesian network).
The hidden quantity is a vector Y=y (e.g. states of
unobserved variables in a Bayes network). The quantity y is
defined such that if it were known, the likelihood of the
completed data point P(x,y|) is easy to maximize.
The log-likelihood of an observation x has the form:
log P(x| ) = log P(x,y| ) – log P(y|x,)
(Because P(x,y| ) = P(x| ) P(y|x, )).
8
The goal of EM algorithm
The log-likelihood of ONE observation x has the form:
log P(x| ) = log P(x,y| ) – log P(y|x,)
The goal: Starting with a current parameter vector ’, EM’s
goal is to find a new vector such that P(x| ) > P(x| ’) with
the highest possible difference.
The result: After enough iterations EM reaches a local
maximum of the likelihood P(x| ).
9
The Expectation Operator
Recall that the expectation of a random variable Y with a pdf
P(y) is given by E[Y] = y y p(y).
The expectation of a function L(Y) is given by
E[L(Y)] = y p(y) L(y).
An example used by the EM algorithm:
Q( |’) E’[log p(x,y|)] = y p(y|x, ’) log p(x ,y|)
The expectation operator E is linear. For two random
variables X,Y, and constants a,b, the following holds
E[aX+bY] = a E[X] + b E[Y]
10
Improving the likelihood
Starting with log P(x| ) = log P(x, y| ) – log P(y|x, ),
multiplying both sides by P(y|x ,’), and summing over y,
yields
Log P(x |) = y P(y|x, ’) log P(x ,y|) - yP(y|x, ’) log P(y |x, )
= E’[log p(x,y|)] = Q( |’)
We now observe that
= log P(x| ) – log P(x|’) = Q( | ’) – Q(’ | ’)
+ y P(y|x, ’) log [P(y |x, ’) / P(y |x, )]
So choosing * = argmax Q(| ’)
maximizes the difference , and
repeating this process leads to a local
maximum of log P(x| ).
0 (relative entropy)
11
The EM algorithm
Input: A likelihood function p(x,y| ) parameterized by .
Initialization: Fix an arbitrary starting value ’
Repeat
E-step:
Compute Q( | ’) = E’[log P(x,y| )]
M-step: ’ argmax Q(| ’)
Until = log P(x| ) – log P(x|’) <
Comment: At the M-step one can actually choose any ’ as long as
> 0. This change yields the so called Generalized EM algorithm.
It is important when argmax is hard to compute.
12
The EM algorithm (with multiple
independent samples)
Recall the log-likelihood of an observation x has the form:
log P(x| ) = log P(x,y| ) – log P(y|x,)
For independent samples (xi, yi), i=1,…,m, we can write:
i log P(xi| ) = i log P(xi,yi| ) – i log P(yi|xi,)
E-step:
Compute Q( | ’) = E’[ i log P(xi,yi| )]
=i E’[log P(xi,yi| )]
Each
sample
Completed
separately.
M-step: ’ argmax Q(| ’)
13
MLE from Incomplete Data
Finding MLE parameters: nonlinear optimization problem
log P(x| )
E ’[log P(x,y| )]
Expectation Maximization (EM):
Use “current point” to construct alternative function (which is “nice”)
Guaranty: maximum of new function has a higher likelihood than the
current point
14
Gene Counting Revisited (as EM)
The observations: The variables X=(NA, NB, NAB, NO) with a
specific assignment x = (nA,nB,nAB,nO).
The hidden quantity: The variables Y=(Na/a, Na/o, Nb/b, Nb/O)
with a specific assignment y = (na/a,na/o, nb/b, nb/O).
The parameters: ={a,b,o }.
The likelihood of the completed data of n points:
P(x,y| ) = P(nAB,nO ,na/a,na/o, nb/b, nb/o| ) =
n!
na / a !na / o !nb / b !nb / o !na / b !no / o !
2 2 2
2 na / a
a
na / o
a o
2 nb / b
b
nb / o
b
o
na / b
a b
2 no / o
o
15
The E-step of Gene Counting
The likelihood of the hidden data given the observed data
of n points:
P(y| x, ’) = P(na/a,na/o, nb/b, nb/o | nA,nB , nAB,nO, ’)
= P(na/a,na/o | nA, ’a,’o ) P(nb/b,nb/o | nB, ’b,’o )
'
nA
2
p(na / a | n A , 'a , 'o )
na / a !(n A na / a )! 'a 2 'a 'o
2
a
na / a
'2a
E ' ( N a / a ) nA 2
'a 2 'a 'o
na / a
2 'a 'o
2
'a 2 'a 'o
n A na / a
2 'a 'o
na / o E ' ( N a / o ) nA 2
'a 2 'a 'o
This is exactly the E-step we used earlier !
16
The M-step of Gene Counting
The log-likelihood of the completed data of n points:
log P( x, y | ) K na / a log a2 na / o log 2 a o
nb / b log b2 nb / o log 2 b o na / b log 2 a b no / o log o2
Taking expectation wrt Y =(Na/a, Na/o, Nb/b, Nb/o) and
using linearity of E yields the function Q(| ’) which
we need to maximize:
] log 2 E [ N
E ' [log P( x, y | )] E ' [ K ] E ' [ N a / a ] log a2 E ' [ N a / o ] log 2 a o
E '[ Nb / b ] log b2 E ' [ Nb / o
b o
'
2
]
log
2
E
[
N
]
log
a/b
a b
'
o/o
o
17
The M-step of Gene Counting (Cont.)
We need to maximize the function:
log 2 n log 2 n log
f ( a b o ) na / a log a2 na / o log 2 a o nb / b log b2
nb / o
b o
a/b
a b
o/o
2
o
Under the constraint a+ b+ o =1.
The solution (obtained using Lagrange multipliers) is given
by
a
2na / a na / o n AB
2n
b
2nb / b nb / o n AB
2n
o
2nO na / o nb / o
2n
Which matches the M-step we used earlier !
18
Outline for a different derivation of
Gene Counting as an EM algorithm
Define a variable X with values xA,xB,xAB,xO.
Define a variable Y with values ya/a, ya/o, yb/b, yb/o, ya/b, yo/o.
Examine the Bayesian network:
Y
X
The local probability table for Y is P(ya/a| ) = a a ,
P(ya/o| ) = 2a o , etc.
The local probability table for X given Y is P(xA | ya/q ,) = 1,
P(xA | ya/o ,) = 1, P(xA | yb/o ,) = 0, etc, only 0’s and 1’s.
Homework: write down for yourself the likelihood function
for n independent points xi,yi, and check that the EM
equations match the gene counting equations.
19