log P(x,y| )

Download Report

Transcript log P(x,y| )

The EM algorithm
Lecture #11
.
Acknowledgement: Some slides of this lecture are due to Nir Friedman.
Expectation Maximization (EM) for
Bayesian networks
Intuition (as before):
 When we have access to all counts, then we
can find the ML estimate of all parameters in
all local tables directly by counting.
 However, missing values do not allow us to
perform such counts.
 So instead, we compute the expected
counts using the current parameter
assignment, and then use them to compute
the maximum likelihood estimate.
A
B
C
D
P(A= a | )
P(B=b|A= a, )
P(C=c|A= a, )
P(D=d|b,c, )
2
Expectation Maximization (EM)
X
Z
Y
Data
P(Y=H|X=H,Z=T, ) = 0.3
Current
parameters
X
Y Z
H
T
H
H
T
?
?
H
T
T
T
? 0.2
?
T 0.1
H
Expected Counts
N (X,Y )
X Y #
H
T
H
T
H
H
T
T
1.3
0.4
1.7
1.6
N (X,Z )
X Z #
H
T
H
T
H
H
T
T
0.1
0.2
2.9
1.8
P(Y=H|X=T, ) = 0.4
3
EM (cont.)
Reiterate
Initial network (G, ’)
X1
X2
X3
Z1
Z2
Expected Counts
Computation
Y
Z3

Training
Data
Updated network (G, )
(E-Step)
N(X1)
N(X2)
N(X3)
N(Y, X1, X1, X3)
N(Z1, Y)
N(Z2, Y)
N(Z3, Y)
X1
Reparameterize
(M-Step)
Z1
X2
X3
Y
Z2
Z3
Note: This EM iteration corresponds to the nonhomogenous HMM iteration. When parameters
are shared across local probability tables or are
functions of each other, changes are needed. 4
EM in Practice
Initial parameters:
 Random parameters setting
 “Best” guess from other source
Stopping criteria:
 Small change in likelihood of data
 Small change in parameter values
Avoiding bad local maxima:
 Multiple restarts
 Early “pruning” of unpromising ones
5
Relative Entropy – a measure of
difference among distributions
We define the relative entropy H(P||Q) for two
probability distributions P and Q of a variable X (with x
being a value of X) as follows:
H(P||Q)=  xiP(xi) log2(P(xi)/Q(xi))
This is a measure of difference between P(x) and Q(x).
It is not a symmetric function. The distribution P(x) is
assumed the “true” distribution used for taking the
expectation of the log of the difference .
The following properties hold:
H(P||Q)  0
Equality holds if and only if P(x) = Q(x) for all x.
6
Average Score for sequence
comparisons
Recall that we have defined the scoring function via
P ( a, b)
 (a, b)  log
Q(a)Q(b)
Note that the average score is the
relative entropy H(P ||Q) where
Q(a,b) = Q(a) Q(b).
Relative entropy also arises when
choosing amongst competing models.
7
The setup of the EM algorithm
We start with a likelihood function parameterized by .
The observed quantity is denoted X=x. It is often a vector
x1,…,xL of observations (e.g., evidence for some nodes in a
Bayesian network).
The hidden quantity is a vector Y=y (e.g. states of
unobserved variables in a Bayes network). The quantity y is
defined such that if it were known, the likelihood of the
completed data point P(x,y|) is easy to maximize.
The log-likelihood of an observation x has the form:
log P(x| ) = log P(x,y| ) – log P(y|x,)
(Because P(x,y| ) = P(x| ) P(y|x, )).
8
The goal of EM algorithm
The log-likelihood of ONE observation x has the form:
log P(x| ) = log P(x,y| ) – log P(y|x,)
The goal: Starting with a current parameter vector ’, EM’s
goal is to find a new vector  such that P(x| ) > P(x| ’) with
the highest possible difference.
The result: After enough iterations EM reaches a local
maximum of the likelihood P(x| ).
9
The Expectation Operator
Recall that the expectation of a random variable Y with a pdf
P(y) is given by E[Y] = y y p(y).
The expectation of a function L(Y) is given by
E[L(Y)] = y p(y) L(y).
An example used by the EM algorithm:
Q( |’)  E’[log p(x,y|)] = y p(y|x, ’) log p(x ,y|)
The expectation operator E is linear. For two random
variables X,Y, and constants a,b, the following holds
E[aX+bY] = a E[X] + b E[Y]
10
Improving the likelihood
Starting with log P(x| ) = log P(x, y| ) – log P(y|x, ),
multiplying both sides by P(y|x ,’), and summing over y,
yields
Log P(x |) =  y P(y|x, ’) log P(x ,y|) -  yP(y|x, ’) log P(y |x, )
= E’[log p(x,y|)] = Q( |’)
We now observe that
= log P(x| ) – log P(x|’) = Q( | ’) – Q(’ | ’)
+  y P(y|x, ’) log [P(y |x, ’) / P(y |x, )]
So choosing * = argmax Q(| ’)
maximizes the difference , and
repeating this process leads to a local
maximum of log P(x| ).
0 (relative entropy)
11
The EM algorithm
Input: A likelihood function p(x,y| ) parameterized by .
Initialization: Fix an arbitrary starting value ’
Repeat
E-step:
Compute Q( | ’) = E’[log P(x,y| )]
M-step: ’  argmax Q(| ’)
Until  = log P(x| ) – log P(x|’) < 
Comment: At the M-step one can actually choose any ’ as long as
 > 0. This change yields the so called Generalized EM algorithm.
It is important when argmax is hard to compute.
12
The EM algorithm (with multiple
independent samples)
Recall the log-likelihood of an observation x has the form:
log P(x| ) = log P(x,y| ) – log P(y|x,)
For independent samples (xi, yi), i=1,…,m, we can write:
i log P(xi| ) = i log P(xi,yi| ) – i log P(yi|xi,)
E-step:
Compute Q( | ’) = E’[ i log P(xi,yi| )]
=i E’[log P(xi,yi| )]
Each
sample
Completed
separately.
M-step: ’  argmax Q(| ’)
13
MLE from Incomplete Data

Finding MLE parameters: nonlinear optimization problem
log P(x| )
E ’[log P(x,y| )]
Expectation Maximization (EM):

Use “current point” to construct alternative function (which is “nice”)
Guaranty: maximum of new function has a higher likelihood than the
current point
14
Gene Counting Revisited (as EM)
The observations: The variables X=(NA, NB, NAB, NO) with a
specific assignment x = (nA,nB,nAB,nO).
The hidden quantity: The variables Y=(Na/a, Na/o, Nb/b, Nb/O)
with a specific assignment y = (na/a,na/o, nb/b, nb/O).
The parameters: ={a,b,o }.
The likelihood of the completed data of n points:
P(x,y| ) = P(nAB,nO ,na/a,na/o, nb/b, nb/o| ) =


n!
 
 
 na / a !na / o !nb / b !nb / o !na / b !no / o ! 
  2     2   2    

2 na / a
a
na / o
a o
2 nb / b
b
nb / o
b
o
na / b
a b
2 no / o
o
15
The E-step of Gene Counting
The likelihood of the hidden data given the observed data
of n points:
P(y| x, ’) = P(na/a,na/o, nb/b, nb/o | nA,nB , nAB,nO, ’)
= P(na/a,na/o | nA, ’a,’o ) P(nb/b,nb/o | nB, ’b,’o )



'
nA
 2

p(na / a | n A , 'a , 'o )  
 na / a !(n A  na / a )!   'a 2 'a  'o 
2
a
na / a


 '2a

 E ' ( N a / a )  nA  2
  'a 2 'a  'o 
na / a
 2 'a  'o 
 2

  'a 2 'a  'o 
n A  na / a
 2 'a  'o 

na / o  E ' ( N a / o )  nA  2
  'a 2 'a  'o 
This is exactly the E-step we used earlier !
16
The M-step of Gene Counting
The log-likelihood of the completed data of n points:
 


log P( x, y |  )  K  na / a log  a2  na / o log 2 a o 
 




 
nb / b log  b2  nb / o log 2 b  o  na / b log 2 a b  no / o log  o2
Taking expectation wrt Y =(Na/a, Na/o, Nb/b, Nb/o) and
using linearity of E yields the function Q(| ’) which
we need to maximize:
 
] log 2    E [ N


E ' [log P( x, y |  )]  E ' [ K ]  E ' [ N a / a ] log  a2  E ' [ N a / o ] log 2 a o 
 
E '[ Nb / b ] log  b2  E ' [ Nb / o
b o
'


 
2
]
log
2



E
[
N
]
log

a/b
a b
'
o/o
o
17
The M-step of Gene Counting (Cont.)
We need to maximize the function:
 
 
 
log 2    n log 2    n log  
f ( a b o )  na / a log  a2  na / o log 2 a o  nb / b log  b2
 nb / o
b o
a/b
a b
o/o
2
o
Under the constraint a+ b+ o =1.
The solution (obtained using Lagrange multipliers) is given
by
a 
2na / a  na / o  n AB
2n
b 
2nb / b  nb / o  n AB
2n
o 
2nO  na / o  nb / o
2n
Which matches the M-step we used earlier !
18
Outline for a different derivation of
Gene Counting as an EM algorithm
Define a variable X with values xA,xB,xAB,xO.
Define a variable Y with values ya/a, ya/o, yb/b, yb/o, ya/b, yo/o.
Examine the Bayesian network:
Y
X
The local probability table for Y is P(ya/a| ) = a a ,
P(ya/o| ) = 2a o , etc.
The local probability table for X given Y is P(xA | ya/q ,) = 1,
P(xA | ya/o ,) = 1, P(xA | yb/o ,) = 0, etc, only 0’s and 1’s.
Homework: write down for yourself the likelihood function
for n independent points xi,yi, and check that the EM
equations match the gene counting equations.
19