Maximum Likelihood And Expectation Maximization

Download Report

Transcript Maximum Likelihood And Expectation Maximization

Maximum Likelihood And
Expectation Maximization
Lecture Notes for CMPUT 466/551
Nilanjan Ray
MLE and EM
• Maximum Likelihood Estimation (MLE) and Expectation
Maximization are two very important tools in Machine Learning
• Essentially you use them in estimating probability distributions in a
learning algorithm; we have already seen one such example– in
logistic regression we used MLE
• We will revisit MLE here, realize certain difficulties of MLE
• Then Expectation Maximization (EM) will rescue us
Probability Density Estimation: Quick Points
Two different routes:
Parametric
• Provide a parametrized class of
density functions
• Tools:
–
–
–
–
Maximum likelihood estimation
Expectation Maximization
Sampling techniques
….
Non-Parametric
• Density is modeled by samples:
• Tools:
– Kernel Methods
– Sampling techniques
– …
Revisiting Maximum Likelihood
The data is coming from a known probability distribution
The probability distribution has some parameters that are unknown to you
Example: data is distributed as Gaussian yi ~ N(, 2),
so the unknown parameters here are  = (, 2)
MLE is a tool that estimates the unknown parameters of the probability
distribution from data
MLE: Recapitulation
• Assume observation data yi are independent
• Form the Likelihood:
N
L( ; Z)  
i 1
( yi  μ) 2
1
exp( 
); Z  { y1 , y2 ,, y N }
2 2
2π σ
• Form the Log-likelihood:
N
l ( ; Z)  log( 
i 1
N
( yi  μ) 2
( yi  μ) 2
1
exp( 
))  
 N log( 2π σ )
2 2
2 2
2π σ
i 1
• To find out the unknown parameter values, maximize the loglikelihood with respect to the unknown parameters:
l
 yi l  0   2  1
 0  μ  i 1 ;
μ
N
 2
N
N
N
 ( y  μ)
i 1
i
2
MLE: A Challenging Example
Source: Department
of Statistics, CMU
Observation data:
histogram
Y1 ~ N ( μ1 , σ ); Y2 ~ N ( μ2 , σ )
2
1
2
2
Y  (1  )Y1  Y2 ;  {0,1} Indicator variable
Mixture model:
θ1  ( μ1 , σ1 );
θ2  ( μ2 , σ 2 )
 is the probability with which the observation is chosen from density 2
(1- ) is the probability with which the observation is chosen from density 1
MLE: A Challenging Example…
Maximum likelihood fitting for parameters:   (π, μ1, μ2 , σ1, σ2 )
Numerically (and of course analytically, too)
Challenging to solve!!
Expectation Maximization: A Rescuer
EM augments the data space– assumes some latent data
Source: Department of Statistics, CMU
EM: A Rescuer…
l0 ( ; T)
T  { yi ,  i }in1
Maximizing this form of log-likelihood is now tractable
Note that we cannot analytically maximize this log-likelihood
Source: Department of Statistics, CMU
EM: The Complete Data Likelihood
By simple differentiations we have:
N
l0
 0  1 
1
 (1   ) y
i
i 1
N
N
l0
 0  2 
 2
i
;
 (1   )
i
i 1
l0
 0  σ12 
2
σ1
 (1   )( y  μ )
i 1
i
i

 (1   )
i
;
i
N
2
l0
 0  σ 22 
2
σ 2
1
N
i 1
i 1
N
i 1
N
i
 y
;
 (y  μ )
i 1
i
i
i
;
N

i 1
2
2
i
N
l0
0π 
π

i 1
N
i
;
So, maximization of the complete data likelihood is much easier!
How do we get the latent variables?
Obtaining Latent Variables
The latent variables are computed as expected values
given the data and parameters:
γi (θ )  E (i |  , yi )  Pr(i  1 |  , yi )
Apply Bayes’ rule:
γi (θ )  Pr(  i  1 |  , yi ) 

Pr( yi |  i  1, ) Pr(  i  1 |  )
Pr( yi |  i  1, ) Pr(  i  1 |  )  Pr( yi |  i  0, ) Pr(  i  0 |  )
Φ 2 ( yi )π
Φ1 ( yi )(1  π )  Φ 2 ( yi )π
EM for Two-component Gaussian Mixture
• Initialize 1, 1, 2, 2, 
• Iterate until convergence
– Expectation of latent variables
Φ 2 ( yi )π
γi (θ ) 
Φ1 ( yi )(1  π )  Φ 2 ( yi )π

1
( yi  μ1 ) 2 ( yi  μ2 ) 2
1   2
1
exp( 

)
 1
2σ12
2σ 22
– Maximization for finding parameters
N
1 
 (1  γ ) y
i 1
N
i
 (1  γ )
i 1
i
N
i
;
2 
γ y
i 1
N
i
γ
i 1
i
N
i
;
σ 
2
1
 (1  γ )( y
i
i 1
i
 μ1 )
N
 (1  γ )
i 1
i
N
2
; σ 
2
2
γ (y
i 1
i
i
 μ2 )
N
γ
i 1
i
N
2
; π
γ
i 1
N
i
;
EM for Mixture of K Gaussians
• Initialize mean vectors, covariance matrices, and mixing
probabilities: k, k, k, k =1,2,…,K.
• Expectation Step: compute responsibilities
γik 
π k Φ(y i ; μ k ,  k )

K
n 1
πnΦ(y i ; μ n ,  n )
, i  1,, N , k  1,, K .
• Maximization Step: update parameters
i 1 γik y i
N
μk 



N
γ
πk 
;
i 1 ik
N
k

N
γ
i 1 ik
N
γ (y i  μ k )( y i  μ k )T
i 1 ik

N
γ
i 1 ik
• Iterate Steps Expectation and Maximization until convergence
EM Algorithm in General
T = (Z, Zm) is the complete data; we only know Z, Zm is missing
Pr( Z | θ ) 
Taking logarithm:
Pr( Z , Z m | θ )
Pr(T | θ )

Pr( Z m | Z , θ ) Pr( Z m | Z , θ )
l (θ ; Z )  l0 (θ ; T )  l1 (θ ; Z m | Z )
Because we have access to previous parameter values , we can do better:
l (θ ; Z )  ET |Z ,θ [l0 (θ ; T )]  EZ m |Z ,θ [l1 (θ ; Z m | Z )]  Q(θ , θ )  R(θ , θ )
Let us now consider the expression:
It can be shown that
Thus if ’ maximizes
R(θ , θ )  R(θ , θ )
Q(θ , θ )
then
l (θ ; Z )  l (θ ; Z )  [Q(θ , θ )  Q(θ , θ )]  [ R(θ , θ )  R(θ , θ )]
This is actually done by Jensen’s inequality
l (θ ; Z )  l (θ ; Z )
EM Algorithm in General
• Start with initial parameter values (0); t = 1
• Expectation step: compute
Q(θ , θ (t 1) )  ET |Z ,θ ( t1) [l0 (θ ; T )]
• Maximization step:
 (t )  arg max Q(θ , θ (t 1) )

• t =t + 1 and iterate
EM Algorithm: Summary
• Augment the original data space by
latent/hidden/missing data
• Frame a suitable probability model for the augmented
data space
• In EM iterations, first assume initial values for the
parameters
• Iterate the Expectation and the Maximization steps
• In the Expectation step, find the expected values of the
latent variables (here you need to use the current
parameter values)
• In the Maximization step, first plug in the expected
values of the latent variables in the log-likelihood of the
augmented data. Then maximize this log-likelihood to
reevaluate the parameters
• Iterate last two steps until convergence
Applications of EM
– Mixture models
– HMMs
– PCA
– Latent variable models
– Missing data problems
– many computer vision problems
–…
References
• The EM Algorithm and Extensions by
Geoffrey J. MacLauchlan, Thriyambakam
Krishnan
• For a non-parametric density estimate by EM
look at:
http://bioinformatics.uchc.edu/LectureNotes_200
6/Tools_EM_SA_2006_files/frame.htm
EM: Important Issues
• Is the convergence of the algorithm guaranteed?
• Does the outcome of EM depend on the initial
choice of the parameter values?
• How about the speed of convergence?
• How easy or difficult could it be to compute the
expected values of the latent variables?