Transcript Document
Ch 9. Mixture Models and EM
Pattern Recognition and Machine Learning,
C. M. Bishop, 2006.
Summarized by
Ho-Sik Seok
Biointelligence Laboratory, Seoul National University
http://bi.snu.ac.kr/
Contents
9.1 K-means Clustering
9.2 Mixtures of Gaussians
9.2.1 Maximum likelihood
9.2.2 EM for Gaussian mixtures
9.3 An Alternative View of EM
9.3.1 Gaussian mixtures revisited
9.3.2 Relation to K-means
9.3.3 Mixtures of Bernouli distributions
9.3.4 EM for Bayesian linear regression
9.4 The EM Algorithm in General
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
2
9.1 K-means Clustering (1/3)
Problem of identifying groups, or clusters, of data
points in a multidimensional space
Partitioning the data set into some number K of clusters
Cluster: a group of data points whose inter-point
distances are small compared with the distances to points
outside of the cluster
Goal: an assignment of data points to clusters such that
the sum of the squares of the distances to each data point
to its closest vector (the center of the cluster) is a
minimum
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
3
9.1 K-means Clustering (2/3)
Two-stage optimization
In the 1st stage: minimizing J with respect to the rnk,
keeping the μk fixed
In the 2nd stage: minimizing J with respect to the μk,
keeping rnk fixed
The mean of all of the data points assigned to cluster k
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
4
9.1 K-means Clustering (3/3)
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
5
9.2 Mixtures of Gaussians (1/3)
A formulation of Gaussian mixtures in terms of discrete latent variables
Gaussian mixture distribution can be written as a
linear superposition of Gaussian
…(*)
An equivalent formulation of the Gaussian mixture
involving an explicit latent variable
Graphical representation of a mixture
model
A
binary random variable z having a 1-of-k representation
The
marginal distribution of x is a Gaussian mixture of the form
(*) for every observed data point xn, there is a corresponding
latent variable zn
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
6
9.2 Mixtures of Gaussians (2/3)
γ(zk) can also be viewed as the responsibility that
component k takes for explaining’ the observation x
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
7
9.2 Mixtures of Gaussians (3/3)
Generating random samples distributed according to the
Gaussian mixture model
Generating a value for z, which denoted as
from the
marginal distribution p(z) and then generate a value for x from
the conditional distribution
a.
b.
c.
The three states of z, corresponding to the three components of the mixture, are
depicted in red, green, blue
The corresponding samples from the marginal distribution p(x)
The same samples in which the colors represent the value of the responsibilities γ(znk)
associated with data point
Illustrating
the responsibilities
by evaluating
the posterior probability for each8
(C) 2007,
SNU Biointelligence
Lab, http://bi.snu.ac.kr/
component in the mixture distribution which this data set was generated
9.2.1 Maximum likelihood (1/3)
Graphical representation of a
Gaussian mixture model for
a set of N i.i.d. data points
{xn}, with corresponding
latent points {zn}
The log of the likelihood function
….(*1)
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
9
9.2.1 Maximum likelihood (2/3)
For simplicity, consider a Gaussian mixture whose components
have covariance matrices given by
Suppose that one of the components of the mixture model has its mean
μj exactly equal to one of the data points so that μj = xn
This data point will contribute a term in the likelihood function of the
term
Once there are at least two
components in the mixture, one of
the components can have a finite
variance and therefore assign
finite probability to all of the data
points while the other component
can shrink onto one specific data
point and thereby contribute an ever
increasing additive value to the log
likelihood over-fitting problem
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
10
9.2.1 Maximum likelihood (3/3)
Over-fitting problem
Example of the over-fitting in a maximum likelihood approach
This problem does not occur in the case of Bayesian approach
In applying maximum likelihood to a Gaussian mixture models,
there should be steps to avoid finding such pathological
solutions and instead seek local minima of the likelihood
function that are well behaved
Identifiability problem
A K-component mixture will have a total of K! equivalent
solutions corresponding to the K! ways of assigning K sets of
parameters to K components
Difficulty of maximizing the log likelihood function (*1)
the difficulty arises from the presence of the summation over k
that appears inside the logarithm in (*1)
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
11
9.2.2 EM for Gaussian mixtures (1/4)
Assign some initial values for the means, covariances, and
mixing coefficients
Expectation or E step
I.
II.
•
Using the current value for the parameters to evaluate the posterior
probabilities or responsibilities
Maximization or M step
III.
•
Using the result of II to re-estimate the means, covariances, and
mixing coefficients
It is common to run the K-means algorithm in order to find a
suitable initial values
•
•
The covariance matrices the sample covariances of the clusters
found by the K-means algorithm
Mixing coefficients the fractions of data points assigned to the
respective clusters
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
12
9.2.2 EM for Gaussian mixtures (2/4)
Given a Gaussian mixture model, the goal is to maximize the
likelihood function with respect to the parameters
1.
2.
Initialize the means μk, covariance Σk and mixing coefficients πk
E step
3.
M step
4.
Evaluate the log likelihood
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
13
9.2.2 EM for Gaussian mixtures (3/4)
…(*2)
Setting the derivatives of (*2) with respect to the means of the
Gaussian components to zero
Responsibilityγ(znk )
A weighted mean of all of the points in the data
set
Setting the derivatives of (*2) with respect to the covariance
of the Gaussian components to zero
-
Each data point weighted by the corresponding posterior probability
-
The denominator given by the effective # of points associated with
the corresponding component
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
14
9.2.2 EM for Gaussian mixtures (4/4)
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
15
9.3 An Alternative View of EM
General EM
Maximizing the log likelihood function
Given a joint distribution p(X, Z|Θ) over observed variables X
and latent variables Z, governed by parameters Θ
1. Choose an initial setting for the parameters Θold
2. E step Evaluate p(Z|X,Θold )
3. M step Evaluate Θnew given by
Θnew = argmaxΘQ(Θ ,Θold)
Q(Θ ,Θold) = ΣZ p(Z|X, Θold)ln p(X, Z| Θ)
4. It the covariance criterion is not satisfied,
then let Θold Θnew
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
16
9.3.1 Gaussian mixtures revisited (1/2)
For the complete data set {X, Z}, the likelihood function
(9.14)
The logarithm acts directly on the Gaussian distribution
much simpler solution to the maximum likelihood problem
zn is the 1 of K coding the complete-data log likelihood
function is a sum of K independent distributions, one for each
mixture component the maximization with respect to a mean
or a covariance is exactly for a single Gaussian
The mixing coefficients are equal to the fractions of data points assigned to
the corresponding components
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
17
9.3.1 Gaussian mixtures revisited (2/2)
Unknown latent variables considering of the completedata log likelihood with respect to the posterior distribution
of the latent variables
Posterior distribution
The expected value of the indicator variable under this posterior distribution
The expected value of the complete-data log likelihood function
…(*3)
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
18
9.4 The EM Algorithm in General (1/3)
The EM algorithm is a general technique for finding
maximum likelihood solutions for probabilistic
models having latent variables
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
19
9.4 The EM Algorithm in General (2/3)
Illustration of the
decomposition
Illustration of the E step
Illustration of the M step
- The q distribution is set equal
to the posterior distribution
for the current parameter
values, causing the lower
bound to move up to the same
value as the log likelihood
function
- The distribution q(Z) is
held fixed and the
lower bound L(q, Θ) is
maximized with respect
to the parameter Θ to
give a revised value
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
20
9.4 The EM Algorithm in General (3/3)
E step
The lower bound L(q, Θold) is maximized with respect to q(Z)
while holding Θold fixed
The largest value of L(q, Θold) will occur when the KullbackLeibler divergence vanishes
M step
The distribution q(Z) is held fixed and the lower bound L(q, Θ)
is maximized with respect to Θ
This will cause the lower bound L
to increase
The quantity that is being maximized
is the expectation of the complete-data
log likelihood
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
21