Gaussian mixture model

Download Report

Transcript Gaussian mixture model

Latent Variables,
Mixture Models
and EM
Christopher M. Bishop
Microsoft Research, Cambridge
BCS Summer School
Exeter, 2003
Overview
•
•
•
•
•
•
•
•
K-means clustering
Gaussian mixtures
Maximum likelihood and EM
Probabilistic graphical models
Latent variables: EM revisited
Bayesian Mixtures of Gaussians
Variational Inference
VIBES
BCS Summer School, Exeter, 2003
Christopher M. Bishop
Old Faithful
BCS Summer School, Exeter, 2003
Christopher M. Bishop
Old Faithful Data Set
Time
between
eruptions
(minutes)
Duration of eruption (minutes)
BCS Summer School, Exeter, 2003
Christopher M. Bishop
K-means Algorithm
• Goal: represent a data set in terms of K clusters each of
which is summarized by a prototype
• Initialize prototypes, then iterate between two phases:
– E-step: assign each data point to nearest prototype
– M-step: update prototypes to be the cluster means
• Simplest version is based on Euclidean distance
– re-scale Old Faithful data
BCS Summer School, Exeter, 2003
Christopher M. Bishop
BCS Summer School, Exeter, 2003
Christopher M. Bishop
BCS Summer School, Exeter, 2003
Christopher M. Bishop
BCS Summer School, Exeter, 2003
Christopher M. Bishop
BCS Summer School, Exeter, 2003
Christopher M. Bishop
BCS Summer School, Exeter, 2003
Christopher M. Bishop
BCS Summer School, Exeter, 2003
Christopher M. Bishop
BCS Summer School, Exeter, 2003
Christopher M. Bishop
BCS Summer School, Exeter, 2003
Christopher M. Bishop
BCS Summer School, Exeter, 2003
Christopher M. Bishop
Responsibilities
• Responsibilities assign data points to clusters
such that
• Example: 5 data points and 3 clusters
BCS Summer School, Exeter, 2003
Christopher M. Bishop
K-means Cost Function
data
responsibilities
BCS Summer School, Exeter, 2003
prototypes
Christopher M. Bishop
Minimizing the Cost Function
• E-step: minimize w.r.t.
– assigns each data point to nearest prototype
• M-step: minimize w.r.t
– gives
– each prototype set to the mean of points in that cluster
• Convergence guaranteed since there is a finite number
of possible settings for the responsibilities
BCS Summer School, Exeter, 2003
Christopher M. Bishop
BCS Summer School, Exeter, 2003
Christopher M. Bishop
BCS Summer School, Exeter, 2003
Christopher M. Bishop
Limitations of K-means
• Hard assignments of data points to clusters – small shift
of a data point can flip it to a different cluster
• Not clear how to choose the value of K
• Solution: replace ‘hard’ clustering of K-means with ‘soft’
probabilistic assignments
• Represents the probability distribution of the data as a
Gaussian mixture model
BCS Summer School, Exeter, 2003
Christopher M. Bishop
The Gaussian Distribution
• Multivariate Gaussian
mean
covariance
• Define precision to be the inverse of the covariance
• In 1-dimension
BCS Summer School, Exeter, 2003
Christopher M. Bishop
Likelihood Function
• Data set
• Assume observed data points generated independently
• Viewed as a function of the parameters, this is known as
the likelihood function
BCS Summer School, Exeter, 2003
Christopher M. Bishop
Maximum Likelihood
• Set the parameters by maximizing the likelihood function
• Equivalently maximize the log likelihood
BCS Summer School, Exeter, 2003
Christopher M. Bishop
Maximum Likelihood Solution
• Maximizing w.r.t. the mean gives the sample mean
• Maximizing w.r.t covariance gives the sample covariance
BCS Summer School, Exeter, 2003
Christopher M. Bishop
Bias of Maximum Likelihood
• Consider the expectations of the maximum likelihood
estimates under the Gaussian distribution
• The maximum likelihood solution systematically underestimates the covariance
• This is an example of over-fitting
BCS Summer School, Exeter, 2003
Christopher M. Bishop
Intuitive Explanation of Over-fitting
BCS Summer School, Exeter, 2003
Christopher M. Bishop
Unbiased Variance Estimate
• Clearly we can remove the bias by using
since this gives
• Arises naturally in a Bayesian treatment (see later)
• For an infinite data set the two expressions are equal
BCS Summer School, Exeter, 2003
Christopher M. Bishop
Gaussian Mixtures
• Linear super-position of Gaussians
• Normalization and positivity require
• Can interpret the mixing coefficients as prior probabilities
BCS Summer School, Exeter, 2003
Christopher M. Bishop
Example: Mixture of 3 Gaussians
BCS Summer School, Exeter, 2003
Christopher M. Bishop
Contours of Probability Distribution
BCS Summer School, Exeter, 2003
Christopher M. Bishop
Surface Plot
BCS Summer School, Exeter, 2003
Christopher M. Bishop
Sampling from the Gaussian
• To generate a data point:
– first pick one of the components with probability
– then draw a sample
from that component
• Repeat these two steps for each new data point
BCS Summer School, Exeter, 2003
Christopher M. Bishop
Synthetic Data Set
BCS Summer School, Exeter, 2003
Christopher M. Bishop
Fitting the Gaussian Mixture
• We wish to invert this process – given the data set, find
the corresponding parameters:
– mixing coefficients
– means
– covariances
• If we knew which component generated each data point,
the maximum likelihood solution would involve fitting
each component to the corresponding cluster
• Problem: the data set is unlabelled
• We shall refer to the labels as latent (= hidden) variables
BCS Summer School, Exeter, 2003
Christopher M. Bishop
Synthetic Data Set Without Labels
BCS Summer School, Exeter, 2003
Christopher M. Bishop
Posterior Probabilities
• We can think of the mixing coefficients as prior
probabilities for the components
• For a given value of we can evaluate the
corresponding posterior probabilities, called
responsibilities
• These are given from Bayes’ theorem by
BCS Summer School, Exeter, 2003
Christopher M. Bishop
Posterior Probabilities (colour coded)
BCS Summer School, Exeter, 2003
Christopher M. Bishop
Posterior Probability Map
BCS Summer School, Exeter, 2003
Christopher M. Bishop
Maximum Likelihood for the GMM
• The log likelihood function takes the form
• Note: sum over components appears inside the log
• There is no closed form solution for maximum likelihood
BCS Summer School, Exeter, 2003
Christopher M. Bishop
Over-fitting in Gaussian Mixture Models
• Singularities in likelihood function when a component
‘collapses’ onto a data point:
then consider
• Likelihood function gets larger as we add more
components (and hence parameters) to the model
– not clear how to choose the number K of components
BCS Summer School, Exeter, 2003
Christopher M. Bishop
Problems and Solutions
• How to maximize the log likelihood
– solved by expectation-maximization (EM) algorithm
• How to avoid singularities in the likelihood function
– solved by a Bayesian treatment
• How to choose number K of components
– also solved by a Bayesian treatment
BCS Summer School, Exeter, 2003
Christopher M. Bishop
EM Algorithm – Informal Derivation
• Let us proceed by simply differentiating the log likelihood
• Setting derivative with respect to
equal to zero gives
giving
which is simply the weighted mean of the data
BCS Summer School, Exeter, 2003
Christopher M. Bishop
EM Algorithm – Informal Derivation
• Similarly for the covariances
• For mixing coefficients use a Lagrange multiplier to give
BCS Summer School, Exeter, 2003
Christopher M. Bishop
EM Algorithm – Informal Derivation
•
•
The solutions are not closed form since they are coupled
Suggests an iterative scheme for solving them:
– Make initial guesses for the parameters
– Alternate between the following two stages:
1. E-step: evaluate responsibilities
2. M-step: update parameters using ML results
BCS Summer School, Exeter, 2003
Christopher M. Bishop
BCS Summer School, Exeter, 2003
Christopher M. Bishop
BCS Summer School, Exeter, 2003
Christopher M. Bishop
BCS Summer School, Exeter, 2003
Christopher M. Bishop
BCS Summer School, Exeter, 2003
Christopher M. Bishop
BCS Summer School, Exeter, 2003
Christopher M. Bishop
BCS Summer School, Exeter, 2003
Christopher M. Bishop
Digression: Probabilistic Graphical Models
•
•
•
•
Graphical representation of a probabilistic model
Each variable corresponds to a node in the graph
Links in the graph denote relations between variables
Motivation:
– visualization of models and motivation for new models
– graphical determination of conditional independence
– complex calculations (inference) performed using
graphical operations (e.g. forward-backward for HMM)
• Here we consider directed graphs
BCS Summer School, Exeter, 2003
Christopher M. Bishop
Example: 3 Variables
• General distribution over 3 variables
• Apply product rule of probability twice
• Express as a directed graph
BCS Summer School, Exeter, 2003
Christopher M. Bishop
General Decomposition Formula
• Joint distribution is product of conditionals, conditioned
on parent nodes
• Example:
BCS Summer School, Exeter, 2003
Christopher M. Bishop
EM – Latent Variable Viewpoint
• Binary latent variables
describing which
component generated each data point
• Conditional distribution of observed variable
• Prior distribution of latent variables
• Marginalizing over the latent variables we obtain
BCS Summer School, Exeter, 2003
Christopher M. Bishop
Expected Value of Latent Variable
• From Bayes’ theorem
BCS Summer School, Exeter, 2003
Christopher M. Bishop
Graphical Representation of GMM
BCS Summer School, Exeter, 2003
Christopher M. Bishop
Complete and Incomplete Data
complete
BCS Summer School, Exeter, 2003
incomplete
Christopher M. Bishop
Graph for Complete-Data Model
BCS Summer School, Exeter, 2003
Christopher M. Bishop
Latent Variable View of EM
• If we knew the values for the latent variables, we would
maximize the complete-data log likelihood
which gives a trivial closed-form solution (fit each
component to the corresponding set of data points)
• We don’t know the values of the latent variables
• However, for given parameter values we can compute
the expected values of the latent variables
BCS Summer School, Exeter, 2003
Christopher M. Bishop
Expected Complete-Data Log Likelihood
• Suppose we make a guess
for the parameter values
(means, covariances and mixing coefficients)
• Use these to evaluate the responsibilities
• Consider expected complete-data log likelihood
where responsibilities are computed using
• We are implicitly ‘filling in’ latent variables with best guess
• Keeping the responsibilities fixed and maximizing with
respect to the parameters give the previous results
BCS Summer School, Exeter, 2003
Christopher M. Bishop
K-means Revisited
• Consider GMM with common covariances
• Take limit
• Responsibilities become binary
• Expected complete-data log likelihood becomes
BCS Summer School, Exeter, 2003
Christopher M. Bishop
EM in General
• Consider arbitrary distribution
over the latent variables
• The following decomposition always holds
where
BCS Summer School, Exeter, 2003
Christopher M. Bishop
Decomposition
BCS Summer School, Exeter, 2003
Christopher M. Bishop
Optimizing the Bound
• E-step: maximize with respect to
– equivalent to minimizing KL divergence
– sets
equal to the posterior distribution
• M-step: maximize bound with respect to
– equivalent to maximizing expected complete-data log
likelihood
• Each EM cycle must increase incomplete-data likelihood
unless already at a (local) maximum
BCS Summer School, Exeter, 2003
Christopher M. Bishop
E-step
BCS Summer School, Exeter, 2003
Christopher M. Bishop
M-step
BCS Summer School, Exeter, 2003
Christopher M. Bishop
Bayesian Inference
• Include prior distributions over parameters
• Advantages in using conjugate priors
• Example: consider a single Gaussian over one variable
– assume variance is known and mean is unknown
– likelihood function for the mean
• Choose Gaussian prior for mean
BCS Summer School, Exeter, 2003
Christopher M. Bishop
Bayesian Inference for a Gaussian
• Posterior (proportional to product of prior and likelihood)
will then also be Gaussian
where
BCS Summer School, Exeter, 2003
Christopher M. Bishop
Bayesian Inference for a Gaussian
BCS Summer School, Exeter, 2003
Christopher M. Bishop
Bayesian Mixture of Gaussians
• Conjugate priors for the parameters:
– Dirichlet prior for mixing coefficients
– Normal-Wishart prior for means and precisions
where the Wishart distribution is given by
BCS Summer School, Exeter, 2003
Christopher M. Bishop
Graphical Representation
• Parameters and latent variables appear on equal footing
BCS Summer School, Exeter, 2003
Christopher M. Bishop
Variational Inference
• As with many Bayesian models, exact inference for the
mixture of Gaussians is intractable
• Approximate Bayesian inference traditionally based on
Laplace’s method (local Gaussian approximation to the
posterior) or Markov chain Monte Carlo
• Variational Inference is an alternative, broadly applicable
deterministic approximation scheme
BCS Summer School, Exeter, 2003
Christopher M. Bishop
General View of Variational Inference
• Consider again the previous decomposition, but where
the posterior is over all latent variables and parameters
where
• Maximizing over
would give the true posterior
distribution – but this is intractable by definition
BCS Summer School, Exeter, 2003
Christopher M. Bishop
Factorized Approximation
• Goal: choose a family of distributions which are:
– sufficiently flexible to give good posterior approximation
– sufficiently simple to remain tractable
• Here we consider factorized distributions
• No further assumptions are required!
• Optimal solution for one factor, keeping the remained fixed
• Coupled solutions so initialize then cyclically update
BCS Summer School, Exeter, 2003
Christopher M. Bishop
Lower Bound
• Can also be evaluated
• Useful for maths/code verification
• Also useful for model comparison:
BCS Summer School, Exeter, 2003
Christopher M. Bishop
BCS Summer School, Exeter, 2003
Christopher M. Bishop
BCS Summer School, Exeter, 2003
Christopher M. Bishop
Illustration: Univariate Gaussian
• Likelihood function
• Conjugate priors
• Factorized variational distribution
BCS Summer School, Exeter, 2003
Christopher M. Bishop
Variational Posterior Distribution
where
BCS Summer School, Exeter, 2003
Christopher M. Bishop
Initial Configuration
BCS Summer School, Exeter, 2003
Christopher M. Bishop
After Updating
BCS Summer School, Exeter, 2003
Christopher M. Bishop
After Updating
BCS Summer School, Exeter, 2003
Christopher M. Bishop
Converged Solution
BCS Summer School, Exeter, 2003
Christopher M. Bishop
Exact Solution
• For this very simple example there is an exact solution
• Expected precision given by
• Compare with earlier maximum likelihood solution
BCS Summer School, Exeter, 2003
Christopher M. Bishop
Variational Mixture of Gaussians
• Assume factorized posterior distribution
• Gives optimal solution in the form
where
Wishart
is a Dirichlet, and
BCS Summer School, Exeter, 2003
is a Normal-
Christopher M. Bishop
Sufficient Statistics
• Small computational overhead compared to maximum
likelihood EM
BCS Summer School, Exeter, 2003
Christopher M. Bishop
Variational Equations for GMM
BCS Summer School, Exeter, 2003
Christopher M. Bishop
Bound vs. K for Old Faithful Data
BCS Summer School, Exeter, 2003
Christopher M. Bishop
Bayesian Model Complexity
BCS Summer School, Exeter, 2003
Christopher M. Bishop
Sparse Bayes for Gaussian Mixture
• Instead of comparing different values of K, start with a
large value and prune out excess components
• Treat mixing coefficients as parameters, and maximize
marginal likelihood [Corduneanu + Bishop, AIStats ’01]
• Gives simple re-estimation equations for the mixing
coefficients – interleave with variational updates
BCS Summer School, Exeter, 2003
Christopher M. Bishop
BCS Summer School, Exeter, 2003
Christopher M. Bishop
BCS Summer School, Exeter, 2003
Christopher M. Bishop
General Variational Framework
• Currently for each new model:
– derive the variational update equations
– write application-specific code to find the solution
• Both stages are time consuming and error prone
• Can we build a general-purpose inference engine which
automates these procedures?
BCS Summer School, Exeter, 2003
Christopher M. Bishop
Lower Bound for GMM
BCS Summer School, Exeter, 2003
Christopher M. Bishop
VIBES
•
•
•
•
Variational Inference for Bayesian Networks
Bishop and Winn (1999)
A general inference engine using variational methods
Models specified graphically
BCS Summer School, Exeter, 2003
Christopher M. Bishop
Example: Mixtures of Bayesian PCA
BCS Summer School, Exeter, 2003
Christopher M. Bishop
Solution
BCS Summer School, Exeter, 2003
Christopher M. Bishop
Local Computation in VIBES
• A key observation is that in the general solution
the update for a particular node (or group of nodes)
depends only on other nodes in the Markov blanket
• Permits a local object-oriented implementation
BCS Summer School, Exeter, 2003
Christopher M. Bishop
Shared Hyper-parameters
BCS Summer School, Exeter, 2003
Christopher M. Bishop
Take-home Messages
• Bayesian mixture of Gaussians
– no singularities
– determines optimal number of components
• Variational inference
– effective solution for Bayesian GMM
– optimizes rigorous bound
– little computational overhead compared to EM
• VIBES
– rapid prototyping of probabilistic models
– graphical specification
BCS Summer School, Exeter, 2003
Christopher M. Bishop
Viewgraphs, tutorials and
publications available from:
http://research.microsoft.com/~cmbishop