Transcript CS59000-ML

CS 59000 Statistical Machine learning
Lecture 25
Yuan (Alan) Qi
Purdue CS
Nov. 25 2008
Outline
• Review of Hidden Markvo Models, forwardbackward algorithm, EM for learning HMM
parameters, Viterbi Algorithm, Kalman
filtering and smoothing
• Rejection Sampling, Importance Sampling,
Metroplis-hasting algorithm, Gibbs sampling
Hidden Markov Models
Many applications, e.g., speech recognition,
natural language processing, handwriting
recognition, bio-sequence analysis
From Mixture Models to HMMs
By turning a mixture Model into a dynamic model,
we obtain the HMM.
Let model the dependence between two consecutive
latent variables by a transition probability:
HMMs
Prior on initial latent variable:
Emission probabilities:
Joint distribution:
Samples from HMM
(a) Contours of constant probability density for the emission
distributions corresponding to each of the three states of the latent
variable. (b) A sample of 50 points drawn from the hidden Markov
model, with lines connecting the successive observations.
Inference: Forward-backward Algorithm
Goal: compute marginals for latent variables.
Forward-backward Algorithm: exact inference
as special case of sum-product algorithm on
the HMM.
Factor graph representation (grouping
emission density and transition probability
in one factor at a time):
Forward-backward Algorithm as Message Passing Method (1)
Forward messages:
Forward-backward Algorithm as Message Passing Method (2)
Backward messages (Q: how to compute it?):
The messages actually involves X
Similarly, we can compute the following (Q: why)
Rescaling to Avoid Overflowing
When a sequence is long, the forward message will
become to small to be represented by the dynamic
range of the computer. We redefine the forward
message
as
Similarly, we re-define the backward message
as
Then, we can compute
See detailed derivation in textbook
Viterbi Algorithm
Viterbi Algorithm:
• Finding the most probable sequence of
states
• Special case of sum-product algorithm on
HMM.
What if we want to find the most probable
individual states?
Maximum Likelihood Estimation for HMM
Goal: maximize
From EM for mixture of Gaussians to EM for
HMMs.
EM for HMM
E step:
Computed from forward-backward/sum-product
algorithm
M step:
Linear Dynamical Systems
Equivalently, we have
where
Kalman Filtering and Smoothing
Inference on linear Gaussian systems.
Kalman filtering: sequentially update scaled
forward message:
Kalman smoothing: sequentially update state
beliefs based on scaled forward and
backward messages:
Sampling Methods
Goal: compute
Challenges: we cannot compute the above
equation in analytically.
Sampling methods: draw random samples
such that
Rejection Sampling (1)
Goal: draw samples from a complicated distribution
Example
Limitations of Rejection Sampling
When the proposal distribution is very
different from the un-normlized target
distribution, many samples will be wasted.
Difficult for high dimensional problems.
Importance Sampling (1)
Importance weights:
Discussion: what will be an ideal proposal
distribution
?
Importance Sampling (2)
When
where
in
is unknown, we have
Importance Sampling (3)
Sampling and EM
M step in EM maximizes
What if we cannot even evaluate the above
integration? One idea: using sampling
method:
Known as Monte Carlo EM algorithm.
Imputation Posterior (IP) Algorithm: Change EM for Bayesian Estimation
Markov Chain Monte Carlo
Goal: use Markov chains to draw samples from
a given distribution.
Idea: set up a Markov chain that converges to
the target distribution and draw samples
from the chain.
Invariant Distribution and Detailed Balance
• 1-order Markov chain:
• Transition probabilities:
• Invariant distribution: A distribution is said to be
invariant, or stationary, with respect to a Markov
chain if each step in the chain leaves that
distribution invariant.
• Detailed balance: sufficient condition for invariant
distributions.
Ergodicity and Equilibrium Distribution.
For m→∞, the distribution p(z(m)) converges to the
required invariant distribution, i.e., the target
distribution, irrespective of the choice of initial
distribution (which may be different from the
target distribution).
This property is called ergodicity and the invariant
distribution is called the equilibrium distribution.
Under weak restrictions on invariant and transition
distributions, a homogeneous Markov chain will
be ergodic.
Metropolis-Hastings Algorithm
Iterative algorithm. At step with state
,
we draw a sample
from a proposal
distribution and accept it with probability
Such that the detailed balance requirement is
satisfied.
Choice of Proposal Distributions
• Local Gaussian distributions centered on the
current state (if it is symmetric Gaussian, H-M
algorithm reduces to Metroplis algorithm):
• Variance is too small -> high acceptance rate but slow
random walks.
• Variance is too large -> low acceptance rate.
• Global Gaussian distribution: find Laplace
approximation (to the posterior) & sample from it.
• Mixture of Gaussians: fit a mixture of Gaussians
and then sample from it.
Example
Gibbs Sampler (1)
Gibbs Sampler (2)
Special case of Metroplis-Hasting Sampler.
Given the proposal distribution
The acceptance rate is
So every sample from the proposal distribution
in a Gibbs sampler will be accepted.