PPT - The University of Texas at Arlington
Download
Report
Transcript PPT - The University of Texas at Arlington
Mixtures of Gaussians and
the EM Algorithm
CSE 6363 β Machine Learning
Vassilis Athitsos
Computer Science and Engineering Department
University of Texas at Arlington
1
Gaussians
β’ A popular way to estimate probability density
functions is to model them as Gaussians.
β’ Review: a 1D normal distribution is defined as:
π π₯ =
1
π 2π
π₯βπ 2
β
π 2π2
β’ To define a Gaussian, we need to specify just two
parameters:
β ΞΌ, which is the mean (average) of the distribution.
β Ο, which is the standard deviation of the distribution.
β Note: Ο2 is called the variance of the distribution.
2
Estimating a Gaussian
β’ In one dimension, a Gaussian is defined like this:
π₯βπ 2
1
β
π π₯ =
π 2π2
π 2π
β’ Given a set of n real numbers x1, β¦, xn, we can easily find the
best-fitting Gaussian for that data.
β’ The mean ΞΌ is simply the average of those numbers:
1
π=
π
π
π₯π
1
β’ The standard deviation Ο is computed as:
π=
1
π β1
π
(π₯π β π)2
1
3
Estimating a Gaussian
β’ Fitting a Gaussian to data does not guarantee that
the resulting Gaussian will be an accurate
distribution for the data.
β’ The data may have a distribution that is very
different from a Gaussian.
4
Example of Fitting a Gaussian
The blue curve
is a density
function F such
that:
- F(x) = 0.25
for 1 β€ x β€ 3.
- F(x) = 0.5 for
7 β€ x β€ 8.
The red curve is
the Gaussian fit
G to data
generated using
F.
5
Naïve Bayes with 1D Gaussians
β’ Suppose the patterns come from a d-dimensional space:
β Examples: pixels to be classified as skin or non-skin, or the statlog dataset.
β’ Notation: xi = (xi,1, xi,2, β¦, xi,d)
β’ For each dimension j, we can use a Gaussian to model the
distribution pj(xi,j | Ck) of the data in that dimension, given their
class.
β’ For example for the statlog dataset, we would get 216 Gaussians:
β 36 dimensions * 6 classes.
β’ Then, we can use the naïve Bayes approach (i.e., assume pairwise
independence of all dimensions), to define P(x | Ck) as:
π
p π₯ πΆπ ) =
ππ π₯π,π ) πΆπ )
π=1
6
Mixtures
of Gaussians
β’ This figure shows our previous example, where we fitted a
Gaussian into some data, and the fit was poor.
β’ Overall, Gaussians have attractive properties:
β They require learning only two numbers (ΞΌ and Ο), and thus require few
training data to estimate those numbers.
β’ However, for some data, Gaussians are just not good fits.
7
Mixtures
of Gaussians
β’ Mixtures of Gaussians are oftentimes a better solution.
β They are defined in the next slide.
β’ They still require relatively few parameters to estimate, and
thus can be learned from relatively small amounts of data.
β’ They can fit pretty well actual distributions of data.
8
Mixtures of Gaussians
β’ Suppose we have k Gaussian distributions Ni.
β’ Each Ni has its own mean ΞΌi and std Οi.
β’ Using these k Gaussians, we can define a Gaussian
mixture M as follows:
π
π π₯ =
π€π ππ π₯
π=1
β’ Each wi is a weight, specifying the relative
importance of Gaussian Ni in the mixture.
β Weights wi are real numbers between 0 and 1.
β Weights wi must sum up to 1, so that the integral of M is 1.9
Mixtures of Gaussians β Example
The blue and green
curves show two
Gaussians.
The red curve shows
a mixture of those
Gaussians.
w1 = 0.9.
w2 = 0.1.
The mixture looks a
lot like N1, but is
influenced a little by
N2 as well.
10
Mixtures of Gaussians β Example
The blue and green
curves show two
Gaussians.
The red curve shows
a mixture of those
Gaussians.
w1 = 0.7.
w2 = 0.3.
The mixture looks
less like N1
compared to the
previous example,
and is influenced
more by N2.
11
Mixtures of Gaussians β Example
The blue and green
curves show two
Gaussians.
The red curve shows
a mixture of those
Gaussians.
w1 = 0.5.
w2 = 0.5.
At each point x, the
value of the mixture
is the average of
N1(x) and N2(x).
12
Mixtures of Gaussians β Example
The blue and green
curves show two
Gaussians.
The red curve shows
a mixture of those
Gaussians.
w1 = 0.3.
w2 = 0.7.
The mixture now
resembles N2 more
than N1.
13
Mixtures of Gaussians β Example
The blue and green
curves show two
Gaussians.
The red curve shows
a mixture of those
Gaussians.
w1 = 0.1.
w2 = 0.9.
The mixture now is
almost identical to
N2(x).
14
Learning a Mixture of Gaussians
β’
β’
β’
β’
β’
Suppose we are given training data x1, x2, β¦, xn.
Suppose all xj belong to the same class c.
How can we fit a mixture of Gaussians to this data?
This will be the topic of the next few slides.
We will learn a very popular machine learning
algorithm, called the EM algorithm.
β EM stands for Expectation-Maximization.
β’ Step 0 of the EM algorithm: pick k manually.
β Decide how many Gaussians the mixture should have.
β Any approach for choosing k automatically is beyond the
scope of this class.
15
Learning a Mixture of Gaussians
β’
β’
β’
β’
Suppose we are given training data x1, x2, β¦, xn.
Suppose all xj belong to the same class c.
We want to model P(x | c) as a mixture of Gaussians.
Given k, how many parameters do we need to estimate in
order to fully define the mixture?
β’ Remember, a mixture M of k Gaussians is defined as:
π
π π₯ =
π
π€π ππ π₯ =
π=1
π€π
π=1
1
ππ 2π
π₯βππ 2
β
π 2ππ2
β’ For each Ni, we need to estimate three numbers:
β wi, ΞΌi, Οi.
β’ So, in total, we need to estimate 3*k numbers.
16
Learning a Mixture of Gaussians
β’ Suppose we are given training data x1, x2, β¦, xn.
β’ A mixture M of k Gaussians is defined as:
π
π π₯ =
π
π€π ππ π₯ =
π=1
π€π
π=1
1
ππ 2π
π₯βππ 2
β
π 2ππ2
β’ For each Ni, we need to estimate wi, ΞΌi, Οi.
β’ Suppose that we knew for each xj, that it belongs to one and
only one of the k Gaussians.
β’ Then, learning the mixture would be a piece of cake:
β’ For each Gaussian Ni:
β Estimate ΞΌi, Οi based on the examples that belong to it.
β Set wi equal to the fraction of examples that belong to Ni.
17
Learning a Mixture of Gaussians
β’ Suppose we are given training data x1, x2, β¦, xn.
β’ A mixture M of k Gaussians is defined as:
π
π π₯ =
π
π€π ππ π₯ =
π=1
π€π
π=1
1
ππ 2π
π₯βππ 2
β
π 2ππ2
β’ For each Ni, we need to estimate wi, ΞΌi, Οi.
β’ However, we have no idea which mixture each xj belongs to.
β’ If we knew ΞΌi and Οi for each Ni, we could probabilistically
assign each xj to a component.
β βProbabilisticallyβ means that we would not make a hard assignment,
but we would partially assign xj to different components, with each
assignment weighted proportionally to the density value Ni(xj).
18
Example of Partial Assignments
β’ Using our previous
example of a mixture:
β’ Suppose xj = 6.5.
β’ How do we assign 6.5 to
the two Gaussians?
β’ N1(6.5) = 0.0913.
β’ N2(6.5) = 0.3521.
β’ So:
β 6.5 belongs to N1 by
0.0913
= 20.6%.
0.0913+0.3521
β 6.5 belongs to N2 by
0.3521
= 79.4%.
19
0.0913+0.3521
The Chicken-and-Egg Problem
β’ To recap, fitting a mixture of Gaussians to data
involves estimating, for each Ni, values wi, ΞΌi, Οi.
β’ If we could assign each xj to one of the Gaussians, we
could compute easily wi, ΞΌi, Οi.
β Even if we probabilistically assign xj to multiple Gaussians,
we can still easily wi, ΞΌi, Οi, by adapting our previous
formulas. We will see the adapted formulas in a few slides.
β’ If we knew ΞΌi, Οi and wi, we could assign (at least
probabilistically) xjβs to Gaussians.
β’ So, this is a chicken-and-egg problem.
β If we knew one piece, we could compute the other.
β But, we know neither. So, what do we do?
20
On Chicken-and-Egg Problems
β’ Such chicken-and-egg problems occur frequently in AI.
β’ Surprisingly (at least to people new in AI), we can easily solve
such chicken-and-egg problems.
β’ Overall, chicken and egg problems in AI look like this:
β We need to know A to estimate B.
β We need to know B to compute A.
β’ There is a fairly standard recipe for solving these problems.
β’ Any guesses?
21
On Chicken-and-Egg Problems
β’ Such chicken-and-egg problems occur frequently in AI.
β’ Surprisingly (at least to people new in AI), we can easily solve
such chicken-and-egg problems.
β’ Overall, chicken and egg problems in AI look like this:
β We need to know A to estimate B.
β We need to know B to compute A.
β’ There is a fairly standard recipe for solving these problems.
β’ Start by giving to A values chosen randomly (or perhaps nonrandomly, but still in an uninformed way, since we do not
know the correct values).
β’ Repeat this loop:
β Given our current values for A, estimate B.
β Given our current values of B, estimate A.
β If the new values of A and B are very close to the old values, break.
22
The EM Algorithm - Overview
β’ We use this approach to fit mixtures of Gaussians to data.
β’ This algorithm, that fits mixtures of Gaussians to data, is called
the EM algorithm (Expectation-Maximization algorithm).
β’ Remember, we choose k (the number of Gaussians in the
mixture) manually, so we donβt have to estimate that.
β’ To initialize the EM algorithm, we initialize each ΞΌi, Οi, and wi.
Values wi are set to 1/k. We can initialize ΞΌi, Οi in different ways:
β
β
β
β
Giving random values to each ΞΌi.
Uniformly spacing the values given to each ΞΌi.
Giving random values to each Οi.
Setting each Οi to 1 initially.
β’ Then, we iteratively perform two steps.
β The E-step.
β The M-step.
23
The E-Step
β’ E-step. Given our current estimates for ΞΌi, Οi, and wi:
β We compute, for each i and j, the probability pij = P(Ni | xj):
the probability that xj was generated by Gaussian Ni.
β How? Using Bayes rule.
πππ = P(ππ |π₯π ) =
ππ π₯π =
π π₯π | ππ βπ(ππ )
1
ππ 2π
π(π₯π )
=
ππ π₯π β π€π
π(π₯π )
π₯βππ 2
β
π 2ππ2
24
The M-Step: Updating ΞΌi and Οi
β’ M-step. Given our current estimates of pij, for each i, j:
β We compute ΞΌi and Οi for each Ni, as follows:
ππ =
π
π=1[πππ π₯π ]
π
π=1 πππ
ππ =
π
π=1[πππ π₯π β
π
π=1 πππ
ππ 2 ]
β To understand these formulas, it helps to compare them
to the standard formulas for fitting a Gaussian to data:
1
π=
π
π
π₯π
1
π=
1
π β1
π
(π₯π β π)2
π=1
25
The M-Step: Updating ΞΌi and Οi
ππ =
π
π=1[πππ π₯π ]
π
π=1 πππ
ππ =
π
π=1[πππ π₯π β
π
π=1 πππ
ππ 2 ]
β To understand these formulas, it helps to compare them
to the standard formulas for fitting a Gaussian to data:
1
π=
π
π
π₯π
1
π=
1
π β1
π
(π₯π β π)2
π=1
β’ Why do we take weighted averages at the M-step?
β’ Because each xj is probabilistically assigned to multiple Gaussians.
β’ We use πππ = π ππ |π₯π as weight of the assignment of xj to Ni.
26
The M-Step: Updating wi
π€π =
π
π=1 πππ
π
π
π=1 π=1 πππ
β’ At the M-step, in addition to updating ΞΌi and Οi, we also
need to update wi, which is the weight of the i-th
Gaussian in the mixture.
β’ The formula shown above is used for the update of wi.
β We sum up the weights of all objects for the i-th Gaussian.
β We divide that sum by the sum of weights of all objects for all
Gaussians.
β The division ensures that
π
π=1 π€π
= 1.
27
The EM Steps: Summary
β’ E-step: Given current estimates for each ΞΌi, Οi, and wi,
update pij:
ππ π₯π β π€π
πππ =
π(π₯π )
β’ M-step: Given our current estimates for each pij, update
ΞΌi, Οi and wi:
π
2]
π
[π
π₯
β
π
ππ
π
π
π=1
π=1[πππ π₯π ]
ππ =
ππ =
π
π
π=1 πππ
π=1 πππ
π€π =
π
π=1 πππ
π
π
π=1 π=1 πππ
28
The EM Algorithm - Termination
β’ The log likelihood of the training data is defined as:
π
πΏ π₯1, β¦ , π₯π =
log 2 π π₯π
π=1
β’ As a reminder, M is the Gaussian mixture, defined as:
π
π π₯ =
π
π€π ππ π₯ =
π=1
π€π
π=1
1
ππ 2π
π₯βππ 2
β
π 2ππ2
β’ One can prove that, after each iteration of the E-step and the Mstep, this log likelihood increases or stays the same.
β’ We check how much the log likelihood changes at each iteration.
29
β’ When the change is below some threshold, we stop.
The EM Algorithm: Summary
β’ Initialization:
β Initialize each ΞΌi, Οi, wi, using your favorite approach (e.g., set
each ΞΌi to a random value, and set each Οi to 1, set each wi
equal to 1/k).
β last_log_likelihood = -infinity.
β’ Main loop:
β E-step:
β’ Given our current estimates for each ΞΌi, Οi, and wi, update each pij.
β M-step:
β’ Given our current estimates for each pij, update each ΞΌi, Οi, and wi.
β log_likelihood = πΏ π₯1, β¦ , π₯π .
β if (log_likelihood β last_log_likelihood) < threshold, break.
β last_log_likelihood = log_likelihood
30
The EM Algorithm: Limitations
β’ When we fit a Gaussian to data, we always get the same result.
β’ We can also prove that the result that we get is the best
possible result.
β There is no other Gaussian giving a higher log likelihood to the data,
than the one that we compute as described in these slides.
β’ When we fit a mixture of Gaussians to the same data, do we
always end up with the same result?
31
The EM Algorithm: Limitations
β’ When we fit a Gaussian to data, we always get the same result.
β’ We can also prove that the result that we get is the best
possible result.
β There is no other Gaussian giving a higher log likelihood to the data,
than the one that we compute as described in these slides.
β’ When we fit a mixture of Gaussians to the same data, we
(sadly) do not always get the same result.
β’ The EM algorithm is a greedy algorithm.
β’ The result depends on the initialization values.
β’ We may have bad luck with the initial values, and end up with a
bad fit.
β’ There is no good way to know if our result is good or bad, or if
better results are possible.
32
Mixtures of Gaussians - Recap
β’ Mixtures of Gaussians are widely used.
β’ Why? Because with the right parameters, they can fit
very well various types of data.
β Actually, they can fit almost anything, as long as k is large
enough (so that the mixture contains sufficiently many
Gaussians).
β’ The EM algorithm is widely used to fit mixtures of
Gaussians to data.
33
Multidimensional Gaussians
β’ Instead of assuming that each dimension is
independent, we can instead model the distribution
using a multi-dimensional Gaussian:
π π£ =
1
1
exp β (π₯ β π)Ξ€ Ξ£ β1 (π₯ β π)
2
2π π |Ξ£|
β’ To specify this Gaussian, we need to estimate the
mean ΞΌ and the covariance matrix Ξ£.
34
Multidimensional Gaussians - Mean
β’ Let x1, x2, β¦, xn be d-dimensional vectors.
β’ xi = (xi,1, xi,2, β¦, xi,d), where each xi,j is a real number.
β’ Then, the mean ΞΌ = (ΞΌ1, ..., ΞΌd) is computed as:
1
π=
π
β’ Therefore, ΞΌj =
1
π
π
π₯π
1
π
π=1 π₯π,π
35
Multidimensional Gaussians β
Covariance Matrix
β’
β’
β’
β’
Let x1, x2, β¦, xn be d-dimensional vectors.
xi = (xi,1, xi,2, β¦, xi,d), where each xi,j is a real number.
Let Ξ£ be the covariance matrix. Its size is dxd.
Let Οr,c be the value of Ξ£ at row r, column c.
ππ,π
1
=
π β1
π
(π₯π,π β ππ )(π₯π,π β ππ )
π=1
36
Multidimensional Gaussians β
Training
β’ Let N be a d-dimensional Gaussian with mean ΞΌ and
covariance matrix Ξ£.
β’ How many parameters do we need to specify N?
β
β
β
β
The mean ΞΌ is defined by d numbers.
The covariance matrix Ξ£ requires d2 numbers Οr,c.
Strictly speaking, Ξ£ is symmetric, Οr,c = Οc,r.
So, we need roughly d2/2 parameters.
β’ The number of parameters is quadratic to d.
β’ The number of training data we need for reliable
estimation is also quadratic to d.
37
The Curse of Dimensionality
β’ We will discuss this "curse" in several places in this course.
β’ Summary: dealing with high dimensional data is a pain, and
presents challenges that may be surprising to someone used
to dealing with one, two, or three dimensions.
β’ One first example is in estimating Gaussian parameters.
β’ In one dimension, it is very simple:
β We estimate two parameters, ΞΌ and Ο.
β Estimation can be pretty reliable with a few tens of examples.
β’ In d dimensions, we estimate O(d2) parameters.
β’ The number of training data is quadratic to the dimensions.
38
The Curse of Dimensionality
β’ For example: suppose we want to train a system to recognize
the faces of Michael Jordan and Kobe Bryant.
β Assume each image is 100x100 pixels.
β Each pixel has three numbers: r, g, b.
β Thus, each image has 30,000 numbers.
β’ Suppose we model each class as a multi-dimensional
Gaussian.
β’ Then, we need to estimate parameters of a 30,000dimensional Gaussian.
β We need roughly 450 million numbers for the covariance matrix.
β’ We would need more than ten billion training images to have
a reliable estimate.
β It is not realistic to expect to have such a large training set for learning
39
how to recognize a single person.
The Curse of Dimensionality
β’ The curse of dimensionality makes it (usually) impossible to
estimate precisely probability densities in high-dimensional
spaces.
β The number of training data that is needed is exponential to the
number of dimensions.
β’ The curse of dimensionality also makes histogram-based
probability estimation infeasible in high dimensions.
β Estimating a histogram still requires a number of training examples
that is exponential to the dimensions.
β’ Estimating a Gaussian requires a number of training
parameters that is "only" quadratic to the dimensions.
β’ However, Gaussians may not be accurate fits for the actual
distribution.
β Mixtures of Gaussians can often provide significantly better fits.
40