PPT - The University of Texas at Arlington

Download Report

Transcript PPT - The University of Texas at Arlington

Mixtures of Gaussians and
the EM Algorithm
CSE 6363 – Machine Learning
Vassilis Athitsos
Computer Science and Engineering Department
University of Texas at Arlington
1
Gaussians
β€’ A popular way to estimate probability density
functions is to model them as Gaussians.
β€’ Review: a 1D normal distribution is defined as:
𝑁 π‘₯ =
1
𝜎 2πœ‹
π‘₯βˆ’πœ‡ 2
βˆ’
𝑒 2𝜎2
β€’ To define a Gaussian, we need to specify just two
parameters:
– ΞΌ, which is the mean (average) of the distribution.
– Οƒ, which is the standard deviation of the distribution.
– Note: Οƒ2 is called the variance of the distribution.
2
Estimating a Gaussian
β€’ In one dimension, a Gaussian is defined like this:
π‘₯βˆ’πœ‡ 2
1
βˆ’
𝑁 π‘₯ =
𝑒 2𝜎2
𝜎 2πœ‹
β€’ Given a set of n real numbers x1, …, xn, we can easily find the
best-fitting Gaussian for that data.
β€’ The mean ΞΌ is simply the average of those numbers:
1
πœ‡=
𝑛
𝑛
π‘₯𝑖
1
β€’ The standard deviation Οƒ is computed as:
𝜎=
1
𝑛 βˆ’1
𝑛
(π‘₯𝑖 βˆ’ πœ‡)2
1
3
Estimating a Gaussian
β€’ Fitting a Gaussian to data does not guarantee that
the resulting Gaussian will be an accurate
distribution for the data.
β€’ The data may have a distribution that is very
different from a Gaussian.
4
Example of Fitting a Gaussian
The blue curve
is a density
function F such
that:
- F(x) = 0.25
for 1 ≀ x ≀ 3.
- F(x) = 0.5 for
7 ≀ x ≀ 8.
The red curve is
the Gaussian fit
G to data
generated using
F.
5
Naïve Bayes with 1D Gaussians
β€’ Suppose the patterns come from a d-dimensional space:
– Examples: pixels to be classified as skin or non-skin, or the statlog dataset.
β€’ Notation: xi = (xi,1, xi,2, …, xi,d)
β€’ For each dimension j, we can use a Gaussian to model the
distribution pj(xi,j | Ck) of the data in that dimension, given their
class.
β€’ For example for the statlog dataset, we would get 216 Gaussians:
– 36 dimensions * 6 classes.
β€’ Then, we can use the naïve Bayes approach (i.e., assume pairwise
independence of all dimensions), to define P(x | Ck) as:
𝑑
p π‘₯ πΆπ‘˜ ) =
𝑝𝑗 π‘₯𝑖,𝑗 ) πΆπ‘˜ )
𝑖=1
6
Mixtures
of Gaussians
β€’ This figure shows our previous example, where we fitted a
Gaussian into some data, and the fit was poor.
β€’ Overall, Gaussians have attractive properties:
– They require learning only two numbers (ΞΌ and Οƒ), and thus require few
training data to estimate those numbers.
β€’ However, for some data, Gaussians are just not good fits.
7
Mixtures
of Gaussians
β€’ Mixtures of Gaussians are oftentimes a better solution.
– They are defined in the next slide.
β€’ They still require relatively few parameters to estimate, and
thus can be learned from relatively small amounts of data.
β€’ They can fit pretty well actual distributions of data.
8
Mixtures of Gaussians
β€’ Suppose we have k Gaussian distributions Ni.
β€’ Each Ni has its own mean ΞΌi and std Οƒi.
β€’ Using these k Gaussians, we can define a Gaussian
mixture M as follows:
π‘˜
𝑀 π‘₯ =
𝑀𝑖 𝑁𝑖 π‘₯
𝑖=1
β€’ Each wi is a weight, specifying the relative
importance of Gaussian Ni in the mixture.
– Weights wi are real numbers between 0 and 1.
– Weights wi must sum up to 1, so that the integral of M is 1.9
Mixtures of Gaussians – Example
The blue and green
curves show two
Gaussians.
The red curve shows
a mixture of those
Gaussians.
w1 = 0.9.
w2 = 0.1.
The mixture looks a
lot like N1, but is
influenced a little by
N2 as well.
10
Mixtures of Gaussians – Example
The blue and green
curves show two
Gaussians.
The red curve shows
a mixture of those
Gaussians.
w1 = 0.7.
w2 = 0.3.
The mixture looks
less like N1
compared to the
previous example,
and is influenced
more by N2.
11
Mixtures of Gaussians – Example
The blue and green
curves show two
Gaussians.
The red curve shows
a mixture of those
Gaussians.
w1 = 0.5.
w2 = 0.5.
At each point x, the
value of the mixture
is the average of
N1(x) and N2(x).
12
Mixtures of Gaussians – Example
The blue and green
curves show two
Gaussians.
The red curve shows
a mixture of those
Gaussians.
w1 = 0.3.
w2 = 0.7.
The mixture now
resembles N2 more
than N1.
13
Mixtures of Gaussians – Example
The blue and green
curves show two
Gaussians.
The red curve shows
a mixture of those
Gaussians.
w1 = 0.1.
w2 = 0.9.
The mixture now is
almost identical to
N2(x).
14
Learning a Mixture of Gaussians
β€’
β€’
β€’
β€’
β€’
Suppose we are given training data x1, x2, …, xn.
Suppose all xj belong to the same class c.
How can we fit a mixture of Gaussians to this data?
This will be the topic of the next few slides.
We will learn a very popular machine learning
algorithm, called the EM algorithm.
– EM stands for Expectation-Maximization.
β€’ Step 0 of the EM algorithm: pick k manually.
– Decide how many Gaussians the mixture should have.
– Any approach for choosing k automatically is beyond the
scope of this class.
15
Learning a Mixture of Gaussians
β€’
β€’
β€’
β€’
Suppose we are given training data x1, x2, …, xn.
Suppose all xj belong to the same class c.
We want to model P(x | c) as a mixture of Gaussians.
Given k, how many parameters do we need to estimate in
order to fully define the mixture?
β€’ Remember, a mixture M of k Gaussians is defined as:
π‘˜
𝑀 π‘₯ =
π‘˜
𝑀𝑖 𝑁𝑖 π‘₯ =
𝑖=1
𝑀𝑖
𝑖=1
1
πœŽπ‘– 2πœ‹
π‘₯βˆ’πœ‡π‘– 2
βˆ’
𝑒 2πœŽπ‘–2
β€’ For each Ni, we need to estimate three numbers:
– wi, ΞΌi, Οƒi.
β€’ So, in total, we need to estimate 3*k numbers.
16
Learning a Mixture of Gaussians
β€’ Suppose we are given training data x1, x2, …, xn.
β€’ A mixture M of k Gaussians is defined as:
π‘˜
𝑀 π‘₯ =
π‘˜
𝑀𝑖 𝑁𝑖 π‘₯ =
𝑖=1
𝑀𝑖
𝑖=1
1
πœŽπ‘– 2πœ‹
π‘₯βˆ’πœ‡π‘– 2
βˆ’
𝑒 2πœŽπ‘–2
β€’ For each Ni, we need to estimate wi, ΞΌi, Οƒi.
β€’ Suppose that we knew for each xj, that it belongs to one and
only one of the k Gaussians.
β€’ Then, learning the mixture would be a piece of cake:
β€’ For each Gaussian Ni:
– Estimate ΞΌi, Οƒi based on the examples that belong to it.
– Set wi equal to the fraction of examples that belong to Ni.
17
Learning a Mixture of Gaussians
β€’ Suppose we are given training data x1, x2, …, xn.
β€’ A mixture M of k Gaussians is defined as:
π‘˜
𝑀 π‘₯ =
π‘˜
𝑀𝑖 𝑁𝑖 π‘₯ =
𝑖=1
𝑀𝑖
𝑖=1
1
πœŽπ‘– 2πœ‹
π‘₯βˆ’πœ‡π‘– 2
βˆ’
𝑒 2πœŽπ‘–2
β€’ For each Ni, we need to estimate wi, ΞΌi, Οƒi.
β€’ However, we have no idea which mixture each xj belongs to.
β€’ If we knew ΞΌi and Οƒi for each Ni, we could probabilistically
assign each xj to a component.
– β€œProbabilistically” means that we would not make a hard assignment,
but we would partially assign xj to different components, with each
assignment weighted proportionally to the density value Ni(xj).
18
Example of Partial Assignments
β€’ Using our previous
example of a mixture:
β€’ Suppose xj = 6.5.
β€’ How do we assign 6.5 to
the two Gaussians?
β€’ N1(6.5) = 0.0913.
β€’ N2(6.5) = 0.3521.
β€’ So:
– 6.5 belongs to N1 by
0.0913
= 20.6%.
0.0913+0.3521
– 6.5 belongs to N2 by
0.3521
= 79.4%.
19
0.0913+0.3521
The Chicken-and-Egg Problem
β€’ To recap, fitting a mixture of Gaussians to data
involves estimating, for each Ni, values wi, ΞΌi, Οƒi.
β€’ If we could assign each xj to one of the Gaussians, we
could compute easily wi, ΞΌi, Οƒi.
– Even if we probabilistically assign xj to multiple Gaussians,
we can still easily wi, ΞΌi, Οƒi, by adapting our previous
formulas. We will see the adapted formulas in a few slides.
β€’ If we knew ΞΌi, Οƒi and wi, we could assign (at least
probabilistically) xj’s to Gaussians.
β€’ So, this is a chicken-and-egg problem.
– If we knew one piece, we could compute the other.
– But, we know neither. So, what do we do?
20
On Chicken-and-Egg Problems
β€’ Such chicken-and-egg problems occur frequently in AI.
β€’ Surprisingly (at least to people new in AI), we can easily solve
such chicken-and-egg problems.
β€’ Overall, chicken and egg problems in AI look like this:
– We need to know A to estimate B.
– We need to know B to compute A.
β€’ There is a fairly standard recipe for solving these problems.
β€’ Any guesses?
21
On Chicken-and-Egg Problems
β€’ Such chicken-and-egg problems occur frequently in AI.
β€’ Surprisingly (at least to people new in AI), we can easily solve
such chicken-and-egg problems.
β€’ Overall, chicken and egg problems in AI look like this:
– We need to know A to estimate B.
– We need to know B to compute A.
β€’ There is a fairly standard recipe for solving these problems.
β€’ Start by giving to A values chosen randomly (or perhaps nonrandomly, but still in an uninformed way, since we do not
know the correct values).
β€’ Repeat this loop:
– Given our current values for A, estimate B.
– Given our current values of B, estimate A.
– If the new values of A and B are very close to the old values, break.
22
The EM Algorithm - Overview
β€’ We use this approach to fit mixtures of Gaussians to data.
β€’ This algorithm, that fits mixtures of Gaussians to data, is called
the EM algorithm (Expectation-Maximization algorithm).
β€’ Remember, we choose k (the number of Gaussians in the
mixture) manually, so we don’t have to estimate that.
β€’ To initialize the EM algorithm, we initialize each ΞΌi, Οƒi, and wi.
Values wi are set to 1/k. We can initialize ΞΌi, Οƒi in different ways:
–
–
–
–
Giving random values to each ΞΌi.
Uniformly spacing the values given to each ΞΌi.
Giving random values to each Οƒi.
Setting each Οƒi to 1 initially.
β€’ Then, we iteratively perform two steps.
– The E-step.
– The M-step.
23
The E-Step
β€’ E-step. Given our current estimates for ΞΌi, Οƒi, and wi:
– We compute, for each i and j, the probability pij = P(Ni | xj):
the probability that xj was generated by Gaussian Ni.
– How? Using Bayes rule.
𝑝𝑖𝑗 = P(𝑁𝑖 |π‘₯𝑗 ) =
𝑁𝑖 π‘₯𝑗 =
𝑃 π‘₯𝑗 | 𝑁𝑖 βˆ—π‘ƒ(𝑁𝑖 )
1
πœŽπ‘– 2πœ‹
𝑃(π‘₯𝑗 )
=
𝑁𝑖 π‘₯𝑗 βˆ— 𝑀𝑖
𝑃(π‘₯𝑗 )
π‘₯βˆ’πœ‡π‘– 2
βˆ’
𝑒 2πœŽπ‘–2
24
The M-Step: Updating ΞΌi and Οƒi
β€’ M-step. Given our current estimates of pij, for each i, j:
– We compute ΞΌi and Οƒi for each Ni, as follows:
πœ‡π‘– =
𝑛
𝑗=1[𝑝𝑖𝑗 π‘₯𝑗 ]
𝑛
𝑗=1 𝑝𝑖𝑗
πœŽπ‘– =
𝑛
𝑗=1[𝑝𝑖𝑗 π‘₯𝑗 βˆ’
𝑛
𝑗=1 𝑝𝑖𝑗
πœ‡π‘– 2 ]
– To understand these formulas, it helps to compare them
to the standard formulas for fitting a Gaussian to data:
1
πœ‡=
𝑛
𝑛
π‘₯𝑗
1
𝜎=
1
𝑛 βˆ’1
𝑛
(π‘₯𝑗 βˆ’ πœ‡)2
𝑗=1
25
The M-Step: Updating ΞΌi and Οƒi
πœ‡π‘– =
𝑛
𝑗=1[𝑝𝑖𝑗 π‘₯𝑗 ]
𝑛
𝑗=1 𝑝𝑖𝑗
πœŽπ‘– =
𝑛
𝑗=1[𝑝𝑖𝑗 π‘₯𝑗 βˆ’
𝑛
𝑗=1 𝑝𝑖𝑗
πœ‡π‘– 2 ]
– To understand these formulas, it helps to compare them
to the standard formulas for fitting a Gaussian to data:
1
πœ‡=
𝑛
𝑛
π‘₯𝑗
1
𝜎=
1
𝑛 βˆ’1
𝑛
(π‘₯𝑗 βˆ’ πœ‡)2
𝑗=1
β€’ Why do we take weighted averages at the M-step?
β€’ Because each xj is probabilistically assigned to multiple Gaussians.
β€’ We use 𝑝𝑖𝑗 = 𝑃 𝑁𝑖 |π‘₯𝑗 as weight of the assignment of xj to Ni.
26
The M-Step: Updating wi
𝑀𝑖 =
𝑛
𝑗=1 𝑝𝑖𝑗
π‘˜
𝑛
𝑖=1 𝑗=1 𝑝𝑖𝑗
β€’ At the M-step, in addition to updating ΞΌi and Οƒi, we also
need to update wi, which is the weight of the i-th
Gaussian in the mixture.
β€’ The formula shown above is used for the update of wi.
– We sum up the weights of all objects for the i-th Gaussian.
– We divide that sum by the sum of weights of all objects for all
Gaussians.
– The division ensures that
π‘˜
𝑖=1 𝑀𝑖
= 1.
27
The EM Steps: Summary
β€’ E-step: Given current estimates for each ΞΌi, Οƒi, and wi,
update pij:
𝑁𝑖 π‘₯𝑗 βˆ— 𝑀𝑖
𝑝𝑖𝑗 =
𝑃(π‘₯𝑗 )
β€’ M-step: Given our current estimates for each pij, update
ΞΌi, Οƒi and wi:
𝑛
2]
𝑛
[𝑝
π‘₯
βˆ’
πœ‡
𝑖𝑗
𝑗
𝑖
𝑗=1
𝑗=1[𝑝𝑖𝑗 π‘₯𝑗 ]
πœŽπ‘– =
πœ‡π‘– =
𝑛
𝑛
𝑗=1 𝑝𝑖𝑗
𝑗=1 𝑝𝑖𝑗
𝑀𝑖 =
𝑛
𝑗=1 𝑝𝑖𝑗
π‘˜
𝑛
𝑖=1 𝑗=1 𝑝𝑖𝑗
28
The EM Algorithm - Termination
β€’ The log likelihood of the training data is defined as:
𝑛
𝐿 π‘₯1, … , π‘₯𝑛 =
log 2 𝑀 π‘₯𝑗
𝑗=1
β€’ As a reminder, M is the Gaussian mixture, defined as:
π‘˜
𝑀 π‘₯ =
π‘˜
𝑀𝑖 𝑁𝑖 π‘₯ =
𝑖=1
𝑀𝑖
𝑖=1
1
πœŽπ‘– 2πœ‹
π‘₯βˆ’πœ‡π‘– 2
βˆ’
𝑒 2πœŽπ‘–2
β€’ One can prove that, after each iteration of the E-step and the Mstep, this log likelihood increases or stays the same.
β€’ We check how much the log likelihood changes at each iteration.
29
β€’ When the change is below some threshold, we stop.
The EM Algorithm: Summary
β€’ Initialization:
– Initialize each ΞΌi, Οƒi, wi, using your favorite approach (e.g., set
each ΞΌi to a random value, and set each Οƒi to 1, set each wi
equal to 1/k).
– last_log_likelihood = -infinity.
β€’ Main loop:
– E-step:
β€’ Given our current estimates for each ΞΌi, Οƒi, and wi, update each pij.
– M-step:
β€’ Given our current estimates for each pij, update each ΞΌi, Οƒi, and wi.
– log_likelihood = 𝐿 π‘₯1, … , π‘₯𝑛 .
– if (log_likelihood – last_log_likelihood) < threshold, break.
– last_log_likelihood = log_likelihood
30
The EM Algorithm: Limitations
β€’ When we fit a Gaussian to data, we always get the same result.
β€’ We can also prove that the result that we get is the best
possible result.
– There is no other Gaussian giving a higher log likelihood to the data,
than the one that we compute as described in these slides.
β€’ When we fit a mixture of Gaussians to the same data, do we
always end up with the same result?
31
The EM Algorithm: Limitations
β€’ When we fit a Gaussian to data, we always get the same result.
β€’ We can also prove that the result that we get is the best
possible result.
– There is no other Gaussian giving a higher log likelihood to the data,
than the one that we compute as described in these slides.
β€’ When we fit a mixture of Gaussians to the same data, we
(sadly) do not always get the same result.
β€’ The EM algorithm is a greedy algorithm.
β€’ The result depends on the initialization values.
β€’ We may have bad luck with the initial values, and end up with a
bad fit.
β€’ There is no good way to know if our result is good or bad, or if
better results are possible.
32
Mixtures of Gaussians - Recap
β€’ Mixtures of Gaussians are widely used.
β€’ Why? Because with the right parameters, they can fit
very well various types of data.
– Actually, they can fit almost anything, as long as k is large
enough (so that the mixture contains sufficiently many
Gaussians).
β€’ The EM algorithm is widely used to fit mixtures of
Gaussians to data.
33
Multidimensional Gaussians
β€’ Instead of assuming that each dimension is
independent, we can instead model the distribution
using a multi-dimensional Gaussian:
𝑁 𝑣 =
1
1
exp βˆ’ (π‘₯ βˆ’ πœ‡)Ξ€ Ξ£ βˆ’1 (π‘₯ βˆ’ πœ‡)
2
2πœ‹ 𝑑 |Ξ£|
β€’ To specify this Gaussian, we need to estimate the
mean ΞΌ and the covariance matrix Ξ£.
34
Multidimensional Gaussians - Mean
β€’ Let x1, x2, …, xn be d-dimensional vectors.
β€’ xi = (xi,1, xi,2, …, xi,d), where each xi,j is a real number.
β€’ Then, the mean ΞΌ = (ΞΌ1, ..., ΞΌd) is computed as:
1
πœ‡=
𝑛
β€’ Therefore, ΞΌj =
1
𝑛
𝑛
π‘₯𝑖
1
𝑛
𝑖=1 π‘₯𝑖,𝑗
35
Multidimensional Gaussians –
Covariance Matrix
β€’
β€’
β€’
β€’
Let x1, x2, …, xn be d-dimensional vectors.
xi = (xi,1, xi,2, …, xi,d), where each xi,j is a real number.
Let Ξ£ be the covariance matrix. Its size is dxd.
Let Οƒr,c be the value of Ξ£ at row r, column c.
πœŽπ‘Ÿ,𝑐
1
=
𝑛 βˆ’1
𝑛
(π‘₯𝑗,π‘Ÿ βˆ’ πœ‡π‘Ÿ )(π‘₯𝑗,𝑐 βˆ’ πœ‡π‘ )
𝑗=1
36
Multidimensional Gaussians –
Training
β€’ Let N be a d-dimensional Gaussian with mean ΞΌ and
covariance matrix Ξ£.
β€’ How many parameters do we need to specify N?
–
–
–
–
The mean ΞΌ is defined by d numbers.
The covariance matrix Ξ£ requires d2 numbers Οƒr,c.
Strictly speaking, Ξ£ is symmetric, Οƒr,c = Οƒc,r.
So, we need roughly d2/2 parameters.
β€’ The number of parameters is quadratic to d.
β€’ The number of training data we need for reliable
estimation is also quadratic to d.
37
The Curse of Dimensionality
β€’ We will discuss this "curse" in several places in this course.
β€’ Summary: dealing with high dimensional data is a pain, and
presents challenges that may be surprising to someone used
to dealing with one, two, or three dimensions.
β€’ One first example is in estimating Gaussian parameters.
β€’ In one dimension, it is very simple:
– We estimate two parameters, ΞΌ and Οƒ.
– Estimation can be pretty reliable with a few tens of examples.
β€’ In d dimensions, we estimate O(d2) parameters.
β€’ The number of training data is quadratic to the dimensions.
38
The Curse of Dimensionality
β€’ For example: suppose we want to train a system to recognize
the faces of Michael Jordan and Kobe Bryant.
– Assume each image is 100x100 pixels.
– Each pixel has three numbers: r, g, b.
– Thus, each image has 30,000 numbers.
β€’ Suppose we model each class as a multi-dimensional
Gaussian.
β€’ Then, we need to estimate parameters of a 30,000dimensional Gaussian.
– We need roughly 450 million numbers for the covariance matrix.
β€’ We would need more than ten billion training images to have
a reliable estimate.
– It is not realistic to expect to have such a large training set for learning
39
how to recognize a single person.
The Curse of Dimensionality
β€’ The curse of dimensionality makes it (usually) impossible to
estimate precisely probability densities in high-dimensional
spaces.
– The number of training data that is needed is exponential to the
number of dimensions.
β€’ The curse of dimensionality also makes histogram-based
probability estimation infeasible in high dimensions.
– Estimating a histogram still requires a number of training examples
that is exponential to the dimensions.
β€’ Estimating a Gaussian requires a number of training
parameters that is "only" quadratic to the dimensions.
β€’ However, Gaussians may not be accurate fits for the actual
distribution.
– Mixtures of Gaussians can often provide significantly better fits.
40