Transcript pptx
Unsupervised Learning
Gaussian Mixture Models
Expectation-Maximization (EM)
Gaussian Mixture Models
Like K-Means, GMM clusters have
centers.
In addition, they have probability
distributions that indicate the
probability that a point belongs to the
cluster.
X2
These ellipses show βlevel setsβ: lines
with equal probability of belonging to
the cluster.
Notice that green points still have
SOME probability of belonging to the
blue cluster, but itβs much lower than
the blue points.
X1
This is a more complex model than KMeans: distance from the center can
matter more in one direction than
another.
GMMs and EM
Gaussian Mixture Models (GMMs) are a model,
similar to a Naïve Bayes model but with important
differences.
Expectation-Maximization (EM) is a parameterestimation algorithm for training GMMs using
unlabeled data.
To explain these further, we first need to review
Gaussian (normal) distributions.
The Normal (aka Gaussian)
Distribution
f π₯ =
1
β(π₯βπ)2
exp(
)
2
2ππ
2π
π: mean, Ο2: variance
Ο
π
Quiz: MLE for Gaussians
Based on your statistics knowledge,
1) What is the MLE for ΞΌ from a bunch of
example X points?
2) What is the MLE for Ο from a bunch of
example X points?
Answer: MLE for Gaussians
Based on your statistics knowledge,
1) What is the MLE for ΞΌ from a bunch of example X points?
1
π=
π
ππ
π
(average of the X values)
2) What is the MLE for Ο from a bunch of example X points?
π2
1
=
π
(ππ β π)2
π
(average deviation from the mean)
Note: this is a so-called βbiasedβ estimator for π 2 ; there is also an
βunbiasedβ estimator which basically just uses (M-1) instead of M.
Weβll stick to the βbiasedβ one here, but either one is fine.
Quiz: Deriving the ML estimators
How would you derive the MLE equations for
Gaussian distributions?
Answer: Deriving the ML estimators
How would you derive the MLE equations for Gaussian
distributions?
Same plan of attack as for MLE estimates of Bayes Nets:
1. Write down the Likelihood function P(D | M)
2. Make the assumption that each data point Xi is
independently distributed, so P(D|M) = βπ ππ π
3. Take the log
4. Take the partial derivative with respect to ΞΌ, set this equal
to zero, and solve for ΞΌ.
5. Take the partial derivative with respect to Ο, set this equal
to zero, and solve for Ο.
Quiz: Estimating a Gaussian
1
π=
π
π2
1
=
π
0
On the left is a
dataset with the
following X values:
0, 3, 4, 5, 6, 7, 10
(ππ β π)2
Find the maximum
likelihood
Gaussian
distribution.
ππ
π
π
Answer: Estimating a Gaussian
0
π=
1
π
ππ =
π
1
0 + 3 + 4 + 5 + 6 + 7 + 10 = 5
7
On the left is a
dataset with the
following X values:
0, 3, 4, 5, 6, 7, 10
1
=
(ππ β π)2
π
π
1
=
0 β 5 2 + 3 β 5 2 + 4 β 5 2 + 5 β 5 2 + 6 β 5 2 + 7 β 5 2 + 10 β 5
7
1 2
1
60
2
2
2
2
2
2
= 5 + 2 + 1 + 0 + 1 + 2 + 5 = 25 + 4 + 1 + 1 + 4 + 25 =
7
7
7
π2
f π₯ =
1
60
2π 7
exp(
β(π₯β5)2
120
7
)
2
Clustering by fitting K Gaussians
0
Suppose our dataset looks like the one above.
It doesnβt really look Gaussian anymore; it looks like it has 3
clusters.
Fitting a single Gaussian to this data will still give you an
estimate.
But that Gaussian will have a low Likelihood value: it will give
very low probability to the leftmost and rightmost clusters.
Clustering by fitting K Gaussians
0
What weβd like to do instead is to fit K Gaussians.
A model for data that involves multiple Gaussian
distributions is called a Gaussian Mixture Model
(GMM).
Clustering by fitting K Gaussians
0
ΞΌred
ΞΌblue
ΞΌgreen
Another way of drawing these is with βLevel setsβ:
Curves that show points with equal probability for each
Guassian.
Wider curves having lower probability than narrower curves.
Notice that each point is contained within every Gaussian, but
is most tightly bound to the closest Gaussian.
Expectation-Maximization (EM)
EM is βK-Means for GMMsβ.
It is a parameter estimation algorithm for GMMs that will determine a
(locally-optimal) setting for all of the GMM parameters, using a bunch
of unlabeled X points.
Input:
1. Data points X1, β¦, XM
2. A number K
Output: π1 , π12 , β¦, ππΎ , ππΎ2 such that the GMM with those means and
standard deviations has a locally-maximum likelihood for the training
data set.
Visualization of EM
1. Initialize the mean and standard deviation of
each Gaussian randomly.
2. Repeat until convergence:
β Expectation: For each point X and each Gaussian
k, find P(X | Gaussian k)
Visualization of EM
1. Initialize the mean and standard deviation of
each Gaussian randomly.
2. Repeat until convergence:
β Expectation: For each point X and each Gaussian k,
find f(X | Gaussian k)
β Maximization: Estimate new ππ and ππ parameters
for each Gaussian. (Technically, you also need to
estimate a third parameter, called Οk. More later.)
Visualization of EM
1. Initialize the mean and standard deviation of
each Gaussian randomly.
2. Repeat until convergence:
β Expectation: For each point X and each Gaussian k,
find f(X | Gaussian k)
β Maximization: Estimate new ππ and ππ parameters
for each Gaussian. (Technically, you also need to
estimate a third parameter, called Οk. More later.)
Gaussian Mixture Model
K Gaussian distributions with parameters π1 , π12 through ππΎ , ππΎ2 .
It also involves K additional parameters, called prior probabilities, π1
through ππΎ . These describe the relative importance of each of the K
Gaussian distributions in the full model.
The likelihood equation for this model looks like this:
π π1 , β¦ , ππ πΊππ) =
π ππ πΊππ
π
πΎ
π ππ πΊππ =
ππ
π=1
Prior
β(π₯ β ππ )2
exp(
)
2
2π
2πππ
π
1
Gaussian
(i.i.d. assumption)
GMMs as Bayes Nets
Cluster
(1, 2, β¦, K)
GMMs are simple Bayes Nets.
Two differences from previous BNs weβve seen:
1.
Weβre used to binary variables in BNs.
Here, the βClusterβ variable has K possible
values (1, 2, β¦, K) instead of just two
(+cluster and βcluster). We used to store
P(+a) and P(-a) for the parent variable; now
we store π1 through ππΎ .
2.
The βXβ variable has infinitely many values
(any real number) instead of just (+x and β
x). We used to store P(+x | +a) and P(+x | a). Now we store π1 , π12 through ππΎ , ππΎ2 ,
and we say f(X |Cluster
is j) =
2
β(π₯βπ
)
1
π
exp(
)
2
X
(a real
number)
2πππ
2ππ
Formal Description of the Algorithm
1.
2.
Init: For each k in {1, β¦, K}, create a random Οk, ΞΌk, Ο2k
Repeat until all Οk, ΞΌk, Ο2k remain the same from one iteration to
the next:
Expectation (aka Assignment in K-Means):
For each Xi, for each k:
β(π₯βππ )2
1
exp(
)
2πππ
2π2
π
β(π₯βππβ² )2
1
π
exp(
)
πβ² πβ² 2ππ
2π2
πβ²
πβ²
ππ
let C[Xi,k] ο
Maximization (aka Update in K-Means):
For each k,
ππ =
ππ =
ππ =
1
π
π
π=1 πΆ[ππ , π]
π
π=1 πΆ[ππ ,π]βππ
π πΆ[π ,π]
π
π=1
π
2
π=1 πΆ ππ ,π β(ππ βππ )
π πΆ[π ,π]
π
π=1
3. Return (for all values of k) Οk, ΞΌk, Ο2k
Evaulation metric for GMMs and EM
LOSS Function (or Objective function) for EM:
EM (locally) maximizes βMarginalβ Likelihood:
EM(X1, β¦, XM)
= argmax[π1 , π12 , π1 , β¦, ππΎ , ππΎ2 , ππΎ ]
f(X1,β¦XM | π1 , π12 , π1 , β¦, ππΎ , ππΎ2 , ππΎ )
Notice that this is the Likelihood function for just the X
variable in our Bayes Net, rather than the Likelihood for
(X and Cluster), which is why it is called βmarginal
likelihoodβ rather than just βlikelihoodβ.
Analysis of EM Performance
EM is guaranteed to find a local optimum of the
Likelihood function.
Theorem: After one iteration of EM, the Likelihood
of the new GMM >= the Likelihood of the previous
GMM.
(Dempster, A.P.; Laird, N.M.; Rubin, D.B. 1977.
"Maximum Likelihood from Incomplete Data via the
EM Algorithm". Journal of the Royal Statistical
Society. Series B (Methodological) 39 (1): 1β
38.JSTOR 2984875.)
EM Generality
Even though EM was originally invented for
GMMs, the same basic algorithm can be used
for learning with arbitrary Bayes Nets when
some of the training data has missing values.
This has made EM one of the most popular
unsupervised learning techniques in machine
learning.
EM Quiz
b
a
g1
Which Gaussian(s) have a nonzero value for f(a)?
How about f(c)?
c
g2
g3
Answer: EM Quiz
b
a
g1
c
g2
Which Gaussian(s) have a nonzero f(a)?
All Gaussians (g1, g2, and g3) have a nonzero value for f(a).
How about f(c)?
Ditto. All Gaussians have a nonzero value for f(c).
g3
Quiz: EM vs. K-Means
a
Option 1
c
Option 2
g1
g2
Option 3
Option 4
At the end of K-Means, where will cluster center g1 end up β Option 1 or Option 2?
At the end of EM, where will cluster center g1 end up β Option 1 or Option 2?
Answer: EM vs. K-Means
a
Option 1
c
Option 2
g1
g2
Option 3
Option 4
At the end of K-Means, where will cluster center g1 end up β Option 1 or Option 2?
Option 1: K-Means puts the βmeanβ at the center of all points in the cluster, and point a
will be the only point in g1βs cluster.
At the end of EM, where will cluster center g1 end up β Option 1 or Option 2?
Option 2: EM puts the βmeanβ at the center of all points in the dataset, where each
point is weighted by how likely it is according to the Gaussian. Point a and Point b will
both have some likelihood, but Point aβs likelihood will be much higher. So the βmeanβ
for g1 will be very close to Point a, but not all the way at Point a.
How many clusters?
Weβve been assuming a fixed K.
Hereβs a technique to determine this automatically, from data.
New objective function:
Minimize: β
π
π=1 log π
ππ πΊπππΎ + πΆππ π‘ β πΎ
Algorithm:
1. Initialize K somehow.
Repeat until convergence:
2. Run EM.
3. Remove unnecessary clusters (low Ο value)
4. Create new random clusters (more or fewer than before,
depending on a heuristic estimate of whether there were too many or too few
before).
This is slow. But one nice property is that it can overcome some difficulties with local
maxima.
Quiz
Is EM for GMMs
Classification or Regression?
Generative or Discriminative?
Parametric or Nonparametric?
Answer
Is EM for GMMs
Classification or Regression?
Two possible answers:
- classification: output is a discrete value (cluster label) for each point
- Regression: output is a real value (probability) for each possible cluster label for
each point
Generative or Discriminative?
- normally, itβs used with a fixed set of input and output variables. However, GMMs
are Bayes Nets that store a full joint distribution. Once itβs trained, a GMM can
actually make predictions for any subset of the variables given any other subset.
Technically, this is generative.
Parametric or Nonparametric?
- parametric: the number of parameters is 3K, which does not change with the
number of training data points.
Quiz
Is EM for GMMs
Supervised or Unsupervised?
Online or batch?
Closed-form or iterative?
Answer
Is EM for GMMs
Supervised or Unsupervised?
- Unsupervised
Online or batch?
- batch: if you add a new data point, you need to revisit
all the training data to recompute the locally-optimal
model
Closed-form or iterative?
-iterative: training requires many passes through the data