Transcript Clustering
MACHINE LEARNING
8. Clustering
Motivation
2
Classification problem:
Need P(C|X)
Bayes: can be computed from P(x|C)
Need to estimation P(x|C) from data
Assume a model (e.g. normal distribution) up to parameters
Compute estimators(ML, MAP) for parameters from data
Regression
Need to estimate joint P(x,r)
Bayes: can be computed from P(r|x)
Assume model up to parameters (e.g. linear)
Compute parameters from data (e.g. least squares)
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Motivation
3
Not always can assume that data came from single
distribution/model
Nonparametric method: don’t assume any model,
compute probability of new data directly from old
data
Semi-parametric/mixture models: assume data
came from a unknown mixture of known models
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Motivation
4
Optical Character Recognition
Two
way to write 7 (w/o horizontal bar)
Can’t assume single distribution
Mixture of unknown number of templates
Compared to classification
Number
of classes is known
Each training sample has a label of a class
Supervised Learning
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Mixture Densities
5
k
p x p x | Gi P Gi
i 1
where Gi the components/groups/clusters,
P ( Gi ) mixture proportions (priors),
p ( x | Gi) component densities
Gaussian mixture where p(x|Gi) ~ N ( μi , ∑i )
parameters Φ = {P ( Gi ), μi , ∑i }ki=1
unlabeled sample X={xt}t (unsupervised learning)
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Example : Color quantization
6
Image: each pixels represented by 24 bit color
Colors come from different distribution (e.g. sky, grass)
Don’t have labeling for each pixels if it’s sky or grass
Want to use only 256 colors in palette to represent
image as close as possible to original
Quantize uniformly: assign single color to each
2^24/256 interval
Waste values for rarely occurring intervals
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Quantization
7
Sample (pixels):
k reference vectors (palette):
Select reference vector for each pixel:
x t mi min x t m j
j
Reference vectors: codebook vectors or code words
k
Compress image
E mi i 1 X t i bit x t mi
Reconstruction error
t
t
x mj
1 if x mi min
j
b
0 otherwise
t
i
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Encoding/Decoding
8
t
xt m j
1 if x mi min
j
bit
0 otherwise
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
K-means clustering
9
Minimize reconstruction error
E mi i 1 X t i bit xt mi
k
Take derivatives and set to zero
Reference vectors is the mean of all instances it
represents
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
K-Means clustering
10
Iterative procedure for finding reference vectors
Start with random reference vectors
Estimate labels b
Re-compute reference vectors as means
Continue till converge
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
k-means Clustering
11
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
12
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Expectation Maximization(EM): Motivation
13
Date came from several distribution
Assume each distribution is known up to parameters
If we would know for each data instance from what
distribution it came, could use parametric estimation
Introduce unobservable (latent) variables which indicate
source distribution
Run iterative process
Estimate latent variables from data and current estimation
of distribution parameters
Use current values of latent variables to refine parameter
estimation
Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
EM
14
Log-Likelihood
L | X log p x t |
t
k
t log p x t | Gi P Gi
i 1
Assume hidden variables Z, which when known,
make optimization much simpler
Complete likelihood, Lc(Φ |X,Z), in terms of X
and Z
Incomplete likelihood, L(Φ |X), in terms of X
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Latent Variables
15
Unknown
Can’t compute complete likelihood Lc(Φ |X,Z)
Can compute its expected value
E-step:Q |l E LC |X, Z |X, l
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
E- and M-steps
16
1.
2.
Iterate the two steps:
E-step: Estimate z given X and current Φ
M-step: Find new Φ’ given z, X, and old Φ.
l
E-step:Q | E LC |X, Z |X,
l
M-step:
l 1
arg max Q |
l
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Example:
17
Data came from mix of Gaussians
Maximize likelihood assuming we know latent
“indicator variable”
E-step: expected value of indicator variables
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
P(G1|x)=h1=0.5
18
Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
EM for Gaussian mixtures
19
Assume all groups/clusters are Gaussians
Multivariate Uncorrelated
Same Variance
Harden indicators
EM:
expected values are between 0 and 1
K-means: 0 or 1
Same as k-means
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Dimensionality Reduction vs. Clustering
20
Dimensionality reduction methods find correlations
between features and group features
Age
and Income are correlated
Clustering methods find similarities between
instances and group instances
Customer
A and B are from the same cluster
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Clustering: Usage for supervised learning
21
Describe data in terms of cluster
Represent
all data in cluster by cluster mean
Range of attributes
Map data into new space(preprocessing)
D-
dimension original space
k- number of clusters
Use indicator variables as data representations
k might be larger then d
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Mixture of Mixtures
22
In classification, the input comes from a mixture of
classes (supervised).
If each class is also a mixture, e.g., of Gaussians,
(unsupervised), we have a mixture of mixtures:
ki
p x | Ci p x | Gij P Gij
j 1
K
p x p x | Ci P Ci
i 1
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Hierarchical Clustering
23
Probabilistic view
Fit
mixture model to data
Find codewords minimizing reconstruction error
Hierarchical clustering
Group
similar items together
No specific model/distribution
Items in groups is more similar to each other than
instances in different groups
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Hierarchical Clustering
24
Minkowski (Lp) (Euclidean for p = 2)
r
dm x , x
s
x
d
j 1
r
j
x
1/ p
s p
j
City-block distance
r
dcb x , x
s
d
r
s
x
x
j
j 1 j
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Agglomerative Clustering
25
Start with clusters each having single point
At each step merge similar clusters
Measure of similarity
Minimal
distance(single link)
Distance
Maximal
distance(complete link)
Distance
Average
between closest points in2 groups
between most distant points in2 groups
distance
Distance
between group centers
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Example: Single-Link Clustering
26
Dendrogram
Based on for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Choosing k
27
Defined by the application, e.g., image quantization
Incremental (leader-cluster) algorithm: Add one at a
time until “elbow” (reconstruction error/log
likelihood/intergroup distances)
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)