Transcript Clustering

MACHINE LEARNING
8. Clustering
Motivation
2

Classification problem:






Need P(C|X)
Bayes: can be computed from P(x|C)
Need to estimation P(x|C) from data
Assume a model (e.g. normal distribution) up to parameters
Compute estimators(ML, MAP) for parameters from data
Regression




Need to estimate joint P(x,r)
Bayes: can be computed from P(r|x)
Assume model up to parameters (e.g. linear)
Compute parameters from data (e.g. least squares)
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Motivation
3



Not always can assume that data came from single
distribution/model
Nonparametric method: don’t assume any model,
compute probability of new data directly from old
data
Semi-parametric/mixture models: assume data
came from a unknown mixture of known models
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Motivation
4

Optical Character Recognition
 Two
way to write 7 (w/o horizontal bar)
 Can’t assume single distribution
 Mixture of unknown number of templates

Compared to classification
 Number
of classes is known
 Each training sample has a label of a class
 Supervised Learning
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Mixture Densities
5
k
p x    p x | Gi P Gi 
i 1


where Gi the components/groups/clusters,
P ( Gi ) mixture proportions (priors),
p ( x | Gi) component densities
Gaussian mixture where p(x|Gi) ~ N ( μi , ∑i )
parameters Φ = {P ( Gi ), μi , ∑i }ki=1
unlabeled sample X={xt}t (unsupervised learning)
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Example : Color quantization
6






Image: each pixels represented by 24 bit color
Colors come from different distribution (e.g. sky, grass)
Don’t have labeling for each pixels if it’s sky or grass
Want to use only 256 colors in palette to represent
image as close as possible to original
Quantize uniformly: assign single color to each
2^24/256 interval
Waste values for rarely occurring intervals
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Quantization
7



Sample (pixels):
k reference vectors (palette):
Select reference vector for each pixel:
x t  mi  min x t  m j
j



Reference vectors: codebook vectors or code words
k
Compress image
E mi i 1 X   t i bit x t  mi
Reconstruction error
t
t

x  mj
1 if x  mi  min
j
b 
0 otherwise
t
i
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Encoding/Decoding
8
t

xt  m j
1 if x  mi  min
j
bit  

0 otherwise
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
K-means clustering
9

Minimize reconstruction error


E mi i 1 X  t i bit xt  mi


k
Take derivatives and set to zero
Reference vectors is the mean of all instances it
represents
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
K-Means clustering
10





Iterative procedure for finding reference vectors
Start with random reference vectors
Estimate labels b
Re-compute reference vectors as means
Continue till converge
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
k-means Clustering
11
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
12
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Expectation Maximization(EM): Motivation
13





Date came from several distribution
Assume each distribution is known up to parameters
If we would know for each data instance from what
distribution it came, could use parametric estimation
Introduce unobservable (latent) variables which indicate
source distribution
Run iterative process
Estimate latent variables from data and current estimation
of distribution parameters
 Use current values of latent variables to refine parameter
estimation

Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
EM
14

Log-Likelihood

L  | X   log p x t | 

t
k


 t log p x t | Gi P Gi 
i 1



Assume hidden variables Z, which when known,
make optimization much simpler
Complete likelihood, Lc(Φ |X,Z), in terms of X
and Z
Incomplete likelihood, L(Φ |X), in terms of X
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Latent Variables
15



Unknown
Can’t compute complete likelihood Lc(Φ |X,Z)
Can compute its expected value
E-step:Q  |l   E  LC  |X, Z  |X, l 
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
E- and M-steps
16

1.
2.
Iterate the two steps:
E-step: Estimate z given X and current Φ
M-step: Find new Φ’ given z, X, and old Φ.
l

E-step:Q  |   E  LC  |X, Z  |X,  
l
M-step:
l 1
 arg max Q  |

l

Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Example:
17

Data came from mix of Gaussians
Maximize likelihood assuming we know latent
“indicator variable”

E-step: expected value of indicator variables

Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
P(G1|x)=h1=0.5
18
Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
EM for Gaussian mixtures
19




Assume all groups/clusters are Gaussians
Multivariate Uncorrelated
Same Variance
Harden indicators
 EM:
expected values are between 0 and 1
 K-means: 0 or 1

Same as k-means
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Dimensionality Reduction vs. Clustering
20

Dimensionality reduction methods find correlations
between features and group features
 Age

and Income are correlated
Clustering methods find similarities between
instances and group instances
 Customer
A and B are from the same cluster
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Clustering: Usage for supervised learning
21

Describe data in terms of cluster
 Represent
all data in cluster by cluster mean
 Range of attributes

Map data into new space(preprocessing)
 D-
dimension original space
 k- number of clusters
 Use indicator variables as data representations
 k might be larger then d
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Mixture of Mixtures
22


In classification, the input comes from a mixture of
classes (supervised).
If each class is also a mixture, e.g., of Gaussians,
(unsupervised), we have a mixture of mixtures:
ki
p x | Ci    p x | Gij P Gij 
j 1
K
p x    p x | Ci P Ci 
i 1
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Hierarchical Clustering
23

Probabilistic view
 Fit
mixture model to data
 Find codewords minimizing reconstruction error

Hierarchical clustering
 Group
similar items together
 No specific model/distribution
 Items in groups is more similar to each other than
instances in different groups
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Hierarchical Clustering
24
Minkowski (Lp) (Euclidean for p = 2)

r
dm x , x
s
   x
d
j 1
r
j
x

1/ p
s p
j
City-block distance

r
dcb x , x
s
 
d
r
s
x

x
j
j 1 j
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Agglomerative Clustering
25



Start with clusters each having single point
At each step merge similar clusters
Measure of similarity
 Minimal
distance(single link)
 Distance
 Maximal
distance(complete link)
 Distance
 Average
between closest points in2 groups
between most distant points in2 groups
distance
 Distance
between group centers
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Example: Single-Link Clustering
26
Dendrogram
Based on for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Choosing k
27


Defined by the application, e.g., image quantization
Incremental (leader-cluster) algorithm: Add one at a
time until “elbow” (reconstruction error/log
likelihood/intergroup distances)
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)