Lecture 7 PPT - Kiri L. Wagstaff
Download
Report
Transcript Lecture 7 PPT - Kiri L. Wagstaff
CS 461: Machine Learning
Lecture 7
Dr. Kiri Wagstaff
[email protected]
2/21/09
CS 461, Winter 2009
1
Plan for Today
Unsupervised Learning
K-means Clustering
EM Clustering
Homework 4
2/21/09
CS 461, Winter 2009
2
Review from Lecture 6
Parametric methods
Data comes from distribution
Bernoulli, Gaussian, and their parameters
How good is a parameter estimate? (bias, variance)
Bayes estimation
ML: use the data (assume equal priors)
MAP: use the prior and the data
Parametric classification
Maximize the posterior probability
2/21/09
CS 461, Winter 2009
3
Clustering
Chapter 7
2/21/09
CS 461, Winter 2009
4
Unsupervised Learning
The data has no labels!
What can we still learn?
Salient groups in the data
Density in feature space
Key approach: clustering
… but also:
Association rules
Density estimation
Principal components analysis (PCA)
2/21/09
CS 461, Winter 2009
5
Clustering
Group items by similarity
Density estimation, cluster models
2/21/09
CS 461, Winter 2009
6
Applications of Clustering
Image Segmentation
[Ma and Manjunath, 2004]
Data Mining: Targeted marketing
Remote Sensing: Land cover types
Text Analysis
[Selim Aksoy]
2/21/09
CS 461, Winter 2009
7
Applications of Clustering
Text Analysis: Noun Phrase Coreference
Cluster JS
John Simon
Input text
John Simon, Chief Financial
Officer of Prime Corp. since
1986, saw his pay jump 20%,
to $1.3 million, as the 37-yearold also became the financialservices company’s president.
Chief Financial Officer
his
the 37-year-old
Singletons
1986
pay
president
20%
Cluster PC
Prime Corp.
$1.3 million
the financial-services
company
2/21/09
CS 461, Winter 2009
8
Sometimes easy
Sometimes impossible
and sometimes
in between
2/21/09
CS 461, Winter 2009
[ Andrew Moore]
9
K-means
1. Ask user how many
clusters they’d like.
(e.g. k=5)
2/21/09
CS 461, Winter 2009
[ Andrew Moore]
10
K-means
1. Ask user how many
clusters they’d like.
(e.g. k=5)
2. Randomly guess k
cluster Center
locations
2/21/09
CS 461, Winter 2009
[ Andrew Moore]
11
K-means
1. Ask user how many
clusters they’d like.
(e.g. k=5)
2. Randomly guess k
cluster Center
locations
3. Each datapoint finds
out which Center it’s
closest to. (Thus
each Center “owns”
a set of datapoints)
2/21/09
CS 461, Winter 2009
[ Andrew Moore]
12
K-means
1. Ask user how many
clusters they’d like.
(e.g. k=5)
2. Randomly guess k
cluster Center
locations
3. Each datapoint finds
out which Center it’s
closest to.
4. Each Center finds
the centroid of the
points it owns
2/21/09
CS 461, Winter 2009
[ Andrew Moore]
13
K-means
1. Ask user how many
clusters they’d like.
(e.g. k=5)
2. Randomly guess k
cluster Center
locations
3. Each datapoint finds
out which Center it’s
closest to.
4. Each Center finds
the centroid of the
points it owns…
5. …and jumps there
6. …Repeat until
terminated!
2/21/09
CS 461, Winter 2009
[ Andrew Moore]
14
K-means
Start: k=5
Example generated by
Dan Pelleg’s super-duper
fast K-means system:
Dan Pelleg and Andrew
Moore. Accelerating Exact
k-means Algorithms with
Geometric Reasoning.
Proc. Conference on
Knowledge Discovery in
Databases 1999,
(KDD99) (available on
www.autonlab.org/pap.html)
2/21/09
CS 461, Winter 2009
[ Andrew Moore]
15
K-means
continues…
2/21/09
CS 461, Winter 2009
[ Andrew Moore]
16
K-means
continues…
2/21/09
CS 461, Winter 2009
[ Andrew Moore]
17
K-means
continues…
2/21/09
CS 461, Winter 2009
[ Andrew Moore]
18
K-means
continues…
2/21/09
CS 461, Winter 2009
[ Andrew Moore]
19
K-means
continues…
2/21/09
CS 461, Winter 2009
[ Andrew Moore]
20
K-means
continues…
2/21/09
CS 461, Winter 2009
[ Andrew Moore]
21
K-means
continues…
2/21/09
CS 461, Winter 2009
[ Andrew Moore]
22
K-means
continues…
2/21/09
CS 461, Winter 2009
[ Andrew Moore]
23
K-means
terminates
2/21/09
CS 461, Winter 2009
[ Andrew Moore]
24
K-means Algorithm
1. Randomly select k cluster centers
2. While (points change membership)
1. Assign each point to its closest cluster
(Use your favorite distance metric)
2. Update each center to be the mean of its items
Objective function: Variance
k
V
c1
2/21/09
2
dist(x
,
)
j
c
x j C c
K-means applet
CS 461, Winter 2009
25
K-means Algorithm: Example
1.
Randomly select k cluster centers
2.
While (points change membership)
1.
Assign each point to its closest cluster
2.
(Use your favorite distance metric)
Update each center to be the mean of its items
Objective function: Variance
k
V
c1
dist(x , )
j
2
c
x j C c
Data: [1, 15, 4, 2, 17, 10, 6, 18]
2/21/09
CS 461, Winter 2009
26
K-means for Compression
Original image
159 KB
2/21/09
Clustered, k=4
53 KB
CS 461, Winter 2009
27
Issue 1: Local Optima
K-means is greedy!
Converging to a non-global optimum:
2/21/09
CS 461, Winter 2009
[Example from Andrew Moore]
28
Issue 1: Local Optima
K-means is greedy!
Converging to a non-global optimum:
2/21/09
CS 461, Winter 2009
[Example from Andrew Moore]
29
Issue 2: How long will it take?
We don’t know!
K-means is O(nkdI)
d = # features (dimensionality)
I =# iterations
# iterations depends on random initialization
“Good” init: few iterations
“Bad” init: lots of iterations
How can we tell the difference, before clustering?
We can’t
Use heuristics to guess “good” init
2/21/09
CS 461, Winter 2009
30
Issue 3: How many clusters?
The “Holy Grail” of clustering
2/21/09
CS 461, Winter 2009
31
Issue 3: How many clusters?
Select k that gives partition with least variance?
[Dhande and Fiore, 2002]
Best k depends on the user’s goal
2/21/09
CS 461, Winter 2009
32
Issue 4: How good is the result?
Rand Index
A = # pairs in same cluster in both partitions
B = # pairs in different clusters in both partitions
Rand = (A + B) / Total number of pairs
4
1
6
3
9
4
6
3
2
8
5
2
1
7
5
7
8
10
10
9
Rand = (5 + 26) / 45
2/21/09
CS 461, Winter 2009
33
K-means: Parametric or Non-parametric?
Cluster models: means
Data models?
All clusters are spherical
Distance in any direction is the same
Cluster may be arbitrarily “big” to include outliers
2/21/09
CS 461, Winter 2009
34
EM Clustering
Parametric solution
Model the data distribution
Each cluster: Gaussian model
N ,
Data: “mixture of models”
Hidden value zt is the cluster of item t
E-step: estimate cluster memberships
E z t X,,
px t |C, , P C
px
j
t
|C j , j , j P C j
M-step: maximize likelihood (clusters, params)
L, | X P(X | , )
2/21/09
CS 461, Winter 2009
35
The GMM assumption
•
There are k components. The
i’th component is called wi
•
Component wi has an
associated mean vector i
2
1
3
2/21/09
CS 461, Winter 2009
[ Andrew Moore]
36
The GMM assumption
•
There are k components. The
i’th component is called wi
•
Component wi has an
associated mean vector i
•
Each component generates data
from a Gaussian with mean i
and covariance matrix 2I
2
1
3
Assume that each datapoint is
generated according to the
following recipe:
2/21/09
CS 461, Winter 2009
[ Andrew Moore]
37
The GMM assumption
•
There are k components. The
i’th component is called wi
•
Component wi has an
associated mean vector i
•
Each component generates data
from a Gaussian with mean i
and covariance matrix 2I
2
Assume that each datapoint is
generated according to the
following recipe:
1. Pick a component at random.
Choose component i with
probability P(wi).
2/21/09
CS 461, Winter 2009
[ Andrew Moore]
38
The GMM assumption
•
There are k components. The
i’th component is called wi
•
Component wi has an
associated mean vector i
•
Each component generates data
from a Gaussian with mean i
and covariance matrix 2I
2
x
Assume that each datapoint is
generated according to the
following recipe:
1. Pick a component at random.
Choose component i with
probability P(wi).
2. Datapoint ~ N(i, 2I )
2/21/09
CS 461, Winter 2009
[ Andrew Moore]
39
The General GMM assumption
•
There are k components. The
i’th component is called wi
•
Component wi has an
associated mean vector i
•
Each component generates data
from a Gaussian with mean i
and covariance matrix Si
2
1
3
Assume that each datapoint is
generated according to the
following recipe:
1. Pick a component at random.
Choose component i with
probability P(wi).
2. Datapoint ~ N(i, Si )
2/21/09
CS 461, Winter 2009
[ Andrew Moore]
40
EM in action
http://www.the-wabe.com/notebook/emalgorithm.html
2/21/09
CS 461, Winter 2009
41
Gaussian
Mixture
Example:
Start
2/21/09
CS 461, Winter 2009
42
[ Andrew Moore]
After first
iteration
2/21/09
CS 461, Winter 2009
43
[ Andrew Moore]
After 2nd
iteration
2/21/09
CS 461, Winter 2009
44
[ Andrew Moore]
After 3rd
iteration
2/21/09
CS 461, Winter 2009
45
[ Andrew Moore]
After 4th
iteration
2/21/09
CS 461, Winter 2009
46
[ Andrew Moore]
After 5th
iteration
2/21/09
CS 461, Winter 2009
47
[ Andrew Moore]
After 6th
iteration
2/21/09
CS 461, Winter 2009
48
[ Andrew Moore]
After 20th
iteration
2/21/09
CS 461, Winter 2009
49
[ Andrew Moore]
EM Benefits
Model actual data distribution, not just centers
Get probability of membership in each cluster,
not just distance
Clusters do not need to be “round”
2/21/09
CS 461, Winter 2009
50
EM Issues?
Local optima
How long will it take?
How many clusters?
Evaluation
2/21/09
CS 461, Winter 2009
51
Summary: Key Points for Today
Unsupervised Learning
Why? How?
K-means Clustering
Iterative
Sensitive to initialization
Non-parametric
Local optimum
Rand Index
EM Clustering
2/21/09
Iterative
Sensitive to initialization
Parametric
Local optimum
CS 461, Winter 2009
52
Next Time
Clustering Reading: Alpaydin Ch. 7.1-7.4, 7.8
Reading questions: Gavin, Ronald, Matthew
Next time: Reinforcement learning – Robots!
2/21/09
CS 461, Winter 2009
53