Lecture 7 PPT - Kiri L. Wagstaff

Download Report

Transcript Lecture 7 PPT - Kiri L. Wagstaff

CS 461: Machine Learning
Lecture 7
Dr. Kiri Wagstaff
[email protected]
2/21/09
CS 461, Winter 2009
1
Plan for Today
 Unsupervised Learning
 K-means Clustering
 EM Clustering
 Homework 4
2/21/09
CS 461, Winter 2009
2
Review from Lecture 6
 Parametric methods
 Data comes from distribution
 Bernoulli, Gaussian, and their parameters
 How good is a parameter estimate? (bias, variance)
 Bayes estimation
 ML: use the data (assume equal priors)
 MAP: use the prior and the data
 Parametric classification
 Maximize the posterior probability
2/21/09
CS 461, Winter 2009
3
Clustering
Chapter 7
2/21/09
CS 461, Winter 2009
4
Unsupervised Learning
 The data has no labels!
 What can we still learn?
 Salient groups in the data
 Density in feature space
 Key approach: clustering
 … but also:
 Association rules
 Density estimation
 Principal components analysis (PCA)
2/21/09
CS 461, Winter 2009
5
Clustering
 Group items by similarity
 Density estimation, cluster models
2/21/09
CS 461, Winter 2009
6
Applications of Clustering
 Image Segmentation
[Ma and Manjunath, 2004]
 Data Mining: Targeted marketing
 Remote Sensing: Land cover types
 Text Analysis
[Selim Aksoy]
2/21/09
CS 461, Winter 2009
7
Applications of Clustering
 Text Analysis: Noun Phrase Coreference
Cluster JS
John Simon
Input text
John Simon, Chief Financial
Officer of Prime Corp. since
1986, saw his pay jump 20%,
to $1.3 million, as the 37-yearold also became the financialservices company’s president.
Chief Financial Officer
his
the 37-year-old
Singletons
1986
pay
president
20%
Cluster PC
Prime Corp.
$1.3 million
the financial-services
company
2/21/09
CS 461, Winter 2009
8
Sometimes easy
Sometimes impossible
and sometimes
in between
2/21/09
CS 461, Winter 2009
[ Andrew Moore]
9
K-means
1. Ask user how many
clusters they’d like.
(e.g. k=5)
2/21/09
CS 461, Winter 2009
[ Andrew Moore]
10
K-means
1. Ask user how many
clusters they’d like.
(e.g. k=5)
2. Randomly guess k
cluster Center
locations
2/21/09
CS 461, Winter 2009
[ Andrew Moore]
11
K-means
1. Ask user how many
clusters they’d like.
(e.g. k=5)
2. Randomly guess k
cluster Center
locations
3. Each datapoint finds
out which Center it’s
closest to. (Thus
each Center “owns”
a set of datapoints)
2/21/09
CS 461, Winter 2009
[ Andrew Moore]
12
K-means
1. Ask user how many
clusters they’d like.
(e.g. k=5)
2. Randomly guess k
cluster Center
locations
3. Each datapoint finds
out which Center it’s
closest to.
4. Each Center finds
the centroid of the
points it owns
2/21/09
CS 461, Winter 2009
[ Andrew Moore]
13
K-means
1. Ask user how many
clusters they’d like.
(e.g. k=5)
2. Randomly guess k
cluster Center
locations
3. Each datapoint finds
out which Center it’s
closest to.
4. Each Center finds
the centroid of the
points it owns…
5. …and jumps there
6. …Repeat until
terminated!
2/21/09
CS 461, Winter 2009
[ Andrew Moore]
14
K-means
Start: k=5
Example generated by
Dan Pelleg’s super-duper
fast K-means system:
Dan Pelleg and Andrew
Moore. Accelerating Exact
k-means Algorithms with
Geometric Reasoning.
Proc. Conference on
Knowledge Discovery in
Databases 1999,
(KDD99) (available on
www.autonlab.org/pap.html)
2/21/09
CS 461, Winter 2009
[ Andrew Moore]
15
K-means
continues…
2/21/09
CS 461, Winter 2009
[ Andrew Moore]
16
K-means
continues…
2/21/09
CS 461, Winter 2009
[ Andrew Moore]
17
K-means
continues…
2/21/09
CS 461, Winter 2009
[ Andrew Moore]
18
K-means
continues…
2/21/09
CS 461, Winter 2009
[ Andrew Moore]
19
K-means
continues…
2/21/09
CS 461, Winter 2009
[ Andrew Moore]
20
K-means
continues…
2/21/09
CS 461, Winter 2009
[ Andrew Moore]
21
K-means
continues…
2/21/09
CS 461, Winter 2009
[ Andrew Moore]
22
K-means
continues…
2/21/09
CS 461, Winter 2009
[ Andrew Moore]
23
K-means
terminates
2/21/09
CS 461, Winter 2009
[ Andrew Moore]
24
K-means Algorithm
1. Randomly select k cluster centers
2. While (points change membership)
1. Assign each point to its closest cluster

(Use your favorite distance metric)
2. Update each center to be the mean of its items

Objective function: Variance
k
V 
c1

2/21/09
2
dist(x
,

)

j
c
x j C c
K-means applet
CS 461, Winter 2009
25
K-means Algorithm: Example
1.
Randomly select k cluster centers
2.
While (points change membership)
1.
Assign each point to its closest cluster

2.

(Use your favorite distance metric)
Update each center to be the mean of its items
Objective function: Variance
k
V 
c1

 dist(x , )
j
2
c
x j C c
Data: [1, 15, 4, 2, 17, 10, 6, 18]

2/21/09
CS 461, Winter 2009
26
K-means for Compression
Original image
159 KB
2/21/09
Clustered, k=4
53 KB
CS 461, Winter 2009
27
Issue 1: Local Optima
 K-means is greedy!
 Converging to a non-global optimum:
2/21/09
CS 461, Winter 2009
[Example from Andrew Moore]
28
Issue 1: Local Optima
 K-means is greedy!
 Converging to a non-global optimum:
2/21/09
CS 461, Winter 2009
[Example from Andrew Moore]
29
Issue 2: How long will it take?
 We don’t know!
 K-means is O(nkdI)
 d = # features (dimensionality)
 I =# iterations
 # iterations depends on random initialization
 “Good” init: few iterations
 “Bad” init: lots of iterations
 How can we tell the difference, before clustering?
 We can’t
 Use heuristics to guess “good” init
2/21/09
CS 461, Winter 2009
30
Issue 3: How many clusters?
 The “Holy Grail” of clustering
2/21/09
CS 461, Winter 2009
31
Issue 3: How many clusters?
 Select k that gives partition with least variance?
[Dhande and Fiore, 2002]
 Best k depends on the user’s goal
2/21/09
CS 461, Winter 2009
32
Issue 4: How good is the result?
 Rand Index
 A = # pairs in same cluster in both partitions
 B = # pairs in different clusters in both partitions
 Rand = (A + B) / Total number of pairs
4
1
6
3
9
4
6
3
2
8
5
2
1
7
5
7
8
10
10
9
Rand = (5 + 26) / 45
2/21/09
CS 461, Winter 2009
33
K-means: Parametric or Non-parametric?
 Cluster models: means
 Data models?
 All clusters are spherical
 Distance in any direction is the same
 Cluster may be arbitrarily “big” to include outliers
2/21/09
CS 461, Winter 2009
34
EM Clustering
 Parametric solution
 Model the data distribution
 Each cluster: Gaussian model
N , 
 Data: “mixture of models”
 Hidden value zt is the cluster of item t
 E-step: estimate cluster memberships

E z t X,,

px t |C, , P C 
 px
j
t

|C j ,  j , j P C j 
 M-step: maximize likelihood (clusters, params)
L, | X   P(X | , )

2/21/09
CS 461, Winter 2009
35
The GMM assumption
•
There are k components. The
i’th component is called wi
•
Component wi has an
associated mean vector i
2
1
3
2/21/09
CS 461, Winter 2009
[ Andrew Moore]
36
The GMM assumption
•
There are k components. The
i’th component is called wi
•
Component wi has an
associated mean vector i
•
Each component generates data
from a Gaussian with mean i
and covariance matrix 2I
2
1
3
Assume that each datapoint is
generated according to the
following recipe:
2/21/09
CS 461, Winter 2009
[ Andrew Moore]
37
The GMM assumption
•
There are k components. The
i’th component is called wi
•
Component wi has an
associated mean vector i
•
Each component generates data
from a Gaussian with mean i
and covariance matrix 2I
2
Assume that each datapoint is
generated according to the
following recipe:
1. Pick a component at random.
Choose component i with
probability P(wi).
2/21/09
CS 461, Winter 2009
[ Andrew Moore]
38
The GMM assumption
•
There are k components. The
i’th component is called wi
•
Component wi has an
associated mean vector i
•
Each component generates data
from a Gaussian with mean i
and covariance matrix 2I
2
x
Assume that each datapoint is
generated according to the
following recipe:
1. Pick a component at random.
Choose component i with
probability P(wi).
2. Datapoint ~ N(i, 2I )
2/21/09
CS 461, Winter 2009
[ Andrew Moore]
39
The General GMM assumption
•
There are k components. The
i’th component is called wi
•
Component wi has an
associated mean vector i
•
Each component generates data
from a Gaussian with mean i
and covariance matrix Si
2
1
3
Assume that each datapoint is
generated according to the
following recipe:
1. Pick a component at random.
Choose component i with
probability P(wi).
2. Datapoint ~ N(i, Si )
2/21/09
CS 461, Winter 2009
[ Andrew Moore]
40
EM in action
 http://www.the-wabe.com/notebook/emalgorithm.html
2/21/09
CS 461, Winter 2009
41
Gaussian
Mixture
Example:
Start
2/21/09
CS 461, Winter 2009
42
[ Andrew Moore]
After first
iteration
2/21/09
CS 461, Winter 2009
43
[ Andrew Moore]
After 2nd
iteration
2/21/09
CS 461, Winter 2009
44
[ Andrew Moore]
After 3rd
iteration
2/21/09
CS 461, Winter 2009
45
[ Andrew Moore]
After 4th
iteration
2/21/09
CS 461, Winter 2009
46
[ Andrew Moore]
After 5th
iteration
2/21/09
CS 461, Winter 2009
47
[ Andrew Moore]
After 6th
iteration
2/21/09
CS 461, Winter 2009
48
[ Andrew Moore]
After 20th
iteration
2/21/09
CS 461, Winter 2009
49
[ Andrew Moore]
EM Benefits
 Model actual data distribution, not just centers
 Get probability of membership in each cluster,
not just distance
 Clusters do not need to be “round”
2/21/09
CS 461, Winter 2009
50
EM Issues?




Local optima
How long will it take?
How many clusters?
Evaluation
2/21/09
CS 461, Winter 2009
51
Summary: Key Points for Today
 Unsupervised Learning
 Why? How?
 K-means Clustering





Iterative
Sensitive to initialization
Non-parametric
Local optimum
Rand Index
 EM Clustering




2/21/09
Iterative
Sensitive to initialization
Parametric
Local optimum
CS 461, Winter 2009
52
Next Time
 Clustering Reading: Alpaydin Ch. 7.1-7.4, 7.8
 Reading questions: Gavin, Ronald, Matthew
 Next time: Reinforcement learning – Robots!
2/21/09
CS 461, Winter 2009
53