Lecture 20 clustering (1): Kmeans algorithm

Download Report

Transcript Lecture 20 clustering (1): Kmeans algorithm

Intro. ANN & Fuzzy Systems
Lecture 20
Clustering (1)
Intro. ANN & Fuzzy Systems
Outline
• Unsupervised learning (Competitive Learning)
and Clustering
• K-Means Clustering Algorithm
(C) 2001 by Yu Hen Hu
2
Intro. ANN & Fuzzy Systems
Unsupervised Learning
• Data Mining
– Understand internal/hidden structure of data
distribution
• Labeling (Target value, teaching input) Cost is High
– Large amount of feature vectors
– Sampling may involve costly experiments
– Data label may not be available at all
• Pre-processing for classification
– features within the same cluster are similar, and
– often belong to the same class
(C) 2001 by Yu Hen Hu
3
Intro. ANN & Fuzzy Systems
Competitive Learning
• A form of unsupervised learning.
• Neurons compete against each other with their activation values.
The winner(s) reserve the privilege to update their weights. The
losers may even be punished by updating their weights in
opposite direction.
• Competitive and Cooperative Learning:
Competitive: Only one neuron's activation can be reinforced.
Cooperative: Several neurons' activation can be reinforced.
(C) 2001 by Yu Hen Hu
4
Intro. ANN & Fuzzy Systems
Competitive Learning Rule
• A neuron WINS the competition if its output is largest
among all neurons for the same input x(n).
• The weights of the winning neuron (k-th) is adjusted:
D wk(n)  [x(n) – wk(n)]
The positions of losing neurons remain unchanged.
• If the weights of a neuron represents its POSITION. If
the output of a neuron is inversely proportional to the
distance between x(n) and wk(n), then
Competitive Learning = CLUSTERING!
(C) 2001 by Yu Hen Hu
5
Intro. ANN & Fuzzy Systems
Competitive Learning Example
initial
after 25 iterations
2
2
1
1
0
0
-1
-1
-2
-2
-1
0
1
2
-2
-2
after 75 iterations
2
1
1
0
0
-1
-1
-1
0
1
0
1
2
at end of 100 iterations
2
-2
-2
-1
2
-2
-2
-1
0
1
2
learncl1.m
(C) 2001 by Yu Hen Hu
6
Intro. ANN & Fuzzy Systems
What is “Clustering”?
1
What can we learn from
these “unlabeled” data
samples?
0.5
0
0
20
40
60
80
100
1.5
1
0.5
0
-0.5
(C) 2001 by Yu Hen Hu
0
0.5
1
– Structures: Some samples
are closer to each other
than other samples
– The closeness between
samples are determined
using a “similarity measure”
– The number of samples per
unit volume is related to the
concept of “density” or
“distribution”
1.5
7
Intro. ANN & Fuzzy Systems
Clustering Problem Statement
• Given a set of vectors {xk; 1  k  K}, find a set of
M clustering centers {w(i); 1  i  c} such that
each xk is assigned to a cluster, say, w(i*),
according to a distance (distortion, similarity)
measure d(xk, w(i)) such that the average
distortion
1
D
K
c
K
 I ( x , i)d ( x ,W (i))
i 1 k 1
k
k
is minimized.
• I(xk,i) = 1 if x is assigned to cluster i with cluster
center w(I); and = 0 otherwise -- indicator function.
(C) 2001 by Yu Hen Hu
8
Intro. ANN & Fuzzy Systems
k-means Clustering Algorithm
Initialization: Initial cluster center w(i); 1  i  c, D(–1)= 0, I(xk,i)
= 0, 1  i  c, 1  k  K;
Repeat
(A) Assign cluster membership (Expectation step)
Evaluate d(xk, w(i));
1  i  c, 1  k  K
I(xk,i) = 1 if d(xk, w(i)) < d(xk, w(j)), j  i;
= 0; otherwise.
1kK
N
(B) Evaluate distortion D: D(iter ) 
I ( x , i )d ( x , w(i )) 1  k  K

k 1
k
k
(C) Update code words according to new assignment
(Maximization)
N
W (i )   I ( xk , i ) xk ,
k 1
N
N i   I ( xk , i ),
1 i  c
k 1
(D) Check for convergence
if 1–D(Iter–1)/D(Iter) < e , then convergent = TRUE,
(C) 2001 by Yu Hen Hu
9
Intro. ANN & Fuzzy Systems
A Numerical Example
x = {-1, -2,0,2,3,4},
W={2.1, 2.3}
1. Assign membership
2.1: {-1, -2, 0, 2}
2.3: {3, 4}
2. Distortion
D = (-1-2.1)2 + (-2-2.1)2
+ (0-2.1)2 + (2-2.1)2 +
(3-2.3)2 + (4-2.3)2
(C) 2001 by Yu Hen Hu
3. Update W to minimize
distortion
W1 = (-1-2+0+2)/4 = -.25
W2 = (3+4)/2 = 3.5
4. Reassign membership
-.25: {-1, -2, 0}
3.5: {2, 3, 4}
5. Update W:
w1 = (-1-2+0)/3 = -1
w2 = (2+3+4)/3 = 3.
Converged.
10
Intro. ANN & Fuzzy Systems
Kmeans Algorithm Demonstration
2.5
2
data points
True cluster centers
data samples
converged centers
true centers
cluster boundary
2
1.5
1.5
1
1
0.5
0.5
0
0
-0.5
-0.5
-1
-0.5
(C) 2001 by Yu Hen Hu
0
0.5
1
1.5
-1
-2
-1
0
1
2
3
Clusterdemo.m 11