What Is Clustering?

Download Report

Transcript What Is Clustering?

A Genetic Algorithm Approach
to K-Means Clustering
Craig Stanek
CS401
November 17, 2004
What Is Clustering?
“partitioning the data being mined into several groups (or
clusters) of data instances, in such a way that:
a) Each cluster has instances that are very similar (or
“near”) to each other, and
b) The instances in each cluster are very different (or “far
away”) from the instances in the other clusters”
--Alex A. Freitas, “Data Mining and Knowledge
Discovery with Evolutionary Algorithms”
Why Cluster?
Segmentation and Differentiation
Age vs. Income
120,000
100,000
Income
80,000
60,000
40,000
20,000
0
0
10
20
30
40
Age
50
60
70
Why Cluster?
Outlier Detection
Age vs. Income
2,500,000
Income
2,000,000
1,500,000
1,000,000
500,000
0
0
10
20
30
40
Age
50
60
70
Why Cluster?
Classification
Petal Width vs. Petal Length
8
7
Petal Length
6
5
4
3
2
1
0
0
0.5
1
1.5
Petal Width
2
2.5
3
K-Means Clustering
1) Specify K clusters
2) Randomly initialize K “centroids”
3) Classify each data instance to closest cluster
according to distance from centroid
4) Recalculate cluster centroids
5) Repeat steps (3) and (4) until no data instances
move to a different cluster
Drawbacks of K-Means Algorithm
• Local rather than global optimum
• Sensitive to initial choice of centroids
• K must be chosen apriori
• Minimizes intra-cluster distance but does not
consider inter-cluster distance
Problem Statement
• Can a Genetic Algorithm approach do better than
standard K-means Algorithm?
• Is there an alternative fitness measure that can
take into account both intra-cluster similarity and
inter-cluster differentiation?
• Can a GA be used to find the optimum number
of clusters for a given data set?
Representation of Individuals
• Randomly generated number of clusters
• Medoid-based integer string (each gene is a
distinct data instance)
Example:
58
244
23
162
113
Genetic Algorithm Approach
Why Medoids?
Age vs. Income
2,500,000
Income
2,000,000
1,500,000
1,000,000
500,000
0
0
10
20
30
40
Age
50
60
70
Genetic Algorithm Approach
Why Medoids?
Age vs. Income
2,500,000
Income
2,000,000
1,500,000
Instances
Medoid
Centroid
1,000,000
500,000
0
0
10
20
30
40
Age
50
60
70
Genetic Algorithm Approach
Why Medoids?
Age vs. Income
300,000
250,000
Income
200,000
Instances
150,000
Medoid
Centroid
100,000
50,000
0
0
10
20
30
40
Age
50
60
70
Recombination
Parent #1:
36
108
82
Parent #2:
5
80
147
82
108
Child #1:
36
108
147
82
6
Child #2:
5
80
82
6
Fitness Function
Let rij represent the jth data instance of the ith cluster
and Mi be the medoid of the ith cluster
  Dist r , M
i

   Dist r , M
k

ni
K
Let X =
i 1 j 1
K
Let Y =
ij
ni
K
i 1 j 1 k 1
Fitness = Y / X
ik
( K  1)
ij
Experimental Setup
Iris Plant Data (UCI Repository)
• 150 data instances
• 4 dimensions
• Known classifications
3 classes
50 instances of each
Experimental Setup
Iris Data Set
Sepal Width vs. Sepal Length
9
8
Sepal Length
7
6
Iris-setosa
5
Iris-versicolor
4
Iris-virginica
3
2
1
0
0
1
2
3
Sepal Width
4
5
Experimental Setup
Iris Data Set
Petal Width vs. Petal Length
8
7
Petal Length
6
5
Iris-setosa
4
Iris-versicolor
Iris-virginica
3
2
1
0
0
0.5
1
1.5
Petal Width
2
2.5
3
Standard K-Means vs. Medoid-Based EA
K-Means
EA
30
30
Avg. Correct
120.1
134.9
Avg. % Correct
80.1%
89.9%
Min. Correct
77
133
Max. Correct
134
135
Avg. Fitness
78.94
84.00
Total Trials
Standard K-Means Clustering
Iris Data Set
Petal Width vs. Petal Length
8
8
7
7
6
6
5
Iris-setosa
4
Iris-versicolor
Iris-virginica
3
Petal Length
Petal Length
Petal Width vs. Petal Length
5
Cluster 1
4
Cluster 3
3
2
2
1
1
0
Cluster 2
0
0
0.5
1
1.5
Petal Width
2
2.5
3
0
0.5
1
1.5
Petal Width
2
2.5
3
Medoid-Based EA
Iris Data Set
Petal Width vs. Petal Length
8
8
7
7
6
6
5
Iris-setosa
4
Iris-versicolor
Iris-virginica
3
Petal Length
Petal Length
Petal Width vs. Petal Length
5
Cluster 1
4
Cluster 3
3
2
2
1
1
0
Cluster 2
0
0
0.5
1
1.5
Petal Width
2
2.5
3
0
0.5
1
1.5
Petal Width
2
2.5
3
Standard Fitness EA vs. Proposed Fitness EA
Standard
Proposed
30
30
Avg. Correct
134.9
134.0
Avg. % Correct
89.9%
89.3%
Min. Correct
133
134
Max. Correct
135
134
Avg. Generations
82.7
24.9
Total Trials
Fixed vs. Variable Number of Clusters EA
Fixed #
Variable #
30
30
Avg. Correct
134.0
134.0
Avg. % Correct
89.3%
89.3%
Min. Correct
134
134
Max. Correct
134
134
3
7
Total Trials
Avg. # of Clusters
Variable Number of Clusters EA
Iris Data Set
Petal Width vs. Petal Length
Petal Width vs. Petal Length
8
8
7
7
6
6
5
Iris-setosa
4
Iris-versicolor
Iris-virginica
3
Petal Length
Petal Length
Cluster 1
Cluster 2
5
Cluster 3
4
Cluster 4
Cluster 5
3
Cluster 6
Cluster 7
2
2
1
1
0
0
0
0.5
1
1.5
Petal Width
2
2.5
3
0
0.5
1
1.5
Petal Width
2
2.5
3
Conclusions
• GA better at obtaining globally optimal solution
• Proposed fitness function shows promise
• Difficulty letting GA determine “correct”
number of clusters on its own
Future Work
• Other data sets
• Alternative fitness function
• Scalability
• GA comparison to simulated annealing