Kernel K-Means++ to

Download Report

Transcript Kernel K-Means++ to

DATA CLUSTERING WITH KERNAL K-MEANS++
Matt Strautmann, Dept. of Electrical and
Computer Engineering
PROJECT OBJECTIVES
PROJECT GOAL
Iris Plant Dataset
Experimentally demonstrate the application of Kernel K-Means
to non-linearly clusterable data sets
ACADEMIC IMPORTANCE

Expand the application of the Kernel K-Means clustering
algorithm to non-traditional uses
 K-Means clustering aims to divide the dataset into clusters
(“groups”) in which each data point belongs to the cluster with the
nearest mean vector.
o WHAT IS KERNAL K-MEANS?
QuickTime™ and a
decompressor
are needed to see this picture.
lans.ece.utexas.edu
 Two step process: data point assignment and update
Kernel data-mapping was seen to solve the overlapping
data sets by:
• Partitioning the points with linear separators in the
new space
•
Soft K-Means++ could not successfully cluster the Lung
Cancer Dataset; results were for one cluster out of three
successfully clustered
•
Soft K-Means++ clustered the two dimension, two
cluster Gaussian dataset with only one error out of the
one thousand data points
CONCLUDING REMARKS
WHAT IS THE PLUS PLUS INITIALIZATION SCHEME?
 The first mean vector is a randomly selected data point
3.) Cluster Centroid Becomes
New Cluster Mean
QuickTime™ and a
d eco mpres sor
are nee ded to s ee this picture.
http://en.wikipedia.org/wiki/K-means_clustering
http://en.wikipedia.org/wiki/K-means_clustering
QuickTime™ and a
d eco mpres sor
are nee ded to s ee this picture.
QuickTime™ and a
d eco mpres sor
are nee ded to s ee this picture.
4.) Step 2 and 3 Repeated
until Convergence
APPROACH
Clustering
Accuracy Average
(over ten runs)
Variance of
Accuracy
Calculation
(over ten runs)
28.00%
8.218%
2.867%
Lung Cancer Dataset
43.75%
-
-
2D2K Gaussian Dataset
99.00%
-
-
8D5K Gaussian Dataset
58.50%
2.082%
0.043%
Clustering
Accuracy Average
(over ten runs)
Standard Deviation of
Accuracy Calculation
(over ten runs)
Iris Plant Dataset
57.00%
5.009%
2.238%
Lung Cancer Dataset
62.00%
6.878%
0.473%
2D2K Gaussian Dataset
96.50%
1.677%
0.028%
8D5K Gaussian Dataset
76.31%
10.366%
1.075%
The initialization was seen to be the most important
factor in the algorithm converging
•
The “PLUS PLUS” cluster mean initialization was seen
to improve the results
•
Kernel assignment works better than the maximum
responsibility calculation of Soft K-Means
•
Kernel K-Means++ can handle small or large dimension
datasets well; the increase of dimensionally seemed to
be advantageous for the Lung Cancer Dataset (56
dimensions) over the lower clustering accuracy of the
Iris Plant Dataset (4 dimensions)
•
Kernel K-Means++ produced superior results to Soft KMeans++ when clustering the Lung Cancer Dataset and
demonstrated recognition of all three clusters
Variance of
Accuracy
Calculation
(over ten runs)
FUTURE WORK
RESULTS COMPARISON
•
Kernel K-Means++ clustering accuracy superior in all cases
except the two dimensional, two cluster dataset.
•
Further improvement of the mean vector initialization is
believed possible over the “PLUS PLUS” initialization
•
The clustering accuracy of the datasets increased by the
following amounts:
•
Other options for the mean-squared error calculation
for data point evaluation are possible
•
The time analysis of the algorithm must be calculate
• Iris Plant: 104%
Evaluate standard K-Means (Soft++) against 4 datasets to
form benchmark
• Lung Cancer: 38%
Hybridize Soft K-Means++ with Kernel K-Means to form
Kernel K-Means++
• 8D5K: 30%
Test Kernel K-Means++ on small size, small dimension
Gaussian, large dimension Gaussian, and large size
datasets
Standard Deviation
of Accuracy
Calculation
(over ten runs)
Iris Plant Dataset
Kernel K-Means++
2.) Voronoi Diagram
Generated by the Means
(data points associated with nearest cluster mean)
http://en.wikipedia.org/wiki/K-means_clustering
1.) Initial Mean Orientations
http://en.wikipedia.org/wiki/K-means_clustering
Soft K-Means++
Quic kTime™ and a
dec ompres sor
are needed to see t his pic ture.
•
SOFT K-MEANS++ VS. KERNEL K-MEANS++
 Each subsequent mean vector is created by evaluating randomly
selected data points against a vector weighting probability
•
•
QuickTime™ and a
decompressor
are needed to see this picture.
 Sum-of-squares algorithm
•
Kernel K-Means++ was found to cluster the test datasets
in a superior manner over Soft K-Means++
• Mapping the data before clustering to a higherdimensional feature space using a nonlinear function
2 Dimension,
2 Cluster Dataset
(Gaussian 2D2K)
2 Dimension,
2 Cluster Dataset
(Gaussian 2D2K)
o WHAT IS K-MEANS CLUSTERING?
•
•
QuickTime™ and a
decompressor
are needed to see this picture.
BACKGROUND
o
DISCUSSION
lans.ece.utexas.edu

o
PROJECT DATASETS
eleves.ens.fr
o
Dr. Donald C. Wunsch II, Dept. of Electrical and
Computer Engineering
• 2D2k: -2.5%
120
100
KERNEL++
SOFT++
80
60
Acknowledgements
40
The author would like to acknowledge the expertise of Dr.
Rui Xu in advising this project.
20
0
IRIS PLANT
DATASET
LUNG CANCER
DATASET
2D2K GAUSSIAN
DATASET
8D5K GAUSSIAN
DATASET