Kernel K-Means++ to
Download
Report
Transcript Kernel K-Means++ to
DATA CLUSTERING WITH KERNAL K-MEANS++
Matt Strautmann, Dept. of Electrical and
Computer Engineering
PROJECT OBJECTIVES
PROJECT GOAL
Iris Plant Dataset
Experimentally demonstrate the application of Kernel K-Means
to non-linearly clusterable data sets
ACADEMIC IMPORTANCE
Expand the application of the Kernel K-Means clustering
algorithm to non-traditional uses
K-Means clustering aims to divide the dataset into clusters
(“groups”) in which each data point belongs to the cluster with the
nearest mean vector.
o WHAT IS KERNAL K-MEANS?
QuickTime™ and a
decompressor
are needed to see this picture.
lans.ece.utexas.edu
Two step process: data point assignment and update
Kernel data-mapping was seen to solve the overlapping
data sets by:
• Partitioning the points with linear separators in the
new space
•
Soft K-Means++ could not successfully cluster the Lung
Cancer Dataset; results were for one cluster out of three
successfully clustered
•
Soft K-Means++ clustered the two dimension, two
cluster Gaussian dataset with only one error out of the
one thousand data points
CONCLUDING REMARKS
WHAT IS THE PLUS PLUS INITIALIZATION SCHEME?
The first mean vector is a randomly selected data point
3.) Cluster Centroid Becomes
New Cluster Mean
QuickTime™ and a
d eco mpres sor
are nee ded to s ee this picture.
http://en.wikipedia.org/wiki/K-means_clustering
http://en.wikipedia.org/wiki/K-means_clustering
QuickTime™ and a
d eco mpres sor
are nee ded to s ee this picture.
QuickTime™ and a
d eco mpres sor
are nee ded to s ee this picture.
4.) Step 2 and 3 Repeated
until Convergence
APPROACH
Clustering
Accuracy Average
(over ten runs)
Variance of
Accuracy
Calculation
(over ten runs)
28.00%
8.218%
2.867%
Lung Cancer Dataset
43.75%
-
-
2D2K Gaussian Dataset
99.00%
-
-
8D5K Gaussian Dataset
58.50%
2.082%
0.043%
Clustering
Accuracy Average
(over ten runs)
Standard Deviation of
Accuracy Calculation
(over ten runs)
Iris Plant Dataset
57.00%
5.009%
2.238%
Lung Cancer Dataset
62.00%
6.878%
0.473%
2D2K Gaussian Dataset
96.50%
1.677%
0.028%
8D5K Gaussian Dataset
76.31%
10.366%
1.075%
The initialization was seen to be the most important
factor in the algorithm converging
•
The “PLUS PLUS” cluster mean initialization was seen
to improve the results
•
Kernel assignment works better than the maximum
responsibility calculation of Soft K-Means
•
Kernel K-Means++ can handle small or large dimension
datasets well; the increase of dimensionally seemed to
be advantageous for the Lung Cancer Dataset (56
dimensions) over the lower clustering accuracy of the
Iris Plant Dataset (4 dimensions)
•
Kernel K-Means++ produced superior results to Soft KMeans++ when clustering the Lung Cancer Dataset and
demonstrated recognition of all three clusters
Variance of
Accuracy
Calculation
(over ten runs)
FUTURE WORK
RESULTS COMPARISON
•
Kernel K-Means++ clustering accuracy superior in all cases
except the two dimensional, two cluster dataset.
•
Further improvement of the mean vector initialization is
believed possible over the “PLUS PLUS” initialization
•
The clustering accuracy of the datasets increased by the
following amounts:
•
Other options for the mean-squared error calculation
for data point evaluation are possible
•
The time analysis of the algorithm must be calculate
• Iris Plant: 104%
Evaluate standard K-Means (Soft++) against 4 datasets to
form benchmark
• Lung Cancer: 38%
Hybridize Soft K-Means++ with Kernel K-Means to form
Kernel K-Means++
• 8D5K: 30%
Test Kernel K-Means++ on small size, small dimension
Gaussian, large dimension Gaussian, and large size
datasets
Standard Deviation
of Accuracy
Calculation
(over ten runs)
Iris Plant Dataset
Kernel K-Means++
2.) Voronoi Diagram
Generated by the Means
(data points associated with nearest cluster mean)
http://en.wikipedia.org/wiki/K-means_clustering
1.) Initial Mean Orientations
http://en.wikipedia.org/wiki/K-means_clustering
Soft K-Means++
Quic kTime™ and a
dec ompres sor
are needed to see t his pic ture.
•
SOFT K-MEANS++ VS. KERNEL K-MEANS++
Each subsequent mean vector is created by evaluating randomly
selected data points against a vector weighting probability
•
•
QuickTime™ and a
decompressor
are needed to see this picture.
Sum-of-squares algorithm
•
Kernel K-Means++ was found to cluster the test datasets
in a superior manner over Soft K-Means++
• Mapping the data before clustering to a higherdimensional feature space using a nonlinear function
2 Dimension,
2 Cluster Dataset
(Gaussian 2D2K)
2 Dimension,
2 Cluster Dataset
(Gaussian 2D2K)
o WHAT IS K-MEANS CLUSTERING?
•
•
QuickTime™ and a
decompressor
are needed to see this picture.
BACKGROUND
o
DISCUSSION
lans.ece.utexas.edu
o
PROJECT DATASETS
eleves.ens.fr
o
Dr. Donald C. Wunsch II, Dept. of Electrical and
Computer Engineering
• 2D2k: -2.5%
120
100
KERNEL++
SOFT++
80
60
Acknowledgements
40
The author would like to acknowledge the expertise of Dr.
Rui Xu in advising this project.
20
0
IRIS PLANT
DATASET
LUNG CANCER
DATASET
2D2K GAUSSIAN
DATASET
8D5K GAUSSIAN
DATASET