On K-Means Cluster Preservation using Quantization Schemes
Download
Report
Transcript On K-Means Cluster Preservation using Quantization Schemes
On K-Means Cluster Preservation
using Quantization Schemes
Deepak Turaga1, Michalis Vlachos2, Olivier Verscheure1
1IBM
T.J. Watson Research Center, NY, USA
2IBM Zürich Research Laboratory, Switzerland
overview – what we want to do…
• Examine under what conditions compression methodologies retain the
clustering outcome
• We focus on the K-Means algorithm
cluster 1
cluster 2
cluster 3
cluster 1
cluster 2
identical
clustering
results
k-Means
original data
k-Means
quantized data
cluster 3
why we want to do that…
• Reduced Storage
– The quantized data will take up less space
why we want to do that…
• Reduced Storage
– The quantized data will take up less space
• Faster execution
– Since the data can be represented in a more compact form the cluster
algorithm will require less runtime
why we want to do that…
• Reduced Storage
– The quantized data will take up less space
• Faster execution
– Since the data can be represented in a more compact form the cluster
algorithm will take less runtime
• Anonymization/Privacy Preservation
– The original values are not disclosed
why we want to do that…
• Reduced Storage
– The quantized data will take up less space
• Faster execution
– Since the data can be represented in a more compact form the cluster
algorithm will take less runtime
• Anonymization/Privacy Preservation
– The original values are not disclosed
• Authentication
– encode some message with the quantization
We will achieve the above and still guarantee same results
other cluster preservation techniques
original
• We do not transform into another space
• Space requirements same – no data simplification
• Shape preservation
[Oliveira04] S. R. M. Oliveira and O. R. Zaane. Privacy Preservation When Sharing Data For Clustering, 2004
[Parameswaran05] R. Parameswaran and D. Blough. A Robust Data Obfuscation Approach for Privacy Preservation of
Clustered Data, 2005
quantized
k-means overview
K-Means Algorithm:
1.
Initialize k clusters (k
specified by user)
randomly.
2.
Repeat until
convergence
1. Assign each
object to the
nearest cluster
center.
2. Re-estimate
cluster centers.
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
k-means example
1.4
1.2
1
0.8
0.6
0.4
0.2
0
-0.2
-0.4
-0.5
0
0.5
1
1.5
k-means applications/usage
• Fast pre-clustering
k-means applications/usage
• Fast pre-clustering
• Real-time clustering (eg image, video effects)
– Color/Image segmentation
k-means objective function
• Objective: Mininize sum of intra-class variance
Cluster centroid
After some algebraic manipulations
clusters
Dimensions/Time instances
2nd moment
1st moment
k-means objective function
So we can preserve the k-Means outcome if:
• We maintain the cluster assignment
• We preserve the 1st and 2nd moment of the cluster objects
clusters
Dimensions/Time instances
2nd moment
1st moment
moment preserving quantization
•
•
•
•
1st moment: average
2nd (central) moment : variance
3rd moment: skewness
4th moment: kyrtosis
In order to preserve the first and second moment we will use
the following quantizer:
Ng
N Ng
N Ng
Ng
Everything below the mean value
is ‘snapped’ here
Everything above the mean value
is ‘snapped’ here
original
quantized
-2.4240
-0.2238
0.0581
-0.4246
-0.2029
-1.5131
-1.1264
-0.8150
0.3666
-0.5861
1.5374
0.1401
-1.8628
-0.4542
-0.6521
0.1033
-0.2206
-0.2790
-0.7337
-0.0645
-1.4795
0.2049
0.2049
0.2049
0.2049
-1.4795
-1.4795
-1.4795
0.2049
-1.4795
0.2049
0.2049
-1.4795
0.2049
-1.4795
0.2049
0.2049
0.2049
-1.4795
0.2049
average = -0.4689
average = -0.4689
Ng
N Ng
N Ng
Ng
Ng
= 0.2049
N Ng
Everything below the mean value
is ‘snapped’ here
= -1.4795
N Ng
Ng
Everything above the mean value
is ‘snapped’ here
These are the points for one dimension and for one cluster
of objects.
Dimension d (or time instance d)
Process is repeated for all dimensions and for all clusters
We have one quantizer per class
our quantization
• One quantizer per class
• The quantized data are binary
our quantization
• The fact the we have 1 quantizer per class suggests that we
need to run k-Means once before we quantize
• This is not a shortcoming of the technique as we need to
know the cluster boundaries so that we know how much we
can simplify the data.
why quantization works?
• Why does the clustering remain same before and after
quantization?
Centers do not change (averages remain same)
why quantization works?
• Why does the clustering remain same before and after
quantization?
Centers do not change (averages remain same)
Cluster assignment does not change because clusters ‘shrink’
due to quantization
will it always work?
• The results will be the same for datasets with well-formed clusters
• Discrepancy of results means that clusters were not that dense
recap
• Use moment preserving quantization to preserve objective
function
• Due to cluster shrinkage, cluster assignments will not change
• Identical results for optimal k-Means
• One quantizer per class
• 1-bit quantizer per dimension
clusters
Dimensions
2nd moment
1st moment
example: shape preservation
example: shape preservation
example: shape preservation
[Bagnall06] A. J. Bagnall, C. A. Ratanamahatana, E. J. Keogh, S. Lonardi, and G. J. Janacek. A Bit Level Representation for Time Series Data
Mining with Shape Based Similarity. In Data Min. Knowl. Discov. 13(1), pages 11–40, 2006.
example: cluster preservation
• 3 years Nasdaq stock ticker data
• We cluster into k=8 clusters
Confusion Matrix
3% mislabeled data
1
2
3
4
5
6
With Binary Clipping: 80% mislabeled
7
8
Cluster centers
after the moment preserving quantization
quantization levels indicate cluster spread
N Ng
Ng
Ng
N Ng
example: label preservation
Acer platanoides
Tilia
Salix fragilis
Quercus robur
• 2 datasets
– Contours of fish
– Contours of leaves
• Clustering and then k-NN voting
For rotation invariance we use a rotation invariant features
35
5
30
4
25
3
20
2
15
10
1
0
5
0
20
40
60
0
0
10
20
30
example: label preservation
• Very low mislabeling error for MPQ
• High error rate for Binary Clipping
other nice characteristics
• Low sensitivity to initial centers
– Mismatch when starting from different centers is around 7%
other nice characteristics
• Low sensitivity to initial centers
– Mismatch when starting from different centers is around 7%
• Neighborhood preservation
– even though we are not optimizing directly that…
– Good results because we are preserving the ‘shape’ of the object
A
B
size reduction by a factor of 3
when using the quantized scheme
• Compression reduces for increasing K
summary
• 1-bit quantizer per dimension sufficient to preserve kMeans
‘as well as possible’
• Theoretically the results will be identical (under conditions)
• Good ‘shape’ preservation
Future work:
• Multi-bit quantization
• Multi-dimension quantization
end..