On K-Means Cluster Preservation using Quantization Schemes

Download Report

Transcript On K-Means Cluster Preservation using Quantization Schemes

On K-Means Cluster Preservation
using Quantization Schemes
Deepak Turaga1, Michalis Vlachos2, Olivier Verscheure1
1IBM
T.J. Watson Research Center, NY, USA
2IBM Zürich Research Laboratory, Switzerland
overview – what we want to do…
• Examine under what conditions compression methodologies retain the
clustering outcome
• We focus on the K-Means algorithm
cluster 1
cluster 2
cluster 3
cluster 1
cluster 2
identical
clustering
results
k-Means
original data
k-Means
quantized data
cluster 3
why we want to do that…
• Reduced Storage
– The quantized data will take up less space
why we want to do that…
• Reduced Storage
– The quantized data will take up less space
• Faster execution
– Since the data can be represented in a more compact form the cluster
algorithm will require less runtime
why we want to do that…
• Reduced Storage
– The quantized data will take up less space
• Faster execution
– Since the data can be represented in a more compact form the cluster
algorithm will take less runtime
• Anonymization/Privacy Preservation
– The original values are not disclosed
why we want to do that…
• Reduced Storage
– The quantized data will take up less space
• Faster execution
– Since the data can be represented in a more compact form the cluster
algorithm will take less runtime
• Anonymization/Privacy Preservation
– The original values are not disclosed
• Authentication
– encode some message with the quantization
We will achieve the above and still guarantee same results
other cluster preservation techniques
original
• We do not transform into another space
• Space requirements same – no data simplification
• Shape preservation
[Oliveira04] S. R. M. Oliveira and O. R. Zaane. Privacy Preservation When Sharing Data For Clustering, 2004
[Parameswaran05] R. Parameswaran and D. Blough. A Robust Data Obfuscation Approach for Privacy Preservation of
Clustered Data, 2005
quantized
k-means overview
K-Means Algorithm:
1.
Initialize k clusters (k
specified by user)
randomly.
2.
Repeat until
convergence
1. Assign each
object to the
nearest cluster
center.
2. Re-estimate
cluster centers.
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
k-means example
1.4
1.2
1
0.8
0.6
0.4
0.2
0
-0.2
-0.4
-0.5
0
0.5
1
1.5
k-means applications/usage
• Fast pre-clustering
k-means applications/usage
• Fast pre-clustering
• Real-time clustering (eg image, video effects)
– Color/Image segmentation
k-means objective function
• Objective: Mininize sum of intra-class variance
Cluster centroid
After some algebraic manipulations
clusters
Dimensions/Time instances
2nd moment
1st moment
k-means objective function
So we can preserve the k-Means outcome if:
• We maintain the cluster assignment
• We preserve the 1st and 2nd moment of the cluster objects
clusters
Dimensions/Time instances
2nd moment
1st moment
moment preserving quantization
•
•
•
•
1st moment: average
2nd (central) moment : variance
3rd moment: skewness
4th moment: kyrtosis
In order to preserve the first and second moment we will use
the following quantizer:
 
Ng
 
N  Ng
N  Ng
Ng

Everything below the mean value
is ‘snapped’ here
Everything above the mean value
is ‘snapped’ here
original
quantized
-2.4240
-0.2238
0.0581
-0.4246
-0.2029
-1.5131
-1.1264
-0.8150
0.3666
-0.5861
1.5374
0.1401
-1.8628
-0.4542
-0.6521
0.1033
-0.2206
-0.2790
-0.7337
-0.0645
-1.4795
0.2049
0.2049
0.2049
0.2049
-1.4795
-1.4795
-1.4795
0.2049
-1.4795
0.2049
0.2049
-1.4795
0.2049
-1.4795
0.2049
0.2049
0.2049
-1.4795
0.2049
average = -0.4689
 
average = -0.4689
 
 
Ng
N  Ng
N  Ng
Ng
Ng
= 0.2049
 
N  Ng
Everything below the mean value
is ‘snapped’ here
= -1.4795
N  Ng
Ng

Everything above the mean value
is ‘snapped’ here
These are the points for one dimension and for one cluster
of objects.
Dimension d (or time instance d)
Process is repeated for all dimensions and for all clusters
We have one quantizer per class
our quantization
• One quantizer per class
• The quantized data are binary
our quantization
• The fact the we have 1 quantizer per class suggests that we
need to run k-Means once before we quantize
• This is not a shortcoming of the technique as we need to
know the cluster boundaries so that we know how much we
can simplify the data.
why quantization works?
• Why does the clustering remain same before and after
quantization?


Centers do not change (averages remain same)
why quantization works?
• Why does the clustering remain same before and after
quantization?




Centers do not change (averages remain same)
Cluster assignment does not change because clusters ‘shrink’
due to quantization
will it always work?
• The results will be the same for datasets with well-formed clusters
• Discrepancy of results means that clusters were not that dense
recap
• Use moment preserving quantization to preserve objective
function
• Due to cluster shrinkage, cluster assignments will not change
• Identical results for optimal k-Means
• One quantizer per class
• 1-bit quantizer per dimension
clusters
Dimensions
2nd moment
1st moment
example: shape preservation
example: shape preservation
example: shape preservation
[Bagnall06] A. J. Bagnall, C. A. Ratanamahatana, E. J. Keogh, S. Lonardi, and G. J. Janacek. A Bit Level Representation for Time Series Data
Mining with Shape Based Similarity. In Data Min. Knowl. Discov. 13(1), pages 11–40, 2006.
example: cluster preservation
• 3 years Nasdaq stock ticker data
• We cluster into k=8 clusters
Confusion Matrix
3% mislabeled data
1
2
3
4
5
6
With Binary Clipping: 80% mislabeled
7
8
Cluster centers
after the moment preserving quantization
quantization levels indicate cluster spread
 
 
N  Ng
Ng
Ng
N  Ng
example: label preservation
Acer platanoides
Tilia
Salix fragilis
Quercus robur
• 2 datasets
– Contours of fish
– Contours of leaves
• Clustering and then k-NN voting
For rotation invariance we use a rotation invariant features
35
5
30
4
25
3
20
2
15
10
1
0
5
0
20
40
60
0
0
10
20
30
example: label preservation
• Very low mislabeling error for MPQ
• High error rate for Binary Clipping
other nice characteristics
• Low sensitivity to initial centers
– Mismatch when starting from different centers is around 7%
other nice characteristics
• Low sensitivity to initial centers
– Mismatch when starting from different centers is around 7%
• Neighborhood preservation
– even though we are not optimizing directly that…
– Good results because we are preserving the ‘shape’ of the object
A
B
size reduction by a factor of 3
when using the quantized scheme
• Compression reduces for increasing K
summary
• 1-bit quantizer per dimension sufficient to preserve kMeans
‘as well as possible’
• Theoretically the results will be identical (under conditions)
• Good ‘shape’ preservation
Future work:
• Multi-bit quantization
• Multi-dimension quantization
end..