Privacy-Preserving Clustering

Download Report

Transcript Privacy-Preserving Clustering

Privacy-Preserving Clustering
Outline


Introduction
Related Work



Preliminaries




Secure Multi-Party Computation
Data Sanitization
Yao’s Millionaires’ Problem
Homomorphic Encryption
Privacy-Preserving K-Means Clustering
Conclusion
Introduction

Why needs privacy-preserving?




Data sharing in today's globally networked systems poses a
threat to individual privacy and organizational confidentiality.
The privacy problem is not data mining, but the way data
mining is done.
So, privacy and data mining can coexist.
An important data mining problem: clustering.
Related Work

Privacy-preserving clustering:

Secure multi-party computation.


High computation and communication costs.
Data sanitization.



Lost of accuracy.
Dimensionality reduction.
Model-based solutions.
Yao’s Millionaires’ Problem

Millionaires’ problem:

Two millionaires wish to know who is richer; however, they
do not want to find out any additional information about
each other’s wealth.
Solutions

Suppose



Alice has i millions.
Bob has j millions.
1 < i, j < 10.
Solutions

Suppose

1.
2.
Alice: i = 5, Bob: j = 3.
(B) x = 7, Ea(x) = 4 = k.
(B) k - j + 1 = 2.
x
Ea(x)
1
7
2
3
3
5
4
8
5
6
6
2
7
4
8
1
9
10
10
9
Solutions
3.
(A) y1 = Da(2) = Da(k - j + 1) = 6.
x
Ea(x)
y2 = Da(3) = 2.
1
7
yj = y3 = Da(4) = Da(k - j + j) = 7.
2
3
y4 = Da(5) = 3.
3
5
y5 = Da(6) = 5.
4
8
y6 = Da(7) = 1.
5
6
y7 = Da(8) = 4.
6
2
…
7
4
8
1
9
10
10
9
Solutions
4.
(A) z1 = y1 = 6.
z2 = y2 = 2.
z3 = y3 = Da(k - j + j) = 7.
z4 = y4 = 3.
z5 = yi = y5 = 5.
z6 = yi + 1 = y5 + 1 = 6.
z7 = y6 + 1 = 2.
…
5. (B) Check if z3 = x or not.
If yes, means that i ≧ j.
If no, means that i < j.
Homomorphic Encryption

Homomorphic encryption:




If there is an algorithm ⊕ to compute H(x⊕y) from H(x)
and H(y) that does not reveal x or y.
H(x⊕y) = H(x) ⊙ H(y)
RSA, …
Additive homomorphic:


H(x+y) = H(x) * H(y)
Paillier, …
Homomorphic Encryption
Privacy-Preserving K-Means Clustering
Over Vertically Partitioned Data
SIGKDD, 2003
Problem Definition

Goal:


Input:


Cluster the known set of common entities without revealing
any value that the clustering is based on.
Each user provides one attribute of all items.
Output:


Assignment of entities to clusters.
Cluster centers themselves.
K-Means Clustering
K-Means Clustering
cluster
decision
distance matrix
new center
computation
Vertically Partitioned Data
User 1
User 2
Terminology






r: # of users, each having different attributes for the
same set of items.
n: # of the common items.
k: # of clusters required.
ui: each cluster mean, i = 1, …, k.
uij: projection of the mean of cluster i on user j.
Final result for user j:


The final value / position of uij, i = 1, …, k.
Cluster assignments: clusti for all points i = 1, …, n.
Privacy-Preserving K-Means Clustering
Securely Finding the Closest Cluster
Securely Finding the Closest Cluster

The security of the algorithm is based on three key
ideas.



Disguise the site components of the distance with random
values that cancel out when combined.
Permute the order of clusters so the real meaning of the
comparison results is unknown.
Compare distances so only the comparison result is learned;
no party knows the distances being compared.
Securely Finding the Closest Cluster
Securely Finding the Closest Cluster
Securely Finding the Closest Cluster
Check Threshold
j
m
Conclusion

Horizontally partitioned data:
User 1
User 2