Privacy-Preserving Clustering
Download
Report
Transcript Privacy-Preserving Clustering
Privacy-Preserving Clustering
Outline
Introduction
Related Work
Preliminaries
Secure Multi-Party Computation
Data Sanitization
Yao’s Millionaires’ Problem
Homomorphic Encryption
Privacy-Preserving K-Means Clustering
Conclusion
Introduction
Why needs privacy-preserving?
Data sharing in today's globally networked systems poses a
threat to individual privacy and organizational confidentiality.
The privacy problem is not data mining, but the way data
mining is done.
So, privacy and data mining can coexist.
An important data mining problem: clustering.
Related Work
Privacy-preserving clustering:
Secure multi-party computation.
High computation and communication costs.
Data sanitization.
Lost of accuracy.
Dimensionality reduction.
Model-based solutions.
Yao’s Millionaires’ Problem
Millionaires’ problem:
Two millionaires wish to know who is richer; however, they
do not want to find out any additional information about
each other’s wealth.
Solutions
Suppose
Alice has i millions.
Bob has j millions.
1 < i, j < 10.
Solutions
Suppose
1.
2.
Alice: i = 5, Bob: j = 3.
(B) x = 7, Ea(x) = 4 = k.
(B) k - j + 1 = 2.
x
Ea(x)
1
7
2
3
3
5
4
8
5
6
6
2
7
4
8
1
9
10
10
9
Solutions
3.
(A) y1 = Da(2) = Da(k - j + 1) = 6.
x
Ea(x)
y2 = Da(3) = 2.
1
7
yj = y3 = Da(4) = Da(k - j + j) = 7.
2
3
y4 = Da(5) = 3.
3
5
y5 = Da(6) = 5.
4
8
y6 = Da(7) = 1.
5
6
y7 = Da(8) = 4.
6
2
…
7
4
8
1
9
10
10
9
Solutions
4.
(A) z1 = y1 = 6.
z2 = y2 = 2.
z3 = y3 = Da(k - j + j) = 7.
z4 = y4 = 3.
z5 = yi = y5 = 5.
z6 = yi + 1 = y5 + 1 = 6.
z7 = y6 + 1 = 2.
…
5. (B) Check if z3 = x or not.
If yes, means that i ≧ j.
If no, means that i < j.
Homomorphic Encryption
Homomorphic encryption:
If there is an algorithm ⊕ to compute H(x⊕y) from H(x)
and H(y) that does not reveal x or y.
H(x⊕y) = H(x) ⊙ H(y)
RSA, …
Additive homomorphic:
H(x+y) = H(x) * H(y)
Paillier, …
Homomorphic Encryption
Privacy-Preserving K-Means Clustering
Over Vertically Partitioned Data
SIGKDD, 2003
Problem Definition
Goal:
Input:
Cluster the known set of common entities without revealing
any value that the clustering is based on.
Each user provides one attribute of all items.
Output:
Assignment of entities to clusters.
Cluster centers themselves.
K-Means Clustering
K-Means Clustering
cluster
decision
distance matrix
new center
computation
Vertically Partitioned Data
User 1
User 2
Terminology
r: # of users, each having different attributes for the
same set of items.
n: # of the common items.
k: # of clusters required.
ui: each cluster mean, i = 1, …, k.
uij: projection of the mean of cluster i on user j.
Final result for user j:
The final value / position of uij, i = 1, …, k.
Cluster assignments: clusti for all points i = 1, …, n.
Privacy-Preserving K-Means Clustering
Securely Finding the Closest Cluster
Securely Finding the Closest Cluster
The security of the algorithm is based on three key
ideas.
Disguise the site components of the distance with random
values that cancel out when combined.
Permute the order of clusters so the real meaning of the
comparison results is unknown.
Compare distances so only the comparison result is learned;
no party knows the distances being compared.
Securely Finding the Closest Cluster
Securely Finding the Closest Cluster
Securely Finding the Closest Cluster
Check Threshold
j
m
Conclusion
Horizontally partitioned data:
User 1
User 2