Privacy Preserving Data Mining on Vertically Partitioned Data
Download
Report
Transcript Privacy Preserving Data Mining on Vertically Partitioned Data
Privacy Preserving K-means
Clustering on Vertically
Partitioned Data
Presented by: Jaideep Vaidya
Joint work: Prof. Chris Clifton
Overview
• Global Problem
– Privacy Preserving Distributed Data Mining
• Specific Problem
– Clustering (K-Means)
• For
– Vertically Partitioned Data
• Using
– Cryptographic Tools
Vertical Partitioning of Data
Global Database View
TID
Brain Tumor?
Medical Records
RPJ
Yes
CAC No Tumor
PTR
No Tumor
Diabetes?
Model
Battery
Cell Phone Data
Diabetic
RPJ
5210
Li/Ion
No
CAC
none
none
Diabetic
PTR
3650
NiCd
Is the problem trivial?
Privacy Preserving Data
Mining
• Perturbation
– Agrawal & Srikant, Agrawal & Aggarwal,
– Rizvi & Haritsa, Evfimievski et al.
• Cryptographic
– Lindell & Pinkas, Du & Zhan
– Vaidya & Clifton, Kantarcioglu & Clifton
Secure Multiparty Computation
(SMC)
• Given a function f and n inputs, x , x ,...,x
1
2
distributed at n sites, compute the
result
y f x1 , x2 ,, xn
while revealing nothing
to any site except its own input(s) and
the result.
n
Results
• Cluster assignment for entities
– Not private
• Cluster centers
– Semi-private
2.3
34
19
15.5
5210 Li/Ion Piezo
Secure
K-means
clustering
K-means
clustering
'
'
'
,
,
,
Arbitrarily select k starting points 1 2
k
Repeat
– Assign 1' , 2' ,, k' to 1 , 2 ,, k
respectively
– (re)assign each object to closest cluster
based on distance from mean
– Re-compute the cluster means 1' , 2' ,, k'
Until no change
Assigning objects to closest
cluster
For everyobject/entity O,
1
2
k
P P
1
2
P
D O,
j
i
Compute arg min xij
i 1k j 1r
r
Key Idea
• Disguise site components with random
values
• Compare distances while revealing only
comparison result
• Permute order of clusters to conceal
meaning of comparison results
Closest Cluster Computation
• 3 special sites, P1, P2 and Pr
• P1 generates
– r random vectors Vi such that
– Permutation π (over 1 .. K)
r
V
i 1
i
0
Permutation Protocol
Du and Atallah ’01
E ( X ), E
V ,
B
A
X ( X V )
( E ( X V ))
Homomorphic encryption: Ek(x)*Ek(y) = Ek(x+y)
Closest Cluster Computation
P2
X2
P1
E2 ( X 2 ), E2
P3
( X 1 V1 )
( E2 ( X 2 V2 ))
( X 3 V3 )
P1
V ,
i
Pr
Er ( X r ), Er
i2
( X r 1 Vr 1 )
( Er ( X r Vr ))
Pr
Pr-1
Xr
Stage 1
X i Vi
Stage 2
Closest Cluster Computation
• Stage 3
– P2 and Pr determine i, the index of the cluster
with minimum distance
• Stage 4
– P1 computes and broadcasts
i
1
When to stop?
• Locally compute difference in means
• Globally known threshold
• Use simple random-adding technique to
disguise actual values
– First party adds random value to its distance and
sends to next party
– Each party adds its value to total and sends on
– Last party compares with first party’s random
+threshold
Communication Cost
• r parties, n data elements, m bit distances
Basic
Algorithm
Optimized
Algorithm
Generic
Method
Non-Secure
Method
Bits
O(knr)
Rounds
O(r+k)
O(kmr)
O(r)
O(kmnr3)
1
O(n)
1
Conclusion
• Presented a solution for Privacy
Preserving K-Means Clustering problem
• How to use clusters?
• Will parties share required information for
the possible benefits?
• Improve Efficiency
• Working on EM-Clustering,
implementations