Privacy-Preserving Outlier Detection
Download
Report
Transcript Privacy-Preserving Outlier Detection
Privacy-Preserving
Data Mining
Jaideep Vaidya
([email protected])
Joint work with
Chris Clifton (Purdue University)
Outline
• Introduction
– Privacy-Preserving Data Mining
– Horizontal / Vertical Partitioning of Data
– Secure Multi-party Computation
• Privacy-Preserving Outlier Detection
• Privacy-Preserving Association Rule
Mining
• Conclusion
Back in the
good ol’ days
Future
Now
Dominick’s
Safeway
Jewel
A “real” example
• Ford / Firestone
– Individual databases
– Possible to join both databases (find corresponding
transactions)
– Commercial reasons to not share data
– Valuable corporate information - Cost structures /
business structures
• Ford Explorers with Firestone tires → Tread
Separation Problems (Accidents!)
• Might have been able to figure out a bit earlier
(Tires from Decatur, Ill. Plant, certain situations)
Public (mis)Perception of Data
Mining: Attack on Privacy
• Fears of loss of privacy constrain data
mining
– Protests over a National Registry
• In Japan
– Data Mining Moratorium Act
• Would stop all data mining R&D by DoD
• Terrorism Information Awareness ended
– Data Mining could be key technology
Is Data Mining a Threat?
• Data Mining summarizes data
– (Possible?) exception: Anomaly / Outlier
detection
• Summaries aren’t private
– Or are they?
– Does generating them raise issues?
• Data mining can be a privacy solution
– Data mining enables safe use of private data
Privacy-Preserving Data
Mining
• How can we mine data if we cannot see it?
• Perturbation
– Agrawal & Srikant, Evfimievski et al.
– Extremely scalable, approximate results
– Debate about security properties
• Cryptographic
– Lindell & Pinkas, Vaidya & Clifton
– Completely accurate, completely secure (tight bound
on disclosure), appropriate for small number of
parties
• Condensation/Hybrid
Assumptions
• Data distributed
– Each data set held by source authorized to
see it
– Nobody is allowed to see aggregate data
• Knowing all data about an individual violates
privacy
• Data holders don’t want to disclose data
– Won’t collude to violate privacy
Gold Standard:
Trusted Third Party
Horizontal Partitioning of Data
CC#
Active? Delinquent?
Amount
Bank of America
123
Yes
Yes
<$300
324
No
No
$300-500
919
Yes
No
>$1000
Chase Manhattan
3450
Yes
Yes
<$300
4127
No
No
$300-500
8772
Yes
No
>$1000
Vertical Partitioning of Data
Global Database View
TID
Brain Tumor?
Medical Records
RPJ
Yes
CAC No Tumor
PTR
No Tumor
Diabetes?
Model
Battery
Cell Phone Data
Diabetic
RPJ
5210
Li/Ion
No
CAC
none
none
Diabetic
PTR
3650
NiCd
Secure Multi-Party Computation
(SMC)
• Given a function f and n inputs, x , x ,...,x
1
2
distributed at n sites, compute the
result
y f x1 , x2 ,, xn
while revealing nothing
to any site except its own input(s) and
the result.
n
Secure Multi-Party Computation
It can be done!
• Yao’s Millionaire’s problem (Yao ’86)
– Secure computation possible if function can be
represented as a circuit
– Idea: Securely compute gate
• Continue to evaluate circuit
• Extended to multiple parties (BGW/GMW ’87)
• Biggest Problem - Efficiency
– Will not work for lots of parties / large quantities of
data
SMC – Models of
Computation
• Semi-honest Model
– Parties follow the protocol faithfully
• Malicious Model
– Anything goes!
• Provably Secure
• In either case, input can always be
modified
What is an Outlier?
• An object O in a dataset T is a DB(p,dt)-outlier if at least
fraction p of the objects in T lie at distance greater than
dt from O
2
1
1
• Centralized solution from Knorr and Ng
– Nested loop comparison
– Maintain count of objects inside threshold
– If count exceeds threshold, declare non-outlier and move to next
• Clever processing order minimizes I/O cost
Privacy-Preserving
Solution
• Key idea: share splitting
– Computations leave results (randomly) split between
parties
– Only outcome is if the count of points within distance
threshold exceeds outlier threshold
• Requires pairwise comparison of all points
– But failure to compare all points reveals information
about non-outliers
• This alone makes it possible to cluster points
• This is a privacy violation
– Asymptotically equivalent to Knorr & Ng
Solution: Horizontal Partition
• Compare locally with your own points
• For remote points, get random share of distance
– Calculate random share of “exceeds threshold or
doesn’t”
• Sum shares and test if enough “close” points
1.5
0.3
2.5
1.5
32
3
-12
1
24
1
-31
-3
12
-1
-23
-0.9
0.9
-0.7
3.2
Random share of distance
•
m
Distance ( X , Y ) ( xr yr ) 2
2
•
•
r 1
x 2x1 y1 y
2
1
2
1
m
m
m
x y 2 xr yr
r 1
2
r
r 1
2
r
r 1
• x2, y2 local; sum of xy is scalar product
– Several protocols for share-splitting scalar product
(Du&Atallah’01; Vaidya&Clifton’02; Ioannidis, Grama,
Atallah’02)
Shares of “Within Threshold”
• Goal: is x + y ≤ dt ?
• Essentially Yao’s Millionaires’ problem
(Yao’86)
– Represent function to be computed as circuit
– Cryptographic protocol gives random shares
of each wire
• Solves “sum of shares from within dt
exceeds minimum” as well
Vertically Partitioned Data
• Each party computes its part of distance
m
– Distance2 ( X , Y ) ( xr yr )2
r 1
a
–
( xr yr )
2
r 1
m
(x
r a 1
r
yr ) 2
• Secure comparison (circuit evaluation)
gives each party shares of 1/0 (close/not)
• Sum and compare as with horizontal
partitioning
Why is this Secure?
• Random shares indistinguishable from random
values
– Contain no knowledge in isolation
– Assuming no collusion – so shares viewed in isolation
• Number of values (= number of shares) known
– Nothing new revealed
• Too few close points is outlier definition
– This is the desired result
• No knowledge that can’t be discovered from
one’s own input and the result!
Conclusion (Outlier Detection)
• Outlier detection feasible without revealing
anything but the outliers
– Possibly expensive (quadratic)
– But more efficient solution for this definition of outlier
inherently reveals potential privacy-violating
information
• Key: Privacy of non-outliers preserved
– Reason why outliers are outliers also hidden
• Allows search for “unusual” entities without
disclosing private information about entities
Association Rules
• Association rules a common data mining task
– Find A, B, C such that AB C holds frequently (e.g.
Diapers Beer)
• Fast algorithms for centralized and distributed
computation
– Basic idea: For AB C to be frequent, AB, AC, and
BC must all be frequent
– Require sharing data
• Secure Multiparty Computation too expensive
Association Rule Mining
• Find out if itemset {A1, B1} is frequent (i.e. If support of
{A1, B1} ≥ k)
A
B
Key
A1
Key
B1
k1
1
k1
0
k2
0
k2
1
k3
0
k3
0
k4
1
k4
1
k5
1
k5
1
• Support of itemset is defined as number of transactions
in which all attributes of the itemset are present
• For binary data, support =|Ai Λ Bi|.
Association Rule Mining
• Idea based on TID-list representation of
data
– Represent attribute A as TID-list Atid
– Support of ABC is | Atid ∩ Btid ∩ Ctid |
• Use a secure protocol to find size of set
intersection to find candidate sets
Cardinality of Set Intersection
•
•
•
•
Use a secure commutative hash function
Pohlig-Hellman Encryption
Each party generates own encryption key
All parties encrypt all the input sets
E1 E2 Ek X El Ei E j X
Result # common objects in all sets
Cardinality of Set Intersection
• Hashing
– All parties hash all sets with their key
• Initial intersection
– Each party finds intersection of all sets
(except its own)
• Final intersection
– Parties exchange the final intersection set,
and compute the intersection of all sets
Computing Size of
Intersection
1
X
E2(E
(E
(Z))
EX:α,λ,σ,β
(X)))
E213(Y)
X∩Y∩Z:λ,β
2(E3
X:α,λ,σ,β
X∩Z:α,β,λ
2
Y
(Y))
Z:α,β,κ,λ,γ
E1E
(X)
1(E2(E
3(Z)))
X∩Y∩Z:λ,β
Z:α,β,κ,λ,γ
Y∩Z:λ,β
3
Z
(Z)
EE
(E
(E
Y:λ,σ,φ,υ,β
33(E
1(X))
3
1
2(Y)))
X∩Y∩Z:λ,β
Y:λ,σ,φ,υ,β
X∩Y:λ,σ,β
Proof of Security
• Proof by Simulation
• What is known
– The size of the intersection set
– Site i learns
S p , C 0,
S
k 1
Sp
p 0
, k 1 , i C , C 2
*
pC
• How it can be simulated
– Protocol is symmetric, simulating view of one
party is sufficient
Proof of Security
• Hashing
– Party i receives encrypted set from party i-1
– Can use random numbers to simulate this
• Intersection
– Party i receives fully hashed sets of all parties
Simulating Fully Encrypted
Sets
|ABC| = 2, |AB| = 3, |AC| = 4, |BC| = 2, |A| = 6, |B| = 7, |C| = 8
ABC 2
AB
3-2
=1
A
6-2-1-2
=1
AC
B
4-2
=2
7-2-1-0
=4
BC
C
2-2
=0
8-2-2-0
=4
A
B
C
R1
R1
R1
R2
R2
R2
R3
R3
R4
R4
R5
R5
R6
R7
R8
R9
R10
R11
R12
R13
R14
Optimized version
Association Rule Mining
(Revisited)
• Naïve algorithm => Simply use APRIORI. A
single set intersection determines the frequency
of a single candidate itemset
– Thousands of itemsets
• Key intuition
– Set Intersection algorithm developed also allows
computation of intermediate sets
– All parties get fully encrypted sets for all attributes
– Local computation allows efficient discovery of all
association rules
Communication Cost
• k parties, m set size, p frequent attributes
– k*(2k-2) = O(k2) messages
– p*(2p-2)*m*encrypted message size = O(p2m)
bits
– k rounds
• Independent of number of itemsets found
Other Results
• ID3 Decision Tree learning
– Horizontal Partitioning: Lindell&Pinkas ’00
– Also vertical partitioning (Du, Vaidya)
• Association Rules
– Horizontal Partitioning: Kantarcıoğlu
•
•
•
•
K-Means / EM Clustering
K-Nearest Neighbor
Naïve Bayes, Bayes network structure
And many more
Challenges
• What do the results reveal?
• A general approach (instead of per data
mining technique)
• Experimental results
• Incentive Compatibility
• Note: Upcoming book in the Advances in
Information Security series by SpringerVerlag
Questions