Privacy Preserving Data Mining

Download Report

Transcript Privacy Preserving Data Mining

Privacy Preserving Data Mining
Haiqin Yang
Extracted from a ppt “Secure
Multiparty Computation and Privacy”
 Added “Privacy Preserving SVM”

1
Outline






Motivation
Privacy
Secure computation and privacy
Privacy preserving SVM
Related to 3D-LBS
Challenges
2
Motivation

Huge databases exist in various applications






Medical data
Consumer purchase data
Census data
Communication and media-related data
Data gathered by government agencies
Can these data be utilized?



For medical research
For improving customer service
For homeland security
3
Motivation

Data sharing is necessary for full utilization


Pooling medical data can improve the quality of
medical research
Pooling of information from different government
agencies can provide a wider picture



What is the health status of citizens that are
supported by social welfare?
Are there citizens that receive simultaneous
support from different agencies?
Data gathered by the government (e.g.,
census data) should be publicly available
4
Motivation

The huge amount of data available
means that it is possible to learn a lot
of information about individuals from
public data




Purchasing patterns
Family history
Medical data
…
5
Privacy

Human definition:



Privacy and autonomy: information that is
personal, confidential or private should not be
unnecessarily distributed or publicly known
Privacy and control: Personal or private
information should not be misused (whatever that
means)
Difficulties in mathematically formulating


The same information is classified differently by
different people
Legitimate use is interpreted differently by different
6
people
Secure computation and privacy

Secure computation




Assume that there is a function that all parties
wish to compute
Secure computation shows how to compute that
function in the safest way possible
In particular, it guarantees minimal information
leakage (the output only)
Privacy


Does the function output itself reveal “sensitive
information”, or
Should the parties agree to compute this
function?
7
This talk


Secure multiparty computation
 Trace back to “two millionaires
problem” (A. Yao 82), or earlier
Privacy preserving data mining

Privacy preserving SVM (Vaidya et al.
KIS 2007)
8
Secure multiparty computation


A set of parties with private inputs
Objective: jointly compute a function of the
inputs so that certain security properties
(like privacy and correctness) are preserved


Applications: secure elections, auctions…
Properties must be ensured even if
some of the parties maliciously attack
the protocol
9
Secure computation tasks

Examples:






Authentication protocols
Online payments
Auctions
Elections
Privacy preserving data mining
Essentially any task…
10
The real model
x
y
Protocol output
Protocol output
11
The ideal model
x
f1(x,y)
y
x
y
f1(x,y)
f2(x,y)
f2(x,y)
12
The security definition
For every real
adversary A

there exists an
adversary S
Protocol
interaction
Trusted party
REAL
13
IDEAL
Why this approach?




General – it captures all applications
The specifics of an application are defined
by its functionality, security is defined as
above
The security guarantees achieved are easily
understood (because the ideal model is
easily understood)
We can be confident that we did not “miss”
any security requirements
14
The ideal model – More details


The definition we gave suffices in the
case of an honest majority
When there is no honest majority



Guaranteed output delivery cannot be achieved
Fairness cannot be achieved
Changes to ideal model:



Corrupted parties receive output first
Adversary decides if honest parties receive their
outputs as well
This is called security with abort
15
Defects to the ideal model

When no honest majority, fairness and
guaranteed output delivery cannot be achieved


This approach can be used to models of partial
information leakage:




This “defect” is included into the ideal model
The parties wish to compute a function f, but
more information is leaked by the protocol
This can be modeled by having the trusted party
explicitly leak this information
Helps for efficiency considerations
Advantage: explicit defect!
16
Privacy preserving data mining

Setting



Data is distributed at different sites
These sites may be third parties (e.g.,
hospitals, government bodies) or individuals
Aim


Compute the data mining algorithm on the
data so that nothing but the output is learned
That is, carry out a secure computation
17
Privacy  Security

Secure computation only deals with
the process of computing the function

It does not ask whether or not the
function should be computed
18
Privacy and secure computation


Secure computation can be used to solve any
distributed data-mining problem
A two-stage process



Decide that the function/algorithm should be
computed – an issue of privacy
Apply secure computation techniques to
compute it securely – security
But, not every privacy problem can be cast as a
distributed computation
19
Privacy preserving SVM classification
(Vaidya et al. KIS 2007)
20
SVM introduction

Data
Objective
SVM Illustration


x2


x1
Problem: How to pass the kernel
matrix among different parties?
21
Case 1: On vertically partitioned data

Bank, health insurance
company and auto
insurance company collect
different information about
the same people
N
d

D1

Linear kernel

Polynomial kernel

RBF kernel
D2
22
Secure merge


Assumption: K parties, Pi holds vi
Procedure




P0 chooses a random number R from a
uniform distribution over F
P0 sends (R+v0) mod |F | to P1
Pi receives
Pi sends
to Pi+1
23
Case 2: On horizontally partitioned data

Different banks collect data for their
customers
N
d

D1
D2
24
Case 3: On arbitrarily partitioned data
25
How to do?

Homomorphic encryption

where * is either addition or multiplication (in some
abelian group)
Existing cryptosystems





Goldwasser–Micali (Blum M, Goldwasser S 1984)
Benaloh cryptosystem (Benaloh JC 1986)
Naccache–Stern cryptosystem (Naccache D, Stern J 1998)
Paillier cryptosystem (Paillier P 1999)
Okamoto–Uchiyama cryptosystem (Okamoto T, Uchiyama
S 1998)
26
Algorithm
27
Other security issues

Collusion with the third party Q
28
Related to 3D-LBS

Issues



Some data may need be encrypted before
research test
May need to work data miners on
encrypted data
Problems


Lack knowledge of security and
cryptography
Limited understanding of privacy
29
Future challenges

Understand the real problem and what the real
requirement is



A very non-trivial task and one that requires
interdisciplinary cooperation
Computer scientists should help to formalize the
notion, lawyers, policy-makers, social scientists
should be involved in understanding the concerns
Some challenges


Reconciling cultural and legal differences relating to
privacy in different countries
Understanding when privacy is “allowed” to be
breached
30
Future challenges

Appropriate modeling for secure
computation

Efficient protocols…
31
Conclusion

Privacy-preserving data mining is truly
needed



Data mining is being used: by security
agencies, governmental bodies and
corporations
Privacy advocates and citizen outcry
often prevents positive use of data mining
Good solutions (which are still currently
out of reach) may be able to resolve the
conflict
32