Privacy Preserving Data Mining

Download Report

Transcript Privacy Preserving Data Mining

Secure Multiparty
Computation and Privacy
Yehuda Lindell
Bar-Ilan University
Motivation
 Huge databases exist in society today
 Medical data
 Consumer purchase data
 Census data
 Communication and media-related data
 Data gathered by government agencies
 Can this data be utilized?
 For medical research
 For improving customer service
 For homeland security
Motivation
 Data sharing is necessary for full utilization:
 Pooling medical data can improve the quality
of medical research
 Pooling of information from different
government agencies can provide a wider
picture


What is the health status of citizens that are
supported by social welfare?
Are there citizens that receive simultaneous
support from different agencies?
 Data gathered by the government (e.g.,
census data) should be publicly available
The Problem
 The huge amount of data available means
that it is possible to learn a lot of information
about individuals from public data




Purchasing patterns
Family history
Medical data
And much much more
Privacy Breaches
 Latanya Sweeney – testimony before the Privacy and Integrity
Advisory Committee of the Department of Homeland Security:

One problem is that people don’t understand what makes data
unique or identifiable. For example, in 1997 I was able to show
how medical information that had all explicit identifiers, such as
name, address and Social Security number removed could be
re-identified using publicly available population registers (e.g., a
voter list). In this particular example, I was able to show how
the medical record of William Weld, the governor of
Massachusetts of the time could be re-identified using only his
date of birth, gender and ZIP. In fact, 87% of the population of
the United States is uniquely identified by date of birth (e.g.,
month, day and year), gender, and their 5-digit ZIP codes. The
point is that data that may look anonymous is not necessarily
anonymous.
Motivation – Homeland Security
 Many different security agencies coexist
 These agencies are hesitant to share
information


Sometimes this is legitimate (if all agencies
share all information, a single mole can
compromise all agencies)
Typically, they only share conclusions
 The problem: more patterns could be found if
if data and not just conclusions are shared
Motivation – Concrete Example
 Investigation at Stillwater State Correctional
Facility, Minnesota




Data mining software was applied to phone
records from the prison
A pattern linking calls between prisoners and a
recent parolee was discovered
The calling data was then mined again
together with records of prisoners’ financial
accounts
The result: a large drug smuggling ring was
discovered
Data Utilization or Privacy
 In many cases, due to public outcry, important
data mining projects have been halted


The Canadian big-brother database
The total awareness project
 It may be good that these projects were
scrapped in their current form, but it would be
better if we could have our cake and eat it too…
Privacy
 The human definition:
 Privacy and autonomy: information about us that we
feel is personal, confidential or private should not be
unnecessarily distributed or publicly known
 Privacy and control: Our personal or private information
should not be misused (whatever that means)
 How can we mathematically formulate this?
 The same information is classified differently by different
people
 Legitimate use is interpreted differently by different
people
Turing’s Concern
 No attempt has yet been made to show that the “computable” numbers
include all numbers which would naturally be regarded as computable.
All arguments which can be given are bound to be, fundamentally,
appeals to intuition, and for this reason rather unsatisfactory
mathematically. The real question at issue is “What are the possible
processes which can be carried out in computing a number?”
 The arguments which I shall use are of three kinds.
 A direct appeal to intuition.
 A proof of the equivalence of two definitions (in case the new
definition has a greater intuitive appeal).
 Giving examples of large classes of numbers which are
computable.
 Once it is granted that computable numbers are all “computable”
several other propositions of the same character follow. In particular, it
follows that, if there is a general process for determining whether a
formula of the Hilbert function calculus is provable, then the
determination can be carried out by a machine.
Secure Computation and Privacy
 Secure computation:
 Assume that there is a function that all parties
wish to compute
 Secure computation shows how to compute
that function in the safest way possible
 In particular, it guarantees minimal information
leakage (the output only)
 Privacy:
 Does the function output itself reveal “sensitive
information”, or
 Should the parties agree to compute this
function?
This Talk
 We focus on the problem of secure multiparty
computation



Definitional paradigms (the focus)
Feasibility results
Efficiency concerns
 We will also discuss the limitations of the
approach for solving problems of privacy
Secure Multiparty Computation
 A set of parties with private inputs
 Parties wish to jointly compute a function of
their inputs so that certain security properties
(like privacy and correctness) are preserved

E.g., secure elections, auctions…
 Properties must be ensured even if some of
the parties maliciously attack the protocol
Secure Computation Tasks
 Examples:






Authentication protocols
Online payments
Auctions
Elections
Privacy preserving data mining
Essentially any task…
Security Requirements
 Consider a secure auction (with secret bids):



An adversary may wish to learn the bids of all
parties – to prevent this, require privacy
An adversary may wish to win with a lower bid
than the highest – to prevent this, require
correctness
But, the adversary may also wish to ensure
that it always gives the highest bid – to
prevent this, require independence of inputs
Defining Security
 Option 1: analyze security concerns for each specific
problem


Auctions: as in previous slide
Elections: privacy and correctness only
 But then, maybe can condition vote for a party based
on whether it will get the minimum number of votes
 Problems:
 How do we know that all concerns are covered?
 Definitions are application dependent and need to be
redefined from scratch for each task
Defining Security – Option 2
 The real/ideal model paradigm for defining
security [GMW,GL,Be,MR,Ca]:


Ideal model: parties send inputs to a trusted
party, who computes the function for them
Real model: parties run a real protocol with no
trusted help
 A protocol is secure if any attack on a real
protocol can be carried out in the ideal model
 Since no attacks can be carried out in the
ideal model, security is implied
The Real Model
x
y
Protocol output
Protocol output
The Ideal Model
x
f1(x,y)
y
x
y
f1(x,y)
f2(x,y)
f2(x,y)
The Security Definition:
For every real
adversary A

there exists an
adversary S
Protocol
interaction
Trusted party
REAL
IDEAL
Properties of the Definition
 Privacy:
 The ideal-model adversary cannot learn more about
the honest party’s input than what is revealed by the
function output
 Thus, the same is true of the real-model adversary
 Correctness:
 In the ideal model, the function is always computed
correctly
 Thus, the same is true in the real-model
 Others:
 For example, independence of inputs
Why This Approach?
 General – it captures all applications
 The specifics of an application are defined by
its functionality, security is defined as above
 The security guarantees achieved are easily
understood (because the ideal model is easily
understood)
 We can be confident that we did not “miss”
any security requirements
More Details on the Definition
 The Adversary
 Computational power:
 Probabilistic polynomial-time versus all-powerful
 Adversarial deviation
 Semi-honest: follows protocol instructions
 Malicious: arbitrary actions
 Corruption behaviour
 Static: set of corrupted parties fixed at onset
 Adaptive: can choose to corrupt parties at any time
during computation
 Number of corruptions
 Honest majority versus unlimited corruptions
More Details on the Definition
 The Network
 Scheduling
 Synchronous: messages sent in rounds
 Asynchronous: messages can be delayed arbitrarily
(adversary controls message delivery)
 Semi-synchronous: adversary controls scheduling, but
only to a limit (can be achieved using local clocks)

Communication channels
 Authenticated: adversary can see, but cannot modify
messages sent between the parties
The Ideal Model – More Details
 The Trusted Party:
 Defined by any probabilistic polynomial-time Turing
machine – this machine defines the functionality
 Trusted party linked to all participants via perfectly
private and authenticated channels
 Upon receiving an input from a party, the trusted party
runs the machine
 If there is an output, it sends it to the designated party
 Continue as above
 This is more general than secure function evaluation
Examples
 Elections:
 Trusted party receives votes. After all have been
received (or after a certain time), it computes the
election results and sends them out to all
 Poker:
 The trusted party deals the cards randomly (sending 5
cards to each party), and remembers which cards were
dealed
 Parties decide which cards to throw and send these
cards to the trusted party
 The trusted party checks that the cards are correct and
re-deals the correct number of new cards
 The trusted party publicizes the cards of a party who
opens his hand
The Ideal Model – More Details
 The definition we gave suffices in the case of
an honest majority
 When there is no honest majority


Guaranteed output delivery cannot be achieved
Fairness cannot be achieved
 Changes to ideal model:



Corrupted parties receive output first
Adversary decides if honest parties receive their
outputs as well
This is called security with abort
Adding Defects to the Ideal Model
 In the case of no honest majority, fairness and
guaranteed output delivery cannot be achieved

This “defect” is included into the ideal model
 This approach can be used to also model partial
information leakage:



The parties wish to compute a function f, but more
information is leaked by the protocol
This can be modeled by having the trusted party
explicitly leak this information
Helps for efficiency considerations
 Advantage: explicit defect!
More on Definitions
 There are numerous ways to define the real
model, regarding both the adversary and
network
 The main thing is to realistically (and
conservatively) model the real world scenario
and adversarial threats

An overly conservative approach may be
detrimental in that it precludes efficient
protocols
Feasibility – A Fundamental Theorem
 Any multiparty functionality can be securely
computed


For any number of corrupted parties: security
with abort is achieved, assuming enhanced
trapdoor permutations [Yao,GMW]
With an honest majority: full security is
achieved, assume private channels only
[BGW,CCD]
 These results hold in the stand-alone model
 When composition is considered, things are much
more difficult
Application to Private Data Mining
 The setting:
 Data is distributed at different sites
 These sites may be third parties (e.g., hospitals,
government bodies) or may be the individual
him or herself
 The aim:
 Compute the data mining algorithm on the data
so that nothing but the output is learned
 That is, carry out a secure computation
 Conclusion: private data mining is solved in
principle
Privacy  Security
 As we have mentioned, secure computation
only deals with the process of computing the
function

It does not ask whether or not the function
should be computed
Privacy and Secure Computation
 Secure computation can be used to solve any
distributed data-mining problem
 A two-stage process:


Decide that the function/algorithm should be
computed – an issue of privacy
Apply secure computation techniques to
compute it securely – security
 But, not every privacy problem can be cast as
a distributed computation
Census Bureau and Privacy
 Case study – the census bureau:
 The census bureau releases its results so that
they can be studied
 Question:
 How does it make sure that the results
released do not compromise privacy?
 Answer:
 The tables are manipulated to hopefully
protect the privacy of individuals
 Methods are highly involved (statistics), but
are not “cryptographic”
Census Bureau and Privacy
 Suggestion:


If someone wants to study the results (ask a
statistical query), let them run a secure
computation with the census bureau
Secure computation is necessary so that
parties don’t have to reveal to the census
bureau what they are studying
Census Bureau and Privacy
 Naïve objection:

This would be far too expensive and is not
realistic in practice
 Better objection:


This contradicts government transparency
The census bureau can fudge results based
on political interests, without possibly being
detected (regular census tables can be
compared against other sources to check for
accuracy)
Census Bureau and Privacy
 An even better objection:


Who says that privacy is preserved in this
way?
If the census bureau doesn’t know the query
(because it’s a secure computation), then it is
easy to ask queries that overlap and reveal
confidential information about an individual
Conclusion – Census Bureau
 Secure multiparty computation doesn’t solve
this problem


Because sometimes in reality, actual data
must be revealed, and not just the results of a
function evaluation
In other words, this problem cannot be cast as
a distributed computation

Casting it as a distributed computation introduces
other problems and can even harm privacy
Secure Computation and Privacy
 The census bureau case is different – it is not
a distributed computation problem
 What about secure computation for
distributed problems?
Privacy Difficulties
 Crucial point:

In order to analyze the privacy of a solution, it
is crucial to understand why we want privacy
to start with
 We demonstrate this with two examples:


Personalized newspapers
Personalized shopping catalogs
Personalized Newspapers
 The aim:

Present a newspaper with a layout that is
personalized to a reader’s interest
 The problem:

We don’t want the newspaper to necessarily
know what we are interested in



Our political opinions may be private
Our interest in certain stocks can be confidential
Or just because our interests are our own
Personalized Newspapers
 The non-private solution:

Input:



User input: answers to an “interest questionnaire”
and possible ratings for articles read
Automated input: the newspaper gathers
information about which articles were read by the
user, for how long and so on (appropriate for online
newspapers)
The computation: data mining algorithms are
run to determine what is of interest to the user
and in which layout to present it
Personalized Newspapers
 The solution – secure computation



The reader inputs its personal data
The newspaper inputs the articles and the
“rules” to define the layout based on the
reader’s data
The reader receives the personalized
newspaper, the newspaper learns nothing
 Caveat:

Successful data mining here would consider
all readers together, which is not practical
Personalized Newspapers
 Privacy is clearly preserved

The newspaper learns nothing about the
reader’s data (interests, preferences, reading
history etc.)
 There is no dilemma here even regarding
computing the function

The newspaper learns nothing so the function
can clearly be computed without
compromising privacy
Personalized Newspaper Danger
 Why do we want privacy regarding our interests
and why is this an important question?
 Typical answer to why we want privacy

A personal feeling of discomfort
 A more concrete answer


A danger to our autonomy: if the newspaper
knows our interests, political leanings etc., it
could feed us slanted information
Our interests could be used against us
Personalized Newspaper Danger
 The solution based on secure computation
does not solve the problem at all

The newspaper can set rules that say: if the
reader has political view X, then present this
article…
 The secure-computation solution provides full
privacy of information

But, the danger to autonomy comes from
being able to act upon private information,
and this was not prevented
Personalized Newspaper Danger
 The reason that we want privacy is crucial


The secure computation solution is trivially
private, but doesn’t solve the problem at all
It can be fixed – but we need to have a good
understanding of what the problem is before
we can design a solution
Personalized Newspaper Solution
 The rules used for defining the layout etc.
should be either:


Defined by the newspaper but be public
Defined by the users themselves
 Conclusions:


This is obvious – but only once you
understand why you actually want privacy
Good, rigorous definitions can be formed,
but they take time and care
Personalized Shopping Catalogs
 The same aim:


Present customers with catalogs containing
products that they are interested in
Give certain customers “deals”
 The same problem:

We don’t want the store to know all of our
shopping history
 The same solution: secure computation
Personalized Shopping Catalogs
 A new problem: price discrimination


The store can charge much higher prices to
customers who are known to not “shop
around”
The secure computation solution once again
does not solve the problem
Privacy Difficulties
 Even if no information is learned, it may still
be possible to “breach privacy”
 Coming up with good definitions of privacy for
different tasks is very non-trivial

The advantage of the cryptographic idealmodel definition (regarding its clarity) is
somewhat diminished here
Challenges
 Secure computation can be used in many
cases to improve privacy


If the function itself preserves sufficient
privacy, then this provides a “full solution”
If the function does not preserve privacy, but
there is no choice but to compute it, using
secure computation minimizes damage
 The problem:

Current protocols are typically very inefficient
Efficiency and Secure Computation
 Two types of adversaries

Semi-honest: follow the protocol


Have efficient protocols for specific tasks
Malicious: arbitrary actions

Most known protocols are highly inefficient
 Semi-honest adversaries make sense for:


Preventing inadvertent information leakage
Where the participating parties really trust
each other (e.g., hospitals)
Alternative Approaches
 Weaken the security guarantee


Consider malicious adversaries as before
Define security in an alternative way (not using
simulation)
 Example:

Provide a definition of security based on
indistinguishability and not simulation
An Example: PIR
 PIR = Private Information Retrieval [CGKS]
 Aim: allow a client to efficiently query a database
without the server learning what the query is
 Definition:


Correctness: when the server is semi-honest, the
client always obtains the correct result
Privacy: the view of any (even malicious) server when
the client’s query is i is indistinguishable from its view
when the client’s query is j
 Sometimes, data privacy is also required (i.e., client
should learn its query and nothing else)
Alternative Approaches
 Consider weaker adversarial models


Are malicious adversaries always realistic?
Maybe we can use game-theoretic notions of
utility – adversaries only cheat in order to gain
something…
 Of course, weaker adversaries can also be
combined with non-simulation definitions…
Where Are We Now?
 The problem of multi-disciplinarian fields:
 Data miners typically don’t have a good
understanding of security and cryptography
 Cryptographers typically don’t really
understand what data mining is all about


E.g., data-mining is an iterative process…
Our understanding of privacy is still very
preliminary
Future Challenges
 Understand what privacy means and what we really
want



A very non-trivial task and one that requires
interdisciplinary cooperation
Computer scientists should help to formalize the
notion, but lawyers, policy-makers, social scientists
should be involved in understanding the concerns
Some challenges here:
 Reconciling cultural and legal differences relating to
privacy in different countries
 Understanding when privacy is “allowed” to be
breached (should searching data require a warrant,
cause and so on)
Future Challenges
 Appropriate modeling for secure computation
 Efficient protocols…
A Final Word
 Privacy-preserving data mining is truly
needed



Data mining is being used: by security
agencies, governmental bodies and
corporations
Privacy advocates and citizen outcry often
prevents positive use of data mining
Good solutions (which are still currently out of
reach) may be able to resolve the conflict