Privacy Preserving Data Mining
Download
Report
Transcript Privacy Preserving Data Mining
Secure Multiparty
Computation and Privacy
Yehuda Lindell
Bar-Ilan University
Motivation
Huge databases exist in society today
Medical data
Consumer purchase data
Census data
Communication and media-related data
Data gathered by government agencies
Can this data be utilized?
For medical research
For improving customer service
For homeland security
Motivation
Data sharing is necessary for full utilization:
Pooling medical data can improve the quality
of medical research
Pooling of information from different
government agencies can provide a wider
picture
What is the health status of citizens that are
supported by social welfare?
Are there citizens that receive simultaneous
support from different agencies?
Data gathered by the government (e.g.,
census data) should be publicly available
The Problem
The huge amount of data available means
that it is possible to learn a lot of information
about individuals from public data
Purchasing patterns
Family history
Medical data
And much much more
Privacy Breaches
Latanya Sweeney – testimony before the Privacy and Integrity
Advisory Committee of the Department of Homeland Security:
One problem is that people don’t understand what makes data
unique or identifiable. For example, in 1997 I was able to show
how medical information that had all explicit identifiers, such as
name, address and Social Security number removed could be
re-identified using publicly available population registers (e.g., a
voter list). In this particular example, I was able to show how
the medical record of William Weld, the governor of
Massachusetts of the time could be re-identified using only his
date of birth, gender and ZIP. In fact, 87% of the population of
the United States is uniquely identified by date of birth (e.g.,
month, day and year), gender, and their 5-digit ZIP codes. The
point is that data that may look anonymous is not necessarily
anonymous.
Motivation – Homeland Security
Many different security agencies coexist
These agencies are hesitant to share
information
Sometimes this is legitimate (if all agencies
share all information, a single mole can
compromise all agencies)
Typically, they only share conclusions
The problem: more patterns could be found if
if data and not just conclusions are shared
Motivation – Concrete Example
Investigation at Stillwater State Correctional
Facility, Minnesota
Data mining software was applied to phone
records from the prison
A pattern linking calls between prisoners and a
recent parolee was discovered
The calling data was then mined again
together with records of prisoners’ financial
accounts
The result: a large drug smuggling ring was
discovered
Data Utilization or Privacy
In many cases, due to public outcry, important
data mining projects have been halted
The Canadian big-brother database
The total awareness project
It may be good that these projects were
scrapped in their current form, but it would be
better if we could have our cake and eat it too…
Privacy
The human definition:
Privacy and autonomy: information about us that we
feel is personal, confidential or private should not be
unnecessarily distributed or publicly known
Privacy and control: Our personal or private information
should not be misused (whatever that means)
How can we mathematically formulate this?
The same information is classified differently by different
people
Legitimate use is interpreted differently by different
people
Turing’s Concern
No attempt has yet been made to show that the “computable” numbers
include all numbers which would naturally be regarded as computable.
All arguments which can be given are bound to be, fundamentally,
appeals to intuition, and for this reason rather unsatisfactory
mathematically. The real question at issue is “What are the possible
processes which can be carried out in computing a number?”
The arguments which I shall use are of three kinds.
A direct appeal to intuition.
A proof of the equivalence of two definitions (in case the new
definition has a greater intuitive appeal).
Giving examples of large classes of numbers which are
computable.
Once it is granted that computable numbers are all “computable”
several other propositions of the same character follow. In particular, it
follows that, if there is a general process for determining whether a
formula of the Hilbert function calculus is provable, then the
determination can be carried out by a machine.
Secure Computation and Privacy
Secure computation:
Assume that there is a function that all parties
wish to compute
Secure computation shows how to compute
that function in the safest way possible
In particular, it guarantees minimal information
leakage (the output only)
Privacy:
Does the function output itself reveal “sensitive
information”, or
Should the parties agree to compute this
function?
This Talk
We focus on the problem of secure multiparty
computation
Definitional paradigms (the focus)
Feasibility results
Efficiency concerns
We will also discuss the limitations of the
approach for solving problems of privacy
Secure Multiparty Computation
A set of parties with private inputs
Parties wish to jointly compute a function of
their inputs so that certain security properties
(like privacy and correctness) are preserved
E.g., secure elections, auctions…
Properties must be ensured even if some of
the parties maliciously attack the protocol
Secure Computation Tasks
Examples:
Authentication protocols
Online payments
Auctions
Elections
Privacy preserving data mining
Essentially any task…
Security Requirements
Consider a secure auction (with secret bids):
An adversary may wish to learn the bids of all
parties – to prevent this, require privacy
An adversary may wish to win with a lower bid
than the highest – to prevent this, require
correctness
But, the adversary may also wish to ensure
that it always gives the highest bid – to
prevent this, require independence of inputs
Defining Security
Option 1: analyze security concerns for each specific
problem
Auctions: as in previous slide
Elections: privacy and correctness only
But then, maybe can condition vote for a party based
on whether it will get the minimum number of votes
Problems:
How do we know that all concerns are covered?
Definitions are application dependent and need to be
redefined from scratch for each task
Defining Security – Option 2
The real/ideal model paradigm for defining
security [GMW,GL,Be,MR,Ca]:
Ideal model: parties send inputs to a trusted
party, who computes the function for them
Real model: parties run a real protocol with no
trusted help
A protocol is secure if any attack on a real
protocol can be carried out in the ideal model
Since no attacks can be carried out in the
ideal model, security is implied
The Real Model
x
y
Protocol output
Protocol output
The Ideal Model
x
f1(x,y)
y
x
y
f1(x,y)
f2(x,y)
f2(x,y)
The Security Definition:
For every real
adversary A
there exists an
adversary S
Protocol
interaction
Trusted party
REAL
IDEAL
Properties of the Definition
Privacy:
The ideal-model adversary cannot learn more about
the honest party’s input than what is revealed by the
function output
Thus, the same is true of the real-model adversary
Correctness:
In the ideal model, the function is always computed
correctly
Thus, the same is true in the real-model
Others:
For example, independence of inputs
Why This Approach?
General – it captures all applications
The specifics of an application are defined by
its functionality, security is defined as above
The security guarantees achieved are easily
understood (because the ideal model is easily
understood)
We can be confident that we did not “miss”
any security requirements
More Details on the Definition
The Adversary
Computational power:
Probabilistic polynomial-time versus all-powerful
Adversarial deviation
Semi-honest: follows protocol instructions
Malicious: arbitrary actions
Corruption behaviour
Static: set of corrupted parties fixed at onset
Adaptive: can choose to corrupt parties at any time
during computation
Number of corruptions
Honest majority versus unlimited corruptions
More Details on the Definition
The Network
Scheduling
Synchronous: messages sent in rounds
Asynchronous: messages can be delayed arbitrarily
(adversary controls message delivery)
Semi-synchronous: adversary controls scheduling, but
only to a limit (can be achieved using local clocks)
Communication channels
Authenticated: adversary can see, but cannot modify
messages sent between the parties
The Ideal Model – More Details
The Trusted Party:
Defined by any probabilistic polynomial-time Turing
machine – this machine defines the functionality
Trusted party linked to all participants via perfectly
private and authenticated channels
Upon receiving an input from a party, the trusted party
runs the machine
If there is an output, it sends it to the designated party
Continue as above
This is more general than secure function evaluation
Examples
Elections:
Trusted party receives votes. After all have been
received (or after a certain time), it computes the
election results and sends them out to all
Poker:
The trusted party deals the cards randomly (sending 5
cards to each party), and remembers which cards were
dealed
Parties decide which cards to throw and send these
cards to the trusted party
The trusted party checks that the cards are correct and
re-deals the correct number of new cards
The trusted party publicizes the cards of a party who
opens his hand
The Ideal Model – More Details
The definition we gave suffices in the case of
an honest majority
When there is no honest majority
Guaranteed output delivery cannot be achieved
Fairness cannot be achieved
Changes to ideal model:
Corrupted parties receive output first
Adversary decides if honest parties receive their
outputs as well
This is called security with abort
Adding Defects to the Ideal Model
In the case of no honest majority, fairness and
guaranteed output delivery cannot be achieved
This “defect” is included into the ideal model
This approach can be used to also model partial
information leakage:
The parties wish to compute a function f, but more
information is leaked by the protocol
This can be modeled by having the trusted party
explicitly leak this information
Helps for efficiency considerations
Advantage: explicit defect!
More on Definitions
There are numerous ways to define the real
model, regarding both the adversary and
network
The main thing is to realistically (and
conservatively) model the real world scenario
and adversarial threats
An overly conservative approach may be
detrimental in that it precludes efficient
protocols
Feasibility – A Fundamental Theorem
Any multiparty functionality can be securely
computed
For any number of corrupted parties: security
with abort is achieved, assuming enhanced
trapdoor permutations [Yao,GMW]
With an honest majority: full security is
achieved, assume private channels only
[BGW,CCD]
These results hold in the stand-alone model
When composition is considered, things are much
more difficult
Application to Private Data Mining
The setting:
Data is distributed at different sites
These sites may be third parties (e.g., hospitals,
government bodies) or may be the individual
him or herself
The aim:
Compute the data mining algorithm on the data
so that nothing but the output is learned
That is, carry out a secure computation
Conclusion: private data mining is solved in
principle
Privacy Security
As we have mentioned, secure computation
only deals with the process of computing the
function
It does not ask whether or not the function
should be computed
Privacy and Secure Computation
Secure computation can be used to solve any
distributed data-mining problem
A two-stage process:
Decide that the function/algorithm should be
computed – an issue of privacy
Apply secure computation techniques to
compute it securely – security
But, not every privacy problem can be cast as
a distributed computation
Census Bureau and Privacy
Case study – the census bureau:
The census bureau releases its results so that
they can be studied
Question:
How does it make sure that the results
released do not compromise privacy?
Answer:
The tables are manipulated to hopefully
protect the privacy of individuals
Methods are highly involved (statistics), but
are not “cryptographic”
Census Bureau and Privacy
Suggestion:
If someone wants to study the results (ask a
statistical query), let them run a secure
computation with the census bureau
Secure computation is necessary so that
parties don’t have to reveal to the census
bureau what they are studying
Census Bureau and Privacy
Naïve objection:
This would be far too expensive and is not
realistic in practice
Better objection:
This contradicts government transparency
The census bureau can fudge results based
on political interests, without possibly being
detected (regular census tables can be
compared against other sources to check for
accuracy)
Census Bureau and Privacy
An even better objection:
Who says that privacy is preserved in this
way?
If the census bureau doesn’t know the query
(because it’s a secure computation), then it is
easy to ask queries that overlap and reveal
confidential information about an individual
Conclusion – Census Bureau
Secure multiparty computation doesn’t solve
this problem
Because sometimes in reality, actual data
must be revealed, and not just the results of a
function evaluation
In other words, this problem cannot be cast as
a distributed computation
Casting it as a distributed computation introduces
other problems and can even harm privacy
Secure Computation and Privacy
The census bureau case is different – it is not
a distributed computation problem
What about secure computation for
distributed problems?
Privacy Difficulties
Crucial point:
In order to analyze the privacy of a solution, it
is crucial to understand why we want privacy
to start with
We demonstrate this with two examples:
Personalized newspapers
Personalized shopping catalogs
Personalized Newspapers
The aim:
Present a newspaper with a layout that is
personalized to a reader’s interest
The problem:
We don’t want the newspaper to necessarily
know what we are interested in
Our political opinions may be private
Our interest in certain stocks can be confidential
Or just because our interests are our own
Personalized Newspapers
The non-private solution:
Input:
User input: answers to an “interest questionnaire”
and possible ratings for articles read
Automated input: the newspaper gathers
information about which articles were read by the
user, for how long and so on (appropriate for online
newspapers)
The computation: data mining algorithms are
run to determine what is of interest to the user
and in which layout to present it
Personalized Newspapers
The solution – secure computation
The reader inputs its personal data
The newspaper inputs the articles and the
“rules” to define the layout based on the
reader’s data
The reader receives the personalized
newspaper, the newspaper learns nothing
Caveat:
Successful data mining here would consider
all readers together, which is not practical
Personalized Newspapers
Privacy is clearly preserved
The newspaper learns nothing about the
reader’s data (interests, preferences, reading
history etc.)
There is no dilemma here even regarding
computing the function
The newspaper learns nothing so the function
can clearly be computed without
compromising privacy
Personalized Newspaper Danger
Why do we want privacy regarding our interests
and why is this an important question?
Typical answer to why we want privacy
A personal feeling of discomfort
A more concrete answer
A danger to our autonomy: if the newspaper
knows our interests, political leanings etc., it
could feed us slanted information
Our interests could be used against us
Personalized Newspaper Danger
The solution based on secure computation
does not solve the problem at all
The newspaper can set rules that say: if the
reader has political view X, then present this
article…
The secure-computation solution provides full
privacy of information
But, the danger to autonomy comes from
being able to act upon private information,
and this was not prevented
Personalized Newspaper Danger
The reason that we want privacy is crucial
The secure computation solution is trivially
private, but doesn’t solve the problem at all
It can be fixed – but we need to have a good
understanding of what the problem is before
we can design a solution
Personalized Newspaper Solution
The rules used for defining the layout etc.
should be either:
Defined by the newspaper but be public
Defined by the users themselves
Conclusions:
This is obvious – but only once you
understand why you actually want privacy
Good, rigorous definitions can be formed,
but they take time and care
Personalized Shopping Catalogs
The same aim:
Present customers with catalogs containing
products that they are interested in
Give certain customers “deals”
The same problem:
We don’t want the store to know all of our
shopping history
The same solution: secure computation
Personalized Shopping Catalogs
A new problem: price discrimination
The store can charge much higher prices to
customers who are known to not “shop
around”
The secure computation solution once again
does not solve the problem
Privacy Difficulties
Even if no information is learned, it may still
be possible to “breach privacy”
Coming up with good definitions of privacy for
different tasks is very non-trivial
The advantage of the cryptographic idealmodel definition (regarding its clarity) is
somewhat diminished here
Challenges
Secure computation can be used in many
cases to improve privacy
If the function itself preserves sufficient
privacy, then this provides a “full solution”
If the function does not preserve privacy, but
there is no choice but to compute it, using
secure computation minimizes damage
The problem:
Current protocols are typically very inefficient
Efficiency and Secure Computation
Two types of adversaries
Semi-honest: follow the protocol
Have efficient protocols for specific tasks
Malicious: arbitrary actions
Most known protocols are highly inefficient
Semi-honest adversaries make sense for:
Preventing inadvertent information leakage
Where the participating parties really trust
each other (e.g., hospitals)
Alternative Approaches
Weaken the security guarantee
Consider malicious adversaries as before
Define security in an alternative way (not using
simulation)
Example:
Provide a definition of security based on
indistinguishability and not simulation
An Example: PIR
PIR = Private Information Retrieval [CGKS]
Aim: allow a client to efficiently query a database
without the server learning what the query is
Definition:
Correctness: when the server is semi-honest, the
client always obtains the correct result
Privacy: the view of any (even malicious) server when
the client’s query is i is indistinguishable from its view
when the client’s query is j
Sometimes, data privacy is also required (i.e., client
should learn its query and nothing else)
Alternative Approaches
Consider weaker adversarial models
Are malicious adversaries always realistic?
Maybe we can use game-theoretic notions of
utility – adversaries only cheat in order to gain
something…
Of course, weaker adversaries can also be
combined with non-simulation definitions…
Where Are We Now?
The problem of multi-disciplinarian fields:
Data miners typically don’t have a good
understanding of security and cryptography
Cryptographers typically don’t really
understand what data mining is all about
E.g., data-mining is an iterative process…
Our understanding of privacy is still very
preliminary
Future Challenges
Understand what privacy means and what we really
want
A very non-trivial task and one that requires
interdisciplinary cooperation
Computer scientists should help to formalize the
notion, but lawyers, policy-makers, social scientists
should be involved in understanding the concerns
Some challenges here:
Reconciling cultural and legal differences relating to
privacy in different countries
Understanding when privacy is “allowed” to be
breached (should searching data require a warrant,
cause and so on)
Future Challenges
Appropriate modeling for secure computation
Efficient protocols…
A Final Word
Privacy-preserving data mining is truly
needed
Data mining is being used: by security
agencies, governmental bodies and
corporations
Privacy advocates and citizen outcry often
prevents positive use of data mining
Good solutions (which are still currently out of
reach) may be able to resolve the conflict