An Ad Omnia Approach to Defining and Achieving Private Data
Download
Report
Transcript An Ad Omnia Approach to Defining and Achieving Private Data
Defining and Achieving Differential Privacy
Cynthia Dwork, Microsoft
Meaningful Privacy Guarantees
Statistical databases
Medical
Government Agency
Social Science
Searching / click stream
Learn non-trivial trends while protecting privacy of
individuals and fine-grained structure
Linkage Attacks
Using “innocuous” data in one dataset to identify a
record in a different dataset containing both
innocuous and sensitive data
At the heart of the voluminous research on hiding
small cell counts in tabular data
The Netflix Prize
Netflix Recommends Movies to its Subscribers
Offers $1,000,000 for 10% improvement in its
recommendation system
Not concerned here with how this is measured
Publishes training data
Nearly 500,000 records, 18,000 movie titles
“The ratings are on a scale from 1 to 5 (integral) stars. To protect
customer privacy, all personal information identifying
individual customers has been removed and all customer ids
have been replaced by randomly-assigned ids. The date of each
rating and the title and year of release for each movie are
provided.”
Some ratings not sensitive, some may be sensitive
OK for Netflix to know, not OK for public to know
A Publicly Available Set of Movie Rankings
International Movie Database (IMDb)
Individuals may register for an account and rate movies
Need not be anonymous
Visible material includes ratings, dates, comments
By definition, these ratings not sensitive
The Fiction of Non-PII
Narayanan & Shmatikov 2006
Movie ratings and dates are PII
“With 8 movie ratings (of which we allow 2 to be
completely wrong) and dates that may have a 3-day error,
96% of Netflix subscribers whose records have been
released can be uniquely identified in the dataset.”
Linkage attack prosecuted using the IMDb.
Link ratings in IMDb to (non-sensitive) ratings in Netflix,
revealing sensitive ratings in Netflix
NS draw conclusions about user.
May be wrong, may be right. User harmed either way.
What Went Wrong?
What is “Personally Identifiable Information”?
Typically syntactic, not semantic
Suppressing “PII” doesn’t rule out linkage attacks
Eg, genome sequence not considered PII ??
Famously observed by Sweeney, circa 1998
AOL debacle
Need a more semantic approach to privacy
Semantic Security Against an Eavesdropper
Goldwasser &Micali 1982
Vocabulary
Plaintext: the message to be transmitted
Ciphertext: the encryption of the plaintext
Auxiliary information: anything else known to attacker
The ciphertext leaks no information about the plaintext.
Formalization
Compare the ability of someone seeing aux and ciphertext to guess
(anything about) the plaintext, to the ability of someone seeing
only aux to do the same thing. Difference should be “tiny”.
Semantic Security for Statistical Databases?
Dalenius, 1977
Anything that can be learned about a respondent from the
statistical database can be learned without access to the
database
An ad omnia guarantee
Happily, Formalizes to Semantic Security
Recall: Anything about the plaintext that can be learned from
the ciphertext can be learned without the ciphertext
Popular Intuition: prior and posterior views about an individual
shouldn’t change “too much”.
Clearly Silly
My (incorrect) prior is that everyone has 2 left feet.
Very popular in literature nevertheless
Definitional awkwardness even when used correctly
10
Semantic Security for Statistical Databases?
Unhappily, Unachievable
Can’t achieve cryptographically small levels of “tiny”
Intuition: (adversarial) user is supposed to learn
unpredictable things about the DB; translates to learning
more than a cryptographically tiny amount about a
respondent
Relax “tiny”?
11
Relaxed Semantic Security for Statistical Databases?
Relaxing Tininess Doesn’t Help
Dwork & Naor 2006
Database teaches average heights of population subgroups
“Terry Gross is two inches shorter than avg Lithuanian ♀”
Access to DB teaches Terry’s height
Terry’s height learnable from the DB, not learnable otherwise
Formal proof extends to essentially any notion of privacy
compromise, uses extracted randomness from the SDB as a onetime pad.
Bad news for k-,l-,m- etc.
Attack Works Even if Terry Not in DB!
Suggests new notion of privacy: risk incurred by joining DB
“Differential Privacy”
Privacy, when existence of DB is stipulated
Before/After interacting vs Risk when in/notin DB
Differential Privacy
K gives -differential privacy if for all values of DB and Me
and all transcripts t:
Pr[ K (DB - Me) = t]
Pr[ K (DB + Me) = t]
≤e
Pr [t]
13
¼ 1§
Differential Privacy is an Ad Omnia Guarantee
No perceptible risk is incurred by joining DB.
Anything adversary can do to me, it could do
without Me (my data).
Pr [response]
Bad Responses:
X
X
X
14
An Interactive Sanitizer: K
Dwork, McSherry, Nissim, Smith 2006
f
+ noise
?
K
f: DB R
K (f, DB) = f(DB) + Noise
Eg, Count(P, DB) = # rows in DB with Property P
15
Sensitivity of a Function f
How Much Can f(DB + Me) Exceed f(DB - Me)?
Recall: K (f, DB) = f(DB) + noise
Question Asks: What difference must noise obscure?
f = maxDB, Me |f(DB+Me) – f(DB-Me)|
eg, Count = 1
16
Calibrate Noise to Sensitivity
f = maxDB, Me |f(DB+Me) – f(DB-Me)|
Theorem: To achieve -differential privacy, use
scaled symmetric noise Lap(|x|/R) with R = f/.
-4R
-3R
-2R
-R
0
R
2R
3R
4R
Pr[x] proportional to exp(-|x|/R)
Increasing R flattens curve; more privacy
Noise depends on f and , not on the database
17
5R
Calibrate Noise to Sensitivity
f = maxDB, Me |f(DB+Me) – f(DB-Me)|
Theorem: To achieve -differential privacy, use
scaled symmetric noise Lap(|x|/R) with R = f/.
-4R
-3R
-2R
Pr[ K (f, DB - Me) = t]
Pr[ K (f, DB + Me) = t]
-R
0
R
2R
3R
4R
= exp(-(|t- f-|-|t- f+|)/R) ≤ exp(-f/R)
18
5R
Multiple Queries
For query sequence f1, …, fd -privacy achieved with noise
generation parameter Ri = fi/ for each response.
Can sometimes do better.
Noise must increase with the sensitivity of the query sequence.
Naively, more queries means noisier answers
Dinur and Nissim 2003 et sequelae
Speaks to the Non-Interactive Setting
Any non-interactive solution permitting “too accurate” answers
to “too many” questions is vulnerable to attack.
Privacy mechanism is at an even greater disadvantage than in the
interactive case; can be exploited
19
Future Work
Investigate Techniques from Robust Statistics
Area of statistics devoted to coping with
What are the utility questions of interest?
Definitional and Algorithmic Work for Other Settings
Problem: the statistical setting makes strong assumptions about
existence and nature of an underlying distribution
Differential Privacy for Social Networks, Graphs
Small amounts of wild data entry errors
Rounding errors
Limited dependence among samples
“Differential” approach more broadly useful
Several results discussed in next few hours
Porous Boundary Between “Inside” and “Outside”?
Outsourcing, bug reporting, combating D-DoS attacks and terror
Privacy is a natural resource.
It’s non-renewable, and it’s not yours.
Conserve it.