Transcript ppt

Pinning Down “Privacy”
Defining Privacy in Statistical Databases
Adam Smith
Weizmann Institute of Science
http://theory.csail.mit.edu/~asmith
1
Database Privacy
Alice
Collection
and
“sanitization”
Bob

You
Users
(government,
researchers,
marketers, …)
“Census problem”
Two conflicting goals
• Utility: Users can extract “global” statistics
• Privacy: Individual information stays hidden
• How can these be formalized?
2
Database Privacy
Alice
Bob

You
Collection
and
“sanitization”
Users
(government,
researchers,
marketers, …)
“Census problem”
Why privacy?
• Ethical & legal obligation
• Honest answers require respondents’ trust
3
Trust is important
4
Database Privacy
Alice
Bob

You
Collection
and
“sanitization”
Users
(government,
researchers,
marketers, …)
• Trusted collection agency
• Published statistics may be tables, graphs, microdata, etc
• May have noise or other distortions
• May be interactive
5
Database Privacy
Alice
Bob

You
Collection
and
“sanitization”
Users
(government,
researchers,
marketers, …)
Variations on model studied in
• Statistics
• Data mining
• Theoretical CS
• Cryptography
Different traditions for what “privacy” means
6
How can we formalize “privacy”?
• Different people mean different things
• Pin it down mathematically?
7
I ask them to take a poem
and hold it up to the light
like a color slide
Can we approach privacy
scientifically?
• Pin down social concept
or press an ear against its hive.• No perfect definition?
• But lots of place for rigor
[…]
• Too late? (see Adi’s talk)
But all they want to do
is tie the poem to a chair with rope
and torture a confession out of it.
They begin beating it with a hose
to find out what it really means.
- Billy Collins, “Introduction to poetry”
8
How can we formalize “privacy”?
• Different people mean different things
• Pin it down mathematically?
Goal #1: Rigor
 Prove clear theorems about privacy
• Few exist in literature
 Make clear (and refutable) conjectures
 Sleep better at night
Goal #2: Interesting science
 (New) Computational phenomenon
 Algorithmic problems
 Statistical problems
9
Overview
• Examples
• Intuitions for privacy
 Why crypto def’s don’t apply
• A Partial* Selection of Definitions
• Conclusions
* “partial” = “incomplete” and “biased”
10
Basic Setting
DB=
x1
x2
x3

xn-1
xn
San
¢¢¢
query 1
answer 1

query T
answer T
Users
(government,
researchers,
marketers, …)
random coins
• Database DB = table of n rows, each in domain D
 D can be numbers, categories, tax forms, etc
 This talk: D = {0,1}d
 E.g.: Married?, Employed?, Over 18?, …
11
Examples of sanitization methods
• Input perturbation
 Change data before processing
 E.g. Randomized response
• flip each bit of table with probability p
• Summary statistics
 Means, variances
 Marginal totals (# people with blue eyes and brown hair)
 Regression coefficients
• Output perturbation
 Summary statistics with noise
• Interactive versions of above:
 Auditor decides which queries are OK, type of noise
12
Two Intuitions for Privacy
“If the release of statistics S makes it possible to determine the
value [of private information] more accurately than is possible
without access to S, a disclosure has taken place.” [Dalenius]
• Learning more about me should be hard
Privacy is “protection from being brought to the attention of
others.” [Gavison]
• Safety is blending into a crowd
13
Why not use crypto definitions?
• Attempt #1:
 Def’n: For every entry i, no information about xi is leaked
(as if encrypted)
 Problem: no information at all is revealed!
 Tradeoff privacy vs utility
• Attempt #2:
 Agree on summary statistics f(DB) that are safe
 Def’n: No information about DB except f(DB)
C
 Problem: how to decide that f is safe?
 Tautology trap
C
C
C
C
C
 (Also: how do you figure out what f is? --Yosi)
14
Overview
• Examples
•
Criteria
• Understandable
• Clear adversary’s goals &
Intuitions for privacy
prior knowledge / side information
• I am a co-author...
 Why crypto def’s don’t apply
• A Partial* Selection of Definitions
 Two straw men
 Blending into the Crowd
 An impossibility result
 Attribute Disclosure and Differential Privacy
• Conclusions
* “partial” = “incomplete” and “biased”
15
Straw man #1: Exact Disclosure
DB=
x1
x2
x3

xn-1
xn
San
¢¢¢
query 1
answer 1

query T
answer T
Adversary A
random coins
• Def’n: safe if adversary cannot learn any entry exactly
 leads to nice (but hard) combinatorial problems
 Does not preclude learning value with 99% certainty or narrowing
down to a small interval
• Historically:
 Focus: auditing interactive queries
 Difficulty: understanding relationships between queries
 E.g. two queries with small difference
16
Straw man #2: Learning the distribution
• Assume x1,…,xn are drawn i.i.d. from unknown
distribution
• Def’n: San is safe if it only reveals distribution
• Implied approach:
 learn the distribution
 release description of distrib
 or re-sample points from distrib
• Problem: tautology trap
 estimate of distrib. depends on data… why is it safe?
17
Blending into a Crowd
• Intuition: I am safe in a group of k or more
 k varies (3… 6… 100… 10,000 ?)
• Many variations on theme:
 Adv. wants predicate g such that
0 < #{ i | g(xi)=true} < k
 g is called a breach of privacy
• Why?
 Fundamental:
• R. Gavison: “protection from being brought to the attention of
others”
 Rare property helps me re-identify someone
 Implicit: information about a large group is public
• e.g. liver problems more prevalent among diabetics
18
Blending into a Crowd
• Intuition: I am safe in a group of k or more
 k varies (3… 6… 100… 10,000 ?)
• Many variations on theme:
 Adv. wants predicate g such that
0 < #{ i | g(xi)=true} < k
Two variants:
• frequency in DB
• frequency in
underlying population
 g is called a breach of privacy
• Why?
How can we capture this?
 Fundamental: • Syntactic definitions
• R. Gavison: “protection from being brought to the attention of
• Bayesian adversary
others”
• “Crypto-flavored”
definitions
 Rare property helps
me re-identify someone
 Implicit: information about a large group is public
• e.g. liver problems more prevalent among diabetics
19
“Syntactic” Definitions
• Given sanitization S, look at set of all databases consistent with S
• Def’n: Safe if no predicate is a breach for all consistent databases
k-anonymity [L. Sweeney]
• Sanitization is histogram of data
 Partition D into bins B1 [ B2 [  [ Bt
 Output cardinalities fj = # ( DB Å Bj )
brown blue
blond

[0,12]
2
[0,12]
10
12
brown [0,14]
12
[0,16]
6
18

16
14
• Safe if for all j, either fj ¸ k or fj=0
Cell bound methods [statistics, 1990’s]
• Sanitization consists of marginal sums
 Let fz = #{i : xi =z}. Then San(DB) = various sums of fz
• Safe if for all z, either 9 cons’t DB with fz ¸ k or 8 cons’t DB’s, fz=0
• Large literature using algebraic and combinatorial techniques
20
“Syntactic” Definitions
• Given sanitization S, look at set of all databases consistent with S
• Def’n: Safe if no predicate is a breach for all consistent databases
k-anonymity [L. Sweeney]
brown blue
• Sanitization is histogram of data
 Partition D into bins B1 [Issues:
B2 [  [ Bt
blond
[0,12]
brown [0,14]
 Output cardinalities fj = •# ( DB
Å
B
)
If k is small:
“all three Canadians
j
14

at
Weizmann
sing
in
a
choir.”
• Safe if for all j, either fj ¸ k or fj=0
• Semantics?
Cell bound methods [statistics,
1990’s]

[0,12] 12
[0,16] 18
16
 Probability not considered
• Sanitization consists of marginal sums
 What if I have side information?
 if fz = #{i : xi =z} then output = various sums of fz
•
 Algorithm for making decisions
Safe if for all z, either 9 cons’t DB
fz ¸ k or 8 cons’t DB’s, fz=0
notwith
considered
What
adversary does
this apply to?
• large literature using algebraicand
combinatorial
techniques
21
Security for “Bayesian” adversaries
Issues:
Goal:
• Adversary outputs point z• 2Restricts
D
the type of predicates
adversary can choose
• Score = 1/fz if fz > 0
0 otherwise• Must know prior distribution
• Def’n: sanitization safe if E(score)
·
 Can 1 scheme work for many
Procedure:
distributions?
 Sanitizer
works
than
• Assume you know adversary’s prior
distribution
overharder
databases
adversary
• Given a candidate output (e.g. set of marginal sums)
• output
Conditional
probabilities
don’t
 Update prior conditioned on
(via Bayes’
rule)
previous iterations
 If maxz E( score | output ) < consider
 then release
“Simulatability” [KMN’05]
 Else consider new set of marginalsums
• Extensive literature on computing 
expected
Can thisvalue
be fixed
(see(with
Yosi’s
efficient
talk)
computations)?
22
Crypto-flavored Approach [CDMSW,CDMT,NS]
“If the release of statistics S makes it possible to determine the
value [of private information] more accurately than is possible
without access to S, a disclosure has taken place.” [Dalenius]
23
Crypto-flavored Approach [CDMSW,CDMT,NS]
• [CDMSW]: Compare to “simulator”:
8 distributions on databases DB 2 C
8 adversaries A, 9 A’ such that
8 subsets J µ DB:
PrDB,S[ A(S) = breach in J ] – PrDB[ A’() = breach in J ] · 
• Definition says nothing if adversary knows x1
 Require that it hold for all subsets of DB
• No non-trivial examples satisfying this definition
 Restrict family of distributions to some class C of distribs
 Try to make as large as possible
 Sufficient: i.i.d. from “smooth” distribution
24
Crypto-flavored Approach [CDMSW,CDMT,NS]
• [CDMSW]: Compare to “simulator”:
8 distributions on databases DB 2 C
8 adversaries A, 9 A’ such that
8 subsets J µ DB:
PrDB,S[ A(S) = breach in J ] – PrDB[ A’() = breach in J ] · 
• [CDMSW,CDMT] Geometric data
 Assume xi 2 Rd
2 2
1
3
 Relax definition:
• Ball predicates gz,r = {x : ||x-z|| · r}
g’z,r = {x : ||x-z|| · C¢ r}
• Breach if # (DB Å gz,r) >0 and #(DB Å g’z,r) < k
 Several types of histograms can be released
 Sufficient for “metric” problems: clustering, min. span tree,…
2
2
25
Crypto-flavored Approach [CDMSW,CDMT,NS]
• [CDMSW]: Compare to “simulator”:
8 distributions on databases DB 2 C
Issues:
8 adversaries A, 9 A’ such
that
8 subsets J µ DB:
• Works for a large class of prior
side in
information
PrDB,S[ A(S) = breach in Jdistributions
] – PrDB[ A’()and
= breach
J] ·
 But not for all
• [NS] No geometric restrictions
 A lot of noise
• Almost erase data!
• Not clear if it helps with “ordinary”
statistical calculations
 Strong privacy statement • Interesting utility requires
 Very weak utility
geometric restrictions
• Too messy?
• [CDMSW,CDMT,NS]: proven statements!
26
Blending into a Crowd
• Intuition: I am safe in a group of k or more
• pros:
 appealing intuition for privacy
 seems fundamental
 mathematically interesting
 meaningful statements are possible!
• cons
 does it rule out learning
facts about particular individual?
 all results seem to make strong assumptions on
adversary’s prior distribution
 is this necessary? (yes…)
27
Overview
• Examples
• Intuitions for privacy
 Why crypto def’s don’t apply
• A Partial* Selection of Definitions
 Two Straw men
 Blending into the Crowd
 An impossibility result
 Attribute Disclosure and Differential Privacy
• Conclusions
28
an impossibility result
• An abstract schema:
 Define a privacy breach
 8 distributions on databases
8 adversaries A, 9 A’ such that
Pr( A(San) = breach ) – Pr( A’() = breach ) · 
• Theorem: [Dwork-Naor]
 For reasonable “breach”, if San(DB) contains information about DB
then some adversary breaks this definition
• Example:
 Adv. knows Alice is 2 inches shorter than average Lithuanian
• but how tall are Lithuanians?
 With sanitized database, probability of guessing height goes up
 Theorem: this is unavoidable
29
proof sketch
• Suppose
 If DB is uniform then entropy I( DB ; San(DB) ) > 0
 “breach” is predicting a predicate g(DB)
• Pick hash function h: {databases}!{0,1}H(DB|San)
 Prior distrib. is uniform conditioned on h(DB)=z
• Then
 h(DB)=z gives no info on g(DB)
 San(DB) and h(DB)=z together determine DB
• [DN] vastly generalize this
30
Preventing Attribute Disclosure
DB=
x1
x2
x3

xn-1
xn
San
¢¢¢
query 1
answer 1

query T
answer T
Adversary A
random coins
• Large class of definitions
 safe if adversary can’t learn “too much” about any entry
 E.g.:
• Cannot narrow Xi down to small interval
• For uniform Xi, mutual information I(Xi; San(DB) ) · 
• How can we decide among these definitions?
31
Differential Privacy
DB=
x1
x2
x3

xn-1
xn
San
¢¢¢
query 1
answer 1

query T
answer T
Adversary A
random coins
• Lithuanians example:
 Adv. learns height even if Alice not in DB
• Intuition [DM]:
 “Whatever is learned would be learned regardless of whether or not
Alice participates”
 Dual: Whatever is already known, situation won’t get worse
32
Differential Privacy
DB=
x1
x2
0

xn-1
xn
San
¢¢¢
query 1
answer 1

query T
answer T
Adversary A
random coins
• Define n+1 games
 “Game 0”: Adv. interacts with San(DB)
 For each i, let DB-i = (x1,…,xi-1,0,xi+1,…,xn)
 “Game i”: Adv. interacts with San(DB-i)
• Bayesian adversary:
 Given S and prior distrib p() on DB, define n+1 posterior distrib’s
33
Differential Privacy
DB=
x1
x2
0

xn-1
xn
San
¢¢¢
query 1
answer 1

query T
answer T
Adversary A
random coins
• Definition: San is safe if
8 prior distributions p(¢) on DB,
8 transcripts S, 8 i =1,…,n
StatDiff( p0(¢|S) , pi(¢| S) ) · 
• Note that the prior distribution may be far from both
• How can we satisfy this?
34
Approach: Indistinguishability [DiNi,EGS,BDMN]
DB=
DB’=
x1
x2
x3

xn-1
xn
x1
x2’
x3

xn-1
xn
query 1
answer
1
transcript
query
S T
answer T
San
¢¢¢
random coins
San
¢¢¢
query 1
answer
1
transcript
Distributions at
“distance” · 
query
S’ T
Choice of answer
distance
T measure is important
random coins
Differ in 1 row
35
Approach: Indistinguishability [DiNi,EGS,BDMN]
DB=
DB’=
x1
x2
x3

xn-1
xn
x1
x2’
x3

xn-1
xn
query 1
answer
1
transcript
query
S T
answer T
San
¢¢¢
random coins
San
¢¢¢
query 1
answer
1
transcript
Distributions at
“distance” · 
query
S’ T
Choice of answer
distance
T measure is important
random coins
36
Approach: Indistinguishability [DiNi,EGS,BDMN]
Problem:  must be large
DB=
• By hybrid argument:
Any two databases induce
transcripts at distance · n
DB’=
• To get utility,  > 1/n
San
¢¢¢
San
¢¢¢
query 1
S
query
T
query 1
S’
query
T
Distrib’s
distance
·
Statistical difference 1/n is not meaningful
• Example: Release random point in database
 San(x1,…,xn) = ( j, xj ) for random j
 For every i , changing xi induces statistical difference 1/n
 But some xi is revealed with probability 1
37
Formalizing Indistinguishability
?
query 1
transcript
answer
S 1
Adversary A
query 1
transcript
answer
S’ 1
Definition: San is -indistinguishable if
8 A,
8 DB, DB’ which differ in 1 row, 8 sets of transcripts E
p( San(DB) 2 E )
• Equivalently, 8 S:
±) p( San(DB’) 2 E )
2 (1e±
p( San(DB) = S )
p( San(DB’)= S )
2 1±
38
Indistinguishability ) Differential Privacy
• Definition: San is safe if
8 prior distributions p(¢) on DB,
8 transcripts S, 8 i =1,…,n
StatDiff( p0(¢|S) , pi(¢| S) ) · 
• We can use indistinguishability:
 For every S and DB:
 This implies StatDiff( p0(¢|S) , pi(¢| S) ) · 
39
Why does this help?
With relatively little noise:
• Averages
• Histograms
• Matrix decompositions
• Certain types of clustering
• …
See Kobbi’s talk
40
Preventing Attribute Disclosure
• Various ways to capture
“no particular value should be revealed”
• Differential Criterion:
 “Whatever is learned would be learned regardless of whether or not
person i participates”
• Satisfied by indistinguishability
 Also implies protection from re-identification?
• Two interpretations:
 A given release won’t make privacy worse
 Rational respondent will answer if there is some gain
• Can we preserve enough utility?
41
Overview
• Examples
• Intuitions for privacy
 Why crypto def’s don’t apply
• A Partial* Selection of Definitions
 Two Straw men
 Blending into the Crowd
 An impossibility result
 Attribute Disclosure and Differential Privacy
* “partial” = “incomplete” and “biased”
42
Things I Didn’t Talk About
• Economic Perspective [KPR]
 Utility of providing data = value – cost
 May depend on whether others participate
 When is it worth my while?
• Specific methods for re-identification
• Various other frameworks (e.g. “L-diversity”)
• Other pieces of big “data privacy” picture
 Access Control
 Implementing trusted collection center
43
Conclusions
• Pinning down social notion in particular context
• Biased survey of approaches to definitions
 A taste of techniques along the way
 Didn’t talk about utility
• Question has different flavor from
 usual crypto problems
 statisticians’ traditional conception
• Meaningful statements are possible!
 Practical?
 Do they cover everything? No
44
Conclusions
• How close are we to converging?
 e.g. s.f.e., encryption, Turing machines,…
 But we’re after a social concept?
 Silver bullet?
• What are the big challenges?
• Need “cryptanalysis” of these systems (Adi…?)
45