Transcript ppt
Pinning Down “Privacy”
Defining Privacy in Statistical Databases
Adam Smith
Weizmann Institute of Science
http://theory.csail.mit.edu/~asmith
1
Database Privacy
Alice
Collection
and
“sanitization”
Bob
You
Users
(government,
researchers,
marketers, …)
“Census problem”
Two conflicting goals
• Utility: Users can extract “global” statistics
• Privacy: Individual information stays hidden
• How can these be formalized?
2
Database Privacy
Alice
Bob
You
Collection
and
“sanitization”
Users
(government,
researchers,
marketers, …)
“Census problem”
Why privacy?
• Ethical & legal obligation
• Honest answers require respondents’ trust
3
Trust is important
4
Database Privacy
Alice
Bob
You
Collection
and
“sanitization”
Users
(government,
researchers,
marketers, …)
• Trusted collection agency
• Published statistics may be tables, graphs, microdata, etc
• May have noise or other distortions
• May be interactive
5
Database Privacy
Alice
Bob
You
Collection
and
“sanitization”
Users
(government,
researchers,
marketers, …)
Variations on model studied in
• Statistics
• Data mining
• Theoretical CS
• Cryptography
Different traditions for what “privacy” means
6
How can we formalize “privacy”?
• Different people mean different things
• Pin it down mathematically?
7
I ask them to take a poem
and hold it up to the light
like a color slide
Can we approach privacy
scientifically?
• Pin down social concept
or press an ear against its hive.• No perfect definition?
• But lots of place for rigor
[…]
• Too late? (see Adi’s talk)
But all they want to do
is tie the poem to a chair with rope
and torture a confession out of it.
They begin beating it with a hose
to find out what it really means.
- Billy Collins, “Introduction to poetry”
8
How can we formalize “privacy”?
• Different people mean different things
• Pin it down mathematically?
Goal #1: Rigor
Prove clear theorems about privacy
• Few exist in literature
Make clear (and refutable) conjectures
Sleep better at night
Goal #2: Interesting science
(New) Computational phenomenon
Algorithmic problems
Statistical problems
9
Overview
• Examples
• Intuitions for privacy
Why crypto def’s don’t apply
• A Partial* Selection of Definitions
• Conclusions
* “partial” = “incomplete” and “biased”
10
Basic Setting
DB=
x1
x2
x3
xn-1
xn
San
¢¢¢
query 1
answer 1
query T
answer T
Users
(government,
researchers,
marketers, …)
random coins
• Database DB = table of n rows, each in domain D
D can be numbers, categories, tax forms, etc
This talk: D = {0,1}d
E.g.: Married?, Employed?, Over 18?, …
11
Examples of sanitization methods
• Input perturbation
Change data before processing
E.g. Randomized response
• flip each bit of table with probability p
• Summary statistics
Means, variances
Marginal totals (# people with blue eyes and brown hair)
Regression coefficients
• Output perturbation
Summary statistics with noise
• Interactive versions of above:
Auditor decides which queries are OK, type of noise
12
Two Intuitions for Privacy
“If the release of statistics S makes it possible to determine the
value [of private information] more accurately than is possible
without access to S, a disclosure has taken place.” [Dalenius]
• Learning more about me should be hard
Privacy is “protection from being brought to the attention of
others.” [Gavison]
• Safety is blending into a crowd
13
Why not use crypto definitions?
• Attempt #1:
Def’n: For every entry i, no information about xi is leaked
(as if encrypted)
Problem: no information at all is revealed!
Tradeoff privacy vs utility
• Attempt #2:
Agree on summary statistics f(DB) that are safe
Def’n: No information about DB except f(DB)
C
Problem: how to decide that f is safe?
Tautology trap
C
C
C
C
C
(Also: how do you figure out what f is? --Yosi)
14
Overview
• Examples
•
Criteria
• Understandable
• Clear adversary’s goals &
Intuitions for privacy
prior knowledge / side information
• I am a co-author...
Why crypto def’s don’t apply
• A Partial* Selection of Definitions
Two straw men
Blending into the Crowd
An impossibility result
Attribute Disclosure and Differential Privacy
• Conclusions
* “partial” = “incomplete” and “biased”
15
Straw man #1: Exact Disclosure
DB=
x1
x2
x3
xn-1
xn
San
¢¢¢
query 1
answer 1
query T
answer T
Adversary A
random coins
• Def’n: safe if adversary cannot learn any entry exactly
leads to nice (but hard) combinatorial problems
Does not preclude learning value with 99% certainty or narrowing
down to a small interval
• Historically:
Focus: auditing interactive queries
Difficulty: understanding relationships between queries
E.g. two queries with small difference
16
Straw man #2: Learning the distribution
• Assume x1,…,xn are drawn i.i.d. from unknown
distribution
• Def’n: San is safe if it only reveals distribution
• Implied approach:
learn the distribution
release description of distrib
or re-sample points from distrib
• Problem: tautology trap
estimate of distrib. depends on data… why is it safe?
17
Blending into a Crowd
• Intuition: I am safe in a group of k or more
k varies (3… 6… 100… 10,000 ?)
• Many variations on theme:
Adv. wants predicate g such that
0 < #{ i | g(xi)=true} < k
g is called a breach of privacy
• Why?
Fundamental:
• R. Gavison: “protection from being brought to the attention of
others”
Rare property helps me re-identify someone
Implicit: information about a large group is public
• e.g. liver problems more prevalent among diabetics
18
Blending into a Crowd
• Intuition: I am safe in a group of k or more
k varies (3… 6… 100… 10,000 ?)
• Many variations on theme:
Adv. wants predicate g such that
0 < #{ i | g(xi)=true} < k
Two variants:
• frequency in DB
• frequency in
underlying population
g is called a breach of privacy
• Why?
How can we capture this?
Fundamental: • Syntactic definitions
• R. Gavison: “protection from being brought to the attention of
• Bayesian adversary
others”
• “Crypto-flavored”
definitions
Rare property helps
me re-identify someone
Implicit: information about a large group is public
• e.g. liver problems more prevalent among diabetics
19
“Syntactic” Definitions
• Given sanitization S, look at set of all databases consistent with S
• Def’n: Safe if no predicate is a breach for all consistent databases
k-anonymity [L. Sweeney]
• Sanitization is histogram of data
Partition D into bins B1 [ B2 [ [ Bt
Output cardinalities fj = # ( DB Å Bj )
brown blue
blond
[0,12]
2
[0,12]
10
12
brown [0,14]
12
[0,16]
6
18
16
14
• Safe if for all j, either fj ¸ k or fj=0
Cell bound methods [statistics, 1990’s]
• Sanitization consists of marginal sums
Let fz = #{i : xi =z}. Then San(DB) = various sums of fz
• Safe if for all z, either 9 cons’t DB with fz ¸ k or 8 cons’t DB’s, fz=0
• Large literature using algebraic and combinatorial techniques
20
“Syntactic” Definitions
• Given sanitization S, look at set of all databases consistent with S
• Def’n: Safe if no predicate is a breach for all consistent databases
k-anonymity [L. Sweeney]
brown blue
• Sanitization is histogram of data
Partition D into bins B1 [Issues:
B2 [ [ Bt
blond
[0,12]
brown [0,14]
Output cardinalities fj = •# ( DB
Å
B
)
If k is small:
“all three Canadians
j
14
at
Weizmann
sing
in
a
choir.”
• Safe if for all j, either fj ¸ k or fj=0
• Semantics?
Cell bound methods [statistics,
1990’s]
[0,12] 12
[0,16] 18
16
Probability not considered
• Sanitization consists of marginal sums
What if I have side information?
if fz = #{i : xi =z} then output = various sums of fz
•
Algorithm for making decisions
Safe if for all z, either 9 cons’t DB
fz ¸ k or 8 cons’t DB’s, fz=0
notwith
considered
What
adversary does
this apply to?
• large literature using algebraicand
combinatorial
techniques
21
Security for “Bayesian” adversaries
Issues:
Goal:
• Adversary outputs point z• 2Restricts
D
the type of predicates
adversary can choose
• Score = 1/fz if fz > 0
0 otherwise• Must know prior distribution
• Def’n: sanitization safe if E(score)
·
Can 1 scheme work for many
Procedure:
distributions?
Sanitizer
works
than
• Assume you know adversary’s prior
distribution
overharder
databases
adversary
• Given a candidate output (e.g. set of marginal sums)
• output
Conditional
probabilities
don’t
Update prior conditioned on
(via Bayes’
rule)
previous iterations
If maxz E( score | output ) < consider
then release
“Simulatability” [KMN’05]
Else consider new set of marginalsums
• Extensive literature on computing
expected
Can thisvalue
be fixed
(see(with
Yosi’s
efficient
talk)
computations)?
22
Crypto-flavored Approach [CDMSW,CDMT,NS]
“If the release of statistics S makes it possible to determine the
value [of private information] more accurately than is possible
without access to S, a disclosure has taken place.” [Dalenius]
23
Crypto-flavored Approach [CDMSW,CDMT,NS]
• [CDMSW]: Compare to “simulator”:
8 distributions on databases DB 2 C
8 adversaries A, 9 A’ such that
8 subsets J µ DB:
PrDB,S[ A(S) = breach in J ] – PrDB[ A’() = breach in J ] ·
• Definition says nothing if adversary knows x1
Require that it hold for all subsets of DB
• No non-trivial examples satisfying this definition
Restrict family of distributions to some class C of distribs
Try to make as large as possible
Sufficient: i.i.d. from “smooth” distribution
24
Crypto-flavored Approach [CDMSW,CDMT,NS]
• [CDMSW]: Compare to “simulator”:
8 distributions on databases DB 2 C
8 adversaries A, 9 A’ such that
8 subsets J µ DB:
PrDB,S[ A(S) = breach in J ] – PrDB[ A’() = breach in J ] ·
• [CDMSW,CDMT] Geometric data
Assume xi 2 Rd
2 2
1
3
Relax definition:
• Ball predicates gz,r = {x : ||x-z|| · r}
g’z,r = {x : ||x-z|| · C¢ r}
• Breach if # (DB Å gz,r) >0 and #(DB Å g’z,r) < k
Several types of histograms can be released
Sufficient for “metric” problems: clustering, min. span tree,…
2
2
25
Crypto-flavored Approach [CDMSW,CDMT,NS]
• [CDMSW]: Compare to “simulator”:
8 distributions on databases DB 2 C
Issues:
8 adversaries A, 9 A’ such
that
8 subsets J µ DB:
• Works for a large class of prior
side in
information
PrDB,S[ A(S) = breach in Jdistributions
] – PrDB[ A’()and
= breach
J] ·
But not for all
• [NS] No geometric restrictions
A lot of noise
• Almost erase data!
• Not clear if it helps with “ordinary”
statistical calculations
Strong privacy statement • Interesting utility requires
Very weak utility
geometric restrictions
• Too messy?
• [CDMSW,CDMT,NS]: proven statements!
26
Blending into a Crowd
• Intuition: I am safe in a group of k or more
• pros:
appealing intuition for privacy
seems fundamental
mathematically interesting
meaningful statements are possible!
• cons
does it rule out learning
facts about particular individual?
all results seem to make strong assumptions on
adversary’s prior distribution
is this necessary? (yes…)
27
Overview
• Examples
• Intuitions for privacy
Why crypto def’s don’t apply
• A Partial* Selection of Definitions
Two Straw men
Blending into the Crowd
An impossibility result
Attribute Disclosure and Differential Privacy
• Conclusions
28
an impossibility result
• An abstract schema:
Define a privacy breach
8 distributions on databases
8 adversaries A, 9 A’ such that
Pr( A(San) = breach ) – Pr( A’() = breach ) ·
• Theorem: [Dwork-Naor]
For reasonable “breach”, if San(DB) contains information about DB
then some adversary breaks this definition
• Example:
Adv. knows Alice is 2 inches shorter than average Lithuanian
• but how tall are Lithuanians?
With sanitized database, probability of guessing height goes up
Theorem: this is unavoidable
29
proof sketch
• Suppose
If DB is uniform then entropy I( DB ; San(DB) ) > 0
“breach” is predicting a predicate g(DB)
• Pick hash function h: {databases}!{0,1}H(DB|San)
Prior distrib. is uniform conditioned on h(DB)=z
• Then
h(DB)=z gives no info on g(DB)
San(DB) and h(DB)=z together determine DB
• [DN] vastly generalize this
30
Preventing Attribute Disclosure
DB=
x1
x2
x3
xn-1
xn
San
¢¢¢
query 1
answer 1
query T
answer T
Adversary A
random coins
• Large class of definitions
safe if adversary can’t learn “too much” about any entry
E.g.:
• Cannot narrow Xi down to small interval
• For uniform Xi, mutual information I(Xi; San(DB) ) ·
• How can we decide among these definitions?
31
Differential Privacy
DB=
x1
x2
x3
xn-1
xn
San
¢¢¢
query 1
answer 1
query T
answer T
Adversary A
random coins
• Lithuanians example:
Adv. learns height even if Alice not in DB
• Intuition [DM]:
“Whatever is learned would be learned regardless of whether or not
Alice participates”
Dual: Whatever is already known, situation won’t get worse
32
Differential Privacy
DB=
x1
x2
0
xn-1
xn
San
¢¢¢
query 1
answer 1
query T
answer T
Adversary A
random coins
• Define n+1 games
“Game 0”: Adv. interacts with San(DB)
For each i, let DB-i = (x1,…,xi-1,0,xi+1,…,xn)
“Game i”: Adv. interacts with San(DB-i)
• Bayesian adversary:
Given S and prior distrib p() on DB, define n+1 posterior distrib’s
33
Differential Privacy
DB=
x1
x2
0
xn-1
xn
San
¢¢¢
query 1
answer 1
query T
answer T
Adversary A
random coins
• Definition: San is safe if
8 prior distributions p(¢) on DB,
8 transcripts S, 8 i =1,…,n
StatDiff( p0(¢|S) , pi(¢| S) ) ·
• Note that the prior distribution may be far from both
• How can we satisfy this?
34
Approach: Indistinguishability [DiNi,EGS,BDMN]
DB=
DB’=
x1
x2
x3
xn-1
xn
x1
x2’
x3
xn-1
xn
query 1
answer
1
transcript
query
S T
answer T
San
¢¢¢
random coins
San
¢¢¢
query 1
answer
1
transcript
Distributions at
“distance” ·
query
S’ T
Choice of answer
distance
T measure is important
random coins
Differ in 1 row
35
Approach: Indistinguishability [DiNi,EGS,BDMN]
DB=
DB’=
x1
x2
x3
xn-1
xn
x1
x2’
x3
xn-1
xn
query 1
answer
1
transcript
query
S T
answer T
San
¢¢¢
random coins
San
¢¢¢
query 1
answer
1
transcript
Distributions at
“distance” ·
query
S’ T
Choice of answer
distance
T measure is important
random coins
36
Approach: Indistinguishability [DiNi,EGS,BDMN]
Problem: must be large
DB=
• By hybrid argument:
Any two databases induce
transcripts at distance · n
DB’=
• To get utility, > 1/n
San
¢¢¢
San
¢¢¢
query 1
S
query
T
query 1
S’
query
T
Distrib’s
distance
·
Statistical difference 1/n is not meaningful
• Example: Release random point in database
San(x1,…,xn) = ( j, xj ) for random j
For every i , changing xi induces statistical difference 1/n
But some xi is revealed with probability 1
37
Formalizing Indistinguishability
?
query 1
transcript
answer
S 1
Adversary A
query 1
transcript
answer
S’ 1
Definition: San is -indistinguishable if
8 A,
8 DB, DB’ which differ in 1 row, 8 sets of transcripts E
p( San(DB) 2 E )
• Equivalently, 8 S:
±) p( San(DB’) 2 E )
2 (1e±
p( San(DB) = S )
p( San(DB’)= S )
2 1±
38
Indistinguishability ) Differential Privacy
• Definition: San is safe if
8 prior distributions p(¢) on DB,
8 transcripts S, 8 i =1,…,n
StatDiff( p0(¢|S) , pi(¢| S) ) ·
• We can use indistinguishability:
For every S and DB:
This implies StatDiff( p0(¢|S) , pi(¢| S) ) ·
39
Why does this help?
With relatively little noise:
• Averages
• Histograms
• Matrix decompositions
• Certain types of clustering
• …
See Kobbi’s talk
40
Preventing Attribute Disclosure
• Various ways to capture
“no particular value should be revealed”
• Differential Criterion:
“Whatever is learned would be learned regardless of whether or not
person i participates”
• Satisfied by indistinguishability
Also implies protection from re-identification?
• Two interpretations:
A given release won’t make privacy worse
Rational respondent will answer if there is some gain
• Can we preserve enough utility?
41
Overview
• Examples
• Intuitions for privacy
Why crypto def’s don’t apply
• A Partial* Selection of Definitions
Two Straw men
Blending into the Crowd
An impossibility result
Attribute Disclosure and Differential Privacy
* “partial” = “incomplete” and “biased”
42
Things I Didn’t Talk About
• Economic Perspective [KPR]
Utility of providing data = value – cost
May depend on whether others participate
When is it worth my while?
• Specific methods for re-identification
• Various other frameworks (e.g. “L-diversity”)
• Other pieces of big “data privacy” picture
Access Control
Implementing trusted collection center
43
Conclusions
• Pinning down social notion in particular context
• Biased survey of approaches to definitions
A taste of techniques along the way
Didn’t talk about utility
• Question has different flavor from
usual crypto problems
statisticians’ traditional conception
• Meaningful statements are possible!
Practical?
Do they cover everything? No
44
Conclusions
• How close are we to converging?
e.g. s.f.e., encryption, Turing machines,…
But we’re after a social concept?
Silver bullet?
• What are the big challenges?
• Need “cryptanalysis” of these systems (Adi…?)
45