An Introduction to Differential Privacy and and its Applications

Download Report

Transcript An Introduction to Differential Privacy and and its Applications

An Introduction to
Differential Privacy
1
and its Applications
Ali Bagherzandi
Ph.D Candidate
University of
California at Irvine
1- Most slides in this presentation are from Cyntia Dwork’s talk.
Goal of Differential Privacy:
Privacy-Preserving Statistical Analysis of Confidential Data
Finding Statistical Correlations
Analyzing medical data to learn genotype/phenotype associations
Correlating cough outbreak with chemical plant malfunction
Can’t be done with HIPAA safe-harbor sanitized data
Noticing Events
Detecting spike in ER admissions for asthma
Official Statistics
Census
Contingency Table Release (statistics for small sets of attributes)
Model fitting, regression analyses, etc.
Standard Datamining Tasks
Clustering; learning association rules, decision trees, separators;
principal component analysis
Two Models:
K
Database
?
Sanitized Database
Non-Interactive: Data are sanitized and released
Two Models:
K
Database
?
Sanitized Database
Interactive: Multiple Queries, Adaptively Chosen
Achievements for Statistical Data-Bases
Defined Differential Privacy
Proved classical goal (Dalenius, 1977) unachievable
“Ad Omnia” definition; independent of linkage information
General Approach; Rigorous Proof
Relates degree of distortion to the (mathematical) sensitivity of
the computation needed for the analysis
“How much” can the data of one person affect the outcome?
Cottage Industry: redesigning algorithms to be insensitive
Assorted Extensions
When noise makes no sense; when actual sensitivity is much
less than worst-case; when the database is distributed; …
Lower bounds on distortion
Initiated by Dinur and Nissim
Semantic Security for Confidentiality
[Goldwasser-Micali ’ 82]
Vocabulary
Plaintext: the message to be transmitted
Ciphertext: the encryption of the plaintext
Auxiliary information: anything else known to attacker
The ciphertext leaks no information about the plaintext.
Formalization
Compare the ability of someone seeing aux and ciphertext to guess
(anything about) the plaintext, to the ability of someone seeing
only aux to do the same thing. Difference should be “tiny”.
Semantic Security for Privacy?
Dalenius, 1977
Anything that can be learned about a respondent from the
statistical database can be learned without access to the
database.
Happily, Formalizes to Semantic Security
Unhappily, Unachievable [Dwork and Naor 2006]
Both for not serious and serious reasons.
Semantic Security for Privacy?
The Serious Reason (told as a parable)
Database teaches average heights of population subgroups
“Terry Tao is four inches taller than avg Swedish ♀”
Access to DB teaches Terry’s height.
Terry’s height learnable from the DB, not learnable w/o.
Proof extends to “any” notion of privacy breach.
Attack Works Even if Terry Not in DB!
Suggests new notion of privacy: risk incurred by joining DB
Before/After interacting vs Risk when in/notin DB
Differential Privacy: Formal Definition
K gives e-differential privacy if for all values of DB, DB’
differing in a single element, and all S in Range(K )
Pr[ K (DB) in S]
Pr[ K (DB’) in S]
≤ eε ~ (1+ε)
ratio bounded
Pr [t]
Achieving Differential Privacy:
f
+ noise
?
K
f: DB  Rd
K (f, DB) = f(DB) + [Noise]d
Eg, Count(P, DB) = # rows in DB with Property P
42
Sensitivity of Function
How Much Can f(DB + Me) Exceed f(DB - Me)?
Recall: K (f, DB) = f(DB) + noise
Question Asks: What difference must noise obscure?
Δf = maxH(DB, DB’)=1 ||f(DB) – f(DB’)||1
eg, ΔCount = 1
43
Calibrate Noise to Sensitivity
Δf = maxH(DB, DB’)=1 ||f(DB) – f(DB’)||1
Theorem: To achieve ε-differential privacy, use scaled
d
symmetric noise [Lap(R)] with R = Δf/ε.
-4R
-3R
-2R
-R
0
R
2R
3R
Lap(R) : f(x) = 1/2R exp(-|x|/R)
Increasing R flattens curve; more privacy
Noise depends on f and e, not on the database
4R
5R
Why does it work? (d=1)
Δf = maxDB, DB-Me |f(DB) – f(DB-Me)|
Theorem: To achieve ε-differential privacy, use scaled
symmetric noise Lap(R) with R = Δf/ε.
-4R
-3R
-2R
Pr[ K (f, DB) = t]
Pr[ K (f, DB’) = t]
-R
0
R
2R
3R
4R
5R
= exp(-(|t- f(DB)|-|t- f(DB’)|)/R) ≤ exp(-Δf/R) ≤ ee
Differential Privacy: Example
ID
Claim
(‘000$)
ID
Claim
(‘000$)
1
9.91
7
10.63
2
9.16
8
12.54
3
10.59
9
9.29
4
11.27
10
8.92
5
10.50
11
20
6
11.89
12
8.55
…
…
…
…
Query : Average claim
Aux input : A fraction C of
entries.
Without Noise : Can learn
my claim w/ prob. C.
What privacy level is enough for me?
I feel safe if I’m hidden among n people  Pr [identify] = 1/n
If n=100, c=0.01 in this example e = 1.
In most applications, same e gives me more feeling of privacy.
e.g. adv. Identification itself is probabilistic.  e = 0.1, 1.0, ln(2)
Differential Privacy: Example
ID
Claim
(‘000$)
ID
Claim
(‘000$)
1
9.91
7
10.63
2
9.16
8
12.54
3
10.59
9
9.29
4
11.27
10
8.92
5
10.50
11
20
6
11.89
12
8.55
…
…
…
…
Query : Average claim
Aux input : 12 entries
e = 1  hide me
among 1200 people
Δf = maxDB, DB-Me |f(DB) – f(DB-Me)| = maxDB, DB-Me |Avg(DB) – Avg(DB-Me)|
= 11.10-10.29 = 0.81  Avg. not a sensitive function!
Add noise : Lap(Df/e) = Lap(0.81/1) = Lap(0.81)
Var = 1.31  Roughly: respond w/ ± 1.31 true value
Differential Privacy: Example
ID
Claim
(‘000$)
ID
Claim
(‘000$)
1
9.91
7
10.63
2
9.16
8
12.54
3
10.59
9
9.29
4
11.27
10
8.92
5
10.50
11
20
6
11.89
12
8.55
…
…
…
…
Query : max claim
Aux input : 12 entries
e = 1  hide me
among 1200 people
Δf = maxDB, DB-Me |f(DB) – f(DB-Me)| = maxDB, DB-Me |Avg(DB) – Avg(DB-Me)|
= 20-12.54 = 7.46
 Max. function is very sensitive!
Add noise : Lap(Df/e) = Lap(7.46/1) = Lap(7.46)
Var = 1.31  Roughly: respond w/ ± 111.30 true value
 Useless : error > sampling error
A Natural Relaxation: (d,T)-Differential Privacy
For all DB, DB’ differing in at most one element
for all subset S of Range(K ),
Pr[ K (DB) in S] ≤ ed Pr[ K (DB’) in S] + T
where T = T(n) is negligible.
Cf : e-Differential Privacy is unconditional, independent of n
Feasibility Results: [Dwork Nissim ’04]
(Here: Revisionist Version)
Database D = {d1, … ,dn}
rows in {0,1}k
T “counting queries” q = (S,g)
S subset [1..n], g: rows{0,1}; “How many rows in S satisfy g?”
q(D) = ∑i in S g(di)
Privacy Mechanism: Add Binomial Noise
r = q(D) + B(f(n), ½)
SuLQ Theorem: (δ, T)-Differential Privacy when
T(n)=O(nc), 0< c <1; δ=1/O(nc’), 0< c’ < c/2
f(n) = (T(n) /δ2) log6 (n)
If (√T(n)/δ)<O(√n) then |noise| < |sampling error|!
e.g. : T =n3/4 and δ= 1/10 then √T/δ=10n3/8 << √n
Impossibility Results:
Line of Research Initiated by Dinur and Nissim ’03
Blatant Non-Privacy: Adversary guesses correctly o(n) rows
Blatant non-privacy if:
all / r*cn / (1/2 + e) c’n
answers are within o(√n) of the true answer,
even against an adversary restricted to
n / cn / c’n queries and
poly(n) / poly(n) / exp(n) computation.
[Y]
/ [DMT] / [DMT]
Results are independent of how noise is distributed.
A variant model permits poly(n) computation in the final case [DY].
2n queries: noise E permits correctly guessing n-4E rows [DiNi].
Depressing implications for non-interactive case.
Thank You!