Differential Privacy

Download Report

Transcript Differential Privacy

2016.6.2
- A hospital has a database of patient records, each record containing a binary
value indicating whether or not the patient has cancer.
-
suppose an adversary is only allowed to use a particular form of query S(i) that
returns the sum of the first i rows of the second column
-
patient
has cancer
Amy
0
Tom
1
Jack
1
Suppose he also knows Jack is in the last row of the database
If Jack has cancer? S(3)-S(2)
Differential privacy address the question of, given the total number of patients
with cancer, whether or not an adversary can learn if a particular individual has
cancer.
Differential Privacy
Difference privacy model is derived from a very simple observation:
When the dataset D contains an individual, for example, Alice. Then do
arbitrary query f (for example, count, sum, average, median, or other
queries etc.) and get result f (D). If after deleting Alice from D the result
of the query is still f(D). This means Alice’s message won’t be leaked.
Differential privacy aims to provide means to maximize the accuracy of
queries from datasets while minimizing the chances of identifying its
records.
Let π’œ: 𝐷𝑛 β†’ 𝒴 be a randomized algorithm. Let 𝐷1 , 𝐷2 ∈ 𝐷𝑛 be two datasets that
differ in at most one record (we call these database neighbors)
xi
xi’
D1
D2
Database neighbors
Differential Privacy. Let πœ– > 0. Define π’œ to be πœ– βˆ’ π‘‘π‘–π‘“π‘“π‘’π‘Ÿπ‘’π‘›π‘‘π‘–π‘Žπ‘™π‘™π‘¦ private if for all
neighboring databases 𝐷1 , 𝐷2 , and for all(measurable) subsets π‘Œ ∈ 𝒴, we have
Pr π’œ 𝐷2 ∈ π‘Œ
exp βˆ’πœ– ≀
≀ exp πœ–
Pr π’œ 𝐷1 ∈ π‘Œ
Where the probability is taken over the randomness of π’œ
 πœ– : Privacy level
 (Sensitivity) For any function 𝑓: 𝐷 β†’ ℝ𝑛 , the sensitivity of 𝑓 is
βˆ†π‘“=max 𝑓 𝐷1 βˆ’ 𝑓(𝐷2 ) 1
for all neighboring database 𝐷1 , 𝐷2 .
 k-anonymity and its expansion model (l-diversity、 t-closeness…) can’t provide
enough security
 Differential privacy doesn’t consider any possible background attackers have
Theorem 1. (Sequential composability) Letπ’œi each provide πœ–π‘–-differential
privacy. A sequence ofπ’œi (D) over the dataset D provides π›΄π‘–πœ–π‘– -differential
privacy.
Theorem 2. (Parallel composability) Letπ’œi each provide πœ–π‘–-differential privacy.
Di represents the subset of dataset D. A sequence ofπ’œi (Di) over the dataset Di
provides max(πœ–π‘– )-differential privacy.
 (Ξ΄,πœ–) - Probabilistic Differential Privacy
 (Ξ΄,πœ–) - Indistinguishability
A privacy mechanism gives (Ξ΄,πœ–) - indistinguishability if for outputs Ο„
and τ’ based on data sets differing in at most one row,
 Laplace Mechanism
 Gaussian Mechanism (probabilistic)
 Exponential Mechanism
A sports event is going to be held. The items are selected from the
set {football, volleyball, basketball, tennis}.
Participants voted for this. Now choosing an item and ensure that the
entire decision-making process to meet the Ξ΅- difference privacy. Set
the number of votes as the utility function, obviously Ξ”u = 1. According
to the exponential mechanism, given privacy budget Ξ΅, we can
calculate the output probability of various items, as shown in the Table.
item
u
Ξ΅=0
Ξ΅=0.1
Ξ΅=1
Football
30
0.25
0.424
0.924
Volleyball
25
0.25
0.330
0.075
Basketball
8
0.25
0.141
1.5e-05
Tennis
2
0.25
0.105
7.7e-07
 Privacy Preserving Data Release (PPDR)
οƒ˜ Interactive data release
Query 1
DP
…
Raw data
Query 1
Query i
οƒ˜ Non-interactive data release
Purified datasets
Raw data
DP
all query results
Gergely Acs
INRIA
[email protected]
Claude Castelluccia
INRIA
[email protected]




Record linkage
Attribute linkage
Table linkage
A probabilistic attack
 CDR dataset -- French telecom company Orange
 1,992,846 users
 1303 towers
 989 IRIS cells
 10/09/2007 - 17/09/2007 (one week)
 Aim : Release the time series of IRIS cells without leaking privacy
: the number of individuals at L in the (t+1)th hour of the week
 Method:
time series of all IRIS cells
satisfies Differential Privacy
sanitized version
for a given privacy level, the magnitude of noise can be substantially reduced by
using several optimizations and by customizing the anonymization mechanisms to
the public characteristics of datasets and applications.
Pre-sampling
select at most l of visits per user
=l
Computing the largest covering cells
Perturbation
οƒ˜ the similarity of geographically close time series
β†’ Clustering cells
οƒ˜ periodic nature
β†’ add Gaussian noise to DCT low-frequency components