Differential Privacy
Download
Report
Transcript Differential Privacy
2016.6.2
- A hospital has a database of patient records, each record containing a binary
value indicating whether or not the patient has cancer.
-
suppose an adversary is only allowed to use a particular form of query S(i) that
returns the sum of the first i rows of the second column
-
patient
has cancer
Amy
0
Tom
1
Jack
1
Suppose he also knows Jack is in the last row of the database
If Jack has cancer? S(3)-S(2)
Diο¬erential privacy address the question of, given the total number of patients
with cancer, whether or not an adversary can learn if a particular individual has
cancer.
Differential Privacy
Difference privacy model is derived from a very simple observation:
When the dataset D contains an individual, for example, Alice. Then do
arbitrary query f (for example, count, sum, average, median, or other
queries etc.) and get result f (D). If after deleting Alice from D the result
of the query is still f(D). This means Aliceβs message wonβt be leaked.
Differential privacy aims to provide means to maximize the accuracy of
queries from datasets while minimizing the chances of identifying its
records.
Let π: π·π β π΄ be a randomized algorithm. Let π·1 , π·2 β π·π be two datasets that
differ in at most one record (we call these database neighbors)
xi
xiβ
D1
D2
Database neighbors
Differential Privacy. Let π > 0. Define π to be π β πππππππππ‘πππππ¦ private if for all
neighboring databases π·1 , π·2 , and for all(measurable) subsets π β π΄, we have
Pr π π·2 β π
exp βπ β€
β€ exp π
Pr π π·1 β π
Where the probability is taken over the randomness of π
ο¬ π : Privacy level
ο¬ (Sensitivity) For any function π: π· β βπ , the sensitivity of π is
βπ=max π π·1 β π(π·2 ) 1
for all neighboring database π·1 , π·2 .
ο¬ k-anonymity and its expansion model (l-diversityγ t-closenessβ¦) canβt provide
enough security
ο¬ Differential privacy doesnβt consider any possible background attackers have
Theorem 1. (Sequential composability) Letπi each provide ππ-differential
privacy. A sequence ofπi (D) over the dataset D provides π΄πππ -differential
privacy.
Theorem 2. (Parallel composability) Letπi each provide ππ-differential privacy.
Di represents the subset of dataset D. A sequence ofπi (Di) over the dataset Di
provides max(ππ )-differential privacy.
ο¬ (Ξ΄,π) - Probabilistic Differential Privacy
ο¬ (Ξ΄,π) - Indistinguishability
A privacy mechanism gives (Ξ΄,π) - indistinguishability if for outputs Ο
and Οβ based on data sets differing in at most one row,
ο¬ Laplace Mechanism
ο¬ Gaussian Mechanism (probabilistic)
ο¬ Exponential Mechanism
A sports event is going to be held. The items are selected from the
set {football, volleyball, basketball, tennis}.
Participants voted for this. Now choosing an item and ensure that the
entire decision-making process to meet the Ξ΅- difference privacy. Set
the number of votes as the utility function, obviously Ξu = 1. According
to the exponential mechanism, given privacy budget Ξ΅, we can
calculate the output probability of various items, as shown in the Table.
item
u
Ξ΅=0
Ξ΅=0.1
Ξ΅=1
Football
30
0.25
0.424
0.924
Volleyball
25
0.25
0.330
0.075
Basketball
8
0.25
0.141
1.5e-05
Tennis
2
0.25
0.105
7.7e-07
ο¬ Privacy Preserving Data Release (PPDR)
ο Interactive data release
Query 1
DP
β¦
Raw data
Query 1
Query i
ο Non-interactive data release
Purified datasets
Raw data
DP
all query results
Gergely Acs
INRIA
[email protected]
Claude Castelluccia
INRIA
[email protected]
ο§
ο§
ο§
ο§
Record linkage
Attribute linkage
Table linkage
A probabilistic attack
ο§ CDR dataset -- French telecom company Orange
ο§ 1,992,846 users
ο§ 1303 towers
ο§ 989 IRIS cells
ο§ 10/09/2007 - 17/09/2007 (one week)
ο§ Aim : Release the time series of IRIS cells without leaking privacy
: the number of individuals at L in the (t+1)th hour of the week
ο§ Method:
time series of all IRIS cells
satisfies Differential Privacy
sanitized version
for a given privacy level, the magnitude of noise can be substantially reduced by
using several optimizations and by customizing the anonymization mechanisms to
the public characteristics of datasets and applications.
ο¬Pre-sampling
select at most l of visits per user
=l
ο¬Computing the largest covering cells
ο¬Perturbation
ο the similarity of geographically close time series
β Clustering cells
ο periodic nature
β add Gaussian noise to DCT low-frequency components