Slides - Jun Zhang
Download
Report
Transcript Slides - Jun Zhang
Jun Zhang, Graham Cormode, Cecilia M. Procopiuc,
Divesh Srivastava, Xiaokui Xiao
The Problem: Private Data Release
◦ Differential Privacy
◦ Challenges
The Algorithm: PrivBayes
◦ Bayesian Network
◦ Details of PrivBayes
Function 𝐹: Linear vs. Logarithmic
Experiments
The Problem: Private Data Release
◦ Differential Privacy
◦ Challenges
The Algorithm: PrivBayes
◦ Bayesian Network
◦ Details of PrivBayes
Function 𝐹: Linear vs. Logarithmic
Experiments
company
institute
sensitive
database 𝐷
public
adversary
company
similar properties
accurate inference
sensitive
database 𝐷
synthetic
database 𝐷 ∗
How can we design such a
private data release algorithm?
adversary
The Problem: Private Data Release
◦ Differential Privacy
◦ Challenges
The Algorithm: PrivBayes
◦ Bayesian Network
◦ Details of PrivBayes
Function 𝐹: Linear vs. Logarithmic
Experiments
Definition of 𝜺-Differential Privacy
◦ A randomized data release algorithm 𝑨 satisfies 𝜀-differential
privacy, if for any two neighboring datasets 𝑫, 𝑫′ and for any
possible synthetic data 𝑫∗ ,
Pr 𝑨 𝑫 = 𝑫∗ ≤ exp(𝜺) ⋅ Pr 𝑨 𝑫′ = 𝑫∗
Name
Has cancer?
Name
Has cancer?
Alice
Yes
Alice
Yes
Bob
No
Bob
No
Chris
Yes
Chris
Yes
Denise
Yes
Denise
Yes
Eric
No
Eric
No
Frank
Yes
Frank
No
A general approach to achieve differential privacy is
injecting Laplace noise to the output, in order to cover
the impact of any individual!
More details in Preliminaries part of the paper
Design a data release algorithm with
differential privacy guarantee.
The Problem: Private Data Release
◦ Differential Privacy
◦ Challenges
The Algorithm: PrivBayes
◦ Bayesian Network
◦ Details of PrivBayes
Function 𝐹: Linear vs. Logarithmic
Experiments
To build a synthetic data, we need to understand the
tuple distribution Pr[∗] of the sensitive data.
convert
sensitive
database 𝐷
+ noise
full-dim
tuple distribution
sample
noisy
distribution
synthetic
database 𝐷 ∗
Example: Database has 10M tuples, 10 attributes
(dimensions), and 20 values per attribute:
Scalability: full distribution Pr[∗] has 2010 ≈ 10𝑇 cells
◦ most of them have non-zero counts after noise injection
◦ privacy is expensive (computation, storage)
Signal-to-noise: avg. information in each cell
10−6 ; avg. noise is 10 (for 𝜀 = 0.1)
10𝑀
is
10𝑇
=
Previous solutions suffer from
either scalability or signal-to-noise problem
The Problem: Private Data Release
◦ Differential Privacy
◦ Challenges
The Algorithm: PrivBayes
◦ Bayesian Network
◦ Details of PrivBayes
Function 𝐹: Linear vs. Logarithmic
Experiments
convert
sensitive
database 𝐷
sample
+ noise
full-dim
tuple distribution
noisy
distribution
synthetic
database 𝐷 ∗
approximate
convert
+ noise
a set of low-dim
distributions
noisy low-dim
distributions
sample
The advantages of using low-dimensional distributions
◦ easy to compute
◦ small domain -> high signal density -> robust against noise
But, how to find a set of low-dim distributions that
provides a good approximation to full distribution?
The Problem: Private Data Release
◦ Differential Privacy
◦ Challenges
The Algorithm: PrivBayes
◦ Bayesian Network
◦ Details of PrivBayes
Function 𝐹: Linear vs. Logarithmic
Experiments
A 5-dimensional database:
Pr 𝑎𝑔𝑒
Pr 𝑤𝑜𝑟𝑘 | 𝑎𝑔𝑒
age
workclass
income
education
title
Pr 𝑒𝑑𝑢 | 𝑎𝑔𝑒
Pr 𝑡𝑖𝑡𝑙𝑒 | 𝑤𝑜𝑟𝑘
Pr 𝑖𝑛𝑐𝑜𝑚𝑒 | 𝑤𝑜𝑟𝑘
A 5-dimensional database:
age
workclass
income
education
title
Pr ∗ ≈ Pr 𝑎𝑔𝑒 ⋅ Pr 𝑤𝑜𝑟𝑘 | 𝑎𝑔𝑒 ⋅ Pr 𝑒𝑑𝑢 | 𝑎𝑔𝑒
⋅ Pr 𝑡𝑖𝑡𝑙𝑒 | 𝑤𝑜𝑟𝑘 ⋅ Pr 𝑖𝑛𝑐𝑜𝑚𝑒 | 𝑤𝑜𝑟𝑘
age
workclass
income
education
title
Pr ∗ ≈ Pr 𝑎𝑔𝑒 ⋅ Pr 𝑒𝑑𝑢| 𝑎𝑔𝑒 ⋅ Pr 𝑤𝑜𝑟𝑘 | 𝑎𝑔𝑒, 𝑒𝑑𝑢
⋅ Pr 𝑡𝑖𝑡𝑙𝑒 | 𝑒𝑑𝑢, 𝑤𝑜𝑟𝑘 ⋅ Pr 𝑖𝑛𝑐𝑜𝑚𝑒 | 𝑤𝑜𝑟𝑘, 𝑡𝑖𝑡𝑙𝑒
Quality of Bayesian network decides the quality of
approximation
The Problem: Private Data Release
◦ Differential Privacy
◦ Challenges
The Algorithm: PrivBayes
◦ Bayesian Network
◦ Details of PrivBayes
Function 𝐹: Linear vs. Logarithmic
Experiments
STEP 1: Choose a suitable Bayesian network 𝒩
◦ must in a differentially private way
STEP 2: Compute conditional distributions implied by 𝒩
◦ straightforward to do under differential privacy
◦ inject noise – Laplace mechanism
STEP 3: Generate synthetic data by sampling from 𝒩
◦ post-processing: no privacy issues
Finding optimal 1-degree Bayesian network was solved
in [Chow-Liu’68]. It is a DAG of maximum in-degree 1,
and maximizes the sum of mutual information 𝐼 of its
edges
𝐼(𝑋, 𝑌) ,
𝑋,𝑌 :edge
where 𝐼 𝑋, 𝑌 =
𝑦∈𝑌 𝑥∈𝑋
Pr 𝑥, 𝑦
Pr[𝑥, 𝑦] log
Pr 𝑥 Pr 𝑦
.
Finding optimal 1-degree Bayesian network was solved
in [Chow-Liu’68]. It is a DAG of maximum in-degree 1,
and maximizes the sum of mutual information 𝐼 of its
edges
⇔ finding the maximum spanning tree, where the
weight of edge (𝑋, 𝑌) is mutual information 𝐼(𝑋, 𝑌).
Build a 1-degree BN for database
𝐴
𝐵
𝐶
𝐷
Alan
0
0
0
0
Bob
0
0
0
0
Cykie
1
1
1
0
David
0
0
0
0
Eric
1
1
0
0
Frank
1
1
0
0
George
0
0
0
0
Helen
1
1
1
0
Ivan
0
0
0
0
Jack
1
1
0
0
Start from a random attribute 𝐴
A
C
B
D
Select next tree edge by its mutual information
A
0.5
0.5
B
0.5 0.2 𝐴
Alan
0
0.3
Bob
0
𝐵
0
0
Cykie
1
1
1
David
0.5 0.5 0
Eric
1
0
0
1
0
Frank
1
1
0
George
0
0
0
0
Helen
1
1
1
0
Ivan
0
0
0
0
Jack
1
1
0
0
0
C
D
𝐶
𝐷
0
0
0
candidates:
0
𝐴→𝐵
0
0𝐴 → 𝐶
0𝐴 → 𝐷
Select next tree edge by its mutual information
A
𝑰=𝟏
B
𝑰 = 𝟎. 𝟒
C
candidates:
𝐴→𝐵
𝐴→𝐶
𝐴→𝐷
𝑰=𝟎
D
Select next tree edge by its mutual information
A
C
B
D
Select next tree edge by its mutual information
𝑰=𝟎
B
C
𝑰 = 𝟎. 𝟒
A
candidates:
𝐴→𝐶
𝐴→𝐷
𝐵→𝐶
𝐵→𝐷
𝑰=𝟎
𝑰 = 𝟎. 𝟐
D
Select next tree edge by its mutual information
A
C
DONE!
B
D
It is NP-hard to train the optimal 𝑘-degree Bayesian
network, when 𝑘 > 1 [JMLR’04].
Most approximation algorithms are too complicated to
be converted into private algorithms.
In our paper, we find a way to extend the Chow-Liu
solution (1-degree) to higher degree cases.
In this talk, we focus on 1-degree cases for simplicity.
Do it under Differential Privacy!
(Non-private) select the edge with maximum 𝐼
(Private) 𝐼 is data-sensitive
-> the best edge is also data-sensitive
Solution: randomized edge selection!
Databases
𝐷
Edges
𝑒
define 𝑞 𝐷, 𝑒 → 𝑅
How good edge 𝑒 is as the result of
selection, given database 𝐷
𝜀 𝑞 𝐷, 𝑒
Return 𝑒 with probability: Pr[𝑒] ∝ exp
⋅
2 Δ 𝑞
′ , 𝑒)
where Δ 𝑞 = max
𝑞
𝐷,
𝑒
−
𝑞(𝐷
′
𝐷,𝐷 ,𝑒
info
noise
1
Do it under Differential
Privacy!
Problem
solved?
Select edges with exponential mechanism
NO
◦ define 𝑞(edge) = 𝐼(edge)
◦ we prove Δ 𝐼 = Θ(log 𝑛 /𝑛), where 𝑛 = |𝐷|. (Lemma 1)
Sensitivity (noise scale) log 𝑛/𝑛
𝜀 large
𝐼 𝑒
info
is
too
for
𝐼(𝑒)
Pr 𝑒 ∝ exp
⋅
2 log 𝑛/𝑛
noise
The Problem: Private Data Release
◦ Differential Privacy
◦ Challenges
The Algorithm: PrivBayes
◦ Bayesian Network
◦ Details of PrivBayes
Function 𝐹: Linear vs. Logarithmic
Experiments
𝐼 and 𝐹 have a strong positive correlation
Functions
Range
(scale of info)
Sensitivity
(scale of noise)
𝐼
Θ(1)
Θ(log 𝑛 /𝑛)
𝐹
Θ(1)
Θ(1/𝑛)
IDEA: define score to agree with 𝐼 at maximum values
and interpolate linearly in-between
Pr[𝑥, 𝑦]
how far?
1
𝐹=−
2
min
Π:𝑜𝑝𝑡𝑖𝑚𝑎𝑙
Π: “optimal” dbns
over 𝑋, 𝑌 that
maximize 𝐼(𝑋, 𝑌)
Π
Pr 𝑥, 𝑦 − Π
1
Range of 𝐹: Θ(1)
Sensitivity of 𝐹: Θ(1/𝑛)
1
𝐹=−
2
0.5
0.4
0.5
𝐼=1
min
Π:𝑜𝑝𝑡𝑖𝑚𝑎𝑙
Pr 𝑥, 𝑦 − Π
0.5 0.2
0.3
𝐼 = 0.4
𝐹 = −0.2
1
0.5
1.6
0.5
𝐼=1
𝐼
𝐹
𝐹 and 𝐼 of random distributions
correlation coefficient 𝑟 = 0.9472
The Problem: Private Data Release
◦ Differential Privacy
◦ Challenges
The Algorithm: PrivBayes
◦ Bayesian Network
◦ Details of PrivBayes
Function 𝐹: Linear vs. Logarithmic
Experiments
Adult dataset
We use four datasets in our experiments
◦ Adult, NLTCS, TPC-E, BR2000
Adult dataset
◦ census data of 45,222 individuals
◦ 15 attributes: age, workclass, education, marital status, etc.
◦ tuple domain size (full-dimensional): about 252
Query: all 2-way marginals
Query: all 3-way marginals
Adult, 𝑌 = gender
Adult, 𝑌 = education
Query: build 4 classifiers
Adult, 𝑌 = gender
Adult, 𝑌 = education
Query: build 4 classifiers
Differential privacy can be applied effectively for data release
Key ideas of the solution:
◦ Bayesian networks for dimension reduction
◦ carefully designed linear quality for exponential mechanism
Many open problems remain:
◦ extend to other forms of data: graph data, mobility data
◦ obtain alternate (workable) privacy definitions
Thanks!
Privacy, accuracy, and consistency too: a holistic solution to
contingency table release [PODS’07]
◦ incurs an exponential running time
◦ only optimized for low-dimensional marginals
Differentially private publication of sparse data [ICDT’12]
◦ achieves scalability, but no help for signal-to-noise problem
Differentially private spatial decompositions [ICDE’12]
◦ coarsens the histogram H to control nr. cells
◦ has some limits, e.g., range queries, ordinal domain
Assume that 𝑑𝑜𝑚 𝑋 ≤ |𝑑𝑜𝑚(𝑌)|. A distribution
Pr[𝑥, 𝑦] maximizes the mutual information between
𝑋 and 𝑌 if and only if
◦ Pr 𝑥 = 1/|𝑑𝑜𝑚(𝑋)|, for any 𝑥 ∈ 𝑑𝑜𝑚(𝑋);
◦ For each 𝑦 ∈ 𝑑𝑜𝑚(𝑌), there is at most one 𝑥 ∈
𝑑𝑜𝑚(𝑋) with Pr 𝑥, 𝑦 > 0.
two score functions for real 𝑥 ⟹ log(1 + 𝑥) and 𝑥
neighboring databases ⟹ 𝑥 and 𝑥 + Δ𝑥
Sensitivity (noise) ⟹ max of derivative ∞ and 1
query 𝑓
privacy budget 𝜀
noisy answer 𝑂
database
differentially private
algorithm
1. risk of privacy breach cumulates after
answering multiple queries
2. It requires specific DP algorithm for
every particular query
user
query 𝑓
noisy answer 𝑂
private data release
privacy budget 𝜀
synthetic data
Reusability: only access sensitive data once
Generality: support most queries