Slides - Jun Zhang

Download Report

Transcript Slides - Jun Zhang

Jun Zhang, Graham Cormode, Cecilia M. Procopiuc,
Divesh Srivastava, Xiaokui Xiao

The Problem: Private Data Release
◦ Differential Privacy
◦ Challenges

The Algorithm: PrivBayes
◦ Bayesian Network
◦ Details of PrivBayes

Function 𝐹: Linear vs. Logarithmic

Experiments

The Problem: Private Data Release
◦ Differential Privacy
◦ Challenges

The Algorithm: PrivBayes
◦ Bayesian Network
◦ Details of PrivBayes

Function 𝐹: Linear vs. Logarithmic

Experiments
company
institute
sensitive
database 𝐷
public
adversary
company
similar properties
accurate inference
sensitive
database 𝐷
synthetic
database 𝐷 ∗
How can we design such a
private data release algorithm?
adversary

The Problem: Private Data Release
◦ Differential Privacy
◦ Challenges

The Algorithm: PrivBayes
◦ Bayesian Network
◦ Details of PrivBayes

Function 𝐹: Linear vs. Logarithmic

Experiments

Definition of 𝜺-Differential Privacy
◦ A randomized data release algorithm 𝑨 satisfies 𝜀-differential
privacy, if for any two neighboring datasets 𝑫, 𝑫′ and for any
possible synthetic data 𝑫∗ ,
Pr 𝑨 𝑫 = 𝑫∗ ≤ exp(𝜺) ⋅ Pr 𝑨 𝑫′ = 𝑫∗
Name
Has cancer?
Name
Has cancer?
Alice
Yes
Alice
Yes
Bob
No
Bob
No
Chris
Yes
Chris
Yes
Denise
Yes
Denise
Yes
Eric
No
Eric
No
Frank
Yes
Frank
No

A general approach to achieve differential privacy is
injecting Laplace noise to the output, in order to cover
the impact of any individual!

More details in Preliminaries part of the paper
Design a data release algorithm with
differential privacy guarantee.

The Problem: Private Data Release
◦ Differential Privacy
◦ Challenges

The Algorithm: PrivBayes
◦ Bayesian Network
◦ Details of PrivBayes

Function 𝐹: Linear vs. Logarithmic

Experiments

To build a synthetic data, we need to understand the
tuple distribution Pr[∗] of the sensitive data.
convert
sensitive
database 𝐷
+ noise
full-dim
tuple distribution
sample
noisy
distribution
synthetic
database 𝐷 ∗


Example: Database has 10M tuples, 10 attributes
(dimensions), and 20 values per attribute:
Scalability: full distribution Pr[∗] has 2010 ≈ 10𝑇 cells
◦ most of them have non-zero counts after noise injection
◦ privacy is expensive (computation, storage)

Signal-to-noise: avg. information in each cell
10−6 ; avg. noise is 10 (for 𝜀 = 0.1)
10𝑀
is
10𝑇
=
Previous solutions suffer from
either scalability or signal-to-noise problem

The Problem: Private Data Release
◦ Differential Privacy
◦ Challenges

The Algorithm: PrivBayes
◦ Bayesian Network
◦ Details of PrivBayes

Function 𝐹: Linear vs. Logarithmic

Experiments
convert
sensitive
database 𝐷
sample
+ noise
full-dim
tuple distribution
noisy
distribution
synthetic
database 𝐷 ∗
approximate
convert
+ noise
a set of low-dim
distributions
noisy low-dim
distributions
sample

The advantages of using low-dimensional distributions
◦ easy to compute
◦ small domain -> high signal density -> robust against noise

But, how to find a set of low-dim distributions that
provides a good approximation to full distribution?

The Problem: Private Data Release
◦ Differential Privacy
◦ Challenges

The Algorithm: PrivBayes
◦ Bayesian Network
◦ Details of PrivBayes

Function 𝐹: Linear vs. Logarithmic

Experiments

A 5-dimensional database:
Pr 𝑎𝑔𝑒
Pr 𝑤𝑜𝑟𝑘 | 𝑎𝑔𝑒
age
workclass
income
education
title
Pr 𝑒𝑑𝑢 | 𝑎𝑔𝑒
Pr 𝑡𝑖𝑡𝑙𝑒 | 𝑤𝑜𝑟𝑘
Pr 𝑖𝑛𝑐𝑜𝑚𝑒 | 𝑤𝑜𝑟𝑘

A 5-dimensional database:
age
workclass
income
education
title
Pr ∗ ≈ Pr 𝑎𝑔𝑒 ⋅ Pr 𝑤𝑜𝑟𝑘 | 𝑎𝑔𝑒 ⋅ Pr 𝑒𝑑𝑢 | 𝑎𝑔𝑒
⋅ Pr 𝑡𝑖𝑡𝑙𝑒 | 𝑤𝑜𝑟𝑘 ⋅ Pr 𝑖𝑛𝑐𝑜𝑚𝑒 | 𝑤𝑜𝑟𝑘
age
workclass
income
education
title
Pr ∗ ≈ Pr 𝑎𝑔𝑒 ⋅ Pr 𝑒𝑑𝑢| 𝑎𝑔𝑒 ⋅ Pr 𝑤𝑜𝑟𝑘 | 𝑎𝑔𝑒, 𝑒𝑑𝑢
⋅ Pr 𝑡𝑖𝑡𝑙𝑒 | 𝑒𝑑𝑢, 𝑤𝑜𝑟𝑘 ⋅ Pr 𝑖𝑛𝑐𝑜𝑚𝑒 | 𝑤𝑜𝑟𝑘, 𝑡𝑖𝑡𝑙𝑒
Quality of Bayesian network decides the quality of
approximation

The Problem: Private Data Release
◦ Differential Privacy
◦ Challenges

The Algorithm: PrivBayes
◦ Bayesian Network
◦ Details of PrivBayes

Function 𝐹: Linear vs. Logarithmic

Experiments

STEP 1: Choose a suitable Bayesian network 𝒩
◦ must in a differentially private way

STEP 2: Compute conditional distributions implied by 𝒩
◦ straightforward to do under differential privacy
◦ inject noise – Laplace mechanism

STEP 3: Generate synthetic data by sampling from 𝒩
◦ post-processing: no privacy issues

Finding optimal 1-degree Bayesian network was solved
in [Chow-Liu’68]. It is a DAG of maximum in-degree 1,
and maximizes the sum of mutual information 𝐼 of its
edges
𝐼(𝑋, 𝑌) ,
𝑋,𝑌 :edge
where 𝐼 𝑋, 𝑌 =
𝑦∈𝑌 𝑥∈𝑋
Pr 𝑥, 𝑦
Pr[𝑥, 𝑦] log
Pr 𝑥 Pr 𝑦
.

Finding optimal 1-degree Bayesian network was solved
in [Chow-Liu’68]. It is a DAG of maximum in-degree 1,
and maximizes the sum of mutual information 𝐼 of its
edges
⇔ finding the maximum spanning tree, where the
weight of edge (𝑋, 𝑌) is mutual information 𝐼(𝑋, 𝑌).

Build a 1-degree BN for database
𝐴
𝐵
𝐶
𝐷
Alan
0
0
0
0
Bob
0
0
0
0
Cykie
1
1
1
0
David
0
0
0
0
Eric
1
1
0
0
Frank
1
1
0
0
George
0
0
0
0
Helen
1
1
1
0
Ivan
0
0
0
0
Jack
1
1
0
0

Start from a random attribute 𝐴
A
C
B
D

Select next tree edge by its mutual information
A
0.5
0.5
B
0.5 0.2 𝐴
Alan
0
0.3
Bob
0
𝐵
0
0
Cykie
1
1
1
David
0.5 0.5 0
Eric
1
0
0
1
0
Frank
1
1
0
George
0
0
0
0
Helen
1
1
1
0
Ivan
0
0
0
0
Jack
1
1
0
0
0
C
D
𝐶
𝐷
0
0
0
candidates:
0
𝐴→𝐵
0
0𝐴 → 𝐶
0𝐴 → 𝐷

Select next tree edge by its mutual information
A
𝑰=𝟏
B
𝑰 = 𝟎. 𝟒
C
candidates:
𝐴→𝐵
𝐴→𝐶
𝐴→𝐷
𝑰=𝟎
D

Select next tree edge by its mutual information
A
C
B
D

Select next tree edge by its mutual information
𝑰=𝟎
B
C
𝑰 = 𝟎. 𝟒
A
candidates:
𝐴→𝐶
𝐴→𝐷
𝐵→𝐶
𝐵→𝐷
𝑰=𝟎
𝑰 = 𝟎. 𝟐
D

Select next tree edge by its mutual information
A
C
DONE!
B
D

It is NP-hard to train the optimal 𝑘-degree Bayesian
network, when 𝑘 > 1 [JMLR’04].

Most approximation algorithms are too complicated to
be converted into private algorithms.

In our paper, we find a way to extend the Chow-Liu
solution (1-degree) to higher degree cases.

In this talk, we focus on 1-degree cases for simplicity.

Do it under Differential Privacy!
(Non-private) select the edge with maximum 𝐼
 (Private) 𝐼 is data-sensitive
-> the best edge is also data-sensitive

Solution: randomized edge selection!
Databases
𝐷
Edges
𝑒
define 𝑞 𝐷, 𝑒 → 𝑅
How good edge 𝑒 is as the result of
selection, given database 𝐷
𝜀 𝑞 𝐷, 𝑒
Return 𝑒 with probability: Pr[𝑒] ∝ exp
⋅
2 Δ 𝑞
′ , 𝑒)
where Δ 𝑞 = max
𝑞
𝐷,
𝑒
−
𝑞(𝐷
′
𝐷,𝐷 ,𝑒
info
noise
1

Do it under Differential
Privacy!
Problem
solved?

Select edges with exponential mechanism
NO
◦ define 𝑞(edge) = 𝐼(edge)
◦ we prove Δ 𝐼 = Θ(log 𝑛 /𝑛), where 𝑛 = |𝐷|. (Lemma 1)
Sensitivity (noise scale) log 𝑛/𝑛
𝜀 large
𝐼 𝑒
info
is
too
for
𝐼(𝑒)
Pr 𝑒 ∝ exp
⋅
2 log 𝑛/𝑛
noise

The Problem: Private Data Release
◦ Differential Privacy
◦ Challenges

The Algorithm: PrivBayes
◦ Bayesian Network
◦ Details of PrivBayes

Function 𝐹: Linear vs. Logarithmic

Experiments
𝐼 and 𝐹 have a strong positive correlation
Functions
Range
(scale of info)
Sensitivity
(scale of noise)
𝐼
Θ(1)
Θ(log 𝑛 /𝑛)
𝐹
Θ(1)
Θ(1/𝑛)
IDEA: define score to agree with 𝐼 at maximum values
and interpolate linearly in-between
Pr[𝑥, 𝑦]
how far?
1
𝐹=−
2
min
Π:𝑜𝑝𝑡𝑖𝑚𝑎𝑙
Π: “optimal” dbns
over 𝑋, 𝑌 that
maximize 𝐼(𝑋, 𝑌)
Π
Pr 𝑥, 𝑦 − Π
1
Range of 𝐹: Θ(1)
Sensitivity of 𝐹: Θ(1/𝑛)
1
𝐹=−
2
0.5
0.4
0.5
𝐼=1
min
Π:𝑜𝑝𝑡𝑖𝑚𝑎𝑙
Pr 𝑥, 𝑦 − Π
0.5 0.2
0.3
𝐼 = 0.4
𝐹 = −0.2
1
0.5
1.6
0.5
𝐼=1
𝐼
𝐹
𝐹 and 𝐼 of random distributions
correlation coefficient 𝑟 = 0.9472

The Problem: Private Data Release
◦ Differential Privacy
◦ Challenges

The Algorithm: PrivBayes
◦ Bayesian Network
◦ Details of PrivBayes

Function 𝐹: Linear vs. Logarithmic

Experiments
Adult dataset

We use four datasets in our experiments
◦ Adult, NLTCS, TPC-E, BR2000

Adult dataset
◦ census data of 45,222 individuals
◦ 15 attributes: age, workclass, education, marital status, etc.
◦ tuple domain size (full-dimensional): about 252
Query: all 2-way marginals
Query: all 3-way marginals
Adult, 𝑌 = gender
Adult, 𝑌 = education
Query: build 4 classifiers
Adult, 𝑌 = gender
Adult, 𝑌 = education
Query: build 4 classifiers

Differential privacy can be applied effectively for data release

Key ideas of the solution:
◦ Bayesian networks for dimension reduction
◦ carefully designed linear quality for exponential mechanism

Many open problems remain:
◦ extend to other forms of data: graph data, mobility data
◦ obtain alternate (workable) privacy definitions
Thanks!

Privacy, accuracy, and consistency too: a holistic solution to
contingency table release [PODS’07]
◦ incurs an exponential running time
◦ only optimized for low-dimensional marginals

Differentially private publication of sparse data [ICDT’12]
◦ achieves scalability, but no help for signal-to-noise problem

Differentially private spatial decompositions [ICDE’12]
◦ coarsens the histogram H to control nr. cells
◦ has some limits, e.g., range queries, ordinal domain

Assume that 𝑑𝑜𝑚 𝑋 ≤ |𝑑𝑜𝑚(𝑌)|. A distribution
Pr[𝑥, 𝑦] maximizes the mutual information between
𝑋 and 𝑌 if and only if
◦ Pr 𝑥 = 1/|𝑑𝑜𝑚(𝑋)|, for any 𝑥 ∈ 𝑑𝑜𝑚(𝑋);
◦ For each 𝑦 ∈ 𝑑𝑜𝑚(𝑌), there is at most one 𝑥 ∈
𝑑𝑜𝑚(𝑋) with Pr 𝑥, 𝑦 > 0.

two score functions for real 𝑥 ⟹ log(1 + 𝑥) and 𝑥

neighboring databases ⟹ 𝑥 and 𝑥 + Δ𝑥

Sensitivity (noise) ⟹ max of derivative ∞ and 1
query 𝑓
privacy budget 𝜀
noisy answer 𝑂
database
differentially private
algorithm
1. risk of privacy breach cumulates after
answering multiple queries
2. It requires specific DP algorithm for
every particular query
user
query 𝑓
noisy answer 𝑂
private data release
privacy budget 𝜀
synthetic data
Reusability: only access sensitive data once
Generality: support most queries