Slides - Data Mining and Security Lab @ McGill

Download Report

Transcript Slides - Data Mining and Security Lab @ McGill

Publishing Set-Valued Data via
Differential Privacy
Rui Chen, Concordia University
Noman Mohammed, Concordia University
Benjamin C. M. Fung, Concordia University
Bipin C. Desai, Concordia University
Li Xiong, Emory University
VLDB 2011
1
Outline
 Introduction
 Preliminaries
 Sanitization algorithm
 Experimental results
 Conclusions
2
Introduction
• The problem: non-interactive set-valued data publication
under differential privacy
• Typical set-valued data: transaction data, web search queries
3
Introduction
• Set-valued data refers to the data in which each record
owner is associated with a set of items drawn from an
item universe.
TID
Items
t1
I1, I2, I3, I4
t2
I2, I4
t3
I2
t4
I1, I2
t5
I2
t6
I1
t7
I1, I2, I3, I4
t8
I2, I3, I4
4
Introduction
• Existing works [1, 2, 3, 4, 5, 6, 7] on publishing set-valued
data are based on partitioned-based privacy models [8].
• They provide insufficient privacy protection.
– Composition attack [8]
– deFinetti attack [9]
– Foreground knowledge attack [10]
• They are vulnerable to background knowledge.
5
Introduction
• Differential privacy is independent of an adversary’
background knowledge and computational power (with
exceptions [11]).
• The outcome of any analysis should not overly depend
on a single data record.
• Existing differentially private data publishing approaches
are not adequate in terms of both utility and scalability
for our problem.
6
Introduction
• Problems of data-independent publishing approaches:
I1
I2
I3
Universe I = {I1, I2, I3 }
I1, I2
I1, I3
I2, I3
I1, I2, I3
• Scalability: O(2n)
• Utility: noise accumulates exponentially
7
Outline
 Introduction
 Preliminaries
 Sanitization algorithm
 Experimental results
 Conclusions
8
Preliminaries
• Context-free taxonomy tree
• Each internal node is a set of their leaves, not
necessarily the semantic generalization
9
Preliminaries
• Differential privacy [12]
D
D
’
• D and D’ are neighbors if they
differ on at most one record
A non-interactive privacy mechanism A gives ε-differential
privacy if for all neighbours D, D’, and for any possible
sanitized database D* ∈ Range(A),
PrA[A(D) = D*] ≤ exp(ε) × PrA[A(D’) = D*]
10
Preliminaries
• Laplace mechanism [12]
Global
Sensitivity
For example, for a single counting query Q over a dataset
D, returning Q(D)+Laplace(1/ε) gives ε-differential
privacy.
11
Preliminaries
• Exponential mechanism [13]
Given a utility function q : (D × R) → R for a database
instance D, the mechanism A,
A(D, q) = { return r with probability ∝
exp(ε×q(D, r)/2△q) }
gives ε-differential privacy.
12
Preliminaries
• Composition properties [14]
Sequential composition
∑iεi –differential privacy
Parallel composition
max(εi)–differential privacy
13
Preliminaries
• Utility metrics
For a given itemset I’  I , a counting query Q over a
dataset D is defined to be Q( D) | {t  D : I '  t} |
A privacy mechanism A is (α, δ)-useful if with probability
1- δ, for every counting query and every dataset D, for
D*=A(D), |Q(D*)-Q(D)|<= α. [15]
14
Outline
 Introduction
 Preliminaries
 Sanitization algorithm
 Experimental results
 Conclusions
15
Sanitization Algorithm
• Top-down partitioning
TID
Items
t1
I1, I2, I3, I4
t2
I2, I4
t3
I2
t4
I1, I2
t5
I2
t6
I1
t7
I1, I2, I3, I4
t8
I2, I3, I4
• Generalize all records to a single partition
• Keep partitioning non-empty partitions
until leaf partitions are reached
16
Sanitization Algorithm
• Privacy budget allocation
• We reserve B/2 to obtain noisy sizes of leaf partitions
and the rest B/2 to guide the partitioning.
• Assign less budget to more general partitions and
more budget to more specific partitions.
17
Sanitization Algorithm
• Privacy budget allocation
A hierarchy cut needs at most

uicut
| InternalNodes(ui, H ) |
partition operations to reach leaf partitions.
Example: {I{1,2}, I{3, 4}} needs at
most two partition operations to
reach leaf partitions
18
Sanitization Algorithm
• Privacy budget allocation
• We reserve B/2 to obtain noisy sizes of leaf partitions
and the rest B/2 to guide the partitioning.
• Assign less budget to more general partitions and
more budget to more specific partitions.
B/2/3 = B/6
(B/2-B/6)/2 = B/6
B/6+B/2 = 2B/3
19
Sanitization Algorithm
• Sub-partition generation
• For a non-leaf partition, we need to consider all
possible sub-partitions to satisfy differential privacy.
• Efficient implementation: separately handling empty
and non-empty partitions (inspired by [16]).
20
Outline
 Introduction
 Preliminaries
 Sanitization algorithm
 Experimental results
 Conclusions
21
Experiments
• Two real-life set-valued datasets are used.
• MSNBC is publicly available at UCI machine learning
repository(http://archive.ics.uci.edu/ml/index.html).
• STM is provided by Societe de transport de Montreal
(STM) (http://www.stm.info).
22
Experiments
• Average relative error vs. privacy budget
B=0.5
B=0.75
B=1.0
23
Experiments
• Utility for frequent itemset mining
B=0.5
B=0.75
B=1.0
24
Experiments
• Scalability: O(|D|*|I|)
Runtime vs. |D|
Runtime vs. |I|
25
Outline
 Introduction
 Preliminaries
 Sanitization algorithm
 Experimental results
 Conclusions
26
Conclusions
• Differential privacy can be successfully applied to noninteractive set-valued data publishing with guaranteed
utility.
• Differential privacy can be achieved by data-dependent
solutions with improved efficiency and accuracy.
• The general idea of data-dependent solutions applies to
other types of data, for example, relational data [17] and
trajectory data [18].
27
References
[1] J. Cao, P. Karras, C. Raissi, and K.-L. Tan. ρ–uncertainty inference proof transaction
anonymization. In VLDB, pp. 1033–1044, 2010.
[2] G. Ghinita, Y. Tao, and P. Kalnis. On the anonymization of sparse high-dimensional
data. In ICDE, pp. 715–724, 2008.
[3] Y. He and J. F. Naughton. Anonymization of set-valued data via top-down, local
generalization. In VLDB, pp. 934–945, 2009.
[4] M. Terrovitis, N. Mamoulis, and P. Kalnis. Privacy-preserving anonymization of setvalued data. In VLDB, pp.115–125, 2008.
[5] M. Terrovitis, N. Mamoulis, and P. Kalnis. Local and global recoding methods for
anonymizing set-valued data.VLDBJ, 20(1):83–106, 2011.
[6] Y. Xu, B. C. M. Fung, K. Wang, A. W. C. Fu, and J. Pei. Publishing sensitive
transactions for itemset utility. In ICDM, pp. 1109–1114, 2008.
[7] Y. Xu, K. Wang, A. W. C. Fu, and P. S. Yu. Anonymizing transaction databases for
publication. In SIGKDD, pp. 767–775, 2008.
28
References
[8] S. R. Ganta, S. P. Kasiviswanathan, and A. Smith. Composition attacks and auxiliary
information in data privacy. In SIGKDD, pp. 265-273, 2008.
[9] D. Kifer. Attacks on privacy and deFinetti’s theorem. In SIGMOD, pp. 127–138,
2009.
[10] R. C. W. Wong, A. Fu, K. Wang, P. S. Yu, and J. Pei. Can the utility of anonymized
data be used for privacy breaches,” ACM Transactions on Knowledge Discovery
from Data, to appear.
[11] D. Kifer and A. Machanavajjhala. No free lunch in data privacy. In SIGMOD, 2011.
[12] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in
private data analysis. In Theory of Cryptography Conference, pp. 265–284, 2006.
[13] F. McSherry and K. Talwar. Mechanism design via differential privacy. In FOCS, pp.
94–103, 2007.
[14] F. McSherry. Privacy integrated queries: An extensible platform for privacypreserving data analysis. In SIGMOD, pp. 19–30, 2009.
[15] A. Blum, K. Ligett, and A. Roth. A learning theory approach to non-interactive
database privacy. In STOC, pp.609–618, 2008.
29
References
[16] G. Cormode, M. Procopiuc, D. Srivastava, and T. T. L. Tran. Differentially Private
Publication of Sparse Data. In CoRR, 2011.
[17] N. Mohammed, R. Chen, B. C. M. Fung, and P. S. Yu. Differentially private data
release for data mining. In SIGKDD, 2011.
[18] R. Chen, B. C. M. Fung, and B. C. Desai. Differentially private trajectory data
publication. ICDE, under review, 2012.
30
Thank you!
Q&A
31
Backup Slides
32
Lower Bound Results
• In the interactive setting, only a limited number of queries
could be answered; otherwise, an adversary would be
able to precisely reconstruct almost the entire original
database.
• In the non-interactive setting, one can only guarantee the
utility of restricted classes of queries.
33
34
35
Threshold Selection
• We design the threshold as a function of the standard
deviation of the noise and the height of a partition’s
hierarchy cut:
2C2  height( p.cut) / p.
36
Relative error
• (α, δ)-usefulness is effective to give an overall estimation
of utility, but fails to produce intuitive experimental
results.
• We experimentally measure the utility of sanitized data
for counting queries by relative error:
| Q( D*)  Q( D) |
max{Q( D), s}
Sanity
bound
37
Experiments
• Average relative error vs. taxonomy tree fan-out
B=0.5
B=0.75
B=1.0
38