Research Poster 24 x 36 - J

Download Report

Transcript Research Poster 24 x 36 - J

Differentially Private Data Release for Data Mining
Noman Mohammed*, Rui Chen*, Benjamin C. M. Fung*, Philip S. Yu+
*Concordia University, Montreal, Canada
+University of Illinois at Chicago, IL, USA
Introduction
Anonymization Algorithm
• Consider the raw data set of Table 1.
Privacy-preserving data publishing (PPDP) studies how to
transform raw data into a version that is immunized against
privacy attacks but that still supports effective data mining tasks.
• Step 1: The algorithm creates a generalized
root partition.
• Step 2: At each iteration the algorithm uses
exponential mechanism to select a candidate
for specialization
• Step 3: The algorithm outputs each leaf
partition along with their noisy counts using
Laplace mechanism.
Figure 3: Tree for partitioning records
Figure 1: Overview of Privacy-preserving data publishing
Differential privacy is one of the strongest privacy definitions.. A
differentially-private mechanism ensures that the probability of
any output (released data) is equally likely from all nearly
identical input data sets and thus guarantees that all outputs are
insensitive to any individual’s data.
Current techniques:
Table 1: Raw data
Table 2: Contingency table
0 + Lap(1/ε)
For highdimensional
data, noise
is too big
Experimental Evaluation
• We employ the publicly available Adult data set. Adult has 45, 222 census
records with 6 numerical attributes, 8 categorical attributes, and a binary
class column representing two income levels, ≤50K or >50K.
• We observe two general trends from the experiments. First, the privacy
budget has a direct impact on the classification accuracy. Second, the
classification accuracy initially increases with the increase of the number of
specializations. However, after a certain threshold the accuracy decreases
with the increase of the number of specializations.
• We compare DiffGen with DiffP-C4.5 [2], a differentially private interactive
algorithm for building a classifier, and with the top-down specialization (TDS)
approach [4] that publishes k-anonymous data for classification analysis.
Conclusions
Our approach:
Table 3: Generalized Contingency table
Figure 2: Taxonomy tree for Job
The proposed method is the first generalization-based algorithm
for differentially private data release that preserves information
for classification analysis.
The proposed solution connects the classical generalization technique with
output perturbation to effectively anonymize raw data. Experimental results
suggest that the proposed solution provides better utility than the interactive
approach and the k-anonymous data, and that it is more effective than publishing
a contingency table.
Acknowledgments: The research is supported in part by the Discovery Grants
(356065-2008), Strategic Project Grants (130570020), and Canada Graduate
Scholarships from the Natural Sciences and Engineering Research Council of
Canada, and the NSF grant IIS-0914934.
Figure 4: Classification accuracy for Max
Figure 5: Comparison
References
1. C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating
noise to sensitivity in private data analysis. In TCC, 2006.
2. A. Friedman and A. Schuster. Data mining with differential
privacy. In SIGKDD, 2010.
3. B. C. M. Fung, K. Wang, R. Chen, and P. S. Yu. Privacypreserving data publishing: A survey of recent developments.
ACM Computing Surveys, 42(4):1–53, June 2010.
4. B. C. M. Fung, K. Wang, and P. S. Yu. Anonymizing
classification data for privacy preservation. IEEE TKDE,
19(5):711–725, May 2007.