Transcript Lecture27

Privacy Preserving Data Mining
Introduction
August 2nd, 2013
Shaibal Chakrabarty
1
• Motivation: Inherent tension in mining sensitive databases:
We want to release aggregate information about the data, without
leaking individual information about participants.
• Aggregate info: Number of A students in a school district.
• Individual info: If a particular student is an A student.
• Problem: Exact aggregate info may leak individual info. Eg:
Number of A students in district, and
Number of A students in district not named Dan Waymel
• Goal: Method to protect individual info, release aggregate info.
2
• A growing number of data mining applications need to deal with
data sources that are distributed, possibly proprietary, and
sensitive to privacy. Financial transactions, health-care records,
and network communication traffic are a few examples. Privacy
is also becoming an increasingly important issue in data mining
applications for counter-terrorism and homeland defense that
may require creating profiles, constructing social network
models, detecting terrorist communications from distributed
privacy sensitive multi-party data.
• Combining such diverse data sets belonging to different parties
may violate the privacy laws. Therefore we need algorithms that
can mine the data while guaranteeing that the privacy of the
data is not compromised.
• This has resulted in the development of several privacy-
preserving data mining techniques. Many of these techniques
work using randomized techniques to perturb the data and
preserve the data privacy while still guaranteeing the invariance
of the underlying patterns.
3
• Goal: Distort data while still preserve some properties for data mining
propose.
−Additive Based
−Multiplicative Based
−Condensation based
−Decomposition
−Data Swapping
4
Randomization approach
• Hide the original data by randomly modifying the data values using
some additive noise still preserving the patterns of the original data
(preserving the underlying probabilistic properties)
• Reconstruct the distribution of original data values from the perturbed
data.
• Cannot reconstruct original values
• A decision tree classifier is built from the perturbed data from this
reconstructed distribution.
• Privacy breaches
Cryptographic approach–
• Party X –owns Database D1, Party Y –owns Database D2
• Build a decision tree built on D1 and D2 without revealing information
about D1 to party Y and about D2 to party X except what might be
revealed from the decision tree.
• Horizontally partitioned data - Records (entities) split across parties
• Vertically partitioned data - Attributes split across parties
5
Randomization
Reconstruction
6
• Agrawal R., Srikant R. Privacy-Preserving Data Mining.
ACM SIGMOD Conference, 2000.
• “Random Data Perturbation Techniques and Privacy
Preserving Data Mining”–Hillol Kargupta, SouptikGupta,
QiWang, Krishnamoorthy Sivakumar
• C. Clifton, M. Kantarcioglu, J. Vaidya, X. Lin, and M. Zhu,
Tools for Privacy Preserving Distributed Data Mining, ACM
SIGKDD Explorations 4(2), January 2003.
• Privacy Preserving Cooperative Statistical Analysis –
WenliangDu, MikhailJ. Atallah
• Defining Privacy for Data Mining –Chris Clifton,
MuratKantarcioglu, JaideepVaidya
• Data Mining : Concepts and Techniques –JiaweiHan,
MichelineKamber
7
• Privacy is a personal choice, so should enable individual adaptable (Liu,
Kantarcioglu and Thuraisingham ICDM’06)
8