Top-Down Specialization for Information and Privacy Preservation

Download Report

Transcript Top-Down Specialization for Information and Privacy Preservation

Top-Down Specialization
for Information and
Privacy Preservation
Benjamin C. M. Fung
Ke Wang
Philip S. Yu
Simon Fraser University
BC, Canada
Simon Fraser University
BC, Canada
IBM T.J. Watson
Research Center
[email protected]
[email protected]
[email protected]
IEEE ICDE 2005
Conference Paper
• Top-Down Specialization for Information and
Privacy Preservation.
• Proceedings of the 21st IEEE International
Conference on Data Engineering (ICDE 2005).
• Conference paper and software:
– http://www.cs.sfu.ca/~ddm
– Software  Privacy  TDS 1.0
2
Outline
• Problem: Anonymity for Classification
• Our method: Top-Down Specialization (TDS)
• Related works
• Experimental results
• Conclusions
• Q&A
3
Motivation
• Government and business have strong
motivations for data mining.
• Citizens have growing concern about protecting
their privacy.
• Can we satisfy both the data mining goal and the
privacy goal?
4
Scenario
• A data owner wants to release a person-specific
data table to another party (or the public) for the
purpose of classification without scarifying the
privacy of the individuals in the released data.
Person-specific
data
Data owner
Data recipients
5
Classification Goal
Training
data
Classification
Algorithm
Decision tree
Education
Bachelors
Doctorate
Masters
Sex
Education
Sex
Age
Class
# of Recs.
9th
F
30
0G3B
3
10th
M
32
0G4B
4
11th
F
35
2G3B
5
12th
F
37
3G1B
4
Bachelors
F
42
4G2B
6
Bachelors
F
44
4G0B
4
Masters
M
44
4G0B
4
Masters
F
44
3G0B
3
Doctorate
F
44
1G0B
1
Total:
34
G
Female
Male
G
……
B
Credit?
Unseen data
(Masters, Female)
Good
Privacy Threat
• If a description on (Education, Sex) is so specific that not
many people match it, releasing the table will lead to
linking a unique or a small number of individuals with
sensitive information.
Education
Sex
Age
Class
# of Recs.
9th
F
30
0G3B
3
10th
M
32
0G4B
4
11th
F
35
2G3B
5
12th
F
37
3G1B
4
Education
Sex
Diagnosis
…
Bachelors
F
42
4G2B
6
Bachelors
F
Depression
…
Bachelors
F
44
4G0B
4
Bachelors
M
Heart disease
…
Masters
M
44
4G0B
4
Masters
F
Depression
…
Masters
F
44
3G0B
3
Masters
F
Heart disease
…
Doctorate
F
44
1G0B
1
Total:
34
Data
Adversary
Doctorate
Frecipients
Knee injury
…
Privacy Goal: Anonymity
• The privacy goal is specified by the anonymity on a
combination of attributes called Virtual Identifier (VID),
where each description on a VID is required to be shared
by at least k records in the table.
• Anonymity requirement
– Consider VID1,…, VIDp. e.g., VID = {Education, Sex}.
– a(vidi) denotes the number of data records in T that share the
value vidi on VIDi.
e.g., vid = {Doctorate, Female}.
– A(vidi) denotes the smallest a(vidi) for any value vidi on VIDi.
– A table T satisfies the anonymity requirement
{<VID1, k1>, …, <VIDp, kp>}
if A(vidi) ≥ ki for 1 ≤ i ≤ p, where ki is the anonymity threshold on
VIDi specified by the data owner.
8
Anonymity Requirement
Example: VID1 = {Education, Sex}, k1 = 4
Education
Sex
Age
Class
# of Recs.
a( vid1 )
9th
F
30
0G3B
3
3
10th
M
32
0G4B
4
4
11th
F
35
2G3B
5
5
12th
F
37
3G1B
4
4
Bachelors
F
42
4G2B
6
10
Bachelors
F
44
4G0B
4
Masters
M
44
4G0B
4
4
Masters
F
44
3G0B
3
3
Doctorate
F
44
1G0B
1
1
Total:
34
A(VID1) = 1
9
Problem Definition
• Anonymity for Classification:
– Given a table T, an anonymity requirement, and a
taxonomy tree of each categorical attribute in UVIDj,
generalize T to satisfy the anonymity requirement
while preserving as much information as possible for
classification.
• A VID may contain both categorical and
continuous attributes.
10
Solution: Generalization
Generalize values in UVIDj.
Education
Sex
Age
Class
# of Recs.
Education
Sex
Age
Class
# of Recs.
9th
F
30
0G3B
3
9th
F
30
0G3B
3
10th
M
32
0G4B
4
10th
M
32
0G4B
4
11th
F
35
2G3B
5
11th
F
35
2G3B
5
12th
F
37
3G1B
4
12th
F
37
3G1B
4
Bachelors
F
42
4G2B
6
Bachelors
F
42
4G2B
6
Bachelors
F
44
4G0B
4
Bachelors
F
44
4G0B
4
Masters
M
44
4G0B
4
Grad School
M
44
4G0B
4
Masters
F
44
3G0B
3
Grad School
F
44
4G0B
4
Doctorate
F
44
1G0B
1
11
Intuition
1. Classification goal and privacy goal have no conflicts:
–
–
Privacy goal: mask sensitive information, usually specific
descriptions that identify individuals.
Classification goal: extract general structures that capture
trends and patterns.
2. A table contains multiple classification structures.
Generalizations destroy some classification structures,
but other structures emerge to help.
If generalization is “carefully” performed, identifying
information can be masked while still preserving trends
and patterns for classification.
12
Algorithm Top-Down Specialization (TDS)
Initialize every value in T to the top most value.
Initialize Cuti to include the top most value.
while some x  UCuti is valid do
Find the Best specialization of the highest Score in UCuti.
Perform the Best specialization on T and update UCuti.
Update Score(x) and validity for x  UCuti.
end while
return Generalized T and UCuti.
Age
ANY
[1-99)
[1-37)
[37-99)
13
Search Criteria: Score
• Consider a specialization v  child(v). To heuristically
maximize the information of the generalized data for
achieving a given anonymity, we favor the specialization
on v that has the maximum information gain for each
unit of anonymity loss:
14
Search Criteria: Score
• Rv denotes the set of records having value v before the specialization.
Rc denotes the set of records having value c after the specialization
where c  child(v).
• I(Rx) is the entropy of Rx [8]:
• freq(Rx, cls) is the number records in Rx having the class cls.
• Intuitively, I(Rx) measures the impurity of classes for the data records
in Rx . A good specialization reduces the impurity of classes.
• Also use InfoGain(v) to determine the optimal binary split on a
continuous interval. [8]
[1-37)
Age
ANY
[1-99)
[37-99)
Education
Sex
Age
Class
# of Recs.
ANY_Edu
ANY_Sex
[1-99)
21G13B
34
Education
Sex
Age
Class
# of Recs.
Secondary
ANY_Sex
[1-99)
5G11B
16
University
ANY_Sex
[1-99)
16G2B
18
To compute Score(ANY_Edu):
Perform the Best Specialization
• To perform the Best specialization Best  child(Best), we
need to retrieve RBest, the set of data records containing
the value Best.
• Taxonomy Indexed PartitionS (TIPS) is a tree structure
with each node representing a generalized record over
UVIDj, and each child node representing a specialization
of the parent node on exactly one attribute.
• Stored with each leaf node is the set of data records
having the same generalized record.
17
Consider VID1 = {Education, Sex}, VID2 = {Sex, Age}
ANY_Edu
ANY_Sex
Secondary
ANY_
Sex
Education
Sex
Age
# of Recs.
ANY_Edu
ANY_Sex
[1-99)
34
[1-37)
[1-37)
12
LinkSecondary
12
ANY_Edu
Secondary
ANY_
Sex
[37-99)
ANY_Sex
4
[37-99)
University
22
ANY_
Sex
[37-99)
Link[37-99)
Age
ANY
[1-99)
[1-37)
[37-99)
18
Count Statistics
• While performing specialization, collect the following
count statistics.
– |Rc|, number of records containing c after specialization Best,
where c  child(Best).
– |Rd|, number of records containing d as if d is specialized,
where d  child(c).
– freq(Rc,cls), number of records in Rc having class cls.
– freq(Rd,cls), number of records in Rd having class cls.
– |Pd|, number of records in partition Pd where Pd is a child partition
under Pc as if c is specialized.
• Score can be updated using the above count statistics
without accessing data records.
19
Update the Score
• Update InfoGain(v): Specialization does not affect
InfoGain(v) except for each value c  child(Best).
• Update AnonyLoss(v): Specialization affects Av(VIDj)
which is the minimum a(vidj) after specializing Best. If
attribute(v) and attribute(Best) are contained in the same
VIDj, then AnonyLoss(v) has to be updated.
Age
ANY
[1-99)
[1-37)
[37-99)
Update the Score
• Need to efficiently extract a(vidj) from TIPS for
updating Av(VIDj).
• Virtual Identifier TreeS (VITS):
– VITj for VIDj = {D1,…, Dw} is a tree of w levels. The
level i > 0 represents the generalized values for Di.
Each root-to-leaf path represents an existing vidj on
VIDj in the generalized data, with a(vidj) stored at the
leaf node.
Consider VID1 = {Education, Sex}, VID2 = {Sex, Age}
ANY_Edu
ANY_Sex
Secondary
ANY_
Sex
Age
Education
Sex
Age
# of Recs.
ANY_Edu
ANY_Sex
[1-99)
34
[1-37)
[1-37)
12
12
ANY_Edu
Secondary
ANY_
Sex
[37-99)
ANY_Sex
4
[37-99)
University
22
ANY_
Sex
[37-99)
18
Benefits of TDS
• Handling multiple VIDs
– Treating all VIDs as a single VID leads to over generalization.
• Handling both categorical and continuous attributes.
– Dynamically generate taxonomy tree for continuous attributes.
• Anytime solution
– User may step through each specialization to determine a
desired trade-off between privacy and accuracy.
– User may stop any time and obtain a generalized table satisfying
the anonymity requirement. Bottom-up approach does not
support this feature.
• Scalable computation
23
Related Works
• The concept of anonymity was proposed by Dalenius [2].
• Sweeny [10] employed bottom-up generalization to
achieve k-anonymity.
– Single VID. Not considering specific use of data.
• Iyengar [6] proposed a genetic algorithm (GA) to address
the problem of anonymity for classification.
– Single VID.
– GA needs 18 hours while our method only needs 7 seconds to
generalize same set of records (with comparable accuracy).
• Wang et al. [12] recently proposed bottom-up
generalization to address the same problem.
– Only for categorical attributes.
– Does not support Anytime Solution.
24
Experimental Evaluation
• Data quality.
– A broad range of anonymity requirements.
– Used C4.5 and Naïve Bayesian classifiers.
• Compare with Iyengar’s genetic algorithm [6].
– Results were quoted from [6].
• Efficiency and Scalability.
25
Data set
• Adult data set
–
–
–
–
–
–
–
Used in Iyengar [6].
Census data.
6 continuous attributes.
8 categorical attributes.
Two classes.
30162 recs. for training.
15060 recs. for testing.
26
Data Quality
• Include the TopN most important attributes into a SingleVID, which
is more restrictive than breaking them into multiple VIDs.
Data Quality
• Include the TopN most important attributes into a SingleVID, which
is more restrictive than breaking them into multiple VIDs.
Data Quality
• Compare with the genetic algorithm using C4.5.
• Only categorical attributes. Same taxonomy trees.
Our method
Efficiency and Scalability
• Took at most 10 seconds for all previous experiments.
• Replicate the Adult data set and substitute some random data.
Conclusions
• Quality classification and privacy preservation can
coexist.
• An effective top-down method to iteratively specialize the
data, guided by maximizing the information utility and
minimizing privacy specificity.
• Simple but effective data structures.
• Great applicability to both public and private sectors that
share information for mutual benefits.
31
Thank you.
Questions?
32
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
R. Agrawal and S. Ramakrishnan. Privacy preserving data mining. In Proc. of the ACM
SIGMOD Conference on Management of Data, pages 439–450, Dallas, Texas, May 2000.
ACM Press.
T. Dalenius. Finding a needle in a haystack - or identifying anonymous census record. Journal
of Official Statistics, 2(3):329–336, 1986.
W. A. Fuller. Masking procedures for microdata disclosure limitation. Official Statistics,
9(2):383–406, 1993.
S. Hettich and S. D. Bay. The UCI KDD Archive, 1999. http://kdd.ics.uci.edu.
A. Hundepool and L. Willenborg. - and -argus: Software for statistical disclosure control. In
the 3rd International Seminar on Statistical Confidentiality, Bled, 1996.
V. S. Iyengar. Transforming data to satisfy privacy constraints. In Proc. of the 8th ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 279–288,
Edmonton, AB, Canada, July 2002.
J. Kim and W. Winkler. Masking microdata files. In ASA Proceedings of the Section on Survey
Research Methods, pages 114–119, 1995.
R. J. Quinlan. C4.5: Progams for Machine Learning. Morgan Kaufmann, 1993.
P. Samarati. Protecting respondents’ identities in microdata release. In IEEE Transactions on
Knowledge Engineering, volume 13, pages 1010–1027, 2001.
L. Sweeney. Datafly: A system for providing anonymity in medical data. In International
Conference on Database Security, pages 356–381, 1998.
33
References
11.
12.
The House of Commons in Canada. The personal information protection and electronic
documents act, April 2000. http://www.privcom.gc.ca/.
K. Wang, P. Yu, and S. Chakraborty. Bottom-up generalization: a data mining solution to
privacy protection. In Proc. of the 4th IEEE International Conference on Data Mining 2004
(ICDM 2004), November 2004.
34