Top-Down Specializat..
Download
Report
Transcript Top-Down Specializat..
Top-Down Specialization for
Information and Privacy
Preservation
Benjamin C. M. Fung
Ke Wang
Philip S. Yu
Simon Fraser University BC,
Canada
Simon Fraser University BC,
Canada
IBM T.J. Watson
Research Center
[email protected]
[email protected]
[email protected]
IEEE ICDE 2005
Outline
Motivation
Problem definition
Top-Down Specialization (TDS)
Conclusion
2
Motivation
Government and business have strong
motivations for data mining.
Citizens have growing concern about protecting
their privacy.
Can we satisfy both the data mining goal and the
privacy goal?
3
Scenario
A data owner wants to release a person-specific
data table to another party (or the public) for the
purpose of classification without scarifying the
privacy of the individuals in the released data.
Person-specific
data
Data owner
Data recipients
4
Privacy Threat
If a description on (Education, Sex) is so specific that not
many people match it, releasing the table will lead to
linking a unique or a small number of individuals with
sensitive information.
Education
Sex
Age
Class
# of Recs.
9th
F
30
0G3B
3
10th
M
32
0G4B
4
11th
F
35
2G3B
5
12th
F
37
3G1B
4
Education
Sex
Diagnosis
…
Bachelors
F
42
4G2B
6
Bachelors
F
Depression
…
Bachelors
F
44
4G0B
4
Bachelors
M
Heart disease
…
Masters
M
44
4G0B
4
Masters
F
Depression
…
Masters
F
44
3G0B
3
Masters
F
Heart disease
…
Doctorate
F
44
1G0B
1
Total:
34
Data
Adversary
Doctorate
Frecipients
Knee injury
…
Privacy Goal: Anonymity
The privacy goal is specified by the anonymity on a
combination of attributes called Virtual Identifier (VID),
where each description on a VID is required to be shared
by at least k records in the table.
Anonymity requirement
Consider VID1,…, VIDp. e.g., VID = {Education, Sex}.
a(vidi) denotes the number of data records in T that share the
value vidi on VIDi.
e.g., vid = {Doctorate, Female}.
A(vidi) denotes the smallest a(vidi) for any value vidi on VIDi.
A table T satisfies the anonymity requirement
{<VID1, k1>, …, <VIDp, kp>}
if A(vidi) ≥ ki for 1 ≤ i ≤ p, where ki is the anonymity threshold on
VIDi specified by the data owner.
6
Anonymity Requirement
Example: VID1 = {Education, Sex}, k1 = 4
Education
Sex
Age
Class
# of Recs.
a( vid1 )
9th
F
30
0G3B
3
3
10th
M
32
0G4B
4
4
11th
F
35
2G3B
5
5
12th
F
37
3G1B
4
4
Bachelors
F
42
4G2B
6
10
Bachelors
F
44
4G0B
4
Masters
M
44
4G0B
4
4
Masters
F
44
3G0B
3
3
Doctorate
F
44
1G0B
1
1
Total:
34
A(VID1) = 1
7
Problem Definition
Anonymity for Classification:
Given
a table T, an anonymity requirement, and a
taxonomy tree of each categorical attribute in UVIDj,
generalize T to satisfy the anonymity requirement
while preserving as much information as possible for
classification.
A VID may contain both categorical and
continuous attributes.
8
Solution: Generalization
Generalize values in UVIDj.
Education
Sex
Age
Class
# of Recs.
Education
Sex
Age
Class
# of Recs.
9th
F
30
0G3B
3
9th
F
30
0G3B
3
10th
M
32
0G4B
4
10th
M
32
0G4B
4
11th
F
35
2G3B
5
11th
F
35
2G3B
5
12th
F
37
3G1B
4
12th
F
37
3G1B
4
Bachelors
F
42
4G2B
6
Bachelors
F
42
4G2B
6
Bachelors
F
44
4G0B
4
Bachelors
F
44
4G0B
4
Masters
M
44
4G0B
4
Grad School
M
44
4G0B
4
Masters
F
44
3G0B
3
Grad School
F
44
4G0B
4
Doctorate
F
44
1G0B
1
9
Solution: Generalization
A generalized T can be viewed as a “cut” through the
taxonomy tree of each attribute in ∪VIDi. A cut of a tree
is a subset of values in the tree that contains exactly one
value on each root-to-leaf path.
A solution cut is ∪Cutj, where Cutj is a cut of the
taxonomy tree of an attribute in∪VIDi, such that the
generalized T represented by∪Cutj satisfies the
anonymity requirement. We are interested in a solution
cut that maximally preserves the information for
classification.
10
Solution: Generalization
Educatio
n
Sex
Hours
Class
# of Recs.
9th
F
30
0Y3N
3
3
10th
M
32
0Y4N
4
4
11th
F
35
2Y3N
5
5
12th
F
37
3Y1N
4
4
Bachelors
F
42
4Y2N
6
10
Bachelors
F
44
4Y0N
4
Masters
M
44
4Y0N
4
4
Masters
F
44
3Y0N
3
3
Doctorate
F
44
1Y0N
1
1
Total:
34
a( vid1 )
11
Intuition
1.
Classification goal and privacy goal have no conflicts:
2.
Privacy goal: mask sensitive information, usually specific
descriptions that identify individuals.
Classification goal: extract general structures that capture
trends and patterns.
A table contains multiple classification structures.
Generalizations destroy some classification structures,
but other structures emerge to help.
If generalization is “carefully” performed, identifying
information can be masked while still preserving trends
and patterns for classification.
12
Algorithm Top-Down Specialization (TDS)
Initialize every value in T to the top most value.
Initialize Cuti to include the top most value.
while some x UCuti is valid do
Find the Best specialization of the highest Score in UCuti.
Perform the Best specialization on T and update UCuti.
Update Score(x) and validity for x UCuti.
end while
return Generalized T and UCuti.
Age
ANY
[1-99)
[1-37)
[37-99)
13
Search Criteria: Score
Consider a specialization v child(v). To heuristically
maximize the information of the generalized data for
achieving a given anonymity, we favor the specialization
on v that has the maximum information gain for each
unit of anonymity loss:
14
Search Criteria: Score
Rv denotes the set of records having value v before the specialization.
Rc denotes the set of records having value c after the specialization
where c child(v).
I(Rx) is the entropy of Rx [8]:
freq(Rx, cls) is the number records in Rx having the class cls.
Intuitively, I(Rx) measures the impurity of classes for the data records
in Rx . A good specialization reduces the impurity of classes.
Also use InfoGain(v) to determine the optimal binary split on a
continuous interval. [8]
Age:30,32,35,37,42,44
[1-37)
Age
ANY
[1-99)
[37-99)
Education
Sex
Age
Class
# of Recs.
ANY_Edu
ANY_Sex
[1-99)
21G13B
34
Education
Sex
Age
Class
# of Recs.
Secondary
ANY_Sex
[1-99)
5G11B
16
University
ANY_Sex
[1-99)
16G2B
18
To compute Score(ANY_Edu):
Perform the Best Specialization
To perform the Best specialization Best child(Best), we
need to retrieve RBest, the set of data records containing
the value Best.
Taxonomy Indexed PartitionS (TIPS) is a tree structure
with each node representing a generalized record over
UVIDj, and each child node representing a specialization
of the parent node on exactly one attribute.
Stored with each leaf node is the set of data records
having the same generalized record.
17
Consider VID1 = {Education, Sex}, VID2 = {Sex, Age}
ANY_Edu
ANY_Sex
Education
Sex
Age
# of Recs.
ANY_Edu
ANY_Sex
[1-99)
34
[1-37)
12
ANY_Edu
ANY_Sex
[37-99)
22
Link university
Secondary
ANY_
Sex
[1-37)
12
LinkSecondary
Secondary
ANY_
Sex
[37-99)
4
University
ANY_
Sex
[37-99)
Link[37-99)
Age
ANY
[1-99)
[1-37)
[37-99)
18
Count Statistics
While performing specialization, collect the following
count statistics.
|Rc|, number of records containing c after specialization Best,
where c child(Best).
|Rd|, number of records containing d as if d is specialized,
where d child(c).
freq(Rc,cls), number of records in Rc having class cls.
freq(Rd,cls), number of records in Rd having class cls.
|Pd|, number of records in partition Pd where Pd is a child partition
under Pc as if c is specialized.
Score can be updated using the above count statistics
without accessing data records.
19
Update the Score
Update InfoGain(v): Specialization does not affect
InfoGain(v) except for each value c child(Best).
Update AnonyLoss(v): Specialization affects Av(VIDj)
which is the minimum a(vidj) after specializing Best. If
attribute(v) and attribute(Best) are contained in the same
VIDj, then AnonyLoss(v) has to be updated.
Age
ANY
[1-99)
[1-37)
[37-99)
Update the Score
Need to efficiently extract a(vidj) from TIPS for
updating Av(VIDj).
Virtual Identifier TreeS (VITS):
for VIDj = {D1,…, Dw} is a tree of w levels. The
level i > 0 represents the generalized values for Di.
Each root-to-leaf path represents an existing vidj on
VIDj in the generalized data, with a(vidj) stored at the
leaf node.
VITj
Consider VID1 = {Education, Sex}, VID2 = {Sex, Age}
ANY_Edu
ANY_Sex
Secondary
ANY_
Sex
Age
Education
Sex
Age
# of Recs.
ANY_Edu
ANY_Sex
[1-99)
34
[1-37)
[1-37)
12
12
ANY_Edu
Secondary
ANY_
Sex
[37-99)
ANY_Sex
4
[37-99)
University
22
ANY_
Sex
[37-99)
18
Benefits of TDS
Handling multiple VIDs
Handling both categorical and continuous attributes.
Dynamically generate taxonomy tree for continuous attributes.
Anytime solution
Treating all VIDs as a single VID leads to over generalization.
User may step through each specialization to determine a
desired trade-off between privacy and accuracy.
User may stop any time and obtain a generalized table satisfying
the anonymity requirement. Bottom-up approach does not
support this feature.
Scalable computation
23
Conclusions
Quality classification and privacy preservation can
coexist.
An effective top-down method to iteratively specialize the
data, guided by maximizing the information utility and
minimizing privacy specificity.
Simple but effective data structures.
Great applicability to both public and private sectors that
share information for mutual benefits.
24
25
26
27
28
Classification Goal
Training
data
Classification
Algorithm
Decision tree
Education
Bachelors
Doctorate
Masters
Sex
Education
Sex
Age
Class
# of Recs.
9th
F
30
0G3B
3
10th
M
32
0G4B
4
11th
F
35
2G3B
5
12th
F
37
3G1B
4
Bachelors
F
42
4G2B
6
Bachelors
F
44
4G0B
4
Masters
M
44
4G0B
4
Masters
F
44
3G0B
3
Doctorate
F
44
1G0B
1
Total:
34
G
Female
Male
G
……
B
Credit?
Unseen data
(Masters, Female)
Good
Experimental Evaluation
Data quality.
A broad
range of anonymity requirements.
Used C4.5 and Naïve Bayesian classifiers.
Compare with Iyengar’s genetic algorithm [6].
Results
were quoted from [6].
Efficiency and Scalability.
30
Data set
Adult data set
Used in Iyengar [6].
Census data.
6 continuous attributes.
8 categorical attributes.
Two classes.
30162 recs. for training.
15060 recs. for testing.
31
Data Quality
Include the TopN most important attributes into a SingleVID, which
is more restrictive than breaking them into multiple VIDs.
Data Quality
Include the TopN most important attributes into a SingleVID, which
is more restrictive than breaking them into multiple VIDs.
Data Quality
Compare with the genetic algorithm using C4.5.
Only categorical attributes. Same taxonomy trees.
Our method
Efficiency and Scalability
Took at most 10 seconds for all previous experiments.
Replicate the Adult data set and substitute some random data.