Transcript Slides
HIDING EMERGING PATTERNS
WITH LOCAL RECODING
GENERALIZATION
Presented by: Michael Cheng
Supervisor: Dr. William Cheung
Co-Supervisor: Dr. Byron Choi
Presentation Flow
Privacy-Preserving Data Publishing
Introduction to Emerging Patterns (EPs)
Introduction to Equivalence Class
Introduction to Generalization
Proposed Problem and Motivation
Heuristic for the Problem
Experimental Results
Future research plan
Privacy Preserving Data Publishing
- Introduction
Organizations often need to publish or share
their data for legitimate reasons
Sensitive information (e.g. personal identities,
restrictive patterns) maybe inferred from the
published data
Privacy Preserving Data Publishing
- Objective
1.
2.
Transform the dataset before publishing, such that:
Sensitive information
In our case: Emerging Patterns (EPs)
Subsequence analysis
In our case: Frequent Itemset (FIS) Mining
Introduction to Emerging Patterns (EPs)
Emerging Patterns (EPs) are itemsets exist in pair of
datasets whose supports are significant in one
dataset but insignificant in another
{MSE, Exec} is an Emerging Pattern
Edu
Occup
Marital
Edu
Occup
Marital
MSE
Exec
Married
BA
Exec
Married
MSE
Exec
Married
BA
Exec
Married
BA
Exec
Married
BA
Exec
Married
BA
Manager
Married
BA
Exec
Married
BA
Repair
Never
MSE
Worker
Never
Income >= 50k
Income < 50k
Introduction to Emerging Patterns (EPs)
Formally, growth rate and EPs are defined as follow:
Introduction to Equivalence Class
Tuples are said to be in the same Equivalence Class
w.r.t. a set of Attribute A if they take same values
of A
Tuples {1,2,3} are in the same Equivalence
Class w.r.t. {Occup, Marital}
ID
Edu
Occup
Marital
1
MSE
Exec
Married
2
MSE
Exec
Married
3
BA
Exec
Married
4
BA
Manager
Married
5
BA
Repair
Never
Introduction to Generalization
Extensively studied in achieving k-Anonymity
Not studied before for hiding itemsets
Modify the original values in dataset into more general
values according to a user-given hierarchy such that
more tuples will share the same set of attribute values
Example:
In Adult, “BA” and “MSE” maybe generalized to
“Degree Holder”
Types of Generalization
Single Dimensional Global Recoding
Multi Dimensional Global Recoding
Multi Dimensional Local Recoding
Occupation
White
Collar
Executive
Blue Collar
Manager
Repair
Worker
Single Dimensional Global Recoding
If we decide to generalize some values to a single
value, all tuples which contains these values will be
affected
Occup
Occup
Exec
Occupation
Exec
Exec
Single Dimensional
Global Recoding
Occupation
Occupation
Manager
Occupation
Repair
Occupation
Multi Dimensional Global Recoding
If we decide to generalize some values to a single
value, all tuples in the same equivalence class which
contains those values will be affected
Occup
Occup
Exec
Occupation
Exec
Exec
Multi Dimensional
Global Recoding
Occupation
Occupation
Manager
Manager
Repair
Repair
Multi Dimensional Local Recoding
Same as the Multi Dimensional Global Recoding
except no Equivalence Class constraint
Occup
Occup
Exec
Occupation
Exec
Exec
Multi Dimensional
Local Recoding
Occupation
Exec
Manager
Manager
Repair
Repair
Proposed Problem
- Why EP and FIS ?
Emerging Pattern may reveal sensitive information
E.g. In the Adult dataset from UCI Repository, we
found that:
{Never-Married,
Own-Child} is an EP from the class
“Income < 50k” to the class “Income >=50k”
Growth Rate: 35
Frequent Itemset is a popular data mining task and
supported by commercial data-mining software
Proposed Problem
-Why Generalization ?
Other methods studied in PPDP
For example:
Either
Adding unknowns, remove tuples, adding fake tuples randomly
Incomplete information
Fake information
In some applications, completeness and truthfulness of data
are important
By using generalization, we can preserve the completeness and
truthfulness of the data
Proposed problem
- Problem Illustration
Emerging Patterns
Frequent Itemsets
D
Emerging Patterns
Frequent Itemsets
Transformation
(Local Recoding)
D’
Intuition of Local Recoding
Support of FIS = 40%
Growth Rate of EP = 3
Frequent Itemset = {Exec, Married}
Emerging Pattern = {MSE ,Exec}
Edu
Occup
Marital
Edu
Occup
Marital
MSE
Exec
Married
BA
Exec
Married
MSE
Exec
Married
BA
Exec
Married
BA
Exec
Married
BA
Exec
Married
BA
Manager
Married
BA
Worker
Married
BA
Repair
Never
MSE
Manager
Never
Income >= 50k
Income < 50k
Intuition of Local Recoding
Edu
Occup
Marital
Edu
Occup
Marital
MSE
Exec
Married
BA
Exec
Married
MSE
Exec
Married
BA
Exec
Married
BA
Exec
Married
BA
Exec
Married
BA
Manager
Married
BA
Worker
Married
BA
Repair
Never
MSE
Manager
Never
Income >= 50k
Income < 50k
Edu
Occup
Marital
Edu
Occup
Marital
MSE
White col
Married
BA
Exec
Married
MSE
White col
Married
BA
Exec
Married
BA
Exec
Married
BA
Exec
Married
BA
Manager
Married
BA
Worker
Married
BA
Repair
Never
MSE
White Col
Never
Income >= 50k
Income < 50k
Heuristic for the Problem
- Greedy Approach
Repeat…
Applying the generalization
D
Emerging Patterns
Mining
Equivalence Classes
Utility Gain
EPs
Class 1
40
EP 1
Class 2
90
EP 2
Class 3
60
EP 3
Class 4
20
EP 4
Class 5
15
Until…
All Emerging Patterns are removed
Heuristic for the Problem
-Greedy Approach
Drawbacks:
Trapped
into some local minima
Solution:
Simulated
Annealing Style Approach for choosing
equivalence class
Heuristic for the Problem
- Simulated Annealing Style Approach
Choose Equivalence Class probabilistically
Two parameters:
Initial temperature ( T0 )
Cooling Rate ( α )
Utility Gain
T=1000
T=100
T=10
90
0.209
0.302
0.945
60
0.203
0.223
0.047
Acceptance Probability:
40
0.199
0.183
0.006
exp Utility Gain / Temperature
20
0.195
0.150
0.0009
15
0.194
0.142
0.0005
Temperature updating:
Tn = α Tn-1
Acceptance probability of different
utility gain and temperature
Heuristic for the Problem
- Simulated Annealing Style Approach
Repeat…
Applying the generalization
and
Decrease the temperature
D
Emerging Patterns
Mining
Equivalence Classes
Probability
EPs
Class 1
0.2
EP 1
Class 2
0.4
EP 2
Class 3
0.1
EP 3
Class 4
0.25
EP 4
Class 5
0.05
Until…
All Emerging Patterns are removed
Two questions
How to choose an EP for generalization?
How to calculate the utility gain?
How to choose an EP for generalization?
Choose the EP which overlaps with the remaining
EPs the most
More
likely to hide other EPs simultaneously
Emerging Patterns
MSE
Never Married
BA
Divorced
BA
Divorced
Worker
BA
Divorced
Repairman
BA
Divorced
Own-Child
How to calculate utility gain?
Utility gain is a function of:
Recoding
Distance (RD)
Reduction of Growth Rate (RG)
How to calculate utility gain ?
- Recoding Distance (RD)
The detail derivation is stated in the paper
Intuitively, it measures…
How
many and how much FIS have been generalized?
How many FIS disappeared?
High level definition of RD:
θq x (generalized FIS) + ( 1- θq ) x (disappeared FIS)
,where θq is user defined parameter
The larger the value of RD, the more the
distortion generated on the Frequent Itemset
How to calculate utility gain ?
- Reduction of Growth Rate(RG)
After taken a local recoding, RG is defined as:
The
reduction of growth rate of all EPs
Emerging Patterns
Growth Rate
Executive , Married
10
BA, Divorced
20
Executive
30
Sum of Growth
Rate
60
Emerging Patterns
Growth Rate
White col, Married
5
BA, Divorced
20
Sum of Growth
Rate
25
Local Recoding
RG = 60 – 25 = 35
How to calculate utility gain?
Putting all these together, utility gain is defined as:
θp x RG – (1- θp ) x RD
,where θp is user defined parameters
It favors:
Local
recoding which can reduce lots of growth rate
It penalizes:
Local
recoding which generate large distortion on FIS
Experimental Setup
Dataset: Adult dataset from UCI Repository
Total number of records: 30162
Income > 50k : 7508
Income <= 50k : 22654
Use only 8 categorical attributes for experiment
Popular benchmark dataset used for generalization
A well accepted hierarchy is defined
Parameters:
Support of FIS : 40%
Growth rate of EP : 5
Initial Temperature : 10
Cooling Rate : 0.4
Performance
Maximum
623.1
RD / No. of FIS disappeared of
the Greedy Approach
RD / No. of FIS disappeared of
Simulated Annealing Style Approach
(Best of 5)
RD:
Runtime (in minutes)
Greedy Approach
Simulated Annealing Style Approach
(Best of 5)
Future Research Plan
Hide EPs in temporal datasets
Consider multi-level FIS
Hiding a group of emerging patterns at a time
Q&A
Any
Questions?