Transcript Slides

HIDING EMERGING PATTERNS
WITH LOCAL RECODING
GENERALIZATION
Presented by: Michael Cheng
Supervisor: Dr. William Cheung
Co-Supervisor: Dr. Byron Choi
Presentation Flow
Privacy-Preserving Data Publishing
 Introduction to Emerging Patterns (EPs)
 Introduction to Equivalence Class
 Introduction to Generalization
 Proposed Problem and Motivation
 Heuristic for the Problem
 Experimental Results
 Future research plan

Privacy Preserving Data Publishing
- Introduction


Organizations often need to publish or share
their data for legitimate reasons
Sensitive information (e.g. personal identities,
restrictive patterns) maybe inferred from the
published data
Privacy Preserving Data Publishing
- Objective

1.

2.

Transform the dataset before publishing, such that:
Sensitive information
In our case: Emerging Patterns (EPs)
Subsequence analysis
In our case: Frequent Itemset (FIS) Mining
Introduction to Emerging Patterns (EPs)

Emerging Patterns (EPs) are itemsets exist in pair of
datasets whose supports are significant in one
dataset but insignificant in another
{MSE, Exec} is an Emerging Pattern
Edu
Occup
Marital
Edu
Occup
Marital
MSE
Exec
Married
BA
Exec
Married
MSE
Exec
Married
BA
Exec
Married
BA
Exec
Married
BA
Exec
Married
BA
Manager
Married
BA
Exec
Married
BA
Repair
Never
MSE
Worker
Never
Income >= 50k
Income < 50k
Introduction to Emerging Patterns (EPs)

Formally, growth rate and EPs are defined as follow:
Introduction to Equivalence Class

Tuples are said to be in the same Equivalence Class
w.r.t. a set of Attribute A if they take same values
of A
Tuples {1,2,3} are in the same Equivalence
Class w.r.t. {Occup, Marital}
ID
Edu
Occup
Marital
1
MSE
Exec
Married
2
MSE
Exec
Married
3
BA
Exec
Married
4
BA
Manager
Married
5
BA
Repair
Never
Introduction to Generalization

Extensively studied in achieving k-Anonymity

Not studied before for hiding itemsets


Modify the original values in dataset into more general
values according to a user-given hierarchy such that
more tuples will share the same set of attribute values
Example:
In Adult, “BA” and “MSE” maybe generalized to
“Degree Holder”
Types of Generalization



Single Dimensional Global Recoding
Multi Dimensional Global Recoding
Multi Dimensional Local Recoding
Occupation
White
Collar
Executive
Blue Collar
Manager
Repair
Worker
Single Dimensional Global Recoding

If we decide to generalize some values to a single
value, all tuples which contains these values will be
affected
Occup
Occup
Exec
Occupation
Exec
Exec
Single Dimensional
Global Recoding
Occupation
Occupation
Manager
Occupation
Repair
Occupation
Multi Dimensional Global Recoding

If we decide to generalize some values to a single
value, all tuples in the same equivalence class which
contains those values will be affected
Occup
Occup
Exec
Occupation
Exec
Exec
Multi Dimensional
Global Recoding
Occupation
Occupation
Manager
Manager
Repair
Repair
Multi Dimensional Local Recoding

Same as the Multi Dimensional Global Recoding
except no Equivalence Class constraint
Occup
Occup
Exec
Occupation
Exec
Exec
Multi Dimensional
Local Recoding
Occupation
Exec
Manager
Manager
Repair
Repair
Proposed Problem
- Why EP and FIS ?


Emerging Pattern may reveal sensitive information
E.g. In the Adult dataset from UCI Repository, we
found that:
 {Never-Married,
Own-Child} is an EP from the class
“Income < 50k” to the class “Income >=50k”
 Growth Rate: 35

Frequent Itemset is a popular data mining task and
supported by commercial data-mining software
Proposed Problem
-Why Generalization ?

Other methods studied in PPDP

For example:


Either




Adding unknowns, remove tuples, adding fake tuples randomly
Incomplete information
Fake information
In some applications, completeness and truthfulness of data
are important
By using generalization, we can preserve the completeness and
truthfulness of the data
Proposed problem
- Problem Illustration
Emerging Patterns
Frequent Itemsets
D
Emerging Patterns
Frequent Itemsets
Transformation
(Local Recoding)
D’
Intuition of Local Recoding



Support of FIS = 40%
Growth Rate of EP = 3
Frequent Itemset = {Exec, Married}
Emerging Pattern = {MSE ,Exec}
Edu
Occup
Marital
Edu
Occup
Marital
MSE
Exec
Married
BA
Exec
Married
MSE
Exec
Married
BA
Exec
Married
BA
Exec
Married
BA
Exec
Married
BA
Manager
Married
BA
Worker
Married
BA
Repair
Never
MSE
Manager
Never
Income >= 50k
Income < 50k
Intuition of Local Recoding
Edu
Occup
Marital
Edu
Occup
Marital
MSE
Exec
Married
BA
Exec
Married
MSE
Exec
Married
BA
Exec
Married
BA
Exec
Married
BA
Exec
Married
BA
Manager
Married
BA
Worker
Married
BA
Repair
Never
MSE
Manager
Never
Income >= 50k
Income < 50k
Edu
Occup
Marital
Edu
Occup
Marital
MSE
White col
Married
BA
Exec
Married
MSE
White col
Married
BA
Exec
Married
BA
Exec
Married
BA
Exec
Married
BA
Manager
Married
BA
Worker
Married
BA
Repair
Never
MSE
White Col
Never
Income >= 50k
Income < 50k
Heuristic for the Problem
- Greedy Approach
Repeat…
Applying the generalization
D
Emerging Patterns
Mining
Equivalence Classes
Utility Gain
EPs
Class 1
40
EP 1
Class 2
90
EP 2
Class 3
60
EP 3
Class 4
20
EP 4
Class 5
15
Until…
All Emerging Patterns are removed
Heuristic for the Problem
-Greedy Approach

Drawbacks:
 Trapped

into some local minima
Solution:
 Simulated
Annealing Style Approach for choosing
equivalence class
Heuristic for the Problem
- Simulated Annealing Style Approach

Choose Equivalence Class probabilistically

Two parameters:
Initial temperature ( T0 )
 Cooling Rate ( α )
Utility Gain
T=1000
T=100
T=10
90
0.209
0.302
0.945
60
0.203
0.223
0.047
Acceptance Probability:
40
0.199
0.183
0.006
exp Utility Gain / Temperature
20
0.195
0.150
0.0009
15
0.194
0.142
0.0005




Temperature updating:

Tn = α Tn-1
Acceptance probability of different
utility gain and temperature
Heuristic for the Problem
- Simulated Annealing Style Approach
Repeat…
Applying the generalization
and
Decrease the temperature
D
Emerging Patterns
Mining
Equivalence Classes
Probability
EPs
Class 1
0.2
EP 1
Class 2
0.4
EP 2
Class 3
0.1
EP 3
Class 4
0.25
EP 4
Class 5
0.05
Until…
All Emerging Patterns are removed
Two questions


How to choose an EP for generalization?
How to calculate the utility gain?
How to choose an EP for generalization?

Choose the EP which overlaps with the remaining
EPs the most
 More
likely to hide other EPs simultaneously
Emerging Patterns
MSE
Never Married
BA
Divorced
BA
Divorced
Worker
BA
Divorced
Repairman
BA
Divorced
Own-Child
How to calculate utility gain?

Utility gain is a function of:
 Recoding
Distance (RD)
 Reduction of Growth Rate (RG)
How to calculate utility gain ?
- Recoding Distance (RD)


The detail derivation is stated in the paper
Intuitively, it measures…
 How
many and how much FIS have been generalized?
 How many FIS disappeared?

High level definition of RD:
θq x (generalized FIS) + ( 1- θq ) x (disappeared FIS)
,where θq is user defined parameter
The larger the value of RD, the more the
distortion generated on the Frequent Itemset
How to calculate utility gain ?
- Reduction of Growth Rate(RG)

After taken a local recoding, RG is defined as:
 The
reduction of growth rate of all EPs
Emerging Patterns
Growth Rate
Executive , Married
10
BA, Divorced
20
Executive
30
Sum of Growth
Rate
60
Emerging Patterns
Growth Rate
White col, Married
5
BA, Divorced
20
Sum of Growth
Rate
25
Local Recoding
RG = 60 – 25 = 35
How to calculate utility gain?


Putting all these together, utility gain is defined as:
θp x RG – (1- θp ) x RD
,where θp is user defined parameters
It favors:
 Local

recoding which can reduce lots of growth rate
It penalizes:
 Local
recoding which generate large distortion on FIS
Experimental Setup

Dataset: Adult dataset from UCI Repository


Total number of records: 30162



Income > 50k : 7508
Income <= 50k : 22654
Use only 8 categorical attributes for experiment


Popular benchmark dataset used for generalization
A well accepted hierarchy is defined
Parameters:




Support of FIS : 40%
Growth rate of EP : 5
Initial Temperature : 10
Cooling Rate : 0.4
Performance
 Maximum
623.1
RD / No. of FIS disappeared of
the Greedy Approach
RD / No. of FIS disappeared of
Simulated Annealing Style Approach
(Best of 5)
RD:
Runtime (in minutes)
Greedy Approach
Simulated Annealing Style Approach
(Best of 5)
Future Research Plan



Hide EPs in temporal datasets
Consider multi-level FIS
Hiding a group of emerging patterns at a time
Q&A
Any
Questions?