Transcript PPT
Reconstruction-Based Association
Rule Hiding
Author: Yuhong Guo
(MS-Ph.D. Candidate, Peking Univ., China)
[email protected]
Advisor: Prof. Shiwei Tang
Co-Advisors: Prof. Dongqing Yang, Jian Pei
Sunday, June 10, 2007
Association Rule Hiding: what?
why?? and how???
Problem: hide sensitive association rules
in data without losing non-sensitives
Motivations: large repositories of data
contain confidential rules disclosed with
serious adverse effects
Traditional:
Solutions
fine-tuning, control the
Data modification
• distortion
• blocking
New promising:
Data reconstruction
2007-7-10
hiding effects indirectly
knowledge sanitization,
control effects directly
SIGMOD Ph.D. Workshop IDAR’07
2
Outline
Background
Motivation
Problem statement
Related work
Proposed Solution
Current Progress
Evaluation Plan
2007-7-10
SIGMOD Ph.D. Workshop IDAR’07
3
Background
Motivation
Data mining
Data sharing
Privacy Preserving
Data mining (PPDM)
Privacy
preserving
Two problems addressed in PPDM
the protection of private data
the protection of sensitive rules
(knowledge) contained in the data
2007-7-10
SIGMOD Ph.D. Workshop IDAR’07
4
Background
Problem statement
Given
a database D to be released
minimum threshold “MST”, “MCT”
a set of association rules R mined from D
a set of sensitive rules Rh R to be hided
Find a new database D’ such that
the rules in Rh cannot be mined from D’
the rules in R-Rh can still be mined as many
as possible
KHD (Knowledge Hiding in Database) problem
2007-7-10
SIGMOD Ph.D. Workshop IDAR’07
5
Background
Related work
Data modification approaches
Basic idea: data sanitization D->D’
Current status:distortion,blocking, prosperous
Drawbacks
• Cannot control hiding effects intuitively, lots of I/O
Data reconstruction approaches
Basic idea:knowledge sanitization D->K->D’
Current status:limited, 3 papers
Advantages
• Can easily control the availability of rules and control
the hiding effects directly, intuitively, handily
2007-7-10
SIGMOD Ph.D. Workshop IDAR’07
6
Background
Classification of current algorithms
Data
modification
DataDistortion
lots of reconstruction-based
work is expected
DataBlocking
Hide
rules
Hide large
itemsets
Algo1a
Algo1b
Algo2a
WSDA
PDA
Algo2b Algo2c
Naïve MinFIA
MaxFIA IGA RRA
RA SWA
Border-Based
Integer-Programing
Sanitization-Matrix
GIH
CR
CR2
Data reconstruction
2007-7-10
SIGMOD Ph.D. Workshop IDAR’07
CIILM
7
Outline
Background
Proposed Solution
Framework
Example
Discussion
Current Progress
Evaluation Plan
2007-7-10
SIGMOD Ph.D. Workshop IDAR’07
8
Proposed Solution
Framework of our approach
1.Frequent Set Mining
DD
FS
R
2.Perform sanitization
Algorithm
3.FP-tree - based Inverse Frequent Set Mining
FS ’
D’
R-Rh
FP-tree
2007-7-10
SIGMOD Ph.D. Workshop IDAR’07
9
Proposed Solution
The first two phases
1. Frequent set mining
Generate all frequent itemsets with their supports
and support counts FS from original database D
2. Perform sanitization algorithm
Input: FS output in phase 1, R, Rh
Output: sanitized frequent itemsets FS’
Process
• Select hiding strategy
• Identify sensitive frequent sets
• Perform sanitization
In best cases, sanitization algorithm can
ensure from FS’ ,we can exactly get the
non-sensitive rules set R-Rh
2007-7-10
SIGMOD Ph.D. Workshop IDAR’07
FS
R
FS’
R-Rh
10
Proposed Method
The third phase: FP-tree-based inverse mining
Basic idea: use FP-tree as a transition “bridge”, which
reduces the gap between a database and its frequent
itemsets and makes transformation more easily
Frequent
Itemsets
FS
Temporary
Database
FP-Tree
(i)
(ii)
A set of Compatible
databases
(iii)
TempD
D1
D2
...
(i) Generate a compatible FP-tree
(ii) Generate a TempD that only includes frequent items
(iii) Scatter infrequent items into TempD
2007-7-10
SIGMOD Ph.D. Workshop IDAR’07
11
Proposed Solution
Example: the first two phases
Oiginal Database: D
TID
T1
T2
T3
T4
T5
T6
Items
ABCE
ABC
ABCD
ABD
AD
ACD
2007-7-10
Frequent Itemsets: FS
A:6 100%
1. Frequent
B:4
66%
set mining
C:4
66%
σ= 4
D:4
66%
MST=66%
MCT=75%
AB:4 66%
AC:4 66%
AD:4 66%
Association Rules: R
confidence support
B A 100%
66%
C A 100%
66%
D A 100%
66%
rules
2. Perform
sanitization
algorithm
A:6 100%
C:4
66%
rules confidence support
D:4
66%
C A 100%
66%
AC:4 66%
D A 100%
66%
AD:4 66%
Association Rules: R-R h
Frequent Itemsets: FS'
SIGMOD Ph.D. Workshop IDAR’07
12
Proposed Solution
Example: the third phase
T ID
It e m s
T1
ACD
T2
ACD
T3
AC
T4
AC
T5
AD
T6
AD
FP
A :6
C :4
D :4
A :6
C :4
σ=4
R e le a s e d D a t a b a s e : D '
TID
Items
TID
Items
TID
Items
TID
Items
T1
ACDE
T1
ACD
T1
ACDE
T1
ACDE
T2
ACD
T2
T2
ACDE
T2
T3
AC
T3
ACD
ACE
T3
AC
T3
ACDE
ACE
T4
AC
T4
AC
T4
AC
T4
AC
T5
AD
T5
AD
T5
AD
T5
AD
T6
AD
T6
AD
T6
AD
T6
AD
D1'
2007-7-10
D2 '
...
Difficulties:
A C :4
66%
A D :4
66%
F r e q u e n t I te m s e ts : F S '
D :2
D :2
100%
66%
66%
Dp '
...
1. How to find
the target
FP-tree
2. How to
control |D’|
...
Dq'
SIGMOD Ph.D. Workshop IDAR’07
13
Proposed Solution
Discussion
Sanitization algorithm
Compared with early popular data
sanitization : performs sanitization directly
on knowledge level of data
Inverse frequent set mining algorithm
Deals with frequent items and infrequent
items separately: more efficiently, a large
number of outputs
Our solution provides user with a knowledge level
window to perform sanitization handily and
generates a number of secure databases
2007-7-10
SIGMOD Ph.D. Workshop IDAR’07
14
Outline
Background
Proposed Solution
Current Progress
Work to date
Future work
Expected contributions
Evaluation Plan
2007-7-10
SIGMOD Ph.D. Workshop IDAR’07
15
Current Progress
Work to date
FP-tree-based method for inverse frequent set
mining (used in the 3rd phase of our framework)
First effort
• Published in Proc. of BNCOD'06
• Provides a good heuristic search strategy to rapidly find a
FP-tree satisfying the given constraints, leading to rapidly
finding a set of compatible databases
Further work
• Accepted by Journal of Software (JOS)
• A more mature and well-designed FP-tree-based method
for inverse frequent set mining by iteratively solving a sub
linear constraint problem
2007-7-10
SIGMOD Ph.D. Workshop IDAR’07
16
Current Progress
Future work
Develop a sound sanitization algorithm with the following
considerations
The support and confidence of the rules in R- Rh should
remain unchanged as much as possible
Can select appropriate hiding strategies according to different
kinds of correlations among the rules in R and Rh
Can prevent rule-based reasoning
Investigate how to restrict the number of transactions in
the new released database
Develop an integrated secure association rule mining tool
Can protect privacy data
Can protect sensitive rules contained in the data
DHD
Integrated secure tool
KHD
2007-7-10
SIGMOD Ph.D. Workshop IDAR’07
17
Current Progress
Expected contributions
CHART: Credible Hiding Association Rule Tool
ARH Evaluation Metrics
Rule sanitization
Algorithm
Inverse Frequent Set
Mining Algorithm
Reconstruction-based ARH Framework
2007-7-10
SIGMOD Ph.D. Workshop IDAR’07
18
Outline
Background
Proposed Solution
Current Progress
Evaluation Plan
2007-7-10
SIGMOD Ph.D. Workshop IDAR’07
19
Evaluation Plan
R
Dataset
BMS-POS
BMS-WebView-1
BMS-WebView-2
…
Rh
~ Rh
Evaluation
②Lost Rules
③Ghost Rules
Hiding effects
①Hiding Failure
① Hiding Failure Ratio
R’
Rh(D’)/Rh(D)
(~Rh(D) − ~Rh(D’))/ ~Rh(D)
② Lost Rules Ratio
③ Ghost Rules Ratio (∣R’∣−∣R∩R’∣)/∣R’∣
Data utility
Time performance
2007-7-10
SIGMOD Ph.D. Workshop IDAR’07
20
Reconstruction-based
Association Rule Hiding
Summary
Ongoing!
1.Frequent Set Mining
3.FP-tree
-
FS
DD
R
2.Perform sanitization
Algorithm
3. FP-tree-basedInverse
Inverse Frequent
Mining
3.FP-tree-based
FrequentSet
Set
Mining
FS ’
D’
FP-tree
2007-7-10
SIGMOD Ph.D. Workshop IDAR’07
R-Rh
Basically
completed!
21
Thanks for your attention
Any suggestion or question?
[email protected]