GradSeminar - Department of Computing Science

Transcript GradSeminar - Department of Computing Science

Database Laboratory
Data Transformation for
Privacy-Preserving Data Mining
Stanley R. M. Oliveira
Database Systems Laboratory
Computing Science Department
University of Alberta, Canada
Graduate Seminar
November 26th, 2004
Introduction
Framework PP-Assoc. Rules PP-Clustering
Results
Conclusions
Motivation
• Changes in technology are making privacy harder.
• The new challenge of Statistical Offices.
• Data Mining plays an outstanding role in business
collaboration.
• The traditional solution “all or nothing” has been too
rigid.
• The need for techniques to enforce privacy concerns
when data are shared for mining.
2
Graduate Seminar – Nov. 26, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework PP-Assoc. Rules PP-Clustering
Results
Conclusions
PPDM: Increasing Number of Papers
35
30
Number of papers
30
25
20
17
14
15
10
5
7
4
1
1
1
1996
1998
3
4
0
1991
1995
Conceptive Landmark
1999
2000
2001
2002
2003
2004
Deployment Landmark
Prospective Landmark
3
Graduate Seminar – Nov. 26, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework PP-Assoc. Rules PP-Clustering
Results
Conclusions
PPDM: Privacy Violation
• Privacy violation in data mining: misuse of data.
• Defining privacy preservation in data mining:
Individual privacy preservation: protection of
personally identifiable information.
Collective privacy preservation: protection of users’
collective activity.
4
Graduate Seminar – Nov. 26, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework PP-Assoc. Rules PP-Clustering
Results
Conclusions
A few Examples of Scenarios in PPDM
• Scenario 1: A hospital shares some data for research
purposes.
• Scenario 2: Outsourcing the data mining process.
• Scenario 3: A collaboration between an Internet
marketing company and an on-line retail company.
5
Graduate Seminar – Nov. 26, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework PP-Assoc. Rules PP-Clustering
Results
Conclusions
A Taxonomy of the Existing Solutions
Data Partitioning
Data Modification
Data Restriction
Data Ownership
Contributions
Fig.1: A Taxonomy of PPDM Techniques
6
Graduate Seminar – Nov. 26, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework PP-Assoc. Rules PP-Clustering
Results
Conclusions
Problem Definition
• To transform a database into a new one that conceals
sensitive information while preserving general patterns
and trends from the original database.
Data mining
Original
Database
The transformation
process
Transformed
Database
Non-sensitive
patterns and trends
7
Graduate Seminar – Nov. 26, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework PP-Assoc. Rules PP-Clustering
Results
Conclusions
Problem Definition (cont.)
• Problem 1: Privacy-Preserving Association Rule Mining
 I do not address privacy of individuals but the problem of
protecting sensitive knowledge.
• Assumptions:
 The data owners have to know in advance some knowledge
(rules) that they want to protect.
 The individual data values (e.g., a specific item) are not
restricted but the relationships between items.
8
Graduate Seminar – Nov. 26, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework PP-Assoc. Rules PP-Clustering
Results
Conclusions
Problem Definition (cont.)
• Problem 2: Privacy-Preserving Clustering
 I protect the underlying attribute values of objects subjected
to clustering analysis.
• Assumptions:
 Given a data matrix Dmn, the goal is to transform D into D'
so that the following restrictions hold:
 A transformation T:D  D’ must preserve the privacy of
individual records.
 The similarity between objects in D and D' must be the same
or slightly altered by the transformation process.
9
Graduate Seminar – Nov. 26, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework PP-Assoc. Rules PP-Clustering
Results
Conclusions
A Framework for Privacy PPDM
Server
Client
Original data
Collective Transformation
Transformed
Database
Individual Transformation
PPDT
Methods
Library of
Algorithms
Sanitization
Retrieval
Facilities
Metrics
Privacy Preserving Framework
A schematic view of the framework for privacy preservation
10
Graduate Seminar – Nov. 26, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework PP-Assoc. Rules PP-Clustering
Results
Conclusions
Privacy-Preserving Association Rule Mining
D
Identify
Discovered
Patterns
Classify
Discovered
Patterns
Transactional
Database
Step 1
Step 2
Select the
Sensitive
Transactions
Modify some
Sensitive
Transactions
Step 3
Step 4
D’
Sanitized
Database
The sanitization process
Inverted File
TID
List of Items
T1
A B C D
A,B  D
T1, T3
T2
A B C
A,C  D
T1, T4
T3
A B D
T4
A C D
T5
A B C
T6
B D
Sensitive
Rules
A sample transactional database
Sensitive
Transaction IDs
The corresponding inverted file
11
Graduate Seminar – Nov. 26, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework PP-Assoc. Rules PP-Clustering
Results
Conclusions
Privacy-Preserving Association Rule Mining
Heuristic 1
Heuristic 2
Heuristic 3
A taxonomy of sanitizing algorithms
12
Graduate Seminar – Nov. 26, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework PP-Assoc. Rules PP-Clustering
Results
Conclusions
Heuristic 1: Degree of Sensitive Transactions
• Definition: Let D be a transactional database and ST a set of all
sensitive transactions in D. The degree of a sensitive
transactions t, such that t  ST, is defined as the number of
sensitive association rules that can be found in t.
Inverted File
TID
List of Items
T1
A B C D
A,B  D
T1, T3
T2
A B C
A,C  D
T1, T4
T3
A B D
T4
A C D
T5
A B C
T6
B D
A sample transactional database
Degree(T1) = 2
Sensitive
Rules
Sensitive
Transaction IDs
Degree(T3) = 1
Degree(T4) = 1
The corresponding inverted file
13
Graduate Seminar – Nov. 26, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework PP-Assoc. Rules PP-Clustering
Results
Conclusions
Data Sharing-Based Algorithms
1. Scan a database and identify the sensitive transactions
for each restrictive patterns;
2. Based on the disclosure threshold , compute the
number of sensitive transactions to be sanitized;
3. For each restrictive pattern, identify a candidate item
that should be eliminated (victim item);
4. Based on the number found in step 3, remove the
victim items from the sensitive transactions.
14
Graduate Seminar – Nov. 26, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework PP-Assoc. Rules PP-Clustering
Results
Conclusions
Data Sharing-Based Algorithms
Sensitive Rules (SR):
Rule 1: A,BD
Rule 2: A,CD
TID
List of Items
TID
List of Items
TID
List of Items
T1
A B C D
T1
B C D
T1
B C D
T2
A B C
T2
A B C
T2
A B C
T3
A B D
T3
A B D
T3
A D
T4
A C D
T4
A C D
T4
A D
T5
A B C
T5
A B C
T5
A B C
T6
B D
T6
B D
T6
B D
Transactional Database
Partial Sanitization
Full Sanitization
 The Round Robin Algorithm (RRA)
• Step 1: Sensitive transactions: A,BD = {T1, T3}; A,CD = {T1, T4}
• Step 2: Select the number of sensitive transactions: (a)  = 50%; (b)  = 0%
• Step 3: Identify the victim items (taking turns):
  = 50% Victim(T1) = A; Victim(T1) = A (Partial Sanitization)
  = 0% Victim(T3) = B; Victim(T4) = C (Full Sanitization)
• Step 4: Sanitize the marked sensitive transactions.
15
Graduate Seminar – Nov. 26, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework PP-Assoc. Rules PP-Clustering
Results
Conclusions
Data Sharing-Based Algorithms (cont.)
Sensitive Rules (SR):
Rule 1: A,BD
Rule 2: A,CD
TID
List of Items
T1
A B C D
T2
A B C
T3
A B D
T4
Group Sensitive Rules
TID
List of Items
T1
B C D
T2
A B C
T3
A B
A C D
T4
A C
T5
A B C
T5
A B C
T6
B D
T6
B D
B
A
D
C
Victim Item
Transactional Database
Sup(D)  Sup(A)
Sanitized Database
 The Item Grouping Algorithm (IGA)
• Step 1: Sensitive transactions: A,BD = {T1, T3}; A,CD = {T1, T4}
• Step 2: Select the number of sensitive transactions:  = 0%
• Step 3: Identify the victim items (grouping sensitive rules):
 Victim(T1) = D; Victim(T3) = D;
Victim(T4) = D
(Full Sanitization)
• Step 4: Sanitize the marked sensitive transactions.
17
Graduate Seminar – Nov. 26, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework PP-Assoc. Rules PP-Clustering
Results
Conclusions
Heuristic 2: Size of Sensitive Transactions
• For every group of K transactions:
Step1: Distinguishing the sensitive transactions from
the non-sensitive ones;
Step 2: Selecting the victim item for each sensitive rule;
Step 3: Computing the number of sensitive transactions
to be sanitized;
Step 4: Sorting the sensitive transactions by size;
Step 5: Sanitizing the sensitive transactions.
18
Graduate Seminar – Nov. 26, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework PP-Assoc. Rules PP-Clustering
Results
Conclusions
Novelties of this Approach
• The notion of disclosure threshold for every single
pattern  Mining Permissions (MP).
• Each mining permission mp = <sri, i>, where
– i sri  set of sensitive rules SR; and
– i  [0 … 1].
• Mining Permissions allow an DBA to put different
weights for different rules to hide.
• All the thresholds i can also be set to the same value,
if needed.
19
Graduate Seminar – Nov. 26, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework PP-Assoc. Rules PP-Clustering
Results
Conclusions
Data Sharing-Based Algorithms (cont.)
Sensitive Rules (SR):
Rule 1: A,BD
Rule 2: A,CD
TID
List of Items
TID
List of Items
T1
A B C D
T1
A B C
T2
A B C
T2
A B C
T3
A B D
T3
A D
T4
A C D
T4
C D
T5
A B C
T5
A B C
T6
B D
T6
B D
Transactional Database
Sanitized Database
 The Sliding Window Algorithm (SWA)
•
Step 1: Sensitive transactions: A,BD = {T1, T3}; A,CD = {T1, T4}
•
Step 2: Identify the victim items (based on frequencies of the items in SR):
 Victim(T3) = B; Victim(T4) = A; Victim(T1) = D;
•
•
•
Step 3: Select the number of sensitive transactions:  = 0%
Step 4: Sort sensitive transactions by size: A,BD = {T3, T1}; A,CD = {T4, T1}
Step 5: Sanitize the marked sensitive transactions.
20
Graduate Seminar – Nov. 26, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework PP-Assoc. Rules PP-Clustering
Results
Conclusions
Data Sharing-Based Metrics
Database D
Database D’
1. Hiding Failure: HF 
# S R ( D' )
# S R ( D)
3. Artifactual Patterns: AP 
2. Misses Cost: MC 
#  S R ( D )  # S R ( D ' )
# S R ( D )
| R' |  | R  R' |
| R' |
4. Difference between D and D’: Dif ( D, D' ) 
n
1
  [ f D (i)  f D ' (i)]
n
f
i 1
D
(i)
i 1
21
Graduate Seminar – Nov. 26, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework PP-Assoc. Rules PP-Clustering
Results
Conclusions
Pattern Sharing-Based Algorithm
Data Sharing-Based Approach
D
Sanitization
Share
D’
D’
AR
generation
Association
Rules(
D’)
Pattern Sharing-Based Approach
D
AR
generation
Association
Rules(
D)
Sanitization
Discovered
Association
Patterns (
Rules(
D’) Share
D’)
22
Graduate Seminar – Nov. 26, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework PP-Assoc. Rules PP-Clustering
Results
Conclusions
Possible Inference Channels
• Inferences: based on non-restrictive rules, someone tries to
deduce one or more restrictive rules that are not supposed
to be discovered.
TID
List of Items
T1
A B C D
T2
A B C
T3
T4
A C D
A B C
T5
A B
Transactional Database
a) An example of forward-inference; b) An example of backward-inference.
24
Graduate Seminar – Nov. 26, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework PP-Assoc. Rules PP-Clustering
Results
Conclusions
Pattern Sharing-Based Metrics
R: all rules
R’: rules to share
Non-Sensitive rules
~SR
Rules to hide
Problem 1: RSE (Side effect)
Sensitive rules
SR
Problem 2: Inference
(recovery factor)
Rules hidden
• 1. Side Effect Factor (SEF)
(| R | (| R ' |  | S R |))
SEF =
(| R |  | S R |)
• 2. Recovery Factor (RF)
RF = [0, 1]
25
Graduate Seminar – Nov. 26, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework PP-Assoc. Rules PP-Clustering
Results
Conclusions
Heuristic 3: Rule Sanitization
• Step1: Identifying the sensitive itemsets.
• Step 2: Selecting the subsets to sanitize.
• Step 3: Sanitizing the set of supersets of marked
pairs in level1.
The Downright Sanitizing Algorithm (DSA)
Frequent pattern lattice
A
Sensitive Rules:
ABC
AC
AB
B
AC* BC
ABC*
Step2
C
AD
D
Level 0
CD
Level 1
Level 2
ACD
Step3
Step1
26
Graduate Seminar – Nov. 26, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework PP-Assoc. Rules PP-Clustering
PP-Clustering
Results
Conclusions
Privacy-Preserving Clustering (PPC)
• PPC over Centralized Data:
 The attribute values subjected to clustering are available in a
central repository.
• PPC over Vertically Partitioned Data:
 There are k parties sharing data for clustering, where k  2;
 The attribute values of the objects are split across the k parties.
 Objects IDs are revealed for join purposes only. The values of
the associated attributes are private.
27
Graduate Seminar – Nov. 26, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework PP-Assoc. Rules PP-Clustering
PP-Clustering
Results
Conclusions
Object Similarity-Based Representation (OSBR)
Example 1: Sharing data for research purposes (OSBR).
Original Data
ID
age weight heart
rate
123
75
80
342
56
254
Transformed Data
Int_def
QRS
PR_int
63
32
91
193
64
53
24
81
174
40
52
70
24
77
129
446
28
58
76
40
83
251
286
44
90
68
44
109
128
A sample of the cardiac arrhythmia database
(UCI Machine Learning Repository)
 0

2.243

0



DM  3.348 2.477
0


3
.
690
3
.
884
3
.
176
0


3.020 4.082 4.130 3.995 0
The corresponding dissimilarity matrix
28
Graduate Seminar – Nov. 26, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework PP-Assoc. Rules PP-Clustering
PP-Clustering
Results
Conclusions
Object Similarity-Based Representation (OSBR)
• The Security of the OSBR:
Lemma 1: Let DMmm be a dissimilarity matrix,
where m is the number of objects. It is impossible to
determine the coordinates of the two objects by
knowing only the distance between them.
• The Complexity of the OSBR:
Communication cost is of the order O(m2), where m
is the number of objects under analysis.
29
Graduate Seminar – Nov. 26, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework PP-Assoc. Rules PP-Clustering
PP-Clustering
Results
Conclusions
Object Similarity-Based Representation (OSBR)
• Limitations of the OSBR:
Lemma 2: Knowing the coordinates of a particular
object i and the distance r between i and any other
object j, it is possible to estimate the attribute values
of j.
Vulnerable to attacks in the case of vertically
partitioned data (Lemma 2).
Conclusion  The OSBR is effective for PPC
centralized data only, but expensive.
30
Graduate Seminar – Nov. 26, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework PP-Assoc. Rules PP-Clustering
PP-Clustering
Results
Conclusions
Dimensionality Reduction Transformation (DRBT)
• General Assumptions:
 The attribute values subjected to clustering are numerical only.
 In PPC over centralized data, object IDs should be replaced by
fictitious identifiers;
 In PPC over vertically partitioned data, object IDs are used for
the join purposes between the parties involved in the solution.
 The transformation (random projection) applied to the data
might slightly modify the distances between data points.
31
Graduate Seminar – Nov. 26, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework PP-Assoc. Rules PP-Clustering
PP-Clustering
Results
Conclusions
Dimensionality Reduction Transformation (DRBT)
• Random projection from d to k dimensions:
 D’ n k = Dn d • Rd k (linear transformation), where
D is the original data, D’ is the reduced data, and R is a random
matrix.
• R is generated by first setting each entry, as follows:
 (R1): rij is drawn from an i.i.d. N(0,1) and then normalizing the
columns to unit length;
 1 with probabilit y 1 / 6

 (R2): rij = 3   0 with probabilit y 2 / 3
 1 with probabilit y 1 / 6

32
Graduate Seminar – Nov. 26, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework PP-Assoc. Rules PP-Clustering
PP-Clustering
Results
Conclusions
Dimensionality Reduction Transformation (DRBT)
• PPC over Centralized Data (General Approach):
 Step 1 - Suppressing identifiers.
 Step 2 - Normalizing attribute values subjected to clustering.
 Step 3 - Reducing the dimension of the original dataset by
using random projection.
 Step 4 – Computing the error that the distances in k-d space
suffer from: Error 2  ( (dˆ  d ) 2 ) /( d 2 )

i, j
ij
ij

ij
i, j
• PPC over Vertically Partitioned Data:
 It is a generalization of the solution for PPC over centralized
data.
33
Graduate Seminar – Nov. 26, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework PP-Assoc. Rules PP-Clustering
PP-Clustering
Results
Conclusions
Dimensionality Reduction Transformation (DRBT)
ID
age
weight
heart
rate
Int_def
QRS
PR_int
Original Data
123
75
80
63
32
91
193
342
56
64
53
24
81
174
254
40
52
70
24
77
129
446
28
58
76
40
83
251
286
44
90
68
44
109
128
RP1
A sample of the cardiac arrhythmia database
(UCI Machine Learning Repository)
RP2
ID
Att1
Att2
Att3
Att1
Att2
Att3
123
-50.40
17.33
12.31
-55.50
-95.26
-107.93
342
-37.08
6.27
12.22
-51.00
-84.29
-83.13
254
-55.86
20.69
-0.66
-65.50
-70.43
-66.97
446
-37.61
-31.66
-17.58
-85.50
-140.87
-72.74
286
-62.72
37.64
18.16
-88.50
-50.22
-102.76
Transformed Data
RP1: The random matrix is based on the
Normal distribution.
RP2: The random matrix is based on a
much simpler distribution.
34
Graduate Seminar – Nov. 26, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework PP-Assoc. Rules PP-Clustering
PP-Clustering
Results
Conclusions
Dimensionality Reduction Transformation (DRBT)
• The Security of the DRBT:
Lemma 3: A random projection from d to k dimensions,
where k  d, is a non-invertible linear transformation.
• The Complexity of the DRBT:
The complexity of space requirements is of order O(m),
where m is the number of objects.
The communication cost is of order O(mlk), where l
represents the size (in bits) required to transmit a dataset
from one party to a central or third party.
35
Graduate Seminar – Nov. 26, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework PP-Assoc. Rules PP-Clustering
PP-Clustering
Results
Conclusions
Dimensionality Reduction Transformation (DRBT)
• The Accuracy of the DRBT:
C1’
C2’
…
Ck’
C1
freq1,1
freq1,2
…
freq1,k
C2
freq2,1
freq2,2
…
freq2,k
…
…
…
…
…
Ck
freqk,1
freqk,2
…
freqk,k
Precision (P) =
Recall (R) =
freqi , j
| ci' |
freqi , j
| ci |
2 P R
F-measure (F) =
PR
k
 | c |  F (c )
Overall F-measure (F) =
i 1
i
i
k
| c |
i 1
i
36
Graduate Seminar – Nov. 26, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework PP-Assoc. Rules PP-Clustering
Results
Results
Conclusions
Results and Evaluation
Dataset
Association
Rules
BMS
59,602
497
2.51
1
145
Retail
88,162
16,470
10.30
1
76
Kosarak
990,573
41,270
8.10
1
1065
Reuters
7,774
26,639
46.81
1
427
340,183
468
33.81
18
51
Mushroom
8,124
119
23.00
23
23
Chess
3,196
75
37.00
37
37
Connect
67,557
129
43.00
43
43
Pumbs
49,046
2,113
74.00
74
74
Accidents
Clustering
# records # items Avg.Length Shortest Longest
Record
Record
Datasets used in our performance evaluation
37
Graduate Seminar – Nov. 26, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework PP-Assoc. Rules PP-Clustering
Results
Results
Conclusions
Data Sharing-Based Algorithms
• Item Group Algorithm (IGA) [Oliveira & Zaiane, PSDM 2002].
• Sliding Window Algorithm (SWA) [Oliveira & Zaïane, ICDM
2003].
• Round Robin Algorithm (RRA) [Oliveira & Zaïane, IDEAS
2003].
• Random Algorithm (RA) [Oliveira & Zaïane, IDEAS 2003].
• Algo2a [E. Dasseni et al., IHW 2001].
38
Graduate Seminar – Nov. 26, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework PP-Assoc. Rules PP-Clustering
Results
Results
Conclusions
Methodology
• The Sensitive rules selected based on four scenarios:
 S1: Rules with items mutually exclusive.
 S2: Rules selected randomly.
 S3: Rules with very high support.
 S4: Rules with low support.
• The effectiveness of the algorithms was measured based on:
 C1:  = 0%, fixed the minimum support () and minimum
confidence ().
 C2: the same as C1 but varied the number of sensitive rules.
 C3:  = 0%, fixed the minimum confidence () and the number of
sensitive rules, and varied the minimum support ().
39
Graduate Seminar – Nov. 26, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework PP-Assoc. Rules PP-Clustering
Results
Results
Conclusions
Measuring Effectiveness
Algorithm
 = 0%  6 sensitive rules
S1
S2
S3
S4
Kosarak
IGA
IGA
IGA
IGA
Retail
IGA
SWA
RA
Reuters
IGA
IGA
BMS-1
IGA
IGA
S1
S2
S3
S4
Kosarak
IGA
IGA
IGA
IGA
RA
Retail
IGA
IGA / SWA
RA
RA
IGA
IGA
Reuters
IGA
IGA
IGA
IGA
IGA
IGA
BMS-1
IGA
Algo2a/IGA
IGA
IGA
Misses Cost under condition C1
Algorithm
 = 0%  varying the # of rules
Algorithm
Misses Cost under condition C2
 = 0%  varying values of 
S1
S2
S3
S4
Kosarak
IGA
IGA
IGA
IGA
Retail
IGA
SWA
RA
Reuters
IGA
IGA
BMS-1
IGA
IGA
Algorithm
 = 0%  6 sensitive rules
S1
S2
S3
S4
Kosarak
SWA
SWA
SWA
SWA
RRA
Retail
SWA
SWA
SWA
SWA
IGA
IGA
Reuters
SWA
SWA
SWA
SWA
IGA
IGA
BMS-1
SWA
SWA
SWA
SWA
Misses Cost under condition C3
Dif (D, D’ ) under conditions C1 and C3
41
Graduate Seminar – Nov. 26, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework PP-Assoc. Rules PP-Clustering
Results
Results
Conclusions
Special Cases of Data Sanitization
SWA: { [rule1, 30%], [rule2, 25%], [rule3, 15%], [rule4, 45%], [rule5, 15%], [rule6, 20%] }
Algorithm
K = 100,000
Kosarak
Retail
Reuters
BMS-1
MC
37.22
31.07
46.48
8.68
HF
5.57
7.45
0.01
21.84
Dif (D, D’)
1.68
1.24
0.63
0.70
An example of different thresholds for the sensitive rules in scenario S3.
 = 0%
MC
HF
 = 5%
MC
HF
 = 10%
 = 15%
MC
HF
MC
HF
 = 25%
MC
HF
IGA
66.31
0.00
64.77
0.66
63.23
0.83
60.94
1.32
56.26
1.99
RRA
64.02
0.00
61.18
7.28
58.15
6.46
55.12
7.62
46.46
15.73
RA
63.86
0.00
60.12
7.12
56.72
7.62
54.29
7.95
46.48
16.39
SWA
65.29
0.00
55.58
1.16
48.31
1.82
42.67
3.31
27.74
15.89
Effect of  on misses cost and hiding failure in the dataset Retail
42
Graduate Seminar – Nov. 26, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework PP-Assoc. Rules PP-Clustering
Results
Results
Conclusions
CPU Time
Results of CPU time for the sanitization process
43
Graduate Seminar – Nov. 26, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework PP-Assoc. Rules PP-Clustering
Results
Results
Conclusions
Pattern Sharing-based Algorithm
• Downright Sanitizing Algorithm (DSA) [Oliveira & Zaiane,
PAKDD 2004].
• We used the data-sharing algorithm IGA for our comparison
study.
• Methodology:
IGA
DSA
1)
We used IGA to sanitize the datasets.
2)
We used Apriori to extract the rules to share (all the datasets).
1)
We used Apriori to extract the rules from the datasets.
2)
We used DSA to sanitize the rules mined in the previous step.
44
Graduate Seminar – Nov. 26, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework PP-Assoc. Rules PP-Clustering
Results
Results
Conclusions
Measuring Effectiveness
Dataset
 = 0%  6 sensitive rules
S1
S2
S3
S4
Kosarak
IGA
DSA
DSA
DSA
Retail
DSA
DSA
DSA
IGA
Reuters
DSA
DSA
DSA
IGA
BMS-1
DSA
DSA
DSA
DSA
Dataset
S1
S2
S3
S4
Kosarak
IGA
DSA
DSA
IGA /
DSA
Retail
DSA
DSA
IGA /
DSA
IGA
Reuters
IGA /
DSA
DSA
DSA
IGA
BMS-1
DSA
DSA
DSA
DSA
The best algorithm in terms of misses cost
Dataset
 = 0%  6 sensitive rules
S1
S2
S3
S4
Kosarak
IGA
DSA
DSA
DSA
Retail
DSA
DSA
DSA
IGA
Reuters
DSA
DSA
DSA
IGA
BMS-1
DSA
DSA
DSA
DSA
 = 0%  varying # of rules
The best algorithm in terms of misses cost
varying the number of rules to sanitize
The best algorithm in terms of side effect factor
45
Graduate Seminar – Nov. 26, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework PP-Assoc. Rules PP-Clustering
Results
Results
Conclusions
Lessons Learned
• Large dataset are our friends.
• The benefit of index: at most two scans to sanitize a
dataset.
• The data sanitization paradox.
• The outstanding performance of IGA and DSA.
• Rule sanitization reduces inference channels, and does
not change the support and confidence of the shared
rules.
• DSA reduces the flexibility of information sharing.
47
Graduate Seminar – Nov. 26, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework PP-Assoc. Rules PP-Clustering
Results
Results
Conclusions
Evaluation: DRBT
• Methodology:
 Step 1: Attribute normalization.
 Step 2: Dimensionality reduction (two approaches).
 Step 3: Computation of the error produced on reduced
datasets.
 Step 4: Run K-means to find the clusters in the original and
reduced datasets.
 Step 5: Computation of F-measure (experiments repeated 10
times).
 Step 6: Comparison of the clusters generated from the
original and the reduced datasets.
48
Graduate Seminar – Nov. 26, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework PP-Assoc. Rules PP-Clustering
Results
Results
Conclusions
DRBT: PPC over centralized Data
Transformation
dr = 37
dr = 34
dr = 31
dr = 28
dr = 25
dr = 22
dr = 16
RP1
0.00
0.015
0.024
0.033
0.045
0.072
0.141
RP2
0.00
0.014
0.019
0.032
0.041
0.067
0.131
The error produced on the dataset Chess (do = 37)
Data
K=2
K=3
K=4
K=5
Transformation
Avg
Std
Avg
Std
Avg
Std
Avg
Std
RP2
0.941
0.014
0.912
0.009
0.881
0.010
0.885
0.006
Average of F-measure (10 trials) for the dataset Accidents (do = 18, dr = 12)
Data
K=2
K=3
K=4
K=5
Transformation
Avg
Std
Avg
Std
Avg
Std
Avg
Std
RP2
1.000
0.000
0.948
0.010
0.858
0.089
0.833
0.072
Average of F-measure (10 trials) for the dataset Iris (do = 5, dr = 3)
49
Graduate Seminar – Nov. 26, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework PP-Assoc. Rules PP-Clustering
Results
Results
Conclusions
DRBT: PPC over Vertically Centralized Data
No. Parties
RP1
RP2
1
0.0762
0.0558
2
0.0798
0.0591
3
0.0870
0.0720
4
0.0923
0.0733
The error produced on the dataset Pumsb (do = 74)
Data
K=2
K=3
K=4
K=5
Transformation
Avg
Std
Avg
Std
Avg
Std
Avg
Std
1
0.909
0.140
0.965
0.081
0.891
0.028
0.838
0.041
2
0.904
0.117
0.931
0.101
0.894
0.059
0.840
0.047
3
0.874
0.168
0.887
0.095
0.873
0.081
0.801
0.073
4
0.802
0.155
0.812
0.117
0.866
0.088
0.831
0.078
Average of F-measure (10 trials) for the dataset Pumsb (do = 74, dr = 38)
50
Graduate Seminar – Nov. 26, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework PP-Assoc. Rules PP-Clustering
Conclusions
Results
Contributions of this Research
• Foundations for further research in PPDM.
• A taxonomy of PPDM techniques.
• A family of privacy-preserving methods.
• A library of sanitizing algorithms.
• Retrieval facilities.
• A set of metrics.
51
Graduate Seminar – Nov. 26, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework PP-Assoc. Rules PP-Clustering
Results
Conclusions
Future Research
• Privacy definition in data mining.
• Combining sanitization and randomization.
• New method for PPC (k-anonymity + isometries
+ data distortion)
• Sanitization of documents repositories.
52
Graduate Seminar – Nov. 26, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework PP-Assoc. Rules PP-Clustering
Results
Conclusions
Thank You!
53
Graduate Seminar – Nov. 26, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira

GradSeminar - Department of Computing Science

Transcript GradSeminar - Department of Computing Science

Directory