A Novel Method for Protecting Sensitive Knowledge in Association

Download Report

Transcript A Novel Method for Protecting Sensitive Knowledge in Association

A Novel Method for Protecting
Sensitive Knowledge in
Association Rule Mining
Introduction
• Why privacy preserving data mining?
• Data privacy V.S. Information privacy
– Data privacy
• Randomization approach
• Secure multi-party computation approach
– Information privacy
• Sensitive Knowledge
2
Background Knowledge
• Support of {A, C}: the probability that a
transaction contains {A, C}
– eg: support of {A, C} = 3/4
• {A, C} is frequent: the support of {A, C} is no
smaller than minimum support defined by user
– if mimSup = 0.4 then {A, C} is a frequent pattern
Transaction ID Items Bought
1000
A,B,C
2000
A,C
3000
A,B,C,D
4000
B,E,F
3
Related Work
• Oliveira_Zaiane proposed 6 algorithms
– Step1:Find sensitive transactions
– Step2: Choose victim items
– Step3: Compute how many sensitive
transactions should be changed
– Step4: Select victim transactions
• SWA of Oliveira_Zaiane
– Almost the same with the others but each step
applied to K transactions, K is the window size
– With the best performance
4
Related Work (cont.)
• Published pattern sets
• Forward-Inference Attack
5
Preliminary
• Represent TDB as a binary matrix
• P : frequent patterns
– PH : frequent patterns with security policies
– ~PH : frequent patterns without security policies
– PH ∪~PH = P
• Pair-Subset :
– eg: {1, 2, 3} is frequent; {{1, 2}, {1, 3}, {2, 3}} is
the Pair-Subset of {1, 2, 3} and {1, 2} is a pairsubpattern of {1, 2, 3}
6
Problem Definition
• Transform D into D', such that PH are
hidden and ~PH are still mined in D' and
also avoid Forward-Inference Attack
• Kernel Ideal
– D is multiplied by a sanitization matrix S
– The problem is transformed to how to define S
7
Matrix Multiplication
• If Dij = 0, D'ij is set to 0 directly
• If
≥ 1, D'ij is set to 1
• If
≤ 0, D'ij is set to 0
8
Matrix Observation
• Setting of “– 1”
• If Sij id set to – 1, for the row that Dti and
Dtj are both equal to 1, D'ij will become 0
9
Matrix Observation (cont.)
• Setting of “1”
• Setting Sij to 1 can keep the relation
between item i and item j by enhancing the
strength of item j
10
Sanitization Process
11
Marked-Set Generation
• 1. Put the patterns with length = 2 in PH
into Marked-Set directly
• 2. for all remainder P in PH do
– if (P has no Pair-Subsets included in MarkedSet)
• Generate k groups, k = # of all Pair-Subsets of P
• Class label of group is named by each PairSubsets of P
• P is stored in each group
• 3.Merge the groups with same class label
12
Marked-Set Generation (cont.)
• 4.for all NP in ~PH do
– Generate their all Pair-Subsets
– Count the frequencies of all Pair-Subsets
• 5. for all groups do
– If the class label of the group ≠ any PairSubset generated in Step1
• The frequency of the group = 0
– If the class label of the group = one PairSubset generated in Step1
• The frequency of the group = the frequency of the
Pair-Subset
13
Marked-Set Generation (cont.)
• 6. Sort the groups by frequency in the increasing order
• 7.for (i =1 to number of groups -1)
for ( j = i +1 to number of groups)
Compare groups pair-wise, Gi, Gj
for all overlap in Gi∩Gj do
If the size of Gi ≠ the size of Gj
Remove overlap from the small one
else
if Check the frequency
Remove overlap from the large one
else
Remove overlap form the group chosen randomly
• 8.for all groups do
If number of patterns stored in group > 0
Put the class label into Marked-Set
14
An overall example
15
16
17
Sanitization Matrix Setting
• 1. Sii = 1
• 2. for all {i, j} in Marked-Set do
• if(# of i in ~PH < # of j in ~PH)
– Sji = –1
• If(# of i in ~PH > # of j in ~PH)
– Sij = –1
• else
– if(# of i in Marked-Set > # of j in Marked-Set)
» Sji = –1
– if(# of i in Marked-Set < # of j in Marked-Set)
» Sij = –1
– else
» Sji = –1 or Sij = –1 randomly
18
Sanitization Matrix Setting (cont.)
• 3.for all {i, j} in (large2- Marked-Set) do
• Set Sij = 1, Sji = 1
• 4.Sij = 0, otherwise
19
20
Probability Policies
• Distortion Probabilityρ
• Used when only one “1” in the column j
and works if
; D'ij
hasρj to be set to 1 and 1–ρj to be set to 0
21
Probability Policies (cont.)
• Lemma1: Give a minimum supportσand a level
of confidence c. Let {i, j} be a pattern in MarkedSet nij be the support count of {i, j} ρ is the
probability of column j. W.L.O.G we assume that
Sij = –1. If ρ satisfies
and
where |D| is the number of transaction in D, we
can say that we are c confident that {i, j} isn’t
frequent in D'
22
Probability Policies (cont.)
• Conformity Probabilityμ
• Used when the column j of S contains at
least two “1s”, works if
, and at
least one – 1 in j is multiplied by 1 in D, D'ij
is set to 1 withμand 0 with 1–μ
23
Probability Policies (cont.)
• Lemma 2: Given a minimum support σ, and a level
of confidence c. Let {i, j} be a pattern in Marked-Set,
and {k, j} be a pattern which belongs to {large2 –
Marked-Set}, nikj be the support count of {i, k, j}.
W.L.O.G, we assume that Sij = –1.μis the
Conformity probability of column j. If μ is set
according to the following rule,
we can say that we are c confident that {i, j} isn’t
frequent in D'.
24
Sanitization Algorithm
for (i =1 to n){
for (j =1 to m){
if (Dij = 0)
D'ij = 0
else{
temp =
if (col j in S with some –1, one 1, temp ≤ 0)
D'ij = 1 (ρj), 0 ( 1–ρj)
else if (col j in S with some –1, more than one 1,at least
one –1 multiplied by 1 in D, temp ≥ 1)
D'ij = 1 (μi), 0 (1–μi)
else if (temp ≤ 0)
D'ij = 0
else if (temp ≥ 1)
D'ij = 1
}
}
25
}
Performance Quantifying
• Hiding Failure:
• Miss Cost:
• Dissimilarity:
• Weakness:
26
Performance Evaluation
27
Performance Evaluation (cont.)
28
Performance Evaluation (cont.)
29
Performance Evaluation (cont.)
30
Performance Evaluation (cont.)
31
Performance Evaluation (cont.)
32
Performance Evaluation (cont.)
33
Performance Evaluation (cont.)
34
Conclusion
• A probability based approach to solve
sensitive knowledge problem is proposed
• In some conditions, the miss cost and the
dissimilarity is little higher than SWA, but
overall, better performance than SWA and
could not suffer from Forward-Inference
Attack
• Any Questions?
35
Reference
•
•
•
•
•
•
[LCC04]Guanling Lee, Chien-Yu Chang and Arbee L.P Chen. Hiding
sensitive patterns in association rules mining. The 28th Annual International
Computer Software and Applications Conference (COMPSAC 2004)
[OZ02]S. R. M. Oliveira and O. R. Zaïane. Privacy Preserving Frequent
Itemset Mining. In Proc. of the IEEE ICDM Workshop on Privacy, Security,
and Data Mining Japan, December 2002.
[OZ03a]S. R. M. Oliveira and O. R. Zaïane. Algorithms for Balancing
Privacy and Knowledge Discovery in Association Rule Mining. In Proc. of
the 7th International Database Engineering and Applications Symposium
(IDEAS’03), Hong Kong, China, July 2003.
[OZ03b]S. R. M. Oliveira and O. R. Zaïane. Protecting Sensitive Knowledge
By Data Sanitization. In Proc. of the 3rd IEEE International Conference on
Data Mining (ICDM’03).
[OZS04]S. R. M. Oliveira, O. R. Zaïane and Yücel Saygin. Secure
Association Rule Sharing The 8th Pacific-Asia Conference on Knowledge
Discovery and Data Mining 2004(PAKDD-04).
[VAE04]Verykios, V.S.; Elmagarmid, A.K.; Bertino, E.; Saygin, Y.; Dasseni, E.
Association rule hiding. IEEE Transactions On Knowledge And Data
Engineering, Vol. 16, No. 4, April 2004.
36