Add a subset of unique mapping set to t` with some probability
Download
Report
Transcript Add a subset of unique mapping set to t` with some probability
Security in Outsourced
Association Rule Mining
Agenda
Introduction
Approximate randomized technique
Encryption
Summary and future work
Introduction
Data mining in company
know about the past activities of their
customers
make strategic decisions
Types of data mining
Association rules mining
Clustering
Classification
Association rules
“X => Y”
If a transaction contains itemset X, the
transaction will probably contain
itemset Y
Support: number of supporting
transactions
Confidence: proportion of transactions
containing X which also contains Y
Performing data mining
Build application
Buy software
Development cost?
Time?
Fit requirements?
Maintenance?
Outsource
Concerns in outsourcing
Output
Execution Assurance
Correctness
Security
Company
Data Miner
DB
Privacy of records
Information of the company
Approximate randomized
technique
Approximate solution
Privacy Preserving Mining of
Association Rules
SIGKDD 2002
Authors: Alexandre Evfimievski,
Ramakrishnan Srikant, Rakesh
Agrawal, Johannes Gehrke
Problem formulation
Let the set of transactions be T =
{t1, t2, … tN}
Transform T to T’ = {t’1, t’2, … t’N}
Mine in T’
Privacy breaches
Itemset A cause a privacy breach of
level p if for some item a in A
P[a in ti|A in t’i] >= p
Select-a-size randomization
For each transaction ti in T
m = length of ti
Select (non-uniformly) randomly an
integer j from [0, m]
Copy uniformly at random j items in ti
to t’i
Consider every item a not in ti, add a to
t’i with a given probability pm
Run on real data
Privacy breach of level <= 50%
P[a in ti|A in t’i] <= 50%
Accuracy = # true positive / (# found
itemsets)
Set 1
Itemset True
True
Size
Itemset Positive
1
65
65
2
228
212
3
22
18
False False
Accuracy
Drops Positive
0
16
4
0
28
5
100%
88%
78%
Accuracy
Set 2:
Itemset True
True
Size
Itemset Positive
1
266
254
2
217
195
3
48
43
False False
Accuracy
Drops Positive
12
22
5
31
45
26
89%
81%
62%
Problems
Estimated counts of large itemsets varies
"beer and diaper" story
Lower accuracy of association rules
customers who buy diapers tend also to buy
beer
hard to believe some strange rules
Expensive to make wrong decision
Supermarket: layout design
Health center: identify new disease
Security concerns
Individual transaction is protected
Private association rules can be
estimated by other parties
Adversary actions may be based on
found association rules
Encryption
Problem formulation
Let the set of transactions be T =
{t1, t2, … tN}
I is the entire set of items
All ti is a subset of I
Transform T to T’ = {t’1, t’2, … t’N}
A third party mines in T’ and gets
AR’
Transform AR’ to AR
Architecture
Association
Rules
Association
Rules
Mappings
DB
Transformer
DB
Encryption
To protect a message, simple encryption
can be applied
Association rule encryption
“GOOD DOG” can be encrypted as “PLLX XLP”
752 => 891?
Milk => Bread
Transaction encryption
<8, 69, 153, 756>?
<Cheese, Fork, Ice-cream, Clock>
Simple scheme
Encryption
For every transaction ti
For every item x in ti
Add f(x) to t’i where f is a bi-jective
function
Decryption
For every association rule ri
For every item y in r
Replace y by f-1(y)
Problems with simple encryption
They are easy to crack
“PLLX XLP”
26P3
Association rules
combinations, with at least one vowel
# Bread > # Car
# association rules, # large
itemsets are disclosed
Solution
Use a more complex scheme
Fake items
Probability to make a correct guess
of a single mapping
= 1 / |I|
Randomly add some fake items to
each transaction
Decrease the above probability to 1 /
(|I| + |F|)
One-to-n Mapping
Originally, we are “one-to-one” mapping
One item One item
A1
B2
C3
We form “one-to-n” mapping
A 1, 4, 5
B2
C 3, 5
Greatly increase the number of possible
mapping of an item
|I|+|F|C1 + |I|+|F|C2 + … |I|+|F|C|F|
Example transformation
T=
{A}
{B}
{C}
{A, B}
{A, C}
{B, C}
{A, B, C}
T’ =
A 1, 4, 5
B2
C 3, 5
{1, 4, 5}
{2}
{3, 5}
{1, 2, 4, 5}
{1, 3, 5}
{2, 3, 5}
{1, 2, 3, 4, 5}
Limitation on the mapping f
For any item x, there does not exist items
y1, y2, …, yk (x ≠ y1 ≠ … ≠ yk )
Such that f(x) subset in f(y1) U f(y2) U…f(yk)
Consider an example
A 1, 2
B 2, 3
C 3, 4
AC 1, 2, 3, 4
ABC 1, 2, 3, 4
Limitation on the mapping f
For any item x
f(x) – Ui != x, i in I f(i) != empty
Every item must map to something
unique
Mapping generation – Item Extend
Initialize every item to map to
something unique I’
For every item x in IE
Randomly pick some mappings
Extend each mapping by x
Example run
A1
B2
C3
IE = {4, 5}
Considering item 4
A1
B2
C3
Pick A
A 1, 4
B2
C3
Considering item 5
A1
B2
C3
Pick A, C
A 1, 4, 5
B2
C 3, 5
Item Extend
Every item must map to something
unique
A 1, 4, 5
Say 1 is unique to f(A)
B2
C 3, 5
suppT(A) = suppT’(1)
For a transaction t without item A
Add a subset of unique mapping set to
t’ with some probability
{1, 4} is unique mapping set in f(A)
{}, {1}, {4}, {1, 4} may be added
Fake items again
Now, every item in t’i must be in
some mappings
Randomly add some fake items in
|F| to each transaction
Mapping f: I -> |I’| U |IE| U |F|
|I’|: core “unique” items
|IE|: expanding items
|F|: fake items
Basic transformation framework
For each transaction t
For each item x in t
For item i in I - t
Add f(x) to t’
Add randomly subset of unique mapping
set of f(i) to t’
For item f in F
Toss a biased coin for each item, add f to
t’ if head (probability should be
difference)
Recovering association rules
Given an encrypted rule in AR’
If there exists i1, i2, …, im in I
Uk=1m f(ik) = X
And there exists j1, j2, …, jn in I
r’: X => Y
Uk=1n f(jk) = XUY
r: {i1, i2, … im} => {j1, j2, …, jn} –
{i1, i2, … im} is a rule in AR
Otherwise, the rule is not correct
Example
Given
1 => 4 (rejected)
2 => 1, 5 (rejected)
2 => 1, 3, 5 (rejected)
2 => 1, 3, 4, 5 (B => AC)
2, 3, 5 => 1, 4 (BC => A)
2, 3, 5 => BC
1, 2, 3, 4, 5 => ABC
Mapping f
A 1, 4, 5
B2
C 3, 5
Correctness
Proposition
For any item x, y, f is transformation
mapping
suppT(x) = suppT’(f(x))
suppT(xUy) = suppT’(f(x) U f(y))
For any itemset X, Y, F is the
transformation mapping
suppT(X) = suppT’(F(X))
suppT(XUY) = suppT’(F(X) U F(Y))
No false drops and false positives
Summary
Generation of mappings
Transformation of transactions
One-to-n mappings
Item Extend
Mapping f(x)
Subsets of unique mapping set
Fake items
Recovering association rules
Reverse mappings and filtering
Test run
# Items = 1k, |T| = 1k
Without transformation
One rule
Time: 8s
Item Extend
147 rules
Total times: 26s
Mappings generation and
transformation: 219ms
Future Work
Define parameters to the problem
Size of |IE|
Size of |F|
Give a clear measure of security
Give a clear measure of overhead
Correctness of association rules
Query execution proof
Result verification
The End
Choosing probability
Uniform distribution or any fixed
distribution give patterns which may
be easily identified
Random probability distribution
{}: 70%, {1}: 5%, {4}: 15%, {1, 4}:
20%
Storage: need additional storage
Back
Algorithm for transformation
Transformation is the most costly
process
Execution time linear to database
size |T|
Should be as fast as possible
Optimization
Mapping Retrieval
For an item x, use a hash table to retrieve the
mapping, h(x)
Adding fake items
First randomly (according to the probability of
adding items) determine the number of items
to add
Randomly pick in the set (non-uniform
distribution)
Gives a much shorter runtime in average
Choice of mapped items
Acceptable as long as it is not easy to
identify I’, IE, F
One way is to use random permutation of
first |I| + |IE| + |F| natural numbers
1 2 … |I|+|IE|+|F| * (1+ δ)
First |I| numbers are mapped to |I’|
Next |IE| numbers are IE
Cut and paste randomization
One case of select-a-size randomization
The way to perform selection of j
Given an integer Km > 0
Randomly choose j in [0, Km]
If (j > m)
Set j = m
Overall input parameters
Km
pm
Effects on support
Support of A in T’
A in t, without replaced
A’ in t, randomly add A
Support of AB in T’
AB in t, without replaced A and B
AB’ in t, randomly add B
A’B in t, randomly add A
A’B’ in t, randomly add A and B
Estimating original support
Support of A in T, x
Support of A in T’, y
x * P(A remains in original transaction)
+ (|DB| - x) * pm = y
Support of AB in T
Support of AB in T’
Support of AB’, A’B in T’
Support of A’B’ in T’
Apriori property
Suppose m = 2 for all t in T
|T| = 10, |I| = {A, B}
pm= 0, j = 1,
Support of B in T’ suppT’ (B)= 0
E(suppT(B)) = 0
suppT’ (A)= 10
suppT’ (AB)= 0
E(suppT(AB)) = suppT’ (A) * 1 = 10
Apriori property
An expected large itemset may
have an expected small sub-set
But generally the support of subsets
are not too small
Instead of using the support
threshold to filter all small
candidates, use a smaller value
Apriori algorithm
Generate candidate sets
Scan database for counts
Recover the predicted support
Discard candidates with support
smaller than <= candidate limit
Save for output candidates with
support >= support threshold
Apriori_gen(remaining candidate)
Candidate limit
A high value
A small value
Increase numbers of false drops
Poor correctness
Increase number of candidate sets
High running time
Experiment
Support threshold: smin
estimated s.d.: δ
smin – δ is found to be a good value
Other applications
Outsourced transaction database
(secure) storage
Outsourced association rule mining
using data stream
Secure distributed association rule
mining with third party miner
Outsourced database with association
rule mining service
Association
Rules
Association
Rules
Mappings
Transformer
Transactions
Query
DB