Add a subset of unique mapping set to t` with some probability

Download Report

Transcript Add a subset of unique mapping set to t` with some probability

Security in Outsourced
Association Rule Mining
Agenda




Introduction
Approximate randomized technique
Encryption
Summary and future work
Introduction

Data mining in company



know about the past activities of their
customers
make strategic decisions
Types of data mining



Association rules mining
Clustering
Classification
Association rules

“X => Y”



If a transaction contains itemset X, the
transaction will probably contain
itemset Y
Support: number of supporting
transactions
Confidence: proportion of transactions
containing X which also contains Y
Performing data mining

Build application



Buy software



Development cost?
Time?
Fit requirements?
Maintenance?
Outsource
Concerns in outsourcing

Output



Execution Assurance
Correctness
Security


Company
Data Miner
DB
Privacy of records
Information of the company
Approximate randomized
technique
Approximate solution

Privacy Preserving Mining of
Association Rules


SIGKDD 2002
Authors: Alexandre Evfimievski,
Ramakrishnan Srikant, Rakesh
Agrawal, Johannes Gehrke
Problem formulation




Let the set of transactions be T =
{t1, t2, … tN}
Transform T to T’ = {t’1, t’2, … t’N}
Mine in T’
Privacy breaches

Itemset A cause a privacy breach of
level p if for some item a in A

P[a in ti|A in t’i] >= p
Select-a-size randomization

For each transaction ti in T




m = length of ti
Select (non-uniformly) randomly an
integer j from [0, m]
Copy uniformly at random j items in ti
to t’i
Consider every item a not in ti, add a to
t’i with a given probability pm
Run on real data

Privacy breach of level <= 50%



P[a in ti|A in t’i] <= 50%
Accuracy = # true positive / (# found
itemsets)
Set 1
Itemset True
True
Size
Itemset Positive
1
65
65
2
228
212
3
22
18
False False
Accuracy
Drops Positive
0
16
4
0
28
5
100%
88%
78%
Accuracy

Set 2:
Itemset True
True
Size
Itemset Positive
1
266
254
2
217
195
3
48
43
False False
Accuracy
Drops Positive
12
22
5
31
45
26
89%
81%
62%
Problems

Estimated counts of large itemsets varies


"beer and diaper" story



Lower accuracy of association rules
customers who buy diapers tend also to buy
beer
hard to believe some strange rules
Expensive to make wrong decision


Supermarket: layout design
Health center: identify new disease
Security concerns


Individual transaction is protected
Private association rules can be
estimated by other parties

Adversary actions may be based on
found association rules
Encryption
Problem formulation


Let the set of transactions be T =
{t1, t2, … tN}
I is the entire set of items




All ti is a subset of I
Transform T to T’ = {t’1, t’2, … t’N}
A third party mines in T’ and gets
AR’
Transform AR’ to AR
Architecture
Association
Rules
Association
Rules
Mappings
DB
Transformer
DB
Encryption

To protect a message, simple encryption
can be applied


Association rule encryption



“GOOD DOG” can be encrypted as “PLLX XLP”
752 => 891?
Milk => Bread
Transaction encryption


<8, 69, 153, 756>?
<Cheese, Fork, Ice-cream, Clock>
Simple scheme


Encryption
For every transaction ti

For every item x in ti


Add f(x) to t’i where f is a bi-jective
function
Decryption

For every association rule ri

For every item y in r

Replace y by f-1(y)
Problems with simple encryption

They are easy to crack

“PLLX XLP”
 26P3

Association rules



combinations, with at least one vowel
# Bread > # Car
# association rules, # large
itemsets are disclosed
Solution

Use a more complex scheme
Fake items

Probability to make a correct guess
of a single mapping


= 1 / |I|
Randomly add some fake items to
each transaction

Decrease the above probability to 1 /
(|I| + |F|)
One-to-n Mapping

Originally, we are “one-to-one” mapping





One item  One item
A1
B2
C3
We form “one-to-n” mapping




A  1, 4, 5
B2
C  3, 5
Greatly increase the number of possible
mapping of an item
 |I|+|F|C1 + |I|+|F|C2 + … |I|+|F|C|F|
Example transformation

T=







{A}
{B}
{C}
{A, B}
{A, C}
{B, C}
{A, B, C}

T’ =


A  1, 4, 5
B2
C  3, 5





{1, 4, 5}
{2}
{3, 5}
{1, 2, 4, 5}
{1, 3, 5}
{2, 3, 5}
{1, 2, 3, 4, 5}
Limitation on the mapping f

For any item x, there does not exist items
y1, y2, …, yk (x ≠ y1 ≠ … ≠ yk )


Such that f(x) subset in f(y1) U f(y2) U…f(yk)
Consider an example





A  1, 2
B  2, 3
C  3, 4
AC  1, 2, 3, 4
ABC  1, 2, 3, 4
Limitation on the mapping f

For any item x


f(x) – Ui != x, i in I f(i) != empty
Every item must map to something
unique
Mapping generation – Item Extend


Initialize every item to map to
something unique I’
For every item x in IE


Randomly pick some mappings
Extend each mapping by x
Example run




A1
B2
C3
IE = {4, 5}
Considering item 4



A1
B2
C3



Pick A
A  1, 4
B2
C3
Considering item 5



A1
B2
C3



Pick A, C
A  1, 4, 5
B2
C  3, 5
Item Extend

Every item must map to something
unique
A  1, 4, 5



Say 1 is unique to f(A)
B2
C  3, 5
suppT(A) = suppT’(1)
For a transaction t without item A


Add a subset of unique mapping set to
t’ with some probability
{1, 4} is unique mapping set in f(A)

{}, {1}, {4}, {1, 4} may be added
Fake items again



Now, every item in t’i must be in
some mappings
Randomly add some fake items in
|F| to each transaction
Mapping f: I -> |I’| U |IE| U |F|



|I’|: core “unique” items
|IE|: expanding items
|F|: fake items
Basic transformation framework

For each transaction t

For each item x in t


For item i in I - t


Add f(x) to t’
Add randomly subset of unique mapping
set of f(i) to t’
For item f in F

Toss a biased coin for each item, add f to
t’ if head (probability should be
difference)
Recovering association rules

Given an encrypted rule in AR’


If there exists i1, i2, …, im in I



Uk=1m f(ik) = X
And there exists j1, j2, …, jn in I


r’: X => Y
Uk=1n f(jk) = XUY
r: {i1, i2, … im} => {j1, j2, …, jn} –
{i1, i2, … im} is a rule in AR
Otherwise, the rule is not correct
Example

Given





1 => 4 (rejected)
2 => 1, 5 (rejected)
2 => 1, 3, 5 (rejected)
2 => 1, 3, 4, 5 (B => AC)
2, 3, 5 => 1, 4 (BC => A)
2, 3, 5 => BC
 1, 2, 3, 4, 5 => ABC

Mapping f
A  1, 4, 5
B2
C  3, 5
Correctness

Proposition

For any item x, y, f is transformation
mapping
suppT(x) = suppT’(f(x))
 suppT(xUy) = suppT’(f(x) U f(y))


For any itemset X, Y, F is the
transformation mapping
suppT(X) = suppT’(F(X))
 suppT(XUY) = suppT’(F(X) U F(Y))


No false drops and false positives
Summary

Generation of mappings



Transformation of transactions




One-to-n mappings
Item Extend
Mapping f(x)
Subsets of unique mapping set
Fake items
Recovering association rules

Reverse mappings and filtering
Test run


# Items = 1k, |T| = 1k
Without transformation



One rule
Time: 8s
Item Extend



147 rules
Total times: 26s
Mappings generation and
transformation: 219ms
Future Work

Define parameters to the problem





Size of |IE|
Size of |F|
Give a clear measure of security
Give a clear measure of overhead
Correctness of association rules


Query execution proof
Result verification
The End
Choosing probability


Uniform distribution or any fixed
distribution give patterns which may
be easily identified
Random probability distribution


{}: 70%, {1}: 5%, {4}: 15%, {1, 4}:
20%
Storage: need additional storage
Back
Algorithm for transformation



Transformation is the most costly
process
Execution time linear to database
size |T|
Should be as fast as possible
Optimization

Mapping Retrieval


For an item x, use a hash table to retrieve the
mapping, h(x)
Adding fake items



First randomly (according to the probability of
adding items) determine the number of items
to add
Randomly pick in the set (non-uniform
distribution)
Gives a much shorter runtime in average
Choice of mapped items


Acceptable as long as it is not easy to
identify I’, IE, F
One way is to use random permutation of
first |I| + |IE| + |F| natural numbers
1 2 … |I|+|IE|+|F| * (1+ δ)


First |I| numbers are mapped to |I’|
Next |IE| numbers are IE
Cut and paste randomization


One case of select-a-size randomization
The way to perform selection of j




Given an integer Km > 0
Randomly choose j in [0, Km]
If (j > m)
 Set j = m
Overall input parameters


Km
pm
Effects on support

Support of A in T’



A in t, without replaced
A’ in t, randomly add A
Support of AB in T’




AB in t, without replaced A and B
AB’ in t, randomly add B
A’B in t, randomly add A
A’B’ in t, randomly add A and B
Estimating original support

Support of A in T, x



Support of A in T’, y
x * P(A remains in original transaction)
+ (|DB| - x) * pm = y
Support of AB in T



Support of AB in T’
Support of AB’, A’B in T’
Support of A’B’ in T’
Apriori property




Suppose m = 2 for all t in T
|T| = 10, |I| = {A, B}
pm= 0, j = 1,
Support of B in T’ suppT’ (B)= 0




E(suppT(B)) = 0
suppT’ (A)= 10
suppT’ (AB)= 0
E(suppT(AB)) = suppT’ (A) * 1 = 10
Apriori property



An expected large itemset may
have an expected small sub-set
But generally the support of subsets
are not too small
Instead of using the support
threshold to filter all small
candidates, use a smaller value
Apriori algorithm






Generate candidate sets
Scan database for counts
Recover the predicted support
Discard candidates with support
smaller than <= candidate limit
Save for output candidates with
support >= support threshold
Apriori_gen(remaining candidate)
Candidate limit

A high value



A small value



Increase numbers of false drops
Poor correctness
Increase number of candidate sets
High running time
Experiment



Support threshold: smin
estimated s.d.: δ
smin – δ is found to be a good value
Other applications



Outsourced transaction database
(secure) storage
Outsourced association rule mining
using data stream
Secure distributed association rule
mining with third party miner
Outsourced database with association
rule mining service
Association
Rules
Association
Rules
Mappings
Transformer
Transactions
Query
DB