Privacy preserving rule mining
Download
Report
Transcript Privacy preserving rule mining
Privacy-preserving rule
mining
Outline
A brief introduction to association rule
mining
Privacy preserving rule mining
Single party
Perturbation - publishing
Encryption - outsourcing
Distributed multiparty
Cryptographic protocols – collaborative mining
Hiding sensitive rules
Association Rule Mining
Transactional datasets
A transaction t = {a,b,c,…}
a,b,c,… are called items
The length of transaction = # of items
Transaction length can vary
Equivalent representation
The set of all items : I
A transaction t can be transformed to a
boolean vector in length of |I|.
Association Rule Mining
Rule mining
Goal: find the frequent itemset
Some itemset, e.g.{a,b,c}, appears frequently,
higher than certain support.
Rules can be derived from the itemset: AB
{a,b,c} is frequent, then abc, abc, …
Metrics
Support = # of occurrences of itemset/ total #
of transactions
i.e., prob (A,B)
Confidence = # of occurrences of itemset/# of
occurrences of left (of the rule)
I.e. the conditional prob: Pr (B|A)
Example
E.g. a,b appears 100 times together,
while abc appears 50 times together in
total 5000 transactions
Support of abc = 50/5000 = 0.01
Confidence of abc = 50/100 = 0.5
Algorithms
Apriori
observation: if itemset A is a part of B,
then support(A) >= support (B)
Steps in finding frequent itemsets:
Starting from single-item set, pruning the
itemsets that have support < threshold
When we have a set of k-itemsets, we
expand it to k+1-itemsets, and check their
supports.
Using data structures like hash tree to
speed up the counting process
Algorithm
Generating rules with confidence threshold
Confidence (AB) = P(B|A)
= support(AB)/support(A)
Centralized approaches
Two types of methods
(Categorical data) Perturbation
Encryption
Perturbation
Papers
1."Privacy Preserving Mining of Association Rules,"
SIGKDD 2002.
2."Maintaining data privacy in association rule
mining”, VLDB 02
3.A framework for high-accuracy privacypreserving mining, ICDE05
Basic ideas
Consider a transaction as a boolean bit vector
Perturb each bit with certain method
Paper 1: randomly select j items from t, then
for rest of all items, with prob p to be selected
Paper 2: each bit has the prob p to be original,
1-p to be flipped
Paper 3: unify the methods with perturbation
matrix
The key is…
After you perturb the data, you
should still be able to find the
supported rules correctly.
The accuracy is traded off by the
intensity of perturbation (p)
Methods discovering the
original support
Paper 1 using the correlation between
“partial support” to find the original
support
Concept of partial support
# {t T |# ( A t ) l}
sup ( A)
N
T
l
Prob of the length change of matched
parts
m
pk [l l ' ] p[# (t ' A) l ' |# (t A) l ]
The size of t : m, the size of itemset A: k
Some results
Let si be supi(A) and si’ be supi’ (A)
1.
2.
The matrix P and D are defined with only p[ll’]
From 1, we can estimate the original support
From 2, we can estimate the reliability (variance) of the support
Estimation (which is related to perturbation rate p)
Privacy
Given an itemset A in perturbed
transaction t’
What is the probability of an item a,
really in the itemset A, i.e., p[a t | A t ' ]
p[a t | A t ' ]
Tradeoff between utility and
privacy
Lowest discoverable support: distinguishable from zero (consider the
variance of support estimation)
Encryption method
Paper: Security in Outsourcing of Association
Rule Mining, VLDB2007
Substitution encryption
1-1 substitution a1, b2, …
1-n substitution a{1,10},
b{2,11,12},…
Problem:
1-1 substitution is weak
Arbitrary 1-n substitution does not work
Cannot recover original rules from the rules
from the substituted items.
The basic idea
Fake items
Original n items, additional m fake items
Define “admissible 1-n mapping”
Arbitrary 1-n mapping may result in irreversible
results
E.g., a{1,2}, b{2}, c{3}
If we find frequent itemset {1,2,3} in the substituted
set, {ac} or {abc}, which one is the right original
itemset?
Admissible 1-n mapping
For each mapping, there should be at least one unique
substitute item in the mapped result, which does not
appear in other mapping
E.g., a{1,2}, b{2}, c{3} breaks the definition
while a{1,2}, b{2,4}, c{3} is admissible
Recovering rules
When we use admissible mappings
We are able to reverse the discovered
rules on substituted set.
E.g., if we find {1,2,4} is a frequent set
check all mappings:
{1,2} a, {2,4} b {1,2,4} {ab}
cost
Additional cost
Generating item mapping
Generating transaction transformation
significant
Cost of rule mining
Both the # of items and the average length
of transaction is increased, thus the total
cost will increase
Features of encryption method
Rules can be accurately recovered
A tradeoff between cost and privacy
Privacy is better preserved with more
fake items
More fake items will result in higher
additional cost.
The VLDB07 paper weaknesses
No strong proof on security
does not mention any attacks – there are
possibly attacks on this scheme
Distributed datasets
Perturbation works, but existing
encryption protocols do not work
might need some protocols to make the
encryption method applicable
Horizontally or vertically partitioned
Paper1: "Privacy-Preserving Distributed Mining of
Association Rules on Horizontally Partitioned Data,"
Paper2: "Privacy preserving association rule mining in
vertically partitioned data,"
Shared features: using the cryptographic methods to
construct protocols
Hiding sensitive rules
When publishing data for rule mining,
the rule itself can be sensitive too.
Basic methods
Decrease the support
Support = # of occurrences of itemset/
total # of transactions
Decrease the confident
Confidence = # of occurrences of itemset/#
of occurrences of left(rule)
discussion on rule hiding
Need sufficient amount of
computational cost at the data owner
side
You should know what rules are
sensitive in advance!
So only necessary for the case you have
to share the data
When hiding sensitive rules, other
rules might be damaged
Summary
Two methods for single party rule
mining
Perturbation and encryption
Distributed multiparty can use
protocols and perturbation method
Hiding sensitive rules is also important
in some cases