Privacy preserving rule mining

Download Report

Transcript Privacy preserving rule mining

Privacy-preserving rule
mining
Outline
 A brief introduction to association rule
mining
 Privacy preserving rule mining
 Single party
 Perturbation - publishing
 Encryption - outsourcing
 Distributed multiparty
 Cryptographic protocols – collaborative mining
 Hiding sensitive rules
Association Rule Mining
 Transactional datasets




A transaction t = {a,b,c,…}
a,b,c,… are called items
The length of transaction = # of items
Transaction length can vary
 Equivalent representation
 The set of all items : I
 A transaction t can be transformed to a
boolean vector in length of |I|.
Association Rule Mining
 Rule mining
 Goal: find the frequent itemset
 Some itemset, e.g.{a,b,c}, appears frequently,
higher than certain support.
 Rules can be derived from the itemset: AB
 {a,b,c} is frequent, then abc, abc, …
 Metrics
 Support = # of occurrences of itemset/ total #
of transactions
 i.e., prob (A,B)
 Confidence = # of occurrences of itemset/# of
occurrences of left (of the rule)
 I.e. the conditional prob: Pr (B|A)
 Example
 E.g. a,b appears 100 times together,
while abc appears 50 times together in
total 5000 transactions
 Support of abc = 50/5000 = 0.01
 Confidence of abc = 50/100 = 0.5
Algorithms
 Apriori
 observation: if itemset A is a part of B,
then support(A) >= support (B)
 Steps in finding frequent itemsets:
 Starting from single-item set, pruning the
itemsets that have support < threshold
 When we have a set of k-itemsets, we
expand it to k+1-itemsets, and check their
supports.
 Using data structures like hash tree to
speed up the counting process
 Algorithm
 Generating rules with confidence threshold
 Confidence (AB) = P(B|A)
= support(AB)/support(A)
Centralized approaches
 Two types of methods
 (Categorical data) Perturbation
 Encryption
Perturbation
 Papers
1."Privacy Preserving Mining of Association Rules,"
SIGKDD 2002.
2."Maintaining data privacy in association rule
mining”, VLDB 02
3.A framework for high-accuracy privacypreserving mining, ICDE05
 Basic ideas
 Consider a transaction as a boolean bit vector
 Perturb each bit with certain method
 Paper 1: randomly select j items from t, then
for rest of all items, with prob p to be selected
 Paper 2: each bit has the prob p to be original,
1-p to be flipped
 Paper 3: unify the methods with perturbation
matrix
The key is…
 After you perturb the data, you
should still be able to find the
supported rules correctly.
 The accuracy is traded off by the
intensity of perturbation (p)
Methods discovering the
original support
 Paper 1 using the correlation between
“partial support” to find the original
support
 Concept of partial support
# {t  T |# ( A  t )  l}
sup ( A) 
N
T
l
 Prob of the length change of matched
parts
m
pk [l  l ' ]  p[# (t ' A)  l ' |# (t  A)  l ]
The size of t : m, the size of itemset A: k
Some results
 Let si be supi(A) and si’ be supi’ (A)
1.
2.
The matrix P and D are defined with only p[ll’]
From 1, we can estimate the original support
From 2, we can estimate the reliability (variance) of the support
Estimation (which is related to perturbation rate p)
Privacy
 Given an itemset A in perturbed
transaction t’
 What is the probability of an item a,
really in the itemset A, i.e., p[a  t | A  t ' ]
p[a  t | A  t ' ]
Tradeoff between utility and
privacy
Lowest discoverable support: distinguishable from zero (consider the
variance of support estimation)
Encryption method
 Paper: Security in Outsourcing of Association
Rule Mining, VLDB2007
 Substitution encryption
 1-1 substitution a1, b2, …
 1-n substitution a{1,10},
b{2,11,12},…
 Problem:
 1-1 substitution is weak
 Arbitrary 1-n substitution does not work
 Cannot recover original rules from the rules
from the substituted items.
The basic idea
 Fake items
 Original n items, additional m fake items
 Define “admissible 1-n mapping”
 Arbitrary 1-n mapping may result in irreversible
results
 E.g., a{1,2}, b{2}, c{3}
 If we find frequent itemset {1,2,3} in the substituted
set, {ac} or {abc}, which one is the right original
itemset?
 Admissible 1-n mapping
 For each mapping, there should be at least one unique
substitute item in the mapped result, which does not
appear in other mapping
 E.g., a{1,2}, b{2}, c{3} breaks the definition
while a{1,2}, b{2,4}, c{3} is admissible
Recovering rules
 When we use admissible mappings
 We are able to reverse the discovered
rules on substituted set.
 E.g., if we find {1,2,4} is a frequent set
check all mappings:
{1,2}  a, {2,4} b  {1,2,4} {ab}
cost
 Additional cost
 Generating item mapping
 Generating transaction transformation
 significant
 Cost of rule mining
 Both the # of items and the average length
of transaction is increased, thus the total
cost will increase
Features of encryption method
 Rules can be accurately recovered
 A tradeoff between cost and privacy
 Privacy is better preserved with more
fake items
 More fake items will result in higher
additional cost.
 The VLDB07 paper weaknesses
 No strong proof on security
 does not mention any attacks – there are
possibly attacks on this scheme
Distributed datasets
 Perturbation works, but existing
encryption protocols do not work
 might need some protocols to make the
encryption method applicable
 Horizontally or vertically partitioned
 Paper1: "Privacy-Preserving Distributed Mining of
Association Rules on Horizontally Partitioned Data,"
 Paper2: "Privacy preserving association rule mining in
vertically partitioned data,"
 Shared features: using the cryptographic methods to
construct protocols
Hiding sensitive rules
 When publishing data for rule mining,
the rule itself can be sensitive too.
 Basic methods
 Decrease the support
 Support = # of occurrences of itemset/
total # of transactions
 Decrease the confident
 Confidence = # of occurrences of itemset/#
of occurrences of left(rule)
discussion on rule hiding
 Need sufficient amount of
computational cost at the data owner
side
 You should know what rules are
sensitive in advance!
 So only necessary for the case you have
to share the data
 When hiding sensitive rules, other
rules might be damaged
Summary
 Two methods for single party rule
mining
 Perturbation and encryption
 Distributed multiparty can use
protocols and perturbation method
 Hiding sensitive rules is also important
in some cases