Transcript ppt

To Do or Not To Do:
The Dilemma of Disclosing
Anonymized Data
Lakshmanan L, Ng R, Ramesh G
Univ. of British Columbia
Oren Fine
Nov. 2008
CS Seminar in Databases (236826)
Once Upon a Time…
• The police is after Edgar, a drug lord suspect.
– Intel. has gathered calls & meetings data records as a
transactional database
– In order to positively frame Edgar, the police must find
hard evidence, and wishes to outsource data mining
tasks to “We Mind your Data Ltd.”
– But, the police is subject to the law, and is obligated
to keep the privacy of the people in the database –
including Edgar, which is innocent until proven
otherwise.
– Furthermore, Edgar is seeking for the smallest hint to
disappear…
I have the pleasure to introduce
Edgar vs. The Police
VS.
Motivation
• The Classic Dilemma:
– Keep your data close to your chest and never risk
privacy or confidentiality or…
– Disclose the data and gain potential valuable
knowledge and benefits
• In order to decide, we need to answer a major
question
– “Just how safe is the anonymized data?”
– Safe = protecting the identities of the of the objects.
Agenda
• Anonymization
• Model the Attacker’s Knowledge
• Determine the risk to our data
Anonymization or De-Identification
• Transform sensitive data into generated
unique content (strings, numbers)
• Example
TID
Names
TID
Transaction
1
{Hussein, Hassan, Dimitri}
1
{1,2,3}
2
{Hussein, Edgar, Anglea}
2
{1,4,5}
3
{Angela, Edgar}
3
{5,4}
4
{Raz, Adi, Yishai}
4
{6,7,8}
5
{Hassan, Yishai, Dimitri, Raz}
5
{2,8,3,6}
6
{Raz, Anglea, Nithai}
6
{6, 5, 9}
Anonymization or De-Identification
• Advantages
– Very simple
– Does not affect final outcome or perturb data
characteristics
• We do not suggest that anonymization is
the “right” way, but it is probably the most
common
Frequent Set Mining Crash Course
• Transactional database
• Each transaction has TID and a set of
items
• An association rule of the form XY has
– Support s if s% of the transactions include
(X,Y)
– Confidence c if c% of the transactions that
include X also include Y
• Support = frequent sets
• Confidence = association rules
• A k-itemset is a set of k items
Example
TID
Names
1
Angela, Ariel, Edgar, Steve, Benny
2
Edgar, Hassan, Steve, Tommy
3
Joe, Sara, Israel
4
Steve, Angela, Edgar
5
Benny, Mahhmud, Tommy
6
Angela, Sara, Edgar
7
Hassan, Angela, Joe, Edgar, Noa
8
Edgar, Benny, Steve, Tommy
Example (Cont.)
• First, we look for frequent sets, according to a
support threshold
• 2-itemsets: {Angela, Edgar}, {Edgar, Steve} have
50% support (4 out of 8 transactions).
• 3-itemsets: {Angela, Edgar, Steve}, {Benny,
Edgar, Steve} and {Tommy, Edgar, Steve} have
only 25% support (2 out of 8 transactions)
• The rule {Edgar, Steve} {Angela} has 50%
confidence (2 out 4 transactions) and the rule
{Tommy} {Edgar, Steve} has 66.6%
confidence.
Frequent Set Mining Crash Course
(You’re Qualified!)
• Widely used in market basket analysis,
intrusion detection, Web usage mining and
bioinformatics
• Aimed at discovering non trivial or not
necessarily intuitive relation between
items/variables of large databases
“Extracting wisdom out of data”
• Who knows what is the most famous
frequent set?
Big Mart’s Database
Modeling the Attacker’s Knowledge
• We believe that the attacker has prior
knowledge about the items in the original
domain
• The prior information regards the
frequencies of items in the original domain
• We capture the attacker’s knowledge with
“Belief Functions”
Examples of Belief Functions
Consistent Mapping
• Mapping anonymized entities to original
entities only according to the belief
function
Ignorant Belief Function (Q)
• How does the graph look like?
• What is the expected number of cracks?
• Suppose n items. Further suppose that we
are only interested in a partial group, of
size n1
• What is the expected number of cracks
now?
• Don’t you underestimate Edgar…
Ignorant Belief Function (A)
Compliant Point-Valued
Belief Function (Q)
• How does the graph look like?
• What is the expected number of cracks?
• Suppose n items. Further suppose that we
are only interested in a partial group, of
size n1
• What is the expected number of cracks
now?
• Unless he has inner source, we shouldn’t
overestimate Edgar either…
Compliant Point-Valued
Belief Function (A)
Compliant Interval
Belief Functions
• Direct Computation Method
– Build a graph G and adjacency matrix AG
– The probability of cracking k out of n items:
• Computing the permanent is know to be
#P-complete problem, state of the art
approximation running time O(n22) !!
• What the !#$!% is a permanent or #P-complete?
Permanent
• A permanent of an n*n matrix is
perm( A) 
n
a 


S n i 1
i,
(i )
• The sum is over all permutations of 1,2,…
• Calculating the permanent is #P-complete
• Which brings us to…
#P-Complete
• Unlike well known complexity classes which are
of decision problems, this is a class of function
problems
• "compute f(x)," where f is the number of
accepting paths of an NP machine
• Example
– NP: Are there any subsets of a list of integers that add
up to zero?
– #P: How many subsets of a list of integers add up to
zero?
Chain Belief Functions
Chain Belief Functions
Unfortunately…
• General Belief Function does not always
produce a chain…
• We seek for way to estimate the number
of cracks.
The O-estimate Heuristic
• Suppose Graph G, interval belief function β.
• For each x, let Ox denote the outdegree of x
in G.
• The probability of cracking x is simply
1
Ox
• The expected number of cracks is
Properties of O-estimate
• Inexact (hence “estimate”)
• Monotonic
-Compliant Belief Function
• Suppose we “somehow” know which items
are guessed wrong
• We sum the O-estimates only over the
compliant frequency groups
Risk Assessment
• Worst case \ Best case – unrealistic
• Determine the intervals width
– Twice the median gap of all successive
frequency groups
– Why?
• Determine the degree of compliancy
– Perform binary search on , subject to a
“degree of tolerance” – .
End to End Example
• These Intel. Calls & Meeting DR are classified
“Top Secret”
TID
Names
1
Angela, Ariel, Edgar, Steve, Benny
2
Edgar, Hassan, Steve, Tommy
3
Joe, Sara, Israel
4
Steve, Angela, Edgar
5
Benny, Mahhmud, Tommy
6
Angela, Sara, Edgar
7
Hassan, Angela, Joe, Edgar, Noa
8
Edgar, Benny, Steve, Tommy
We Anonymize the Database
I
J
freq
TID
Items
1
1, 2, 3, 4, 5
2
3, 6, 4, 7
Angela
1
4/8
Ariel
2
1/8
Edgar
3
6/8
Steve
4
4/8
Benny
5
3/8
3
8, 9, 10
Hassan
6
2/8
4
4, 1, 3
Tommy
7
3/8
Joe
8
2/8
5
5, 7, 12
Sara
9
2/8
6
1, 9, 3
Israel
10
1/8
7
6, 1, 8, 3, 11
Noa
11
1/8
8
3, 5, 4, 7
Mahhmud
12
1/8
Frequency Groups
Frequency
Items
1/8
2, 10, 11, 12
2/8
6, 8, 9
3/8
5, 7
4/8
1, 4
6/8
3
• The gaps between the frequency groups:
1/8, 1/8, 1/8, 1/8, 2/8
• The median gap = 1/8
The Attacker’s Prior Knowledge
I
Frequency Group
Angela
3/8 – 5/8
Ariel
0 – 2/8
Edgar
5/8 – 7/8
Steve
3/8 – 5/8
Benny
2/8 – 4/8
Hassan
1/8 – 3/8
Tommy
2/8 – 4/8
Joe
1/8 – 3/8
Sara
1/8 – 3/8
Israel
0 – 2/8
Noa
0 – 2/8
Mahhmud
0 – 2/8
The Graph, By the Way…
Angela
1
Ariel
2
Edgar
Steve
3
4
Benny
5
Hassan
6
Tommy
7
Joe
8
Sara
9
Israel
10
Noa
11
Mahhmud
12
Calculating the Risk
• Oest=1/4+1/7+1/3+1/4+1/7+1/9+1/7+
1/9+1/9+1/7+1/7+1/7 = 2.023
• Now, it’s a question of how much would
you tolerate...
• Note, that this is the expected number of
cracks. However, if we are interested in
Edgar, as we’ve seen in previous lemmas,
the probability of crack – 1/3.
Experiments
Open Problems
• The attacker’s prior knowledge remains a
largely unsolved issue
• This article does not really deal with
frequent sets but rather frequent items
– Frequent sets can add more information and
differentiate objects from one frequency group
Modeling the Attacker’s Knowledge
in the Real World
• In a report for the Canadian Privacy
Commissioner appears a broad mapping
of adversary knowledge
– Mapping phone directories
– CV’s
– Inferring gender, year of birth and postal code
from different details
– Data remnants on 2nd hand hard disks
– Etc.
‫סוף טוב‪ ,‬הכל טוב‬
Bibliography
• Lakshmanan L., Ng R., Ramesh G. To Do or Not To Do: The
Dilemma of Disclosing Anonymized Data. ACM SIGMOD
Conference, 2005.
• Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining
association rules. In Proc. 1994 Int. Conf. Very Large Data Bases
(VLDB’94), Santiago, Chile, pp. 487–499.
• Pan-Canadian De-Identification Guidelines for Personal Health
Information, Khaled El-Emam et al., April 2007.
• Wikipedia
– Association rule
– #P
– Permanent
Questions ?