Slide - Data Mining and Security Lab @ McGill

Transcript Slide - Data Mining and Security Lab @ McGill

Anonymizing Location-based data
Jarmanjit Singh
Jar_sing(at)encs.concordia.ca
Qing Shi
q_shi(at)encs.concordia.ca
Harpreet Sandhu
h_san(at)encs.concordia.ca
Benjamin Fung
fung(at)ciise.concordia.ca
Concordia Institute for Information Systems Engineering
Concordia University
Montreal, Quebec
Canada H3G 1M8
C3S2E-2009
The research is supported in part by the Discovery Grants (356065-2008) from Natural
Sciences and Engineering Research Council of Canada (NSERC).
Overview






RFID basics
RFID data publishing
Problem statement
Proposed algorithms
Evaluation
Conclusion
1
RFID basics
Tag
Radio frequency identification
Uses radio frequency (RF) to identify
(ID) objects.
Wireless technology
That allows a sensor (reader) to read,
from a distance, and without line of
sight, a unique electronic product code
(EPC) associated with a tag.
Reader
2
Data flow in RFID system
This is where we use
anonymiziation
algorithms to preserve
the privacy of data to be
published.
3
Motivating example
 For example, Alice has used her
RFID-based credit card at:
 Grocery store, Dental clinic, Shopping mall, Beer bar, Casino, AIDs clinic
etc.
Assume that Eve has seen Alice using her card at grocery
store and shopping mall.
However,
RFIDcompany
Company
publish
data
And while
therekeeping
is only
How can theif RFID
safeguard
theitsdata
privacy
one record containing
groceryRFID
store
shopping mall.
the released
dataand
useful?
Then Eve can immediately infer that this record belongs to Alice and can
also learn other locations visited by her.
4
RFID database
Person-Specific Path Table
Path
<EPC#; loc; time>
<EPC1; a; t1>
<EPC2; b; t1>
<EPC3; c; t2>
<EPC2; d; t2>
<EPC1; e; t2>
<EPC3; e; t4>
<EPC1; c; t3>
<EPC2; f; t3>
<EPC1; g; t4>
EPC1
< a1 e2  c3  g4 >
EPC2
< b1  d2  f3 >
EPC3
< c2  e4 >
<(loc1t1) … (locntn)>
where,
(lociti) is a pair indicating the location and time (called
transaction),
<(loc1t1) … (locntn)> is a path (called record)
RFID database
5
Attacker knowledge
If there is only record containing e4 and c7 then attacker can easily
Attacker knowledge: Suppose the adversary knows that
infer
that
this Alice,
record
Alice
also learn other
the
target
victim,
hasbelongs
visited eto
and
c at and
timecan
4 and
locations visited by Alice
7, respectively.
6
Problem statement
 We model attacker knowledge by I.
 Attacker can learn maximum of I transactions within
any record. Knowledge is constrained by “effort”
required to learn.
 We transform person-specific path database D to
(k,I)-anonymized database D’.
 Such that, no attacker having prior knowledge of m
transactions of a record r Є D and m ≤ I, can use his
knowledge to identify less than k records from D’.
7
Problem statement cont.
Assume, Attacker knowledge I=2 and, K value = 3
<e4c7>, r = 31
s = <d2f6>,
•
A table T satisfies (K,I)-anonymity if and only if r ≥ K for any
subsequence s with |s| ≤ I of any path in T, where r is the number of
records containing s and K is an anonymity threshold.
8
This is easy said but
how to transform
database D to version D’
that is immunized
against re-identification
attacks ?
9
Proposed method: Three steps
Pre-suppression
Firstly, we scan D to find items support < K.
And, delete them from D to get Dpre.
Generate subsets of size-i
We generate subsets of size-I from Dpre.
And, make additional scan to count their
support.
Add dummy records
We make infrequent subsets to be frequent
by using IF-anonymity algorithm.
10
Generate subsets of size-i
Subset generation
Increasing lexicographical order,
means we do not consider the reverse combinations of
transactions within a record.
The size of subsets generated should not exceed I.
Suppose,
I=2
{a1, d2}, {a1, b3}, {a1, e4}, {a1, f6}, {a1, c7}, {d2, b3}, {d2,
{b3,{d2,
e4},f6},
{b3,{d2,
f6},c7},
{b3,{b3,
e8},e4},
{e4,{b3,
f6}, f6},
{e4,{b3,
e8},c7},
{f6, {e4,
e8} f6},
e4},
{e4, c7}, {f6, c7}
Do this for all records!!
11
Count support
Count support for each subset.
Identify frequent and infrequent subsets.
Frequent
subsets
Infrequent
subsets
These subsets
have support
value < K value.
We need to add
dummy records to
make them (K,I)
anonymous
These subsets have support
value ≥ K value.
12
Pre-suppression: Example
Suppose,
k=3
Infrequent
Infrequent
subsets
subsets
Subsets
containing
‘a1’
14
What is dummy record?
Dummy records are fake records inserted in a database In
order to make infrequent subsets meet support value.
Some properties of adding dummy record:
Property 1: Length of dummy record should not exceed the
maximum length.
Property 2: The transactions within dummy record should
have reasonable time difference.
15
Process to add dummy record
Construct tree out of
infrequent subsets.
Two properties:
Null
e4: 3
b6: 2
d2: 1
Reasonable time
difference.
Length of dummy record.
b6: 1
c7: 2
g9: 1
e4: 1
a5: 1
we can get the minimum reasonable time difference between any two
locations either by learning from D or by using geographical databases
16
Divide tree if time conflict
 Rule 1: Let β is the set of nodes at level 1 of tree
 And ‘n’ be the node at which tree need to be divided.
 Let γ be the set of children nodes of ‘n’.
 If there exists an intersection α between β and γ, β ∩
γ = α ≠ ф.
 Let δ be the set of children nodes of α.
 And intersection |δ ∩ γ| ≥ |δ| / 2.
 We separate ‘n’ and α along with their children nodes (γ
and δ respectively) from original tree to construct different
tree.
17
Divide tree if large
 Count the number of nodes in each tree
except null.
 If any tree has nodes more than threshold.
 Divide tree again by taking ratio:
 Let X be the number of nodes in tree and X > λ
 Ratio: X / λ .
18
Divide tree Cont..
 Rule2: suppose number of nodes at level-1 of
tree are |1x|.
 And ratio: X / λ ≥ |1x|
 We divide tree for each node at level-1 and we
compute ratio again for each tree.
 And if ratio: X / λ < |1x|
 We divide tree at level-1 by combining nodes (at
level-1) having more intersecting children’s in one
tree.
19
Add dummy
 After having each tree satisfying X ≈ λ.
 We can write dummy record by following rule 3.
 Rule 3:
 let Xj be the set of nodes at level-i (initially i =1)
 And Xj+1 be the set of nodes at level-(i+1), .....,
 Xm be the set of nodes at level-m.
 All sets have their values in ascending order by
time. We get dummy record by taking Union
of (X1, X2 , .., Xm).
20
Recount support
 Dummy will also generate some subsets for which we
do not know the support.
 For ex, {a, b} , {b, c} are infrequent subsets and we added
dummy a  b  c. To make the frequent but there is also
one new subset {a, c} for which we don’t know the support
value.
 So, we generate subsets of size-I from dummies and
count support for each.
 We repeat IF-anonymity algorithm for new infrequent
subsets.
 Process stops when there is no infrequent subset.
21
22
Experimental evaluation:
Distortion vs. Dimensions
23
Distortion vs. Attacker knowledge
24
Distortion vs. Number of record
26
Conclusion
 Privacy in publishing high dimensional data
has become an important issue.
 We illustrate the treat of re-identification
attack caused by publishing RFID data.
 In this paper, we have proposed an efficient
scheme to (K,I)-anonymize high dimensional
data.
27
References
•
•
•
•
•
•
•
•
A. R. Beresford and F. Stajano. Location privacy in pervasive computing. IEEE
Pervasive Computing, 2003.
L. Sweeney. Achieving k-Anonymity Privacy Protection Using Generalization and
Suppression. International Journal on Uncertainty, Fuzziness and Knowledge-based
Systems, 10(5), 2002.
R. J. Bayardo and R. Agrawal. Data Privacy through Optimal k-Anonymization. In
IEEE ICDE, pages 217–228, 2005.
K. LeFevre, D. J. DeWitt, and R. Ramakrishnan. Incognito: Efficient Full-domain kAnonymity. In ACM SIGMOD, pages 49–60, 2005.
K. LeFevre, D. J. DeWitt, and R. Ramakrishnan. Mondrian Multidimensional kAnonymity. In IEEE ICDE, 2006.
C. C. Aggarwal and P. S. Yu. A Condensation Based approach to Privacy Preserving
Data Mining. In EDBT, pages 183-199, 2004.
L. Sweeney. k-Anonymity: A Model for Protecting Privacy. International Journal of
Uncertainty, Fuzziness and Knowledge-Based Systems, vol. 10, no. 5, pages 557–
570, 2002.
A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkitasubramaniam. l-Diversity:
Privacy beyond k-Anonymity. In IEEE ICDE, 2006.
28
References cont.
•
•
•
•
•
•
•
C. C. Aggarwal. On k-Anonymity and the Curse of Dimensionality. In VLDB, pages
901–909, 2005.
J. Xu, W. Wang, J. Pei, X. Wang, B. Shi, and A. Fu, Utility-Based anonymization Using
Local Recoding. In ACM SIGKDD, 2006.
X. Xiao and Y. Tao. Anatomy: Simple and Effective Privacy Preservation. In VLDB,
2006.
Y. Xu, B. C. M. Fung, K. Wang, A. W. C. Fu, and J. Pei. Publishing sensitive
transactions for itemset utility. In IEEE ICDM, pages 1109-1114, December 2008.
G. Ghinita, Y. Tao, P. Kalnis. On the anonymization of sparse high-dimensional data.
In IEEE ICDE, 2008.
M. Terrovitis, N. Mamoulis and P. Kalnis. Anonymity in unstructured data. Technical
Report, Hong Kong University, 2008.
J. Han and M. Kamber. Data mining: Concepts and Techniques. The Morgan
Kaufmann series in Data Management Systems, Jim Gray, Series Editor Morgan
Kaufmann Publishers, March 2006. ISBN 1-55860-901-6
29
References cont.
•
•
•
•
•
•
•
•
B. C. M. Fung, K. Wang, R. Chen, and P. S. Yu. Privacy-preserving data publishing: a
survey on recent developments. ACM Computing Surveys, 2010.
B. C. M. Fung, K. Wang, L. Wang, and P. C. K. Hung. Privacy-preserving data
publishing for cluster analysis. Data & Knowledge Engineering, 2009.
N. Mohammed, B. C. M. Fung, P. C. K. Hung, and C. K. Lee. Anonymizing healthcare
data: a case study on the Red Cross. In ACM SIGKDD, June 2009.
B. C. M. Fung, K. Wang, and P. S. Yu. Anonymizing classification data for privacy
preservation. IEEE (TKDE), 19(5):711-725, May 2007.
K. Wang, B. C. M. Fung, and P. S. Yu. Handicapping attacker's confidence: an
alternative to k-anonymization. Knowledge and Information Systems (KAIS),
11(3):345-368, April 2007. Springer-Verlag.
B. C. M. Fung, K. Wang, A. W. C. Fu, and J. Pei. Anonymity for continuous data
publishing. In EDBT, pages 264-275, March 2008.
K. Wang and B. C. M. Fung. Anonymizing sequential releases. In ACM SIGKDD,
pages 414-423, August 2006. DOI=
http://www.cs.sfu.ca/~wangk/pub/WF06kdd.pdf
B. C. M. Fung, K. Wang, and P. S. Yu. Top-down specialization for information and
privacy preservation. In IEEE ICDE, pages 205-216, April 2005.
30
Thank you
?
32

Slide - Data Mining and Security Lab @ McGill

Transcript Slide - Data Mining and Security Lab @ McGill

Directory