Finding Personally Identifying Inforamtion

Download Report

Transcript Finding Personally Identifying Inforamtion

Finding Personally Identifying
Information
Mark Shaneck
CSCI 5980
May 10, 2004
Data, Data, Everywhere…
Lot’s of data
 Let’s mine it and find interesting things!
 But also maintain people’s privacy…

Privacy Preserving Data Mining
Randomization
 Cryptographic protocols
 K-anonymization

Randomization Techniques
Insert random ‘noise’ into the data to mask
actual values
 Compute a function to recover the original
data distribution from the randomized
values
 Not always secure – random noise can be
filtered in certain circumstances to
accurately estimate original data values

Cryptographic Techniques
Utilize cryptographic techniques, such as
oblivious transfer, to execute data mining
algorithms between two parties without
revealing data to each other
 E.g. Amazon.com and Barnes & Noble

K-anonymity
For a given set of attribute values in a
record, these same values must also occur in
at least k-1 other records for the data to be
considered k-anonymous
 Idea developed by Latanya Sweeney, who
has also developed a tool Datafly, which
will perform generalizations and
suppressions to achieve the k-anonymity

Generalization and Suppression
Generalization
 Usually done for an entire attribute
 Basically make the data less precise
 Taxonomy tree for categorical data
 Discretization of continuous, numerical
values
 Suppression – remove data (either record or
a single value)

What Should We Generalize?




Most approaches assume that you know what
columns to generalize from domain knowledge
What if you choose too little? Re-identification
What if you choose too much? Data loss
Is there a way to get some more insight into the
data and what makes it not k-anonymous?
What Should We Generalize?
Statistics-Netherlands uses a tool called margus, which checks all combinations of
columns up to size 3
 Obvious extension: check all combinations
of columns
 Too huge of a search space! How can we
reduce it?

A Priori Algorithm




General data mining definition: For sets X and Y,
if X  Y, then Support(X)  Support(Y)
For k-anonymity: For a set of attributes X and Y, if
X  Y, and X is not k-anonymous, then Y is not kanonymous
Use this approach to see which rows cause kanonymity to fail for which combinations of
columns
This info can be used to make more informed
decisions when using generalization and
suppression techniques
Experiment



Implemented the method and tested it on the
“Adult Database” from the UCI Machine Learning
Repository
Contains 32,561 records with 16 attributes:
 age, workclass, fnlwgt, education,
education-num, marital-status, occupation,
relationship, race, sex, capital-gain, capitalloss, hours-per-week, native-country
K value was 3
Results



Discovered some not-immediately obvious single
attributes: age (a few old people), hours-per-week (a few
work-aholics), native country (one person from the
Netherlands)
Many combinations of 2 attributes:
 workclass & gender identified 2 records with values
(“Never-worked”, “Female”)
 Occupation & race identified 1 record with value
(“Armed-Forces”, “Black”)
Due to A Priori, did not find any combinations greater than
2 that failed ({Sex, Education, Education-Num} was the
largest combination that remained k-anonymous)
Conclusion and Future Work
Conclusions
 Applying A Prioiri can provide some
insight into the data to be anonymized
 Future work
 Further testing on various data sets
 Deeper analysis of results of algorithm
and how to best apply them in
anonymization
