Finding Personally Identifying Inforamtion
Download
Report
Transcript Finding Personally Identifying Inforamtion
Finding Personally Identifying
Information
Mark Shaneck
CSCI 5980
May 10, 2004
Data, Data, Everywhere…
Lot’s of data
Let’s mine it and find interesting things!
But also maintain people’s privacy…
Privacy Preserving Data Mining
Randomization
Cryptographic protocols
K-anonymization
Randomization Techniques
Insert random ‘noise’ into the data to mask
actual values
Compute a function to recover the original
data distribution from the randomized
values
Not always secure – random noise can be
filtered in certain circumstances to
accurately estimate original data values
Cryptographic Techniques
Utilize cryptographic techniques, such as
oblivious transfer, to execute data mining
algorithms between two parties without
revealing data to each other
E.g. Amazon.com and Barnes & Noble
K-anonymity
For a given set of attribute values in a
record, these same values must also occur in
at least k-1 other records for the data to be
considered k-anonymous
Idea developed by Latanya Sweeney, who
has also developed a tool Datafly, which
will perform generalizations and
suppressions to achieve the k-anonymity
Generalization and Suppression
Generalization
Usually done for an entire attribute
Basically make the data less precise
Taxonomy tree for categorical data
Discretization of continuous, numerical
values
Suppression – remove data (either record or
a single value)
What Should We Generalize?
Most approaches assume that you know what
columns to generalize from domain knowledge
What if you choose too little? Re-identification
What if you choose too much? Data loss
Is there a way to get some more insight into the
data and what makes it not k-anonymous?
What Should We Generalize?
Statistics-Netherlands uses a tool called margus, which checks all combinations of
columns up to size 3
Obvious extension: check all combinations
of columns
Too huge of a search space! How can we
reduce it?
A Priori Algorithm
General data mining definition: For sets X and Y,
if X Y, then Support(X) Support(Y)
For k-anonymity: For a set of attributes X and Y, if
X Y, and X is not k-anonymous, then Y is not kanonymous
Use this approach to see which rows cause kanonymity to fail for which combinations of
columns
This info can be used to make more informed
decisions when using generalization and
suppression techniques
Experiment
Implemented the method and tested it on the
“Adult Database” from the UCI Machine Learning
Repository
Contains 32,561 records with 16 attributes:
age, workclass, fnlwgt, education,
education-num, marital-status, occupation,
relationship, race, sex, capital-gain, capitalloss, hours-per-week, native-country
K value was 3
Results
Discovered some not-immediately obvious single
attributes: age (a few old people), hours-per-week (a few
work-aholics), native country (one person from the
Netherlands)
Many combinations of 2 attributes:
workclass & gender identified 2 records with values
(“Never-worked”, “Female”)
Occupation & race identified 1 record with value
(“Armed-Forces”, “Black”)
Due to A Priori, did not find any combinations greater than
2 that failed ({Sex, Education, Education-Num} was the
largest combination that remained k-anonymous)
Conclusion and Future Work
Conclusions
Applying A Prioiri can provide some
insight into the data to be anonymized
Future work
Further testing on various data sets
Deeper analysis of results of algorithm
and how to best apply them in
anonymization