AnonymityPrivacyIssu..

Download Report

Transcript AnonymityPrivacyIssu..

Anonymity and Privacy Issues
--- re-identification
Yimeng Zhang
12/4/07
Index
• Views on Privacy of Social Media
• Overview of Re-identification
• You are What You Say: Privacy Risks of Public
Mentions, Frankowski et al. SIGIR06
Improper Use of Personal
Information Online
Top Privacy Concerns
Remaining Anonymous
True Information Provide
While Registering
Ability to Remain Anonymous
Importance of Controlling Personal
Information
Specifying Who Can View
Personal Information
Conclusion
• Around 40% of people would like to remain
anonymous on social media or social networking
sites
• Most people provide their true personal
information while registering
• Most people think it is important to have the
control of personal information online
Re-identification Techniques can identify the users of an anonymous dataset
Privacy Loss through Re-identification
• Re-identification: Linkage of datasets with
explicit identifiers with datasets without explicit
identifiers through common attributes
People wish to keep private
• Datasets without explicit identifiers
– Public data which are made anonymous by users
– Public data by research groups (after suitable
anonymizing)
– Public data from government agencies (census)
Example of Re-identification
Public by Group Insurance
Commission of Massachusetts
Voter register list of Massachusetts
purchased with only 20$
87% of Population in 1990. US are likely to be
uniquely identified based on only on Zip, Birth and
Sex
Sweeney, 2002
The Rebus Form
+
=
Governor’s medical records!
From Frankowski, SIGIR06
Example of face identification
With explicit identified profiles
Without explicit identified profiles
Friendster
Facebook
Identity violation!
Face Recognizer
Gross and Acquisti, WPES 05
You Are What You Say: Privacy
Risks of Public Mentions
Dan Frankowski, Dan Cosley, Shilad
Sen, Loren Terveen, John Riedl
University of Minnesota
SIGIR 2006
Main Idea
• People can be identified by their preferences
and what they talk about
–
–
–
–
Reviews of books, movies, songs
Mentions on forums or blogs
Friend list on Facebook
Wish or purchase list on Amazon
• Method for Re-identification
– Datasets are represented in Sparse Relation Spaces
– Re-identification can be done by matching two Sparse
Relation Spaces
Sparse Relation Space
• Relates people to items
• Sparse: have few
relationships recorded
per person
• Dataset that can be
represented in a Sparse
Relation Space is
vulnerable
i1
i2
p1
p2
p3
…
i3
X
X
X
…
Research Questions
• Risks of dataset release
– What are the risks to user privacy when
releasing a dataset
• Altering the dataset
– How can dataset owners alter the dataset to
preserve user privacy
• Self defense
– How can users protect their own privacy
Experiment Dataset: MovieLens
Dataset1: Movie Ratings
Users do not allow to reveal
Released for research use
“Anonymous Dataset”
Dataset2: Movies Reviews
Public
Feature of the dataset
Number of ratings of an item by percentile
60000
50000
40000
Number of ratings
• Both ratings and
mentions follow a
power law
• Important feature for
real world sparse
relation space
30000
20000
10000
0
0%
20%
40%
60%
Item percentile
80%
Frankowski, SIGIR 06
100%
Evaluation Measure
Mentions
Mentions
by User t
Ratings
Re-identify
Algorithm
Top k ratings users ranked by the
likelihood they are user t
K-identified: t is in the k users returned by the algorithm
K-identification rate: the fraction of k-identified users
Set Intersection Algorithm
for Re-identification
• Likely list: Users in the rating database who
have rated every movie mentions by user t
• Problem
– Users mention movies but do not rate them
TF-IDF Algorithm
• Mentions of a user: vector of the movies
the user mentioned
• Ratings of a user: vector of the movies the
user rated
• Likelihood: TF-IDF cosine similarity
Scoring Algorithm
Scoring:
• emphasize the mentions of rarely rated movies
• de-emphasize the number of ratings a user has
Score for one mention/movie of a user:
Fraction of users who
have not rated mention m
Score for a user:
Multiplication of scores for all mentions of this user
Scoring Algorithm with Ratings
• Suppose we have an magic analyzer which can guess the
rating of a movie from the mention
– Eg. Using the context of that mention
• Algorithms
– ExactRating: the analyzer can perfectly determine the rating
– FuzzingRaing: the analyzer can guess the rating value within +/-1
Percent of users identified by
different algorithms
1-identification rate
RQ2: Altering the dataset
• How can dataset owners alter the dataset
they release to preserve user privacy
• Data Suppression
– Algorithm: Drop rarely rated movies
– Not big problem for industry, but harmful for
research
Dataset level Suppression
Do not work!
RQ3: Self Defence
• How can users protect their own privacy
• Suppression
– Not to mention movies rated rarely
• Misdirection
– Mention items they have not rated
User Level Suppression
Do not work!
Misdirection
Works when user mention popular items
Conclusion
• Simple data mining algorithms can identify
the users who mention in a sparse relation
space and think they are anonymous
– Use the algorithms: eg. find paper reviewers
(Future work of Frankowski)
– Privacy risks for users on Social Media sites
• Hard to preserve privacies
– Don’t reveal your privacies even if it seems to
be anonymous