Privacy Risks of Public Mentions

Download Report

Transcript Privacy Risks of Public Mentions

Do You Trust Your Recommender?
An Exploration of Privacy and Trust in
Recommender Systems
Dan Frankowski, Dan Cosley, Shilad Sen,
Tony Lam, Loren Terveen, John Riedl
University of Minnesota
Story: Finding “Subversives”
“.. few things tell you as much about a person
as the books he chooses to read.”
– Tom Owad, applefritter.com
2
CDT Spring Research Forum 2007
Session Outline
 Exposure: undesired access to a
person’s information
 Privacy Risks
 Preserving Privacy
 Bias and Sabotage: manipulating a
trusted system to manipulate users of
that system
3
CDT Spring Research Forum 2007
Why Do I Care?
 As a businessperson
 The nearest competitor is one click away
 Lose your customer’s trust, they will leave
 Lose your credibility, they will ignore you
 As a person
 Let’s not build Big Brother
4
CDT Spring Research Forum 2007
Risk of Exposure in One Slide
Public
Dataset
+
Private
Dataset
YOU
=
5
+
YOU
Your private
data linked!
algorithms
Seems bad. How
can privacy be
preserved?
CDT Spring Research Forum 2007
movielens.org
-Started ~1995
-Users rate movies ½ to 5 stars
-Users get recommendations
-Private: no one outside
GroupLens can see user’s
ratings
Anonymized Dataset
-Released 2003
-Ratings, some
demographic data, but
no identifiers
-Intended for research
-Public: anyone can
download
movielens.org Forums
-Started June 2005
-Users talk about movies
-Public: on the web, no login to read
-Can forum users be identified in
our anonymized dataset?
Research Questions
 RQ1: RISKS OF DATASET RELEASE:
What are risks to user privacy when
releasing a dataset?
 RQ2: ALTERING THE DATASET: How
can dataset owners alter the dataset
they release to preserve user privacy?
 RQ3: SELF DEFENSE: How can users
protect their own privacy?
9
CDT Spring Research Forum 2007
Motivation: Privacy Loss
 MovieLens forum users did not agree to
reveal ratings
 Anonymized ratings + public forum data
= privacy violation?
 More generally: dataset 1 + dataset 2 =
privacy risk?
 What kind of datasets?
 What kinds of risks?
10
CDT Spring Research Forum 2007
Vulnerable Datasets
 We talk about
datasets from a sparse
relation space
 Relates people to
items
 Is sparse (few
relations per
person from
possible relations)
 Has a large space
of items
11
i1
i2
p1
p2
i3
X
X
p3
…
CDT Spring Research Forum 2007
X
…
Example Sparse Relation Spaces
 Examples
 Customer purchase data from Target
 Songs played from iTunes
 Articles edited in Wikipedia
 Books/Albums/Beers… mentioned by bloggers or
on forums
 Research papers cited in a paper (or review)
 Groceries bought at Safeway
 …
 We look at movie ratings and forum mentions, but
there are many sparse relation spaces
12
CDT Spring Research Forum 2007
Risks of re-identification
 Re-identification is matching a user in
two datasets by using some linking
information (e.g., name and address, or
movie mentions)
 Re-identifying to an identified dataset
(e.g., with name and address, or social
security number) can result in severe
privacy loss
13
CDT Spring Research Forum 2007
Story: Finding Medical records
(Sweeney 2002)
87% of people in 1990
U.S.
censusofidentifiable
Former
Governor
Massachusetts
by these!
14
CDT Spring Research Forum 2007
The Rebus Form
+
=
Governor’s medical records!
15
CDT Spring Research Forum 2007
Related Work
 Anonymizing datasets: k-anonymity
 Sweeney 2002
 Privacy-preserving data mining
 Verykios et al 2004, Agrawal et al 2000, …
 Privacy-preserving recommender systems
 Polat et al 2003, Berkovsky et al 2005,
Ramakrishnan et al 2001
 Text mining of user comments and opinions
 Drenner et al 2006, Dave et al 2003, Pang et al
2002
16
CDT Spring Research Forum 2007
RQ1: Risks of Dataset Release
 RQ1: What are risks to user privacy
when releasing a dataset?
 RESULT: 1-identification rate of 31%
 Ignores rating values entirely!
 Can do even better if text analysis
produces rating value
 Rarely-rated items were more identifying
17
CDT Spring Research Forum 2007
Glorious Linking Assumption
 People mostly talk about things they
know => People tend to have rated
what they mentioned
 Measured P(u rated m | u mentioned m)
averaged over all forum users: 0.82
18
CDT Spring Research Forum 2007
Algorithm Idea
All Users
Users who
rated a
popular item
Users who
rated a rarely
rated item
Users who
rated both
19
CDT Spring Research Forum 2007
1
Probability of 1-identification
0.9
0.8
0.7
0.6
•More mentions => better
•>=16Probability
mentions
and we vs. algorithm
of 1-identification
re-identification
often 1-identify
ExactRating
FuzzyRating
Scoring
TF-IDF
Set Intersection
0.5
0.4
0.3
0.2
0.1
0
1
(25)
2..3
(21)
4..7
(23)
8..15
(22)
16..31
(18)
# mentions bin (and # users in bin)
32..63
(13)
>64
(11)
RQ2: ALTERING THE DATASET
 How can dataset owners alter the dataset
they release to preserve user privacy?
 Perturbation: change rating values
 Oops, Scoring doesn’t need values
 Generalization: group items (e.g., genre)
 Dataset becomes less useful
 Suppression: hide data
 IDEA: Release a ratings dataset suppressing all
“rarely-rated” items
21
CDT Spring Research Forum 2007
Database-level suppression curves
Fraction of users 1-identified
0.6
•Drop 88% of items to protect
current users against 1identification
0.5
0.4
•88% of items => 28% ratings
0.3
0.2
0.1
0
0
0.2
0.4
0.6
Fraction of items suppressed
0.8
1
RQ3: SELF DEFENSE
 RQ3: How can users protect their own
privacy?
 Similar to RQ2, but now per-user
 User can change ratings or mentions. We
focus on mentions
 User can perturb, generalize, or
suppress. As before, we study
suppression
23
CDT Spring Research Forum 2007
User-level suppression curves
1
•Suppressing 20% of
mentions dropped 1ident some, but not
all
Fraction of users 1-identified
0.9
0.8
0.7
0.6
•Suppressing >20%
is not reasonable for
a user
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
Fraction of user mentions (per user) suppressed
0.5
Another Strategy: Misdirection
 What if users mention items they did NOT
rate? This might misdirect a re-identification
algorithm
 Create a misdirection list of items. Each user
takes an unrated item from the list and
mentions it. Repeat until not identified.
 What are good misdirection lists?
 Remember: rarely-rated items are identifying
25
CDT Spring Research Forum 2007
User 1-identification vs. number of misdirecting mentions
•Rarely-rated items don’t misdirect!
Fraction of users 1-identified
0.35
0.3
•Better to misdirect to a large crowd
0.25
•Rarely-rated items are identifying,
popular items are misdirecting
0.2
0.15
•Popular items do better, though 1-ident isn’t
0.1
0.05
0
0
5
10
# misdirecting mentions
15
20
Rare, rated>=1
Rare, rated>=16
Rare, rated>=1024
Rare, rated>=8192
zero
Popular
Exposure: What Have We Learned?
 REAL RISK
 Re-identification can lead to loss of privacy
 We found substantial risk of re-identification in our
sparse relation space
 There are a lot of sparse relation spaces
 We’re probably in more and more of them available
electronically
 HARD TO PRESERVE PRIVACY
 Dataset owner had to suppress a lot of their dataset to
protect privacy
 Users had to suppress a lot to protect privacy
 Users could misdirect somewhat with popular items
27
CDT Spring Research Forum 2007
Advice: Keep Customer’s Trust
 Share data rarely
 Remember the governor: (zip + birthdate +
gender) is not anonymous
 Reduce exposure
 Example: Google will anonymize search
data older than 24 months
28
CDT Spring Research Forum 2007
AOL: 650K users, 20M queries
Data wants to be free
 Government
subpoena, research,
commerce
People do not know
the risks
AOL was text, this is
items
29
NY Times: 4417749
searched for “dog that
urinates on everything.”
CDT Spring Research Forum 2007
Discussion #1: Exposure
 Examples of sparse relation
spaces?
 Examples of re-identification
risks?
 How to preserve privacy?
30
CDT Spring Research Forum 2007