Privacy Risks of Public Mentions
Download
Report
Transcript Privacy Risks of Public Mentions
Do You Trust Your Recommender?
An Exploration of Privacy and Trust in
Recommender Systems
Dan Frankowski, Dan Cosley, Shilad Sen,
Tony Lam, Loren Terveen, John Riedl
University of Minnesota
Story: Finding “Subversives”
“.. few things tell you as much about a person
as the books he chooses to read.”
– Tom Owad, applefritter.com
2
CDT Spring Research Forum 2007
Session Outline
Exposure: undesired access to a
person’s information
Privacy Risks
Preserving Privacy
Bias and Sabotage: manipulating a
trusted system to manipulate users of
that system
3
CDT Spring Research Forum 2007
Why Do I Care?
As a businessperson
The nearest competitor is one click away
Lose your customer’s trust, they will leave
Lose your credibility, they will ignore you
As a person
Let’s not build Big Brother
4
CDT Spring Research Forum 2007
Risk of Exposure in One Slide
Public
Dataset
+
Private
Dataset
YOU
=
5
+
YOU
Your private
data linked!
algorithms
Seems bad. How
can privacy be
preserved?
CDT Spring Research Forum 2007
movielens.org
-Started ~1995
-Users rate movies ½ to 5 stars
-Users get recommendations
-Private: no one outside
GroupLens can see user’s
ratings
Anonymized Dataset
-Released 2003
-Ratings, some
demographic data, but
no identifiers
-Intended for research
-Public: anyone can
download
movielens.org Forums
-Started June 2005
-Users talk about movies
-Public: on the web, no login to read
-Can forum users be identified in
our anonymized dataset?
Research Questions
RQ1: RISKS OF DATASET RELEASE:
What are risks to user privacy when
releasing a dataset?
RQ2: ALTERING THE DATASET: How
can dataset owners alter the dataset
they release to preserve user privacy?
RQ3: SELF DEFENSE: How can users
protect their own privacy?
9
CDT Spring Research Forum 2007
Motivation: Privacy Loss
MovieLens forum users did not agree to
reveal ratings
Anonymized ratings + public forum data
= privacy violation?
More generally: dataset 1 + dataset 2 =
privacy risk?
What kind of datasets?
What kinds of risks?
10
CDT Spring Research Forum 2007
Vulnerable Datasets
We talk about
datasets from a sparse
relation space
Relates people to
items
Is sparse (few
relations per
person from
possible relations)
Has a large space
of items
11
i1
i2
p1
p2
i3
X
X
p3
…
CDT Spring Research Forum 2007
X
…
Example Sparse Relation Spaces
Examples
Customer purchase data from Target
Songs played from iTunes
Articles edited in Wikipedia
Books/Albums/Beers… mentioned by bloggers or
on forums
Research papers cited in a paper (or review)
Groceries bought at Safeway
…
We look at movie ratings and forum mentions, but
there are many sparse relation spaces
12
CDT Spring Research Forum 2007
Risks of re-identification
Re-identification is matching a user in
two datasets by using some linking
information (e.g., name and address, or
movie mentions)
Re-identifying to an identified dataset
(e.g., with name and address, or social
security number) can result in severe
privacy loss
13
CDT Spring Research Forum 2007
Story: Finding Medical records
(Sweeney 2002)
87% of people in 1990
U.S.
censusofidentifiable
Former
Governor
Massachusetts
by these!
14
CDT Spring Research Forum 2007
The Rebus Form
+
=
Governor’s medical records!
15
CDT Spring Research Forum 2007
Related Work
Anonymizing datasets: k-anonymity
Sweeney 2002
Privacy-preserving data mining
Verykios et al 2004, Agrawal et al 2000, …
Privacy-preserving recommender systems
Polat et al 2003, Berkovsky et al 2005,
Ramakrishnan et al 2001
Text mining of user comments and opinions
Drenner et al 2006, Dave et al 2003, Pang et al
2002
16
CDT Spring Research Forum 2007
RQ1: Risks of Dataset Release
RQ1: What are risks to user privacy
when releasing a dataset?
RESULT: 1-identification rate of 31%
Ignores rating values entirely!
Can do even better if text analysis
produces rating value
Rarely-rated items were more identifying
17
CDT Spring Research Forum 2007
Glorious Linking Assumption
People mostly talk about things they
know => People tend to have rated
what they mentioned
Measured P(u rated m | u mentioned m)
averaged over all forum users: 0.82
18
CDT Spring Research Forum 2007
Algorithm Idea
All Users
Users who
rated a
popular item
Users who
rated a rarely
rated item
Users who
rated both
19
CDT Spring Research Forum 2007
1
Probability of 1-identification
0.9
0.8
0.7
0.6
•More mentions => better
•>=16Probability
mentions
and we vs. algorithm
of 1-identification
re-identification
often 1-identify
ExactRating
FuzzyRating
Scoring
TF-IDF
Set Intersection
0.5
0.4
0.3
0.2
0.1
0
1
(25)
2..3
(21)
4..7
(23)
8..15
(22)
16..31
(18)
# mentions bin (and # users in bin)
32..63
(13)
>64
(11)
RQ2: ALTERING THE DATASET
How can dataset owners alter the dataset
they release to preserve user privacy?
Perturbation: change rating values
Oops, Scoring doesn’t need values
Generalization: group items (e.g., genre)
Dataset becomes less useful
Suppression: hide data
IDEA: Release a ratings dataset suppressing all
“rarely-rated” items
21
CDT Spring Research Forum 2007
Database-level suppression curves
Fraction of users 1-identified
0.6
•Drop 88% of items to protect
current users against 1identification
0.5
0.4
•88% of items => 28% ratings
0.3
0.2
0.1
0
0
0.2
0.4
0.6
Fraction of items suppressed
0.8
1
RQ3: SELF DEFENSE
RQ3: How can users protect their own
privacy?
Similar to RQ2, but now per-user
User can change ratings or mentions. We
focus on mentions
User can perturb, generalize, or
suppress. As before, we study
suppression
23
CDT Spring Research Forum 2007
User-level suppression curves
1
•Suppressing 20% of
mentions dropped 1ident some, but not
all
Fraction of users 1-identified
0.9
0.8
0.7
0.6
•Suppressing >20%
is not reasonable for
a user
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
Fraction of user mentions (per user) suppressed
0.5
Another Strategy: Misdirection
What if users mention items they did NOT
rate? This might misdirect a re-identification
algorithm
Create a misdirection list of items. Each user
takes an unrated item from the list and
mentions it. Repeat until not identified.
What are good misdirection lists?
Remember: rarely-rated items are identifying
25
CDT Spring Research Forum 2007
User 1-identification vs. number of misdirecting mentions
•Rarely-rated items don’t misdirect!
Fraction of users 1-identified
0.35
0.3
•Better to misdirect to a large crowd
0.25
•Rarely-rated items are identifying,
popular items are misdirecting
0.2
0.15
•Popular items do better, though 1-ident isn’t
0.1
0.05
0
0
5
10
# misdirecting mentions
15
20
Rare, rated>=1
Rare, rated>=16
Rare, rated>=1024
Rare, rated>=8192
zero
Popular
Exposure: What Have We Learned?
REAL RISK
Re-identification can lead to loss of privacy
We found substantial risk of re-identification in our
sparse relation space
There are a lot of sparse relation spaces
We’re probably in more and more of them available
electronically
HARD TO PRESERVE PRIVACY
Dataset owner had to suppress a lot of their dataset to
protect privacy
Users had to suppress a lot to protect privacy
Users could misdirect somewhat with popular items
27
CDT Spring Research Forum 2007
Advice: Keep Customer’s Trust
Share data rarely
Remember the governor: (zip + birthdate +
gender) is not anonymous
Reduce exposure
Example: Google will anonymize search
data older than 24 months
28
CDT Spring Research Forum 2007
AOL: 650K users, 20M queries
Data wants to be free
Government
subpoena, research,
commerce
People do not know
the risks
AOL was text, this is
items
29
NY Times: 4417749
searched for “dog that
urinates on everything.”
CDT Spring Research Forum 2007
Discussion #1: Exposure
Examples of sparse relation
spaces?
Examples of re-identification
risks?
How to preserve privacy?
30
CDT Spring Research Forum 2007