Preserving Privacy and Utility in Data Sharing

Download Report

Transcript Preserving Privacy and Utility in Data Sharing

April 13, 2010
Towards Publishing
Recommendation Data With
Predictive Anonymization
Chih-Cheng Chang†, Brian Thompson†,
Hui Wang‡, Danfeng Yao†
†
‡
ACM Symposium on Information, Computer, and
Communications Security (ASIACCS 2010)
April 13, 2010
Outline
• Introduction
• Privacy in recommender systems
• Predictive Anonymization
• Experimental results
• Conclusions and future work
Towards Publishing Recommendation Data
With Predictive Anonymization
April 13, 2010
Motivation
• Inevitable trend towards data sharing
– Medical records
– Social networks
– Web search data
– Online shopping, ads
• Databases contain sensitive information
• Growing need to protect privacy
Towards Publishing Recommendation Data
With Predictive Anonymization
April 13, 2010
Privacy in Relational Databases
identifiers
Name
sensitive information
Age
Gender
Zip Code
Disease
Joe Smith
52
Male
08901
Cancer
John Doe
24
Male
08904
---------
Mary Johnson
45
Female
08854
Asthma
Janie McJonno
59
Female
08904
Cancer
Johnny Walker
76
Male
08854
Diabetes
Towards Publishing Recommendation Data
With Predictive Anonymization
April 13, 2010
Privacy in Relational Databases
“Pseudo-identifiers”
Name
Age
Gender
Zip Code
Disease
Person 0001
52
Male
08901
Cancer
Person 0002
24
Male
08904
---------
Person 0003
45
Female
08854
Asthma
Person 0004
59
Female
08904
Cancer
Person 0005
76
Male
08854
Diabetes
87% of the U.S. population can be uniquely
identified by DOB, gender, and zip code! [S00]
Towards Publishing Recommendation Data
With Predictive Anonymization
April 13, 2010
Approaches to Achieving Privacy
1. Statistical databases
• Only aggregate queries: What is average salary?
• Differential Privacy [Dinur-Nissim ‘03, Dwork ‘06]
Adaptively add random noise to output so querier
can not determine if a user is in the database
• Quality decreases over multiple queries
2. Publishing of anonymized databases
• No restriction on how data is utilized, good for
complex data mining applications
• How to address privacy concerns?
Towards Publishing Recommendation Data
With Predictive Anonymization
April 13, 2010
Anonymization of Databases
Techniques:
• Perturbation
Name
Age
Joe Smith
52 53
John Doe
24 26
Mary Johnson
45 42
Towards Publishing Recommendation Data
With Predictive Anonymization
April 13, 2010
Anonymization of Databases
Techniques:
• Perturbation
• Swapping
Name
Age
Joe Smith
52
John Doe
24
Mary Johnson
45
Towards Publishing Recommendation Data
With Predictive Anonymization
April 13, 2010
Anonymization of Databases
Techniques:
• Perturbation
• Swapping
• Generalization
Name
Age
Joe Smith
52 50s
John Doe
24 20s
Mary Johnson
45 40s
Def. A database entry is k-anonymous
if ≥ k-1 other entries match identically on
the insensitive attributes. [SS98]
Towards Publishing Recommendation Data
With Predictive Anonymization
April 13, 2010
The Generalization Approach
Name
Age
Gender
Zip Code
Disease
Person 0001
<50
32
Male
08901
089**
Cancer
Person 0002
<50
24
Male
08904
089**
---------
Person 0003
>50
59
Female
08854
088**
Asthma
Person 0004
<50
45
Male
08904
089**
Diabetes
Person 0005
>50
76
Female
08854
088**
Cancer
Person 0006
>50
61
Female
08854
088**
AIDS
Towards Publishing Recommendation Data
With Predictive Anonymization
April 13, 2010
Outline
• Introduction
• Privacy in recommender systems
• Predictive Anonymization
• Experimental results
• Conclusions and future work
Towards Publishing Recommendation Data
With Predictive Anonymization
April 13, 2010
Recommender Systems
• Users register for service
• After buying a good, they submit a rating for it
• Get recommendations based on yours and
others’ ratings
Towards Publishing Recommendation Data
With Predictive Anonymization
April 13, 2010
Recommender Systems
NETFLIX
Alien
User
Joe Smith
0001
Batman
4
User
John 0002
Doe
3
3
Mary
UserJohnson
0003
2
?
Janie
User
McDonno
0004
Closer
Dogma
Evita
?
2
X-rated
Gladiator
5
5
3
3
?
2
The Netflix Challenge:
“Anonymized” Netflix data is released to the public.
$1 million prize for best movie prediction algorithm.
Question: Is privacy really protected?
Towards Publishing Recommendation Data
With Predictive Anonymization
April 13, 2010
Privacy in Recommender Systems
NETFLIX
Alien
User 0001
Closer
Dogma
4
User 0002
3
User 0003
2
User 0004
Batman
Evita
X-rated
Gladiator
2
3
5
5
3
3
2
Narayanan and Shmatikov [NS08] exploited
external information to re-identify users in the
released Netflix Challenge dataset.
Privacy breach!
Towards Publishing Recommendation Data
With Predictive Anonymization
April 13, 2010
News Timeline
Oct. 2006
Netflix Challenge announced
May 2008
N&S publish attack
Aug. 2009
Plans announced for Challenge 2
Dec. 2009
Netflix users file lawsuits
Mar. 2010
NC2 plans canceled due to privacy
concerns (and FTC investigation)
How can we enable sharing of recommendation
data without compromising users’ privacy?
Towards Publishing Recommendation Data
With Predictive Anonymization
April 13, 2010
Challenges in Anonymization of
Recommender Systems
• All data may be considered “sensitive” by users.
• All data could be used as quasi-identifiers.
• Data sparsity helps re-identification attacks, and
makes anonymization difficult. [NS08]
• Scalability – Netflix matrix has 8.5 billion cells!
Towards Publishing Recommendation Data
With Predictive Anonymization
April 13, 2010
Attack Models
We represent the recommendation
database as a labeled bipartite graph:
3
0001
5
4
0002
4
Ben
“structure-based attack”
4
1
English Patient
5
Pretty in Pink
5
Star Wars
1
English Patient
Tim
4
0004
English Patient
Godfather
5
0003
Star Wars
Godfather
“label-based attack”
Towards Publishing Recommendation Data
With Predictive Anonymization
April 13, 2010
Privacy Models
• Node re-identification privacy:
Should not be possible to re-identify individuals.
• Link existence privacy:
Should not be possible to infer whether a user
has seen a particular movie.
Our approach, Predictive Anonymization,
provides these notions of privacy against both
the structure-based and label-based attacks.
Towards Publishing Recommendation Data
With Predictive Anonymization
April 13, 2010
Outline
• Introduction
• Privacy in recommender systems
• Predictive Anonymization
• Experimental results
• Conclusions and future work
Towards Publishing Recommendation Data
With Predictive Anonymization
April 13, 2010
Predictive Anonymization
Our solution takes a 3-step approach:
1. Use predictive padding to reduce sparsity.
2. Cluster users into groups of size k.
3. Perform homogenization by assigning users
in each group to have the same ratings.
Achieves k-anonymity!
Towards Publishing Recommendation Data
With Predictive Anonymization
April 13, 2010
Predictive Anonymization
Alien
Batman
Closer
Dogma
Evita
X-rated
Gladiator
User 0001
3
4
5
3
2
1
4
User 0002
3
3
5
2
3
1
5
User 0003
2
2
3
5
2
3
3
User 0004
3
2
3
4
1
3
2
• Want to cluster users, but there is not enough
information due to data sparsity.
• Solution: Fill empty cells with predicted values.
• Cluster users based on similar tastes, not
necessarily similar lists of movies rated.
Towards Publishing Recommendation Data
With Predictive Anonymization
April 13, 2010
Predictive Anonymization
1. Use predictive padding to reduce sparsity.
2. Cluster users into groups of size k.
3. Perform homogenization by assigning users
in each group to have the same ratings.
The final step, homogenization, can be done in
one of several ways. We describe two methods,
“padded” and “pure” homogenization.
Towards Publishing Recommendation Data
With Predictive Anonymization
April 13, 2010
Predictive Anonymization
“Padded Homogenization”
0001
0002
3.5
3
4.5
5
3.5
4
4.5 4
4.5
5 2.5
0003
4.5
4
0004
3.5
43.5
11.5
2.5
1.5
4.5
5 4.5
Star Wars
Godfather
English Patient
Pretty in Pink
• All edges are added to the recommendation graph.
• Each cluster is averaged using the padded data.
Towards Publishing Recommendation Data
With Predictive Anonymization
April 13, 2010
Predictive Anonymization
“Pure Homogenization”
0001
3.5
3
4.5
5
3.5
4
Star Wars
0002
4.5
4
Godfather
4 4
4.5
5
0003
1
4.5
4
0004
English Patient
1
5
5
Pretty in Pink
• Only necessary edges are added to the graph.
• Each cluster is averaged using the original data.
Towards Publishing Recommendation Data
With Predictive Anonymization
April 13, 2010
Outline
• Introduction
• Privacy in recommender systems
• Predictive Anonymization
• Experimental results
• Conclusions and future work
Towards Publishing Recommendation Data
With Predictive Anonymization
April 13, 2010
Experiments
• Performed on the Netflix Challenge dataset:
– 480,189 users and 17,770 movies
– more than 100 million ratings
• Singular value decomposition (SVD) is
used for padding and prediction.
• We compute the root mean squared error
(RMSE) for a test set of 1 million ratings on
the original and anonymized data.
RMSE =

predicted actual
Towards Publishing Recommendation Data
With Predictive Anonymization
2
n
April 13, 2010
Analysis: Prediction Accuracy
Experiment Series
RMSE
Original Data
0.95185
Padded Anonymization (k=5)
0.95970
Padded Anonymization (k=50)
0.95871
Pure Anonymization (k=5)
2.36947
Pure Anonymization (k=50)
2.37710
• Padded Anonymization preserves prediction accuracy.
• However, sparsity is eliminated, which affects the utility
of the published dataset for data mining applications.
Towards Publishing Recommendation Data
With Predictive Anonymization
April 13, 2010
Summary
Utility
Prediction
Accuracy
Supports
Complex
Data Mining
Privacy
Node
Re-Ident.
Privacy
Naive
Anonymization
Padded Predictive
Anonymization
Pure Predictive
Anonymization
Towards Publishing Recommendation Data
With Predictive Anonymization
Link
Existence
Privacy
April 13, 2010
Outline
• Introduction
• Privacy in recommender systems
• Predictive Anonymization
• Experimental results
• Conclusions and future work
Towards Publishing Recommendation Data
With Predictive Anonymization
April 13, 2010
Conclusions
• We have formalized privacy and attack
models for recommender systems.
• Our solutions show that privacy-preserving
publishing of anonymized recommendation
data is feasible.
• More work is required to find a practical
solution that satisfies real-world privacy
and utility goals.
Towards Publishing Recommendation Data
With Predictive Anonymization
April 13, 2010
Future Work
• Investigate the use of differential privacy-like
guarantees for recommendation databases
• Analyze how to protect against more complex
attacks with greater background knowledge
• Evaluate the utility of anonymized
recommendation data for advanced data
mining applications
Towards Publishing Recommendation Data
With Predictive Anonymization
April 13, 2010
Thank you!
Towards Publishing Recommendation Data
With Predictive Anonymization