Google News Personalization Scalable Online Collaborative Filtering

Download Report

Transcript Google News Personalization Scalable Online Collaborative Filtering

Google News
Personalization
Scalable Online Collaborative Filtering
Abhinandan Das - [email protected]
Mayur Datar - [email protected]
Ashutosh Garg - [email protected]
Shyam Rajaram - [email protected]
Presented by: Aniket Zamwar - [email protected]
1
Already Studied in
Class
• Map Reduce
• Collaborative Filtering
• Content Based Recommendation
• Clustering Techniques - Pros/Cons
2
Problem Statement
•
•
•
4/18/13
Scale of operation is very huge - order of several
million news stories dynamically changing at high
rate.
Presented with the click history for N users ( U = {
u1,u2,...,uN} ) over M items ( S = {s1,s2,...,sM} ),
and given a specific user ‘u’ with click history set
Cu consisting of stories { si1 , . . . , si|Cu | },
recommend K stories to the user.
Strict timing constraints for recommendation
engine to generate recommendations.
Google News
Personalization
3
Approaches
• Collaborative Clustering
• Probabilistic Latent Semantic Indexing
• Covisitation Counts
4/18/13
Google News
Personalization
4
Problem Setting
• Record User Queries and Clicks
• Recommendations of News using user click
history and click history of the community
4/18/13
Google News
Personalization
5
Recommender
• Content based
Systems
System
• Collaborative Filtering Systems
‣ Memory-based Algorithms
‣
Prediction calculated as weighted average of the ratings given
by other users
‣
weight is proportional to to “similarity” between users.
‣ Model-based Algorithms
‣
Model the users based on their past ratings and use these models
to predict ratings of unseen items.
• Mix of memory based + model based
systems
4/18/13
Google News
Personalization
6
Algorithms
• Model based approach
• Clustering Techniques: Probabilistic
Latent Semantic Indexing(PLSI) and
min hash
• Memory based approach
• Item Covisitation
4/18/13
Google News
Personalization
7
Min Hashing
•
•
•
•
•
4/18/13
Probabilistic clustering technique - assigns pairs of
users to same cluster with probability proportional to
overlab between the set of items the users have voted
for.
Similarity calculated using Jaccard Coefficient
To Do: Given user u-i, compute similarity S(u-i, u-j) for
all users u-j, and recommend stories to u-i voted by u-j
with weight equal to S(u-i, u-j)
Issues: Real time not scalable, using hash table to find
vote for specific user is also not feasible, offline
computation is also not feasible
Locality Sensitive Hashing
(LSH)
comes
for
rescue
Google News
Personalization
8
Locality Sensitive
Hashing
• Key Idea: Hash data points using several hash
functions, such that for each hash function the
probability of collision is much higher for objects
which are close to each other.
•
•
4/18/13
Min-hashing technique is used to randomly permute
the set of items (S) and for each user u-i compute its
hash value h(u-i) as the index of first item under the
permutation that belongs to user’s item set Cu-i.
Min-hashing = probabilistic clustering where each
hash bucket corresponds to a cluster, that puts two
users together in the same cluster with probability
equal to item set similarity S(u-i, u-j)
Google News
Personalization
9
PLSI
•Probabilistic Latent Semantic models
•Models users and items as random variables relationship between users and items is
learned by modeling joint distribution of users
and items as mixed distribution
•A hidden variable Z is used to define the
relationship, it represents user communities
and item communities.
4/18/13
Google News
Personalization
10
Covisitation
• Covisitation is defined as event in which two
stories are clicked by same user within a
certain time interval.
• A graph whose nodes represent items and
weighted edges represent time discounted
number of covisitation instances.
• For each user click the adjacency list
representing graph is updated: for entry for
each item in user history, new entry
corresponding to clicked item is added if not
there; if it is already there then the age
discounted count is updated.
4/18/13
Google News
Personalization
11
Covisitation based
Recommendation
• Fetch user u-i’s recent click history - limited to
past few hours or days.
• For each item s-i in click history of user, lookup
the entry for pair (s-i, s) in adjacency list for s-i
stored in Big Table.
• The value stored in entry normalized by sum of
all entries for s-i is stored to recommendation
score.
• Recommendation score is normalized to a
value between 0 and 1 by linear scaling.
4/18/13
Google News
Personalization
12
System Setup
• Three Components:
•
Offline component to cluster users based on
click history
•
Online servers:
•
Updating story and user statistics each time user clicks on news story
• Generating news story when user requests
•
Two Data Tables
•
User Table (UT) indexed by user-id, stores user click history and clustering
information.
• Story Table (ST) indexed by story-id, stores real time click counts for every story-
story and story-cluster pair.
4/18/13
Google News
Personalization
13
System Components
NSS
Min Hashing
PLSI
b
u
f
f
e
r
UT
user click
NFE
view personalized
news page request
Update Statistics
NPS
c
a
c
h
e
Fetch
Statistics
NFE: News Front End
NSS: News Statistics Server
NPS: News Personalization
Server
4/18/13
Google News
Personalization
ST
Offline
Log
Analysis
UT: User Table
ST: Story Table
14
Pros
• Scalable collaboration of Content based
and Collaborative clustering
• Recommendation system using Min
Hashing and PLSI Algorithms
• Scaling the algorithms by using Map
Reduce and Big Table representation for
data
• Using click events as vote for news
• Dynamically providing the latest likely news
that suits the interest of the user
4/18/13
Google News
15
Personalization
Cons
• Depends a lot on User Clicks
• User Clicks considered as positive vote
• Does not say anything about negative
vote
4/18/13
Google News
Personalization
16