Transcript Document
Topic 12 – Recommender Systems and the Netflix
Prize
Data Mining - Volinsky - 2011 - Columbia University
1
Outline
•
•
•
•
Intro to Recommender Systems
The Netflix Prize – structure and history
Models for Recommendation
Extensions of model for other domains
Data Mining - Volinsky - 2011 - Columbia University
2
Recommender Systems
•
•
Systems which take user preferences about items as input and outputs recommendations
Early examples
• Bellcore Music Recommender (1995)
• MIT Media Lab: Firefly (1996)
Best example: Amazon.com
Worst example: Amazon.com
Also:
Netflix
eBay
Google Reader
iTunes Genius
digg.com
Hulu.com
Data Mining - Volinsky - 2011 - Columbia University
3
Recommender Systems
• Basic idea
– recommend item i to user u for the purpose of
• Exposing them to something they would not have otherwise seen
• Leading customers to the Long Tail
• Increasing customers’ satisfaction
• Data for recommender systems (need to know who likes what)
– Purchase/rented
– Ratings
– Web page views
– Which do you think is best?
Data Mining - Volinsky - 2011 - Columbia University
4
Recommender Systems
• Two types of data:
• Explicit data: user provides information about their preferences
– Pro: high quality ratings
– Con: Hard to get: people cannot be bothered
• Implicit data: infer whether or not user likes product based on behavior
– Pro: Much more data available, less invasive
– Con: Inference often wrong (does purchase imply preference?)
• In either case, data is just a big matrix
– Users x items
– Entries binary or real-valued
1
3
5
5 4
2 4
• Biggest Problem:
4
1 2
2 4
5
5
4
2 1 3
3
4 3 5
4
2
4 3 4 2
– Sparsity: most users have not rated most products.
Data Mining - Volinsky - 2011 - Columbia University
1
3
3
2 5
2
4
5
Recommender Systems: Models
Two camps on how to make recommendations:
– Collaborative Filtering (CF)
• Use collective intelligence from all available rating information to make predictions
for individuals
• Depends on the fact that user tastes are correlated and commutative:
• If Alice and Bob both like X and Alice likes Y then Bob is more likely to like Y
– Content based
• Extracts “features” from items for a big regression or rule-based model
• See www.nanocrowd .com
– 15 years of research in the field
– Conventional wisdom:
• CF performs better when there is sufficient data
• Content-based is useful when there is little data
Data Mining - Volinsky - 2011 - Columbia University
6
Recommender Systems: Evaluation
• Narrow focus on accuracy sometimes misses the point
– Prediction Diversity, Prediction Context, Order of predictions
– For explicit ratings: use training – test setup and use
some form of MSE score
• OK, but treats all ratings the same (care most about top k)
• Doesn’t reflect interventional impact of recommender
– For implicit, we don’t have a score to calculate MSE.
• Can still do training test
• Create a ranked vector of preference – really only interested in top-k
for a recommendation algorithm
• use “discounted cumulative gain”
– Gives more weight to those predicted at the top
Data Mining - Volinsky - 2011 - Columbia University
7
Evaluation: Discounted Cumulative Gain
• Recommending stories on a web site:
–
–
–
–
–
Ten stories to recommend in test set: ABCDEFGHIJ
Only room on page for 5 recommendations
Algorithm 1 ranks: AECHI
Algorithm 2 ranks: CDBFG
User actually viewed AEBFGJ
– rel = {0,1}
– DCG1= 1+1/log2(2)+0+0+0 = 2
– DCG2=0+0+1/log2(3)+1/log2(4)+1/log2(5) = 1.56
Data Mining - Volinsky - 2011 - Columbia University
8
The Netflix Prize
Data Mining - Volinsky - 2011 - Columbia University
9
Netflix
• A US-based DVD rental-by mail company
• >10M customers, 100K titles, ships 1.9M DVDs per
day
Good recommendations = happy
customers
Data Mining - Volinsky - 2011 - Columbia University
10
Netflix Prize
• October, 2006:
• Offers $1,000,000 for an improved recommender algorithm
•Training data
• 100 million ratings
• 480,000 users
• 17,770 movies
• 6 years of data: 2000-2005
user
movie
score
date
1
user
21
movie
1
score
2002-01-03
date
1
213
212
5
?
2002-04-04
2003-01-03
2
1
345
1123
4
?
2002-05-05
2002-05-04
2
123
25
4
?
2002-05-05
2002-07-05
2
768
8773
3
?
2003-05-03
2002-09-05
3
2
76
98
5
?
2003-10-10
2004-05-03
4
3
45
16
4
?
2004-10-11
2003-10-10
5
4
568
2450
1
?
2004-10-11
5
342
2032
2
?
2004-10-11
5
234
9098
2
?
2004-12-12
2004-10-11
6
5
76
11012
5
?
2005-01-02
2004-12-12
6
56
664
4
?
2005-01-31
2005-01-02
6
1526
?
2005-01-31
• Test data
• Last few ratings of each user (2.8 million)
• Evaluation via RMSE: root mean squared error
• Netflix Cinematch RMSE: 0.9514
• Competition
• $1 million grand prize for 10% improvement
• If 10% not met, $50,000 annual “Progress Prize” for best
improvement
Data Mining - Volinsky - 2011 - Columbia University
11
Netflix Prize
• Competition design
• Hold-out set created by taking last
9 ratings for each user
– Non-random, biased set
• Hold-out set split randomly three
ways:
– Probe Set – appended to training data
to allow unbiased estimation of RMSE
– Submit ratings for the (Quiz+Test)
Sets
– Netflix returns RMSE on the Quiz
Set only
Data Mining - Volinsky - 2011 - Columbia University
12
Data Characteristics
Mean Score vs. Date of Rating
3.8
3.6
3.5
3.4
3.3
3.2
2000
2001
2002
2003
2004
2005
2006
Date
40
35
30
Percentage
Mean Score
3.7
25
Training (m = 3.60)
20
Probe (m = 3.67)
15
10
5
0
1
2
3
4
5
Rating
Data Mining - Volinsky - 2011 - Columbia University
13
Ratings per movie/user
Avg #ratings/movie: 5627
Avg #ratings/user: 208
User ID
# Ratings
Mean Rating
305344
17,651
1.90
387418
17,432
1.81
2439493
16,560
1.22
1664010
15,811
4.26
2118461
14,829
4.08
1461435
9,820
1.37
Data Mining - Volinsky - 2011 - Columbia University
14
Data Characteristics
Most Loved Movies
Avg rating
Count
The Shawshank Redemption
4.593
137812
Lord of the Rings :The Return of the King
4.545
133597
The Green Mile
4.306
180883
Lord of the Rings :The Two Towers
4.460
150676
Finding Nemo
4.415
139050
Raiders of the Lost Ark
4.504
117456
• Most Loved Movies
Most Rated Movies
Highest Variance
Miss Congeniality
The Royal Tenenbaums
Independence Day
Lost In Translation
The Patriot
Pearl Harbor
The Day After Tomorrow
Miss Congeniality
Pretty Woman
Napolean Dynamite
Pirates of the Caribbean
Fahrenheit 9/11
Data Mining - Volinsky - 2011 - Columbia University
15
The competition progresses…
• Cinematch beaten in two weeks
• Halfway to 10% in 6 weeks
• Our team, BellKor (Bob Bell and
Yehuda Koren) took over the lead in
the summer…
• With 48 hours to go to the $50K
Progress Prize, we had a comfortable
lead…with another submission in
pocket
Data Mining - Volinsky - 2011 - Columbia University
16
Leaderboard
8pm
10/1
6am
Data Mining - Volinsky - 2011 - Columbia University
05:00 pm
Sept 30
17
8pm
Leaderboard
06:00 pm
Sept 30
wanna split 50/50?
Data Mining - Volinsky - 2011 - Columbia University
18
• ARRRRGH! We have one more chance….
8pm
10/1
6am
Data Mining - Volinsky - 2011 - Columbia University
19
8pm
A Nervous Last Day
• Start in virtual tie on Quiz data
• Unclear who leads on Test data
• Can Gravity/Dinosaurs improve again?
• More offers for combining
• Can we squeeze out a few more points?
• Improved our mixing strategy
• Created a second team, KorBell, just to be safe
Data Mining - Volinsky - 2011 - Columbia University
20
Our final submission(s)…
8pm
10/1
6am
Data Mining - Volinsky - 2011 - Columbia University
21
8pm
Data Mining - Volinsky - 2011 - Columbia University
22
So The Drama
Timeline:
– 2008:
• BellKor merges with BigChaos to win 2nd $50K Progress Prize (9.4%)
– 2009:
• June 26: BellKor, Big Chaos and Pragmatic Theory merge passing 10%
threshold, begins 30 day ‘last call’ period.
• July 24: All is quiet. We crawl up to 10.08%...
• July 25: The Ensemble, a coalition of 23 independent teams, combine to pass
us on the public leaderboard.
• July 26: Bellkor pulls back into a tie. Then Ensemble pulls ahead by 0.01 %
• July 26: Netflix Prize ends: winner (determined on Test Set) is unclear.
• September 21: Netflix announces BellKor’s Pragmatic Chaos as winner of
$1M prize.
Data Mining - Volinsky - 2011 - Columbia University
23
So The Drama
Timeline:
– 2009:
• June 26: BellKor, Big Chaos and Pragmatic Theory merge
passing 10% threshold, begins 30 day ‘last call’ period.
• Lead next best team by 0.37%
• Only two teams within 0.80%
– But soon, the cavalry charges
• Mega-mergers start to form.
– July 25 (25 hours left):
• We crawl up to 10.08%...
• The Ensemble, a coalition of 23 independent teams, combine
to pass us on the public leaderboard
Data Mining - Volinsky - 2011 - Columbia University
24
Data Mining - Volinsky - 2011 - Columbia University
25
Test Set Results
• BellKor’s Pragmatic Theory: 0.8567
• The Ensemble:
0.8567
• Tie breaker was submission date/time
• We won by 20 minutes!
But really:
• BellKor’s Pragmatic Theory: 0.856704
• The Ensemble:
0.856714
• Also, a combination of BPC (10.06%) and
Ensemble (10.06%) scores results in a 10.19%
improvement!
Data Mining - Volinsky - 2011 - Columbia University
26
Our Approach
• Our Prize winning solutions were an ensemble of many
separate solution sets
• Progress Prize 2007: 103 sets
• Progress Prize 2008 (w/Big Chaos): 205 sets
• Grand Prize 2009 (w/ BC and Pragmatic Theory): > 800 sets!!
– We used two main classes of models
•
•
•
•
•
Nearest Neighbors
Latent Factor Models (via Singular Value Decomposition)
Also regularized regression, not a big factor
Teammates used neural nets and other methods
Approaches mainly algorithmic, not statistical in nature
Data Mining - Volinsky - 2011 - Columbia University
27
Data representation (excluding dates)
users
1
1
2
1
5
movies
2
4
6
4
2
5
5
7
8
2
4
3
- unknown rating
10
5
3
5
3
9
4
4
4
3
6
5
1
4
1
4
3
2
3
3
4
4
2
11
4
2
1
3
5
3
2
2
2
12
5
4
- rating between 1 to 5
Data Mining - Volinsky - 2011 - Columbia University
28
Nearest Neighbors
Data Mining - Volinsky - 2011 - Columbia University
29
Nearest Neighbors
users
1
1
2
1
5
movies
2
4
6
4
2
5
5
7
8
2
4
3
- unknown rating
10
5
3
5
3
9
4
4
4
3
6
5
1
4
1
4
3
2
3
3
4
4
2
11
4
2
1
3
5
3
2
2
2
12
5
4
- rating between 1 to 5
Data Mining - Volinsky - 2011 - Columbia University
30
Nearest Neighbors
users
1
1
2
1
5
movies
2
4
6
4
2
5
4
3
5
6
?
5
2
8
9
4
3
10
5
3
5
3
7
4
4
1
4
1
4
3
2
3
3
4
4
2
11
4
2
1
3
5
3
2
2
2
12
5
4
- estimate rating of movie 1 by user 5
Data Mining - Volinsky - 2011 - Columbia University
31
Nearest Neighbors
users
1
1
2
1
5
movies
2
4
6
4
2
5
4
3
5
6
?
5
2
8
9
4
3
10
5
3
5
3
7
4
4
1
4
1
4
3
2
3
3
4
11
12
4
2
1
3
5
4
2
2
2
2
3
5
4
Neighbor selection:
Identify movies similar to 1, rated by user 5
Data Mining - Volinsky - 2011 - Columbia University
32
Nearest Neighbors
users
1
1
2
1
5
movies
2
4
6
4
2
5
4
3
5
6
?
5
7
9
2
4
10
5
3
5
3
8
4
4
1
4
1
4
3
2
3
3
4
4
2
3
11
4
2
1
3
5
3
2
2
2
12
5
4
Compute similarity weights:
s13=0.2, s16=0.3
Data Mining - Volinsky - 2011 - Columbia University
33
Nearest Neighbors
users
1
1
2
1
5
movies
2
4
6
4
2
5
4
3
5
6
2.6
5
2
8
9
4
3
10
5
3
5
3
7
4
4
1
4
1
4
3
2
3
3
4
4
2
11
4
2
1
3
5
3
2
2
2
12
5
4
Predict by taking weighted average:
(0.2*2+0.3*3)/(0.2+0.3)=2.6
Data Mining - Volinsky - 2011 - Columbia University
34
Nearest Neighbors
– To predict the rating for user u on item i:
• Use similar users’ ratings for similar movies:
rˆui
å
=
å
j ÎN (i,u)
sij ruj
j ÎN(i,u)
sij
rui = rating for user u and item i
bui= baseline rating for user u and item I
sij = similarity between items i and j
N(i,u) = neighborhood of item i for user u (might be fixed at k)
Data Mining - Volinsky - 2011 - Columbia University
35
Nearest Neighbors
• Useful to “center” the data, and model residuals
• What is sij ???
– Cosine distance
– Correlation
• What is N(i,u)??
– Top-k
– Threshold
• What is bui
• How to deal with missing values?
• Choose several different options and throw them in!
Data Mining - Volinsky - 2011 - Columbia University
36
Nearest Neighbors, cont
• This is called “item-item” NN
– Can also do user-user
– Which do you think is better?
• Advantages of NN
– Few modeling assumptions
– Easy to explain to users
– Most popular RS tool
Data Mining - Volinsky - 2011 - Columbia University
37
Nearest Neighbors, Modified
• Problem with traditional k-NN:
• Similarity weights are calculated globally, and
• do not account for correlation among the neighbors
– We estimate the weights (wij) simultaneously via a least squares
optimization :
Basically, a regression using the ratings in the nbhd.
– Shrinkage helps address correlation
– (don’t try this at home)
Data Mining - Volinsky - 2011 - Columbia University
38
Latent factor models – Singular Value Decomposition
SVD finds concepts
The Color
Purple
Geared
towards
females
Sense and
Sensibility
serious
Braveheart
Amadeus
Lethal Weapon
Ocean’s 11
Geared
towards
males
The Lion King
The Princess
Diaries
Independence
Day
escapist
Data Mining - Volinsky - 2011 - Columbia University
Dumb and
Dumber
39
Matrix Decomposition - SVD
users
1
items
Example with
3 factors
(concepts
3
5
2
4
2
5
4
?
4
1
2
3
4
4
1
5
5
3
3
4
4
4
2
1
3
5
4
~
2
2
3
3
2
2
5
4
users
items
~
.1
-.4
.2
-.5
.6
.5
-.2
.3
.5
1.1
2.1
.3
-.7
2.1
-2
-1
.7
.3
D3
1.1
-.2
.3
.5
-2
-.5
.8
-.4
.3
1.4
2.4
-.9
-.8
.7
.5
1.4
.3
-1
1.4
2.9
-.7
1.2
-.1
1.3
2.1
-.4
.6
1.7
2.4
.9
-.3
.4
.8
.7
-.6
.1
Each user and each item is described by a
feature vector across concepts
Data Mining - Volinsky - 2011 - Columbia University
40
Factorization-based modeling
1
3
5
5 4
2 4
4
1 2
2 4
5
2 1 3
3
5
4 3 5
4
4 3 4 2
1
3
4
2
2 5
3
2
~
.1
-.4
.2
-.5
.6
.5
-.2
.3
.5
1.1
2.1
.3
-.7
2.1
-2
-1
.7
.3
1.1
-.2
.3
.5
-2
-.5
.8
-.4
.3
1.4
2.4
-.9
-.8
.7
.5
1.4
.3
-1
1.4
2.9
-.7
1.2
-.1
1.3
2.1
-.4
.6
1.7
2.4
.9
-.3
.4
.8
.7
-.6
.1
4
• This is a strange way to use SVD!
– Usually for reducing dimensionality, here for filling in missing data!
– Special techniques to do SVD w/ missing data
•
Alternating Least Squares = variant of EM algorithms
• Probably most popular model among contestants
–
12/11/2006: Simon Funk describes an SVD based method
Data Mining - Volinsky - 2011 - Columbia University
41
Latent Factor Models, Modified
• Problem with traditional SVD:
User and item factors are determined globally
– Each user described as a fixed linear combination across factors
– What if there are different people in the household?
–
•
Let the linear combination change as a function of the item rated.
•
Substitute pu with pu(i), and add similarity weights
•
Again, don’t try this at home!
Data Mining - Volinsky - 2011 - Columbia University
42
First 2 Singular Vectors
Data Mining - Volinsky - 2011 - Columbia University
43
Incorporating Implicit Data
• Implicit Data: what you choose to rate is an important, and separate
piece of information than how you rate it.
• Helps incorporate negative information, especially for those users
with low variance.
• Can be fit in NN or SVD
Data Mining - Volinsky - 2011 - Columbia University
44
Temporal Effects
It took us 1.5 years to figure out how to use the
dates!
Account for temporal effects in two ways
• Directly in the SVD model – allow user factor vectors pu to
vary with time pu(t)
– Many ways to do this, adds many parameters, need regularization
– DTTAH
• Observed: number of user ratings on a given date is a proxy
for how long ago the movie was seen:
– Some movies age better than others
Data Mining - Volinsky - 2011 - Columbia University
45
Memento vs Patch Adams
rating
Memento (127318 samples)
4.2
4.1
4
3.9
3.8
3.7
3.6
3.5
3.4
3.3
3.2
1
2
3-4
5-8
9 - 16
17 - 32
33 - 64
65 - 128
129 - 256
257+
129 - 256
257+
frequency
Patch Adams (121769 samples)
4
3.9
rating
3.8
3.7
3.6
3.5
3.4
3.3
3.2
1
2
3-4
5-8
9 - 16
17 - 32
33 - 64
65 - 128
frequency
46
Data Mining - Volinsky - 2011 - Columbia University
Netflix Prize: Other Topics
• Ensembles of models
• Application to TV data
Data Mining - Volinsky - 2011 - Columbia University
47
The Power of Ensembles
• Ensembles of models
• Over 800 in Grand Prize solution!
• Some of our “models” were blends of other models!
Courtesy LingPipe Blog
– or models on residuals of other models
– Why: because it worked
• Black Box magic at its worst
• However, mega blends are not needed in practice
– Best model: complex model with implicit and explicit data, and
time varying coefficients, millions of parameters and regularization.
– This did as well as our 2007 Progress Prize winning combo of 107
models!
– Even a small handful of ‘simple’ models gets you 80-85% of the
way
• In practice, .01% does not matter much
Data Mining - Volinsky - 2011 - Columbia University
48
Blending Models
• Figuring out the right way to blend models was
an important piece of our team strategy
• Subsequent slides courtesy of Michael Jahrer
(BigChaos)…
• Keep in mind…
Data Mining - Volinsky - 2011 - Columbia University
49
Probe Blending < Quiz Blending
• A few facts before we go on:
1) Every model I can be reduced to its prediction vector on the
probe set (pi). That can be compared to the true probe answers (r)
2) The model can also be applied to the quiz set (qi), but can only be
compared to the true answers (s) by submission
3) We submitted all individual models to get the RMSE for every
model i.
4) The distribution of the quiz vector (how many 1s, 2s, 3s, etc) was
well known. This gives the variance of the vector
Data Mining - Volinsky - 2011 - Columbia University
50
Probe Blending
Data Mining - Volinsky - 2011 - Columbia University
51
Quiz Blending
Data Mining - Volinsky - 2011 - Columbia University
52