Recommendation-Engine-Sheth-2014-01
Download
Report
Transcript Recommendation-Engine-Sheth-2014-01
Making recommendations
David Sheth
Making Recommendations
• Why Recommend
• How to generate recommendations
• How to present recommendations
Why Recommend
• Help people find what they want
• Can lead to more sales
• Can lead to higher satisfaction / fewer returns
• Leads to repeat visits
How to generate recommendations
• Gather the data
• Choose some recommenders to start with
• Generate Recommendations
• Evaluate the Recommenders
Gather the data
• Explicit ratings
• Users rating things they have seen/read/purchased
• Users liking/tweeting/pinning/posting. (Binary data)
• Implicit feedback
• Users clicking to view things
• Purchases
• Examination duration time
• Save to favorites
• Print
Sample data
• Movielens
• Many research papers based on the movielens dataset—10M ratings, 100K tags, 10K
movies, 72K users.
• http://grouplens.org/datasets/movielens/
• Search for movielens dataset
• Book Crossing dataset
• 1M ratings, 271K books, 278K users
• http://www.informatik.uni-freiburg.de/~cziegler/BX/
• Search for book crossing dataset
Choose some recommenders (1/4):
Non Personalized
• People who bought this also bought that
• Doesn’t matter what else we know about the person
• Spaghetti -> Sauce
Choose some recommenders (2/4):
Content based
• Based on the attributes of an item
• Example for a book:
• Pride and Prejudice: Historical fiction, early 1800s, set in England, female protagonist,
comic tone, female author
• Fellowship of the Rings: Fantasy fiction, male protagonist, set in the past, male author
• Requires some way to assign attributes to items.
• This can be a hard problem
• Most prominent example: Pandora.
• Built on the Music genome project: up to 450 attributes per song
• As a user rates or buys items, the recommendation engine learns what attributes
a user likes and dislikes
Choose some recommenders (3/4):
User-User
• Find people who have purchased or rated highly items similar to what our target
user has purchased or rated highly
• See if they have purchased or rated highly anything the target user doesn’t
already own
• Recommend the thing that user doesn’t own
• Example
item 1
item 2
item 3
item 4
Bob
4
2
5
5
Sally
2
5
1
4
Jim
3
5
2
1
Lucy
5
2
4
????
Choose some recommenders (4/4):
Item-Item
• Find items that are rated in a similar manner—these items are considered
“similar”
• Find items that are similar to the items a user has already bought or rated highly
• Recommend those items
• Example:
item 1
item 2
item 3
item 4
Harry
4
2
5
5
John
2
5
1
4
Lily
3
5
2
1
Fred
5
2
???
5
Choosing prior to generating
recommendations
• Look at the data you have
• Look at your budget
• Look at the size of your data
• Look at how quickly you need to provide recommendations
Process the data (1/4):
Non Personalized
• Percent of people that bought A that also bought B
•
•
•
𝐴+𝐵
𝐴
𝐵𝑟𝑒𝑎𝑑+𝐽𝑎𝑚
20
=
𝐵𝑟𝑒𝑎𝑑
30
𝐵𝑟𝑒𝑎𝑑+𝐵𝑎𝑛𝑎𝑛𝑎𝑠
27
=
𝐵𝑟𝑒𝑎𝑑
30
• You will find bananas in most people’s cart, so it will be the recommended item
for pretty much everything.
Process the data (1/4):
Non Personalized
• Want to calculate the specific influence of A—i.e. how much more likely does buying
A mean that a person will buy B.
•
•
•
𝐴+𝐵
𝐴
!𝐴+𝐵
!𝐴
20
𝐵𝑟𝑒𝑎𝑑+𝐽𝑎𝑚
30
𝐵𝑟𝑒𝑎𝑑
3
𝑁𝑜 𝐵𝑟𝑒𝑎𝑑,𝑏𝑢𝑡 𝐽𝑎𝑚
70
𝑁𝑜 𝐵𝑟𝑒𝑎𝑑
𝐵𝑟𝑒𝑎𝑑+𝐵𝑎𝑛𝑎𝑛𝑎𝑠
𝐵𝑟𝑒𝑎𝑑
𝑁𝑜 𝐵𝑟𝑒𝑎𝑑,𝑏𝑢𝑡 𝐵𝑎𝑛𝑎𝑛𝑎𝑠
𝑁𝑜 𝐵𝑟𝑒𝑎𝑑
=
=
=
.67
.04
27
30
63
70
=
= 16.7
.9
.9
=1
𝐴+𝐵
𝐴
!𝐴 + 𝐵
!𝐴
Process the Data (2/4):
Content based
• Want to look at all the attributes a user likes, and the compare that with the
attributes of all the potential items.
• Consider the simplest case: We have determined that the user likes computer
languages with 0.7 and Austin with 0.3
• Possible meetups:
• Austin java (languages 0.8, Austin 0.2)
• Dallas Java (languages 1.0, Austin 0.0)
• Austin outdoors (languages 0.0, Austin 1.0)
• Can plot these as vectors
Process the Data (2/4):
Content based
Preference Comparison
1.2
1
Austin
0.8
0.6
0.4
0.2
0
0
0.2
0.4
0.6
Languages
0.8
1
1.2
Process the data (2/4):
Content based
• Cosine similarity
• In 2 dimensions, the smaller the angle between two vectors, the bigger the cosine.
Process the data (2/4):
Content based
• If there are more attributes, we just treat that as vectors that have more
dimensions -- i.e. we have 3, 10, or 100 dimensional vectors.
• Cosine similarity between high dimensional vectors?
• (Dot product of user vector and the item vector) / (normalized user vector *
normalized item vector)
Process the data (2/4):
Content based
• Lenskit: A Open Source Recommendation Tool
• From Academia (University of Minnesota)
• Lets you add your own version of any part of the recommendation engine
• Comes with data structures that let you do recommendation based calculations
Process the data (3/4):
User-User
• Find people who have purchased or rated highly items similar to what our target
user has purchased or rated highly
• See if they have purchased or rated highly anything the target user doesn’t
already own
• Recommend the thing that user doesn’t own
• Example
item 1
item 2
item 3
item 4
Bob
4
2
5
5
Sally
2
5
1
4
James
3
5
2
1
Lucy
5
2
4
????
Process the data (3/4):
User-User
• Determine how similar other users are to you:
• Can use cosine similarity again
• Others possible—Pearson similarity. Test and see what works best for your
data/budget
• Select the closest n% or the closest n
• Select all the items from the similar users
• Calculate the average preference for that item among all the similar users
• Or weighted average—the more similar the person is to the user, the greater the
weight.
• Return the items with the highest average
Process the data (3/4):
User-User
• Ideal case: A class with a constructor which takes
• The rating data
• The algorithm for finding similarities
• The algorithm for selecting the closest (n% or nearest N)
• Method which takes a user, returns a list of recommendations
• Where can we find such a class?
Process the data (3/4):
User-User
• Mahout
• From Apache
• Provides Machine learning libraries, focused on Clustering, Classification, and
Recommendations.
• Provides Hadoop versions of some of the algorithms.
• Recommendation is to stay with non-Hadoop versions if your data is small enough
Process the data (4/4):
Item-Item
• Find items that are rated in a similar manner—these items are considered
“similar”
• Find items that are similar to the items a user has already bought or rated highly
• Recommend those items
• Example:
item 1
item 2
item 3
item 4
Harry
4
2
5
5
John
2
5
1
4
Lily
3
5
2
1
Fred
5
2
???
5
Process the data (4/4):
Item-Item
• Can use mahout for this as well:
• What if data is too big?
Process the data (4/4):
Item-Item
Steps
Job
Convert items to index
itemIdIndex
Convert ratings to vector per user
toUserVectors
Build Item Vectors
toItemVectors
Calculate Item Similarity
RowSimilarityJob
Calculate Item/User ratings
PartialMultiply
Extract the top N recommendations
AggregrateAndRecommend
Analyze the data
• Both Lenskit and Mahout let you hold back part of the ratings, and then let you
see if the recommender recommends the missing data
Choosing after generating
recommendations
• Check the evaluation score
• Can also evaluate based on variety of recommendations—you can score this
number, and blend it with the evaluation score
• Performance (to see if it meets your budget)
• Runtime characteristics (online/offline)
• Usual findings:
• Item-Item more stable than User-User
• Item-Item generally provides better recommendations
• Non Personalized is fast
• Content based depends on data
Present the data
• No statistical terms
• Visualizations work well
• Histograms—bar for 1 star, bar for 2 stars, bar for 3 stars, bar for 4 stars
• Tables
• Complicated charts do not work well—i.e. graph based on similarity to the user.
• Helpful to explain “why” if you can do it in a non statistical way
Summary
Decide to
recommend
Gather data
Choose some
recommenders
Generate
Recommendations
Evaluate results
Present the data
Other items of interest
• Other recommenders
• Dimensionality reduction via Singular Value Decomposition (SVD)
• Slope recommenders
• Making the data better
• Use rating above user rating
• Tossing out newer users to avoid fraud
• Other fraud detection mechanisms
Resources
• Mahout In Action
• Lenskit Documentation—designed for students to pick up and use
• Mahout Documentation—assumes some recommendation background.
• Explaining Collaborative Filtering Recommendations, Jonathan L. Herlocker,
Joseph A. Konstan, and John Riedll. CSCW '00 Proceedings of the 2000 ACM
conference on Computer supported cooperative work, 2000
• Introduction to Recommender System (Coursera)—may be offered again
• ACM Conference Series on Recommender Systems