Data sparsity

Download Report

Transcript Data sparsity

The Wisdom of the Few
Xavier Amatrian, Neal Lathis, Josep M. Pujol
SIGIR’09
Advisor: Jia Ling, Koh
Speaker: Yu Cheng, Hsieh
Outline
•
•
•
•
•
•
•
Introduction
Mining the Web For Expert Ratings
Expert Nearest-Neighbors
Result
User Study
Discussion
Conclusion
Introduction
• Nearest-neighbor collaborative filtering suffers
from the following shortcomings
- Data sparsity
- Noise
- Cold-start problem
- Scalability
Cause the problem of defining the similarity
Mining the Web For Expert Ratings
• Collect 8,000 movies from Netflix
• 1,750 experts from “Rotten Tomatoes”
- Remove those experts who rated less than
  250 movies
- Only 169 experts left
• General users: members in the “Netflix”
Mining the Web For Expert Ratings
Mining the Web For Expert Ratings
• Dataset analysis: Data sparsity
Experts achieve lower data sparsity
Mining the Web For Expert Ratings
• Dataset analysis: Average rating distribution
Expert Nearest-Neighbors
• User-based Collaborative Filtering (CF) is
used
• Two stages
1. construct user-item matrix
2. construct user-expert matrix
Expert Nearest-Neighbors
• Construct user-expert matrix
- Calculate similarities between users and expert
 (r , r )
r r
i
sim (a, b) 
ai
i
2
bi
i
ai

2
bi
2 N a b
N a  Nb
N a : total number of items user " a" has rated
N a b : total number of items user " a" and user b have rated
rai : rating for item i from user " a"
Expert Nearest-Neighbors
• Only select the experts such that sim (e, a)  
• A threshold  as the minimum number of
expert neighbors who must have rate the
item
Expert Nearest-Neighbors
• Construct user-expert matrix
- Rating prediction
rai   u


e E
(rei -  e ) sim(e, a)
 sim(e, a)
E  {e1 , e2 ,..., en } where n  
 u ,  e : mean ratings with respect to
user and expert
rai : predicted rating for item
i
from user
Result
• Error in Predicted Recommendations
- 5-fold cross validation
- Base line: rai   e without similarity
calculation
Result
Expert CF and Neighbor CF:   0.01   10
Result
Result
Result
Result
• Top-N Recommendation Precision
- No recommendation list with fixed number
of N items
- Only classify items into recommendable
and not recommendable given a
threshold 
Result
  0.01,   10
User Study
• Select 500,000 movies from Netflix
(random sample)
• Separate movies into 10 equal density
bins according to the popularity
• Select 10 movies from each bin for 57
participants to rate (total 100 movies)
User Study
• 8,000 movies and theirs rating are collected
• Generate 4 top-10 recommendation list
- Random List
- Critics Choice
- Neighbor-CF
  0.01,   10
- Expert-CF
User Study
• K=50
User Study
User Result
Discussion
• Data sparsity
- Experts are more likely to have rated a large percentage
of the items
• Noise and malicious ratings
- Noise: Experts are more consistent with their
ratings
- Malicious ratings: Experts rate items in good manner
• Cold start problem
• Scalability
- Reduce the time complexity to compute
the similarity matrix
Conclusions
• Proposed an approach to recommend,
based on the opinion of an external source.
• Using few experts to predict.
• Tackle the problems of traditional KNN-CF