A shot at Netflix Challenge
Download
Report
Transcript A shot at Netflix Challenge
A shot at Netflix Challenge
Hybrid Recommendation System
Priyank Chodisetti
Problem and Approach
•
•
•
•
•
•
•
A data set of 240,000 users and their ratings for 17770 movies is
provided.
Given a user ‘p’ and movie ‘m’, we should predict how much the
‘p’ will rate the movie ‘m’
My Idea: Take entirely two different approaches and merge those
results.
Applied Latent Semantic Analysis and Collaborative filtering
techniques on the dataset independently
Through LSI, Mapped the dataset to lower dimensional space
and tried to extract relation between different movies
Through Collaborative filtering, tried to find the user tastes by
comparing with other similar users
Major Problems:
– Computationally large, for example one soultion of mine ran for 14
hrs with most diappointing results
– Matrix is Sparse for almost ~99 and hence ~99% missing values
Handling Major Problems
•
•
•
Generally missing values are handled by taking the average rating given
by the user or overall average rating of all users. But I believe that,
$1,000,000 winner will be the one who handles the missing values well.
Adopted method described in [2] which aptly fits in the current situtation.
LSI:
– Apply SVD on the Matrix, retain the first ‘k’ higher singular values. It gives
–
–
•
us the space in ‘k’ dimensions or best ‘k’ rank approximation
But to How Many Dimensions?? Experiment
To make a prediction for person p's rating for movie m, we would take the
mth row of U, matrix multiply with S, and matrix multiply that with the pth
column of V(t)
Collaborative Filtering:
–
–
Find the kNN and come out with predicted rating.
If we consider Euclidean distance as distance measure, we have 17770
dimensions. So consider Pearson Co-efficient
Implementation
•
Mixing LSI and Collaborative Filtering
– Find kNN in reduced dimension space, and consider euclidean
distance as the distance measure.
•
Used SVDLIBC which used Lanczo method for Singular Value
Decomposition
Computational Challenges:
– All the files in the training set are converted into one single larget file,
so as to reduce disk access and increase the response time
– Converted the whole data into sparse text format
– Also generated a large data set, which gives in terms of user: movie,
his rating format in contrast to given movie: user, his rating format
– Using C++
Future Extensions this Winter
– Plans to implement General Hebbian Algorithm, so as to reduce the
computation time and will be easier to handle missing values.
– Interested and motivated friends can join me this winter
•
•
References
1.
M Brand. Fast Online SVD revisions for lightweight recommender systems.
In Proc. SIAM International Conference on Data Mining. 2003
2.
M. W. Berry. Incremental Singular Value Decomposition of uncertain data. In
Proceedings, European conference on the SIGIR. ACM. 1999
3.
B. Sarwar, G. Karypis, J.Konstan, and J.Riedi. Application of Dimensionality
Reduction in recommender System - a case study. In ACM WebKDD
Workshop, 2000