music recommendation system for last.fm dataset

Download Report

Transcript music recommendation system for last.fm dataset

MUSIC RECOMMENDATION
SYSTEM FOR LAST.FM
DATASET
Why music recommendation
system is required?
What is a data mining ?

Data mining , which can be called data or knowledge discovery, is
the process of analyzing data from different perspectives and
summarizing it into useful information.
http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/palace/datamining.htm
http://www.headsafrica.com/headsafrica/application/views/services/client/zf_files/images/data_mining/data_mining.jpg
Data mining Modelling
Clustering
Items are grouped
for their similar
specification in this
method. It is
consider the
similarities of data
among themselves
Classification
It is very common
technique for predicting
some interests. It may
refer to categorization
data items. Unclassified
cases are predicted as
any class label group
according to other
classified label class
Association
Existing records in the
database by examining
their relationship with
each other, it is a
technique that
determines which
events occur together
simultaneously
What is recommendation
engine?

Recommendation system is described as system which interprets
data that users entered the system and makes recommendation to
users.
Recommendation Techniques
 Content-based Filtering
The salient features of any contents which were liked or watched
previously by users are saved in mostly databases and new profile is created
for users.
While making recommendation, the content that belongs to nearest
feature from the sets of property previously created is recommended with
looking at this profile.
https://www.ntt-review.jp/archive_html/200804/images/le1_fig02.gif
Recommendation Techniques

Collaborative Filtering
This constitutes the foundation of “The one loving one loves the alike”
approaches.
It is not depending on the one user's content- property profile, while
making recommendation bearing in mind that users who like the similar
content properties or users with similar characteristics.
http://www.bridgewell.com/images_en/ec_03.jpg
Recommendation Techniques

Collaborative Filtering Types
 User-based recommendation: This technique finds the similar users
and recommends item.
 Item-based recommendation: The similarity of items is calculated and
items are recommended.
http://oytunyuksel.com/wp-content/uploads/post-02-01.jpg
How to be created
recommendation engine ?
How to be created
recommendation engine ?

When the recommendation engine is created, the following steps should
be implemented.
 The definition of data representation
 The creation of database or file model structure
 Making data pre-processing for getting the best result
http://www.w3.org/WAI/TIDE/phases.gif
What is an Apache Mahout ?

It is a Java library of scalable machine-learning algorithms,
implemented on top of Apache Hadoop and using the MapReduce
paradigm.
For using Mahout in project:

Download the latest Mahout release is 0.8
It can be accessed from the link below
http://apache.fastbull.org/mahout/0.8/mahout-distribution-0.8.zip


Extract all the libraries and include them in a new Eclipse (NetBeans)
project as external JAR file.

Java 1.6.x or greater is required for installation

Hadoop is not mandatory to create recommendation engine.
http://hortonworks.com/hadoop/mahout/
http://hortonworks.com/wp-content/uploads/2013/09/mantle-mahout.png
How to use Mahout for
recommendation?

The recommendation in Mahout follows these steps:
 The dataset is adjusted for Mahout-compliant
 The compatible recommender component is chosen
 The similarity calculations are computing according to rating or
preferences
 The recommendation is evaluated
Recommender job flow
The main step doing
the heavy lifting in the
workflow is the
"calculate cooccurrences" step.
This step is
responsible for doing
pairwise comparisons
across the entire
matrix, looking for
commonalities.
http://www.ibm.com/developerworks/library/j-mahout-scaling/
The background process of
recommendation in architecture
Graduation Project with Last.fm

Scheduling
Graduation Project with Last.fm
 Gannt chart
Graduation Project with Last.fm
 What is important risks ?
Big-Data
 Time
 Computer performance
 Sparsity
http://www.pm-primer.com/wp-content/uploads/2012/04/risk1.jpg
Music recommendation project for
Last.fm

The dataset of « Last.fm Dataset-1K users » is used in project. This
dataset has information about user properties and which songs are
listened by which users.
 This dataset 2 files, one of them is users’ profile file and other one
contains users’ musical history.

There are 1000 users and 19,150,868 lines musical history which
belongs to 1000-users.
Music recommendation project for
Last.fm

Last.fm API is used and new csv format is created.

Although there are 1000 users, during to project period 700 users'
files with desired properties were prepared due to time
constraints.
After preparing files, all files were saved on database tables for the
sake of easy data processing, the tables:

Artists
Tracks
UserTagTrack
Users
TrackTags
Music recommendation project for
Last.fm

The collaborative filtering method is used.

2 types of segmentation are considered.
 The one of the recommendation is made between clustering users
according to gender, age, country type.
 Other recommendation is made between all users.

User-based recommendation engine is created.

JDBC and File Data Model is used for data representation.
Music recommendation project for
Last.fm

To make cluster, Weka is used because of simplicity. All users'
characteristics were represented as value. (In thesis page 33-34)
goes
…….
Music recommendation project for
Last.fm

There are many methods can be used for collaborative filtering :
 Mean Squared Differences Algorithm
 Vector Similarity
 Pearson Correlation Coefficient
 Strengths and Weaknesses of Collaborative Filtering Method

Pearson Correlation Similarity algorithm is used for thesis data
model. Since it is convenient and gives correct result for huge
amount of data.
The
functionality of
project system
JDBC Model-Database Tables
Artists
artist id
artist name
Tracks
track id
track name
artist id
published year
TrackTags
tag id
tag name
Users
user id
user name
gender
age
country
UserTagTrack
usertagtrack id
user id
track id
tag id
preferences
 It is a general database (default), all files or other
databases are created from this.
Recommendation Model
PrefUserTag
user id
tag id
sum (preferences)
track id
sum (preferences)
tag id
sum (preferences)
PrefUserTrack
user id
PrefTagTrack
track id
 In JDBCDataModel, primary keys must be defined because
of time efficiency.The database format should be:
Number of elements in tables
 The name of tables begins with «Pref» statement are
formatted table for Mahout recommendation functions.
 They contain very low data according to UserTagTrack
table.
Number of elements in tables
Before the
assignment of
primary key
With primary key,
format is shown below:
user id
tag id
sum
(preferences)
The introduction of system

After the text file is created via API, standard line of text is
shown as follows:
user name, artist name, track name, published year, tags
user_000103, Super Furry Animals, The Undefeated, 2003, indie, britpop,
rock, trumpet, pop

This line represents on UserTagTrack table:
usertagtrackid
user id
track id
tag id
preferences
1
user_000103
indie
20
2
user_000103
britpop
20
3
user_000103
rock
20
4
user_000103
trumpet
20
5
user_000103
The
Undefeated
The
Undefeated
The
Undefeated
The
Undefeated
The
Undefeated
pop
20
The functions used in the
recommendation engine
 The working principle of user-based recommendation engine:
Recommendation Results
The infinite amount of results can be obtained via evaluator program. In thesis,
pages 41-51 have many results with different conditions.
Table Name
PrefUserTag
Neighbourhood Size
2
For User Id
5
# Recommendations
5
Results
Tag-Name
RecommendedItem[item:112040,value:213.03076]
missjudy76
RecommendedItem[item:3387, value:211.02057]
my 750 essential songs
RecommendedItem[item:8124, value:194.43637]
lionel richie
RecommendedItem[item:8147, value:175.26286]
leona lewis
RecommendedItem[item:1809, value:167.69398]
better than the original
Recommendation Results
Table Name
PrefUserTrack
Neighbourhood Size
2
For User Id
5
# Recommendations
5
Results
Track Name
RecommendedItem[item:7064,value:73.0]
Out Of Control
Neighbourhood Size
Results
7
Track Name
RecommendedItem[item:16570,value:304.5]
When You'Re Gone
RecommendedItem[item:7064, value:73.0]
Out Of Control
RecommendedItem[item:1466, value:9.0]
Aerodynamic
RecommendedItem[item:7170, value:5.0 ]
Bring Me To Life
RecommendedItem[item:2969, value:5.0]
Number Five With A Bullet
How to evaluate results ?

The evaluation of this recommendation engine result is realized with
the most common metrics precision and recall.

Precision is calculated with the ratio of relevant items recommended
correctly to the number of items recommended.

Recall is the ratio of relevant items recommended correctly to the
number of items which are relavent to users.
Predicted as positive
Actual Positive
TP
Predicted as negative
FN
Actual Negative
FP
TN
How to evaluate results ?

The precision-recall is provided RecommenderIRStatsEvaluator
class in Mahout. The evaluate function gives the result of F-measure,
precision, recall value of recommendation engine .
 Parameters are given this functions, the important parameter is
«at» which means that the number of recommendations to
consider when evaluating precision
o precision at something (integer value)
Evaluation Results
Table Name
PrefUserTag
Data Model Structure
User-Tag-Preference
Row-Column Variable Number
# users: 700 , # item: 14044
Neighbourhood Size
2
5 recommendations
Precision: 0.9784243295019155
Recall: 0.9741058655221752
Table Name
PrefUserTrack
Data Model Structure
User-Track-Preference
Row-Column Variable Number
# users: 700, # item: 316018
Neighbourhood Size
2
5 recommendations
Precision: 0.033268482490272366
Recall: 0.005531505531505532
Evaluation Results
Table Name
PrefUserTrack
Data Model Structure
User-Track-Preference
Row-Column Variable Number
# users: 700, # item: 316018
Neighbourhood Size
3
5 recommendations
Precision: 0.036322463768115994
Recall: 0.012746512746512747
The comment of evaluation results

If the number of neighbourhood size increases, the
recommendation engine results will be better because of the
working principle of similarity function.

User-tag recommendation engine is the better than user-track
recommendation engine because of data size and sparsity.

People with similar characteristics are also similar musical tastes.

When the neighbourhood size increases, the number of
recommended items increases.
Self-criticism I

The creation of data set and data representation took a long time.
Thus, ready dataset can be used and this way buys project holder
extra time.

There are huge amount of data in data model. Scanning all data
and making recommendation took a long time because of
computer capacity. Thus, I could get a better computer.

The out of memory error was the most frequently encountered
problems while calculating evaluation result because of low JAVA
heap-space in operating system or Java version.
Self-criticism II

Slowness or memory error problems can be solved via using
parallel programming. In addition, using server is the another
alternative solution for problems.

User-Track Profile results is not good, recommendation engine
performance for this model could be increased.

If the computer capacity increases,
more data can be used for
recommendation engine.
http://d1jb6zrebfcfrk.cloudfront.net/assets/content/cache/made/65b7808e1a1599d2/Think_Bigger,_Make_B
etter_3_860_484.png
http://thisiscolossal.com/wp-content/uploads/2011/01/better-3-600x337.jpg
Thank you for listening 