NN Homework #2 PPT

Transcript NN Homework #2 PPT

Neural Networks
Homework #2
KDD Cup 2007
Who Rated What in 2006
第六組
M9615904 傅家炫
M9615912 陳友書
Contents






System
Data Transform Method
Feature Extraction Method
Sampling Method
Experiment
Conclusion
2
System


Server: Sun Fire V210 server
 CPU: two UltraSPARC processors
 Memory: 8 GB
 OS: Sun Solaris 9
 DB: MySQL 5.0
PC
 CPU: Intel Core 2 Duo E6600
 Memory: 6 GB
 OS: Microsoft Windows Vista Ultimate x64
 MATLAB: MATLAB R2007b x64
3
Data Transform Method
Features Extracting
by
SQL Statement
Data format convert
by Perl Script
Training Set
from
Netflix Prize
1
MySQL
Database
3
Backpropagation
Training
MATLAB
4
2
Dump training data
from database
Training Data
4
Feature Extraction Method



SQL Statement:
 Numeric Functions
 Data Manipulation Statements
 Data Definition Statements
Perl Interpreter
UNIX Shell Script
5
Sampling Method
MovieID_1
MovieID_1
sampling
MovieID_2
Join both lists
(CustomerID, MovieID).
The pairs means that the
user rates the movie at
year YYYY.
CustomerID_1
CustomerID_1
sampling
CustomerID_3
MovieID_3
Random sampling with
probability proportional to the
number of ratings per MovieID.
MovieID_3
CustomerID_2
MovieID_2
CustomerID_2
CustomerID_3
Random sampling with
probability proportional to the
number of ratings per
CustomerID.
continue
6
Sampling Method ( cont. )
(CustomerID, MovieID)
The pairs means that the
user rates the movie at
year YYYY.
Join both lists
(CustomerID, MovieID).
(CustomerID, MovieID)
The pairs means that the
user never rates the
movie.
randomly choose 10,000 CustomerID
and 500 MovieID from datatset , and
then cross them each other. Throw
away any pair that already existed in
all the historic data.
GET
( CustomerID, MovieID )
historic features from
Database
Training Data
7
Experiment

Training approach
Training Data
MATLAB
8
Experiment ( cont. )
9
Conclusion
Probability of Who rated what
0.5
0.6
0.7
0.8
0.9
0.5
0.6
0.7
0.8
0.9
as
as
as
as
as
division
division
division
division
division
=>
=>
=>
=>
=>
Quantity
23,435
62,897
10,423
3,064
181
accueacy=
accueacy=
accueacy=
accueacy=
accueacy=
7.804 %
24.491%
79.89 %
89.179 %
92.029%
10
Conclusion ( cont. )


Why no probability less than 0.5?
 the proportion of the sample
 feature no normalization
Select feature affect accuracy.
 No customer-movie related feature
 use other method to extract，such as
Hierarchical Clustering、
K-Means Clustering
11

NN Homework #2 PPT

Transcript NN Homework #2 PPT

Directory