Transcript slides

Identifying and Incorporating
Latencies in Distributed Data Mining
Algorithms
Michael Sevilla
Identifying and Incorporating
Latencies in Distributed Data Mining
Algorithms
Michael Sevilla
Applicability of Mahout for Large
Data Sets
Michael Sevilla
What is Mahout?
• Distributed machine learning libraries
– “scalable to reasonably large data sets”
– Runs on Hadoop
http://heureka.blogetery.com/
The Data: Million Song Data Set
• Large Data Set
– 1,019,318 users
– 384,546 MSD songs
– 48,373,586 (user, song, count)
• Kaggle Competition: offline evaluation
– Predict songs a user will listen to using
• Training: 1M user listening history
• Validation: 110K users
• “Martin L” blogged his methodology + results
Motivations
•
•
•
•
Can Mahout easily be modified?
Can Mahout perform well for this workload?
Can Mahout produce accurate results?
Can Mahout work ‘out of box’?
• Hypothesis: 22 machines + Mahout > 1 guy
´
22
vs.
What kind of Recommender?
• Format: <userID, songID, count>
• Users interacting with items
• Users express preferences towards items
• We can us Collaborative Filtering
´
22
vs.
Collaborative Filtering
• Predicts preference of user towards an item
• Constructs a Top-N-Recommendation
1. Parse input training data
2. Create user-item-matrix
3. Predict missing entries
é x 00
ê
ê x 01
ê
ê
ê
êë x 0 m
x10
xij
xn 0 ù
ú
ú
ú
ú
ú
xnm úû
Mahout has item-based Collaborative Filtering jobs!
CAN MAHOUT EASILY BE MODIFIED?
Martin’s Code
• Methodology: similarity vector of history
– Sparse-matrix
• COLISTEN(i, j) – listeners who listened to i and j
– Sum similarities for each song user x listens to
• The code: all python
– Parse: 27 lines of code (l.o.c)
– Create Matrix: 46 l.o.c
– Predict: 45 l.o.c
åCOLISTEN[i,:]
é x 00
ê
ê x 01
COLISTEN(i, j) = ê
ê
ê
êë x 0 m
x10
xij
xn 0 ù
ú
ú
ú
ú
ú
xnm úû
Mahout’s Code
• Methodology:
– No Idea…
• The code: all java
– Poorly commented
– 14 *.java files
– Many Directories
• ~/mahout/core/src/main/java/org/apache/ma
hout/cf/taste/hadoop/item/RecommenderJob.
java
– RecommenderJob.java: 284 lines of code (l.o.c)
– SimilarityMatrixRowWrapperMapper.java: 47 l.o.c
– UserVectorSplitterMapper.java: 138 l.o.c
Mahout’s Code
CAN MAHOUT EASILY BE MODIFIED?
NO
CAN MAHOUT PERFORM WELL FOR
THIS WORKLOAD?
Martin’s Code
• Performance on 86MB:
– Parse data: 10 minutes
– Make Matrix: 22 minutes
– Predict songs for 11000 users: 1 hour, 18 minutes
• Did not test scalability
$/ python convertToNumbers.py
$/ python colisten.py
$/ python predict_colisten.py
Mahout’s Code
• Performance on 86MB:
– Parse Time: 10 minutes
– Total Time: 25 minutes
• Tested scalability
– 64MB, 128MB, 256MB, 1GB, 2GB, 3GB
Mahout’s Code
• Total Time
• ~ 12m, 43m, 1hr, 2hr, 4hr, >5hr ….
10 Nodes Failed
Mahout’s Code
• Prepare Jobs (parse): seconds - minutes
Mahout’s Code
• Recommend Jobs (predict): seconds - minutes
Mahout’s Code
• Create Matrix Jobs: minutes - hours
CAN MAHOUT PERFORM WELL FOR
THIS WORKLOAD?
NO
CAN MAHOUT PRODUCE ACCURATE
RESULTS?
Training Set
• Kaggle Million Song Subset: 110K users
– User 2: 16 entries – took out 8
– User 16: 32 entries – took out 8
– User 17: 25 entries – took out 8
å
Q
Martin’s Code
User 2:
User 16:
User 17:
MAP =
q=1
AveP(q)
Q
where Q is the number of queries
1 2 3
4
5
( + +
+
+
) / 9 = 0.074
3 7 127 283 401
1
2
3
( +
+
) / 9 = 0.005
67 120 314
1
( ) / 8 = 0.004
32
å
Q
Mahout’s Code
User 2:
User 16:
User 17:
MAP =
q=1
AveP(q)
Q
where Q is the number of queries
1 2
( + ) / 9 = 0.013
20 32
1 2
3
4
( + +
+
) / 9 = 0.010
28 94 162 268
1
( ) / 8 = 0.004
30
CAN MAHOUT PRODUCE ACCURATE
RESULTS?
YES
CAN MAHOUT WORK ‘OUT OF BOX’?
YES… but not well
Conclusion
• Mahout did not scale well
• Mahout was not easy to learn
• Mahout was not easily modifiable
• For performance and efficiency, it is better to
– Understand the data set
– Understand data mining
– Understand the methodology