MrsRF: an efficient MapReduce algorithm for analyzing large

Download Report

Transcript MrsRF: an efficient MapReduce algorithm for analyzing large

MapReduce for Machine Learning on
Multicore
Cheng-Tao Chu and Sang Kyun Kim et al.
Stanford University, Stanford CA
Presented by Inna Rytsareva
1
Outline
 What is MapReduce?
 Problem Description and Formalization
 Statistical Query Model and Summation Form
 Architecture (inspired by MapReduce)
 Adopted ML Algorithms
 Experiments
 Future of MapReduce for Machine Learning
 Discussion
Map-Reduce for Machine Learning on Multicore
2
Motivation
 Problem: lots of data
 Example: 20+ billion web pages x 20KB = 400+
terabytes
 One computer can read 30-35 MB/sec from disk
 ~four months to read the web
 ~1,000 hard drives just to store the web
 Even more to do something with the data
3
Motivation
 Solution: spread the work over many machines
 Same problem with 1000 machines, < 3 hours
 programming work
 communication and coordination
 recovering from machine failure
 status reporting
 debugging
 optimization
 locality
 repeat for every problem you want to solve
4
Cluster Computing
 Many racks of computers,
thousands of machines
per cluster
 Limited bisection
bandwidth between racks
http://upload.wikimedia.org/wikipedia/commons/d/d3/IBM_Blue_Gene_
P_supercomputer.jpg
5
MapReduce
 A simple programming model that applies to many
large-scale computing problems
 Hide messy details in MapReduce runtime library:
 automatic parallelization
 load balancing
 network and disk transfer optimization
 handling of machine failures
 robustness
6
Programming model
 Input & Output: each a set of key/value pairs
 Programmer specifies two functions: map() reduce()
 map (in_key, in_value) -> list(out_key, intermediate_value)
 Processes input key/value pair
 Produces set of intermediate pairs
 reduce (out_key, list(intermediate_value)) -> list(out_value)
 Combines all intermediate values for a particular key
 Produces a set of merged output values (usually just one)
7
Example
 Page 1: the weather is good
 Page 2: today is good
 Page 3: good weather is good.
8
Map Output
 Worker 1:
 (the 1), (weather 1), (is 1), (good 1).
 Worker 2:
 (today 1), (is 1), (good 1).
 Worker 3:
 (good 1), (weather 1), (is 1), (good 1).
9
Reduce Input
 Worker 1:
 (the 1)
 Worker 2:
 (is 1), (is 1), (is 1)
 Worker 3:
 (weather 1), (weather 1)
 Worker 4:
 (today 1)
 Worker 5:
 (good 1), (good 1), (good 1), (good 1)
10
Reduce Output
 Worker 1:
 (the 1)
 Worker 2:
 (is 3)
 Worker 3:
 (weather 2)
 Worker 4:
 (today 1)
 Worker 5:
 (good 4)
11
http://4.bp.blogspot.com/_j6mB7TMmJJY/STAYW9gC-NI/AAAAAAAAAGY/lLKo7sBp5i8/s1600-h/P1.png
12
Fault tolerance
 On worker failure:
 Detect failure via periodic heartbeats (ping)
 Re-execute completed and in-progress map tasks
 Re-execute in progress reduce tasks
 Task completion committed through master
 Master failure:
 Could handle, but don't yet (master failure unlikely)
13
MapReduce Transparencies
 Parallelization
 Fault-tolerance
 Locality optimization
 Load balancing
14
Suitable for your task if
 Have a cluster
 Working with large dataset
 Working with independent data (or assumed)
 Can be cast into map and reduce
15
References
 Original paper J. Dean and S. Ghemawat, “MapReduce:
Simplified Data Processing on Large Clusters,”
Communications of the ACM, vol. 51, 2008, pp. 107-113.
(http://labs.google.com/papers/mapreduce.html)
 On wikipedia (http://en.wikipedia.org/wiki/MapReduce)
 Hadoop – MapReduce in Java
(http://lucene.apache.org/hadoop/)
 Starfish - MapReduce in Ruby (http://rufy.com/starfish/)
16
Outline
 What is MapReduce?
 Problem Description and Formalization
 Statistical Query Model and Summation Form
 Architecture (inspired by MapReduce)
 Adopted ML Algorithms
 Experiments
 Future of MapReduce for Machine Learning
 Discussion
Map-Reduce for Machine Learning on Multicore
17
Motivations
 Industry-wide shift to multicore
 No good framework for parallelize ML algorithms
 Goal: develop a general and exact technique for parallel
programming of a large class of ML algorithms for
multicore processors
http://upload.wikimedia.org/wikipedia/commons/a/af/E6750bs8.jpg
18
Idea
…
…
…
• Statistical Query Model
• Summation Form
• MapReduce
19
Outline
 What is MapReduce?
 Problem Description and Formalization
 Statistical Query Model and Summation Form
 Architecture (inspired by MapReduce)
 Adopted ML Algorithms
 Experiments
 Future of MapReduce for Machine Learning
 Discussion
Map-Reduce for Machine Learning on Multicore
20
Valiant Model [Valiant’84]
 x is the input
 y is a function of x that we want to learn
 In Valiant model, the learning algorithm uses
randomly drawn examples <x, y> to learn the target
function
21
Statistical Query Model [Kearns’98]
 A restriction on Valiant model
 A learning algorithm uses some aggregates over the
examples, not the individual examples
 Given a function f(x,y) over instances (data points x and
labels y), a statistical oracle will return an estimate of the
expectation of f(x,y)
 Any model that computes gradients or sufficient statistics
over f(x,y) fits this model
 Typically this is achieved by summing over the data.
22
Summation Form
 Aggregate over the data:
 Divide the data set into pieces
 Compute aggregates on each cores
 Combine all results at the end
23
Example: Linear Regression
Model:
Goal:
Solution: Given m examples: (x1, y1),
(x2, y2), …, (xm, ym) We
write a matrix X with x1, …, xm as rows, and row vector Y=(y1, y2,
…ym). Then the solution is
Parallel computation:
•
•
24
Outline
 What is MapReduce?
 Problem Description and Formalization
 Statistical Query Model and Summation Form
 Architecture (inspired by MapReduce)
 Adopted ML Algorithms
 Experiments
 Future of MapReduce for Machine Learning
 Discussion
Map-Reduce for Machine Learning on Multicore
25
Lighter Weight MapReduce for Multicore
26
Outline
 What is MapReduce?
 Problem Description and Formalization
 Statistical Query Model and Summation Form
 Architecture (inspired by MapReduce)
 Adopted ML Algorithms
 Experiments
 Future of MapReduce for Machine Learning
 Conclusion and Discussion
Map-Reduce for Machine Learning on Multicore
27
Locally Weighted Linear Regression
(LWLR)
Solve:
When wi == 1, this is least
squares.
 Mappers: one sets compute subgroups of A, the other set compute
subgroups b
 Two reducers for computing A and b
 Finally compute the solution
28
Naïve Bayes (NB)
 Goal: estimate P(xj=k|y=1) and P(xj=k|y=0) and P(y)
 Computation: count the occurrence of (xj=k,
y=1) and (xj=k, y=0),
count the occurrence of (y=1) and (y=0)
 Mappers: count a subgroup of training samples
 Reducer: aggregate the intermediate counts, and calculate the final
result
29
Gaussian Discriminative Analysis
(GDA)
 Goal: classification of x into classes of y
 assuming each class is a Gaussian Mixture model with different
means but same covariance.
 Computation:
 Mappers: compute for a subgroup of training samples
 Reducer: aggregate intermediate results
30
K-means
 Computing the Euclidean distance between sample vectors
and centroids
 Recalculating the centroids
 Divide the computation to subgroups to be handled by
map-reduce
31
Neural Network (NN)
 Back-propagation, 3-layer network
 Input, middle, 2 output nodes
 Goal: compute the weights in the NN by back propagation
 Mapper: propagate its set of training data through the network, and
propagate errors to calculate the partial gradient for weights
 Reducer: sums the partial gradients and does a batch gradient descent
to update the weights
32
Principal Components Analysis
(PCA)
 Compute the principle eigenvectors of the covariance matrix
1  m
T 
T



x
x


 mi1 i i 
 Clearly, we can compute the summation form using map-reduce
 Express the mean vector as a sum
1  m 
  i1 xi 

m 


33
Other Algorithms
 Logistic Regression
 Independent Component Analysis
 Support Vector Machine
 Expectation Maximization (EM)
34
Time Complexity
Basically: Linear speed up with
increasing number of cores
35
Outline
 What is MapReduce?
 Problem Description and Formalization
 Statistical Query Model and Summation Form
 Architecture (inspired by MapReduce)
 Adopted ML Algorithms
 Experiments
 Future of MapReduce for Machine Learning
 Discussion
Map-Reduce for Machine Learning on Multicore
36
Setup
 Compare map-reduce version and sequential version
 10 data sets
 Machines:
 Dual-processor Pentium-III 700MHz, 1GB RAM
 16-way Sun Enterprise 6000
37
Dual-Processor SpeedUps
38
SpeedUp for 2-16 processors
Bold – average
Error Bars – max/min
Dashed - variance
39
Multicore Simulator Results
 Multicore simulator over the sensor dataset




Better results – reported for NN & LR
NN
 16 cores 15.5x
 32 cores 29x
 64 cores 54x
LR
 16 cores 15x
 32 cores 29.5x
 64 cores 53x
Could be because of less communication cost
40
Conclusion
 Parallelize summation forms
 NO change in the underlying algorithm
 NO approximation
 Use map-reduce on a single machine
41
Outline
 What is MapReduce?
 Problem Description and Formalization
 Statistical Query Model and Summation Form
 Architecture (inspired by MapReduce)
 Adopted ML Algorithms
 Experiments
 Future of MapReduce for Machine Learning
 Discussion
Map-Reduce for Machine Learning on Multicore
42
Apache Mahout
 An Apache Software Foundation project to create
scalable machine learning libraries under the Apache
Software License
 http://mahout.apache.org
 Why Mahout?





Community
Documentation and Examples
Scalability
the Apache License
Not-specific research-oriented
http://dictionary.reference.com/browse/mahout
43
Focus: Scalable
 Goal: Be as fast and efficient as the possible given the intrinsic
design of the algorithm
 Some algorithms won’t scale to massive machine clusters
 Others fit logically on a MapReduce framework like Apache
Hadoop
 Still others will need other distributed programming models
 Most Mahout implementations are MapReduce enabled
 Work in Progress
44
Sampling of Who uses Mahout?
https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout
45
Focus: Machine Learning
Applications
Examples
Genetic
Freq.
Pattern
Mining
Utilities
Lucene/Vectorizer
Classification
Clustering
Math
Vectors/Matrices/
SVD
Recommenders
Collections
(primitives)
Apache
Hadoop
http://cwiki.apache.org/confluence/display/MAHOUT/Algorithms
46
Resources
 “Mahout in Action”
 Owen, Anil, Dunning and Friedman
 http://awe.sm/5FyNe
 “Introducing Apache Mahout”
 http://www.ibm.com/developerworks/java/library/j-mahout/
 “Taming Text” by Ingersoll, Morton, Farris
 “Programming Collective Intelligence” by Toby Segaran
 “Data Mining - Practical Machine Learning Tools and
Techniques” by Ian H. Witten and Eibe Frank
 “Data-Intensive Text Processing with MapReduce” by
Jimmy Lin and Chris Dyer
47
Outline
 What is MapReduce?
 Problem Description and Formalization
 Statistical Query Model and Summation Form
 Architecture (inspired by MapReduce)
 Adopted ML Algorithms
 Experiments
 Future of MapReduce for Machine Learning
 Discussion
Map-Reduce for Machine Learning on Multicore
48
Discussion
 What are other alternatives to MapReduce?
 What to do if “summation form” is not applicable?
 Does the dataset quality effect implementation and
performance of parallel machine learning algorithms?
 Multicore processors… future?
Predicting Structural and Functional Sites in Proteins by Searching for Maximum-Weight Cliques
49