Transcript Slide 1

Map-Reduce and Parallel
Computing for Large-Scale
Media Processing
Youjie Zhou
Outline
► Motivations
► Map-Reduce
Framework
► Large-scale Multimedia Processing
Parallelization
► Machine Learning Algorithm Transformation
► Map-Reduce Drawbacks and Variants
► Conclusions
Motivations
► Why
we need Parallelization?
 “Time is Money”
►Simultaneously
►Divide-and-conquer
 Data is too huge to handle
►1
trillion (10^12) unique URLs in 2008
►CPU speed limitation
Motivations
► Why
we need Parallelization?
 Increasing Data
►Social
Networks
►Scalability!
 “Brute Force”
►No
approximations
►Cheap clusters v.s. expensive computers
Motivations
► Why
we choose Map-Reduce?
 Popular
►A
parallelization framework Google proposed and
Google uses it everyday
►Yahoo and Amazon also involve in
 Popular  Good?
►“Hides”
parallelization details from users
►Provides high-level operations that suit for majority
algorithms
 Good start on deeper parallelization researches
Map-Reduce Framework
► Simple
idea inspired by function language
(like LISP)
 map
►a
type of iteration in which a function is successively
applied to each element of one sequence
 reduce
►a
function combines all the elements of a sequence
using a binary operation
Map-Reduce Framework
► Data
representation
 <key,value>
 map generates <key,value> pairs
 reduce combines <key,value> pairs
according to same key
► “Hello,
world!” Example
Map-Reduce Framework
data
split0
map
reduce
split1
map
reduce
split2
map
reduce
reduce
output
Map-Reduce Framework
► Count
the appearances of each different
word in a set of documents
void map (Document)
for each word in Document
generate <word,1>
void reduce (word,CountList)
int count = 0;
for each number in CountList
count += number
generate <word,count>
Map-Reduce Framework
► Different
Implementations
 Distributed computing
►each
computer acts as a computing node
►focusing on reliability over distributed computer
networks
►Google’s clusters
 closed source
 GFS: distributed file system
►Hadoop
 open source
 HDFS: hadoop distributed file system
Map-Reduce Framework
► Different
Implementations
 Multi-Core computing
► each
core acts as a computing node
► focusing on high speed computing using large shared memories
► Phoenix++
 a two dimensional <key,value> table stored in the memory
where map and reduce read and write pairs
 open source created by Stanford
► GPU
 10x higher memory bandwidth than a CPU
 5x to 32x speedups on SVM training
Large-scale Multimedia Processing
Parallelization
► Clustering
 k-means
 Spectral Clustering
► Classifiers
training
 SVM
► Feature
extraction and indexing
 Bag-of-Features
 Text Inverted Indexing
Clustering
►
k-means


Basic and fundamental
Original Algorithm
1.
2.
Pick k initial center points
Iterate until converge
1. Assign each point with the nearest center
2. Calculate new centers

Easy to parallel!
Clustering
k-means
►


a shared file contains center points
map
1.
2.
for each point, find the nearest center
generate <key,value> pair



key: center id
value: current point’s coordinate
reduce
1. collect all points belonging to the same cluster (they have the
same key value)
2. calculate the average  new center

iterate
Clustering
► Spectral
Clustering
L I D
1 / 2
SD
1 / 2
 S is huge: 10^6 points (double) need 8TB
 Sparse It!
► Retain
only S_ij where j is among the t nearest neighbors of i
► Locality Sensitive Hashing?
 It’s an approximation
► We
can calculate directly  Parallel
Clustering
► Spectral
Clustering
 Calculate distance matrix
►map
 creates <key,value> so that every n/p points have the
same key
 p is the number of node in the computer cluster
►reduce
 collect points with same key so that the data is split into p
parts and each part is stored in each node
►for
each point in the whole data set, on each node,
find t nearest neighbors
Clustering
► Spectral
Clustering
 Symmetry
►x_j
in t-nearest-neighbor set of x_i ≠ x_i in t-nearestneighbor set of x_j
►map
 for each nonzero element, generates two <key,value>
 first: key is row ID; value is column ID and distance
 second: key is column ID; value is row ID and distance
►reduce
 uses key as row ID and fills columns specified by column
ID in value
Classification
► SVM
yi ( wT xi  b)   / 2
2

|| w ||
arg min ( w)  wT w
arg max ( )   i  1 / 2 i j yi y j K ( xi , x j )
Classification
► SVM
 SMO
 instead of solving all alpha together
 coordinate ascent
► pick
one alpha, fix others
► optimize alpha_i
Classification
► SVM
 SMO
 But we cannot optimize only one alpha for SVM
 We need to optimize two alpha each iteration
arg max ( )   i  1 / 2 i j yi y j K ( xi , x j )
0   i  C ,   i yi  0
n
1 y1    i yi
i 2
Classification
► SVM
 repeat until converge:
►map
 given two alpha, updating the optimization information
►reduce
 find the two maximally violating alpha
Feature Extraction and Indexing
► Bag-of-Features
 features  feature clusters  histogram
 feature extraction
►map
 takes images in and outputs features directly
 feature clustering
►clustering
algorithms, like k-means
Feature Extraction and Indexing
► Bag-of-Features
 feature quantization histogram
►map
 for each feature on one image, find the nearest feature
cluster
 generates <imageID,clusterID>
►reduce
 <imageID,cluster0,cluster1…>
 for each feature cluster, updating the histogram
 generates <imageID,histogram>
Feature Extraction and Indexing
► Text
Inverted Indexing
 Inverted index of a term
►a
document list containing the term
► each item in the document list stores statistical information
 frequency, position, field information
 map
► for
each term in one document, generates <term,docID>
 reduce
►<term,doc0,doc1,doc2…>
► for
each document, update statistical information for that term
► generates <term,list>
Machine Learning Algorithm
Transformation
► How
can we know whether an algorithm can
be transformed into a Map-Reduce fashion?
 if so, how to do that?
► Statistical
Query and Summation Form
 All we want is to estimate or inference
►cluster
id, labels…
 From sufficient statistics
►distances
between points
►points positions
 statistic computation can be divided
Machine Learning Algorithm
Transformation
► Linear
Regression
T
arg min Error  (Y  Xb) (Y  Xb)
reduce
map
  ( X T X )1 X TY
T
(
x
x
 i i)
( x y )
Summation Form
i
i
Machine Learning Algorithm
Transformation
► Naïve
Bayesian
vMAP  arg max v jV P(v j | a1...an )
vMAP  arg max v jV [ P(a1...an | v j ) P(v j ) | P(a1...an )]
vMAP  arg max v jV [ P(a1...an | v j ) P(v j )]
vNB  arg max v jV P(v j ) P(ai | v j )
map
reduce
Machine Learning Algorithm
Transformation
► Solution
 Find statistics calculation part
 Distribute calculations on data using map
 Gather and refine all statistics in reduce
Map-Reduce Systems Drawbacks
► Batch
based system
 “pull” model
►reduce
must wait for un-finished map
►reduce “pull” data from map
 no iteration support directly
► Focusing
too much on distributed system
and failure tolerance
 local computing cluster may not need them
Map-Reduce Systems Drawbacks
► Focusing
too much on distributed system
and failure tolerance
Map-Reduce Variants
► Map-Reduce
online
 “push” model
►map
“pushes” data to reduce
 reduce can also “push” results to map from the
next job
 build a pipeline
► Iterative
Map-Reduce
 higher level schedulers
 schedule the whole iteration process
Map-Reduce Variants
► Series
Map-Reduce?
Map-Reduce?
MPI? Condor?
Multi-Core
Map-Reduce
Multi-Core
Map-Reduce
Multi-Core
Map-Reduce
Multi-Core
Map-Reduce
Conclusions
► Good




parallelization framework
Schedule jobs automatically
Failure tolerance
Distributed computing supported
High level abstraction
►easy
► Too
to port algorithms on it
“industry”
 why we need a large distributed system?
 why we need too much data safety?
References
[1] Map-Reduce for Machine Learning on Multicore
[2] A Map Reduce Framework for Programming Graphics Processors
[3] Mapreduce Distributed Computing for Machine Learning
[4] Evaluating mapreduce for multi-core and multiprocessor systems
[5] Phoenix Rebirth: Scalable MapReduce on a Large-Scale Shared-Memory System
[6] Phoenix++: Modular MapReduce for Shared-Memory Systems
[7] Web-scale computer vision using MapReduce for multimedia data mining
[8] MapReduce indexing strategies: Studying scalability and efficiency
[9] Batch Text Similarity Search with MapReduce
[10] Twister: A Runtime for Iterative MapReduce
[11] MapReduce Online
[12] Fast Training of Support Vector Machines Using Sequential Minimal Optimization
[13] Social Content Matching in MapReduce
[14] Large-scale multimedia semantic concept modeling using robust subspace bagging
and MapReduce
[15] Parallel Spectral Clustering in Distributed Systems
Thanks
Q&A