Transcript Slide 1
Map-Reduce and Parallel
Computing for Large-Scale
Media Processing
Youjie Zhou
Outline
► Motivations
► Map-Reduce
Framework
► Large-scale Multimedia Processing
Parallelization
► Machine Learning Algorithm Transformation
► Map-Reduce Drawbacks and Variants
► Conclusions
Motivations
► Why
we need Parallelization?
“Time is Money”
►Simultaneously
►Divide-and-conquer
Data is too huge to handle
►1
trillion (10^12) unique URLs in 2008
►CPU speed limitation
Motivations
► Why
we need Parallelization?
Increasing Data
►Social
Networks
►Scalability!
“Brute Force”
►No
approximations
►Cheap clusters v.s. expensive computers
Motivations
► Why
we choose Map-Reduce?
Popular
►A
parallelization framework Google proposed and
Google uses it everyday
►Yahoo and Amazon also involve in
Popular Good?
►“Hides”
parallelization details from users
►Provides high-level operations that suit for majority
algorithms
Good start on deeper parallelization researches
Map-Reduce Framework
► Simple
idea inspired by function language
(like LISP)
map
►a
type of iteration in which a function is successively
applied to each element of one sequence
reduce
►a
function combines all the elements of a sequence
using a binary operation
Map-Reduce Framework
► Data
representation
<key,value>
map generates <key,value> pairs
reduce combines <key,value> pairs
according to same key
► “Hello,
world!” Example
Map-Reduce Framework
data
split0
map
reduce
split1
map
reduce
split2
map
reduce
reduce
output
Map-Reduce Framework
► Count
the appearances of each different
word in a set of documents
void map (Document)
for each word in Document
generate <word,1>
void reduce (word,CountList)
int count = 0;
for each number in CountList
count += number
generate <word,count>
Map-Reduce Framework
► Different
Implementations
Distributed computing
►each
computer acts as a computing node
►focusing on reliability over distributed computer
networks
►Google’s clusters
closed source
GFS: distributed file system
►Hadoop
open source
HDFS: hadoop distributed file system
Map-Reduce Framework
► Different
Implementations
Multi-Core computing
► each
core acts as a computing node
► focusing on high speed computing using large shared memories
► Phoenix++
a two dimensional <key,value> table stored in the memory
where map and reduce read and write pairs
open source created by Stanford
► GPU
10x higher memory bandwidth than a CPU
5x to 32x speedups on SVM training
Large-scale Multimedia Processing
Parallelization
► Clustering
k-means
Spectral Clustering
► Classifiers
training
SVM
► Feature
extraction and indexing
Bag-of-Features
Text Inverted Indexing
Clustering
►
k-means
Basic and fundamental
Original Algorithm
1.
2.
Pick k initial center points
Iterate until converge
1. Assign each point with the nearest center
2. Calculate new centers
Easy to parallel!
Clustering
k-means
►
a shared file contains center points
map
1.
2.
for each point, find the nearest center
generate <key,value> pair
key: center id
value: current point’s coordinate
reduce
1. collect all points belonging to the same cluster (they have the
same key value)
2. calculate the average new center
iterate
Clustering
► Spectral
Clustering
L I D
1 / 2
SD
1 / 2
S is huge: 10^6 points (double) need 8TB
Sparse It!
► Retain
only S_ij where j is among the t nearest neighbors of i
► Locality Sensitive Hashing?
It’s an approximation
► We
can calculate directly Parallel
Clustering
► Spectral
Clustering
Calculate distance matrix
►map
creates <key,value> so that every n/p points have the
same key
p is the number of node in the computer cluster
►reduce
collect points with same key so that the data is split into p
parts and each part is stored in each node
►for
each point in the whole data set, on each node,
find t nearest neighbors
Clustering
► Spectral
Clustering
Symmetry
►x_j
in t-nearest-neighbor set of x_i ≠ x_i in t-nearestneighbor set of x_j
►map
for each nonzero element, generates two <key,value>
first: key is row ID; value is column ID and distance
second: key is column ID; value is row ID and distance
►reduce
uses key as row ID and fills columns specified by column
ID in value
Classification
► SVM
yi ( wT xi b) / 2
2
|| w ||
arg min ( w) wT w
arg max ( ) i 1 / 2 i j yi y j K ( xi , x j )
Classification
► SVM
SMO
instead of solving all alpha together
coordinate ascent
► pick
one alpha, fix others
► optimize alpha_i
Classification
► SVM
SMO
But we cannot optimize only one alpha for SVM
We need to optimize two alpha each iteration
arg max ( ) i 1 / 2 i j yi y j K ( xi , x j )
0 i C , i yi 0
n
1 y1 i yi
i 2
Classification
► SVM
repeat until converge:
►map
given two alpha, updating the optimization information
►reduce
find the two maximally violating alpha
Feature Extraction and Indexing
► Bag-of-Features
features feature clusters histogram
feature extraction
►map
takes images in and outputs features directly
feature clustering
►clustering
algorithms, like k-means
Feature Extraction and Indexing
► Bag-of-Features
feature quantization histogram
►map
for each feature on one image, find the nearest feature
cluster
generates <imageID,clusterID>
►reduce
<imageID,cluster0,cluster1…>
for each feature cluster, updating the histogram
generates <imageID,histogram>
Feature Extraction and Indexing
► Text
Inverted Indexing
Inverted index of a term
►a
document list containing the term
► each item in the document list stores statistical information
frequency, position, field information
map
► for
each term in one document, generates <term,docID>
reduce
►<term,doc0,doc1,doc2…>
► for
each document, update statistical information for that term
► generates <term,list>
Machine Learning Algorithm
Transformation
► How
can we know whether an algorithm can
be transformed into a Map-Reduce fashion?
if so, how to do that?
► Statistical
Query and Summation Form
All we want is to estimate or inference
►cluster
id, labels…
From sufficient statistics
►distances
between points
►points positions
statistic computation can be divided
Machine Learning Algorithm
Transformation
► Linear
Regression
T
arg min Error (Y Xb) (Y Xb)
reduce
map
( X T X )1 X TY
T
(
x
x
i i)
( x y )
Summation Form
i
i
Machine Learning Algorithm
Transformation
► Naïve
Bayesian
vMAP arg max v jV P(v j | a1...an )
vMAP arg max v jV [ P(a1...an | v j ) P(v j ) | P(a1...an )]
vMAP arg max v jV [ P(a1...an | v j ) P(v j )]
vNB arg max v jV P(v j ) P(ai | v j )
map
reduce
Machine Learning Algorithm
Transformation
► Solution
Find statistics calculation part
Distribute calculations on data using map
Gather and refine all statistics in reduce
Map-Reduce Systems Drawbacks
► Batch
based system
“pull” model
►reduce
must wait for un-finished map
►reduce “pull” data from map
no iteration support directly
► Focusing
too much on distributed system
and failure tolerance
local computing cluster may not need them
Map-Reduce Systems Drawbacks
► Focusing
too much on distributed system
and failure tolerance
Map-Reduce Variants
► Map-Reduce
online
“push” model
►map
“pushes” data to reduce
reduce can also “push” results to map from the
next job
build a pipeline
► Iterative
Map-Reduce
higher level schedulers
schedule the whole iteration process
Map-Reduce Variants
► Series
Map-Reduce?
Map-Reduce?
MPI? Condor?
Multi-Core
Map-Reduce
Multi-Core
Map-Reduce
Multi-Core
Map-Reduce
Multi-Core
Map-Reduce
Conclusions
► Good
parallelization framework
Schedule jobs automatically
Failure tolerance
Distributed computing supported
High level abstraction
►easy
► Too
to port algorithms on it
“industry”
why we need a large distributed system?
why we need too much data safety?
References
[1] Map-Reduce for Machine Learning on Multicore
[2] A Map Reduce Framework for Programming Graphics Processors
[3] Mapreduce Distributed Computing for Machine Learning
[4] Evaluating mapreduce for multi-core and multiprocessor systems
[5] Phoenix Rebirth: Scalable MapReduce on a Large-Scale Shared-Memory System
[6] Phoenix++: Modular MapReduce for Shared-Memory Systems
[7] Web-scale computer vision using MapReduce for multimedia data mining
[8] MapReduce indexing strategies: Studying scalability and efficiency
[9] Batch Text Similarity Search with MapReduce
[10] Twister: A Runtime for Iterative MapReduce
[11] MapReduce Online
[12] Fast Training of Support Vector Machines Using Sequential Minimal Optimization
[13] Social Content Matching in MapReduce
[14] Large-scale multimedia semantic concept modeling using robust subspace bagging
and MapReduce
[15] Parallel Spectral Clustering in Distributed Systems
Thanks
Q&A