Lecture 2 – Theoretical Underpinnings of MapReduce

Download Report

Transcript Lecture 2 – Theoretical Underpinnings of MapReduce

Lecture 2 – MapReduce:
Theory and
Implementation
CSE 490h – Introduction
to Distributed Computing,
Winter 2008
Except as otherwise noted, the content of this presentation is
licensed under the Creative Commons Attribution 2.5 License.
Last Class

How do I process lots of data?
 Distribute

the work
Can I distribute the work?
 Maybe…
if it’s not dependent on other tasks
 Example: Fibonnaci.
Last Class

What problems can occur?
 Large
tasks
 Unpredictable bugs
 Machine failure

How do solve / avoid these?
 Break
up into small chunks?
 Restart tasks?
 Use known working solutions
MapReduce
Concept from functional programming
 Implemented by Google
 Applied to large number of problems

Functional Programming Review
Java:
int fooA(String[] list) {
return bar1(list) + bar2(list);
}
int fooB(String[] list) {
return bar2(list) + bar1(list);
}
Do they give the same result?
Functional Programming Review
Functional Programming:
fun fooA(l: int list) =
bar1(l) + bar2(l)
fun fooB(l: int list) =
bar2(l) + bar1(l)
Do they give the same result?
Functional Programming Review
Operations do not modify data structures:
They always create new ones
 Original data still exists in unmodified form

Functional Updates Do Not Modify
Structures
fun foo(x, lst) =
let lst' = reverse lst in
reverse ( x :: lst' )
foo: a’ -> a’ list -> a’ list
The foo() function above reverses a list, adds a new
element to the front, and returns all of that, reversed,
which appends an item.
But it never modifies lst!
Functions Can Be Used As
Arguments
fun DoDouble(f, x) = f (f x)
It does not matter what f does to its
argument; DoDouble() will do it twice.
What is the type of this function?
x: a’
f: a’ -> a’
DoDouble: (a’ -> a’) -> a’ -> a’
map (Functional Programming)
Creates a new list by applying f to each element of
the input list; returns output in order.
f
f
f
f
f
f
map f lst: (’a->’b) -> (’a list) -> (’b list)
map Implementation
fun map f []
= []
| map f (x::xs) = (f x) :: (map f xs)

This implementation moves left-to-right
across the list, mapping elements one at a
time

… But does it need to?
Implicit Parallelism In map



In a purely functional setting, elements of a list
being computed by map cannot see the effects
of the computations on other elements
If order of application of f to elements in list is
commutative, we can reorder or parallelize
execution
This is the “secret” that MapReduce exploits
Fold
Moves across a list, applying f to each element
plus an accumulator. f returns the next
accumulator value, which is combined with the
next element of the list
f
f
f
f
f
returned
initial
fold f x0 lst: ('a*'b->'b)->'b->('a list)->'b
fold left vs. fold right



Order of list elements can be significant
Fold left moves left-to-right across the list
Fold right moves from right-to-left
SML Implementation:
fun foldl f a []
= a
| foldl f a (x::xs) = foldl f (f(x, a)) xs
fun foldr f a []
= a
| foldr f a (x::xs) = f(x, (foldr f a xs))
Example
fun foo(l: int list) =
sum(l) + mul(l) + length(l)
How can we implement this?
Example (Solved)
fun foo(l: int list) =
sum(l) + mul(l) + length(l)
fun sum(lst) = foldl (fn (x,a)=>x+a) 0 lst
fun mul(lst) = foldl (fn (x,a)=>x*a) 1 lst
fun length(lst) = foldl (fn (x,a)=>1+a) 0 lst
Google MapReduce
Input Handling
 Map function
 Partition Function
 Compare Function
 Reduce Function
 Output Writer

Input Handling
Divides up data into bite-size chunks
 Starts up tasks
 Assigns tasks to idle workers

Map
Input: Key, Value pair
 Output: Key, Value pairs
 Example: Annual Rainfall Per City

Map (Example)

Example: Annual Rainfall Per City
map(String key, String value):
// key: date
// value: weather info
foreach (City c in value)
EmitIntermediate(c, c.temperature)
Partition Function
Allocates map output to particular reduces
 Input: key, number of reduces
 Output: Index of desired reduce
 Typical: hash(key) % numberOfReduces

Comparison
Sorts input for each reduce
 Example: Annual rainfall per city

 Sorts
rainfall data for each city
 Seattle: {0, 0, 0, 1, 4, 7, 10, …}
Reduce
Input: Key, Sorted list of values
 Output: Single value
 Example: Annual rainfall per city

Reduce
Input: Key, Sorted list of values
 Output: Single value
 Example: Annual rainfall per city

Reduce (Example)

Example: Annual rainfall per city
 reduce(String
key, Iterator values):
// key: city
// values: temperature
sum = 0, count = 0
for each (v in values)
sum += v
count = count + 1
Emit(sum / count)
Output

Writes the output to storage (GFS, etc)
Input key*value
pairs
Input key*value
pairs
...
map
map
Data store 1
Data store n
(key 1,
values...)
(key 2,
values...)
(key 3,
values...)
(key 2,
values...)
(key 1,
values...)
(key 3,
values...)
== Barrier == : Aggregates intermediate values by output key
key 1,
intermediate
values
key 2,
intermediate
values
key 3,
intermediate
values
reduce
reduce
reduce
final key 1
values
final key 2
values
final key 3
values
MapReduce for Google Local
Intersections
 Rendering Tiles
 Finding nearest gas stations
