Lecture 2 – Theoretical Underpinnings of MapReduce
Download
Report
Transcript Lecture 2 – Theoretical Underpinnings of MapReduce
Lecture 2 – MapReduce:
Theory and
Implementation
CSE 490h – Introduction
to Distributed Computing,
Winter 2008
Except as otherwise noted, the content of this presentation is
licensed under the Creative Commons Attribution 2.5 License.
Last Class
How do I process lots of data?
Distribute
the work
Can I distribute the work?
Maybe…
if it’s not dependent on other tasks
Example: Fibonnaci.
Last Class
What problems can occur?
Large
tasks
Unpredictable bugs
Machine failure
How do solve / avoid these?
Break
up into small chunks?
Restart tasks?
Use known working solutions
MapReduce
Concept from functional programming
Implemented by Google
Applied to large number of problems
Functional Programming Review
Java:
int fooA(String[] list) {
return bar1(list) + bar2(list);
}
int fooB(String[] list) {
return bar2(list) + bar1(list);
}
Do they give the same result?
Functional Programming Review
Functional Programming:
fun fooA(l: int list) =
bar1(l) + bar2(l)
fun fooB(l: int list) =
bar2(l) + bar1(l)
Do they give the same result?
Functional Programming Review
Operations do not modify data structures:
They always create new ones
Original data still exists in unmodified form
Functional Updates Do Not Modify
Structures
fun foo(x, lst) =
let lst' = reverse lst in
reverse ( x :: lst' )
foo: a’ -> a’ list -> a’ list
The foo() function above reverses a list, adds a new
element to the front, and returns all of that, reversed,
which appends an item.
But it never modifies lst!
Functions Can Be Used As
Arguments
fun DoDouble(f, x) = f (f x)
It does not matter what f does to its
argument; DoDouble() will do it twice.
What is the type of this function?
x: a’
f: a’ -> a’
DoDouble: (a’ -> a’) -> a’ -> a’
map (Functional Programming)
Creates a new list by applying f to each element of
the input list; returns output in order.
f
f
f
f
f
f
map f lst: (’a->’b) -> (’a list) -> (’b list)
map Implementation
fun map f []
= []
| map f (x::xs) = (f x) :: (map f xs)
This implementation moves left-to-right
across the list, mapping elements one at a
time
… But does it need to?
Implicit Parallelism In map
In a purely functional setting, elements of a list
being computed by map cannot see the effects
of the computations on other elements
If order of application of f to elements in list is
commutative, we can reorder or parallelize
execution
This is the “secret” that MapReduce exploits
Fold
Moves across a list, applying f to each element
plus an accumulator. f returns the next
accumulator value, which is combined with the
next element of the list
f
f
f
f
f
returned
initial
fold f x0 lst: ('a*'b->'b)->'b->('a list)->'b
fold left vs. fold right
Order of list elements can be significant
Fold left moves left-to-right across the list
Fold right moves from right-to-left
SML Implementation:
fun foldl f a []
= a
| foldl f a (x::xs) = foldl f (f(x, a)) xs
fun foldr f a []
= a
| foldr f a (x::xs) = f(x, (foldr f a xs))
Example
fun foo(l: int list) =
sum(l) + mul(l) + length(l)
How can we implement this?
Example (Solved)
fun foo(l: int list) =
sum(l) + mul(l) + length(l)
fun sum(lst) = foldl (fn (x,a)=>x+a) 0 lst
fun mul(lst) = foldl (fn (x,a)=>x*a) 1 lst
fun length(lst) = foldl (fn (x,a)=>1+a) 0 lst
Google MapReduce
Input Handling
Map function
Partition Function
Compare Function
Reduce Function
Output Writer
Input Handling
Divides up data into bite-size chunks
Starts up tasks
Assigns tasks to idle workers
Map
Input: Key, Value pair
Output: Key, Value pairs
Example: Annual Rainfall Per City
Map (Example)
Example: Annual Rainfall Per City
map(String key, String value):
// key: date
// value: weather info
foreach (City c in value)
EmitIntermediate(c, c.temperature)
Partition Function
Allocates map output to particular reduces
Input: key, number of reduces
Output: Index of desired reduce
Typical: hash(key) % numberOfReduces
Comparison
Sorts input for each reduce
Example: Annual rainfall per city
Sorts
rainfall data for each city
Seattle: {0, 0, 0, 1, 4, 7, 10, …}
Reduce
Input: Key, Sorted list of values
Output: Single value
Example: Annual rainfall per city
Reduce
Input: Key, Sorted list of values
Output: Single value
Example: Annual rainfall per city
Reduce (Example)
Example: Annual rainfall per city
reduce(String
key, Iterator values):
// key: city
// values: temperature
sum = 0, count = 0
for each (v in values)
sum += v
count = count + 1
Emit(sum / count)
Output
Writes the output to storage (GFS, etc)
Input key*value
pairs
Input key*value
pairs
...
map
map
Data store 1
Data store n
(key 1,
values...)
(key 2,
values...)
(key 3,
values...)
(key 2,
values...)
(key 1,
values...)
(key 3,
values...)
== Barrier == : Aggregates intermediate values by output key
key 1,
intermediate
values
key 2,
intermediate
values
key 3,
intermediate
values
reduce
reduce
reduce
final key 1
values
final key 2
values
final key 3
values
MapReduce for Google Local
Intersections
Rendering Tiles
Finding nearest gas stations