Lecture 2 – Theoretical Underpinnings of MapReduce
Download
Report
Transcript Lecture 2 – Theoretical Underpinnings of MapReduce
MapReduce:
Acknowledgements: Some slides form Google University (licensed under the
Creative Commons Attribution 2.5 License) others from Jure Leskovik
MapReduce
Concept from functional programming
Applied to large number of problems
Java:
int fooA(String[] list) {
return bar1(list) + bar2(list);
}
int fooB(String[] list) {
return bar2(list) + bar1(list);
}
Do they give the same result?
Functional Programming:
fun fooA(l: int list) =
bar1(l) + bar2(l)
fun fooB(l: int list) =
bar2(l) + bar1(l)
They do give the same result!
Functional Programming
Operations do not modify data structures:
They always create new ones
Original data still exists in unmodified form
Functional Updates Do Not Modify
Structures
fun foo(x, lst) =
let lst' = reverse lst in
reverse ( x :: lst' )
foo: a’ -> a’ list -> a’ list
The foo() function above reverses a list, adds a new
element to the front, and returns all of that, reversed,
which appends an item.
But it never modifies lst!
Functions Can Be Used As Arguments
fun DoDouble(f, x) = f (f x)
It does not matter what f does to its
argument; DoDouble() will do it twice.
What is the type of this function?
x: a’
f: a’ -> a’
DoDouble: (a’ -> a’) -> a’ -> a’
map (Functional Programming)
Creates a new list by applying f to each element of
the input list; returns output in order.
f
f
f
f
f
f
map Implementation
fun map f []
= []
| map f (x::xs) = (f x) :: (map f xs)
This implementation moves left-to-right
across the list, mapping elements one at a
time
… But does it need to?
Implicit Parallelism In map
In a functional setting, elements of a list
being computed by map cannot see the
effects of the computations on other
elements
If order of application of f to elements in
list is commutative, we can reorder or
parallelize execution
Reduce
Moves across a list, applying f to each element
plus an accumulator. f returns the next
accumulator value, which is combined with the
next element of the list
f
f
f
f
f
returned
initial
Order of list elements can be significant
Fold left moves left-to-right across the list …
Again, if operation commutative order not important
MapReduce
Motivation: Large Scale Data Processing
Google:
20+ billion web pages x 20KB = 400+ TB
1
computer reads 30-35 MB/sec from disk~4
months to read the web
~1,000 hard drives to store the web
Even more to do something with the data
Web data sets are massive
Tens
to hundreds of terabytes
Cannot mine on a single server
Standard architecture emerging
Cluster of commodity Linux nodes
Gigabit
– commodity clusters
ethernet interconnect
How to organize computations on this architecture?
Mask issues such as hardware failure
Traditional ‘big-iron box’ (circa 2003)
8 2GHz Xeons
64GB RAM
8TB disk
$758,000
USD
Prototypical Google rack (circa 2003)
176
2GHz Xeons
176GB RAM
~7TB disk
278,000 USD
In Aug
2006 Google had ~450,000 machines
Prototypical architecture
The Challenge: Large-scale data-intensive
computing
commodity hardware
process huge datasets
on many computers, e.g., data
mining
Challenges:
How do you distribute computation?
Distributed/parallel programming is hard
Single machine performance should not
matter /
incremental scalability
Machines fail
Map-reduce addresses all of the above
Elegant
way to work with big data
Idea:
collocate computation and data
(Store files multiple times for reliability)
Need:
Programming
model
Map-Reduce
Infrastructure
File
system: Google: GFS, Hadoop: HDFS
Runtime engine
MapReduce
Automatic parallelization & distribution
Fault-tolerant
Provides status and monitoring tools
Clean abstraction for programmers
*
Reduce (k’, <v’>*) <k’’, v’’>
Notation: * -- a list
* -- a list
<k’’, v’’>
*
Reduce (k’, <v’>*) <k’’, v’’>
map(String input_key, String input_value):
// input_key: document name
// input_value: document contents
for each word w in input_value:
EmitIntermediate(w, "1");
reduce(String output_key, intermediate_value_list):
// output_key: a word
// intermediate_value_list: a list of ones
int result = 0;
for each v in intermediate_values:
result += ParseInt(v);
Emit(output_key, AsString(result));
Reversed
Web-Link Graph: For a list of
web pages produce the set of pages that
have links that point to each of these
pages.
Email me your solution (pseudocode) by the end of Thursday
27/02
Key ideas behind map-reduce
Key idea 1:
Separate the what from the how
MapReduce abstracts away the
“distributed” part of the system
details
are handled by the framework
However, in-depth knowledge of the
framework is key for performance
Custom
data reader/writer
Custom data partitioning
Memory utilization
* -- a list
<k’’, v’’>*
*
Reduce (k’, <v’>*) <k’’, v’’>*
Key idea 2:
Move processing to the data
Drastic departure from high-performance
computing model
HPC: distinction between processing nodes and storage nodes.
Designed for CPU intensive tasks
Data intensive workloads
Generally not processor demanding
The
network and I/O are the bottleneck
MapReduce assumes processing and storage
nodes to be co-located: (data locality)
Distributed
filesystems are necessary
Key idea 3:
Scale out, not up!
For data-intensive workloads,
a
large number of commodity servers is preferred over a
small number of high-end servers
cost of super-computers is not linear
Some numbers
Processing data is quick, I/O is very slow:
1 HDD = 75 MB/sec; 1000 HDDs = 75 GB/sec
Data volume processed: 80 PB/day at Google; 60TB/day at Facebook (~2012)
Key idea 4
“Shared-nothing” infrastructure
(both hard- and soft-ware)
Sharing vs. Shared nothing:
Sharing:
manage a common/global state
Shared nothing: independent entities, no
common state
Functional programming as key enabler
No side effects
Recovery from failures much easier
map/reduce – as subset of functional
More examples
Distributed Grep: The map function emits a line if it
matches a supplied pattern. The reduce function is an
identity function that just copies the supplied intermediate
data to the output.
Count of URL Access Frequency: The map function
processes logs of web page requests and outputs <URL;
1>. The reduce function adds together all values for the
same URL and emits a <URL; total count> pair.
ReverseWeb-Link Graph: The map function outputs
<target; source> pairs for each link to a target URL found in
a page named source. The reduce function concatenates
the list of all source URLs associated with a given target
URL and emits the pair: <target; list(source)>
Term-Vector per Host: …
More info
MapReduce: Simplified Data Processing on Large
Clusters
Jeffrey Dean and Sanjay Ghemawat,
http://labs.google.com/papers/mapreduce.html
The Google File System
Sanjay Ghemawat, Howard Gobioff, and ShunTakLeung,
http://labs.google.com/papers/gfs.html