Transcript Slide 1
Introduction to Map/Reduce
Data Transformations
Tasso Argyros
CTO and Co-Founder
Aster Data Systems
[email protected]
A Brief History of MapReduce
2004:
Google
publishes
MapReduce
paper at
OSDI
2
2006: Hadoop
becomes the
first OpenSource
implementation
of MapReduce
Confidential and proprietary. Copyright © 2008 Aster Data Systems
2008: Aster and
Greenplum
become the first
database vendors
to incorporate
MapReduce
functionality into
their products
What is MapReduce?
It’s the simplest API you have ever seen
It has just two functions
1. Map()
and
2. Reduce()
Plus: it’s language independent (Java, Perl, Python,
…)
3
Confidential and proprietary. Copyright © 2008 Aster Data Systems
Why is MapReduce Useful?
It simplifies distributed applications…
…by abstracting the details of
data distribution (where is the data I need?) and
process distribution (where should I run this
process?)…
…behind two simple functions.
But let’s see an example
4
Confidential and proprietary. Copyright © 2008 Aster Data Systems
The quick
brown fox
jumps over the
lazy dog.
The world only
needs five
computers.
MapReduce is
a very
powerful
programming
paradigm.
In-Database
MapReduce is
the future.
Server A
Hello world.
Server B
Switch
5
To be or not to
be: that is the
question.
Confidential and proprietary. Copyright © 2008 Aster Data Systems
Server C
Server D
Goal
We Want to Count
the # of Times
Each Word Occurs
6
Confidential and proprietary. Copyright © 2008 Aster Data Systems
1st Approach1st Approach
No MapReduce
No MapReduce
7
Confidential and proprietary. Copyright © 2008 Aster Data Systems
The quick
brown fox
jumps over
the lazy dog
In-Database
MapReduce is
the future.
Server A
the
quick
brown
fox
jumps
over
the
lazy
dog
The world
only needs
five
computers.
the
world
only
needs
five
computers
Hello world.
in
database
mapreduce
is
the
future
concept
Server B
Confidential and proprietary. Copyright © 2008 Aster Data Systems
To be or not
to be: that is
the question.
MapReduce is
mapreduce
a very
is
powerful a
concept. very
powerful
Switch
8
hello
world
Server C
Server D
to
be
or
not
to
be
that
is
the
question
the
quick
brown
fox
jumps
over
the
lazy
dog
in database
mapreduce
is
the
future
the
world
only
needs
five
computers
hello
world
mapreduce
is
a
very
powerful
concept
to
be
or
not
to
be
that
is
the
question
Server 4 Final Result File
9
the
5
is
3
mapreduce
2
…
…
Confidential and proprietary. Copyright © 2008 Aster Data Systems
What Did We Do?
1. Write a script to parse the documents and
output word lists
2. FTP all the word lists to server 4
3. Write another script to count each word on
Server 4
Problem: (2) and (3) do not scale!
10
Confidential and proprietary. Copyright © 2008 Aster Data Systems
2nd Approach
No MapReduce
Fully Distributed
11
Confidential and proprietary. Copyright © 2008 Aster Data Systems
The quick
brown fox
jumps over
the lazy dog
In-Database
MapReduce is
the future.
Server A
the
quick
brown
fox
jumps
over
the
lazy
dog
The world
only needs
five
computers.
the
world
only
needs
five
computers
Hello world.
hello
world
To be or not
to be: that is
the question.
MapReduce is
mapreduce
a very
is
powerful a
concept. very
powerful
in
database
mapreduce
is
the
future
concept
Server B
the
the
world
the
world
the
powerful
the
lazy
database
brown
database
Confidential and proprietary. Copyright © 2008 Aster Data Systems
future
12
Switch
Server C
Server D
mapreduce
mapreduce
be
be
to
jumps
computers
hello
is
is
is
question
over
a
that
to
be
or
not
to
be
that
is
the
question
Server 1 Final Result File
the
5
…
….
Server 2 Final Result File
world
2
…
….
Server 3 Final Result File
mapreduce
2
…
….
Server 4 Final Result File
13
is
3
…
….
Confidential and proprietary. Copyright © 2008 Aster Data Systems
2nd Approach: No MapReduce, Distributed
1. Write a script to parse the documents and output word
lists
2. FTP all the word lists to a single node
3. Write another script to count each word
4. Write another script to break each word list to four
files: f1, f2, f3, f4
Note: for this to work the same word needs to be in the
same file across all servers. How do I do that?
Answer: use some form of hashing
5. FTP all f1s to Server 1, all f2s to Server 2, etc
n(n-1) transfers
6. Write another script to merge all fXs on each Server
7. Write another script to count each word
14
Confidential and proprietary. Copyright © 2008 Aster Data Systems
Does it work?
Yes
Does it take lots of time?
Yes!
Is it a pain?
Yes!!
Would you do it?
No!!!
15
Confidential and proprietary. Copyright © 2008 Aster Data Systems
Moreover…
Who will manage your files?
What if nodes fail?
What if you want to add more nodes?
What if…
What if…
What if…
16
Confidential and proprietary. Copyright © 2008 Aster Data Systems
17
Input
Any file
(e.g. documents)
Input
All <key, value> pairs with
the same key grouped
(e.g. all <word, count> pairs
where word = “the”)
Map()
Reduce()
Output
Stream of <key, value> pairs
(e.g. <word, count> pairs)
Output
Anything
(e.g. sum of counts for a
specific word)
Confidential and proprietary. Copyright © 2008 Aster Data Systems
The quick
brown fox
jumps over
the lazy dog
Map()
In-Database
MapReduce is
the future.
<the, 1>
<the, 1>
<the, 1>
<the, 1>
<the, 1>
<database,1>
<database,1>
<future,1>
18
<in, 1>
<database, 1>
<mapreduce,1>
<is,1>
<the,1>
<future,1>
Map()
Server A
Server B
<world,1>
<world,1>
<powerful,1>
<lazy,1>
<brown,1>
Map() and
Redistributi
on Phase
<the, 1>
<quick, 1>
<brown,1>
<fox,1>
<jumps,1>
<over,1>
<the,1>
<lazy,1>
<dog,1>
Switch
Confidential and proprietary. Copyright © 2008 Aster Data Systems
Server C
Server D
<mapreduce,1>
<mapreduce,1>
<be,1>
<be,1>
<to,1>
<jumps,1>
<computers,1>
<hello,1>
<is,1>
<is,1>
<is,1>
<question,1>
<over,1>
<a,1>
<that,1>
Grouping and
Reduce() Phase
(on Server 1)
<the,
<the,
<the,
<the,
<the,
<the, 1>
<the, 1>
<the, 1>
<the, 1>
<the, 1>
<database,1>
<database,1>
<future,1>
1>
1>
1>
1>
1>
Server 1 Final
Result File
<database,1>
<database,1>
<future,1>
19
Reduce()
Confidential and proprietary. Copyright © 2008 Aster Data Systems
Reduce()
Reduce()
the
5
database
2
future
1
What Just Happened?
By writing two small scripts with a few lines
of code…
… we achieved exactly the same result!
Plus, our code did not have to care about:
•the # of servers on the system (4 or 400?)
• which server to send each word
• any network communication aspects
• any fault tolerance aspects
•…
20
Confidential and proprietary. Copyright © 2008 Aster Data Systems
Word Count was Only an Example!
Google does all web indexing on MapReduce
“The indexing code is simpler, smaller, and easier to
understand, because the code that deals with fault
tolerance, distribution and parallelization is hidden
within the MapReduce library. For example, the
size of one phase of the computation dropped from
approximately 3,800 lines of C++ code to approximately
700 lines when expressed using MapReduce.”
Google 2004 MapReduce paper
21
Confidential and proprietary. Copyright © 2008 Aster Data Systems
Word Count was Only an Example!
Published work from Stanford University showed
that even extremely complex Data Mining
algorithms can fit in this very simple model
“We adapt Google’s MapReduce paradigm to
demonstrate this parallel speed up technique on a
variety of learning algorithms including locally
weighted linear regression (LWLR), k-means,
logistic regression (LR), naive Bayes (NB), SVM,
ICA, PCA, gaussian discriminant analysis (GDA),
EM, and backpropagation (NN).”
Stanford 2006 AI Lab paper
22
Confidential and proprietary. Copyright © 2008 Aster Data Systems
Result?
MapReduce makes writing
parallel programs
extremely easy…
…and can accommodate
from trivial to very
complex algorithms…
…thus enabling the
processing of petabytes of
data with a few lines of
code!
23
Confidential and proprietary. Copyright © 2008 Aster Data Systems
But…
Today MapReduce is used only by hardcore
coders/programmers/hackers
Changes in MapReduce queries require changes in
the MapReduce code itself
• Constantly keep coding
Using MapReduce with database data is hard and
cumbersome…
…when most of the structured data in the
enterprise are stored in databases!
24
Confidential and proprietary. Copyright © 2008 Aster Data Systems
Beyond SQL and MapReduce
25
Confidential and proprietary. Copyright © 2008 Aster Data Systems
SQL vs MapReduce: Two different worlds?
SQL
26
MapReduce
Declarative
Procedural
• Specifies what needs to
happen
• Specifies how it needs to
happen
Execution plans optimized
Code compiled once;
dynamically
MapReduce plans are static
Input/output is
Input/output is
structured
unstructured
Data redistribution inferred
Data redistribution based
from SQL statement (in
on <keys> in Reduce()
MPP Databases)
phase
Confidential and proprietary. Copyright © 2008 Aster Data Systems
Implementing MR in the Database
Uses Polymorphic SQL operators to embed MapReduce
functions to SQL
Introduces a “PARTITION BY” clause to specify data
redistribution
Introduces a “SEQUENCE BY” clause to specify ordering
of data flows to the MR functions
Best of both worlds
• Planning is still dynamic
• MapReduce functions can be used like custom SQL operators
• MapReduce functions can implement any algorithm or transformation
• Code Once – Use Many (through SQL) model
27
Confidential and proprietary. Copyright © 2008 Aster Data Systems
The SQL/MR Process
28
Write a SQL/MR
function using
Perl, Java,
Python…
Upload the script
or .jar file to the
database
(one-time)
(one-time)
Confidential and proprietary. Copyright © 2008 Aster Data Systems
Invoke SQL/MR
function through
SQL
(infinite times)
SQL/MR Function: Syntax
SELECT…
FROM
(5) Select output (eg. count)
MR_Function (
ON source_data
[ PARTITION BY column ]
[ ORDER BY column ]
[Function Arguments]
)
WHERE …
GROUP BY …
HAVING …
ORDER BY …
LIMIT …;
29
Optional conditions & filters
Confidential and proprietary. Copyright © 2008 Aster Data Systems
(4) Java/Python/… MR function
(1) Source table or sub-select
(2) <key> for data redistribution
(3) Sort before the MR function
Optional MR_Function Arguments
Example 1: Tokenization
Demo #1: Only Map (Tokenization) in SQL/MR
SELECT word, count(*) AS wordcount
FROM Tokenize( ON blogs )
GROUP BY word
ORDER BY wordcount DESC
LIMIT 20;
Demo #2: Map (Tokenization) and Reduce (WordCount) in
SQL/MR
SELECT key AS word, value AS wordcount
FROM WordCountReduce (
ON Tokenize ( ON blogs )
PARTITION BY key
)
ORDER BY wordcount DESC
LIMIT 20;
Demo #3: Why do Reduce when you have SQL?
SELECT word, count(*) AS wordcount
FROM Tokenize( ON blogs )
GROUP BY word
ORDER BY wordcount DESC
LIMIT 20;
30
Confidential and proprietary. Copyright © 2008 Aster Data Systems
Example 2: Sessionization
What Is Sessionize?
An example Aster SQL/MR function
Leverages Aster’s Java library API
What Does It Do?
User specified a column (eg. timestamp)
and a session timeout value (in seconds)
Spits out unique session identifiers
(sessionid column)
Usage
CREATE TABLE sessionized_clicks AS
SELECT ts, userid, sessionid, ...
FROM Sessionize(
ON clicks
PARTITION BY userid
ORDER BY ts
TIMEOUT 60
);
31
Confidential and proprietary. Copyright © 2008 Aster Data Systems
Example 2: Sessionization
Clickstream
00:58:24 PrezBus
h
10:00:24 Shawn1
02:30:33 PrezBus
h
10:01:23 Shawn1
SQL/MR
timesta userid
mp
10:00:00 Shawn1
timesta userid
mp
10:00:00 Shawn1
sessioni
d
0
10:00:24 Shawn1
0
10:01:23 Shawn1
0
10:02:40 Shawn1
1
Session Timeout = 60 seconds
timesta userid
sessioni
mp
d
00:58:24 PrezBush 0
10:02:40 Shawn1
02:30:33 PrezBush 1
INPUT
32
Confidential and proprietary. Copyright © 2008 Aster Data Systems
OUTPUT
Slide 32
MR Applications in the Database
ELT
Text and data transformations, in-parallel, in-database
Queries that become too complex for SQL
E.g. Sessionize(), customer segmentation, predictive analytics, …
Queries that SQL inherently cannot handle well
Time series analytics
Aster has a set of pre-defined SQL/MR functions for this
Data structures that do not fit well the relational model
Time series (again)
Graphs, spatial data
Any analytical or reporting application that requires more performance
and data proximity!
33
Confidential and proprietary. Copyright © 2008 Aster Data Systems
Summary
Growing challenges in scaling analytical
applications and reporting
MapReduce is driving a data revolution (see:
Google)
In-Database MapReduce will open up databases
to a host of new applications
[email protected]
(Questions, Comments)
asterdata.com/blog
(Lots of technical details)
1.888.Aster.Data
(Any other information)
34
Confidential and proprietary. Copyright © 2008 Aster Data Systems