Transcript Slide 1

MapReduce
With a SQL-MapReduce focus
by
Curt A. Monash, Ph.D.
President, Monash Research
Editor, DBMS2
contact @monash.com
http://www.monash.com
http://www.DBMS2.com
Curt Monash

Analyst since 1981



Publicly available research



Covered DBMS since the pre-relational days
Also analytics, search, etc.
Blogs, including DBMS2 (http://www.DBMS2.com)
Feed at http://www.monash.com/blogs.html
User and vendor consulting
Agenda




Introduction and truisms
MapReduce overview
MapReduce specifics
SQL and MapReduce together
Monash’s First Law of Commercial Semantics
Bad jargon drives out good
For example: “Relational”, “Parallel”, “MapReduce”
Where to measure database technology

Language interpretation and execution capabilities




Administrative capabilities
How well it all works



Functionality
Speed
Fit and finish
Reliability
How much it all – really – costs
You can do anything in 0s and 1s … but how much
effort will it actually take?
What’s hard about parallelization*





Getting the right data …
… to the right nodes …
… at the right time …
… while dealing with errors …
… and without overloading the network
Otherwise, programming a grid is a lot like
programming a single node.
*in general -- not just for “database” technology
MPP DBMS are good at parallelization …
… under three assumptions, namely:



You can express the job nicely in SQL …
 ... or whatever other automatically-parallel
languages the DBMS offers
You don’t really need query fault-tolerance …
 … which is usually the case unless you have
1000s of nodes
There’s enough benefit to storing the data in tables
to justify the overhead
SQL commonly gets frustrating …
… when you’re dealing with sequences of events or
relationships, because:



Self-joins are expensive
Programming is hard when you’re not sure how
long the sequence is
For example:



Clickstreams
Financial data time series
Social network graph analysis
The pure MapReduce alternative
Lightweight approach to parallelization

The only absolute requirement is a certain simple
programming model …



… so simple that parallelization is “automatic” …
… and very friendly to procedural languages
It doesn’t require a DBMS on the back end

No SQL required!
Non-DBMS implementations commonly have query
fault-tolerance
But you have to take care of optimizing data
redistribution yourself

MapReduce evolution








Used under-the-covers for quite a while
Named and popularized by Google
Open-sourced in Hadoop
Widely adopted by big web companies
Integrated (at various levels) into MPP RDBMS
Adopted for social network analysis
Explored/investigated for data mining applications
???
M/R use cases -- large-scale ETL

Text indexing


Time series disaggregation



This is how Google introduced the MapReduce concept
Clickstream sessionization and analytics
Stock trade pattern identification
Relationship graph traversal
M/R use cases – hardcore arithmetic


Statistical routines
Data “cooking”
The essence of MapReduce





“Map” steps
Data redistribution
“Reduce” steps
In strict alternation …
… or not-so-strict
“Map” step basics (reality)

Input = anything



Set of data
Output of previous Reduce step
Output = anything

There’s an obvious key
Map step basics (formality)



Input = {<key, value> pairs}
Output = {<key, value> pairs}
Input and output key types don’t have to be the
same
“Embarrassingly parallel” based on key
Map step examples

Word count



Text indexing



Input format = document/text string
Output format = <WordName, 1>
Input format = document/text string
Output format = <WordName, (DocumentID, Offset)>
Log parsing


Input format = log file
Output format = <Key, formatted event>
Reduce step basics



Input = {<key, value> pairs}, where all the keys are
equal
Output = {<key, value> pairs}, where the set
commonly has cardinality = 1
Input and output key types don’t have to be the
same
Just like Map, “embarrassingly parallel” based on key
Reduce step examples

Word count



Text indexing



Input format = <WordName, 1>
Output format = <WordName, count>
Input format = <WordName, (DocumentID, Offset)>
Output format = <WordName, index file>
Log parsing


E.g., input format = <UserID or EventID, event record>
E.g., output format = <Same, reformatted event record>
More honoured in the breach than in the observance!
Sometimes the Reduce step is trivial
MapReduce for data mining




Partition on some key
Calculate a single vector* for each whole partition
Aggregate the vectors
Hooray!
*Algorithm-dependent
Sometimes Reduce doesn’t reduce


Tick stream data “cooking” can increase its size by
one to two orders of magnitude
Sessionization might just add a column –
SessionID – to records

Or is that a Map step masquerading as a Reduce?
Some reasons to integrate SQL and MapReduce



JOINs were invented for a reason
So was SQL 2003
It’s kind of traditional to keep data in an RDBMS
Some ways to integrate SQL and MapReduce

A SQL layer built on a MapReduce engine




E.g., Facebook’s Hive over Hadoop
But building a DBMS-equivalent is hard
MapReduce invoking SQL
SQL invoking MapReduce

Aster’s SQL M/R
To materialize or not to materialize?



DBMS avoidance of intermediate materialization 
much better performance
Classic MapReduce intermediate materialization 
query fault-tolerance
How much does query fault-tolerance matter?


(Query duration) x (Node count)
Node MTTF
vs.
DBMS-style materialization strategies usually win
Other reasons to put your data in a real database







Query response time
General performance
Backup
Security
General administration
SQL syntax
General programmability and connectivity
Aspects of Aster’s approach to MapReduce




Data stored in a database
MapReduce execution managed by a DBMS
Flexible MapReduce syntax
MapReduce invoked via SQL
Further information
Curt A. Monash, Ph.D.
President, Monash Research
Editor, DBMS2
contact @monash.com
http://www.monash.com
http://www.DBMS2 com