Transcript Document

A Comparison of Approaches to Large-Scale Data Analysis
Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt,
Samuel Madden, Michael Stonebraker
SIGMOD 2009
Intelligent Database Systems Lab
School of Computer Science & Engineering
Seoul National University, Seoul, Korea
2009-10-09
Summarized by Jaeseok Myung
MapReduce vs. Parallel DBMS
Center for E-Business Technology
Copyright  2009 by CEBT
MapReduce
한재선, SearchDay2008, http://nexr.tistory.com
Center for E-Business Technology
Copyright  2009 by CEBT
Architectural Differences
Parallel DBMS
MapReduce
Schema Support
O
X
Indexing
O
X
Programming Model
Stating what you want
(SQL)
Presenting an algorithm
(C/C++, Java, …)
Optimization
O
X
Flexibility
Good
Fault Tolerance
Good
Center for E-Business Technology
Copyright  2009 by CEBT
Benchmark Environment (1/2)
 Systems

Hadoop: The most popular open-source MR implementation

DBMS-X: a parallel DBMS that stores data in a row-based format

Vertica: a column-based parallel DBMS
 All Three systems were deployed on a 100-node cluster
 Analytical Tasks

Data Loading

Selection Task

Aggregation Task

Join Task

UDF Aggregation Task
Center for E-Business Technology
Copyright  2009 by CEBT
Benchmark Environment (2/2)
 Dataset

Documents : 600,000 unique documents for each node

155 million UserVisits records (20GB/node)

18 million Rankings records (1GB/node)
Center for E-Business Technology
Copyright  2009 by CEBT
1. Data Loading
Reorganization
loading time
Center for E-Business Technology
Copyright  2009 by CEBT
2. Selection Task
 The selection task is a lightweight filter to find the pageURLs in
the Rankings table(1GB/node) with a pageRank above a userdefined threshold
 Query

SELECT pageURL, pageRank
FROM Rankings WHERE pageRank > x;

x = 10, which yields approximately 36,000 records per data file on
each node
 For MR, implementing the same task with Java language
Center for E-Business Technology
Copyright  2009 by CEBT
2. Selection Task - Result
time for
combining the
output into a
single file
(Additional MR)
Processing time
Center for E-Business Technology
Copyright  2009 by CEBT
3. Aggregation Task
 The aggregation task is calculating the total adRevenue
generated for each sourceIP in the UserVisits(20GB/node),
grouped by the sourceIP column
 Query

SELECT sourceIP, SUM(adRevenue) FROM UserVisits GROUP BY
sourceIP;

This task always produces 2.5 million records
Center for E-Business Technology
Copyright  2009 by CEBT
3. Aggregation Task - Result
Center for E-Business Technology
Copyright  2009 by CEBT
4. Join Task


The join task consists of two sub-tasks that perform a complex
calculation on two data sets

In the first part of the task, each system must find the sourceIP that
generated the most revenue within a particular date range

Once these intermediate records are generated, the system must then
calculate the average pageRank of all the pages visited during this interval
Query

SELECT INTO Temp sourceIP, AVG(pageRank) as avgPageRank,
SUM(adRevenue) as totalRevenue
FROM Rankings AS R, UserVisits AS UV
WHERE R.pageURL = UV.destURL AND UV.visitDate
BETWEEN Date(‘2000-01-15’) AND Date(‘2000-01-22’)
GROUP BY UV.sourceIP;

SELECT sourceIP, totalRevenue, avgPageRank FROM Temp ORDER BY
totalRevenue DESC LIMIT 1;
Center for E-Business Technology
Copyright  2009 by CEBT
4. Join Task - Result
Center for E-Business Technology
Copyright  2009 by CEBT
5. UDF Aggregation Task
 The final task is to compute the inlink count for each document
in the dataset
 Query


SELECT INTO Temp F(contents) FROM Document;
–
F : a user-defined function that parses the contents of each record in
the Documents table and emits URLs into the database
–
With this function F, we populate a temporary table with a list of URLs
and then can execute a simple query to calculate the inlink count
SELECT url, SUM(value) FROM Temp GROUP BY url;
Center for E-Business Technology
Copyright  2009 by CEBT
5. UDF Aggregation Task - Result
Center for E-Business Technology
Copyright  2009 by CEBT
Conclusion
MapReduce < Parallel DBMS
Center for E-Business Technology
Copyright  2009 by CEBT
HadoopDB: An Architectural Hybrid of MapReduce and DBMS
Technologies for Analytical Workloads
Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel Abadi,
Avi Silberschatz, Alexander Rasin
VLDB 2009
Intelligent Database Systems Lab
School of Computer Science & Engineering
Seoul National University, Seoul, Korea
2009-10-09
Summarized by Jaeseok Myung
HadoopDB
 The Basic Idea (An Architectural Hybrid of MR & DBMS)

To use MR as the communication layer above multiple nodes
running single-node DBMS instances
 Queries are expressed in SQL, translated into MR by extending
existing tools, and as much work as possible is pushed into the
higher performing single node databases
Center for E-Business Technology
Copyright  2009 by CEBT
The Architecture of HadoopDB
Center for E-Business Technology
Copyright  2009 by CEBT
HadoopDB – Join Task
Center for E-Business Technology
Copyright  2009 by CEBT