Transcript Document
A Comparison of Approaches to Large-Scale Data Analysis
Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt,
Samuel Madden, Michael Stonebraker
SIGMOD 2009
Intelligent Database Systems Lab
School of Computer Science & Engineering
Seoul National University, Seoul, Korea
2009-10-09
Summarized by Jaeseok Myung
MapReduce vs. Parallel DBMS
Center for E-Business Technology
Copyright 2009 by CEBT
MapReduce
한재선, SearchDay2008, http://nexr.tistory.com
Center for E-Business Technology
Copyright 2009 by CEBT
Architectural Differences
Parallel DBMS
MapReduce
Schema Support
O
X
Indexing
O
X
Programming Model
Stating what you want
(SQL)
Presenting an algorithm
(C/C++, Java, …)
Optimization
O
X
Flexibility
Good
Fault Tolerance
Good
Center for E-Business Technology
Copyright 2009 by CEBT
Benchmark Environment (1/2)
Systems
Hadoop: The most popular open-source MR implementation
DBMS-X: a parallel DBMS that stores data in a row-based format
Vertica: a column-based parallel DBMS
All Three systems were deployed on a 100-node cluster
Analytical Tasks
Data Loading
Selection Task
Aggregation Task
Join Task
UDF Aggregation Task
Center for E-Business Technology
Copyright 2009 by CEBT
Benchmark Environment (2/2)
Dataset
Documents : 600,000 unique documents for each node
155 million UserVisits records (20GB/node)
18 million Rankings records (1GB/node)
Center for E-Business Technology
Copyright 2009 by CEBT
1. Data Loading
Reorganization
loading time
Center for E-Business Technology
Copyright 2009 by CEBT
2. Selection Task
The selection task is a lightweight filter to find the pageURLs in
the Rankings table(1GB/node) with a pageRank above a userdefined threshold
Query
SELECT pageURL, pageRank
FROM Rankings WHERE pageRank > x;
x = 10, which yields approximately 36,000 records per data file on
each node
For MR, implementing the same task with Java language
Center for E-Business Technology
Copyright 2009 by CEBT
2. Selection Task - Result
time for
combining the
output into a
single file
(Additional MR)
Processing time
Center for E-Business Technology
Copyright 2009 by CEBT
3. Aggregation Task
The aggregation task is calculating the total adRevenue
generated for each sourceIP in the UserVisits(20GB/node),
grouped by the sourceIP column
Query
SELECT sourceIP, SUM(adRevenue) FROM UserVisits GROUP BY
sourceIP;
This task always produces 2.5 million records
Center for E-Business Technology
Copyright 2009 by CEBT
3. Aggregation Task - Result
Center for E-Business Technology
Copyright 2009 by CEBT
4. Join Task
The join task consists of two sub-tasks that perform a complex
calculation on two data sets
In the first part of the task, each system must find the sourceIP that
generated the most revenue within a particular date range
Once these intermediate records are generated, the system must then
calculate the average pageRank of all the pages visited during this interval
Query
SELECT INTO Temp sourceIP, AVG(pageRank) as avgPageRank,
SUM(adRevenue) as totalRevenue
FROM Rankings AS R, UserVisits AS UV
WHERE R.pageURL = UV.destURL AND UV.visitDate
BETWEEN Date(‘2000-01-15’) AND Date(‘2000-01-22’)
GROUP BY UV.sourceIP;
SELECT sourceIP, totalRevenue, avgPageRank FROM Temp ORDER BY
totalRevenue DESC LIMIT 1;
Center for E-Business Technology
Copyright 2009 by CEBT
4. Join Task - Result
Center for E-Business Technology
Copyright 2009 by CEBT
5. UDF Aggregation Task
The final task is to compute the inlink count for each document
in the dataset
Query
SELECT INTO Temp F(contents) FROM Document;
–
F : a user-defined function that parses the contents of each record in
the Documents table and emits URLs into the database
–
With this function F, we populate a temporary table with a list of URLs
and then can execute a simple query to calculate the inlink count
SELECT url, SUM(value) FROM Temp GROUP BY url;
Center for E-Business Technology
Copyright 2009 by CEBT
5. UDF Aggregation Task - Result
Center for E-Business Technology
Copyright 2009 by CEBT
Conclusion
MapReduce < Parallel DBMS
Center for E-Business Technology
Copyright 2009 by CEBT
HadoopDB: An Architectural Hybrid of MapReduce and DBMS
Technologies for Analytical Workloads
Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel Abadi,
Avi Silberschatz, Alexander Rasin
VLDB 2009
Intelligent Database Systems Lab
School of Computer Science & Engineering
Seoul National University, Seoul, Korea
2009-10-09
Summarized by Jaeseok Myung
HadoopDB
The Basic Idea (An Architectural Hybrid of MR & DBMS)
To use MR as the communication layer above multiple nodes
running single-node DBMS instances
Queries are expressed in SQL, translated into MR by extending
existing tools, and as much work as possible is pushed into the
higher performing single node databases
Center for E-Business Technology
Copyright 2009 by CEBT
The Architecture of HadoopDB
Center for E-Business Technology
Copyright 2009 by CEBT
HadoopDB – Join Task
Center for E-Business Technology
Copyright 2009 by CEBT