HadoopDB project - Aalto University Wiki

Download Report

Transcript HadoopDB project - Aalto University Wiki

HadoopDB project
An Architetural hybrid of
MapReduce and DBMS
Technologies for Analytical
Workloads
Anssi Salohalla
Background
• Amount of data that needs to be stored for
analyzing is exploding
• On the other hand, analyzing performance
can’t be compromized despite the
increase in data amount
• Efficient high-end proprietary machines
are expensive
Parallel databases
• Shared-nothing MPP architecture (a collection of
independent machines, each with local hard disk and
main memory, connected together on high-speed
network)
• Machines are cheaper, lower-end, commodity hardware
• Scales well up to a point, tens of nodes
• Good performance
• Poor fault tolerance
• Problems with heterogeneous environment (machines
must be equal in performance)
• Good support for flexible query interface
MapReduce systems
•
•
•
•
•
•
Cheap
Scales well to thousands of nodes
Good support for heterogeneous environment
Good fault tolerance
Performance issues compared to parallel DBs
Generally no support for SQL (excluding eg.
Hive)
What is HadoopDB
• Recent study at Yale University, Database Research
Dep.
• Hybrid architecture of parallel databases and
MapReduce system
• The idea is to combine the best qualities of both
technologies
• Multiple single-node databases are connected using
Hadoop as the task coordinator and network
communication layer
• Queries are distributed across the nodes by MapReduce
framework, but as much work as possible is done in the
database node
HadoopDB architecture
Reference: Azza Abouzeid, Kamil BajdaPawlikowski,
Daniel Abadi, Avi Silberschatz, Alexander Rasin. HadoopDB: An Architectural Hybrid of MapReduce and
DBMS Technologies for Analytical Workloads
Desired properties of
HadoopDB
•
•
•
•
Performance
Fault tolerance
Support for heterogeneous environment
Flexible query interface
Study benchmark systems
•
•
•
•
Hadoop system
HadoopDB
Vertica
DBMS-X
Benchmark tasks
•
•
•
•
•
•
•
Data loading
Grep task
Selection task
Aggregation task
Join task
UDF Aggregation task
Fault tolerance and heterogeneous
environment
Results 1/2
Reference: Azza Abouzeid, Kamil BajdaPawlikowski,
Daniel Abadi, Avi Silberschatz, Alexander Rasin. HadoopDB: An Architectural Hybrid of MapReduce and
DBMS Technologies for Analytical Workloads
Results 2/2
Reference: Azza Abouzeid, Kamil BajdaPawlikowski,
Daniel Abadi, Avi Silberschatz, Alexander Rasin. HadoopDB: An Architectural Hybrid of MapReduce and
DBMS Technologies for Analytical Workloads
Conclusions
• HadoopDB is close in performance to
parallel databases
• HadoopDB is able to operate in truly
heterogeneous environment and has the
fault tolerance of Hadoop environment
• Equal licensing costs to Hadoop
• Better performance expected in future
Further reading
•
•
•
HadoopDB Project. Web page:
http://db.cs.yale.edu/hadoopdb/hadoopdb.html
Azza Abouzeid, Kamil BajdaPawlikowski, Daniel Abadi, Avi Silberschatz,
Alexander Rasin. HadoopDB: An Architectural Hybrid of MapReduce and
DBMS Technologies for Analytical Workloads
Hadoop Project. Hadoop Cluster Setup. Web page:
http://hadoop.apache.org/core/docs/current/cluster_setup.html .
Questions?