Transcript HadoopDB
HadoopDB
Inneke Ponet
Introduction
Technologies for data analysis
HadoopDB
Desired properties
Layers of HadoopDB
HadoopDB Components
Introduction
More and more data needs to be stored and
processed.
People want to do more and more complex
calculations on their collected data.
Analytical databases on high-end machines are
moving towards cheaper lower-end machines.
The analytical database market is 27% of the
database software market and is growing at a rate
of 10,3% annually.
Technologies for data analysis
Parallel databases:
good performance,
good efficiency.
MapReduce-based systems:
superior scalability,
good fault tolerance,
good flexibility to handle unstructered data.
Parallel databases
Support for standard relational tables and SQL.
Implements techniques for a better performance:
Indexing, compression, materialized views, result
caching, I/O sharing.
Data is partitioned (shared-nothing architecture)
transparent to the end-user.
Shared-nothing architecture
The DBMS of the most analytical databases are
deployed on a shared-noting architecture:
A collection of machines that
are independent,
are possible virtual,
have their own local disk and local main memory,
are connected by a high-speed network.
Scalability of machines.
Analysis tasks are easy to parallellize.
MapReduce
A technology from Google:
processes (un)structured data that is distributed on many
nodes in a shared-nothing cluster;
works at enormous scale.
Map and Reduce:
parallel without communicating;
Map-repartition-Reduce cycles.
MapReduce: advantages
No detailed query execution plan in advance
at runtime:
adjust to node failures and slow nodes
(re)assigning tasks to faster nodes.
Checkpoints the output to local disk
minimizing of the work in case of a failure.
HadoopDB
Hybrid database:
a combination of:
traditional DBMS,
MapReduce-technology.
Developed by Yale University students: Azza
Abouzeid and Kamil BajDa-Pawlikowski
It is free and open source.
Desired properties
A. Performance
B. Fault tolerance
C. Heterogeneous environment
D. Flexible query interface
E. Scalability
A. Performance
Primary characteristic to distinguish.
MapReduce: first modeling and loading data
before processing
slower performance than parallel databases.
Cost saving: faster software product cheaper
than a hardware upgrade or buying additional
hardware.
Parallel databases
MapReduce
B. Fault tolerance
Succesfully commit transactions.
Make progress on a workload.
Heterogeneity and scalibility more faults
BUT MapReduce good fault tolerance:
reassigning tasks;
sub-tasks minimize the effect of faults.
Parallel databases: assumption failures are rare
more testing => slower performance.
Parallel databases
MapReduce
C. Heterogeneous environment
Nodes don’t always run on
identical hardware,
an identical virtual machine.
Different performance.
Parallel databases: not tested on more than 100
nodes.
Parallel databases
MapReduce
D. Flexible query interface
Easy to make queries:
SQL and non-SQL interface languages,
Use of tools.
Robust mechanisme for writing UDFs.
Parallel databases: SQL, ODBC and UDFs.
MapReduce-based systems: it is possible (Hive),
but not always (Hadoop).
Parallel databases
/ MapReduce
E. Scalability
Traditional DBMS:
only scalable to 100 nodes.
Reasons
Assumption
Failures
Failures are rare.
Hetrogeneity
Homogeneous array of machines.
Not tested
There are no applications with more
than a few dozen nodes.
MapReduce-based systems:
designed to scale to thousands of nodes in a sharednothing architecture.
Parallel databases
MapReduce
Desired properties
Parallel databases
MapReduce
Performance
Fault tolerance
Heterogeneous
environment
Flexible query interface
/
Scalability
Layers of HadoopDB
Communication: Hadoop
Database: PostgreSQL
Translation: Hive
Hadoop
Communication layer of HadoopDB.
Hadoop framework two layers:
Hadoop Distributed File System (HDFS),
MapReduce framework.
Cost: free/open source MapReduce.
PostgreSQL
Relational DBMS.
(Possible) database layer of HadoopDB.
Cost: free/open source.
Hive
Translation layer.
Processing of a SQL query:
Query Abstract Syntax Tree.
MetaStore: schema of the table(s).
Logical query plan: DAG of relational operators.
Optimized plan.
Physical executable plan: MapReduce job(s).
XML plan: DAG serialized.
Hive Driver executes a Hadoop job.
HadoopDB components
Database Connector:
Interface between independent database systems;
Extends the InputFormat class (of Hadoop);
Connect to any JDBC-compliant database.
Catalog:
Meta-information about the databases:
connection parameters,
metadata.
XML file in HDFS accessed by:
Master node,
Worker/Slave nodes.
HadoopDB Components (2)
Data loader:
Global hasher:
Custom MapReduce job files in HDFS;
Repartioning data upon loading.
Local hasher:
Copies partition from HDFS to local file system;
Partitions the file in smaller sized chunks.
HadoopDB Components (3)
SQL to MapReduce:
Parallel database front-end to process SQL queries.
HiveQL
↓ Transform
MapReduce jobs:
Connect to tables stored in HDFS;
Consists of DAGs of relational operators that operate as iterators.
Assumption no collection of tables:
Operations on multiple tables Reduce function.
NOT in HadoopDB: a join operation can be pushed to the
databse layer.
HadoopDB Components (4)
SQL/SMS planner:
Modifies Hive:
Updates the MetaStore
Two passes over the physical plan:
1.
2.
Determine the partition keys for the Reduce Sink Operators.
Operators are:
converted in SQL querie(s);
pushed into the database layer.
Only filter, select and aggregation operators.
HadoopDB Components (5)
Questions?