Transcript HadoopDB

HadoopDB
Inneke Ponet
 Introduction
 Technologies for data analysis
 HadoopDB
 Desired properties
 Layers of HadoopDB
 HadoopDB Components
Introduction
 More and more data needs to be stored and
processed.
 People want to do more and more complex
calculations on their collected data.
 Analytical databases on high-end machines are
moving towards cheaper lower-end machines.
 The analytical database market is 27% of the
database software market and is growing at a rate
of 10,3% annually.
Technologies for data analysis
Parallel databases:
 good performance,
 good efficiency.
MapReduce-based systems:
 superior scalability,
 good fault tolerance,
 good flexibility to handle unstructered data.
Parallel databases
 Support for standard relational tables and SQL.
 Implements techniques for a better performance:
 Indexing, compression, materialized views, result
caching, I/O sharing.
 Data is partitioned (shared-nothing architecture)
 transparent to the end-user.
Shared-nothing architecture
The DBMS of the most analytical databases are
deployed on a shared-noting architecture:
 A collection of machines that




are independent,
are possible virtual,
have their own local disk and local main memory,
are connected by a high-speed network.
 Scalability of machines.
 Analysis tasks are easy to parallellize.
MapReduce
A technology from Google:
 processes (un)structured data that is distributed on many
nodes in a shared-nothing cluster;
 works at enormous scale.
Map and Reduce:
 parallel without communicating;
 Map-repartition-Reduce cycles.
MapReduce: advantages
No detailed query execution plan in advance
 at runtime:

adjust to node failures and slow nodes
 (re)assigning tasks to faster nodes.
Checkpoints the output to local disk

minimizing of the work in case of a failure.
HadoopDB
Hybrid database:

a combination of:
 traditional DBMS,
 MapReduce-technology.
Developed by Yale University students: Azza
Abouzeid and Kamil BajDa-Pawlikowski
 It is free and open source.
Desired properties
A. Performance
B. Fault tolerance
C. Heterogeneous environment
D. Flexible query interface
E. Scalability
A. Performance
 Primary characteristic to distinguish.
 MapReduce: first modeling and loading data
before processing
 slower performance than parallel databases.
  Cost saving: faster software product cheaper
than a hardware upgrade or buying additional
hardware.
 Parallel databases
 MapReduce
B. Fault tolerance
 Succesfully commit transactions.
 Make progress on a workload.
 Heterogeneity and scalibility  more faults
BUT MapReduce  good fault tolerance:
 reassigning tasks;
 sub-tasks minimize the effect of faults.
 Parallel databases: assumption failures are rare
more testing => slower performance.
 Parallel databases
 MapReduce
C. Heterogeneous environment
 Nodes don’t always run on
 identical hardware,
 an identical virtual machine.
 Different performance.
 Parallel databases: not tested on more than 100
nodes.
 Parallel databases
 MapReduce
D. Flexible query interface
 Easy to make queries:
 SQL and non-SQL interface languages,
 Use of tools.
 Robust mechanisme for writing UDFs.
 Parallel databases: SQL, ODBC and UDFs.
 MapReduce-based systems: it is possible (Hive),
but not always (Hadoop).
 Parallel databases
/ MapReduce
E. Scalability
Traditional DBMS:
 only scalable to 100 nodes.
Reasons
Assumption
Failures
Failures are rare.
Hetrogeneity
Homogeneous array of machines.
Not tested
There are no applications with more
than a few dozen nodes.
MapReduce-based systems:
 designed to scale to thousands of nodes in a sharednothing architecture.
 Parallel databases
 MapReduce
Desired properties
Parallel databases
MapReduce
Performance


Fault tolerance


Heterogeneous
environment


Flexible query interface

/
Scalability


Layers of HadoopDB
 Communication: Hadoop
 Database: PostgreSQL
 Translation: Hive
Hadoop
 Communication layer of HadoopDB.
 Hadoop framework  two layers:
 Hadoop Distributed File System (HDFS),
 MapReduce framework.
 Cost: free/open source  MapReduce.
PostgreSQL
 Relational DBMS.
 (Possible) database layer of HadoopDB.
 Cost: free/open source.
Hive
 Translation layer.
 Processing of a SQL query:
 Query  Abstract Syntax Tree.
 MetaStore: schema of the table(s).
 Logical query plan: DAG of relational operators.
 Optimized plan.
 Physical executable plan: MapReduce job(s).
 XML plan: DAG  serialized.
 Hive Driver executes a Hadoop job.
HadoopDB components
 Database Connector:
 Interface between independent database systems;
 Extends the InputFormat class (of Hadoop);
 Connect to any JDBC-compliant database.
 Catalog:
 Meta-information about the databases:
 connection parameters,
 metadata.
 XML file in HDFS  accessed by:
 Master node,
 Worker/Slave nodes.
HadoopDB Components (2)
 Data loader:
 Global hasher:
 Custom MapReduce job  files in HDFS;
 Repartioning data upon loading.
 Local hasher:
 Copies partition from HDFS to local file system;
 Partitions the file in smaller sized chunks.
HadoopDB Components (3)
 SQL to MapReduce:
 Parallel database front-end to process SQL queries.
 HiveQL
↓ Transform
MapReduce jobs:
 Connect to tables stored in HDFS;
 Consists of DAGs of relational operators that operate as iterators.
 Assumption  no collection of tables:
 Operations on multiple tables  Reduce function.
 NOT in HadoopDB: a join operation can be pushed to the
databse layer.
HadoopDB Components (4)
 SQL/SMS planner:
 Modifies Hive:
 Updates the MetaStore
 Two passes over the physical plan:
1.
2.
Determine the partition keys for the Reduce Sink Operators.
Operators are:
 converted in SQL querie(s);
 pushed into the database layer.
 Only filter, select and aggregation operators.
HadoopDB Components (5)
Questions?