Hive - DWH Community
Download
Report
Transcript Hive - DWH Community
DATA WAREHOUSE
Oracle Data Warehouse
Mit Big Data neue Horizonte für
das Data Warehouse ermöglichen
Alfred Schlaucher, Detlef Schroeder
DATA WAREHOUSE
Themen
Big Data Buzz Word oder eine neue
Dimension und Möglichkeiten
Oracles Technologie zu Speichern von
unstrukturierten und teilstrukturierten Massendaten
Cloudera Framwork
„Connectors“ in die neue Welt Oracle Loader for
Hadoop und HDFS
Big Data Appliance
Mit Oracle R Enterprise neue
Analyse-Horizonte entdecken
Big Data Analysen mit Endeca
Hive
• Hive is an abstraction on top of MapReduce
• Allows users to query data in the Hadoop cluster without
knowing Java or MapReduce
• Uses the HiveQL language
• Very similar to SQL
• The Hive Interpreter runs on a client machine
• Turns HiveQL queries into MapReduce jobs
• Submits those jobs to the cluster
• Note: this does not turn the cluster into a relational database
server!
• It is still simply running MapReduce jobs
• Those jobs are created by the Hive Interpreter
Hive (cont’d)
• Sample Hive query:
SELECT stock.product, SUM(orders.purchases)
FROM stock INNER JOIN orders
ON (stock.id = orders.stock_id)
WHERE orders.quarter = 'Q1'
GROUP BY stock.product;
Pig
• Pig is an alternative abstraction on top of MapReduce
• Uses a dataflow scripting language
• Called PigLatin
• The Pig interpreter runs on the client machine
• Takes the PigLatin script and turns it into a series of MapReduce jobs
• Submits those jobs to the cluster
• As with Hive, nothing ‘magical’ happens on the cluster
• It is still simply running MapReduce jobs
Pig (cont’d)
• Sample Pig script:
stock = LOAD '/user/fred/stock' AS (id, item);
orders= LOAD '/user/fred/orders' AS (id, cost);
grpd = GROUP orders BY id;
totals = FOREACH grpd GENERATE group,
SUM(orders.cost) AS t;
result = JOIN stock BY id, totals BY group;
DUMP result;
Flume and Sqoop
• Flume provides a method to import data into HDFS as it is
generated
• Rather than batch-processing the data later
• For example, log files from a Web server
• Sqoop provides a method to import data from tables in a
relational database into HDFS - HIVE
• Does this very efficiently via a Map-only MapReduce job
• Can also ‘go the other way’
• Populate database tables from files in HDFS
Oozie
• Oozie allows developers to create a workflow of MapReduce
jobs
• Including dependencies between jobs
• The Oozie server submits the jobs to the server in the correct
sequence
HBase
• HBase is ‘the Hadoop database’
• A ‘NoSQL’ datastore
• Can store massive amounts of data
• Gigabytes, terabytes, and even petabytes of data in a table
• Scales to provide very high write throughput
• Hundreds of thousands of inserts per second
• Copes well with sparse data
• Tables can have many thousands of columns
• Even if most columns are empty for any given row
• Has a very constrained access model
• Insert a row, retrieve a row, do a full or partial table scan
• Only one column (the ‘row key’) is indexed
HBase vs Traditional RDBMSs
RDBMS
HBase
Data layout
Row-oriented
Column-oriented
Transactions
Yes
Single row only
Query language
SQL
get/put/scan
Security
Authentication/Authorizati
on
TBD
Indexes
On arbitrary columns
Row-key only
Max data size
TBs
PB+
Read/write throughput 1000s queries/second
limits
Millions of
queries/second
Kontakt und mehr Informationen
Oracle Data Warehouse Community Mitglied werden
Viele kostenlose Seminare und Events
Download – Server:
www.ORACLEdwh.de
Nächste deutschsprachige Oracle DWH Konferenz:
19. + 20. März 2013 Kassel