Hive - DWH Community | studyslide.com

Hive - DWH Community

Transcript Hive - DWH Community

DATA WAREHOUSE
Oracle Data Warehouse
Mit Big Data neue Horizonte für
das Data Warehouse ermöglichen
Alfred Schlaucher, Detlef Schroeder
DATA WAREHOUSE
Themen
 Big Data Buzz Word oder eine neue
Dimension und Möglichkeiten
 Oracles Technologie zu Speichern von
unstrukturierten und teilstrukturierten Massendaten
 Cloudera Framwork
 „Connectors“ in die neue Welt Oracle Loader for
Hadoop und HDFS
 Big Data Appliance
 Mit Oracle R Enterprise neue
Analyse-Horizonte entdecken
 Big Data Analysen mit Endeca
Hive
• Hive is an abstraction on top of MapReduce
• Allows users to query data in the Hadoop cluster without
knowing Java or MapReduce
• Uses the HiveQL language
• Very similar to SQL
• The Hive Interpreter runs on a client machine
• Turns HiveQL queries into MapReduce jobs
• Submits those jobs to the cluster
• Note: this does not turn the cluster into a relational database
server!
• It is still simply running MapReduce jobs
• Those jobs are created by the Hive Interpreter
Hive (cont’d)
• Sample Hive query:
SELECT stock.product, SUM(orders.purchases)
FROM stock INNER JOIN orders
ON (stock.id = orders.stock_id)
WHERE orders.quarter = 'Q1'
GROUP BY stock.product;
Pig
• Pig is an alternative abstraction on top of MapReduce
• Uses a dataflow scripting language
• Called PigLatin
• The Pig interpreter runs on the client machine
• Takes the PigLatin script and turns it into a series of MapReduce jobs
• Submits those jobs to the cluster
• As with Hive, nothing ‘magical’ happens on the cluster
• It is still simply running MapReduce jobs
Pig (cont’d)
• Sample Pig script:
stock = LOAD '/user/fred/stock' AS (id, item);
orders= LOAD '/user/fred/orders' AS (id, cost);
grpd = GROUP orders BY id;
totals = FOREACH grpd GENERATE group,
SUM(orders.cost) AS t;
result = JOIN stock BY id, totals BY group;
DUMP result;
Flume and Sqoop
• Flume provides a method to import data into HDFS as it is
generated
• Rather than batch-processing the data later
• For example, log files from a Web server
• Sqoop provides a method to import data from tables in a
relational database into HDFS - HIVE
• Does this very efficiently via a Map-only MapReduce job
• Can also ‘go the other way’
• Populate database tables from files in HDFS
Oozie
• Oozie allows developers to create a workflow of MapReduce
jobs
• Including dependencies between jobs
• The Oozie server submits the jobs to the server in the correct
sequence
HBase
• HBase is ‘the Hadoop database’
• A ‘NoSQL’ datastore
• Can store massive amounts of data
• Gigabytes, terabytes, and even petabytes of data in a table
• Scales to provide very high write throughput
• Hundreds of thousands of inserts per second
• Copes well with sparse data
• Tables can have many thousands of columns
• Even if most columns are empty for any given row
• Has a very constrained access model
• Insert a row, retrieve a row, do a full or partial table scan
• Only one column (the ‘row key’) is indexed
HBase vs Traditional RDBMSs
RDBMS
HBase
Data layout
Row-oriented
Column-oriented
Transactions
Yes
Single row only
Query language
SQL
get/put/scan
Security
Authentication/Authorizati
on
TBD
Indexes
On arbitrary columns
Row-key only
Max data size
TBs
PB+
Read/write throughput 1000s queries/second
limits
Millions of
queries/second
Kontakt und mehr Informationen
Oracle Data Warehouse Community Mitglied werden
Viele kostenlose Seminare und Events
Download – Server:
www.ORACLEdwh.de
Nächste deutschsprachige Oracle DWH Konferenz:
19. + 20. März 2013 Kassel

Hive - DWH Community

Transcript Hive - DWH Community

Directory