Transcript HDFS/Hive

Hadoop Data Management
by Team – 5 ISQS6339
Shashank Mishra
Mrugank Dhone
Rohit Ramteke
Sonali Digwal
Vivek Shimbulal
@svivekbafna
Videos
•
The Evolution of the Apache Hadoop Ecosystem | Cloudera. 8’11”
•
Published on Sep 6, 2013. Hadoop Co-founder Doug Cutting explains how the Hadoop ecosystem has expanded
and evolved into a much larger Big Data platform with Hadoop at its center.
• http://www.youtube.com/watch?v=eo1PwSfCXTI
•
A Hadoop Ecosystem Overview. 21’54”
•
Published on Jan 10, 2014. This is a technical overview, explaining the Hadoop Ecosystem. As a part of this
presentation, we chose to focus on the HDFS, MapReduce, Yarn, Hive, Pig and HBase software components.
• http://www.youtube.com/watch?v=kRnh3WpcKXo
•
Working in the Hadoop Ecosystem. 10’40”
•
Published on Sep 5, 2013. Mark Grover, a Software Engineer at Cloudera, talks about working in the Hadoop
ecosystem.
• http://www.youtube.com/watch?v=nbUsY9tj-pM
2
HDFS
https://www.youtube.com/watch?v=1_ly9dZnmWc, 8’27”
HDFS OVERVIEW
•
Based on Google’s GFS (Google File System)
• Provides redundant storage of massive amounts of data
• Data is distributed at all nodes at load time
HDFS DESIGN
•
Runs on commodity hardware
•
•
Assumes high failure rates of the component
Works well with lots of large files
• It is built around the idea of “write-once, read many times”
HDFS ARCHITECTURE
•
Operates on top of an existing file system
• Files are stored as “Blocks”
•
•
Default block size is 64 MB
Provides reliability through replication
• NameNode stores metadata and manages access
• No data caching due to large datasets
HDFS ARCHITECTURE DIAGRAM
HDFS FILE STORAGE
•
NameNode
Keeps metadata in RAM for fast lookup
• File-system metadata size is limited to the amount of available RAM on the
NameNode
• Stores all metadata
•
•
DataNode
Different blocks of the same file are stored on different DataNodes
• Stores file content as blocks
• Same block is replicated across several datanodes for redundancy
• Periodically sends a report of all existing blocks to the NameNode
•
FAILURE AND REPLACEMENT
•
DataNode failure and recovery
• NameNode failure and options to avoid
• Secondary NameNode
• Block placement strategies
Hive vs. HBase
https://www.youtube.com/watch?v=U0r9s4iXwo0, 2’51”
https://www.youtube.com/watch?v=IumVWII3fRQ, 2’50”
Introduction to Hive
•
•
•
•
•
•
Why Hive? Motivation
Hive’s Architecture
Hive’s Principles- Schema on Read
Hive’s Principles
DW Stack in Hadoop
Getting started with HIVE
Hive Motivation
- Map Reduce development is time consuming
- Required intimate knowledge of the framework
- Limited resources familiar with required expertise
- No schema to understand data in HDFS
Architecture
DW Stack in Hadoop
Conventional
RDBMS
DB Tables
System Tables
SQL Query Engine
All the 3 layers glued
together
Layers
Storage
Metadata
Query
Hadoop DW Stack
HDFS Files (Raw and
Structured Data)
Hcatalog (Metastore)
Multiple Engines (SQL &
Non-SQL)
All the layers are
separate and
independent
HiveQL
•
Create table
CREATE (DATABASE|SCHEMA) [IF NOT
EXISTS] database_name
[COMMENT database_comment]
[LOCATION hdfs_path]
[WITH DBPROPERTIES
(property_name=property_value, ...)];
•
DROP table
DROP (DATABASE|SCHEMA) [IF EXISTS]
database_name [RESTRICT|CASCADE];
•
ALTER table
ALTER (DATABASE|SCHEMA)
database_name SET DBPROPERTIES
(property_name=property_value, ...);
• Hive does NOT support deletion or update
of a particular record (row) or particular
set of records.
Background
•
Row Oriented Storage Structure
PK
1
2
.
.
10
Make Model
Nissan Pathfinder
Nissan Pathfinder
.
.
.
.
Toyota Null
Year
2003
2005
.
.
2005
Colour Transmission
red
auto
blue
auto
.
.
.
.
red
auto
Background
•
Column Oriented Storage Structure
PK
1
2
.
.
10
Make
Nissan
Nissan
.
.
Toyota
Model
Pathfinder
Pathfinder
.
.
Null
Year
2003
2005
.
.
2005
Colour
red
blue
.
.
red
Transmission
auto
auto
.
.
auto
What is HBase?
•
•
•
•
•
HBase is an open Source, non-relational, distributed database.
Built on top of Apache Hadoop and Apache Zookeeper.
HBase is a BigTable-like storage (for Hadoop).
Key/Value Column Family Store.
Column Oriented, Multi Dimensional Database.
Key/Value Column Family Structure
Row ID
1
2
.
10
Car Column Family
Make:Nissan
Model:Pathfinder
Year:2003
Color:red
Transmission:auto
Make:Nissan
Model:Pathfinder
Year:2005
Color:blue
Transmission:auto
.
Make:Nissan
Year:2005
Color:red
Transmission:auto
Row ID Column Family: Company Column Family: Attributes
1
2
.
10
Make:Nissan
Model:Pathfinder
Year:2003
Color:red
Transmission:auto
Make:Nissan
Model:Pathfinder
Year:2005
.
Color:blue
Transmission:auto
Make:Nissan
Year:2005
Color:red
Transmission:auto
HBase Tables
•
Tables are sorted by Row in lexicographical order.
• Table schema only defines its column families.
•
•
•
•
Each family consists of any number of columns
Each column consists of any number of versions
Columns only exist when inserted, NULLs are free
Columns within a family are sorted and stored together
HBase Components
HBase Vs RDBMS
HBase
RDBMS
Column Oriented
Row Oriented
No Query Language
SQL
Flexible Schema
Fixed Schema
De-Normalized Data
Normalized Data
Good for semi-structured Data as well as Good for Structured Data
structured data.
Stores Sparse Data Efficiently.
Not Optimized for Sparse tables.
NoSQL
https://www.youtube.com/watch?v=XPqrY7YEs0A, 5’02”
Database Management System
Storage and retrieval
of data.
Relational
Database
OLAP
(DATAWAREHOUS
E)
NOSQL
Relational Databases
Relational Data
Problem:
Absence of
Standard
1970 Boyce Codd
suggested Relation
forms to persist
data.
NoSQL
Relational Databases
created as a
standard.
Problem: Could not
handle Big Data
NoSQL
Databases were
the answer.
Objective
Scalability
NoSQL
Performance
High
Availability
Performance
Less
Functionality
More
Performance
More
Functionality
Less
Performance
RDBM
S
OLAP
NoSQL
Storage
Unstructured or
structured data
Structured Data
RDBM
S
Tables
OLAP
Cubes
NoSQL
Collections
Structured and Unstructured data
•
Unstructured data examples, media files, blogs being written
online, text files etc.
•
Structured Data are the ones that conform to a particular Data
Model.
Types of NoSQL Databases
NoSQL
Databases
Key
Value
Store
Tabular
Document
Oriented
Examples of NoSQL Databases
NoSQL
Databases
Key
Value
Store
Examples:
Memcached
Cohenrence
Redis
Tabular
Examples:
BigTable.
Hbase,
Accumulo
Document
Oriented
Examples:
MongoDB
CouchDB
Cloudant
NoSQL missing features
•
No Joins Support ( to overcome performance issues)
• No Complex Transactions ( absence of Rollback, commit)
• No Constraint Support (to be implemented at the application
level.)
When to Use?
•
•
•
•
•
•
Great Quantities of data need to be stored.
The structure of data is not uniform.
Relationship not of high importance
Fast running applications
Growing list of data example server logs, twitter post, Blogs.
Constraints and validations not a part of implementation.
When not to use?
•
Complex transaction to be handled
• Joins must be Handled
• Validations a necessity at database side.
Lets get started
•
Download mongodb : http://www.mongodb.org/downloads
•
Command to start MongoDb:
(mongod.exe --dbpath="E:\Spring2014\Project\data“)
•
Tutorial: http://docs.mongodb.org/manual/tutorial
Getting started..
•
Connect to MongoDB
• Command : open command prompt traverse till the path of bin
“E:\Spring2014\Project\BI AND DTM\mongodb-win32-x86_642008plus-2.4.9\mongodb-win32-x86_64-2008plus-2.4.9\bin”
Some basic Operations.
•
INSERT : for(var i = 0 ; i <=25 ; i++) db.testData.insert({x:i})
• SELECT : db.testData.find()
• “use” command to switch to a database and also dynamically
create it.
• Example “use tutorial”, it would create a new object(document), it
will be a physical file in the data folder mentioned in the dbpath.
•
Following diagram compares “insert” statement
with “Insert“ of Relational Database.
db.users.insert(
{name : “Sue”,
Age : 26,
Status: “complicated”}
)
Field –value pair
Insert into users
(name, age, status)
values ( “sue”,26,
“complicated”)
•
In the above NoSQL query users is a collection and
the json data is called as document.
•
SQL to MongoDB Mapping Chart
SQL terms
MongoDB /NoSQL
terms
Database
Database
Collection
Table
Document or BSON
Row
Field
Column
Index
Index
Table Joins
Embedded documents and
linking
Primary Key
Primary (_id attribute)
Questions?