2 0_Introduction to Hadoop Eco-System
Download
Report
Transcript 2 0_Introduction to Hadoop Eco-System
Diploma in Big
Data and Analytics
Introduction to Hadoop Eco-System
Agenda
In this session, you will learn about:
•
•
•
•
•
•
•
•
•
•
•
•
What is Hadoop?
Why Hadoop?
Advantages of Hadoop?
History of Hadoop
Key Characteristics of Hadoop
Hadoop 1.0 & 2.0 Eco-system
Hadoop Use Cases
Where Hadoop Fits?
Traditional vs. Hadoop Architecture
RDBMS vs. Hadoop
When to Use or Not Use Hadoop?
Hadoop Opportunities
Private and Confidential
2
What is Hadoop?
A solution to the Big Data problem.
A free Java-based framework that allows for the
distributed processing of large data sets.
Processes data across clusters of commodity
computers, using a simple programming model.
An Apache project, inspired by Google's
MapReduce and Google File System papers.
Fault tolerant & reliable open source system.
Private and Confidential
3
What is Hadoop?
Private and Confidential
4
Why Hadoop?
Why is Big Data Technology Needed?
90% of the data in the
world today has been created
in the last
2 years alone.
Structured formats
have some limitations
with respect to handling large
80% of the data is
unstructured or exists
in widely varying structures,
which are difficult to analyze.
Difficult to
integrate
information distributed
across multiple systems.
quantities of data.
Private and Confidential
5
Why Hadoop?
Additional Advantages
Most business users do
not know what should
be analyzed.
Potentially valuable
data is dormant or
discarded.
A lot of information has
a short, useful lifespan.
It is too expensive to
justify the integration
of large volumes of
unstructured data.
Context adds meaning
to the existing
information.
Private and Confidential
6
Advantages of Hadoop
Why is Big Data Technology Appealing?
Runs a number of applications on distributed systems with thousands of nodes
involving petabytes of data
Has a distributed file system, called Hadoop Distributed File System or HDFS,
which enables fast data transfer among the nodes
It helps to manage and process a huge amount of data cost efficiently.
It analyzes data in its native form, which may be unstructured, structured, or
streaming.
It captures data from fast-happening events in real time.
It can handle failure of isolated nodes and tasks assigned to such nodes.
It can turn data into actionable insights.
Private and Confidential
7
History of Hadoop
Private and Confidential
8
Hadoop’s Key Characteristics
Reliability
Provides a reliable, fault tolerant shared
data storage and analysis system.
Scalability
Offers very high linear scalability.
Flexibility
It can process structured, semistructured & unstructured data.
Economical
Robust
Works on inexpensive commodity
hardware.
Well suited to meet the analytical needs
of developers.
Private and Confidential
9
Hadoop is Reliable
System automatically
reallocates work to
another location
Level of replication is
configurable
Why is Hadoop Reliable?
Data automatically
gets replicated at two
other locations
File is available on the
third system at least
in case of 2 system
collapses
Private and Confidential
High level of fault
tolerance
10
Scalable Development Environment
Private and Confidential
11
Flexibility in Data Processing
Hadoop Brings Flexibility In Data Processing
• One of the biggest challenges organizations have had in that past was the
challenge of handling unstructured data.
• Let’s face it, only 20% of data in any organization is structured while the rest is
all unstructured whose value has been largely ignored due to lack of technology
to analyze it.
• Hadoop manages data whether structured or unstructured, encoded or formatted,
or any other type of data.
• Hadoop brings the value to the table where unstructured data can be useful in
decision making process
Application data
Machine data
Social data
Private and Confidential
Enterprise data
12
Hadoop is Very Cost Effective
• Hadoop generates cost benefits by
bringing massively parallel computing
to commodity servers, resulting in a
substantial reduction in the cost per
terabyte of storage, which in turn
makes it reasonable to model all your
data.
• Apache Hadoop was developed to help
Internet-based companies deal with
prodigious volumes of data.
• According to some analysts, the cost of a
Hadoop data management system,
including hardware, software, and other
expenses, comes to about $1,000 a
terabyte–about one-fifth to onetwentieth the cost of other data
management technologies
Private and Confidential
13
Hadoop Ecosystem is Robust
Why is Hadoop Considered Robust?
Meets analytical
needs of
developers and
small to large
organizations
Deliver to a
variety of data
processing needs
Private and Confidential
Projects such as
MapReduce, Hive,
HBase, Apache Pig,
Sqoop, Flume
14
The Hadoop 1.0 Eco-System
Private and Confidential
15
The Hadoop 2.0 Eco-System
Private and Confidential
16
Hadoop Use Cases
Private and Confidential
17
Where does Hadoop Fit?
Web and e-tailing
•
•
•
•
Recommendation
Engines
Ad Targeting
Search Quality
Abuse and Click
Fraud Detection
Telecommunications
•
•
•
•
Customer Churn
Prevention
Network
Performance
Optimization
Calling Data Record
(CDR) Analysis
Analyzing Network
to Predict Failure
Private and Confidential
Government
•
•
•
Fraud Detection &
Cyber Security
Welfare schemes
Justice
18
Where does Hadoop Fit?
Healthcare & Life Sciences
•
•
•
•
•
Health information exchange
Gene sequencing
Serialization
Healthcare service quality improvements
Drug Safety
Banks and Financial Services
•
•
•
•
•
Modeling True Risk
Threat Analysis
Fraud Detection
Trade Surveillance
Credit Scoring And Analysis
Retail
• Point of sales Transaction Analysis
• Customer Churn Analysis
• Sentiment Analysis
Private and Confidential
19
Leading Brands using Hadoop
Source: https://wiki.apache.org/hadoop/PoweredBy
Private and Confidential
20
Traditional Data Analytics Architecture
BI Reports +
Interactive Apps
RDBMS
(Aggregated Data)
ETL Compute Grid
Can’t explore
original raw high
fidelity data
Moving data to compute
doesn’t scale
Storage Only Grid
(Original Raw Data)
Mostly Append
Collection
Premature
death of data
Instrumentation
Private and Confidential
21
Hadoop Data Analytics Architecture
BI Reports +
Interactive Apps
RDBMS
(Aggregated Data)
Data exploration
& advanced
analytics
Scalable throughout
for ETL and
aggregation
Hadoop : Storage + Compute Grid
Mostly Append
Collection
Data is
alive
forever
Instrumentation
Private and Confidential
22
RDBMS v/s Hadoop
RDBMS
Hadoop
Data processing efficiency in
Gigabytes +
Data processing efficiency in Petabytes
+
Mostly proprietory
Open Source framework
One project with multiple components
Eco System Suite of java based(mostly)
projects
Designed for client server architecture
Designed to support distributed
architecture
High usage would require high end
server
Designed to run on commodity
hardware
Costly
Cost efficient
Legacy procedure
High fault tolerance
Private and Confidential
23
RDBMS v/s Hadoop (Contd)
RDBMS
Hadoop
Relies on OS file system
Based on distributed file system (HDFS)
Needs structured data
Very good support of unstructured data
Needs to follow defined constraints
Flexible, evolvable and fast
Stable products
Still evolving
Real time Read/Write (OLTP)
Suitable for Batch processing (OLAP)
Arbitrary insert and update
Sequential write
Supports ACID transactions
Supports BASE
Schema required on write
Schema required on read
Repeated Read and Write
Write once, Repeated Read
Private and Confidential
24
When to use Hadoop?
Hadoop can be used in various scenarios including some of the following:
Analytics
Search
Data Retention
Log file processing
Analysis of Text, Image, Audio, & Video content
Recommendation systems like in E-Commerce
Website
Private and Confidential
25
When not to use Hadoop?
Hadoop may not be a right fit in the following situations:
Low-latency or near
real-time data
access.
If you have a large
number of small
files to be
processed.
Multiple writes
scenario or
scenarios requiring
arbitrary writes or
writes between the
files.
Private and Confidential
26
Opportunities on Hadoop
•
Dice.com – Nearly 2500 jobs across US
•
Indeed.com – Over 13200 jobs across US
Job Type
Job functions
Skills
Develops MapReduce jobs,
designs data warehouses
Java, Scripting, Linux
Hadoop Admin
Manages Hadoop cluster,
designs data pipelines
Linux administration,
Network Management,
Experience in managing large
cluster of machines
Data Scientist
Data mining and figuring out
hidden knowledge in data
Math, data mining algorithms
Business Analyst
Analyzes data
Pig, Hive, HSQL , familiarity
with other BI tools
Hadoop Developer
Private and Confidential
27
Quiz - Time
Identify what does not characterize Hadoop?
A
An open source Apache project
B
Distributed data processing
framework
C
Highly secured
D
Highly reliable & redundant
Private and Confidential
28
Quiz - Time
A bank transaction database file from an external location needs to be analyzed by
Hadoop. What tool is used to load this data in hadoop for processing?
HDFS
MapReduce
A
B
Flume
Sqoop
C
D
Private and Confidential
29
Summary
Hadoop an Apache project to
handle Big Data.
A framework that supports distributed
processing of large data sets in a cluster.
Key characteristics:
Reliability
Scalability
A scalable development environment.
Flexibility
Allows a high percentage of data for BI &
advanced analytics.
Cost & Fault
Tolerance
Thank you
Mumbai | Bangalore | Pune | Chennai | Jaipur
ACCREDITED TRAINING PARTNER: