Transcript Slide 1

Introduction to Hadoop
Capabilities, Accelerators and Solutions
Big Data


Google processed over 400 PB of data on datacenters composed of thousands
of machines in September 2007 alone ***
Today, every organization has it’s own big data problem and most are using
Hadoop to solve it.
*** MapReduce: Simplified Data Processing on Large Clusters, Communications of the ACM, vol. 51, no. 1
(2008), pp. 107-113, Jeffrey Dean and Sanjay Ghemawat
2
Where is Big Data?
Big Data Has Reached Every Market Sector
Source – McKinsey & Company report. Big data: The next frontier for innovation, competition and
productivity. May 2011.
3
Big Data Value Creation Opportunities
Financial Services
Healthcare
• Optimal treatment pathways
• Detect fraud
• Model and manage risk
• Remote patient monitoring
• Improve debt recovery rates
• Predictive modeling for new drugs
• Personalize banking/insurance Products • Personalized Medicine
Retail
• In-store behavior analysis
Web/Social/Mobile
• Cross selling
• Optimize pricing, placement, design
• Optimize inventory and distribution
• Location-based marketing
• Social segmentation
• Sentiment analysis
• Price comparison Services
Manufacturing
Government
• Design to value
• Crowd-sourcing “Digital factory” for lean
manufacturing
• Improve service via product sensor Data
• Reduce fraud
• Segment populations, customize
action
• Support open data initiatives
• Automate decision Making
4
What is Hadoop?
 Hadoop is an open-source project overseen by the Apache Software




Foundation
Originally based on papers published by Google in 2003 and 2004
Hadoop is an ecosystem, not a single product
Hadoop committers work at several different organizations
– Including Facebook, Yahoo!, Twitter, Cloudera, Hortonworks
5
Hadoop - Inspiration
You Say, “tomato…”
Google calls it:
Hadoop equivalent
GFS
HDFS
MapReduce
Hadoop MapReduce
Sawzall
Hive, Pig
BigTable
HBase
Chubby
ZooKeeper
Pregel
Giraph
Google was awarded a patent for “map reduce – a system for large scale
data processing” in 2010, but blessed Apache Hadoop by granting a license.
6
Hadoop Timeline
• Started for Nutch at Yahoo! by Doug Cutting in early 2006
• Hadoop 2.x, released in 2012, is basis for all current, stable Hadoop
distributions
 Apache Hadoop 2.0.xx
 CDH4.*
 HDP2.*
7
Typical Data Strategy
ETL Tools
Informatica
DW / Marts
Teradata
BI
Microstrategy
Analytics
SAS
Commercial
Oracle Data Integrator
IBM Datastage
Microsoft SSIS
OBIEE
Cognos
Microsoft SSRS
TIBCO Spotfire
SPSS
Open source
Talend
Oracle
DB2, Netezza
SQL server
EMC Greenplum
mySQL
Pentaho , Jaspersoft
R, RapidMiner
8
How Hadoop fits in?
Hadoop can complement the existing DW environment as well replace some of
the components in a traditional data strategy.
9
How Hadoop fits in?
• Storage
•
•
HDFS – It’s a file system, not a DBMS
HBase - Columnar storage that serves low-latency read / write request
• Extract / Load
•
•
Source / Target is RDBMS - Sqoop, hiho
Stream processing - Flume, Scribe, Chukwa, S4, Storm
• Transformation
•
•
Map-reduce (Java or any other language), Pig, Hive, Oozie etc.
Talend and Informatica have built products to abstract complexity of map-reduce
• Analytics
•
RHadoop, Mahout
• BI – All existing players are coming up with Hadoop connectors
10
Hadoop Ecosystem
11
Hadoop Ecosystem Continued…
12
Map-reduce – Programming model
Single map task and a single reduce task -
Multiple map tasks with a single reduce task -
13
Map-reduce – Programming model
14
Hadoop Map Reduce
 What happens during a Map-reduce job’s lifetime?




Clients submit MapReduce jobs to the JobTracker, a daemon that resides on
“master node”
The JobTracker assigns Map and Reduce tasks to other nodes on the cluster
These nodes each run a software daemon known as the TaskTracker
The TaskTracker is responsible for actually instantiating the Map or Reduce
task, and reporting progress back to the JobTracker
 Terminology –



A job is a ‘full program’ – a complete execution of Mappers and Reducers over a
dataset
A task is the execution of a single Mapper or Reducer over a slice of data
A task attempt is a particular instance of an attempt to execute a task
 There will be at least as many task attempts as there are tasks
 If a task attempt fails, another will be started by the JobTracker
 Speculative execution can also result in more task attempts than completed tasks
15
Pig Latin
•
Data-flow oriented language
• High-level language for routing data, allows easy integration of Java
for complex tasks
• Data-types include sets, associative arrays, tuples
•
Client-side utility
• Pig interpreter converts the pigscript to Java map-reduce jobs
and submits it to JobTracker
• No additional installs needed on
Hadoop Cluster
• Pig performance ~ 1.4x Java
MapReduce jobs, but lines of
code needed ~ 1/10th
•
Developed at Yahoo!
16
Hive
•
SQL-based data warehousing app
• Feature set is similar to Pig
• Language is more strictly SQL-esque
•
•
•
Supports SELECT, JOIN, GROUP BY, etc.
Uses “Schema on Read” philosophy
Features for analyzing very large data sets
• Partition columns
• Sampling
• Buckets
•
Requires install of metastore on Hadoop cluster
•
Developed at Facebook
17
HBase
 Distributed, versioned, column-oriented store on top of HDFS
 Goal - To store tables with billion rows and million columns
 Provides an option of “low-latency” (OLTP) reads/writes along with




support for batch-processing model of map-reduce
HBase cluster consists of a single “HBase Master” and multiple
“RegionServers”
Facebook uses HBase to drive its messaging infrastructure
Stats - Chat service supports over 300 million users who send over
120 billion messages per month
Nulls are not stored by design and typical table storage looks like –
Row-key
Column-family
Column
Timestamp
Value
1
CF
Name
Ts1
Vijay
1
CF
Address
Ts1
Mumbai
1
CF
Address
Ts2
Goa
18
Sqoop

RDBMS to Hadoop
 Command-line tool to import any JDBC supported database into
Hadoop
 And also export data from Hadoop to any database


Generates map-only jobs to connect to database and
read/write records
DB specific connectors contributed by vendors –

 Oraoop for Oracle by Quest software
 Teradata connector from Teradata
 Netezza connector from IBM
Developed at Cloudera
 Oracle has come up with “Oracle Loader for Hadoop” and claim that it is
optimized for “Oracle Database 11g”
19
Informatica HParser
 Graphical interface to


design data transformation
jobs
Converts designed DT jobs
to Hadoop Map-reduce jobs
Out-of-the-box Hadoop
parsing support for
industry-standard formats,
including Bloomberg,
SWIFT, NACHA, HIPAA,
HL7, ACORD, EDI X12, and
EDIFACT etc.
20
Flume
 Flume is a distributed, reliable, available service for efficiently
moving large amounts of data as it is produced
 Developed at Cloudera
21
Machine Learning
• Apache Mahout
• Scalable machine learning library most of the algorithms implemented on top
Apache Hadoop using map/reduce paradigm
• Supported Algorithms –
• Recommendation mining - takes users’ behavior and find items said
specified user might like.
• Clustering - takes e.g. text documents and groups them based on
related document topics.
• Classification - learns from existing categorized documents what specific
category documents look like and is able to assign unlabeled documents
to the appropriate category.
• Frequent item set mining - takes a set of item groups (e.g. terms in a
query session, shopping cart content) and identifies, which individual
items typically appear together.
•
RHadoop (from Revolution Analytics) and RHIPE (from Purdue University)
allows executing R programs over Apache Hadoop
22
Graph Implementations
Graph implementations follow the bulk-synchronous parallel model,
popularized by Google’s Pregel –
1) Giraph (submitted to Apache Incubator)
2) GoldernOrb
3) Apache Hama
4) More – http://www.quora.com/What-are-some-good-MapReduceimplementations-for-graphs
23
Hadoop Distributions
24
Hadoop Variants / Flavors / Distributions


Apache Hadoop –
 Completely open and up-to-date version of Hadoop
Cloudera’s distribution including Hadoop (CDH)
 Open source Hadoop tools packaged with “closed” management suite (SCM)
 Profits by providing support (Cost-model is per node in Cluster) & Trainings

Hortonworks Data Platform





Spun-off in 2011 from Yahoo!’s core Hadoop team
Open source Hadoop tools packaged with “open” management suite (Apache Ambari)
Profits by providing support (Cost-model is per node in Cluster) &Trainings
Signed a deal with Microsoft to develop Hadoop for Windows
MapR
 Claims to have developed faster version of HDFS
 MapR’s distribution powers EMC’s Greenplum products

Oracle Big Data Appliance & IBM BigInsights
 Powered by CDH

More may exist……..
25
Hadoop - Key Contributors
26
Hadoop - Key Contributors
27
Hadoop - Key Contributors
28
References


Hadoop: The Definitive Guide
by Tom White (Cloudera Inc.)


Hadoop in Action
by Chuck Lam ()


HBase: The Definitive Guide
by Lars George (Cloudera Inc.)



Mahout in Action
by Sean Owen, Robin Anil,
Ted Dunning, and Ellen Friedman


Programming Pig
by Alan Gates (Hortonworks)
Thank You
.