Transcript Hadoop
Hadoop & MapReduce
Zhangxi Lin
CAABI, Texas Tech University
FIFE, Southwestern University of Finance & Economics
Cellphone:18610660375, QQ/WeChat: 155970
http://zlin.ba.ttu.edu [email protected]
2015-06-16
1
6/16/2015
Zhangxi Lin
1
CAABI, Texas Tech University
◦ Center for Advanced Analytics and Business
Intelligence initially started in 2004 by Dr. Peter
Westfall, ISQS, Rawls College of Business.
Sichuan Key Lab of Financial Intelligence and
Financial Engineering (FIFE), SWUFE
◦ One of two key labs in finance founded in 2008 and
sponsored by Sichuan Provincial Government
◦ Underpinned by two areas in SWUFE: Information and
Finance
2
ISQS 6339, Data Mgmt & BI
6/16/2015
Zhangxi Lin
2
Know Big Data One More Step
When we talk about big data we must know what Hadoop is
When we planning about data warehousing we must know what
HDFS and NoSQL are.
When we say data mining we must know what Mahout and H2O
are.
Do you know Hadoop data warehousing does not need
dimensional modeling?
Do you know how Hadoop stores heterogeneous data?
Do you know what are Hadoop’s “Archeries heal”?
Do you know you can install a Hadoop system in your Laptop?
Do you know Alibaba has retired its last mini-computer in 2014?
So, let’s talk about Hadoop
6/16/2015
Zhangxi Lin
3
After this lecture you will
Understand what challenges are in big data
management
Understand how Hadoop and MapReduce
works
Get familiar to the Hadoop ecology
Be able to install a Hadoop in your laptop
Be able to install a handy big data tool in your
laptop to visualize and mine data
6/16/2015
Zhangxi Lin
4
Outlines
Apache Hadoop
Hadoop Data Warehousing
Hadoop ETL
Hadoop Data Mining
Data Visualization with Hadoop
MapReduce Algorithm
Setting up Your Hadoop
Appendixes
◦ The Hadoop Ecological System
◦ Matrix calculation with MapReduce
6/16/2015
Zhangxi Lin
5
A Traditional Business Intelligence System
MS SQL Server
SSMS
SSIS
SSAS
BIDS
SSRS
SAS
EG
SAS
EM
6/16/2015
Zhangxi Lin
6
Hadoop ecosystem
6/16/2015
Zhangxi Lin
7
-
6/16/2015
Zhangxi Lin
8
What is Hadoop?
Hadoop is an open-source software framework for
storing and processing big data in a distributed fashion on
large clusters of commodity hardware.
Hadoop is not a replacement for traditional RDMS but is
a supplement to handle and process large datasets .
It achieves two tasks:
1. Massive data storage.
2. Faster processing.
Using Hadoop is cheaper, faster and better.
6/16/2015
Zhangxi Lin
9
Hadoop 2: Big data's big leap forward
The new Hadoop is the Apache Foundation's attempt to
create a whole new general framework for the way big
data can be stored, mined, and processed.
The biggest constraint on scale has been Hadoop’s job
handling. All jobs in Hadoop are run as batch processes
through a single daemon called JobTracker, which
creates a scalability and processing-speed bottleneck.
Hadoop 2 uses an entirely new job-processing
framework built using two daemons:
ResourceManager, which governs all jobs in the
system, and NodeManager, which runs on each
Hadoop node and keeps the ResourceManager
informed about what's happening on that node.
6/16/2015
Zhangxi Lin
10
Hadoop 1.0 VS Hadoop 2.0
Hadoop 1.0
Hadoop 2.0
Features of Hadoop 2.0 over Hadoop 1.0:
• Horizontal scalability of Namenode.
• Namenode is no longer a single point of failure.
• Ability to process Terabytes and Petabytes of data available in HDFS
using Non-MapReduce applications such as MPI, GIRAPH.
• The two major functionalities of overburdened JobTracker (resource
management and job scheduling/monitoring) into two separate
daemons.
6/16/2015
Zhangxi Lin
11
Apache Spark
Apache Spark is an open source cluster computing
framework originally developed in the AMPlab at UC
Berkley.
Spark in-memory provides performance up to 100
times faster for certain applications.
Spark is well suited for machine learning algorithms.
Spark requires a cluster manager and a distributed
storage system.
Spark supports Hadoop YARN.
6/16/2015
Zhangxi Lin
12
MapReduce
MapReduce is a framework for processing parallelizable
problems across huge datasets using a large number of
computers (nodes), collectively referred to as a cluster
or a grid.
6/16/2015
Zhangxi Lin
13
MapReduce 2.0 – YARN
(Yet Another Resource Negotiator)
6/16/2015
Zhangxi Lin
14
How Hadoop Operates
6/16/2015
Zhangxi Lin
15
Hadoop Ecosystem
6/16/2015
Zhangxi Lin
16
Hadoop Topics
No:
Topic
1 Data warehousing
Components
2 Publicly available big data services
HDFS, HBase,
HIVE, KylinNoSQL/NewSQL, Solr
Hortonworks, CloudEra, HaaS, EC2
3 MapReduce & Data mining
Mahout, H2O, R, Python
4 Big data ETL
Kettle, Flume, Sqoop,
Impala,Chakwa. Dremel, Pig
Oozie, ZooKeeper, Ambari, Loom,
Ganglia
Tomcat, Neo4J, Pig, Hue
5 Big data platform management
6 Application development platform
7 Tools & Visualizations
8 Streaming data processing
Pentaho, Tableau
Saiku, Mondrian, Gephi,
Spark, Storm, Kafka, Avro
6/16/2015
Zhangxi Lin
17
HADOOP DATA
WAREHOUSING
6/16/2015
Zhangxi Lin
18
Comparing the RDBMS and Hadoop
data warehousing stack
Layer
Storage
Metadata
Query
Hadoop
Advantages of
Hadoop over
conventional
RDBMS
HDFS file system
HDFS is purposebuilt for extreme IO
speeds
System tables
HCatalog
All clients can use
HCatalog to read
files.
SQL query engine
Multiple engines
(SQL and nonSQL)
Multiple query
engines like Hive or
Impala are available.
Conventional
RDBMS
Database tables
6/16/2015
Zhangxi Lin
19
HDFS ( Hadoop Distributed File
System)
Hadoop ecosystem consists of many
components and libraries for varied tasks. The
storage part of Hadoop is HDFS and the
processing part is MapReduce.
HDFS is the a java based distributed file-system
that stores data on commodity machines
without prior organization, providing very high
aggregate bandwidth across the clusters.
6/16/2015
Zhangxi Lin
20
HDFS Architecture & Design
HDFS has a master/slave architecture.
HDFS consists of a single NameNode
and several number of DataNodes in a
cluster.
In HDFS files are split in one or more
’blocks’ and are stored in a set of
DataNodes.
HDFS exposes a file system namespace
and allows user data to be stored in files.
DataNodes serves read, write requests,
performs block creation, deletion, and
replication upon instruction from
Namenode.
6/16/2015
Zhangxi Lin
21
6/16/2015
Zhangxi Lin
22
What is NoSQL?
Stands for Not Only SQL
NoSQL is a non-relational database management system.
NoSQL is different from traditional relational database
management systems in some significant ways.
NoSQL is designed for distributed data stores where very
large scale of data storing is needed (for example Google or
Facebook which collects terabits of data every day for their
users).
These types of data storing may not require fixed schema,
avoid join operations and typically scale horizontally.
6/16/2015
Zhangxi Lin
23
NoSQL
6/16/2015
Zhangxi Lin
24
- Praveen Ashokan
6/16/2015
Zhangxi Lin
25
What is NewSQL?
•
•
•
•
•
A modern RDBMS that seek to provide the same scalable
performance of NoSQL systems for OLTP read-write workloads
while still maintaining the ACID guarantees of a traditional database
system.
SQL as the primary interface
Non- Locking Concurrency control
High per-node performance
H-Store parallel database system is the first known NewSQL
system
6/16/2015
Zhangxi Lin
26
Classification of NoSQL and NewSQL
6/16/2015
Zhangxi Lin
27
Taxonomy of Big Data Stores
6/16/2015
Zhangxi Lin
28
Features of OldSQL vs NoSQL vs NewSQL
6/16/2015
Zhangxi Lin
29
6/16/2015
Zhangxi Lin
30
HBase
•
•
•
•
HBase is a non-relational,distributed database
It is a column-oriented DBMS
It is an implementation of Google’s Big Table
HBase is built on top of Hadoop File
Distributed System(HDFS)
6/16/2015
Zhangxi Lin
31
Differences between HBase and
Relational Database
•
•
•
•
•
•
HBase is a column-oriented database while a Relational
database is a row-oriented database
HBase is highly scalable while RDBMS is hard to scale.
Hbase has flexible schema while RDBMS has fixed schema
HBase holds denormalized data while data in a Relational
database is normalized
The performance of HBase is good for large volumes of
unstructured data while the performance is poor for a
Relational database
HBase does not use any query language while a Relational
Database uses SQL to retrieve data
6/16/2015
Zhangxi Lin
32
HBase Data Model
Column Family
Row key
TimeStamp
value
6/16/2015
Zhangxi Lin
33
HBase: Keys and Column Families
Each record is divided into Column
Families
6/16/2015
Zhangxi Lin
34
What is Apache Hive?
• The Apache Hive is data warehouse software facilitates
querying and managing large datasets residing in
distributed storage
• It built on top of Apache Hadoop it provides tools to easy
data extract/transform/load
• It supports analysis of large datasets stored in Hadoop’s
HDFS
• It supports SQL-like language called HQL as well as big
data analytics with the help of Map-Reduce
6/16/2015
Zhangxi Lin
35
What is HQL?
• HQL : Hive Query Language
• Doesn’t conform any ANSI standard
• Very close to MySQL dialect, but with some
differences
• SQL to HQL cheat sheet http://hortonworks.com/wpcontent/uploads/downloads/2013/08/Hortonworks.CheatSheet.SQLt
oHive.pdf
• HQL doesn’t support transactions, so don’t
compare with RDBMS
6/16/2015
Zhangxi Lin
36
HADOOP ETL
6/16/2015
Zhangxi Lin
37
List of Tools
Sqoop
Flume
Impala
Chukwa
Kettle
6/16/2015
Zhangxi Lin
38
E
T
L
6/16/2015
Zhangxi Lin
39
Sqoop
Is a short form of SQL to Hadoop
Used to move back data back and forth
between RDBMS and HDFS for
performing analysis using BI Tools.
Is a simple command line tool(Sqoop 2 is
bringing web interface as well)
6/16/2015
Zhangxi Lin
40
How Sqoop Works
Dataset
Slice 1
Slice 1
Slice 2
Mapper 1
Mapper 1
Mapper 1
6/16/2015
Zhangxi Lin
41
Sqoop 1 & Sqoop 2
Feature
Sqoop 1
Connectors for all major RDBMS Supported.
Sqoop 2
Not supported.
Workaround: Use the generic JDBC Connector which
has been tested on the following databases: Microsoft SQL
Server, PostgreSQL, MySQL and Oracle.
This connector should work on any other JDBC compliant
database. However, performance might not be comparable
to that of specialized connectors in Sqoop.
Encryption of Stored Passwords
Not supported. No workaround.
Supported using Derby's on-disk
encryption.Disclaimer: Although expected to work in
the current version of Sqoop 2, this configuration has not
been verified.
Data transfer from RDBMS to
Hive or HBase
Supported.
Not supported.
1.Workaround: Follow this two-step approach.Import
data from RDBMS into HDFS
2.Load data into Hive or HBase manually using appropriate
tools and commands such as the LOAD DATA statement
in Hive
Data transfer from Hive or
HBase to RDBMS
1.Not supported.Workaround: Follow
this two-step approach.Extract data from
Hive or HBase into HDFS (either as a text
or Avro file)
2.Use Sqoop to export output of previous
step to RDBMS
Not supported.
Follow the same workaround as for Sqoop 1.
6/16/2015
Zhangxi Lin
42
Sqoop 1 & Sqoop 2 Architecture
For more on Differences
https://www.youtube.com/watch?v=xzU3HL4ZYI0
6/16/2015
Zhangxi Lin
43
What is Flume ?
Flume – It is a distributed, reliable service used for gathering,
aggregating and transporting large amounts of streaming event
data for analysis.
Event data – streaming log data (website/application logs – to
analyse user’s activity) or streaming data (e.g. social media –
analyse an event, stock prices- to analyse a stock’s
performance)
6/16/2015
Zhangxi Lin
44
Architecture and Working
6/16/2015
Zhangxi Lin
45
Impala –An open source SQL query
engine
Developed by Cloudera and fully open source,
hosted on github.
Released as beta in 10/2012
1.0 version available in 05/2013
6/16/2015
Zhangxi Lin
46
About Impala
6/16/2015
Zhangxi Lin
47
What is Chukwa
Chukwa is an open source data collection
system for monitoring large distributed
systems.
Used for log collection and analysis.
Built on top of the Hadoop Distributed File
System (HDFS) and Map/Reduce framework
Not a streaming database
Not a real time system
6/16/2015
Zhangxi Lin
48
Why do we need Chukwa?
Data monitoring and analysis.
◦ To collect system matrices and log files.
◦ To store data in Hadoop clusters
Uses MapReduce to analyze data.
◦ Robust
◦ Scalable
◦ Rapid Data Processing
6/16/2015
Zhangxi Lin
49
How it Works?
6/16/2015
Zhangxi Lin
50
Data Analysis
6/16/2015
Zhangxi Lin
51
ETL Tools
Sqoop
Flume
Features
•
•
•
•
Bulk import
Direct input
Data interaction
Data export
•
•
•
•
Fan out
Fan in
Processors
Auto-batching of
events
Multiplexing
channels for data
mining
•
•
•
Kettle
•
•
•
Migrating data
between applications
or databases
Exporting data from
databases to flat files
Loading data
massively into
databases
Data cleansing
Integrating
applications
Advantage
•
•
Parallel data transfer
Efficient data
Analysis
•
Reliable, Scalable,
Manageable,
Customizable, High
Performance
Feature Rich and
Fully Extensible
Contextual Routing
•
•
Higher level than
code
•
Well tested full
suite of
components
•
Data analysis
tools
•
Free
6/16/2015
Disadvantage
•
Not east to manage
installations and
configurations
•
Have to weaken
some delivery
guarantees
•
•
Not running fast
Take some time to
install
Zhangxi Lin
52
Building a Datawarehouse in
Hadoop using ETL Tools
Copy data into HDFS with ETL tool (e.g.
Informatica), Sqoop or Flume into standard HDFS files
(write once). This registers the metadata
with HCatalog.
Declare the query schema in Hive or Impala, which
doesn’t require data copying or re-loading, due to
the schema-on-read advantage of Hadoop compared
with schema-on-write constraint in RDBMS.
Explore with SQL queries and launching BI tools
e.g. Tableau, BusinessObjects for exploratory analytics.
6/16/2015
Zhangxi Lin
53
HADOOP DATA MINING
6/16/2015
Zhangxi Lin
54
What is Mahout?
Meaning: A person who
keep and drives an
elephant – an Indian
term
Mahout is a scalable
open source machine
learning library hosted
by Apache.
Mahout core algorithms
are implemented on top
of Apache Hadoop using
the Map/Reduce
paradigm.
6/16/2015
Zhangxi Lin
55
Mahout’s position
6/16/2015
Zhangxi Lin
56
6/16/2015
Zhangxi Lin
57
Mapreduce flow in mahout
6/16/2015
Zhangxi Lin
58
What is H2O?
H2O scales statistics, machine learning and math over
BigData.
H2O is extensible and users can build blocks using
simple math legos in the core.
H2O keeps familiar interfaces like R, Excel & JSON so
that BigData enthusiasts & experts can explore, merge,
model and score datasets using a range of simple to
advanced algorithms.
H2O makes it fast and easy to derive insights from your
data through faster and better predictive modeling.
H2O has a vision of online scoring and modeling in a
single platform
6/16/2015
Zhangxi Lin
59
H2O
How is H2O Different form Mahout ?
H2O
Mahout
Can use any of R, REST/JSON, GUI
(browser), Java or Scala.
Can use Java
H2O is GUI product with less algorithms
More number of Algorithms that need
knowledge od Java
Algorithms are typically 100x faster than
current Map/Reduce-based Mahout
Algorithms are typically slower compared to
H2O.
Knowledge of Java is NOT required to
develop prediction model
Knowledge of Java required to develop
prediction model
Real Time
Not Real Time
6/16/2015
Zhangxi Lin
60
H2O
Predictive Modeling Factories – Better Marketing with H2O
• Advertising Technology – Better Conversions with H2O
• Risk & Fraud Analysis – Better detection with H2O
• Customer Intelligence – Better Sales with H2O
Users of H2O
•
6/16/2015
Zhangxi Lin
61
MAP/REDUCE
ALGORITHM
6/16/2015
Zhangxi Lin
62
How to write a MapReduce
program
Parallelization is the key
Algorithm is different from a single server
application
◦ Map function
◦ Reduce function
Considerations
◦ Load balance
◦ Efficiency
◦ Memory management
6/16/2015
Zhangxi Lin
63
MapReduce Executes
6/16/2015
Zhangxi Lin
64
Schematic of a map-reduce
computation
6/16/2015
Zhangxi Lin
65
Example: counting the number of occurrences for each
word in a collection of documents
The input file is a repository of documents, and each
document is an element. The Map function for this
example uses keys that are of type String (the words)
and values that are integers. The Map task reads a
document and breaks it into its sequence of words
w1,w2, . . . ,wn. It then emits a sequence of key-value
pairs where the value is always 1. That is, the output of
the Map task for this document is the sequence of keyvalue pairs:
(w1, 1), (w2, 1), . . . , (wn, 1)
6/16/2015
Zhangxi Lin
66
Map Task
A single Map task will typically process many
documents. Thus, its output will be more than the
sequence for the one document suggested above. If a
word w appears m times among all the documents
assigned to that process, then there will be m keyvalue pairs (w, 1) among its output.
After all the Map tasks have completed successfully,
the master controller merges the files from each Map
task that are destined for a particular Reduce task
and feeds the merged file to that process as a
sequence of key-list-of-value pairs. That is, for each
key k, the input to the Reduce task that handles key k
is a pair of the form (k, [v1, v2, . . . , vn]), where (k, v1),
(k, v2), . . . , (k, vn) are all the key-value pairs with key
k coming from all the Map tasks.
6/16/2015
Zhangxi Lin
67
Reduce Task
The output of the Reduce function is a sequence
of zero or more key-value pairs.
The Reduce function simply adds up all the values.
The output of a reducer consists of the word and
the sum. Thus, the output of all the Reduce tasks
is a sequence of (w,m) pairs, where w is a word
that appears at least once among all the input
documents and m is the total number of
occurrences of w among all those documents.
The application of the Reduce function to a single
key and its associated list of values is referred to
as a reducer.
6/16/2015
Zhangxi Lin
68
Big Data Visualization
and Tools
6/16/2015
Zhangxi Lin
69
Big Data Visualization and Tools
Tools :
Tableau
Pentaho
Modrian
Saiku
Spotfire
Gephi
6/16/2015
Zhangxi Lin
70
Tableau
What is Tableau?
Tableau is a visual analysis solution that allows
people to explore and analyze data with simple
drag and drop
operations.
6/16/2015
Zhangxi Lin
71
Tableau
Tableau Alliance Partners
6/16/2015
Zhangxi Lin
72
Tableau
6/16/2015
Zhangxi Lin
73
What is Pentaho?
Pentaho is a commercial open source software for Business
Intelligence (BI).
Pentaho has been developed since 2004 in Orlando, Florida.
Pentaho provides comprehensive reporting, OLAP analysis, dashboards, data
integration, data mining and a BI platform.
It is built under Java platform.
Runs well under various platforms (Windows, Linux, Macintosh, Solaris, Unix, etc.)
Has a complete package from reporting, ETL for warehousing data management,
OLAP server data mining also dashboard.
BI Platform supports Pentaho end to end business intelligence capabilities and
provide central access to your business information, with back end security,
integration, scheduling, auditing and more.
Designed to meet the needs of any size organization.
6/16/2015
Zhangxi Lin
74
A few facts
6/16/2015
Zhangxi Lin
75
6/16/2015
Zhangxi Lin
76
Analyzer
6/16/2015
Zhangxi Lin
77
Reports
6/16/2015
Zhangxi Lin
78
Overall Features
6/16/2015
Zhangxi Lin
79
HADOOP IN YOUR
LAPTOP
6/16/2015
Zhangxi Lin
80
Hortonworks Background
Hortonworks is a Business computer software company based in Palo
Alto,California
Hortonworks supports & develops Apache Hadoop framework, that
allows distributed processing of large data sets across clusters of
computers
They are the sponsors of Apache Software Foundation
Founded in June 2011 by Yahoo and Benchmark capital as an independent
company. It went public on December 2014
Below are the list of company collaborated with Hortonworks
Microsoft on October 2011 to develop Azure & Window server
Infomatica on November 2011 to develop HParser
Teradata on February 2012 to develop Aster data system
SAP AG on September 2012 announced it would resell Hortonworks
distribution
6/16/2015
Zhangxi Lin
81
They do Hadoop using HDP
100%
Network Data Usage
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Category 1
Before HDP
With HDP
6/16/2015
Zhangxi Lin
82
Hortonworks Data Platform
Hortonworks' product named Hortonworks Data Platform (HDP)
includes Apache Hadoop and is used for storing, processing, and analyzing
large volumes of data.
It includes Apache Projects like HDFS, MapReduce, Pig, Hive, Hbase,
Zookeeepr and other components
Why was it developed?
It was develop with one aim to make Apache Hadoop ready for
enterprise.
What does it do?
It takes Big Data component of Apache Hadoop and make them ready for
prime time use in Enterprise Environment.
6/16/2015
Zhangxi Lin
83
HDP Functional Areas
6/16/2015
Zhangxi Lin
84
Certified Technology Program
One of the most important aspects of the Technology Partner Program is
the certification of partner technologies with HDP
Hortonworks Certified Technology Program simplifies big data planning by
providing pre-built and validated integrations between leading enterprise
technologies and Hortonworks Data Platform (HDP)
YARN Ready Certification,
Security Ready,
Operations Ready
Governance Ready
More Details: http://hortonworks.com/partners/certified/
6/16/2015
Zhangxi Lin
85
How to get HDP?
HDP is architected, developed, and built completely in the open. Anyone
can download it from website http://hortonworks.com/hdp/downloads/
for free
It comes with different version which can used as per need.
HDP 2.2 on Sandbox – runs on VirtualBox or VMWare
Automated (Amabri) – RHEL/Ubuntu/CentOS/SLES
Manual – RHEL/Ubuntu/CentOS/SLES
Windows – Windows Server 2008 & 2012
6/16/2015
Zhangxi Lin
86
Installing HDP
IP address to login
on the browser
6/16/2015
Zhangxi Lin
87
DEMO-HDP
Below are the step we will be preforming in HDP
Starting HDP
Upload a source file
Load in file in HCatalog
Pig Basic Tutorial
6/16/2015
Zhangxi Lin
88
6/16/2015
Zhangxi Lin
89
About Cloudera
Cloudera is “The commercial Hadoop company”
Founded by leading experts on Hadoop from
Facebook, Google, Oracle and Yahoo
Provides consulting and training services for
Hadoop users
Staff includes several committers to Hadoop
projects
6/16/2015
Zhangxi Lin
90
Who uses Cloudera?
6/16/2015
Zhangxi Lin
91
Cloudera Software (All Open-Source)
Cloudera’s Distribution including Apache
Hadoop (CDH)
– A single, easy-to-install package from the Apache Hadoop
core repository
– Includes a stable version of Hadoop, plus critical bug fixes
and solid new features from the development version
Components
– Apache Hadoop
– Apache Hive
– Apache Pig
– Apache HBase
– Apache Zookeeper
– Flume, Hue, Oozie, and Sqoop
6/16/2015
Zhangxi Lin
92
CDH and Enterprise Ecosystem
6/16/2015
Zhangxi Lin
93
Beyond Hadoop
Hadoop is incapable of handling OLTP tasks because of
its latency.
Alibaba has deelop its own distributed system instead of
using Hadoop. Currently, it takes Alipay’s system 20 ms
to process a payment transaction, but 200 ms for fraud
detection
◦ “2014年双十一交易额是多少?当大家还正在酣睡之时,双
十一的疯狂正在开始。11日凌晨,逆天的天猫双十一购物狂
欢节开场,今年每分钟支付成功的峰值为79万笔/分,对比
去年20万笔/分,较去年增长4倍,”
12306.cn has replaced its old system with VMware
vFabric TM GemFire ® in-memory database system. This
makes its services stable and robustic.
6/16/2015
Zhangxi Lin
94
HaaS(Hadoop as a Service)
HaaS example
Amazon Web
Services(AWS) -Amazon Elastic
MapReduce (EMR) providing Hadoop based
platform for data analysis with S3 as the storage
system and EC2 as the compute system
Microsoft HDInsight, Cloudera CDH3, IBM
Infoshpere BigInsights, EMC GreenPlum HD and
Windows Azure HDInsight Service are the
primary HaaS services by global IT giants
APPENDIX 1: HADOOP
ECOLOGICAL SYSTEM
6/16/2015
Zhangxi Lin
97
Choosing a right Hadoop
architecture
Application dependent
Too many solution providers
Too many choices
6/16/2015
Zhangxi Lin
98
Teradata Big Data Platform
6/16/2015
Zhangxi Lin
99
Dell’s Hadoop ecosystem
6/16/2015
Zhangxi Lin
100
Nokia’s Big Data Architechture
6/16/2015
Zhangxi Lin
101
Cloudera’s Hadoop System
6/16/2015
Zhangxi Lin
102
6/16/2015
Zhangxi Lin
103
6/16/2015
Zhangxi Lin
104
Intel
6/16/2015
Zhangxi Lin
105
Comparison of Two Generations of
Hadoop
6/16/2015
Zhangxi Lin
106
6/16/2015
Zhangxi Lin
107
6/16/2015
Zhangxi Lin
108
Different Components of Hadoop
6/16/2015
Zhangxi Lin
109
6/16/2015
Zhangxi Lin
110
APPENDIX 2: MATRIX
CALCULATION
6/16/2015
Zhangxi Lin
111
Map/Reduce Matrix Multiplication
6/16/2015
Zhangxi Lin
112
Map/Reduce – Scheme 1, Step 1
6/16/2015
Zhangxi Lin
113
Map/Reduce – Scheme 1, Step 2
6/16/2015
Zhangxi Lin
114
Map/Reduce – Scheme 2, Oneshot
6/16/2015
Zhangxi Lin
115
Communication Cost
The sum of the communication cost of all
the tasks implementing that algorithm.
In addition to the amount of time to execute
a task it also includes the time for moving
data into the memory.
◦ The algorithm executed by each task tends to be very
simple, often linear in the size of its input
◦ The typical interconnect speed for a computing cluster
is one gigabit per second.
◦ The time taken to move the data from a chunk into the
main memory may exceed the time needed to operate
on the data.
6/16/2015
Zhangxi Lin
116
Reducer size
The upper bound on the number of values that are
allowed to appear in the list associated with a single
key. Reducer size can be selected with at least two
goals.
◦ By making the reducer size small, we can force there to be
many reducers, according to which the problem input is
divided by the Map tasks.
◦ We can choose a reducer size sufficiently small that we
are certain the computation associated with a single
reducer can be executed entirely in the main memory of
the compute node where its Reduce task is located.
The running time will be greatly reduced if we can
avoid having to move data repeatedly between main
memory and disk.
6/16/2015
Zhangxi Lin
117
Replication rate
The number of key-value pairs produced
by all the Map tasks on all the inputs,
divided by the number of inputs. That is,
the average communication from Map
tasks to Reduce tasks (measured by
counting key-value pairs) per input.
6/16/2015
Zhangxi Lin
118
Segmenting Matrix to Reduce the Cost
6/16/2015
Zhangxi Lin
119
Map/Reduce – Scheme 3
6/16/2015
Zhangxi Lin
120
Map/Reduce – Scheme 4, Step 1
6/16/2015
Zhangxi Lin
121
Map/Reduce – Scheme 4, Step 2
6/16/2015
Zhangxi Lin
122