Slides - Zhangxi Lin - Texas Tech University

Download Report

Transcript Slides - Zhangxi Lin - Texas Tech University

ISQS 6339, Business Intelligence
Hadoop & MapReduce
Zhangxi Lin
Texas Tech University
1
1
Outlines
Big data ecology
 Review of Hadoop
 MapReduce Algorithm
 The Hadoop Ecological System


Appendix
◦ Examples of MapReduce
2
3
4
5
6
7
8
REVIEW OF HADOOP
9
Questions before viewing the
videos
What is Hadoop
 What is MapReduce
 Why did they become a major solution to
cope with big data problem

10
Videos of Hadoop

Challenges Created by Big Data. 8’51”
 Published on Apr 10, 2013.This video explains the challenges created by big data that Hadoop
addresses efficiently.You will learn why traditional enterprise model fails to address the Variety,
Volume, and Velocity challenges created by Big Data and why creation of Hadoop was
required.
 http://www.youtube.com/watch?v=cA2btTHKPMY

Hadoop Architecture. 14’27”
 Published on Mar 24, 2013
 http://www.youtube.com/watch?v=YewlBXJ3rv8

History Behind Creation of Hadoop. 6’29”
 Published on Apr 5, 2013. This video talk about the brief history behind creation of Hadoop.
How Google invented the technology, how it went into Yahoo, how Doug Cutting and Michael
Cafarella created Hadoop, and how it went to Apache.
 http://www.youtube.com/watch?v=jA7kYyHKeX8
11
Hadoop – for BI in the Cloudera



Hadoop is a free, Java-based programming framework
that supports the processing of large data sets in a
distributed computing environment.
Hadoop makes it possible to run applications on
systems with thousands of nodes involving thousands
of terabytes.
Hadoop was inspired by Google's MapReduce, a
software framework in which anapplication is broken
down into numerous small parts. Doug Cutting,
Hadoop's creator, named the framework after his child's
stuffed toy elephant.
12
Apache Hadoop

The Apache Hadoop framework is
composed of the following modules :
◦ Hadoop Common - contains libraries and
utilities needed by other Hadoop modules
◦ Hadoop Distributed File System (HDFS).
◦ Hadoop YARN - a resource-management
platform responsible for managing compute
resources in clusters and using them for
scheduling of users' applications.
◦ Hadoop MapReduce - a programming
model for large scale data processing.
13
MapReduce
MapReduce is a framework for processing parallelizable
problems across huge datasets using a large number of
computers (nodes), collectively referred to as a cluster
or a grid.
14
How Hadoop Operates
15
Hadoop 2: Big data's big leap forward



The new Hadoop is the Apache Foundation's attempt to
create a whole new general framework for the way big
data can be stored, mined, and processed.
The biggest constraint on scale has been Hadoop’s job
handling. All jobs in Hadoop are run as batch processes
through a single daemon called JobTracker, which
creates a scalability and processing-speed bottleneck.
Hadoop 2 uses an entirely new job-processing
framework built using two daemons:
ResourceManager, which governs all jobs in the
system, and NodeManager, which runs on each
Hadoop node and keeps the ResourceManager
informed about what's happening on that node.
16
MapReduce 2.0 – YARN
(Yet Another Resource Negotiator)
17
Apache Spark



An open-source cluster computing framework originally developed
in the AMPLab at UC Berkeley. In contrast to Hadoop's two-stage
disk-based MapReduce paradigm, Spark's in-memory primitives
provide performance up to 100 times faster for certain
applications.
Spark requires a cluster manager and a distributed storage system.
For cluster manager, Spark supports standalone (native Spark
cluster), Hadoop YARN, or Apache Mesos. For distributed storage,
Spark can interface with a wide variety, including Hadoop
Distributed File System (HDFS), Cassandra, OpenStack Swift,
and Amazon S3.
In February 2014, Spark became an Apache Top-Level Project. Spark
has over 465 contributors in 2014.
- Source: http://en.wikipedia.org/wiki/Apache_Spark
18
MAP/REDUCE
ALGORITHM
19
Videos - MapReduce

Intro To MapReduce. 9’08”
◦ Published on Mar 1, 2013. Intro to MapReduce concepts.
Explores the flow of a MapReduce program.
◦ http://www.youtube.com/watch?v=HFplUBeBhcM

Hadoop Map Reduce Part1. 4’21”
◦ Published on Mar 20, 2012
◦ http://www.youtube.com/watch?v=dVqaz2j2kII
20
Distributed File Systems (DFS)
Implementations
Files are divided into chunks, typically 64 megabytes
in size. Chunks are replicated three times, at three
different compute nodes located on different racks.
 To find the chunks of a file, the master node or name
node is used. The master node is itself replicated.
 Three Standards

◦ The Google File System (GFS), the original of the class.
◦ Hadoop Distributed File System (HDFS), an open-source
DFS used with Hadoop, an implementation of map-reduce
and distributed by the Apache Software Foundation.
◦ CloudStore, an open-source DFS originally developed by
Kosmix
21
Map/Reduce Execution
22
23
Example 1: counting the number of occurrences for each
word in a collection of documents

The input file is a repository of documents, and each
document is an element. The Map function for this
example uses keys that are of type String (the words)
and values that are integers. The Map task reads a
document and breaks it into its sequence of words
w1,w2, . . . ,wn. It then emits a sequence of key-value
pairs where the value is always 1. That is, the output of
the Map task for this document is the sequence of keyvalue pairs:

(w1, 1), (w2, 1), . . . , (wn, 1)
Map Task
A single Map task will typically process many
documents. Thus, its output will be more than the
sequence for the one document suggested above. If a
word w appears m times among all the documents
assigned to that process, then there will be m keyvalue pairs (w, 1) among its output.
 After all the Map tasks have completed successfully,
the master controller merges the files from each Map
task that are destined for a particular Reduce task
and feeds the merged file to that process as a
sequence of key-list-of-value pairs. That is, for each
key k, the input to the Reduce task that handles key k
is a pair of the form (k, [v1, v2, . . . , vn]), where (k, v1),
(k, v2), . . . , (k, vn) are all the key-value pairs with key
k coming from all the Map tasks.

Grouping and Aggregation
Reduce Task
The output of the Reduce function is a sequence
of zero or more key-value pairs.
 The Reduce function simply adds up all the values.
The output of a reducer consists of the word and
the sum. Thus, the output of all the Reduce tasks
is a sequence of (w,m) pairs, where w is a word
that appears at least once among all the input
documents and m is the total number of
occurrences of w among all those documents.


The application of the Reduce function to a single
key and its associated list of values is referred to
as a reducer.
Combiner
Reducers, Reduce Tasks, Compute Nodes, and
Skew
HADOOP ECOLOGICAL
SYSTEM
36
Choosing a right Hadoop
architecture
Application dependent
 Too many solution providers
 Too many choices

37
Videos

The Evolution of the Apache Hadoop Ecosystem | Cloudera. 8’11”
◦ Published on Sep 6, 2013. Hadoop Co-founder Doug Cutting explains how the
Hadoop ecosystem has expanded and evolved into a much larger Big Data
platform with Hadoop at its center.
◦ http://www.youtube.com/watch?v=eo1PwSfCXTI

A Hadoop Ecosystem Overview. 21’54”
◦ Published on Jan 10, 2014. This is a technical overview, explaining the Hadoop
Ecosystem. As a part of this presentation, we chose to focus on the HDFS,
MapReduce,Yarn, Hive, Pig and HBase software components.
◦ http://www.youtube.com/watch?v=kRnh3WpcKXo

Working in the Hadoop Ecosystem. 10’40”
◦ Published on Sep 5, 2013. Mark Grover, a Software Engineer at Cloudera, talks
about working in the Hadoop ecosystem.
◦ http://www.youtube.com/watch?v=nbUsY9tj-pM
38
Cloudera’s Hadoop System
39
40
Comparison of Two Generations of
Hadoop
41
42
Different Components of Hadoop
43
Pivotal Big Data Product - OSS

Greenplum was a big data analytics
company headquartered in San
Mateo, California. Its products include
Unified Analytics Platform, Data
Computing Appliance, Analytics Lab,
Database, HD and Chorus. Acquired
by EMC Corporation in July
2010, and then became part
of Pivotal Software in 2012.

Pivotal GemFire is a distributed data
management platform designed for
many diverse data management
situations, but is especially useful for
high-volume, latency-sensitive,
mission-critical, transactional systems.

Pivotal Software, Inc. (Pivotal) is a
software company based in Palo Alto,
California that provides software and
services for the development of
custom applications for data and
analytics based on cloud computing
technology. Pivotal Software is a spinout and joint venture of EMC
Corporation and its
subsidiary VMware that combined
software products, employees, and
lines of businesses from the two
parent companies
including Greenplum,Cloud
Foundry, Spring, Pivotal Labs,
GemFire and other products from
the VMware vFabric Suite.
44
2015 Team-Topic
No:
1
2
3
4
5
6
7
8
Topic
Data warehousing
Hadoop Data warehouse design
Publicly available big data services
Tools and free resources
MapReduce & Data mining
Efficiency of distributed data/text mining
Big data ETL-1
1) Heterogeneous data processing across
platforms
2) System management
Big data ETL-2
1) Heterogeneous data processing across
platforms
2) System management
Application development platform
1) Algorithms and innovative development
environments
2) Load balancing
Tools & Visualizations
Features for big data visualization and data
utilization.
Streaming data processing
Efficiency and effectiveness of real-time data
processing
Components
HDFS, HBase, HIVE,
NoSQL/NewSQL, Solr
Hortonworks, CloudEra, HaaS,
EC2
Mahout, H2O, R, Python
Team#
DW1
Schedule
4/7
DW2
4/9
DW3
4/14
1) Kettle, Flume, Sqoop, Impala
2) Oozie, ZooKeeper, Ambari,
Loom, Ganglia
DW4
4/16
1) Kettle, Flume, Sqoop, Impala
2) Oozie, ZooKeeper, Ambari,
Loom, Ganglia
DW5
4/21
Tomcat, Neo4J, Pig, Hue
DW6
4/23
Pentaho, Tableau
Saiku, Mondrian, Gephi,
DW7
4/28
Spark, Storm, Kafka, Avro
5/5
45
2014 Topics
Topic
Components
Team#
Presentation
Data warehousing
HDFS, HBase, HIVE, NoSQL
5
4/8
Data mining
Mahout, R
1
4/10
System management
Oozie, ZooKeeper
3
4/15
ETL
Kettle, Flume, Sqoop
8
4/17
Programming platform
Pig, Hue, Python, Tomcat,
Jetty, Neo4J
Storm, Kafka, Avro
2
4/22
4
4/24
Information retrieval
Impala, Pentaho, Solr
6
4/29
Visualization
Saiku, Mondrian, Gephi,
Ganglia
7
5/1
Streaming data
processing
46
Project team’s tasks
1) collect the information
2) study the product
3) present the topic for 40-50 minutes
with 20-25 slides.
Explain the position of your topic in Hadoop
ecological system
Main challenges and solutions
Products and their features
Optional deliverables with
extra credit:
1) implementation demonstration
2) research papers
3) proposal for
research/implementation ideas that
demonstrates the creativeness.
Application demonstrations
Your comments
3 questions for your classmates
4) provide the references, including
tutorial materials, videos, web links
5) report outcomes with documents in 510 pages.
47
Appendix
MORE HADOOP
CHARTS
48
Vision of Data Flow
49
50
Real-time Data Processing
51
Application Perspective of Hadoop
52
Appendix
MATRIX CALCULATION
53
Map/Reduce Matrix Multiplication
54
Map/Reduce – Scheme 1, Step 1
55
Map/Reduce – Scheme 1, Step 2
56
Map/Reduce – Scheme 2, Oneshot
57