Slides - Zhangxi Lin - Texas Tech University

Download Report

Transcript Slides - Zhangxi Lin - Texas Tech University

Big Data
Zhangxi Lin
Texas Tech University
1
1
2
3
4
5
6
HADOOP & SPARK
Hadoop Cases
• 10 Cases:
• http://hadoopilluminated.com/hadoop_illuminated/Hadoop_Use
_Cases.html
• 7 Cases:
• http://www.mrc-productivity.com/blog/2015/06/7-real-life-usecases-of-hadoop/
• A Case Study of Hadoop in Healthcare:
• http://www.bigdataeverywhere.com/files/chicago/BDELeadingaHealthcareCaseStudy-QURAISHI.pdf
HADOOP/SPARK
Distributed business intelligence
•
•
•
•
•
•
•
LAMP: Linux, Apache, MySQL, PHP/Perl/Python
Hadoop
MapReduce
HDFS
NOSQL
Zookeeper
Storm
ISQS 3358 BI
• Deal with big data – the open & distributed approach
12
Videos of Hadoop
• Challenges Created by Big Data. 8’51”
• Published on Apr 10, 2013. This video explains the challenges created by big data that
Hadoop addresses efficiently. You will learn why traditional enterprise model fails to
address the Variety, Volume, and Velocity challenges created by Big Data and why
creation of Hadoop was required.
• http://www.youtube.com/watch?v=cA2btTHKPMY
• Hadoop Architecture. 14’27”
Published on Mar 24, 2013
http://www.youtube.com/watch?v=YewlBXJ3rv8
History Behind Creation of Hadoop. 6’29”
Published on Apr 5, 2013. This video talk about the brief history behind creation of
Hadoop. How Google invented the technology, how it went into Yahoo, how Doug
Cutting and Michael Cafarella created Hadoop, and how it went to Apache.
• http://www.youtube.com/watch?v=jA7kYyHKeX8
•
•
•
•
13
Apache Hadoop
• Hadoop Common - contains libraries and utilities needed by
other Hadoop modules
• Hadoop Distributed File System (HDFS).
• Hadoop YARN - a resource-management platform responsible
for managing compute resources in clusters and using them for
scheduling of users' applications.
• Hadoop MapReduce - a programming model for large scale
data processing.
ISQS 6339, Data Mgmt & BI
• The Apache Hadoop framework is composed of the following
modules :
14
• Hadoop is a free, Java-based programming framework
that supports the processing of large data sets in a
distributed computing environment.
• Hadoop makes it possible to run applications on systems
with thousands of nodes involving thousands of terabytes.
• Hadoop was inspired by Google's MapReduce, a software
framework in which anapplication is broken down into
numerous small parts. Doug Cutting, Hadoop's creator,
named the framework after his child's stuffed toy
elephant.
ISQS 3358 BI
Hadoop – for BI in the Cloudera
15
MapReduce
ISQS 3358 BI
MapReduce is a framework for processing parallelizable
problems across huge datasets using a large number of
computers (nodes), collectively referred to as a cluster
or a grid.
16
• The new Hadoop is the Apache Foundation's attempt to
create a whole new general framework for the way big
data can be stored, mined, and processed.
• The biggest constraint on scale has been Hadoop’s job
handling. All jobs in Hadoop are run as batch processes
through a single daemon called JobTracker, which
creates a scalability and processing-speed bottleneck.
• Hadoop 2 uses an entirely new job-processing
framework built using two daemons:
ResourceManager, which governs all jobs in the system,
and NodeManager, which runs on each Hadoop node
and keeps the ResourceManager informed about what's
happening on that node.
ISQS 6339, Data Mgmt & BI
Hadoop 2: Big data's big leap forward
17
MapReduce 2.0 –YARN
(Yet Another Resource Negotiator)
ISQS 6339, Data Mgmt & BI
The fundamental idea of YARN is to split up the functionalities of resource management
And job scheduling/monitoring into separate daemons. The idea is to have a global
ResourceManager (RM) and per-application ApplicationMaster (AM).
18
Comparison of Two Generations of
Hadoop
19
Apache Spark
• An open-source cluster computing framework originally developed
in the AMPLab at UC Berkeley. In contrast to Hadoop's two-stage
disk-based MapReduce paradigm, Spark's in-memory primitives
provide performance up to 100 times faster for certain applications.
• Spark requires a cluster manager and a distributed storage system.
For cluster manager, Spark supports standalone (native Spark
cluster), Hadoop YARN, or Apache Mesos. For distributed storage,
Spark can interface with a wide variety, including Hadoop
Distributed File System (HDFS), Cassandra, OpenStack Swift,
and Amazon S3.
• In February 2014, Spark became an Apache Top-Level Project. Spark
has over 465 contributors in 2014.
- Source: http://en.wikipedia.org/wiki/Apache_Spark
20
Apache Spark
• Apache Spark is an open source cluster computing framework.
Originally developed at the University of California, Berkeley,
the Spark codebase was later donated to the Apache Software
Foundation that has maintained it since.
• What is Spark? 25’27”
https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/ch01.html
Components of Spark
Ecosystem
•
•
•
•
•
•
Shark (SQL)
Spark Streaming (Streaming)
MLlib (Machine Learning)
GraphX (Graph Computation)
SparkR (R on Spark)
BlindDB (Approximate SQL)
• MLLib is an open source library build by the people who build
Spark, mainly inspired by the sci-kit learn library.
• H2O is a free library build by the company H2O. It is actually a
stand alone library, which can be integrated with Spark with
the 'Sparkling Water' connector.
Scala
• Scala is a general purpose programming language. Scala has full support for
functional programming and a very strong static type system. Designed to be
concise, many of Scala's design decisions were inspired by criticism of the
shortcomings of Java.
• The name Scala is a portmanteau of "scalable" and "language"
• Spark is written by Scala.
• The design of Scala started in 2001 at the École Polytechnique Fédérale de
Lausanne (EPFL) by Martin Odersky, following on from work on Funnel, a
programming language combining ideas from functional programming and Petri
nets. Odersky had previously worked on Generic Java and javac, Sun's Java
compiler.
• After an internal release in late 2003, Scala was released publicly in early 2004
on the Java platform, and on the .NET platform in June 2004. A second version
(v2.0) followed in March 2006. The .NET support was officially dropped in 2012.
• On 17 January 2011 the Scala team won a five-year research grant of over €2.3
million from the European Research Council. On 12 May 2011, Odersky and
collaborators launched Typesafe Inc., a company to provide commercial support,
training, and services for Scala. Typesafe received a $3 million investment in
2011 from Greylock Partners.
ISQS 3358 BI
Cloudera’s Hadoop System
25
26
Comparison between big data platform and
traditional BI platform
Data
management
ISQS 3358 BI
Applications
ETL
Data Source
27
Comparison between big data platform and
traditional BI platform
Applications
Data
management
Pentaho, Tableau,
QlikView
R, Scala, Python, Pig
Mahout, H2O, Mllib
HBase/Hive, GraphX,
Neo4J
ETL
Kettle, Flume, Sqoop,
Impala
Data Source
HDFS, NoSQL, NewSQL
Traditional DW
SAS, SPSS, SSRS
SSAS
SSMS
SSIS
ISQS 3358 BI
Hadoop/Spark
Flat files, OLE DB, Excel,
mails, FTP, etc.
28
Topics
No:
1
Topic
Data warehousing
Focus: Hadoop Data warehouse design
2
Publicly available big data services
Focus: tools and free resources
3
MapReduce & Data mining
Components
HDFS, HBase, HIVE,
NoSQL/NewSQL, Solr
Hortonworks, CloudEra, HaaS,
EC2, Spark
Mahout, H2O, MLlib, R, Python
Focus: Efficiency of distributed data/text mining
4
Big data ETL
Kettle, Flume, Sqoop, Impala
Focus: Heterogeneous data processing across
platforms
5
System management:
Focus: Load balancing and system efficiency
6
Application development platform
Focus: Algorithms and innovative development environments
7
Tools & Visualizations
Focus: Features for big data visualization and data utilization.
8
Streaming data processing
Focus: Efficiency and effectiveness of real-time data processing
Oozie, ZooKeeper, Ambari,
Loom, Ganglia, Mesos
Tomcat, Neo4J, Taitan, GraphX,
Pig, Hue
Pentaho, Tableau, Qlik
Saiku, Mondrian, Gephi,
Spark, Storm, Kafka, Avro
HADOOP VS. SPARK
Will Spark replace Hadoop?
• Hadoop is not a single product, it is an ecosystem. Same for Spark.
• MapReduce can be replaced with Spark Core. Yes, it can be replaced over the
time and this replacement seems reasonable. But Spark is not yet mature
enough to make a full replacement of this technology. Plus no one will
completely give up on MapReduce unless all the tools that depend on it will
support an alternative execution engine.
• Hive can be replaced with Spark SQL. Yes, it is again true. But you should
understand that Spark SQL is even younger than the Spark itself, this technology
is younger than 1yo. At the moment it can only toy around the mature Hive
technology, I will look back at this in 1.5 – 2 years. As you remember, 2-3 years
ago Impala was the Hive killer, but now both technologies are living together and
Impala still didn’t kill Hive.
• Storm can be replaced with Spark Streaming. Yes, it can, but to be fair Storm is
not a piece of Hadoop ecosystem as it is completely independent tool. They are
targeting a bit different computational models so I don’t think that Storm will
disappear, but it will continue to leave as a niche product
• Mahout can be replaced with MLlib. To be fair, Mahout is already losing the
market and over the last year it became obvious that this tool will soon be
dropped from the market. And here you can really say that Spark replaced
something from Hadoop ecosystem.
INSTALL HADOOP/SPARK
Platform Installation
• Install VirtualBox 5.0.x
• https://www.virtualbox.org/wiki/Downloads
• Install Hadoop (need 10GB+ disk space)
• CloudEra CDH 5.5, or
• HortonWorks Sandbox 2.4
• Install Spark
• Windows: https://spark.apache.org/downloads.html
• Mac OS X: http://genomegeek.blogspot.com/2014/11/how-toinstall-apache-spark-on-mac-os-x.html
Hortonworks Data Platform
• Install & Setup Hortonworks
• Download, Mar 25, 2-15, 9’26”
• Hortonworks Sandbox, May 2, 2013, 8’35”
• Install Hortonworks Sandbox 2.0, Sep 1, 2014, 22’23”
• Setup Hortonworks Sandbox with Virtualbox VM, Nov 20,
2013, 24’25”
• Download & Install http://hortonworks.com/hdp/downloads/
Debugging
• Error after upgrade:VT-x is not available.
(VERR_VMX_NO_VMX)
https://forums.virtualbox.org/viewtopic.php?f=6&t=58820&si
d=a1f50f7a44da06187cf5468e43a656e5&start=30
Install CloudEra’s QuickStart
Install Spark - Windows
• Download Spark from https://spark.apache.org/downloads.html
• Download Spark: spark-1.6.0-bin-hadoop2.6.tgz
• Look for the installation in this video.
• https://www.youtube.com/watch?v=KvQto_b3sqw
• Change command prompt directory to the downloads by typing
cd C:\Users\......
or
(as shown below)
Mac OS X Yosemite
• http://genomegeek.blogspot.com/2014/11/how-to-installapache-spark-on-mac-os-x.html
Install Spark – Mac OS X
• Install Java
• - Download Oracle Java SE Development Kit 7 or 8 at Oracle JDK
downloads page.
• - Double click on .dmg file to start the installation
• - Open up the terminal.
• - Type java -version, should display the following
• java version "1.7.0_71"
• Java(TM) SE Runtime Environment (build 1.7.0_71-b14)
• Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed
mode)
• Set JAVA_HOME
• export JAVA_HOME=$(/usr/libexec/java_home)
• Install Homebrew
• ruby -e "$(curl -fsSL
https://raw.githubusercontent.com/Homebrew/install/master/install
)”
• Install Scala
• brew install scala
• Set SCALA_HOME
• export SCALA_HOME=/usr/local/bin/scala
• export PATH=$PATH:$SCALA_HOME/bin
• Download Spark from https://spark.apache.org/downloads.html
• tar -xvzf spark-1.1.1.tar
• cd spark-1.1.1
• Fire up the Spark
• For the Scala shell: ./bin/spark-shell
• For the Python shell: ./bin/pyspark
• Run Examples
• Calculate Pi:
• ./bin/run-example org.apache.spark.examples.SparkPi
• MLlib Correlations example:
• ./bin/run-example org.apache.spark.examples.mllib.Correlations
• MLlib Linear Regression example:
• ./bin/spark-submit
• --class org.apache.spark.examples.mllib.LinearRegression
• examples/target/scala-*/spark-*.jar data/mllib/sample_linear_regression_data.txt
• References:
• How to install Spark on Mac OS Xhttp://ondrejkvasnovsky.blogspot.com/2014/06/how-to-install-spark-on-mac-os-x.html
• How To Set $JAVA_HOME Environment Variable On Mac OS X
http://www.mkyong.com/java/how-to-set-java_home-environment-variableon-mac-os-x/
• Homebrew - The missing package manager for OS X http://brew.sh