Spark - Shark Data Analytics Stack on a Hadoop Cluster

Transcript Spark - Shark Data Analytics Stack on a Hadoop Cluster

Spark - Shark Data Analytics Stack on a
Hadoop Cluster
April 22, 2013
April 23, 2013
Michael Malak
Data Analytics Senior Engineer at Time Warner Cable
Technicaltidbit.com
Chris Deptula
Senior Big Data Consultant
317.840.2935
[email protected]
@chrisdeptula
http://www.openbi.com
Michael Walker
Managing Partner
720.373.2200
[email protected]
http://www.rosebt.com
Agenda
• The Big Data Problem
• Spark Ecosystem
• NFL Data Science Use Case
• Visualizing Data
The Big Data Problem
Speed Kills in Data Science
Hype Cycle for Emerging Tech 2012
Hype Cycle for Big Data 2012
Evolution of DW Architecture
Emerging DW Architecture
Next-Generation Data Architecture
Big Data Ecosystem Parts
DW Database Systems MQ 2013
Total Enterprise Data Growth 2005-2015
Structured vs Unstructured Data
Modern DW/BI Analytical Ecosystems
Big Data Ecosystem Parts
The Internet of Things
Big Data 4 V's
New World of Databases
New World of Databases
Hadoop
Hadoop
Hadoop
Big Data Vendor Focused on Hadoop and
NoSQL Revenue 2012
Big Data Analytics Infrastructure
The Spark Ecosystem
Agenda
• What Hadoop gives us
• What everyone is complaining about in 2013
• Spark
o
o
o
o
o
o
Berkeley Team
BDAS (Berkeley Data Analytics Stack)
RDDs (Resilient Distributed Datasets)
Shark
Spark Streaming
Other Spark subsystems
Global Big Data Apr 23,
2013
technicaltidbit.com
31
What Hadoop Gives Us
•
•
HDFS
Map/Reduce
Global Big Data Apr 23,
2013
technicaltidbit.com
32
Hadoop: HDFS
Image from mark.chmarny.com
Global Big Data Apr 23,
2013
technicaltidbit.com
33
Hadoop: Map/Reduce
Image from blog.octo.com
Image from people.apache.org/~rdonkin
Global Big Data Apr 23,
2013
technicaltidbit.com
34
Map/Reduce Tools
Pig Script
HiveQL
Pig
Hive
Hbase App
Hadoop
Linux
Global Big Data Apr 23,
2013
technicaltidbit.com
35
Hadoop Distribution Dogs in the Race
Hadoop Distribution
Query Tool
Apache Drill
Stinger
Global Big Data Apr 23,
2013
technicaltidbit.com
36
Other Open Source Solutions
•
•
Druid
Spark
Global Big Data Apr 23,
2013
technicaltidbit.com
37
Not just caching, but streaming
•
•
•
1st generation: HDFS
2nd generation: Caching & “Push” Map/Reduce
3rd generation: Streaming
Global Big Data Apr 23,
2013
technicaltidbit.com
38
•
•
•
•
•
Berkeley Team
40 students
8 faculty
3 staff software
engineers
Silicon Valley style
skunkworks office
space
2 years into 6 year
program
Global Big Data Apr 23,
2013
Image from Ian Stoica’s slides from Strata 2013 presentation
technicaltidbit.com
39
BDAS
(Berkeley Data Analytics Stack)
Bagel App
Shark App
Spark Streaming
App
Bagel
Shark
Spark Streaming
Spark App
Spark
Hadoop/HDFS
Mesos
Linux
Global Big Data Apr 23,
2013
technicaltidbit.com
40
RDDs
(Resilient Distributed Dataset)
Image from Matei Zaharia’s paper
Global Big Data Apr 23,
2013
technicaltidbit.com
41
RDDs: Laziness
x => x.startsWith(“ERROR”)
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))
.map(_.split(‘\t’)(2))
All Lazy
.filter(_.contains(“foo”))
cnt = errors.count
Action!
Global Big Data Apr 23,
2013
technicaltidbit.com
42
RDDs: Transformations vs. Actions
Transformations
Actions
map(func)
filter(func)
flatMap(func)
sample(withReplacement,
frac, seed)
union(otherDataset)
groupByKey[K,V](func)
reduceByKey[K,V](func)
join[K,V,W](otherDataset)
cogroup[K,V,W1,W2](other1,
other2)
cartesian[U](otherDataset)
sortByKey[K,V]
reduce(func)
collect()
count()
take(n)
first()
saveAsTextFile(path)
saveAsSequenceFile(path)
foreach(func)
[K,V] in Scala same as
<K,V> templates in C++,
Java
Global Big Data Apr 23,
2013
technicaltidbit.com
43
Hive vs. Shark
HDFS files
Global Big Data Apr 23,
2013
HiveQL
HiveQL
Shark
HDFS files
technicaltidbit.com
+
RDDs
44
Shark: Copy from HDFS to RDD
CREATE TABLE wiki_small_in_mem TBLPROPERTIES
("shark.cache" = "true") AS SELECT * FROM wiki;
CREATE TABLE wiki_cached AS SELECT * FROM wiki;
Creates a table that is stored in a cluster’s memory using
RDD.cache().
Global Big Data Apr 23,
2013
technicaltidbit.com
45
Shark: Just a Shim
Shark
Images from Reynold Xin’s presentation
Global Big Data Apr 23,
2013
technicaltidbit.com
46
What about “Big Data”?
PB
Shark Effectiveness
TB
GB
MB
KB
Global Big Data Apr 23,
2013
technicaltidbit.com
47
Median Hadoop job input size
Image from Reynold Xin’s presentation
Global Big Data Apr 23,
2013
technicaltidbit.com
48
Spark Streaming: Motivation
x1,000,000 clients
Global Big Data Apr 23,
2013
HDFS
technicaltidbit.com
49
Spark Streaming: DStream
•
“A series of small batches”
{{“id”: “hercman”},
“eventType”: “buyGoods”}}
{{“id”: “hercman”},
“eventType”: “buyGoods”}}
{{“id”: “shewolf”},
“eventType”: “error”}}
{{“id”: “shewolf”},
“eventType”: “error”}}
2 sec
RDD
2 sec
RDD
2 sec
...
RDD
{{“id”: “catlover”},
“eventType”: “buyGoods”}}
{{“id”: “hercman”},
“eventType”: “logOff”}}
DStream
Global Big Data Apr 23,
2013
technicaltidbit.com
50
Spark Streaming: DAG
DStream
.filter(
_.eventType=
=
“error”)
Kafka
DStream[String
] (JSON)
Dstream
.transform
The DAG
Global Big Data Apr 23,
2013
Dstream
.foreach(
println)
technicaltidbit.com
Dstream
.filter(
_.eventType=
=
“buyGoods”)
Dstream
.foreach(
println)
Dstream
.map((_.id,1))
Dstream
.groupByKey
51
Spark Streaming: Example Code
// Initialize
val ssc = new StreamingContext(“mesos://localhost”, “games”, Seconds(2), …)
val msgs = ssc.kafkaStream[String](prm, topic, StorageLevel.MEMORY_AND_DISK)
// DAG
val events:Dstream[evObj] = messages.transform(rdd => rdd.map(new evObj(_))
val errorCounts = events.filter(_.eventType == “error”)
errorCounts.foreach(rdd => println(rdd.count))
val usersBuying = events.filter(_.eventType == “buyGoods”).map((_.id,1))
.groupByKey
usersBuying.foreach(rdd => println(rdd.count))
// Go
ssc.start
Global Big Data Apr 23,
2013
technicaltidbit.com
52
Stateful Spark Streaming
Class ErrorsPerUser(var numErrors:Int=0) extends Serializable
val updateFunc = (values:Seq[evObj], state:Option[ErrorsPerUser]) => {
if (values.find(_.eventType == “logOff”) == None)
None
else {
values.foreach(e => {
e.eventType match { “error” => state.numErrors += 1 }
})
Option(state)
}
}
// DAG
val events:Dstream[evObj] = messages.transform(rdd => rdd.map(new evObj(_))
val errorCounts = events.filter(_.eventType == “error”)
val states = errorCounts.map((_.id,1))
.updateStateByKey[ErrorsPerUser](updateFunc)
// Off-DAG
states.foreach(rdd => println(“Num users experiencing errors:” + rdd.count))
Global Big Data Apr 23,
2013
technicaltidbit.com
53
Other Spark Subsystems
•
•
•
Bagel (similar to Google Pregel)
Sparkler (Matrix decomposition)
(Machine Learning)
Global Big Data Apr 23,
2013
technicaltidbit.com
54
Teaser
•
Global Big Data Apr 23,
2013
Future Meetup: Machine
learning from real-time data
streams
technicaltidbit.com
55
Data Science NFL Use-Case
Speed Kills - In Data Science and NFL
Speed Kills: Up-Tempo - No Huddle
Observation: best college and NFL offenses are using
an up-tempo - no huddle strategy.
Hypothesis: NFL teams using the up-tempo - no
huddle strategy have the best winning records.
Data Science Formula
Data Science Formula
Data science processes include:
1) Information gathering
2) Re-representation of the information in a schema that
aids analysis
3) Development of insight through the manipulation of this
representation
4) Creation of some knowledge product or direct action
based on the insight
Data Science Formula
Information > Schema > Insight > Product
The data analysis may be organized in two key loops:
1) Searching loop (seeking, extracting, filtering information)
2) Understanding loop (modeling and conceptualization
from a schema that best fits the evidence)
Data Collection
Collect data on all NFL offenses
Collect data on NFL offenses using an up-tempo - no
huddle strategy
Collect data on NFL team records (win-losses)
Data Comparison
Compare data on all NFL offenses with data on NFL
offenses using an up-tempo - no huddle strategy
Data Science Tools
Scientific methods
Analytical techniques
Machine learning techniques
Algorithm design and execution
Data visualization and story-telling
Statistics
Math
Computer engineering
Data mining
Data modeling
Predictive, Descriptive, Prescriptive
Analytics
There are three types of data analysis:
Descriptive (business intelligence and data mining)
Predictive (forecasting)
Prescriptive (optimization and simulation)
Predictive Analytics
Predictive Modeling Techniques
Predictive Analytics
Prescriptive Analytics
Regression Analysis
NFL Data Sources
2002-2009 NFL PLAY BY PLAY
http://www.infochimps.com/tags/nfl
Supplemented stats source:
http://www.databasefootball.com/
http://www.pro-football-reference.com/
Data Results
From 2001 to 2012 only one NFL team used the up-tempo
- no huddle offense consistently:
The New England Patriots
Note: In college football many teams started using the uptempo - no huddle offense consistently in 2009. One
team stands out:
The Oregon Ducks
Data Model - DVOA
DVOA (Defense-adjusted Value Over Average) for total
offense as well as rushing and passing offense
separated. All numbers are adjusted to an average
schedule of opponents and an average percentage of
fumbles recovered by the offense.
Exceptions are the three columns marked NONADJUSTED. Rushing includes all rushing, not just
running backs.
Designed by Aaron Schatz of Football Outsiders
2001 Offensive Efficiency Ratings
2002 Offensive Efficiency Ratings
2003 Offensive Efficiency Ratings
2004 Offensive Efficiency Ratings
2005 Offensive Efficiency Ratings
2006 Offensive Efficiency Ratings
2007 Offensive Efficiency Ratings
2008 Offensive Efficiency Ratings
2009 Offensive Efficiency Ratings
2011 Offensive Efficiency Ratings
2012 Offensive Efficiency Ratings
Yr to Yr Correlation of Offense 2002 - 2009
All-Time DVOA Lists
All-Time DVOA Lists
NFL Team Plays per Game 2003
NFL Team Plays per Game 2004
NFL Team Plays per Game 2005
NFL Team Plays per Game 2006
NFL Team Plays per Game 2007
NFL Team Plays per Game 2008
NFL Team Plays per Game 2009
NFL Team Plays per Game 2011
NFL Team Plays per Game 2012
Team Records
Total Points
Average Points per Game
Total Plays
Average Plays per Game
Data Results
From 2001 to 2009 the New England Patriots had 107 wins
and 37 losses - the second best record in NFL (to
Mannings Ind).
They appeared in five (5) - winning three (3) super bowls.
From 2001 to 2012 the New England Patriots had 146 wins
and 46 losses - the best record in NFL.
Data Results
In 128 games NE scored a total of 3,356 points (26.2 ppg)
Second to Mannings Colts.
NE runs up-tempo - no -huddle offense
IND runs no-huddle offense
Data Results
NE in 5 SBs - wins 3 SBs (up-tempo - no -huddle offense)
IND in 2 SBs - wins 1 SB (no -huddle offense)
NE Stats 2001 - 2012
Yr - W/L - Plpg/R - Off R - Tot pps
2001 11-5 62.5/18
6 371 - Super Bowl Champs
2002 9-7 64.4/8
16 381
2003 14-2 66.4/1
12 348 - Super Bowl Champs
2004 14-2 64.3/7
2005 10-6 63.7/15
2006 12-4 66.4/4
2007
2008
2009
2010
16-0
11-5
10-6
14-2
65.8/3
68.4/1
67/2
62.6/21
2011 13-3 67.4/2
2012 12-4 74.3/1
4
10
7
437
379
385
- Super Bowl Champs
1
8
6
1
589
410
427
518
- Lost SB
3
1
513 - Lost SB
557
Denver Broncos Stats 2011 - 2012
Yr - W/L - Plpg/R - Off R - Tot pps
2011 8-8
63.6/16 25
2012 13-3 69.2/4
2
309
481
Buffalo Bills - 1990-1993
Yr - W/L - Plpg/R - Off R - Tot pps
1990
1991
1992
1993
13-3
13-3
11-5
12-4
58.2/21
66/2
67.9/1
67.3/3
1
2
3
7
428
458
381
329
- Lost SB
- Lost SB
- Lost SB
- Lost SB
Oregon Ducks 2009-2012
Yr - W/L - Plpg/R - Off R - Tot pps
2009 10-3 68.1/54
2010 12-1 78.8/5
8
1
468 - Lost Rose B - Final Rank 11
611 - Lost BCS Ch - Final Rank 3
2011 12-2 72.5/32
2012 12-1 81.4/9
3
2
644 - Won Rose B - Final Rank 4
637 - Won Fiesta B - Final Rank 2
Forecasting Principles - Framework
Stages of Forecasting
Forecasting Methods Selection Chart
Forecasting Methodology Tree
Change in Offensive Play Calling
Ave Pass Attempts per Game 35 Yrs
Completion % 35 yrs
Interception % 35 yrs
Passing: Return vs. Risk
QB Ratings 35 yrs
Series 1st Down Likelyhood on 1st and 10
Offensive Points by Yards
Season Wins by Passing Efficiency
Plays per Game per Year
Speed Kills: Up-Tempo - No Huddle
Prediction: More NFL offenses will utilize the uptempo - no huddle strategy.
Prediction: More NFL offenses will pass.
Forecast: NFL teams using the up-tempo - no huddle
strategy will have the best winning records.
Visualizing Data
When does a Data Scientists job end?
•
•
•
Data Scientists must be able to tell stories with their
findings.
The audience may not understand regression analysis,
weighted ranks, etc.
Must be able to present findings in a clear, concise, and
easy to look at manner.
Which would your manager prefer?
VS
Big Data Visualization Tools
Pentaho
Tableau
Jaspersoft
Pervasive
Birst
Many many more
Pentaho
An open source full suite of data integration and analytics
tools
- Data Integration
- Pixel perfect reports
- OLAP cubes
- Dashboards
Works with many Big Data sources including Hadoop, Hive,
HBase, Cassandra, MongoDB, and Couchbase along with
traditional data sources.
OLAP engine automatically generates SQL using
Mondrian.
Pentaho
Has integration with WEKA, but no other statistical
languages.
Unfortunately like all other user driven visualization tools
today data must be extracted to memory or a database.
Pentaho makes doing this easy.
Commercial edition includes the capability to automate this
process called Instaview.
Work on creating support for Hive SQL.
Demo
Demo creating visualizations
Speed Kills
Thank You
Presentation by:
Michael Malak
Chris Deptula
Michael Walker

Spark - Shark Data Analytics Stack on a Hadoop Cluster

Transcript Spark - Shark Data Analytics Stack on a Hadoop Cluster

Directory