Spark – An Introduction - Prof. Harold Liu`s Website!
Download
Report
Transcript Spark – An Introduction - Prof. Harold Liu`s Website!
Berkeley Data Analytics Stack
Prof. Harold Liu
15 December 2014
Data Processing Goals
•
•
•
Low latency (interactive) queries on
historical data: enable faster decisions
– E.g., identify why a site is slow and fix it
Low latency queries on live data (streaming):
enable decisions on real-time data
– E.g., detect & block worms in real-time (a
worm may infect 1mil hosts in 1.3sec)
Sophisticated data processing: enable
“better” decisions
– E.g., anomaly detection, trend analysis
Today’s Open Analytics Stack…
•
..mostly focused on large on-disk datasets: great for
batch but slow
Application
Data Processing
Storage
Infrastructure
Goals
Batch
One
stack to
rule them all!
Interactive
Streaming
Easy to combine batch, streaming, and interactive computations
Easy to develop sophisticated algorithms
Compatible with existing open source ecosystem (Hadoop/HDFS)
Support Interactive and Streaming Comp.
•
•
Aggressive use of memory
Why?
1. Memory transfer rates >> disk or SSDs
2. Many datasets already fit into memory
•
Inputs of over 90% of jobs in
Facebook, Yahoo!, and Bing clusters
fit into memory
•
e.g., 1TB = 1 billion records @ 1KB
each
10Gbps
128512GB
40-60GB/s
16 cores
0.21GB/s
(x10 disks)
3. Memory density (still) grows with Moore’s
law
•
RAM/SSD hybrid memories at
horizon
14GB/s
(x4 disks)
10-30TB
1-4TB
High end datacenter node
Support Interactive and Streaming Comp.
•
•
Increase parallelism
Why?
– Reduce work per node
improve latency
result
T
•
Techniques:
– Low latency parallel scheduler
that achieve high locality
– Optimized parallel communication
patterns (e.g., shuffle, broadcast)
– Efficient recovery from failures
and straggler mitigation
result
Tnew (< T)
Support Interactive and Streaming Comp.
•
•
•
Trade between result accuracy and
response times
Why?
– In-memory processing does not
guarantee interactive query
processing
• e.g., ~10’s sec just to scan 512
GB RAM!
• Gap between memory capacity
and transfer rate increasing
Challenges:
– accurately estimate error and
running time for…
– … arbitrary computations
128512GB
40-60GB/s
16 cores
doubles
every 18
months
doubles
every 36
months
Berkeley Data Analytics Stack
(BDAS)
New apps: AMP-Genomics, Carat, …
Application
Data Processing
Data Storage
Management
Resource
Infrastructure
Management
• in-memory processing
• trade between time, quality, and
cost
Efficient data sharing across
frameworks
Share infrastructure across
frameworks
(multi-programming for datacenters)
lg
orit
“Launched” January 2011: 6 Year Plan
hm
8 CS Faculty
a
~40 students
s
3 software engineers
chi
eo
Organized for collaboration:
ne
pl
s
e
Berkeley AMPLab
–
–
–
•
Berkeley AMPLab
•
Funding:
–
–
–
XData,
CISE Expedition Grant
Industrial, founding sponsors
18 other sponsors, including
Goal: Next Generation of Analytics Data Stack for Industry &
Research:
• Berkeley Data Analytics Stack (BDAS)
• Release as Open Source
Berkeley Data Analytics Stack
(BDAS)
•
Existing stack components….
HIVE
Pig
…
HBase
Storm
Data Processing
MPI
Data
Processing
Hadoop
HDFS
Data Management
Resource Management
Data
Mgmnt.
Resource
Mgmnt.
Mesos
•
•
•
Management platform that allows multiple framework to share
cluster
Compatible with existing open analytics stack
Deployed in production at Twitter on 3,500+ servers
HIVE
Pig
…
HBase
Storm
MPI
Data
Processing
Hadoop
HDFS
Data
Mgmnt.
Mesos
Resource
Mgmnt.
Spark
•
•
In-memory framework for interactive and iterative
computations
– Resilient Distributed Dataset (RDD): fault-tolerance, inmemory storage abstraction
Scala interface, Java and Python APIs
HIVE Pig
Spark
…
Data
Storm MPI Processing
Hadoop
HDFS
Data
Mgmnt.
Mesos
Resource
Mgmnt.
Spark Streaming [Alpha Release]
•
•
•
Large scale streaming computation
Ensure exactly one semantics
Integrated with Spark unifies batch, interactive, and streaming
computations!
Spark
Streamin
g
HIVE Pig
Spark
Data
… Stor MP
Processing
I
m
Hadoop
HDFS
Data
Mgmnt.
Mesos
Resource
Mgmnt.
Shark Spark SQL
•
•
HIVE over Spark: SQL-like interface (supports Hive 0.9)
– up to 100x faster for in-memory data, and 5-10x for disk
In tests on hundreds node cluster at
Spark
Streamin
g
HIVE Pig
Shark
Spark
Data
… Stor MP
Processing
I
m
Hadoop
HDFS
Data
Mgmnt.
Mesos
Resource
Mgmnt.
Tachyon
•
•
High-throughput, fault-tolerant in-memory storage
Interface compatible to HDFS
•
Support for Spark and Hadoop
Spark
Streamin
g
HIVE Pig
Shark
Data
… Stor MP
Processing
I
m
Hadoop
Spark
Tachyon
HDFS
Mesos
Data
Mgmnt.
Resource
Mgmnt.
BlinkDB
•
•
•
Large scale approximate query engine
Allow users to specify error or time bounds
Preliminary prototype starting being tested at Facebook
Spark
Streamin
g
BlinkDB
Pig
Shark
Spark
HIVE
Data
… Stor MP
Processing
I
m
Hadoop
Tachyon
HDFS
Mesos
Data
Mgmnt.
Resource
Mgmnt.
SparkGraph
•
•
GraphLab API and Toolkits on top of Spark
Fault tolerance by leveraging Spark
Spark
Streamin
g
Spark
Graph
BlinkDB
Pig
Shark
Spark
HIVE
Data
… Stor MP
Processing
I
m
Hadoop
Tachyon
HDFS
Mesos
Data
Mgmnt.
Resource
Mgmnt.
MLlib
•
•
•
Declarative approach to ML
Develop scalable ML algorithms
Make ML accessible to non-experts
Spark
Streamin
g
Spark
Graph
MLbas
e
BlinkDB
Pig
Shark
Spark
HIVE
Data
… Stor MP
Processing
I
m
Hadoop
Tachyon
HDFS
Mesos
Data
Mgmnt.
Resource
Mgmnt.
Compatible with Open Source Ecosystem
•
Support existing interfaces whenever possible
GraphLab API
Spark
Streamin
g
Spark
Graph
MLbas
e
BlinkDB Hive Interface
Shark
Spark
Tachyon
and Shell
Pig
Data
… Stor MP
HIVE
Processing
I
m
Hadoop
HDFS API
Mesos
Compatibility Data
layer for
HDFS
Hadoop, Storm,
MPI,
Mgmnt.
etc to run over Mesos
Resource
Mgmnt.
Compatible with Open Source Ecosystem
•
Use existing interfaces whenever possible
Accept inputs
from Kafka,
Flume, Twitter,
TCP
Sockets, …
Spark
Streamin
g
Spark
Graph
Support Hive API
MLbas
e
BlinkDB
Pig
Shark
Spark
Data
… Stor MP
HIVE
Processing
I
m
Support HDFS
Hadoop
API, S3 API, and
Hive metadata
Tachyon
HDFS
Mesos
Data
Mgmnt.
Resource
Mgmnt.
Summary
•
•
•
•
•
Support interactive and streaming computations
– In-memory, fault-tolerant storage abstraction, low-latency
scheduling,...
Easy to combine batch, streaming, and interactive
Batch
computations
– Spark execution engine supports
Spark
all comp. models
Interacti
Streami
Easy to develop sophisticated algorithms
ve
ng
– Scala interface, APIs for Java, Python, Hive QL, …
– New frameworks targeted to graph based and ML algorithms
Compatible with existing open source ecosystem
Open source (Apache/BSD) and fully committed to release high
quality software
– Three-person software engineering team lead by Matt Massie
(creator of Ganglia, 5th Cloudera engineer)
Spark
In-Memory Cluster Computing for
Iterative and Interactive Applications
UC Berkeley
Background
• Commodity clusters have become an important computing
platform for a variety of applications
– In industry: search, machine translation, ad targeting, …
– In research: bioinformatics, NLP, climate simulation, …
• High-level cluster programming models like MapReduce
power many of these apps
• Theme of this work: provide similarly powerful abstractions
for a broader class of applications
Motivation
Current popular programming models for
clusters transform data flowing from stable
storage to stable storage
e.g., MapReduce:
Map
Input
Reduce
Output
Map
Map
Reduce
Motivation
• Acyclic data flow is a powerful abstraction, but is
not efficient for applications that repeatedly reuse a
working set of data:
– Iterative algorithms (many in machine learning)
– Interactive data mining tools (R, Excel, Python)
• Spark makes working sets a first-class concept to
efficiently support these apps
Spark Goal
• Provide distributed memory abstractions for clusters to
support apps with working sets
• Retain the attractive properties of MapReduce:
– Fault tolerance (for crashes & stragglers)
– Data locality
– Scalability
Solution: augment data flow model with
“resilient distributed datasets” (RDDs)
Programming Model
• Resilient distributed datasets (RDDs)
– Immutable collections partitioned across cluster that
can be rebuilt if a partition is lost
– Created by transforming data in stable storage using
data flow operators (map, filter, group-by, …)
– Can be cached across parallel operations
• Parallel operations on RDDs
– Reduce, collect, count, save, …
• Restricted shared variables
– Accumulators, broadcast variables
Example: Log Mining
•Load error messages from a log into memory,
then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
Base
Transformed
RDD RDD
results
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2))
cachedMsgs = messages.cache()
tasks
Driver
Cached RDD
Cache 1
Worke
r
Block 1
Parallel
operation
cachedMsgs.filter(_.contains(“foo”)).count
Cache 2
cachedMsgs.filter(_.contains(“bar”)).count
Worke
r
. . .
Cache 3
Result: full-text search of
Wikipedia in <1 sec (vs 20 sec
for on-disk data)
Worke
r
Block 3
Block 2
RDDs in More Detail
• An RDD is an immutable, partitioned, logical collection
of records
– Need not be materialized, but rather contains
information to rebuild a dataset from stable storage
• Partitioning can be based on a key in each record
(using hash or range partitioning)
• Built using bulk transformations on other RDDs
• Can be cached for future reuse
RDD Operations
Transformations
(define a new RDD)
map
filter
sample
union
groupByKey
reduceByKey
join
cache
…
Parallel operations (Actions)
(return a result to driver)
reduce
collect
count
save
lookupKey
…
RDD Fault Tolerance
• RDDs maintain lineage information that can be used to
reconstruct lost partitions
• e.g.:
cachedMsgs = textFile(...).filter(_.contains(“error”))
.map(_.split(‘\t’)(2))
.cache()
HdfsRDD
path: hdfs://…
FilteredRDD
func:
contains(...)
MappedRDD
func: split(…)
CachedRDD
Example 1: Logistic Regression
• Goal: find best line separating two sets of points
random initial line
target
Logistic Regression Code
• val data = spark.textFile(...).map(readPoint).cache()
• var w = Vector.random(D)
• for (i <- 1 to ITERATIONS) {
• val gradient = data.map(p =>
• (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
• ).reduce(_ + _)
• w -= gradient
•}
• println("Final w: " + w)
Logistic Regression Performance
127 s / iteration
first iteration 174 s
further iterations 6 s
Example 2: MapReduce
• MapReduce data flow can be expressed using RDD
transformations
res = data.flatMap(rec => myMapFunc(rec))
.groupByKey()
.map((key, vals) => myReduceFunc(key, vals))
Or with combiners:
res = data.flatMap(rec => myMapFunc(rec))
.reduceByKey(myCombiner)
.map((key, val) => myReduceFunc(key, val))
Example 3
Other Spark Applications
• Twitter spam classification (Justin Ma)
• EM alg. for traffic prediction (Mobile Millennium)
• K-means clustering
• Alternating Least Squares matrix factorization
• In-memory OLAP aggregation on Hive data
• SQL on Spark (future work)
Conclusion
• By making distributed datasets a first-class primitive,
Spark provides a simple, efficient programming model for
stateful data analytics
• RDDs provide:
– Lineage info for fault recovery and debugging
– Adjustable in-memory caching
– Locality-aware parallel operations
• We plan to make Spark the basis of a suite of batch
and interactive data analysis tools