Spark学习入门 - Xiamen University
Download
Report
Transcript Spark学习入门 - Xiamen University
INTRODUCTION TO SPARK
Tangzk
2015/7/16
OUTLINE
BDAS(the Berkeley Data Analytics Stack)
Spark
Other Components based on Spark
Introduction to Scala
Spark Programming Practices
Debugging and Testing Spark Programs
Learning Spark
Why are Previous MapReduce-Based Systems Slow?
Conclusion
1
𝟐𝟑
BDAS(THE BERKELEY DATA ANALYTICS
STACK)
2
𝟐𝟑
SPARK
Master/Slave architecture
In-Memory Computing Platform
Resilient Distributed Datasets(RDDs) abstraction
DAG Engine Execution
Fault Recovery using Lineage
Supporting Interactive data mining
Survived in Hadoop Ecosystem
3
𝟐𝟑
SPARK – IN-MEMORY COMPUTING
Hadoop: two-stages MapReduce topology
Spark: DAG executing topology
4
𝟐𝟑
SPARK – RDD(RESILIENT DISTRIBUTED
DATASETS)
Data Computing and Storage Abstraction
Records organized by Partition
Immutable(Read Only)
only be created through transformations
Move computing instead of data
Coarse-grained Programming Interfaces
Partition Reconstruction using lineage
5
𝟐𝟑
SPARK-RDD: COARSE-GRAINED
PROGRAMMING INTERFACES
Transformation: defining one or more RDDs.
Action: return a value.
Lazy computation
6
𝟐𝟑
SPARK-RDD: LINEAGE GRAPH
Spark Example:
Return the time fields of web GET access by
“66.220.128.123”(time field is number 3 in a tabseparated format)
logs = spark.textFile("hdfs://...")
accessesByIp = logs.filter(_.startWith("66.220.128.123"))
accessesByIp.persist()
accessesByIp.filter(_.contains("GET"))
.map(_.split('\t')(3))
.collect()
7
𝟐𝟑
SPARK - INSTALLATION
Running Spark
Local Mode
Standalone Mode
Cluster Mode: on YARN/Mesos
Support Programming Language
Scala
Java
Python
Spark Interactive Shell
8
𝟐𝟑
OTHER COMPONENTS(1) - SHARK
SQL and Rich Analytics at Scale.
Partial DAG Execution optimize query planning
at runtime.
run SQL queries up to 100× faster than Apache
Hive, and machine learning programs up to
100× faster than Hadoop.
9
𝟐𝟑
OTHER COMPONENTS(1) – SPARK
STREAMING
Scalable fault-tolerant streaming computing
framework
Discretized streams abstraction: separate
continuous dataflow into batches of input data
10
𝟐𝟑
OTHER COMPONENTS(2) - MLBASE
Distributed Machine Learning Library
11
𝟐𝟑
OTHER COMPONENTS(4) – GRAPHX
Distributed Graph Computing framework
RDG(Resilient Distributed Graph) abstraction
Supporting Gather-Apply-Scatter model in
GraphLab
12
𝟐𝟑
GAS MODEL - GRAPHLAB
Machine 1
Machine 2
Master
Gather Infos
From Nbs
Y’
Y’
Y’
Y’
Σ1 +
Apply Vertex
update
Σ
+
Σ2
+
Mirror
Y
Σ3
Σ4
Mirror
Machine 3
Mirror
Machine 4
13
𝟐𝟑
From Jiang Wenrui’s thesis defense
INTRODUCTION TO SCALA(1)
Runs on JVM(and .Net)
Full interoperability with Java
Statically typed
Object Oriented
Functional Programming
14
𝟐𝟑
INTRODUCTION TO SCALA(2)
Declare a list of integers
Declare a function, cube, compute the cube of an Int
val cubes = ints.map(x => cube(x))
Sum the cube of integers.
def cube(a: Int): Int = a * a * a
Apply cube function to list.
val ints = List(1,2,4,5,7,3)
cubes.reduce((x1,x2) => x1+x2)
Define a factorial function that comput n!
def fact(n: Int): Int = {
if(n == 0) 1
else n * fact(n-1)
}
15
𝟐𝟑
SPARK IN PRACTICES(1)
“Hello Word”(interactive shell)
val textFile = sc.textFile(“hdfs://…”)
textFile.flatMap(line => line.split(“ “))
.map(word => (word, 1))
.reduceByKey((a, b) => a+b)
“Hello Word”(Standalone App)
16
𝟐𝟑
SPARK IN PRACTICES(2)
PageRank in Spark
17
𝟐𝟑
DEBUGGING AND TESTING SPARK
PROGRAMS(1)
Running in Local Mode
sc = new SparkContext("local", name)
Debug in IDE
Running in Standalone/Cluster Mode
Job Web GUI on 8080/4040
Log4j
jstack/jmap
dstat/iostat/lsof -p
Unit test
Test in local mode
18
𝟐𝟑
DEBUGGING AND TESTING SPARK
PROGRAMS(2)
RDDJoin2:
(2,4)
(1,2)
RDDJoin3:
(1,(1,3))
(1,(2,3))
19
𝟐𝟑
DEBUGGING AND TESTING SPARK
PROGRAMS(3)
Tunning Spark
Large object in lambda operator should be replaced
by broadcast variables instead.
Coalescing partitions avoid large number of empty
tasks after filtering operations.
Make good use of partition for data
locality.(mapPartitions)
Good choice of partitioning key to balance data
Set spark.local.dir to set of disks
Take care of the number of reduce tasks
Don’t collect data but write to HDFS directly.
20
𝟐𝟑
LEARNING SPARK
Spark Quick Start,
http://spark.apache.org/docs/latest/quickstart.html
Holden Karau, Fast Data Processing with Spark
Spark Docs, http://spark.apache.org/docs/latest/
Spark Source code,
https://github.com/apache/spark
Spark User Mailing list,
http://spark.apache.org/mailing-lists.html
21
𝟐𝟑
WHY ARE PREVIOUS MAPREDUCE-BASED
SYSTEMS SLOW?
Conventional thoughts:
expensive data materialization for fault tolerance,
inferior data layout (e.g., lack of indices),
costlier execution strategies.
But Hive/Shark alleviate these by:
In-memory computing and storage
Partial DAG execution
Experiment results in Shark:
Intermediate Outputs
Data Format and Layout by co-partitioning
Execution Strategies optimizing using PDE
Task Scheduling Cost
22
𝟐𝟑
CONCLUSION
Spark
In-Memory computing platform for iterative and
interactive tasks
RDD abstraction
Lineage reconstruction for fault recovery
Large number of components based on
Spark Programming
Just think RDD like vector
Function programming
Scala IDE is not strong enough.
Lack of good tools to debug and test.
23
𝟐𝟑