Spark学习入门 - Xiamen University

Download Report

Transcript Spark学习入门 - Xiamen University

INTRODUCTION TO SPARK
Tangzk
2015/7/16
OUTLINE
BDAS(the Berkeley Data Analytics Stack)
 Spark
 Other Components based on Spark
 Introduction to Scala
 Spark Programming Practices
 Debugging and Testing Spark Programs
 Learning Spark
 Why are Previous MapReduce-Based Systems Slow?
 Conclusion

1
𝟐𝟑
BDAS(THE BERKELEY DATA ANALYTICS
STACK)
2
𝟐𝟑
SPARK
Master/Slave architecture
 In-Memory Computing Platform
 Resilient Distributed Datasets(RDDs) abstraction
 DAG Engine Execution
 Fault Recovery using Lineage
 Supporting Interactive data mining
 Survived in Hadoop Ecosystem

3
𝟐𝟑
SPARK – IN-MEMORY COMPUTING
Hadoop: two-stages MapReduce topology
 Spark: DAG executing topology

4
𝟐𝟑
SPARK – RDD(RESILIENT DISTRIBUTED
DATASETS)
Data Computing and Storage Abstraction
 Records organized by Partition
 Immutable(Read Only)
 only be created through transformations
 Move computing instead of data
 Coarse-grained Programming Interfaces
 Partition Reconstruction using lineage

5
𝟐𝟑
SPARK-RDD: COARSE-GRAINED
PROGRAMMING INTERFACES
Transformation: defining one or more RDDs.
 Action: return a value.
 Lazy computation

6
𝟐𝟑
SPARK-RDD: LINEAGE GRAPH

Spark Example:

Return the time fields of web GET access by
“66.220.128.123”(time field is number 3 in a tabseparated format)
logs = spark.textFile("hdfs://...")
accessesByIp = logs.filter(_.startWith("66.220.128.123"))
accessesByIp.persist()
accessesByIp.filter(_.contains("GET"))
.map(_.split('\t')(3))
.collect()
7
𝟐𝟑
SPARK - INSTALLATION

Running Spark
Local Mode
 Standalone Mode
 Cluster Mode: on YARN/Mesos


Support Programming Language
Scala
 Java
 Python


Spark Interactive Shell
8
𝟐𝟑
OTHER COMPONENTS(1) - SHARK
SQL and Rich Analytics at Scale.
 Partial DAG Execution optimize query planning
at runtime.
 run SQL queries up to 100× faster than Apache
Hive, and machine learning programs up to
100× faster than Hadoop.

9
𝟐𝟑
OTHER COMPONENTS(1) – SPARK
STREAMING
Scalable fault-tolerant streaming computing
framework
 Discretized streams abstraction: separate
continuous dataflow into batches of input data

10
𝟐𝟑
OTHER COMPONENTS(2) - MLBASE

Distributed Machine Learning Library
11
𝟐𝟑
OTHER COMPONENTS(4) – GRAPHX
Distributed Graph Computing framework
 RDG(Resilient Distributed Graph) abstraction
 Supporting Gather-Apply-Scatter model in
GraphLab

12
𝟐𝟑
GAS MODEL - GRAPHLAB
Machine 1
Machine 2
Master
Gather Infos
From Nbs
Y’
Y’
Y’
Y’
Σ1 +
Apply Vertex
update
Σ
+
Σ2
+
Mirror
Y
Σ3
Σ4
Mirror
Machine 3
Mirror
Machine 4
13
𝟐𝟑
From Jiang Wenrui’s thesis defense
INTRODUCTION TO SCALA(1)
Runs on JVM(and .Net)
 Full interoperability with Java
 Statically typed
 Object Oriented
 Functional Programming

14
𝟐𝟑
INTRODUCTION TO SCALA(2)

Declare a list of integers


Declare a function, cube, compute the cube of an Int


val cubes = ints.map(x => cube(x))
Sum the cube of integers.


def cube(a: Int): Int = a * a * a
Apply cube function to list.


val ints = List(1,2,4,5,7,3)
cubes.reduce((x1,x2) => x1+x2)
Define a factorial function that comput n!

def fact(n: Int): Int = {
if(n == 0) 1
else n * fact(n-1)
}
15
𝟐𝟑
SPARK IN PRACTICES(1)

“Hello Word”(interactive shell)


val textFile = sc.textFile(“hdfs://…”)
textFile.flatMap(line => line.split(“ “))
.map(word => (word, 1))
.reduceByKey((a, b) => a+b)
“Hello Word”(Standalone App)
16
𝟐𝟑
SPARK IN PRACTICES(2)

PageRank in Spark
17
𝟐𝟑
DEBUGGING AND TESTING SPARK
PROGRAMS(1)

Running in Local Mode
sc = new SparkContext("local", name)
 Debug in IDE


Running in Standalone/Cluster Mode
Job Web GUI on 8080/4040
 Log4j
 jstack/jmap
 dstat/iostat/lsof -p


Unit test

Test in local mode
18
𝟐𝟑
DEBUGGING AND TESTING SPARK
PROGRAMS(2)
RDDJoin2:
(2,4)
(1,2)
RDDJoin3:
(1,(1,3))
(1,(2,3))
19
𝟐𝟑
DEBUGGING AND TESTING SPARK
PROGRAMS(3)

Tunning Spark







Large object in lambda operator should be replaced
by broadcast variables instead.
Coalescing partitions avoid large number of empty
tasks after filtering operations.
Make good use of partition for data
locality.(mapPartitions)
Good choice of partitioning key to balance data
Set spark.local.dir to set of disks
Take care of the number of reduce tasks
Don’t collect data but write to HDFS directly.
20
𝟐𝟑
LEARNING SPARK
Spark Quick Start,
http://spark.apache.org/docs/latest/quickstart.html
 Holden Karau, Fast Data Processing with Spark
 Spark Docs, http://spark.apache.org/docs/latest/
 Spark Source code,
https://github.com/apache/spark
 Spark User Mailing list,
http://spark.apache.org/mailing-lists.html

21
𝟐𝟑
WHY ARE PREVIOUS MAPREDUCE-BASED
SYSTEMS SLOW?

Conventional thoughts:
expensive data materialization for fault tolerance,
 inferior data layout (e.g., lack of indices),
 costlier execution strategies.


But Hive/Shark alleviate these by:
In-memory computing and storage
 Partial DAG execution


Experiment results in Shark:
Intermediate Outputs
 Data Format and Layout by co-partitioning
 Execution Strategies optimizing using PDE
 Task Scheduling Cost

22
𝟐𝟑
CONCLUSION

Spark
In-Memory computing platform for iterative and
interactive tasks
 RDD abstraction
 Lineage reconstruction for fault recovery
 Large number of components based on


Spark Programming
Just think RDD like vector
 Function programming
 Scala IDE is not strong enough.
 Lack of good tools to debug and test.

23
𝟐𝟑