Make Sense of Big Data

Download Report

Transcript Make Sense of Big Data

Make Sense of Big Data
Researched by JIANG Wen-rui
Led by Pro. ZOU
1
Three levels of Big Data
Data Analysis
Software Infrastructure
Hardware Infrastructure
SaaS
PaaS
IaaS
2
Contradiction—First and Second Level
Data Analysis
Meachine Learning
Data Warehouse
Statistics
SoftWare Infrastruct
MapReduce
Pregel
GraphLab
GraphBuilder
Spark
3
Evolution of Big Data Tech
Data
Intelligence
Level
BC-PDM
MLBase
Mahout
BDAS
Cloudera
Graph app
GraphLab
Shark
Software
Architecture
Level
Spark
Hive
MapReduce
Hive
Pig
MapReduce
Pregel
GraphBuilder
MapR
MapReduce
HBase
HDFS
4
4V in Big Data
V
Volume
Variety
Velocity
Value
Why?
Big Data is just that – data sets that are so massive that typical software
systems are incapable of economically storing, let alone managing and
computing, the information. A Big Data platform must capture and
readily provide such quantities in a comprehensive and uniform storage
framework to enable straightforward management and development
One of the tenets of Big Data is the exponential growth of unstructured
data. The vast majority of data now originates from sources with either
limited or variable structure, such as social media and telemetry. A Big
Data platform must accommodate the full spectrum of data types and
forms.
As organizations continue to seek new questions, patterns, and metrics
within their data sets, they demand rapid and agile modeling and query
capabilities. A Big Data platform should maintain the original format and
precision of all ingested data to ensure full latitude of future analysis and
processing cycles.
Driving relevant value, whether as revenue or cost savings, from data is
the primary motivator for many organizations. The popularity of long tail
business models has forced companies to examine their data in detail to
find the patterns, affiliations, and connections to drive these new
opportunities
5
Model VS
Frame
Performance
Google
Good at data-independence tasks, not machine learning and graph
processing(data-dependent and iterative tasks).
based on acyclic data flow
Think like a key.
Google
Good at iterative and data-dependent computations, include graph
processing.
Using BSP(Bulk Synchronous Parallel) Model.
A Message Passing abstraction.
MapReduce
CMU
Pregel
GraphLab
UC Berkeley BDAS
Spark
Good at iterative and data-dependent computations , especially nature
graph problem.
Using asynchronous distributed shared memory model.
A Shared-State abstraction.
Think like a vertex.
Good at Iterative algorithms, Interactive data mining, OLAP reports.
Using RDDs(resilient distributed datasets) abstraction, which using InMemory Cluster Computing and distributed-memory model.
6
MapReduce
7
Map@MapReuduce
8
Reduce@MapReuduce
9
RPC@MapReuduce
10
RPC@MapReuduce
11
MapReduce+BSP
12
BSP Model
Processors
Local
Computation
Communication
Barrier
Synchronization
13
Mapreduce + BSP
14
GraphLab
15
GraphLab-Think like a vertex
16
Graphlab Working Pattern
Pattern
Functions
MR
Map-Reduce
Map_reduce_vertices
Map_reduce_edges
Transform_vertices
Transform_edges
GAS
Gather-Apply-Scatter
Gather_edges
Gather
Apply
Scatter_edges
Scatter
17
Distributed Execution of a
PowerGraph Vertex-Program
Machine 1
Machine 2
Master
Gather
Apply
Scatter
Y’
Y’Y’
Y’
Σ1
+
Σ
+
Σ2
+
Mirror
Y
Σ3
Σ4
Mirror
Machine 3
Mirror
Machine 4 18
Graphlab vs Pregel--Example
Depends on the
popularity their followers
Depends on popularity
of her followers
What’s the popularity
of this user?
Popular?
19
Graphlab vs Pregel-- PageRank
Rank of
user i
Weighted sum of
neighbors’ ranks
u Update ranks in parallel
u Iterate until convergence
20
The Pregel Abstraction
Vertex-Programs interact by sending messages.
Pregel_PageRank(i, messages) :
// Receive all the messages
total = 0
foreach( msg in messages) :
total = total + msg
i
// Update the rank of this vertex
R[i] = 0.15 + total
// Send new messages to neighbors
foreach(j in out_neighbors[i]) :
Send msg(R[i] * wij) to vertex j
Malewicz et al. [PODC’09, SIGMOD’10]
21
The Pregel Abstraction
Compute
Communicate
Barrier
The GraphLab Abstraction
Vertex-Programs directly read the neighbors state
GraphLab_PageRank(i)
// Compute sum over neighbors
total = 0
foreach( j in in_neighbors(i)):
total = total + R[j] * wji
i
j
// Update the PageRank
R[i] = 0.15 + total
// Trigger neighbors to run again
if R[i] not converged then
foreach( j in out_neighbors(i)):
signal vertex-program on j
Low et al. [UAI’10, VLDB’12]
23
GraphLab Execution
Scheduler
The scheduler determines the order that vertices are executed
CPU 1
e
b
a
hi
h
c
b
a
f
i
d
g
j
k
CPU 2
The process repeats until the scheduler is empty
Num-Vertices
GraphLab vs. Pregel (BSP)
100000000
51% updated only once
1000000
10000
100
1
0
10
20
30
40
Number of Updates
50
60
70
• Multicore PageRank (25M Vertices, 355M Edges)
Graph-parallel Abstractions
Better for ML
Messaging
i
Synchronous
Shared State
i
Asynchronous
26
Challenges of High-Degree Vertices
Sequentially process
edges
Sends many
messages
(Pregel)
Asynchronous Execution
requires heavy locking (GraphLab)
Touches a large
fraction of graph
(GraphLab)
Edge meta-data
too large for single
machine
Synchronous Execution
prone to stragglers (Pregel)
27
Berkeley Data Analytics Stack
28
Berkeley Data Analytics Stack
MapReduce
MPI
GraphLab
etc
MLBase
Value
BlinkDB(approximate queries)
Shark(Spark+Hive)-SQL
Velocity
Spark
Shared RDDs(distributed memory)
Mesos(Cluster resource manager)
HDFS
Variety
Volume
29
Spark-Motivation
Most current cluster programming models are
based on acyclic data flow from stable storage
to stable storage
Map
Input
Reduce
Output
Map
Map
Reduce
Spark
Iterative algorithms, including many machine
learning algorithms and graph algorithms like
PageRank.
Interactive data mining, where a user would
like to load data into RAM across a cluster and
query it repeatedly.
OLAP reports that run multiple aggregation
queries on the same data.
31
Spark
Spark allows iterative computation on the same data,
which would form a cycle if jobs were visualized
Spark offers an abstraction called resilient distributed
datasets (RDDs) to support these applications efficiently
32
RDDs
Resilient Distributed Dataset (RDD) serves as an
abstraction to raw data, and some data is kept in memory
and cached for later use.
 Spark allows data to be committed in RAM for an
approximate20x speedup over MapReduce based on disks.
RDDs allow Spark to outperform existing models by up to
100x in multi-pass analytics
 RDDs are immutable and created through parallel
transformations such as map, filter, groupBy and reduce
33
Function-Mapreduce VS Spark
34
Running Time (s)
Logistic Regression Performance
4500
4000
3500
3000
2500
2000
1500
1000
500
0
127 s / iteration
Hadoop
Spark
1
5
10
20
Number of Iterations
30
first iteration 174 s
further iterations 6 s
MLBase Motivation-2 Gaps
In spite of the modern primacy of data, the complexity of existing
ML algorithms is often overwhelming——
many users do not understand the trade-offs and challenges of
parameterizing and choosing between different learning techniques.
They need to tune and compare several suitable algorithms
 Further more, existing scalable systems that support machine
learning are typically not accessible to ML researchers without a
strong background in distributed systems and low-level primitives
So we design a systems which is extensibility to novel ML algorithms.
MLBase—4 pieces
Capability
MQL
ML-Library
A simple declarative way to specify ML tasks
A library of distributed algorithms
Set of high-level operators to enable ML researchers to
scalably implement a wide range of ML methods
without deep systems knowledge
ML-Optimizer
A novel optimizer to select and dynamically adapt
the choice of learning algorithm
ML-Runtime
A new run-time optimized for the data-access patterns of
these high-level operators
37
MLBase Architecture
MLBase
Error guide
Just Hadoop Frame ? In a sense, the distributed
platforms just a language, we can not miss them,
also not only depend on them. The things more
important is as follows:
Machine Learning! Reading: Machine Learning A
Probabilistic Perspective.
Deep Learning.
40
FUJITSU
Parallel time series regression
Led by Dr. Yang
Group
LI Zhong-hua
WANG Yun-zhi
JIANG Wen-rui
41
Parallel time series regression
Property
Performance
Platform
Hadoop from Apache. MapReduce from Google(Open Source)
GraphLab from Carnegie Mellon University(Open Source)
Both are Good at distributed parallel processing
MapReduce – good at acyclic data flow
GraphLab - Good at iterative and data-dependent computations
Volume
Support for big data. The algorithm has good scalability. When a
large amount of data comes, the algorithm can handle it without
any modification, just by increasing the number of clusters
Velocity
Rapid and agile modeling and handling capabilities for big data.
Interface
Using XML file for input parameters setting, allowing customers
set parameters intuitively
42
Parallel time series regression
Decompose
MapReduce
CycLenCalcu
MapReduce
Indicative Frag
MapReduce
TBSCPro
MapReduce
Clustering
GraphLab
Choose Cluster
MapReduce
43
Design for Parallel Indicative fragment
Indicative fragment - identification the best length of indicative fragment.
Assume - days:90 Max indicative fragment Length:96
Compare - Serial and parallel time complexity
1
1
2
C90
Serial
1
3
3
1
1
2
2
90
C
3
96
3
3
96
2
Generate all the
96* (90*89/2)
operation pairs
before the parallel
computation
2
C902
2
2
96
96
2
C90
2
90
Time Complexity: 96* C
96
96
Parallel
Time Complexity: 1
44
TBSCPro
1
2
Heap with capacity 3
3
4
5
1
a1 a2 a3 a4 a5 a6 a7 a8 ………..
2
b1 b2 b3 b4 b5 b6 b7 b8 ……..
3
c1 c2 c3 c4 c5 c6 c7 c8 ……..
4
d1 d2 d3 d4 d5 d6 d7 d8
………..
5
e1 e2 e3 e4 e5 e6 e7 e8 ………..
1
2
All
Days
3
4
5
1
2
3
4
5
45
Parallel time series regression model
1000
900
988
Cluster(4 Nodes)
Map=2/Reduce=2
Map=4/Reduce=4
Map=4/Reduce=8
800
785
Map=4/Reduce=12(Best Time)
700
608
619
600
514
520
500
464
400
341
259
300
200
165
156
145
100
192
172
164
163
209
199
332
283
260
231
229
419
356
440
471
398
320
266
191
0
16
64
144
256
400
600
800
1000
46
Data Selection Run Time
Time: s
1800
Data Selection Time(s)-4 Nodes Cluster
Primary Seeds Time(s)-4 Nodes Cluster
DataSelection Time(s)-Single Node
1600
1400
1200
1000
795
800
606
600
518
430
400
200
765
190
150
0
90
192
152
180
201
156
360
210
163
720
230
191
1080
272
229
471
369
307
266
320
566
398
Num of days
1440
1800
2160
2520
2880
3240
3600
47
Thank you !
48