New Iteration

Download Report

Transcript New Iteration

http://salsahpc.indiana.edu/twister4azure
Iterative MapReduce for Azure Cloud
Twister4Azure
Iterative MapReduce for Windows
Azure Cloud
Thilina Gunarathne ([email protected])
Indiana University
Twister4Azure – Iterative MapReduce
• Decentralized iterative MR architecture for clouds
– Utilize highly available and scalable Cloud services
• Extends the MR programming model
• Multi-level data caching
– Cache aware hybrid scheduling
• Multiple MR applications per job
• Collective communication primitives
• Outperforms Hadoop in local cluster by 2 to 4 times
• Dynamic scheduling, load balancing, fault tolerance,
monitoring, local testing/debugging
http://salsahpc.indiana.edu/twister4azure/
MRRoles4Azure
Azure Cloud Services
• Highly-available and scalable
• Utilize eventually-consistent , high-latency cloud services effectively
• Minimal maintenance and management overhead
Decentralized
• Avoids Single Point of Failure
• Global queue based dynamic scheduling
• Dynamically scale up/down
MapReduce
• First pure MapReduce for Azure
• Typical MapReduce fault tolerance
MRRoles4Azure
Azure Queues for scheduling, Tables to store meta-data and monitoring data, Blobs for
input/output/intermediate data storage.
Data Intensive Iterative Applications
• Growing class of applications
– Clustering, data mining, machine learning & dimension
reduction applications
– Driven by data deluge & emerging computation fields
– Lots of scientific applications
k ← 0;
MAX ← maximum iterations
δ[0] ← initial delta value
while ( k< MAX_ITER || f(δ[k], δ[k-1]) )
foreach datum in data
β[datum] ← process (datum, δ[k])
end foreach
δ[k+1] ← combine(β[])
k ← k+1
end while
Data Intensive Iterative Applications
Compute
Communication
Broadcast
Reduce/ barrier
Smaller LoopVariant Data
New Iteration
Larger LoopInvariant Data
• Growing class of applications
– Clustering, data mining, machine learning & dimension
reduction applications
– Driven by data deluge & emerging computation fields
Iterative MapReduce
• MapReduceMerge
Map
Combine
Shuffle
Sort
Reduce
Merge
Broadcast
• Extensions to support additional broadcast (+other)
input data
Map(<key>, <value>, list_of <key,value>)
Reduce(<key>, list_of <value>, list_of <key,value>)
Merge(list_of <key,list_of<value>>,list_of <key,value>)
Merge Step
• Extension to the MapReduce programming model to support
iterative applications
– Map -> Combine -> Shuffle -> Sort -> Reduce -> Merge
• Receives all the Reduce outputs and the broadcast data for
the current iteration
• User can add a new iteration or schedule a new MR job from
the Merge task.
– Serve as the “loop-test” in the decentralized architecture
• Number of iterations
• Comparison of result from previous iteration and current iteration
– Possible to make the output of merge the broadcast data of the next
iteration
Multi-Level Caching
In-Memory/Disk
caching of static
data
• Caching BLOB data on disk
• Caching loop-invariant data in-memory
– Direct in-memory
– Memory mapped files
Cache Aware Hybrid Scheduling
New Job
Scheduling Queue
Worker Role
Map Workers
Map
1
Left over tasks
• Decentralized
Job Bulletin Board
• Fault tolerant
Job 1, iteration 2, bcast..
2, iteration 26, bcast..
• Multiple MapReduce Job
…….
applications within an
iteration
• Load balancing
• Multiple waves
Map
2
Map
n
Reduce Workers
Red
1
Red
2
Red
m
In Memory/Disk Data
Cache
Map Task Meta Data Cache
New Iteration
Data Transfer
•
Iterative vs traditional MapReduce
– Iterative computations tasks are finer grained
– Intermediate data are relatively smaller
•
Hybrid Data Transfer based on the use case
– Blob+Table storage based transport
– Direct TCP Transport
• Push data from Map to Reduce
•
Optimized data broadcasting
Fault Tolerance For Iterative MapReduce
• Iteration Level
– Role back iterations
• Task Level
– Re-execute the failed tasks
• Hybrid data communication utilizing a combination of
faster non-persistent and slower persistent mediums
– Direct TCP (non persistent), blob uploading in the
background.
• Decentralized control avoiding single point of failures
• Duplicate-execution of slow tasks
Collective Communication Primitives for
Iterative MapReduce
• Supports common higher-level communication patterns
• Performance
– Framework can optimize these operations transparently to the users
• Multi-algorithm
– Avoids unnecessary steps in traditional MR and iterative MR
• Ease of use
– Users do not have to manually implement these logic (eg: Reduce and
Merge tasks)
– Preserves the Map & Reduce API’s
• AllGather
• We are working on several other primitives as well
AllGather Primitive
• AllGather
– MDS BCCalc, PageRank (with in-links matrix)
First iteration performs the
initial data fetch
Task Execution Time Histogram
Overhead between iterations
Number of Executing Map Task Histogram
Scales better than Hadoop on
bare metal
Strong Scaling with 128M Data Points
Weak Scaling
Multi Dimensional Scaling
BC: Calculate BX
Map
Reduc
e
Merge
X: Calculate invV
Reduc
(BX)
Merge
Map
e
Calculate Stress
Map
Reduc
e
Merge
New Iteration
Performance adjusted for sequential
performance difference
Data Size Scaling
Weak Scaling
Scalable Parallel Scientific Computing Using Twister4Azure. Thilina Gunarathne, BingJing Zang, Tak-Lon Wu and Judy Qiu.
Submitted to Journal of Future Generation Computer Systems. (Invited as one of the best 6 papers of UCC 2011)
Multi Dimensional Scaling
18
MDSBCCalc
Task Execution Time (s)
16
MDSStressCalc
14
12
10
8
6
4
2
0
0
2048
140
120
100
80
60
40
20
0
4096
6144
Number of Executing
Map Tasks
MDSBCCalc
0
100
200
8192
10240 12288
Map Task ID
14336
16384
18432
MDSStressCalc
300
400 Time500
Elapsed
(s)
600
700
800
Performance Comparisons
BLAST Sequence Search
Smith Watermann
Sequence Alignment
100.00%
90.00%
Parallel Efficiency
80.00%
70.00%
60.00%
50.00%
40.00%
30.00%
Twister4Azure
20.00%
Hadoop-Blast
DryadLINQ-Blast
10.00%
0.00%
128
228
328
428
528
Number of Query Files
628
728
Parallel Efficiency
Cap3 Sequence Assembly
100%
95%
90%
85%
80%
75%
70%
65%
60%
55%
50%
Twister4Azure
Amazon EMR
Apache Hadoop
Num. of Cores * Num. of Files
MapReduce in the Clouds for Science, Thilina Gunarathne, et al. CloudCom 2010, Indianapolis, IN.
KMEANS CLUSTERING DEMO
Acknowledgement
• Microsoft Extreme Computing Group for the
Azure Compute Grant
• Persistent Systems
• IU Salsa Group