Transcript SC11_WJ

MATE-CG: A MapReduce-Like Framework
for Accelerating Data-Intensive
Computations on Heterogeneous Clusters
Wei Jiang and Gagan Agrawal
Outline
Background
System Design and Implementation
Auto-Tuning Framework
Applications
Experiments
Related Work
Conclusion







2
April 6, 2016
Background (I)
Map-Reduce


Simple API



Easy to write parallel programs: map and reduce
Fault-tolerant for large-scale data centers
Performance?

Always a concern for HPC community
Generalized Reduction as a variant


Shared a similar processing structure


3
The key difference lies in a programmer-managed
reduction-object
Outperformed Map-reduce for a sub-class of
data intensive applications
April 6, 2016
Background (II)
Parallel Computing Environments


CPU clusters(multi-cores)



GPU clusters(many-cores)




Higher performance with better cost and energy
efficiency
Low programming productivity
Limited MapReduce-like support: Mars for a single
GPU, a recent IDAV for GPU clusters, …
CPU-GPU clusters(heterogeneous systems)

4
Most widely used as traditional HPC platforms
Support MapReduce and many of its variants
No MapReduce-like support up to date!
April 6, 2016
Background (III)
Previously developed MATE (a Map-Reduce system with
an AlternaTE API) for multi-core environments




Phoenix implemented Map-Reduce in shared-memory systems
MATE adopted Generalized Reduction, first proposed in
FREERIDE that was developed at Ohio State 2001-2003
Comparison between MATE and Phoenix for



Data Mining Applications
Comparing performance and API
Understanding performance overheads
MATE provided an alternative API better than ``MapReduce`` for some data-intensive applications


5
Assumption is that reduction object must fit in memory
April 6, 2016
Background (IV)
To address the limitation in MATE, we developed ExMATE to support the management of large reductionobjects, which are required by graph mining applications




6
Developed support for managing disk-resident reduction
object and efficient updating in distributed environments
Evaluated Ex-MATE with PEGASUS, a Hadoop-based graph
mining system
Ex-MATE outperformed PEGASUS for three graph mining
applications with factors ranging from 9 to 35
April 6, 2016
Map-Reduce Execution
7
April 6, 2016
Comparing Processing Structures
• Reduction Object represents the intermediate state of the execution
• Reduce func. is commutative and associative
• Sorting, grouping.. .overheads are eliminated with red. func/obj.
8
April 6, 2016
Observations on Processing Structures
Map-Reduce is based on functional idea


Does not maintain state
This can lead to overheads of managing intermediate
results between map and reduce


Map could generate intermediate results of very large size
Reduction-based approach is based on a programmermanaged reduction object





9
Not as ‘clean’
But, avoids sorting of intermediate results
Can also help shared memory parallelization
Helps better fault-recovery
April 6, 2016
System Design and Implementation


Execution Overview of MATE-CG
System API




Support of heterogeneous computing
Data types: Input_Space and Reduction_Object
Functions: CPU_Reduction and GPU_Reduction
Runtime




10
Partitioning disk-resident dataset among nodes
Managing large-sized reduction object on disk
Managing large-sized intermediate data
Using GPUs to accelerate computation
April 6, 2016
MATE-CG Overview

11
Execution work-flow
April 6, 2016
System API

12
Data types and functions
April 6, 2016
Implementation Considerations (I)

A multi-level data partitioning scheme

First, partitioning function: partition inputs into
blocks and distributed them to different nodes


Second, heterogeneous data mapping: cut each
block into two parts, one for CPU, the other for
GPU


How to identify the best data mapping?
Third, splitting function: split part of data blocks
into smaller chunks

13
Data locality should be considered
Observation: smaller chunk size for CPU and larger
chunk size for GPU
April 6, 2016
Implementation Considerations (II)

Management of large-sized reductionobject/intermediate data:

Reduce disk I/O of large reduction objects:



Reduce network costs of large intermediate data:


14
Data access patterns are used to reuse splits of
reduction objects as much as possible
Transparent to user code
A generic solution to invoke a all-to-all broadcast
among all nodes would cause severe performance
losses
Application-driven optimizations can be used to improve
performance.
April 6, 2016
Auto-Tuning Framework

Auto-tuning problem: given an application, find the
optimal parameter setting to distribute data to the CPU
and the GPU respectively due to different processing
capabilities.


For example: 20/80? 50/50? 70/30?
Our approach: exploit the iterative nature of many dataintensive applications with similar computations over a
number of iterations




15
Construct an analytical model to predict performance
The optimal value is learnt over the first few iterations
No compile-time search or tuning is needed
Low runtime overheads with a large number of iterations
April 6, 2016
The Analytical Model (I)

We focus on the two main components in the overall
running time on each node: data processing time on the
CPU and/or the GPU and the overheads on the CPU
First, consider the CPU only and we have:

Second, on the GPU, we have:

Third, let Tcg represent the heterogeneous execution
time using both CPU and GPU, we have:

16
April 6, 2016
The Analytical Model (II)

Let p represent the fraction of data to the CPU and we
have:


and
Overall, to relate Tcg with p, we have the following
illustration
17
April 6, 2016
The Analytical Model (III)

Illustration of the relationship between Tcg and p:
18
April 6, 2016
The Analytical Model (IV)

To minimize Tcg by computing the optimal p, we have:

To identify the best p, a simple heuristic way is used:




19
First, set p to 1: use CPUs only
Second, set p to 0: use GPUs only
Obtain necessary values for other parameters in the above
expression and predict an initial p
Adjust p accordingly in future iterations for variances in
measured values: make the CPU and the GPU finish
simultaneously
April 6, 2016
Applications: three representatives

Gridding kernel from scientific computing


The Expectation-Maximization algorithm from data
mining



Single pass: convert visibilities into a grid-model of the sky
Iterative: estimate a vector of parameters
Two consecutive steps: the Expectation step (E-step) and the
Maximization step (M-step)
PageRank from graph mining


20
Iterative: calculate the relative importance of web pages
Is essentially a matrix-vector multiplication algorithm
April 6, 2016
Applications: Optimizations (I)

The Expectation-Maximization algorithm




Large intermediate matrix between the E-step and the M-step
Could cause a lot of network communication costs for
broadcasting such a large matrix among all nodes
Optimization: On the same node, M-step reads the same
subset of intermediate matrix as produced in E-step (use of a
common partitioner)
PageRank



21
Data-copying overheads are significant on GPUs
Smaller input vector splits are shared by larger matrix blocks
that need further splitting
Optimization: copy shared input vector splits only once to save
copying time (fine-grained copying)
April 6, 2016
Applications: Optimizations (II)

Outline of data copying and computation on GPUs
22
April 6, 2016
Example User Code

23
Gridding Kernel
April 6, 2016
Experiments Design (I)

Experiments Platform



24
A heterogeneous CPU-GPU cluster
Each node has one Intel 8-core CPU and a
NVIDA Tesla (Fermi) GPU (448 cores)
Used up to 128 CPU cores and 7168 GPU
cores on 16 nodes
April 6, 2016
Experiments Design (II)

Three representative applications


For each application, we run it in four
modes in the cluster:




25
Gridding kernel, EM, and PageRank.
CPU-1: 1 CPU core per node as baseline
CPU-8: 8 CPU cores per node
GPU-only: only the GPU per node
CPU-8-n-GPU: both 8 CPU cores and GPU
per node
April 6, 2016
Experiments Design (III)

We focused on three aspects:



26
For each application: scalability with the
increasing number of GPUs and
performance improvement over CPUs
Performance improvement of
Heterogeneous computing and
effectiveness of auto-tuning Framework
Performance impact of application-driven
optimizations and examples of system
tuning
April 6, 2016
Results: Scalability with # of GPUs (I)

PageRank: 64GB dataset; a graph of 1 billion
nodes and 4 billion edges
7.0
6.8
6.3
5.0
16%
27
April 6, 2016
Results: Scalability with # of GPUs (II)

Gridding Kernel: 32GB dataset; a collection of
800 million visibilities and a 6.4GB sky grid
7.5
7.2
6.9
6.5
25%
28
April 6, 2016
Results: Scalability with # of GPUs (III)

EM: 32GB dataset; a cluster of 1 billion points
7.6
6.8
5.0
15.0
3.0
29
April 6, 2016
Results: Auto-tuning (I)

PageRank: 64GB dataset on 16 nodes
7%
P=0.30
30
April 6, 2016
Results: Auto-tuning (II)

EM: 32GB dataset on 16 nodes
E: 29%
M: 24%
E: p=0.31
M: p=0.27
31
April 6, 2016
Results: Heterogeneous Execution

Gridding Kernel: 32GB dataset on 16 nodes
>=56%
>=42%
32
April 6, 2016
Results: App-Driven Optimizations (I)

EM: 4GB dataset with 20GB intermediate
matrix
1.7
7.7
33
April 6, 2016
Results: App-Driven Optimizations (II)

PageRank: 32GB dataset with a block size of
512MB and GPU chunk size of 128MB
24%
34
April 6, 2016
Results: Examples for System Tuning

Gridding Kernel: 32GB dataset; varying
cpu_chunk_size and gpu_chunk_size
16 MB
512MB
35
April 6, 2016
Insights




GPUs can significantly accelerate certain classes of
computations but exhibits programming difficulties
and introduces data-copying overheads
Suitable data mapping between the CPU and the GPU
in heterogeneous computing is crucial for the overall
performance
Application-specific opportunities should be exploited
to make the best use of a parallel system with
optional APIs
Automatic optimization would be desirable to choose
the correct set of system parameter settings
36
April 6, 2016
Related Work

Data-intensive computing with map-reduce-like
models





Programming heterogeneous systems



Multi-core CPUs: Phoenix, Phoenix-rebirth, …
A single GPU: Mars, MapCG, …
GPU clusters: MITHRA, IDAV, …
CPU-GPU clusters: our MATE-CG
Software end: Merge, EXOCHI, Harmony, Qilin, …
Hardware end: CUBA, CUDA, OpenCL, …
Auto-tuning: a lot of efforts


37
Basic idea: Search for the best solution among all possibilities

Map solutions to performance metrics

Map hardware/software characteristics to parameters
Very useful for library generators, compilers, runtime systems,...
April 6, 2016
Conclusions




38
The MATE-CG supports a map-reduce-like API to ease the
programming difficulty of a heterogeneous CPU-GPU cluster
The system achieves good scalability with increasing number
of nodes and the heterogeneous execution further improves
the performance over CPU only or GPU only
It introduces a novel and effective auto-tuning approach for
choosing the best data mapping between the CPU and the
GPU
Application-specific optimizations should be considered in the
user code and a high-level API should be coupled with
significant auto-tuning for identify the right system parameter
settings automatically
April 6, 2016
Questions?
39
April 6, 2016