Avg. Time Per Iteration (sec) - Computer Science and Engineering

Download Report

Transcript Avg. Time Per Iteration (sec) - Computer Science and Engineering

Data-Intensive Computing:
From Clouds to GPUs
Gagan Agrawal
1
April 12, 2016
Motivation
Growing need for analysis of large scale
data



Scientific
Commercial
Data-intensive Supercomputing (DISC)
Map-Reduce has received a lot of attention




2
Database and Datamining communities
High performance computing community
April 12, 2016
Motivation (2)
Processor Architecture Trends





Clock speeds are not increasing
Trend towards multi-core architectures
Accelerators like GPUs are very popular
Cluster of multi-cores / GPUs are becoming
common
Trend Towards the Cloud


Use storage/computing/services from a provider



3
How many of you prefer gmail over cse/osu account for
e-mail?
Utility model of computation
Need high-level APIs and adaptation
April 12, 2016
My Research Group
Data-intensive theme at multiple levels






4
Parallel programming Models
Multi-cores and accelerators
Adaptive Middleware
Scientific Data Management / Workflows
Deep Web Integration and Analysis
April 12, 2016
This Talk
Parallel Programming API for Data-Intensive
Computing



An alternate API and System for Google’s MapReduce
Show actual comparison
Fault-tolerance for data-intensive computing
Data-intensive Computing on Accelerators



5
Compilation for GPUs
April 12, 2016
Map-Reduce
Simple API for (data-intensive) parallel
programming
Computation is:






6
Apply map on each input data element
Produce (key,value) pair(s)
Sort them using the key
Apply reduce on the set with a distinct key values
April 12, 2016
Map-Reduce Execution
7
April 12, 2016
Map-Reduce: Positives and Questions
Positives:


Simple API



Functional language based
Very easy to learn
Support for fault-tolerance

Important for very large-scale clusters
Questions


Performance?


8
Comparison with other approaches
Suitability for different class of applications?
April 12, 2016
Class of Data-Intensive Applications
Many different types of applications


Data-center kind of applications


More ``compute-intensive’’
applications



Data scans, sorting, indexing
data-intensive
Machine learning, data mining, NLP
Map-reduce / Hadoop being widely used for this class
Standard Database Operations

Sigmod 2009 paper compares Hadoop with Databases
and OLAP systems
What is Map-reduce suitable for?
What are the alternatives?



9
MPI/OpenMP/Pthreads – too low level?
April 12, 2016
FREERIDE: GOALS



Developed 2001-2003 at Ohio State
Framework for Rapid Implementation of
Data Mining Engines
The ability to rapidly prototype a highperformance mining implementation




Distributed memory parallelization
Shared memory parallelization
Ability to process disk-resident datasets
Only modest modifications to a sequential
implementation for the above three
April 12, 2016
10
April 12, 2016
10
FREERIDE – Technical Basis
Popular data mining
algorithms have a
common canonical
loop
Generalized Reduction
Can be used as the
basis for supporting a
common API
Demonstrated for
Popular Data Mining
and Scientific Data
Processing Applications




11
While( ) {
forall (data instances d) {
I = process(d)
R(I) = R(I) op f(d)
}
…….
}
April 12, 2016
Comparing Processing Structure

Similar, but with subtle differences
12
April 12, 2016
Processing Structure for FREERIDE

13
Basic one-stage dataflow
April 12, 2016
April 12, 2016
13
Observations on Processing Structure

Map-Reduce is based on functional idea



Do not maintain state
This can lead to sorting overheads
FREERIDE API is based on a programmer
managed reduction object




14
Not as ‘clean’
But, avoids sorting
Can also help shared memory parallelization
Helps better fault-recovery
April 12, 2016
Experiment Design

Tuning parameters in Hadoop




For comparison, we used four
applications



Data Mining: KMeans, KNN, Apriori
Simple data scan application: Wordcount
Experiments on a multi-core cluster

15
Input Split size
Max number of concurrent map tasks per node
Number of reduce tasks
8 cores per node (8 map tasks)
April 12, 2016
Results – Data Mining

KMeans: varying # of nodes
400
Dataset:
6.4G
K : 1000
Dim: 3
350
300
Avg. Time Per
Iteration (sec)
250
Hadoop
FREERIDE
200
150
100
50
0
4
8
16
# of nodes
16
April 12, 2016
Results – Data Mining (II)
Apriori: varying # of nodes
140
Dataset: 900M
Support level: 3%
Confidence level:
9%
120
100
Avg. Time Per
Iteration (sec)

80
Hadoop
FREERIDE
60
40
20
0
4
8
16
# of nodes
April 12, 2016
17
April 12, 2016
17
Results – Data Mining (III)
KNN: varying # of nodes
160
Dataset:
6.4G
K : 1000
Dim: 3
140
120
100
Avg. Time Per
Iteration (sec)

Hadoop
FREERIDE
80
60
40
20
0
4
8
16
# of nodes
April 12, 2016
18
April 12, 2016
18
Results – Datacenter-like Application

Wordcount: varying # of nodes
600
Dataset:
6.4G
Total Time (sec)
500
400
Hadoop
FREERIDE
300
200
100
0
4
8
16
# of nodes
19
April 12, 2016
Scalability Comparison

KMeans: varying dataset size
180
K : 100
Dim: 3
On 8 nodes
160
140
Avg. Time Per
Iteration (sec)
120
100
Hadoop
FREERIDE
80
60
40
20
0
800M 1.6G
3.2G
6.4G 12.8G
Dataset Size
20
April 12, 2016
Scalability – Word Count

Wordcount: varying dataset size
600
On 8 nodes
Total Time (sec)
500
400
Hadoop
FREERIDE
300
200
100
0
800M 1.6G
3.2G
6.4G 12.8G
Dataset Size
21
April 12, 2016
Observations


Performance issues with Hadoop are now
well know
How much of a factor is the API


Java, file system
API comparison on the same platform

Design of MATE

22
Map-reduce with an AlternaTE API
April 12, 2016
Basis: Phoenix implementation




Shared memory map-reduce implementation
from Stanford
C based
An efficient runtime that handles
parallelization, resource management, and
fault recovery
Support FREERIDE-like API
23
April 12, 2016
April 12, 2016
23
Functions

24
APIs provided by the runtime
Function Description
R/O
int mate_init(scheudler_args_t * args)
R
int mate_scheduler(void * args)
R
int mate_finalize(void * args)
O
void reduction_object_pre_init()
R
int reduction_object_alloc(int size)—return the object id
R
void reduction_object_post_init()
R
void accumulate/maximal/minimal(int id, int offset, void * value)
O
void reuse_reduction_object()
O
void * get_intermediate_result(int iter, int id, int offset)
O
April 12, 2016
April 12, 2016
24
Experiments: K-means

K-means: 400MB, 3-dim points, k = 100 on
one AMD node with 16 cores
25
April 12, 2016
April 12, 2016
25
Fault-Tolerance in FREERIDE/MATE

Map-reduce supports fault-tolerance by
replicating files


Storage and processing time overheads
FREERIDE/MATE API offers another option



26
Reduction object is a low-cost application-level
checkpoint
Can support efficient recovery
Can also allow redistribution of work on other
nodes
April 12, 2016
Fault Tolerance Results
27
April 12, 2016
This Talk

Parallel Programming API for Data-Intensive
Computing




An alternate API and System for Google’s MapReduce
Show actual comparison
Fault-tolerance for data-intensive computing
Data-intensive Computing on Accelerators

28
Compilation for GPUs
April 12, 2016
Background - GPU Computing
• Many-core architectures/Accelerators are becoming
more popular
• GPUs are inexpensive and fast
• CUDA is a high-level language for GPU
programming
CUDA Programming
• Significant improvement over use of Graphics
Libraries
• But ..
• Need detailed knowledge of the architecture of GPU and a new
language
• Must specify the grid configuration
• Deal with memory allocation and movement
• Explicit management of memory hierarchy
Parallel Data mining
• Common structure of data mining applications
(FREERIDE)
• /* outer sequential loop */
while() {
/* Reduction loop */
Foreach (element e){
(i, val) = process(e);
Reduc(i) = Reduc(i) op val;
}
Porting on GPUs




High-level Parallelization is straight-forward
Details of Data Movement
Impact of Thread Count on Reduction time
Use of shared memory
Architecture of the System
User Input
Variable
information
Reduction
functions
Optional
functions
Variable
Analyzer
Host Program
Variable
Access Pattern
and
Combination
Operations
Kernel functions
Code
Generator
Code
Analyzer( In
LLVM)
Grid
configuration
and kernel
invocation
Executable
User Input
Variables to be used in the reduction
function
Values of each variable or size of array
A sequential reduction function
Optional functions (initialization function,
combination function…)
Analysis of Sequential Code
• Get the information of access features of
each variable
• Determine the data to be replicated
• Get the operator for global combination
• Variables for shared memory
Memory Allocation and Copy
……
……
A.
T0
T1
T2 T3 T4
B.
T61 T62 T63 T0
T1
T61 T62 T63 T0
T1
……
T0
T1
T2 T3 T4
Need copy for each thread
C.
Copy the updates back to host memory after the
kernel reduction function returns
Generating CUDA Code and C++/C
code Invoking the Kernel Function
• Memory allocation and copy
• Thread grid configuration (block number
and thread number)
• Global function
• Kernel reduction function
• Global combination
Optimizations
• Using shared memory
• Providing user-specified initialization
functions and combination functions
• Specifying variables that are allocated
once
Applications
• K-means clustering
• EM clustering
• PCA
K-means Results
Speedups
Speedup of EM
Speedup of PCA
Summary


Data-intensive Computing is of growing
importance
One size doesn’t fit all


Map-reduce has many limitations
Accelerators can be promising for achieving
performance
43
April 12, 2016