Avg. Time Per Iteration (sec) - Computer Science and Engineering
Download
Report
Transcript Avg. Time Per Iteration (sec) - Computer Science and Engineering
Data-Intensive Computing:
From Clouds to GPUs
Gagan Agrawal
1
April 12, 2016
Motivation
Growing need for analysis of large scale
data
Scientific
Commercial
Data-intensive Supercomputing (DISC)
Map-Reduce has received a lot of attention
2
Database and Datamining communities
High performance computing community
April 12, 2016
Motivation (2)
Processor Architecture Trends
Clock speeds are not increasing
Trend towards multi-core architectures
Accelerators like GPUs are very popular
Cluster of multi-cores / GPUs are becoming
common
Trend Towards the Cloud
Use storage/computing/services from a provider
3
How many of you prefer gmail over cse/osu account for
e-mail?
Utility model of computation
Need high-level APIs and adaptation
April 12, 2016
My Research Group
Data-intensive theme at multiple levels
4
Parallel programming Models
Multi-cores and accelerators
Adaptive Middleware
Scientific Data Management / Workflows
Deep Web Integration and Analysis
April 12, 2016
This Talk
Parallel Programming API for Data-Intensive
Computing
An alternate API and System for Google’s MapReduce
Show actual comparison
Fault-tolerance for data-intensive computing
Data-intensive Computing on Accelerators
5
Compilation for GPUs
April 12, 2016
Map-Reduce
Simple API for (data-intensive) parallel
programming
Computation is:
6
Apply map on each input data element
Produce (key,value) pair(s)
Sort them using the key
Apply reduce on the set with a distinct key values
April 12, 2016
Map-Reduce Execution
7
April 12, 2016
Map-Reduce: Positives and Questions
Positives:
Simple API
Functional language based
Very easy to learn
Support for fault-tolerance
Important for very large-scale clusters
Questions
Performance?
8
Comparison with other approaches
Suitability for different class of applications?
April 12, 2016
Class of Data-Intensive Applications
Many different types of applications
Data-center kind of applications
More ``compute-intensive’’
applications
Data scans, sorting, indexing
data-intensive
Machine learning, data mining, NLP
Map-reduce / Hadoop being widely used for this class
Standard Database Operations
Sigmod 2009 paper compares Hadoop with Databases
and OLAP systems
What is Map-reduce suitable for?
What are the alternatives?
9
MPI/OpenMP/Pthreads – too low level?
April 12, 2016
FREERIDE: GOALS
Developed 2001-2003 at Ohio State
Framework for Rapid Implementation of
Data Mining Engines
The ability to rapidly prototype a highperformance mining implementation
Distributed memory parallelization
Shared memory parallelization
Ability to process disk-resident datasets
Only modest modifications to a sequential
implementation for the above three
April 12, 2016
10
April 12, 2016
10
FREERIDE – Technical Basis
Popular data mining
algorithms have a
common canonical
loop
Generalized Reduction
Can be used as the
basis for supporting a
common API
Demonstrated for
Popular Data Mining
and Scientific Data
Processing Applications
11
While( ) {
forall (data instances d) {
I = process(d)
R(I) = R(I) op f(d)
}
…….
}
April 12, 2016
Comparing Processing Structure
Similar, but with subtle differences
12
April 12, 2016
Processing Structure for FREERIDE
13
Basic one-stage dataflow
April 12, 2016
April 12, 2016
13
Observations on Processing Structure
Map-Reduce is based on functional idea
Do not maintain state
This can lead to sorting overheads
FREERIDE API is based on a programmer
managed reduction object
14
Not as ‘clean’
But, avoids sorting
Can also help shared memory parallelization
Helps better fault-recovery
April 12, 2016
Experiment Design
Tuning parameters in Hadoop
For comparison, we used four
applications
Data Mining: KMeans, KNN, Apriori
Simple data scan application: Wordcount
Experiments on a multi-core cluster
15
Input Split size
Max number of concurrent map tasks per node
Number of reduce tasks
8 cores per node (8 map tasks)
April 12, 2016
Results – Data Mining
KMeans: varying # of nodes
400
Dataset:
6.4G
K : 1000
Dim: 3
350
300
Avg. Time Per
Iteration (sec)
250
Hadoop
FREERIDE
200
150
100
50
0
4
8
16
# of nodes
16
April 12, 2016
Results – Data Mining (II)
Apriori: varying # of nodes
140
Dataset: 900M
Support level: 3%
Confidence level:
9%
120
100
Avg. Time Per
Iteration (sec)
80
Hadoop
FREERIDE
60
40
20
0
4
8
16
# of nodes
April 12, 2016
17
April 12, 2016
17
Results – Data Mining (III)
KNN: varying # of nodes
160
Dataset:
6.4G
K : 1000
Dim: 3
140
120
100
Avg. Time Per
Iteration (sec)
Hadoop
FREERIDE
80
60
40
20
0
4
8
16
# of nodes
April 12, 2016
18
April 12, 2016
18
Results – Datacenter-like Application
Wordcount: varying # of nodes
600
Dataset:
6.4G
Total Time (sec)
500
400
Hadoop
FREERIDE
300
200
100
0
4
8
16
# of nodes
19
April 12, 2016
Scalability Comparison
KMeans: varying dataset size
180
K : 100
Dim: 3
On 8 nodes
160
140
Avg. Time Per
Iteration (sec)
120
100
Hadoop
FREERIDE
80
60
40
20
0
800M 1.6G
3.2G
6.4G 12.8G
Dataset Size
20
April 12, 2016
Scalability – Word Count
Wordcount: varying dataset size
600
On 8 nodes
Total Time (sec)
500
400
Hadoop
FREERIDE
300
200
100
0
800M 1.6G
3.2G
6.4G 12.8G
Dataset Size
21
April 12, 2016
Observations
Performance issues with Hadoop are now
well know
How much of a factor is the API
Java, file system
API comparison on the same platform
Design of MATE
22
Map-reduce with an AlternaTE API
April 12, 2016
Basis: Phoenix implementation
Shared memory map-reduce implementation
from Stanford
C based
An efficient runtime that handles
parallelization, resource management, and
fault recovery
Support FREERIDE-like API
23
April 12, 2016
April 12, 2016
23
Functions
24
APIs provided by the runtime
Function Description
R/O
int mate_init(scheudler_args_t * args)
R
int mate_scheduler(void * args)
R
int mate_finalize(void * args)
O
void reduction_object_pre_init()
R
int reduction_object_alloc(int size)—return the object id
R
void reduction_object_post_init()
R
void accumulate/maximal/minimal(int id, int offset, void * value)
O
void reuse_reduction_object()
O
void * get_intermediate_result(int iter, int id, int offset)
O
April 12, 2016
April 12, 2016
24
Experiments: K-means
K-means: 400MB, 3-dim points, k = 100 on
one AMD node with 16 cores
25
April 12, 2016
April 12, 2016
25
Fault-Tolerance in FREERIDE/MATE
Map-reduce supports fault-tolerance by
replicating files
Storage and processing time overheads
FREERIDE/MATE API offers another option
26
Reduction object is a low-cost application-level
checkpoint
Can support efficient recovery
Can also allow redistribution of work on other
nodes
April 12, 2016
Fault Tolerance Results
27
April 12, 2016
This Talk
Parallel Programming API for Data-Intensive
Computing
An alternate API and System for Google’s MapReduce
Show actual comparison
Fault-tolerance for data-intensive computing
Data-intensive Computing on Accelerators
28
Compilation for GPUs
April 12, 2016
Background - GPU Computing
• Many-core architectures/Accelerators are becoming
more popular
• GPUs are inexpensive and fast
• CUDA is a high-level language for GPU
programming
CUDA Programming
• Significant improvement over use of Graphics
Libraries
• But ..
• Need detailed knowledge of the architecture of GPU and a new
language
• Must specify the grid configuration
• Deal with memory allocation and movement
• Explicit management of memory hierarchy
Parallel Data mining
• Common structure of data mining applications
(FREERIDE)
• /* outer sequential loop */
while() {
/* Reduction loop */
Foreach (element e){
(i, val) = process(e);
Reduc(i) = Reduc(i) op val;
}
Porting on GPUs
High-level Parallelization is straight-forward
Details of Data Movement
Impact of Thread Count on Reduction time
Use of shared memory
Architecture of the System
User Input
Variable
information
Reduction
functions
Optional
functions
Variable
Analyzer
Host Program
Variable
Access Pattern
and
Combination
Operations
Kernel functions
Code
Generator
Code
Analyzer( In
LLVM)
Grid
configuration
and kernel
invocation
Executable
User Input
Variables to be used in the reduction
function
Values of each variable or size of array
A sequential reduction function
Optional functions (initialization function,
combination function…)
Analysis of Sequential Code
• Get the information of access features of
each variable
• Determine the data to be replicated
• Get the operator for global combination
• Variables for shared memory
Memory Allocation and Copy
……
……
A.
T0
T1
T2 T3 T4
B.
T61 T62 T63 T0
T1
T61 T62 T63 T0
T1
……
T0
T1
T2 T3 T4
Need copy for each thread
C.
Copy the updates back to host memory after the
kernel reduction function returns
Generating CUDA Code and C++/C
code Invoking the Kernel Function
• Memory allocation and copy
• Thread grid configuration (block number
and thread number)
• Global function
• Kernel reduction function
• Global combination
Optimizations
• Using shared memory
• Providing user-specified initialization
functions and combination functions
• Specifying variables that are allocated
once
Applications
• K-means clustering
• EM clustering
• PCA
K-means Results
Speedups
Speedup of EM
Speedup of PCA
Summary
Data-intensive Computing is of growing
importance
One size doesn’t fit all
Map-reduce has many limitations
Accelerators can be promising for achieving
performance
43
April 12, 2016