HIPS10 - The Ohio State University

Download Report

Transcript HIPS10 - The Ohio State University

AUTO-GC: Automatic Translation of
Data Mining Applications to GPU
Clusters
Wenjing Ma Gagan Agrawal
The Ohio State University
Outline of Contents
• Background of GPU and GPU cluster computing
• System Design
• Implementation
– System API
– Code analysis
– Generation of FREERIDE code
– Generation of CUDA programs
• Applications on GPU clusters
– K-means, PCA
• Future work
Background
• Multi-core and many-core architectures are becoming more
popular
• GPUs are inexpensive and fast
• CUDA is a high level language that supports programming on GPU
• It is common for clusters to have powerful GPGPUs
– Tianhe-1, #5 in Top 500 list
• Other systems with Heterogeneous Cores
– RoadRunner, #2 in Top 500 list
Complication of Programming on GPUs
• User has to have thorough knowledge of the
architecture of GPU and the programming model of
CUDA
• Has to deal with the memory allocation and copy
• Need to know what data should be copied onto shared
memory
• ……
Programming GPU Clusters
• Need to combine CUDA with a distributed memory
programming model
• Hybrid Programming Models are not well
developed today
• Many decisions for application development
– Partitioning data and computation between nodes
– CUDA `Memory Hierarchy’
Needs and Approach
• Need high-level approaches for programming
heterogeneous clusters
• Very challenging compilation / programming model
problem
• Domain-specific approaches can be feasible
– Support restricted class of applications
Target Domain: Data-Intensive
Applications
• A lot of interest in large-scale data analysis
• Large clusters widely used
• Many data-intensive applications can be scaled
using GPUs
• Common structure can be exploited
Parallel Data Mining
Common structure of parallel data mining
implementations
•
•Based on our earlier work on FREERIDE Middleware
{ * Outer Sequential Loop * }
While () {
{ * Reduction Loop * }
Foreach (element e) {
(i,val) = process(e);
Reduc(i) = Reduc(i) op val;
}
Finalize();
}
A Data Intensive Computing
Middleware - FREERIDE
• Reduction Object represents the intermediate state of the execution
• Reduce func. is commutative and associative
• Sorting, grouping.. overheads are eliminated with red. func/obj.
9
Architecture of Our Code
Generation System
Architecture (Summary)
• User input
• Code analyzer
– Analysis of variables (variable type and size)
– Analysis of reduction functions (sequential code from
the user)
• Code Generator ( generating FREERIDE and CUDA
code)
Generalized Reductions on GPUs
• SIMD shared memory programming
• 3 steps involved in the main loop
– Data read
– Computing update
– Writing update
Parallelization on a Cluster
Use FREERIDE based processing
Local reductions are further accelerated
with GPUs

{ * Outer Sequential Loop * }
While () {
{ * Reduction Loop * }
Foreach (element e) {
(i,val) = process(e);
Reduc(i) = Reduc(i) op val;
Reduction()
}
Finalize();
}
Finalize()
User Input
Variables to be used in the reduction
function
A sequential reduction function
Finalize function
Optional functions (initialization function,
combination function…)
Analysis of Sequential Code
• Obtaining variable access features
• Extracting the reduction objects and their combination
operations
• Analysis for parallelization
– Find out the parallel loop
– Figure out the data to be replicated
– Calculate the size of shared memory to use and which
data to be copied to shared memory
Get the Operator for Combining
Reduction Objects
• All the output variables are denoted as reduction objects
• Combination of reduction objects
– Local combination at the end of CUDA function
– Global combination by MPI done within ADR
• Also done by the code analyzer using LLVM
• Currently support + and *
Generating Freeride code
• Three functions
– Initialization(): At the beginning of the execution
– Reduction(): Invoked for every data block
– Finalize(): At the end of every iteration
Generating Freeride code
•
Initialization()
– Allocate memory for variables
• Need to consider the copies for GPUs
•
Reduction()
– Void Reduction(void* data) {
float* point = (float*) data;
reduc(data, update, ……); // invoke CUDA function
for(int i=0;i<k*5;k++)
reduction(0, i, update[i]); // updating reduction objects
}
•
Finalize()
– Update local variables with the combined value of reduction
objects
– Check for finishing condition
Generating CUDA Code
•
•
•
•
Memory allocation and copy
Global function
Kernel reduction function
local combination
Memory Allocation and Copy
A.
……
……
T0
T1
T2
T3
T4
B.
T61 T62
T63 T0
T1
T61 T62
T63 T0
T1
……
T0
T1
T2
T3
T4
Need copy for each thread
C.
Copy the updates back to host memory after the kernel reduction function returns
Generated CUDA Functions
• Global function
– Invoking kernel
• Kernel Reduction Function
–
–
–
–
Generated out of the original sequential code
Divide the main loop by block_number and thread_number
Replace the access offsets with appropriate indices
……
• Local Combination
– Assume all updates are summed or multiplied from each
thread
– An automatically generated global combination function
which is invoked by 1 thread
Optimizations
• Using shared memory
• Providing user-specified initialization functions and
combination functions
• Specifying variables that are allocated for single
time
• …
Applications
• K-means clustering
– 16 blocks and 256 threads per block
• PCA
– 16 blocks and 128 threads per block
Experiment Results
Time(seconds)
CPU
GPU-automatic
GPU-manual
150
140
130
120
110
100
90
80
70
60
50
40
30
20
10
0
K-means (1.5GB
data)
1
2
nodes
4
8
Experiment Results
CPU
GPU-automatic
30
20
Time(second)
PCA (64M rows, 3
principal Components)
10
0
1
2
nodes
4
8
Experiment Results
Time(second)
CPU
GPU-automatic
1900
1800
1700
1600
1500
1400
1300
1200
1100
1000
900
800
700
600
500
400
300
200
100
0
PCA (2M rows, 64
principal Components)
1
2 nodes 4
8
Summary
• High-level abstractions are feasible if we target
specific processing structures
• Promising initial results
– Many challenges remain in obtaining performance
from GPU Clusters
Future work
• Analysis of sequential code to support more
operations for combination of reduction objects
• Improve shared memory allocation strategy
• Support of automatic grid configuration
• Support of multi-core and GPU on the some
computing node