幻灯片 1 - The Ohio State University

Download Report

Transcript 幻灯片 1 - The Ohio State University

Automatic Transformation of
Applications onto GPUs and GPU
Clusters
PhD Candidacy presentation: Wenjing Ma
Advisor: Dr Gagan Agrawal
The Ohio State University
Sep 11, 2009
Outline of Contents
• Introduction
• Accelerators, GPGPU and GPU cluster
• Difficulty of GPU programming
• Current Work
• Translation system for enabling data mining applications on
GPUs
• Automatic translation of data mining applications from
MATLAB to GPUs
• Automatic code generation for data mining on clusters with
GPU support
• Future work
• Optimize shared memory usage
• Support more categories of MATLAB programs
• Extend FREERIDE framework
Sep 11, 2009
Introduction
• Accelerators, GPGPU and GPU cluster
– Multi-core architectures are more and more
popular in high performance computing
– GPU, Cell Processor, FPGA
– GPU has good performance/price ratio
• Difficulty of Programming
– How to program a cluster with accelerators on
each node ?
Sep 11, 2009
Our Approach
• Provide high-level support for programming
emerging high-end configurations
• High-level language as input
• Target code for particular devices as output
• Focus on specific application classes
• Simplifies code generation
Sep 11, 2009
Specific Implementations
C/C++
GPU
GPU cluster
MATLAB
……
……
Sep 11, 2009
Outline of Contents
• Introduction
• Current Work
• Translation system for enabling data mining
applications on GPUs
• Automatic translation of data mining applications from
MATLAB to GPUs
• Automatic code generation for data mining on clusters
with GPU support
• Future work
• Optimize shared memory usage
• Support more categories of MATLAB programs
• Extend FREERIDE framework
Sep 11, 2009
Translation System for Enabling
Data Mining Applications on
GPUs
• Focus on data mining applications
• Common structure
• Translation system
• Simple user input
• Optimization for GPU execution
Sep 11, 2009
Complications of GPU Programming
• User has to have thoroughly understand the
architecture of GPU and the programming
model of the language
• Has to deal with the memory allocation
• Need to develop and optimize strategies for
data movement between memory hierarchies
(example is the shared memory in GPU)
Sep 11, 2009
Parallel Data Mining
• Common structure of data mining applications
(adapted from FREERIDE)
•
while() {
/* Reduction loop */
Foreach (element e) {
(i, val) = process(e);
Reduc(i) = Reduc(i) op val;
}
Finalize();
}
Sep 11, 2009
Basic Idea of Data Mining on GPU
• Parallel shared memory programming
• 3 steps involved in the main loop
• Reading data
• Computing update
• Writing back update
Sep 11, 2009
GREENRIDE: Translation System for Enabling
Data Mining Applications on GPUs
• User input
• Code analyzer
• Analysis of variables (variable type and size)
• Analysis of reduction functions (sequential code
from the user)
• Code Generator ( generating CUDA code and
C++ code invoking the kernel function)
• Optimization
Sep 11, 2009
Architecture of the System
User Input
Variable
information
Reduction
functions
Optional
functions
Variable
Analyze
r
Host Program
Variable
Access
Pattern and
Combination
Operations
Kernel functions
Code
Generato
r
Code
Analyzer( In
LLVM)
Data copy
and thread
grid
configuration
Executable
Sep 11, 2009
User Input
Variables to be used in the reduction
function
Value of variable / size of array
A sequential reduction function
Optional functions (initialization function,
combination function…)
Sep 11, 2009
Analysis of Sequential Code
• Identify the parallel loop
• One major loop is parallelized
• Get the access features of each variable
• Figure out the data to be replicated
• Get the operator for global combination
• Calculate the size of shared memory to use
and which data to be kept in shared memory
Sep 11, 2009
Get the Access Features for
Each Variable
• Use LLVM to generate the intermediate
representation
• Analyze the intermediate representation and
get the information of each pointer
• Trace the pointers used in “store” operations, which
are output pointers
• The other pointers in argument list are input
variables
• The pointers that don’t appear in the argument list
are temporary storage
Sep 11, 2009
Data to be Replicated
• Variables that need to be replicated for each
thread
• Variables to be written
• Different threads may access the same object
• Need combination at the end
Sep 11, 2009
Generate CUDA and C++/C Code
• Revisit Common structure of data mining
applications (adapted from FREERIDE)
•
while() {
/* Reduction loop */
Foreach (element e){
(i, val) = process(e);
Reduc(i) = Reduc(i) op val;
}
Finalize();
}
Sep 11, 2009
Generate CUDA and C++/C Code
• Memory allocation and copy
• Reduction functions and global combination
• Optimization
• Using shared memory on GPU
Sep 11, 2009
Memory Allocation and Copy
Device Memory allocation, copying between memory
hierarchies, replication for threads
……
……
A.
T61 T62 T63 T0 T1
T0 T1 T2 T3 T4
B.
……
T61 T62 T63 T0 T1
T0 T1 T2 T3 T4
Need replication for each thread
Sep 11, 2009
Reduction Functions
• Global function
• Invoke the kernel reduction, global combination function,
• Invoke user specified initialization and combination
functions, if any
• Kernel reduction function
• Generated out of the original sequential code
• Divide the main loop by block number and thread number
• Rewrite the access indices
• ……
• Global combination
Sep 11, 2009
Program Structure of the
Generated CUDA Code
• Host program
Copy_data(HOST_TO_DEVICE);
Device_reduct();
Copy_data(DEVICE_TO_HOST);
Combination();
• Device function:
Device_reduct()
{
Reduction();
__syncthreads();
Combination();
__syncthreads();
}
Sep 11, 2009
Optimizations
• Use shared memory on GPU ( almost “must”! )
• If provided, use user-specified initialization
functions and combination functions
• Specify variables that are allocated once and
reused by multiple invocations of the kernel
• …
Sep 11, 2009
Deal with Shared Memory
• Use simple heuristics
• Size = length * sizeof(type) * thread_info
– length: size of the array
– type: char, int, and float
– thread_info: whether it’s copied to each thread
• Sort the arrays according to Size
• Mark each array as shared until the size exceeds
the limit of shared memory
Sep 11, 2009
Other Optimizations
• Reduce memory allocation and copy overhead
• Arrays shared by multiple iterations can be allocated
and copied only once
• User defined combination function
• Possibly reduce the amount of data to combine
Sep 11, 2009
Experiments
• Applications: popular data mining applications
• K-means clustering
• EM clustering
• PCA
• Compare performance of manual CUDA code
and automatically generated code
• The effect of optimization, especially the use of
shared memory
Sep 11, 2009
Experiment Results (K-means)
manual-computing
manual-computing with copy
automatic-computing
automatic-computing with copy
thread_number * block_number
256*256
256*32
256*8
256*2
256*1
0
Results on GeForce 8800GTX
Sep 11, 2009
thread_number * block_number
256*256
10
256*32
20
256*8
30
256*2
40
256*1
50
40
35
30
25
20
15
10
5
0
64*1
Speedup over CPU sequential
version
60
64*1
Speedup over CPU sequential
version
manual-computing
manual-computing with copy
automatic-computing
automatic-computing with copy
Results on GeForce 9800GX2
Experiment Results (EM)
automatic-computing
automatic-computing with copy
20
15
10
5
Sep 11, 2009
256*256
thread_number * block_number
256*64
256*32
256*16
256*8
256*4
256*2
256*1
128*1
0
64*1
Speedup over CPU sequential version
manual-computing
manual-computing with copy
Experiment Results (PCA)
25
20
15
10
5
thread_number * block_number
Sep 11, 2009
128*64
128*32
128*16
128*8
128*4
128*2
256*1
128*1s
0
64*1
Speedup over CPU
sequential version
computing
computing with copy
optimized computing
optimized computing with copy
Outline of Contents
• Introduction
• Current Work
• Translation system for enabling data mining
applications on GPUs
• Automatic translation of data mining applications from
MATLAB to GPUs
• Automatic code generation for data mining on clusters
with GPU support
• Future work
• Optimize shared memory usage
• Support more categories of MATLAB programs
• Extend FREERIDE framework
Sep 11, 2009
GMAT-DM: Automatic
Transformation from MATLAB for
GPUs
MATLAB
code
OCTAVE
parser
C code
GREENRIDE
GMAT-DM
CUDA
code
Sep 11, 2009
Main Issues in GMAT-DM
• Matrix Manipulation
• Inner representation as 1-d arrays
• Type and shape information based on user input
• Basic problem: matrix operation translation
• Matrix Multiplication Chain
• Function combination
• Other optimizations
Sep 11, 2009
Matrix Multiplication Chain
A*B*C
A:6*100; B: 100*2; C: 2*1
1. (A*B)*C
Instruction count: 1212
Intermediate storage: 12
2. A*(B*C)
Instruction count: 800
Intermediate storage: 100
Use weighted metrics to determine the
multiplication chain
Sep 11, 2009
Function Combination
for (i = 0; i < maxRow; i++)
for (j = 0; j < k; j++)
{
/* code to compute distances */
d[i][j] = sqrt(distances);
}
for(int i1 = 0; i1<maxRow; i1++)
{
z[i1] = 0;
g[i1] = MAX_VALUE;
for(int i2 = 0; i2 < k; i2++)
{
if(z[i1] > d[i1][i2])
{
z[i1] = d[i1][i2];
g[i1] = i2;
}
}
}
float temp 0, temp z;
int temp g;
for (i = 0; i < maxRow; i++)
{
temp_z = MAX_VALUE;
temp_g = 0;
for (j = 0; j < k; j++)
{
/* code to compute distances */
temp_0 = sqrt(distances);
if(temp_z > temp_0)
{
temp_z = temp_0;
temp_g = j;
}
}
z[i] = temp_z;
g[i] = temp_g;
}
Sep 11, 2009
Other Optimizations
• Avoid memory allocation overhead – use index
transformation
• Row extraction
• Matrix transposition
• Rename when a single element is accessed
through many iterations
Sep 11, 2009
Experiment Results (K-means)
C
unoptimized
optimized
C with copy
unoptimized with copy
optimized with copy
35
30
20
15
10
5
thread_number * block_number
Sep 11, 2009
256*256
256*64
256*32
256*16
256*8
256*4
256*2
512*1
256*1
128*1
0
64*1
Speedup
25
Experiment Results (EM)
C
optimized-1
optimized-2-with-copy
Sep 11, 2009
256*256
256*64
256*32
thre ad_numbe r * block_numbe r
M phase
E phase
256*16
256*8
256*4
256*2
256*1
128*1
speedup
256*256
256*64
256*32
256*16
256*8
256*4
256*2
128*1
256*1
thre ad_numbe r * block_numbe r
unoptimized
optimized-2
20
18
16
14
12
10
8
6
4
2
0
64*1
unoptimized
optimized-with-copy
20
18
16
14
12
10
8
6
4
2
0
64*1
speedup
C
optimized
Experiment Results (PCA)
MATLAB
MATLAB-with copy
C
C-with copy
25
Speedup
20
15
10
5
thread_number * block_number
Sep 11, 2009
128*64
128*32
128*16
128*8
128*4
128*2
128*1
0
Outline of Contents
• Introduction
• Current Work
• Translation system for enabling data mining
applications on GPUs
• Automatic translation of data mining applications from
MATLAB to GPUs
• Automatic code generation for data mining on clusters
with GPU support
• Future work
• Optimize shared memory usage
• Support more categories of MATLAB programs
• Extend FREERIDE framework
Sep 11, 2009
AUTO-GC: Automatic Code
Generation for FREERIDE with GPU
Support
 Revisit the common structure of data mining
applications (adapted from FREERIDE)
While () {
{ * Reduction Loop * }
Foreach (element e) {
(i,val) = process(e);
Reduc(i) = Reduc(i) op val;
}
Finalize();
}
Sep 11, 2009
GPU
Code Generation System
Sep 11, 2009
Experiment Results (K-means)
1.5GB data set
3GB data set
Sep 11, 2009
Experiment Results (PCA)
64M rows, 4 components
2M rows, 64 components
Sep 11, 2009
Outline of Contents
• Introduction
• Current Work
• Translation system for enabling data mining
applications on GPUs
• Automatic translation of data mining applications from
MATLAB to GPUs
• Automatic code generation for data mining on clusters
with GPU support
• Future work
• Optimize shared memory usage
• Support more categories of MATLAB programs
• Extend FREERIDE framework
Sep 11, 2009
Optimize Shared Memory Usage
• Use constant and texture memory
• cached and read-only
• Integer programming model
• Loop transformation
Sep 11, 2009
Integer Programming Model
Variables:
x[i][j], 0 ≤ i ≤ m, 0 ≤ j ≤ n, whether variable i is kept
in shared memory at loop j
Objective function:
maximize
z = Σi∈{1..m}, j∈{1..n}(Acc[i][j] − 1) ∗ x[i][j] ∗Wp[i][j];
Constraints:
For i in 1..m:
Σj∈1..nx[i][j] ∗W[i][j] ∗ Sizep[i][j] ≤ S;
For j in 1..n:
For i in 1..nLive[j]:
x[live[j][i].start][j] = x[live[j][i].start + 1][j] = ...
= x[live[j][i].end][j];
Sep 11, 2009
Loop Transformation
• Loop Fusion
– Reduce intermediate memory
– For loops with the same size, and variable[k] is
temporary storage
Loop 1 (size[1] = n): Active[1][k]=1 ......
Loop 2 (size[2] = n): Active[2][k]=1 ......
Sep 11, 2009
Loop Transformation
• Loop Fission
– Fit more data in shared memory
– Should not violate data dependency
1. for (int r = 0; r < n; r++) //Loop 1
2. {
3. · · ·
4. for(int c = 0; c < m; c + +) //Loop 2
5.
i[c]· · · ;
6. for(int c = 0; c < m; c + +) //Loop 3
7.
j[c]· · · ;
8. · · ·
9. }
Sep 11, 2009
Loop Transformation
• Loop Switching
– Sort all the loops in decreasing order of
iteration numbers
– Enable better usage of shared memory
– Should not violate dependency
Sep 11, 2009
Support More Categories of
MATLAB Programs
• Include library functions and determining thread
grid configuration
• Deal with sparse matrix computation
• Investigate basic functions in SPM library and
parallelization of ICA
Sep 11, 2009
Include Library Functions and
Determining Thread Grid
Configuration
• 2-D blocking or 1-D blocking
• Use library functions
Depending on the input data
Might need separating kernels to
use different configurations
Need a model to evaluate the
overhead of multiple kernels
Sep 11, 2009
Deal with Sparse Matrix
Computation
• Represent sparse matrix in GPU
• Coordinate representation
• Diagonal representation (easy to be combined with
computing context)
• Might use Sparse Accelerator (evaluate efficiency of
shared memory usage)
• Representation conversion
• Better to be done on CPU
Sep 11, 2009
Investigate Basic Functions in SPM
Library and Parallelization of ICA
• Enrich GMAT-DM
• Provide GPU support for more functions and
operations in MATLAB, especially sparse matrix
computation
• Study possible optimization to communication
and synchronization for SPM and ICA
• For clusters or GPU, depending on my progress
and time
Sep 11, 2009
Extend FREERIDE Framework
• Convert MATLAB programs into C++ and CUDA
code in the FREERIDE framework
– Based on GMAT-DM and AUTO-GC
• Cost model to determine device configuration
cost[i] = (mem[i] ∗LATENCY + comp[i] + shared[i]) ∗
Iter[i]+shared[i] ∗LATENCY
total_cost =∑mi=1 cost[i]/number_iterations
cost_combine =∑i∈output size[i]*total_threads
If total_cost + cost_combine > T Then use GPU
Else use CPU
Sep 11, 2009
Conclusion
• Contributions
– GREENRIDE: Translating Data Mining Applications onto
GPUs
– GMAT-DM: Automatic Translation of Data Mining
Applications from MATLAB to GPUs
– AUTO-GC: Automatic Code Generation for FREERIDE with
GPU Support
• Proposed work
– Optimize shared memory usage
– Support more categories of MATLAB programs
– Extend FREERIDE framework
Sep 11, 2009
Thank you !
• Questions?
Sep 11, 2009