#### Transcript 幻灯片 1 - The Ohio State University

Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The Ohio State University Sep 11, 2009 Outline of Contents • Introduction • Accelerators, GPGPU and GPU cluster • Difficulty of GPU programming • Current Work • Translation system for enabling data mining applications on GPUs • Automatic translation of data mining applications from MATLAB to GPUs • Automatic code generation for data mining on clusters with GPU support • Future work • Optimize shared memory usage • Support more categories of MATLAB programs • Extend FREERIDE framework Sep 11, 2009 Introduction • Accelerators, GPGPU and GPU cluster – Multi-core architectures are more and more popular in high performance computing – GPU, Cell Processor, FPGA – GPU has good performance/price ratio • Difficulty of Programming – How to program a cluster with accelerators on each node ? Sep 11, 2009 Our Approach • Provide high-level support for programming emerging high-end configurations • High-level language as input • Target code for particular devices as output • Focus on specific application classes • Simplifies code generation Sep 11, 2009 Specific Implementations C/C++ GPU GPU cluster MATLAB …… …… Sep 11, 2009 Outline of Contents • Introduction • Current Work • Translation system for enabling data mining applications on GPUs • Automatic translation of data mining applications from MATLAB to GPUs • Automatic code generation for data mining on clusters with GPU support • Future work • Optimize shared memory usage • Support more categories of MATLAB programs • Extend FREERIDE framework Sep 11, 2009 Translation System for Enabling Data Mining Applications on GPUs • Focus on data mining applications • Common structure • Translation system • Simple user input • Optimization for GPU execution Sep 11, 2009 Complications of GPU Programming • User has to have thoroughly understand the architecture of GPU and the programming model of the language • Has to deal with the memory allocation • Need to develop and optimize strategies for data movement between memory hierarchies (example is the shared memory in GPU) Sep 11, 2009 Parallel Data Mining • Common structure of data mining applications (adapted from FREERIDE) • while() { /* Reduction loop */ Foreach (element e) { (i, val) = process(e); Reduc(i) = Reduc(i) op val; } Finalize(); } Sep 11, 2009 Basic Idea of Data Mining on GPU • Parallel shared memory programming • 3 steps involved in the main loop • Reading data • Computing update • Writing back update Sep 11, 2009 GREENRIDE: Translation System for Enabling Data Mining Applications on GPUs • User input • Code analyzer • Analysis of variables (variable type and size) • Analysis of reduction functions (sequential code from the user) • Code Generator ( generating CUDA code and C++ code invoking the kernel function) • Optimization Sep 11, 2009 Architecture of the System User Input Variable information Reduction functions Optional functions Variable Analyze r Host Program Variable Access Pattern and Combination Operations Kernel functions Code Generato r Code Analyzer( In LLVM) Data copy and thread grid configuration Executable Sep 11, 2009 User Input Variables to be used in the reduction function Value of variable / size of array A sequential reduction function Optional functions (initialization function, combination function…) Sep 11, 2009 Analysis of Sequential Code • Identify the parallel loop • One major loop is parallelized • Get the access features of each variable • Figure out the data to be replicated • Get the operator for global combination • Calculate the size of shared memory to use and which data to be kept in shared memory Sep 11, 2009 Get the Access Features for Each Variable • Use LLVM to generate the intermediate representation • Analyze the intermediate representation and get the information of each pointer • Trace the pointers used in “store” operations, which are output pointers • The other pointers in argument list are input variables • The pointers that don’t appear in the argument list are temporary storage Sep 11, 2009 Data to be Replicated • Variables that need to be replicated for each thread • Variables to be written • Different threads may access the same object • Need combination at the end Sep 11, 2009 Generate CUDA and C++/C Code • Revisit Common structure of data mining applications (adapted from FREERIDE) • while() { /* Reduction loop */ Foreach (element e){ (i, val) = process(e); Reduc(i) = Reduc(i) op val; } Finalize(); } Sep 11, 2009 Generate CUDA and C++/C Code • Memory allocation and copy • Reduction functions and global combination • Optimization • Using shared memory on GPU Sep 11, 2009 Memory Allocation and Copy Device Memory allocation, copying between memory hierarchies, replication for threads …… …… A. T61 T62 T63 T0 T1 T0 T1 T2 T3 T4 B. …… T61 T62 T63 T0 T1 T0 T1 T2 T3 T4 Need replication for each thread Sep 11, 2009 Reduction Functions • Global function • Invoke the kernel reduction, global combination function, • Invoke user specified initialization and combination functions, if any • Kernel reduction function • Generated out of the original sequential code • Divide the main loop by block number and thread number • Rewrite the access indices • …… • Global combination Sep 11, 2009 Program Structure of the Generated CUDA Code • Host program Copy_data(HOST_TO_DEVICE); Device_reduct(); Copy_data(DEVICE_TO_HOST); Combination(); • Device function: Device_reduct() { Reduction(); __syncthreads(); Combination(); __syncthreads(); } Sep 11, 2009 Optimizations • Use shared memory on GPU ( almost “must”! ) • If provided, use user-specified initialization functions and combination functions • Specify variables that are allocated once and reused by multiple invocations of the kernel • … Sep 11, 2009 Deal with Shared Memory • Use simple heuristics • Size = length * sizeof(type) * thread_info – length: size of the array – type: char, int, and float – thread_info: whether it’s copied to each thread • Sort the arrays according to Size • Mark each array as shared until the size exceeds the limit of shared memory Sep 11, 2009 Other Optimizations • Reduce memory allocation and copy overhead • Arrays shared by multiple iterations can be allocated and copied only once • User defined combination function • Possibly reduce the amount of data to combine Sep 11, 2009 Experiments • Applications: popular data mining applications • K-means clustering • EM clustering • PCA • Compare performance of manual CUDA code and automatically generated code • The effect of optimization, especially the use of shared memory Sep 11, 2009 Experiment Results (K-means) manual-computing manual-computing with copy automatic-computing automatic-computing with copy thread_number * block_number 256*256 256*32 256*8 256*2 256*1 0 Results on GeForce 8800GTX Sep 11, 2009 thread_number * block_number 256*256 10 256*32 20 256*8 30 256*2 40 256*1 50 40 35 30 25 20 15 10 5 0 64*1 Speedup over CPU sequential version 60 64*1 Speedup over CPU sequential version manual-computing manual-computing with copy automatic-computing automatic-computing with copy Results on GeForce 9800GX2 Experiment Results (EM) automatic-computing automatic-computing with copy 20 15 10 5 Sep 11, 2009 256*256 thread_number * block_number 256*64 256*32 256*16 256*8 256*4 256*2 256*1 128*1 0 64*1 Speedup over CPU sequential version manual-computing manual-computing with copy Experiment Results (PCA) 25 20 15 10 5 thread_number * block_number Sep 11, 2009 128*64 128*32 128*16 128*8 128*4 128*2 256*1 128*1s 0 64*1 Speedup over CPU sequential version computing computing with copy optimized computing optimized computing with copy Outline of Contents • Introduction • Current Work • Translation system for enabling data mining applications on GPUs • Automatic translation of data mining applications from MATLAB to GPUs • Automatic code generation for data mining on clusters with GPU support • Future work • Optimize shared memory usage • Support more categories of MATLAB programs • Extend FREERIDE framework Sep 11, 2009 GMAT-DM: Automatic Transformation from MATLAB for GPUs MATLAB code OCTAVE parser C code GREENRIDE GMAT-DM CUDA code Sep 11, 2009 Main Issues in GMAT-DM • Matrix Manipulation • Inner representation as 1-d arrays • Type and shape information based on user input • Basic problem: matrix operation translation • Matrix Multiplication Chain • Function combination • Other optimizations Sep 11, 2009 Matrix Multiplication Chain A*B*C A:6*100; B: 100*2; C: 2*1 1. (A*B)*C Instruction count: 1212 Intermediate storage: 12 2. A*(B*C) Instruction count: 800 Intermediate storage: 100 Use weighted metrics to determine the multiplication chain Sep 11, 2009 Function Combination for (i = 0; i < maxRow; i++) for (j = 0; j < k; j++) { /* code to compute distances */ d[i][j] = sqrt(distances); } for(int i1 = 0; i1<maxRow; i1++) { z[i1] = 0; g[i1] = MAX_VALUE; for(int i2 = 0; i2 < k; i2++) { if(z[i1] > d[i1][i2]) { z[i1] = d[i1][i2]; g[i1] = i2; } } } float temp 0, temp z; int temp g; for (i = 0; i < maxRow; i++) { temp_z = MAX_VALUE; temp_g = 0; for (j = 0; j < k; j++) { /* code to compute distances */ temp_0 = sqrt(distances); if(temp_z > temp_0) { temp_z = temp_0; temp_g = j; } } z[i] = temp_z; g[i] = temp_g; } Sep 11, 2009 Other Optimizations • Avoid memory allocation overhead – use index transformation • Row extraction • Matrix transposition • Rename when a single element is accessed through many iterations Sep 11, 2009 Experiment Results (K-means) C unoptimized optimized C with copy unoptimized with copy optimized with copy 35 30 20 15 10 5 thread_number * block_number Sep 11, 2009 256*256 256*64 256*32 256*16 256*8 256*4 256*2 512*1 256*1 128*1 0 64*1 Speedup 25 Experiment Results (EM) C optimized-1 optimized-2-with-copy Sep 11, 2009 256*256 256*64 256*32 thre ad_numbe r * block_numbe r M phase E phase 256*16 256*8 256*4 256*2 256*1 128*1 speedup 256*256 256*64 256*32 256*16 256*8 256*4 256*2 128*1 256*1 thre ad_numbe r * block_numbe r unoptimized optimized-2 20 18 16 14 12 10 8 6 4 2 0 64*1 unoptimized optimized-with-copy 20 18 16 14 12 10 8 6 4 2 0 64*1 speedup C optimized Experiment Results (PCA) MATLAB MATLAB-with copy C C-with copy 25 Speedup 20 15 10 5 thread_number * block_number Sep 11, 2009 128*64 128*32 128*16 128*8 128*4 128*2 128*1 0 Outline of Contents • Introduction • Current Work • Translation system for enabling data mining applications on GPUs • Automatic translation of data mining applications from MATLAB to GPUs • Automatic code generation for data mining on clusters with GPU support • Future work • Optimize shared memory usage • Support more categories of MATLAB programs • Extend FREERIDE framework Sep 11, 2009 AUTO-GC: Automatic Code Generation for FREERIDE with GPU Support Revisit the common structure of data mining applications (adapted from FREERIDE) While () { { * Reduction Loop * } Foreach (element e) { (i,val) = process(e); Reduc(i) = Reduc(i) op val; } Finalize(); } Sep 11, 2009 GPU Code Generation System Sep 11, 2009 Experiment Results (K-means) 1.5GB data set 3GB data set Sep 11, 2009 Experiment Results (PCA) 64M rows, 4 components 2M rows, 64 components Sep 11, 2009 Outline of Contents • Introduction • Current Work • Translation system for enabling data mining applications on GPUs • Automatic translation of data mining applications from MATLAB to GPUs • Automatic code generation for data mining on clusters with GPU support • Future work • Optimize shared memory usage • Support more categories of MATLAB programs • Extend FREERIDE framework Sep 11, 2009 Optimize Shared Memory Usage • Use constant and texture memory • cached and read-only • Integer programming model • Loop transformation Sep 11, 2009 Integer Programming Model Variables: x[i][j], 0 ≤ i ≤ m, 0 ≤ j ≤ n, whether variable i is kept in shared memory at loop j Objective function: maximize z = Σi∈{1..m}, j∈{1..n}(Acc[i][j] − 1) ∗ x[i][j] ∗Wp[i][j]; Constraints: For i in 1..m: Σj∈1..nx[i][j] ∗W[i][j] ∗ Sizep[i][j] ≤ S; For j in 1..n: For i in 1..nLive[j]: x[live[j][i].start][j] = x[live[j][i].start + 1][j] = ... = x[live[j][i].end][j]; Sep 11, 2009 Loop Transformation • Loop Fusion – Reduce intermediate memory – For loops with the same size, and variable[k] is temporary storage Loop 1 (size[1] = n): Active[1][k]=1 ...... Loop 2 (size[2] = n): Active[2][k]=1 ...... Sep 11, 2009 Loop Transformation • Loop Fission – Fit more data in shared memory – Should not violate data dependency 1. for (int r = 0; r < n; r++) //Loop 1 2. { 3. · · · 4. for(int c = 0; c < m; c + +) //Loop 2 5. i[c]· · · ; 6. for(int c = 0; c < m; c + +) //Loop 3 7. j[c]· · · ; 8. · · · 9. } Sep 11, 2009 Loop Transformation • Loop Switching – Sort all the loops in decreasing order of iteration numbers – Enable better usage of shared memory – Should not violate dependency Sep 11, 2009 Support More Categories of MATLAB Programs • Include library functions and determining thread grid configuration • Deal with sparse matrix computation • Investigate basic functions in SPM library and parallelization of ICA Sep 11, 2009 Include Library Functions and Determining Thread Grid Configuration • 2-D blocking or 1-D blocking • Use library functions Depending on the input data Might need separating kernels to use different configurations Need a model to evaluate the overhead of multiple kernels Sep 11, 2009 Deal with Sparse Matrix Computation • Represent sparse matrix in GPU • Coordinate representation • Diagonal representation (easy to be combined with computing context) • Might use Sparse Accelerator (evaluate efficiency of shared memory usage) • Representation conversion • Better to be done on CPU Sep 11, 2009 Investigate Basic Functions in SPM Library and Parallelization of ICA • Enrich GMAT-DM • Provide GPU support for more functions and operations in MATLAB, especially sparse matrix computation • Study possible optimization to communication and synchronization for SPM and ICA • For clusters or GPU, depending on my progress and time Sep 11, 2009 Extend FREERIDE Framework • Convert MATLAB programs into C++ and CUDA code in the FREERIDE framework – Based on GMAT-DM and AUTO-GC • Cost model to determine device configuration cost[i] = (mem[i] ∗LATENCY + comp[i] + shared[i]) ∗ Iter[i]+shared[i] ∗LATENCY total_cost =∑mi=1 cost[i]/number_iterations cost_combine =∑i∈output size[i]*total_threads If total_cost + cost_combine > T Then use GPU Else use CPU Sep 11, 2009 Conclusion • Contributions – GREENRIDE: Translating Data Mining Applications onto GPUs – GMAT-DM: Automatic Translation of Data Mining Applications from MATLAB to GPUs – AUTO-GC: Automatic Code Generation for FREERIDE with GPU Support • Proposed work – Optimize shared memory usage – Support more categories of MATLAB programs – Extend FREERIDE framework Sep 11, 2009 Thank you ! • Questions? Sep 11, 2009