GPU-dm-wenjing

Download Report

Transcript GPU-dm-wenjing

Exploiting Computing
Power of GPU for Data
Mining Application
Wenjing Ma, Leonid Glimcher,
Gagan Agrawal
Outline of contents




Background of GPU computing
Parallel data mining
Challenges of data mining on GPU
GPU implementation





Experiment results



k-means
EM
kNN
Apriori
Results of kmeans and EM
Features of applications that are suitable for
GPU computing
Related and future work
Background of GPU computing



Multi-core architectures are
becoming more popular in high
performance computing
GPU is inexpensive and fast
CUDA is a high level language
that supports programming on
GPU
CUDA functions

Host function


Global function


Called by host and executed on
host
Called by host and executed on
device
Device function

Called by device and executed on
device
Architecture of GeForce 8800
GPU (1 multiprocessor)
Parallel data mining
Common structure of data
mining applications (adopted
from Freeride)

{ * Outer Sequential Loop * }
While () {
{ * Reduction Loop * }
Foreach (element e) {
(i,val) = process(e);
Reduc(i) = Reduc(i) op val;
}
}
Challenges of data mining on
GPU


SIMD shared memory
programming
3 steps involved in the main
loop
Data read
 Computing update
 Writing update

Computing update
copy common variables from device memory to
shared memory
nBlocks = blockSize/ thread number
For i=1 to nBlocks
{
each thread process 1 data element
}
Global reduction
GPU Implementation

k-means
Data are points (say, 3 dimension)
 Start with k clusters
 Find the nearest cluster for each
point
 determine the k centroids from the
points assigned to the
corresponding center
 Repeat until the assignments of
points don’t change

GPU version of kmeans
Device function:
Shared_memory center
nBlocks = blockSize / thread_number
tid = thread_ID
For i = 1 to nBlocks
min = 0;
For j = 1 to k
dis = distance(data[tid], center[j])
if (dis < min)
min = dis
min index = i
update[tid][min index] (data[tid],dis)
Thread 0 combines all copies of update
Other applications

EM


E step and M step, different
amount of computation
Apriori
Tree-structured reduction objects
 Large amount of updates


kNN
Experiment results


k-means and EM has the best
performance when using 512
threads/block and 16 or 32
thread blocks
kNN and apriori hardly get good
speedup with GPU
1600B, 512T
160B, 512T
mem copy
32B, 512T
16B, 512T
8B, 512T
4B, 512T
file copy
2B, 512T
1B, 512T
1B, 256T
1B, 128T
1B, 64T
CPU-Seq
seconds
k-means
(10MB points)
14
12
computing
10
8
6
4
2
0
1600B, 512T
160B, 512T
mem copy
32B, 512T
16B, 512T
8B, 512T
4B, 512T
file copy
2B, 512T
1B, 512T
1B, 256T
1B, 128T
1B, 64T
CPU-Seq
seconds
k-means (continued)
(20MB points)
30
computing
25
20
15
10
5
0
EM (continued)
(512K points)
25
file copy
E mem copy
E computing
M computing
M mem copy
20
Seconds
15
10
5
4B, 256T
2B, 256T
1B, 256T
1B, 128T
1B, 64T
CPU-seq
0
EM (continued)
(1M points)
25
file copy
E mem copy
E computing
M computing
M mem copy
20
Seconds
15
10
5
4B, 256T
2B, 256T
1B, 256T
1B, 128T
1B, 64T
CPU-seq
0
Features of applications that are
suitable for GPU computing



the time spent on processing the
data must dominate the I/O cost
the size of the reduction object
needs to be small enough to
have a replica for each thread in
device memory
using the shared memory to
store frequently accessed data
the time spent on processing the
data must dominate the I/O cost
I/O
computing

the size of the reduction object
needs to be small enough to
have a replica for each thread in
device memory
No locking mechanism on GPU
 The access to the reductionobjects
are unpredictable


using the shared memory to
store frequently accessed data
Accessing device memory is very
time consuming
 Shared memory serves as a high
speed cache
 For non-read-only data elements
on shared memory, we also need
replica for each thread

Related work



Freeride
Other GPU computing
languages
The usage of GPU computation
in scientific computing
Future work



Middleware for data mining on
GPU
Provide some compilation
mechanism for data mining
applications on MATLAB
Enable tuning of parameters
that can optimize GPU
computing

Thank you! Questions?