hipc_presentationx - web.iiit.ac.in

Download Report

Transcript hipc_presentationx - web.iiit.ac.in

Scalable Clustering using
Multiple GPUs
IIIT Hyderabad
K Wasif Mohiuddin
P J Narayanan
Center for Visual Information Technology
International Institute of Information Technology (IIIT)
Hyderabad
Introduction
• Classification of data desired for
meaningful representation.
• Data in subset ideally shares common traits.
• Unsupervised learning for finding hidden
structure.
• Application in data mining, computer vision
IIIT Hyderabad
with
– Image Classification
– Document Retrieval
• Simple K-Means algorithm
December 20, 2011
HiPC - 2011
2
IIIT Hyderabad
Need for High Performance Clustering
• Time Complexity of O(ndk+1 log n) where n- input
vectors, d- dimension, k-centers
• A fast, efficient clustering implementation is
needed to deal with large data, high
dimensionality and large centers.
• In computer vision, 128-dim SIFT and 512-dim
(GIST) are common. Features can run into several
millions
• Bag of Words for Vocabulary generation using
SIFT [Lowe IJCV04] vectors
December 20, 2011
HiPC - 2011
3
Challenges and Contributions
• Data: Storage of data format for quick and
repeated access.
• Computational: O(ndk+1 log n)
IIIT Hyderabad
• Contributions: A complete GPU based
implementation with
–
–
–
–
Exploitation of intra-vector parallelism
Efficient Mean evaluation
Data Organization
Multi GPU framework
December 20, 2011
HiPC - 2011
4
Related Work
• General Improvements
– KD-trees [Moor et al, SIGKKD-1999
– Triangle Inequality [Elkan, ICML-2003]
IIIT Hyderabad
• Pre CUDA GPU Efforts Improvements
– Fragment Shader [Hart et al, SIGGRAPH2004]
December 20, 2011
HiPC - 2011
5
Related Work (cont)
• Recent GPU efforts
IIIT Hyderabad
– Mean on CPU [Che et al, JPDC-2008]
– Mean on CPU + GPU [Hong et al, WCCSIE2009]
– GPU Miner [Ren et al, HP Labs-2009]
– HPK-Means [Wu et al, UCHPC-2009]
– Divide & Rule [Li et al, ICCIT-2010]
• Parallelism not exploited within data object.
• Lacking efficiency in Mean evaluation on
GPU.
• Proposed techniques are parameter dependant.
December 20, 2011
HiPC - 2011
6
K-Means
• Objective Function
∑i∑j‖xi(j) -cj‖2
1≤ i ≤ n, 1≤ j ≤ k
• Euclidean distance : L2 norm
• Steps:
IIIT Hyderabad
– Membership Evaluation
– New Mean Evaluation
– Convergence
December 20, 2011
HiPC - 2011
7
Algorithm
IIIT Hyderabad
• K random centers are initially chosen from
input.
• Partitions data into k clusters
• Observation belongs to the cluster with the
nearest mean.
• Re-evaluate the new centers & continue the
process till convergence is attained.
December 20, 2011
HiPC - 2011
8
IIIT Hyderabad
K-Means on GPU
Membership Evaluation
• Involves Distance and Minima evaluation.
• Single thread per component of vector
• Parallel computation done on ‘d’
components of input and center vectors
stored in row major format.
• Log summation for distance evaluation.
• For each input vector we traverse across all
centers stored in L2 cache.
December 20, 2011
HiPC - 2011
9
IIIT Hyderabad
K-Means on GPU (Cont)
Membership Evaluation
• Data objects stored in
row major format
• Provides coalesced
access
• Distance evaluation
using shared memory.
• Root finding avoided
December 20, 2011
HiPC - 2011
10
K-Means on GPU (Cont)
• Mean Evaluation Issues
IIIT Hyderabad
– Data rearrangement on CPU as per membership
is time consuming.
– Concurrent writes
– Random reads and writes
– Non uniform distribution of labels for data
objects.
December 20, 2011
HiPC - 2011
11
IIIT Hyderabad
Mean Evaluation on GPU
• Store labels and index in 64 bit records
• Group data objects with same membership
using Splitsort operation.
• We split using labels as key
• Gather primitive used to rearrange input in
order of labels.
• Sorted global index of input vectors is
generated.
Splitsort : Suryakant & Narayanan IIITH, TR 2009
December 20, 2011
HiPC - 2011
12
IIIT Hyderabad
Splitsort & Transpose Operation
December 20, 2011
HiPC - 2011
13
IIIT Hyderabad
Mean Evaluation on GPU (cont)
• Row major storage of vectors enabled
coalesced access.
• CUDPP segmented scan followed by
compact operation for histogram count.
• Transpose operation before rearranging
input vectors.
• Using segmented scan again we evaluated
mean of rearranged vectors as per labels.
December 20, 2011
HiPC - 2011
14
Implementation Details
• Tesla
– 2 vectors per block , 2 centers at a time
– Centers accessed via shared memory
IIIT Hyderabad
• Fermi
– 2 vectors per block, 4 centers at a time
– Centers accessed via global memory using L2
cache
– More shared memory for distance evaluation
• Occupancy of 83% using 5136 KB of shared
memory in case of Fermi.
December 20, 2011
HiPC - 2011
15
IIIT Hyderabad
ISSUES
• Too many distance evaluations
• Convergence highly dependent on cluster
centers chosen.
• Prior seeding using K-Means++ can reduce
the number of iterations.
• Parameters like dimension, cluster centers
affect the performance apart from the input
size of the vectors.
December 20, 2011
HiPC - 2011
16
IIIT Hyderabad
Limitations of GPU device
• Highly computational & memory
consuming algorithms.
• Limited Global and Shared memory on a
GPU device.
• Division of computational load if more than
one device is available.
• Utilization of every resource available.
• Scalability of the algorithm
December 20, 2011
HiPC - 2011
17
IIIT Hyderabad
Multi GPU Approach
• Partition input data into
chunks proportional to
number of cores.
• Broadcast ‘k’ centers to all
the nodes.
• Perform Membership &
partial mean on each of the
GPUs sent to their
respective nodes.
• Nodes direct partial sums to
Master node.
• New means evaluated by
Master node for next
iteration.
December 20, 2011
HiPC - 2011
18
IIIT Hyderabad
Results
• Generated Gaussian SIFT vectors
• Variation in parameters n, d, k
• Performance on CPU(32 bit, 2.7 Ghz),
Tesla T10, GTX 480 tested up to nmax :4
Million, kmax : 8000 , dmax : 256
• MultiGPU (4xT10 + GTX 480)
nmax : 32 Million, kmax : 8000, dmax : 256
• Comparison with previous GPU
implementations.
December 20, 2011
HiPC - 2011
19
Overall Results
CPU
GPU
Tesla T10
GTX 480
4xT10
10K, 80
1.3
0.119
0.18
0.097
50K, 800
71.3
2.73
1.73
0.891
125K, 2K
463.6
14.18
7.71
2.47
250K, 4K
1320
38.5
27.7
7.45
1M, 8K
28936
268.6
170.6
68.5
N, K
IIIT Hyderabad
Times of K-Means on CPU, GPUs in seconds for d=128.
December 20, 2011
HiPC - 2011
20
IIIT Hyderabad
Overall Performance
• Mean evaluation reduced to 6% of the total
time for large input of high dimensional
data.
• Multi GPU provided linear speedup
• Speedup of up to 170 on GTX 480
• 6 Million vectors of 128 dimension
clustered in just 136 sec per iteration.
• Achieved up to twice increase in speedup
against the best GPU implementation
December 20, 2011
HiPC - 2011
21
IIIT Hyderabad
Performance vs ‘n’
Linear performance for variation in n, with d=128 and k=4,000.
December 20, 2011
HiPC - 2011
22
IIIT Hyderabad
Performance vs ‘d’
Performance for variation in d, with n=1M and k=8,000.
December 20, 2011
HiPC - 2011
23
IIIT Hyderabad
Performance vs ‘k’
Linear performance for variation in k, with n=50k and d=128.
December 20, 2011
HiPC - 2011
24
Comparison
K
D
Li et al
Wu et al
Our
K-Means
2 Million
400
8
1.23
4.53
1.27
4 Million
100
8
0.689
4.95
0.734
4 Million
400
8
2.26
9.03
2.4
51,200
32
64
1320
-
0.191
51,200
32
128
28936
-
0.282
IIIT Hyderabad
N
Running time of K-Means in seconds on GTX 280.
December 20, 2011
HiPC - 2011
25
IIIT Hyderabad
Performance on GPUs
Performance of 8600, Tesla, GTX 480 for d=128 and k=1,000.
December 20, 2011
HiPC - 2011
26
IIIT Hyderabad
Conclusions
• Achieved a speed of over 170 on single NVIDIA Fermi
GPU.
• Complete GPU based implementation.
• High Performance for large ‘d’ due to processing of
vector in parallel.
• Scalable in problem size n, d, k and number of cores.
• Use of operations like Splitsort, Transpose for coalesced
memory access.
• Overcame memory limitations using Multi GPU frame
work.
• Code will be available at http://cvit.iiit.ac.in soon
December 20, 2011
HiPC - 2011
27
Thank You
IIIT Hyderabad
Questions?