#### Transcript hipc_presentationx - web.iiit.ac.in

Scalable Clustering using Multiple GPUs IIIT Hyderabad K Wasif Mohiuddin P J Narayanan Center for Visual Information Technology International Institute of Information Technology (IIIT) Hyderabad Introduction • Classification of data desired for meaningful representation. • Data in subset ideally shares common traits. • Unsupervised learning for finding hidden structure. • Application in data mining, computer vision IIIT Hyderabad with – Image Classification – Document Retrieval • Simple K-Means algorithm December 20, 2011 HiPC - 2011 2 IIIT Hyderabad Need for High Performance Clustering • Time Complexity of O(ndk+1 log n) where n- input vectors, d- dimension, k-centers • A fast, efficient clustering implementation is needed to deal with large data, high dimensionality and large centers. • In computer vision, 128-dim SIFT and 512-dim (GIST) are common. Features can run into several millions • Bag of Words for Vocabulary generation using SIFT [Lowe IJCV04] vectors December 20, 2011 HiPC - 2011 3 Challenges and Contributions • Data: Storage of data format for quick and repeated access. • Computational: O(ndk+1 log n) IIIT Hyderabad • Contributions: A complete GPU based implementation with – – – – Exploitation of intra-vector parallelism Efficient Mean evaluation Data Organization Multi GPU framework December 20, 2011 HiPC - 2011 4 Related Work • General Improvements – KD-trees [Moor et al, SIGKKD-1999 – Triangle Inequality [Elkan, ICML-2003] IIIT Hyderabad • Pre CUDA GPU Efforts Improvements – Fragment Shader [Hart et al, SIGGRAPH2004] December 20, 2011 HiPC - 2011 5 Related Work (cont) • Recent GPU efforts IIIT Hyderabad – Mean on CPU [Che et al, JPDC-2008] – Mean on CPU + GPU [Hong et al, WCCSIE2009] – GPU Miner [Ren et al, HP Labs-2009] – HPK-Means [Wu et al, UCHPC-2009] – Divide & Rule [Li et al, ICCIT-2010] • Parallelism not exploited within data object. • Lacking efficiency in Mean evaluation on GPU. • Proposed techniques are parameter dependant. December 20, 2011 HiPC - 2011 6 K-Means • Objective Function ∑i∑j‖xi(j) -cj‖2 1≤ i ≤ n, 1≤ j ≤ k • Euclidean distance : L2 norm • Steps: IIIT Hyderabad – Membership Evaluation – New Mean Evaluation – Convergence December 20, 2011 HiPC - 2011 7 Algorithm IIIT Hyderabad • K random centers are initially chosen from input. • Partitions data into k clusters • Observation belongs to the cluster with the nearest mean. • Re-evaluate the new centers & continue the process till convergence is attained. December 20, 2011 HiPC - 2011 8 IIIT Hyderabad K-Means on GPU Membership Evaluation • Involves Distance and Minima evaluation. • Single thread per component of vector • Parallel computation done on ‘d’ components of input and center vectors stored in row major format. • Log summation for distance evaluation. • For each input vector we traverse across all centers stored in L2 cache. December 20, 2011 HiPC - 2011 9 IIIT Hyderabad K-Means on GPU (Cont) Membership Evaluation • Data objects stored in row major format • Provides coalesced access • Distance evaluation using shared memory. • Root finding avoided December 20, 2011 HiPC - 2011 10 K-Means on GPU (Cont) • Mean Evaluation Issues IIIT Hyderabad – Data rearrangement on CPU as per membership is time consuming. – Concurrent writes – Random reads and writes – Non uniform distribution of labels for data objects. December 20, 2011 HiPC - 2011 11 IIIT Hyderabad Mean Evaluation on GPU • Store labels and index in 64 bit records • Group data objects with same membership using Splitsort operation. • We split using labels as key • Gather primitive used to rearrange input in order of labels. • Sorted global index of input vectors is generated. Splitsort : Suryakant & Narayanan IIITH, TR 2009 December 20, 2011 HiPC - 2011 12 IIIT Hyderabad Splitsort & Transpose Operation December 20, 2011 HiPC - 2011 13 IIIT Hyderabad Mean Evaluation on GPU (cont) • Row major storage of vectors enabled coalesced access. • CUDPP segmented scan followed by compact operation for histogram count. • Transpose operation before rearranging input vectors. • Using segmented scan again we evaluated mean of rearranged vectors as per labels. December 20, 2011 HiPC - 2011 14 Implementation Details • Tesla – 2 vectors per block , 2 centers at a time – Centers accessed via shared memory IIIT Hyderabad • Fermi – 2 vectors per block, 4 centers at a time – Centers accessed via global memory using L2 cache – More shared memory for distance evaluation • Occupancy of 83% using 5136 KB of shared memory in case of Fermi. December 20, 2011 HiPC - 2011 15 IIIT Hyderabad ISSUES • Too many distance evaluations • Convergence highly dependent on cluster centers chosen. • Prior seeding using K-Means++ can reduce the number of iterations. • Parameters like dimension, cluster centers affect the performance apart from the input size of the vectors. December 20, 2011 HiPC - 2011 16 IIIT Hyderabad Limitations of GPU device • Highly computational & memory consuming algorithms. • Limited Global and Shared memory on a GPU device. • Division of computational load if more than one device is available. • Utilization of every resource available. • Scalability of the algorithm December 20, 2011 HiPC - 2011 17 IIIT Hyderabad Multi GPU Approach • Partition input data into chunks proportional to number of cores. • Broadcast ‘k’ centers to all the nodes. • Perform Membership & partial mean on each of the GPUs sent to their respective nodes. • Nodes direct partial sums to Master node. • New means evaluated by Master node for next iteration. December 20, 2011 HiPC - 2011 18 IIIT Hyderabad Results • Generated Gaussian SIFT vectors • Variation in parameters n, d, k • Performance on CPU(32 bit, 2.7 Ghz), Tesla T10, GTX 480 tested up to nmax :4 Million, kmax : 8000 , dmax : 256 • MultiGPU (4xT10 + GTX 480) nmax : 32 Million, kmax : 8000, dmax : 256 • Comparison with previous GPU implementations. December 20, 2011 HiPC - 2011 19 Overall Results CPU GPU Tesla T10 GTX 480 4xT10 10K, 80 1.3 0.119 0.18 0.097 50K, 800 71.3 2.73 1.73 0.891 125K, 2K 463.6 14.18 7.71 2.47 250K, 4K 1320 38.5 27.7 7.45 1M, 8K 28936 268.6 170.6 68.5 N, K IIIT Hyderabad Times of K-Means on CPU, GPUs in seconds for d=128. December 20, 2011 HiPC - 2011 20 IIIT Hyderabad Overall Performance • Mean evaluation reduced to 6% of the total time for large input of high dimensional data. • Multi GPU provided linear speedup • Speedup of up to 170 on GTX 480 • 6 Million vectors of 128 dimension clustered in just 136 sec per iteration. • Achieved up to twice increase in speedup against the best GPU implementation December 20, 2011 HiPC - 2011 21 IIIT Hyderabad Performance vs ‘n’ Linear performance for variation in n, with d=128 and k=4,000. December 20, 2011 HiPC - 2011 22 IIIT Hyderabad Performance vs ‘d’ Performance for variation in d, with n=1M and k=8,000. December 20, 2011 HiPC - 2011 23 IIIT Hyderabad Performance vs ‘k’ Linear performance for variation in k, with n=50k and d=128. December 20, 2011 HiPC - 2011 24 Comparison K D Li et al Wu et al Our K-Means 2 Million 400 8 1.23 4.53 1.27 4 Million 100 8 0.689 4.95 0.734 4 Million 400 8 2.26 9.03 2.4 51,200 32 64 1320 - 0.191 51,200 32 128 28936 - 0.282 IIIT Hyderabad N Running time of K-Means in seconds on GTX 280. December 20, 2011 HiPC - 2011 25 IIIT Hyderabad Performance on GPUs Performance of 8600, Tesla, GTX 480 for d=128 and k=1,000. December 20, 2011 HiPC - 2011 26 IIIT Hyderabad Conclusions • Achieved a speed of over 170 on single NVIDIA Fermi GPU. • Complete GPU based implementation. • High Performance for large ‘d’ due to processing of vector in parallel. • Scalable in problem size n, d, k and number of cores. • Use of operations like Splitsort, Transpose for coalesced memory access. • Overcame memory limitations using Multi GPU frame work. • Code will be available at http://cvit.iiit.ac.in soon December 20, 2011 HiPC - 2011 27 Thank You IIIT Hyderabad Questions?