High Performance Computing with the GPU

Download Report

Transcript High Performance Computing with the GPU

1) Leverage raw computational power of GPU

Magnitude performance gains possible
2) Leverage maturation of GPU HW and SW
HW:
SW:
Dedicated
fixed 3D
accelerators
---
1995 - 2000
Programmable gfx pipeline
(shaders)
Assembly
code
Shader
programming
languages
(Cg/HLSL)
2000 - 2005
General
computing
(nVidia G80)
General
programming
languages
(CUDA)
2006 -

Nanoscale Molecular Dynamics (NAMD)
University of Illinois, Urbana-Champaign
tools for simulating and visualizing biomolecular
processes
 Yield 3.5x – 8x performance gains


Develop a high performance library of core
computational methods using the GPU

Library level
 BLAS (Basic Linear Algebra Subprograms)
 numerical methods
 image processing kernels

Application level
 port LONI algorithms

G80 chipset: nVidia 8800 GTX



128 micro-processors




680 million transistors
Intel Core 2 (290 million)
16 multi-processor units @ 1.3 GHz
8 processors per multi-processor unit
Device memory: 768 MB
High performance parallel architecture



On-chip shared memory (16 KB)
Texture cache (8 KB)
Constant memory (64 KB) and cache (8 KB)
Matrix Pivot Divide
0.400000
0.350000
0.300000
seconds
0.250000
0.200000
GPU
CPU
0.150000
0.100000
0.050000
0.000000
1
2
3
4
5
6
7
8
9
10
11
12
Matrix size in multiples of 16
13
14
15
16

Compatible with all cards with CUDA driver



Linux / Windows
Mobile (GeForce 8M), desktop (GeForce 8), server (Quadro)
Scalable to multi-GPUs


nVidia SLI
Workstation cluster (nVidia Tesla)
 1.5 GB Dedicated Memory
 2 or 4 G80 GPUs (256 or 512 micro-processors)

Attractive cost-to-performance ratio


nVidia 8800 GTX: $550
nVidia Tesla: $7500 - $12,000





nVidia CUDA is 1st generation
Not all algorithms scale well to GPU
Host memory to device memory bottleneck
Single-precision floating point
Cross-GPU development currently not
available
Task
Time
a) Identify computational methods to implement
b) Evaluate if scalable to GPU
2 - 4 weeks
Experimentation/Implementation
3 - 4 months
Develop prototype
Feb 2008

Basic definitions

BLOCK = conceptual computational node
 Max number = 65535
 Optimal if # of blocks multiple of # of multiprocessors (16)

Each BLOCK runs a number of threads
 Max threads per block = 512
 Optimal if # threads multiple of warp size (32)

Pivot-divide for 3D volume data



Matrix pivot-divide applied to each slice independently
Mapped each slice to “block” (NUMBLOCKS = N)
Each thread in block handles one row in slice
(NUMTHREADS = N)
Pivot-divide for volume data
GPU vs CPU
3000
2500
milliseconds
2000
1500
GPU
CPU
1000
500
0
16
32
48
64
80
96
112
128
144
160
176
192
208
224
240
GPU 0.4891 0.7109 1.371 3.1105 4.5215 7.8988 12.075 16.361 23.748 30.085 42.103 63.365 64.613 80.364 98.76
CPU 0.0764 0.4917 3.3496 5.8916 11.545 22.181 40.249 185.6
256
124.9
92.97 223.63 193.83 371.26 319.41 521.19 508.99 2762.8
Volume Size (NxNxN)
Pivot-divide for volume
GPU Memory Overhead vs Computation
140
120
100
milliseconds
80
60
Computation
Memory
40
20
0
16
32
48
64
80
96
112
128
144
160
176
192
208
224
240
256
Computation 0.4496 0.5592 0.8942 1.7589 2.4644 4.1567 6.2411 8.3843 12.124 15.246 21.258 31.878 32.5 40.379 49.574 62.655
Memory
0.0395 0.1516 0.4768 1.3516 2.0572 3.7421 5.8335 7.9771 11.624 14.839 20.846 31.486 32.113 39.985 49.187 62.241
Volume Size (NxNxN)

As long as no synchronization among slices, scales
well to GPU


Host to Device latency



Concurrent read of other slices should be possible
1GB/s measured (2GB/s reported)
PCIe settings?
Need Investigating:

NUMBLOCKS and multiprocessor count?
 Fine-tune number of slices per block?
 CUDA scheduler seems to handle it well when
NUMBLOCKS = N

Scaling issues
 N > NUMTHREADS ?
 Will we ever hit BLOCK limit?

t( total ) = t( mem ) + t( compute )

GPU
 t(mem) = host to device transfer
 t(compute) = kernel time

CPU
 t(mem) = memcpy()
 t(compute) = loop time

Parameters


for N=16…256, BLOCKS = 256
for N=272…512, BLOCKS=512

Host to Device memory bottleneck


Pageable vs Pinned memory allocation
2x faster with pinned
Elapsed run-time
1800
1600
1400
milliseconds
1200
CPU Total
1000
CPU-C
GPU Total
800
GPU-C
600
400
200
0
512
496
480
464
448
432
416
400
384
368
352
336
320
304
288
272
256
240
224
208
192
176
160
144
128
112
96
80
64
48
32
16
Volume size (NxNxN)
Performance Gain Factors
4.5
4
3.5
3
Factor
2.5
2
1.5
1
CPU/GPU (Total)
CPU-C/GPU-C
0.5
CPU-C/GPU-T
0
512
496
480
464
448
432
416
400
384
368
352
336
320
304
288
272
256
240
224
208
192
176
160
144
128
112
96
80
64
48
32
16
Volume size (NxNxN)
800
Quadratic Models
700
CPU
600
y = 34.6073 - 0.68294x + .003184x^2
GPU Total
500
22.005 - 0.476x + .00237x^2
millseconds
GPU
400
y = 11.2798 - 0.24405x + .001181x^2
300
200
100
0
16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256 272 288 304 320 336 352 368 384 400 416 432 448 464 480 496 512
-100
Volume size (NxNxN)

Single Instruction Multiple Data Model (SIMD)



Less synchronization, higher performance
v1.0 – no sync among blocks
High Arithmetic Intensity


Arithmetic Intensity = Arithmetic OPs/Memory Ops
Computations can overlap with memory operations


Memory Operations highest latency
Shared memory



Texture memory





Cached from device memory
Optimized for 2D spatial locality
Built-in filtering/interpolation methods
Read packed data in one operation (ie: RGBA)
Constant memory



Fast as accessing register with no bank conflicts
Limited to 16KB
Cached from device memory
Fast as register if all threads read same address
Device memory


Uncached, very slow
Faster if byte aligned and coalesced into single contiguous access

Arithmetic Operations



4 clock cycles for float (+,*,*+), int (+)
16 clock cycles for 32-bit int mul (4 cycles for 24-bit)
36 clock cycles for float division
 (int division and modulo very costly)


v1.0 – only floats (double converted to float)
Atomic operations (v1.1 only)

Provides locking mechanisms


Minimize host-to-device memory transfers
Minimize device memory access


Minimize execution divergence





Minimize branching in kernel
Unroll loops
Make high use of shared memory


Optimize with byte alignment, coalescing
Must correctly stripe data to avoid bank conflicts
For image processing tasks, texture memory may
be more efficient
# threads per block = multiple( 32 )
# blocks = ?