High Performance Computing with the GPU
Download
Report
Transcript High Performance Computing with the GPU
1) Leverage raw computational power of GPU
Magnitude performance gains possible
2) Leverage maturation of GPU HW and SW
HW:
SW:
Dedicated
fixed 3D
accelerators
---
1995 - 2000
Programmable gfx pipeline
(shaders)
Assembly
code
Shader
programming
languages
(Cg/HLSL)
2000 - 2005
General
computing
(nVidia G80)
General
programming
languages
(CUDA)
2006 -
Nanoscale Molecular Dynamics (NAMD)
University of Illinois, Urbana-Champaign
tools for simulating and visualizing biomolecular
processes
Yield 3.5x – 8x performance gains
Develop a high performance library of core
computational methods using the GPU
Library level
BLAS (Basic Linear Algebra Subprograms)
numerical methods
image processing kernels
Application level
port LONI algorithms
G80 chipset: nVidia 8800 GTX
128 micro-processors
680 million transistors
Intel Core 2 (290 million)
16 multi-processor units @ 1.3 GHz
8 processors per multi-processor unit
Device memory: 768 MB
High performance parallel architecture
On-chip shared memory (16 KB)
Texture cache (8 KB)
Constant memory (64 KB) and cache (8 KB)
Matrix Pivot Divide
0.400000
0.350000
0.300000
seconds
0.250000
0.200000
GPU
CPU
0.150000
0.100000
0.050000
0.000000
1
2
3
4
5
6
7
8
9
10
11
12
Matrix size in multiples of 16
13
14
15
16
Compatible with all cards with CUDA driver
Linux / Windows
Mobile (GeForce 8M), desktop (GeForce 8), server (Quadro)
Scalable to multi-GPUs
nVidia SLI
Workstation cluster (nVidia Tesla)
1.5 GB Dedicated Memory
2 or 4 G80 GPUs (256 or 512 micro-processors)
Attractive cost-to-performance ratio
nVidia 8800 GTX: $550
nVidia Tesla: $7500 - $12,000
nVidia CUDA is 1st generation
Not all algorithms scale well to GPU
Host memory to device memory bottleneck
Single-precision floating point
Cross-GPU development currently not
available
Task
Time
a) Identify computational methods to implement
b) Evaluate if scalable to GPU
2 - 4 weeks
Experimentation/Implementation
3 - 4 months
Develop prototype
Feb 2008
Basic definitions
BLOCK = conceptual computational node
Max number = 65535
Optimal if # of blocks multiple of # of multiprocessors (16)
Each BLOCK runs a number of threads
Max threads per block = 512
Optimal if # threads multiple of warp size (32)
Pivot-divide for 3D volume data
Matrix pivot-divide applied to each slice independently
Mapped each slice to “block” (NUMBLOCKS = N)
Each thread in block handles one row in slice
(NUMTHREADS = N)
Pivot-divide for volume data
GPU vs CPU
3000
2500
milliseconds
2000
1500
GPU
CPU
1000
500
0
16
32
48
64
80
96
112
128
144
160
176
192
208
224
240
GPU 0.4891 0.7109 1.371 3.1105 4.5215 7.8988 12.075 16.361 23.748 30.085 42.103 63.365 64.613 80.364 98.76
CPU 0.0764 0.4917 3.3496 5.8916 11.545 22.181 40.249 185.6
256
124.9
92.97 223.63 193.83 371.26 319.41 521.19 508.99 2762.8
Volume Size (NxNxN)
Pivot-divide for volume
GPU Memory Overhead vs Computation
140
120
100
milliseconds
80
60
Computation
Memory
40
20
0
16
32
48
64
80
96
112
128
144
160
176
192
208
224
240
256
Computation 0.4496 0.5592 0.8942 1.7589 2.4644 4.1567 6.2411 8.3843 12.124 15.246 21.258 31.878 32.5 40.379 49.574 62.655
Memory
0.0395 0.1516 0.4768 1.3516 2.0572 3.7421 5.8335 7.9771 11.624 14.839 20.846 31.486 32.113 39.985 49.187 62.241
Volume Size (NxNxN)
As long as no synchronization among slices, scales
well to GPU
Host to Device latency
Concurrent read of other slices should be possible
1GB/s measured (2GB/s reported)
PCIe settings?
Need Investigating:
NUMBLOCKS and multiprocessor count?
Fine-tune number of slices per block?
CUDA scheduler seems to handle it well when
NUMBLOCKS = N
Scaling issues
N > NUMTHREADS ?
Will we ever hit BLOCK limit?
t( total ) = t( mem ) + t( compute )
GPU
t(mem) = host to device transfer
t(compute) = kernel time
CPU
t(mem) = memcpy()
t(compute) = loop time
Parameters
for N=16…256, BLOCKS = 256
for N=272…512, BLOCKS=512
Host to Device memory bottleneck
Pageable vs Pinned memory allocation
2x faster with pinned
Elapsed run-time
1800
1600
1400
milliseconds
1200
CPU Total
1000
CPU-C
GPU Total
800
GPU-C
600
400
200
0
512
496
480
464
448
432
416
400
384
368
352
336
320
304
288
272
256
240
224
208
192
176
160
144
128
112
96
80
64
48
32
16
Volume size (NxNxN)
Performance Gain Factors
4.5
4
3.5
3
Factor
2.5
2
1.5
1
CPU/GPU (Total)
CPU-C/GPU-C
0.5
CPU-C/GPU-T
0
512
496
480
464
448
432
416
400
384
368
352
336
320
304
288
272
256
240
224
208
192
176
160
144
128
112
96
80
64
48
32
16
Volume size (NxNxN)
800
Quadratic Models
700
CPU
600
y = 34.6073 - 0.68294x + .003184x^2
GPU Total
500
22.005 - 0.476x + .00237x^2
millseconds
GPU
400
y = 11.2798 - 0.24405x + .001181x^2
300
200
100
0
16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256 272 288 304 320 336 352 368 384 400 416 432 448 464 480 496 512
-100
Volume size (NxNxN)
Single Instruction Multiple Data Model (SIMD)
Less synchronization, higher performance
v1.0 – no sync among blocks
High Arithmetic Intensity
Arithmetic Intensity = Arithmetic OPs/Memory Ops
Computations can overlap with memory operations
Memory Operations highest latency
Shared memory
Texture memory
Cached from device memory
Optimized for 2D spatial locality
Built-in filtering/interpolation methods
Read packed data in one operation (ie: RGBA)
Constant memory
Fast as accessing register with no bank conflicts
Limited to 16KB
Cached from device memory
Fast as register if all threads read same address
Device memory
Uncached, very slow
Faster if byte aligned and coalesced into single contiguous access
Arithmetic Operations
4 clock cycles for float (+,*,*+), int (+)
16 clock cycles for 32-bit int mul (4 cycles for 24-bit)
36 clock cycles for float division
(int division and modulo very costly)
v1.0 – only floats (double converted to float)
Atomic operations (v1.1 only)
Provides locking mechanisms
Minimize host-to-device memory transfers
Minimize device memory access
Minimize execution divergence
Minimize branching in kernel
Unroll loops
Make high use of shared memory
Optimize with byte alignment, coalescing
Must correctly stripe data to avoid bank conflicts
For image processing tasks, texture memory may
be more efficient
# threads per block = multiple( 32 )
# blocks = ?