Numerical Simulation Using GPUs

Transcript Numerical Simulation Using GPUs

Data Analysis and Visualization
Numerical Simulations
Using Programmable GPUs
Stan Tomov
September 5, 2003
Brookhaven Science Associates
U.S. Department of Energy
1
Outline
Motivation
• Literature review
• The graphics pipeline
• Programmable GPUs
• Block diagram of nVidia's GeForce FX
• Some probability based simulations
- Monte Carlo simulations
- Ising model
- Percolation model
• Implementation
• Performance results and analysis
• Extensions and future work
• Conclusions
•
Brookhaven Science Associates
U.S. Department of Energy
1
Motivation
The GPUs have:
●
High flops count (nVidia has listed 200Gflops theoretical speed for NV30)
Problem size
11,540
47,636
193,556
780,308
●
●
Frames per second using
OpenGL(GPU)
Mesa (CPU)
189
8.01
52
1.71
13
0.44
3
0.12
Table 1. GPU vs CPU in rendering polygons.
The GPU (Quadro2 Pro) is approximately 30
times faster than the CPU (Pentium III, 1 GHz)
in rendering polygonal data of various sizes.
Compatible price performance (0.1 cents per M flop)
High rate of performance increase over time (doubling every 6 months)
Explore the possibility of extending GPUs' use to non-graphics applications
Brookhaven Science Associates
U.S. Department of Energy
1
Literature review
Using graphics hardware for non-graphics applications:
• Cellular automata
• Reaction-diffusion simulation (Mark Harris, University of North Carolina)
• Matrix multiply (E. Larsen and D. McAllister, University of North Carolina)
• Lattice Boltzmann computation (Wei Li, Xiaoming Wei, and Arie Kaufman, Stony Brook)
• CG and multigrid (J. Bolz et al, Caltech, and N. Goodnight et al, University of Virginia)
• Convolution (University of Stuttgart)
Performance results:
• Significant speedup of GPU vs CPU are reported if the GPU performs
low precision computations (30 to 60 times; depends on the configuration)
• The fact that the operations are low precision is often skipped which may be confusing:
- NCSA, University of Illinois assembled a $50,000 supercomputer out of 70 PlayStation 2
consoles, which could theoretically deliver 0.5 trillion operations/second
- also, currently $200 GPUs are capable of 1.2 trillion op/s
• GPU’s flops performance is comparable to the CPU’s
Brookhaven Science Associates
U.S. Department of Energy
1
The graphics pipeline
Brookhaven Science Associates
U.S. Department of Energy
1
Programmable GPUs
(in particular NV30)
• Support floating point operations
• Vertex program
- Replaces fixed-function pipeline for vertices
- Manipulates single vertex data
- Executes for every vertex
• Fragment program
- Similar to vertex program but for pixels
• Programming in Cg:
- High level language
- Looks like C
- Portable
- Compiles Cg programs to assembly code
Brookhaven Science Associates
U.S. Department of Energy
1
Block diagram of GeForce FX
AGP 8x graphics bus bandwidth: 2.1GB/s
• Local memory bandwidth: 16 GB/s
• Chip officially clocked at 500 MHz
• Vertex processor:
•
- execute vertex shaders or emulate fixed transformations and lighting (T&L)
● Pixel processor :
- execute pixel shaders or emulate fixed shaders
- 2 int & 1 float ops or 2 texture accesses/clock circle
● Texture & color interpolators
- interpolate texture coordinates and color values
Performance (on processing 4D vectors):
● Vertex ops/sec - 1.5 Gops
● Pixel ops/sec - 8 Gops (int), or 4 Gops (float)
Hardware at Digit-Life.com, NVIDIA GeForce FX, or "Cinema show
started", November 18, 2002.
Brookhaven Science Associates
U.S. Department of Energy
1
Monte Carlo simulations
Used in variety of simulations in physics, finance, chemistry, etc.
● Based on probability statistics and use random numbers
● A classical example: compute area of a circle
● Computation of expected values:
●
N
E(F) =  F (S i )P(S i )
i=1
(1)
N can be very large : on a 1024 x 1024 lattice
of particles, every
2
1024
particle modeled to have k states, N = k
●
Random number generation. We used linear congruential type
generator:
R(n)  (a * R(n 1)  b) mod N
Brookhaven Science Associates
U.S. Department of Energy
1
Ising model
●
Simplified model for magnets (introduced by Wilhelm Lenz in 1920,
further studied by his student Ernst Ising)
Modeled on 2D lattice with a “spin” (corresponding to orientation of electrons)
at every cell pointing up or down
● Uses temperature to couple 2 opposing
physical principles
●
- minimization of the system's energy
- entropy maximization
●
Want to compute
- expected magnetization: F ( Si )  N up ( Si )  N down ( Si )
- expected energy:
F(S i )  En(S i ) =  Si (j)S i (k)
j,k
●
Evolve the system into “higher probability” states and compute
expected values as average over those states
- evolving from state to state, based on certain probability decision, is related to so called Markov chains:
W.Gilks, S.Richardson, and D.Spiegelhalter (Editors), Markov chain Monte Carlo in Practice, Chapman&Hall, 1996.
Brookhaven Science Associates
U.S. Department of Energy
1
Ising model computational procedure
●
●
●
●
Choose an absolute temperature of interest T (in Kelvin)
Color lattice in a checkerboard manner
Start consecutive black and white “sweeps”
Change the spin at a site based on the procedure
1. Denote current state as S, the state with flipped spin as S'
2. Compute
ΔE  E(S' )  E(S)
3. If ΔE  0 accept S'
else generate R  [0,1] and accept S' if,
P(S' )
R
= e  ΔE / (kT)
P(S)
where P(S) is given by the Boltzmann probability distribution function
P(S) =
e  E( S)/ ( kT)
N
e
 E ( Si ) /( kT )
i=1
Brookhaven Science Associates
U.S. Department of Energy
1
Percolation model
●
First studied by Broadbent and Hemmercley in 1957
●
Used in studies of disordered medium (usually
specified by a probability distribution)
●
●
Applied in studies of various phenomena such as
spread of diseases, flow in porous media, forest
fire propagation, clustering, etc.
Of particular interest are:
- media modeling threshold after which there exists a
“spanning cluster”
- relations between different media models
- time to reach steady state spanning cluster
Brookhaven Science Associates
U.S. Department of Energy
1
Implementation
Approaches:
• Pure OpenGL (simulations using the fixed-function pipeline)
• Shaders in assembly
• Shaders in Cg
Dynamic texturing:
• Create a texture T (think of a 2D lattice)
• Loop:
- Render an image using T (in an off-screen buffer)
- Update T from the resulting image
Brookhaven Science Associates
U.S. Department of Energy
1
Performance results and analysis
• Time in s. (approximate) for different vector flops on the GPU:
256x256
512x512
traffic
0.00063
0.0024
+, -, *, /
0.00010
0.0003
cos, sin
0.00026
0.0010
log, exp
0.00045
0.0015
if, ? :
0.00016
0.0008
 48 B per node – speed limited by
GPU’s memory speed (16 GB/s)
 3.5 Gflops
 20 x faster then CPU but the
operations are of low accuracy
• Time in s. (approximate) including traffic for different vector flops on the CPU:
256x256
512x512
1024x1024
+, -, *, /
0.0011
0.0046
0.017
cos, sin
0.0540
0.0650
0.267
log, exp
0.0609
0.1100
0.426
Brookhaven Science Associates
U.S. Department of Energy
32 B per node – speed
limited by CPU’s memory
speed (4.2 GB/s)
1
Performance results and analysis
• GPU and CPU (2.8 GHz) performance on the Ising model
Lattice size (not necessary power of 2)
64x64
128x128
256x256
512x512
1024x1024
GPU sec/frame
0.0006
0.0023
0.0081
0.033
0.14
CPU no opt.
0.0009
0.0024
0.0083
0.032
0.13
CPU with –O4
0.0008
0.0020
0.0069
0.026
0.10
GPU instr./sec
0.55 G
0.57 G
0.66 G
0.63 G
0.61 G
•  2.64 Gflops, i.e. 15% GPU theoretical power utilization (too many ifs):
- if (flag) { … } : exec. time = time to compute the block even if flag = 0
• Performance compatible with visualization related sample shaders from nVidia
• Cg
assembly
- Performance is the same for using runtime Cg or the generated assembly code
- The assembly code generated is not optimal: we found cases where the code could
be optimized and performance increased
Brookhaven Science Associates
U.S. Department of Energy
1
Extensions and future work
• Code optimization (through optimization of Cg generated assembly)
• More applications:
- QCD ?
- Fluid flow ?
• Parallel algorithms (or just as a coprocessor)
- domain decomposition type in cluster environment
- Motivation: communication rates CPU
GPU for lattices of different sizes in seconds
64x64
128x128
256x256
512x512
 speed
Read bdr (glReadPixels)
0.00016
0.0002
0.0006
0.0024
14 MB/s
Read all (glReadPixels)
0.00040
0.0015
0.0062
0.0250
167 MB/s
Write bdr (glDrawPixels)
0.00022
0.0003
0.0007
0.0024
14 MB/s
Write all
(glTexSubImage2D)
0.00020
0.0008
0.0032
0.0120
350 MB/s
Write bdr
(glTexSubImage2D)
0.00050
0.0020
0.0071
0.0250
1.3 MB/s
Not a bottleneck
in cluster with
1Gbit network
• Other ideas?
Brookhaven Science Associates
U.S. Department of Energy
1
Conclusions
• GPUs have higher rate of performance increase over time than CPUs
- always appealing as “research for the future”
• In certain applications GPUs are 30 to 60 times faster than CPUs
for low precision computations (depending on configuration)
• For certain floating point applications GPU’s and CPU’s
performance is comparable
- can be used as coprocessor
•
•
•
•
GPUs are often constrained in memory, but
Preliminary results show it is feasible to use GPUs in parallel
Cg is a convenient tool (but cgc could be optimized)
It is feasible to use GPUs for numerical simulations
- we demonstrated it by implementing 2 models (with many applications), and
- used the implementation in benchmarking NV30 and Cg
Brookhaven Science Associates
U.S. Department of Energy
1

Numerical Simulation Using GPUs

Transcript Numerical Simulation Using GPUs

Directory