Mapping Computational Concepts to GPUs

Download Report

Transcript Mapping Computational Concepts to GPUs

General-Purpose Computation on
Graphics Hardware
David Luebke
University of Virginia
Course Introduction
• The GPU on commodity video cards has evolved
into an extremely flexible and powerful processor
– Programmability
– Precision
– Power
• We are interested in how to harness that power for
general-purpose computation
Motivation:
Computational Power
• GPUs are fast…
–
–
–
–
3 GHz Pentium4 theoretical: 6 GFLOPS, 5.96 GB/sec peak
GeForceFX 5900 observed: 20 GFLOPS, 25.3 GB/sec peak
GeForce 6800 Ultra observed: 53 GFLOPS, 35.2 GB/sec peak
GeForce 8800 GTX: estimated at 520 GFLOPS, 86.4 GB/sec
peak (reference here)
• That’s almost 100 times faster than a 3 Ghz Pentium4!
• GPUs are getting faster, faster
– CPUs: annual growth  1.5×  decade growth  60×
– GPUs: annual growth > 2.0×  decade growth > 1000
Courtesy Kurt Akeley,
Ian Buck & Tim Purcell, GPU Gems (see course notes)
Motivation:
Computational Power
GPU
CPU
Courtesy Naga Govindaraju
Motivation:
Computational Power
GPU
multiplies per second
GFLOPS
NVIDIA NV30, 35, 40
ATI R300, 360, 420
Pentium 4
July 01
Jan 02
July 02
Jan 03
July 03
Jan 04
Courtesy Ian Buck
An Aside:
Computational Power
• Why are GPUs getting faster so fast?
– Arithmetic intensity: the specialized nature of GPUs makes it
easier to use additional transistors for computation not cache
– Economics: multi-billion dollar video game market is a
pressure cooker that drives innovation
Motivation:
Flexible and precise
• Modern GPUs are deeply programmable
– Programmable pixel, vertex, video engines
– Solidifying high-level language support
• Modern GPUs support high precision
– 32 bit floating point throughout the pipeline
– High enough for many (not all) applications
Motivation:
The Potential of GPGPU
• The power and flexibility of GPUs makes them an
attractive platform for general-purpose computation
• Example applications range from in-game physics
simulation to conventional computational science
• Goal: make the inexpensive power of the GPU
available to developers as a sort of computational
coprocessor
The Problem:
Difficult To Use
• GPUs designed for and driven by video games
– Programming model is unusual & tied to computer graphics
– Programming environment is tightly constrained
• Underlying architectures are:
– Inherently parallel
– Rapidly evolving (even in basic feature set!)
– Largely secret
• Can’t simply “port” code written for the CPU!
GPU Fundamentals:
The Graphics Pipeline
Graphics State
Application
Transform
Vertices
(3D)
Rasterizer
Xformed,
Lit
Vertices
(2D)
CPU
Shade
Fragments
(pre-pixels)
GPU
• A simplified graphics pipeline
– Note that pipe widths vary
– Many caches, FIFOs, and so on not shown
Final
pixels
(Color, Depth)
Video
Memory
(Textures)
Render-to-texture
GPU Fundamentals:
The Modern Graphics Pipeline
Graphics State
Application
Vertices
(3D)
Vertex
Vertex
Processor
Processor
Rasterizer
Xformed,
Lit
Vertices
(2D)
CPU
• Programmable vertex
processor!
Fragments
(pre-pixels)
GPU
Pixel
Fragment
Processor
Processor
Final
pixels
(Color, Depth)
Video
Memory
(Textures)
Render-to-texture
• Programmable pixel
processor!
GPU Pipeline:
Transform
• Vertex Processor (multiple operate in parallel)
– Transform from “world space” to “image space”
– Compute per-vertex lighting
GPU Pipeline:
Rasterizer
• Rasterizer
– Convert geometric rep. (vertex) to image rep. (fragment)
• Fragment = image fragment
– Pixel + associated data: color, depth, stencil, etc.
– Interpolate per-vertex quantities across pixels
GPU Pipeline: Shade
• Fragment Processors (multiple in parallel)
– Compute a color for each pixel
– Optionally read colors from textures (images)
Importance of Data Parallelism
• GPU: Each vertex / fragment is independent
– Temporary registers are zeroed
– No static data
– No read-modify-write buffers
• Data parallel processing
– Best for ALU-heavy architectures: GPUs
• Multiple vertex & pixel pipelines
– Hide memory latency (with more computation)
Courtesy of Ian Buck
Arithmetic Intensity
• Lots of ops per word transferred
• GPGPU demands high arithmetic intensity for peak
performance
– Ex: solving systems of linear equations
– Physically-based simulation on lattices
– All-pairs shortest paths
Courtesy of Pat Hanrahan
Data Streams & Kernels
• Streams
– Collection of records requiring similar computation
• Vertex positions, Voxels, FEM cells, etc.
– Provide data parallelism
• Kernels
– Functions applied to each element in stream
• Transforms, PDE, …
– No dependencies between stream elements
• Encourage high arithmetic intensity
Courtesy of Ian Buck
Example: Simulation Grid
• Common GPGPU computation style
– Textures represent computational grids = streams
• Many computations map to grids
–
–
–
–
Matrix algebra
Image & Volume processing
Physical simulation
Global Illumination
• ray tracing, photon mapping,
radiosity
• Non-grid streams can be
mapped to grids
Stream Computation
• Grid Simulation algorithm
– Made up of steps
– Each step updates entire grid
– Must complete before next step can begin
• Grid is a stream, steps are kernels
– Kernel applied to each stream element
Scatter vs. Gather
• Grid communication
– Grid cells share information
Computational Resources Inventory
• Programmable parallel processors
– Vertex & Fragment pipelines
• Rasterizer
– Mostly useful for interpolating addresses (texture coordinates)
and per-vertex constants
• Texture unit
– Read-only memory interface
• Render to texture
– Write-only memory interface
Vertex Processor
• Fully programmable (SIMD / MIMD)
• Processes 4-vectors (RGBA / XYZW)
• Capable of scatter but not gather
– Can change the location of current vertex
– Cannot read info from other vertices
– Can only read a small constant memory
• Future hardware enables gather!
– Vertex textures
Fragment Processor
• Fully programmable (SIMD)
• Processes 4-vectors (RGBA / XYZW)
• Random access memory read (textures)
• Capable of gather but not scatter
– No random access memory writes
– Output address fixed to a specific pixel
• Typically more useful than vertex processor
– More fragment pipelines than vertex pipelines
– RAM read
– Direct output
CPU-GPU Analogies
• CPU programming is familiar
– GPU programming is graphics-centric
• Analogies can aid understanding
CPU-GPU Analogies
CPU
GPU
Stream / Data Array =
Memory Read
=
Texture
Texture Sample
CPU-GPU Analogies
CPU
GPU
Loop body / kernel / algorithm step = Fragment Program
Feedback
• Each algorithm step depend on the
results of previous steps
• Each time step depends on the
results of the previous time step
CPU-GPU Analogies
CPU
GPU
.
.
.
Grid[i][j]= x;
.
.
.
Array Write
=
Render to Texture
GPU Simulation Overview
• Analogies lead to implementation
– Algorithm steps are fragment programs
• Computational “kernels”
– Current state variables stored in textures
– Feedback via render to texture
• One question: how do we invoke
computation?
Invoking Computation
• Must invoke computation at each pixel
– Just draw geometry!
– Most common GPGPU invocation is a full-screen quad
Typical “Grid” Computation
• Initialize “view” (so that pixels:texels::1:1)
glMatrixMode(GL_MODELVIEW);
glLoadIdentity();
glMatrixMode(GL_PROJECTION);
glLoadIdentity();
glOrtho(0, 1, 0, 1, 0, 1);
glViewport(0, 0, outTexResX, outTexResY);
• For each algorithm step:
– Activate render-to-texture
– Setup input textures, fragment program
– Draw a full-screen quad (1x1)
Branching Techniques
• Fragment program branches can be expensive
– No true fragment branching on NV3X & R3X0
• Better to move decisions up the pipeline
–
–
–
–
–
Replace with math
Occlusion Query
Domain decomposition
Z-cull
Pre-computation
Branching with OQ
• Use it for iteration termination
Do
{ // outer loop on CPU
BeginOcclusionQuery
{
// Render with fragment program that
// discards fragments that satisfy
// termination criteria
} EndQuery
} While query returns > 0
• Can be used for subdivision techniques
– Demo later
Domain Decomposition
• Avoid branches where outcome is fixed
– One region is always true, another false
– Separate FPs for each region, no branches
• Example:
boundaries
Z-Cull
• In early pass, modify depth buffer
– Write depth=0 for pixels that should not be modified by later
passes
– Write depth=1 for rest
• Subsequent passes
– Enable depth test (GL_LESS)
– Draw full-screen quad at z=0.5
– Only pixels with previous depth=1 will be processed
• Can also use early stencil test
• Not available on NV3X
– Depth replace disables ZCull
Pre-computation
• Pre-compute anything that will not change every
iteration!
• Example: arbitrary boundaries
– When user draws boundaries, compute texture containing
boundary info for cells
– Reuse that texture until boundaries modified
– Combine with Z-cull for higher performance!
High-level Shading Languages
• Cg, HLSL, & GLSlang
– Cg:
• http://www.nvidia.com/cg
– HLSL:
• http://msdn.microsoft.com/library/default.asp?url=/library/enus/directx9_c/directx/graphics/reference/highlevellanguageshad
ers.asp
– GLSlang:
• http://www.3dlabs.com/support/developer/ogl2/whitepapers/inde
x.html
float3 cPlastic = Cd * (cAmbi + cDiff)
+ Cs*cSpec
….
MULR R0.xyz, R0.xxx, R4.xyxx;
MOVR R5.xyz, -R0.xyzz;
DP3R R3.x, R0.xyzz, R3.xyzz;
…
GPGPU Languages
• Why do we want them?
– Make programming GPUs easier!
• Don’t need to know OpenGL, DirectX, or ATI/NV extensions
• Simplify common operations
• Focus on the algorithm, not on the implementation
• Sh
– University of Waterloo
– http://libsh.sourceforge.net
– http://www.cgl.uwaterloo.ca
• Brook
– Stanford University
– http://brook.sourceforge.net
– http://graphics.stanford.edu
Brook: general purpose streaming
language
• stream programming model
– enforce data parallel computing
• streams
– encourage arithmetic intensity
• kernels
• C with stream extensions
• GPU = streaming coprocessor
system outline
.br
Brook source files
brcc
source to source compiler
brt
Brook run-time library
Brook language
streams
• streams
– collection of records requiring similar computation
• particle positions, voxels, FEM cell, …
float3 positions<200>;
float3 velocityfield<100,100,100>;
– similar to arrays, but…
• index operations disallowed: position[i]
• read/write stream operators
streamRead (positions, p_ptr);
streamWrite (velocityfield, v_ptr);
– encourage data parallelism
Brook language
kernels
• kernels
– functions applied to streams
• similar to for_all construct
kernel void foo (float a<>, float b<>,
out float result<>) {
result = a + b;
}
float a<100>;
float b<100>;
float c<100>;
foo(a,b,c);
for (i=0; i<100; i++)
c[i] = a[i]+b[i];
Brook language
kernels
• kernels
– functions applied to streams
• similar to for_all construct
kernel void foo (float a<>, float b<>,
out float result<>) {
result = a + b;
}
– no dependencies between stream elements
– encourage high arithmetic intensity
Brook language
kernels
• kernels arguments
–
–
–
–
input/output streams
constant parameters
gather streams
iterator streams
kernel void foo (float a<>, float b<>,
float t, float array[],
iter float n<>,
out float result<>) {
result = array[a] + t*b + n;
}
float a<100>;
float b<100>;
float c<100>;
float array<25>
iter float n<100> = iter(0, 10);
foo(a,b,3.2f,array,n,c);
Brook language
kernels
• ray triangle intersection
kernel void krnIntersectTriangle(Ray ray<>, Triangle tris[],
RayState oldraystate<>,
GridTrilist trilist[],
out Hit candidatehit<>) {
float idx, det, inv_det;
float3 edge1, edge2, pvec, tvec, qvec;
if(oldraystate.state.y > 0) {
idx = trilist[oldraystate.state.w].trinum;
edge1 = tris[idx].v1 - tris[idx].v0;
edge2 = tris[idx].v2 - tris[idx].v0;
pvec = cross(ray.d, edge2);
det = dot(edge1, pvec);
inv_det = 1.0f/det;
tvec = ray.o - tris[idx].v0;
candidatehit.data.y = dot( tvec, pvec ) * inv_det;
qvec = cross( tvec, edge1 );
candidatehit.data.z = dot( ray.d, qvec ) * inv_det;
candidatehit.data.x = dot( edge2, qvec ) * inv_det;
candidatehit.data.w = idx;
} else {
candidatehit.data = float4(0,0,0,-1);
}
}
Brook language
reductions
• reductions
– compute single value from a stream
reduce void sum (float a<>,
reduce float r<>)
r += a;
}
float a<100>;
float r;
sum(a,r);
r = a[0];
for (int i=1; i<100; i++)
r += a[i];
Brook language
reductions
• reductions
– associative operations only
(a+b)+c = a+(b+c)
• sum, multiply, max, min, OR, AND, XOR
• matrix multiply
Brook language
reductions
• multi-dimension reductions
– stream “shape” differences resolved by reduce function
reduce void sum (float a<>,
reduce float r<>)
r += a;
}
float a<20>;
float r<5>;
sum(a,r);
for (int i=0; i<5; i++)
r[i] = a[i*4];
for (int j=1; j<4; j++)
r[i] += a[i*4 + j];
Brook language
stream repeat & stride
• kernel arguments of different shape
– resolved by repeat and stride
kernel void foo (float a<>, float b<>,
out float result<>);
float a<20>;
float b<5>;
float c<10>;
foo(a,b,c);
foo(a[0],
foo(a[2],
foo(a[4],
foo(a[6],
foo(a[8],
foo(a[10],
foo(a[12],
foo(a[14],
foo(a[16],
foo(a[18],
b[0],
b[0],
b[1],
b[1],
b[2],
b[2],
b[3],
b[3],
b[4],
b[4],
c[0])
c[1])
c[2])
c[3])
c[4])
c[5])
c[6])
c[7])
c[8])
c[9])
Brook language
matrix vector multiply
kernel void mul (float a<>, float b<>,
out float result<>)
{ result = a*b; }
reduce void sum (float a<>,
reduce float result<>)
{ result += a; }
float
float
float
float
matrix<20,10>;
vector<1, 10>;
tempmv<20,10>;
result<20, 1>;
mul(matrix,vector,tempmv);
sum(tempmv,result);
M
V
V
V
=
T
Brook language
matrix vector multiply
kernel void mul (float a<>, float b<>,
out float result<>)
{ result = a*b; }
reduce void sum (float a<>,
reduce float result<>)
{ result += a; }
float
float
float
float
matrix<20,10>;
vector<1, 10>;
tempmv<20,10>;
result<20, 1>;
mul(matrix,vector,tempmv);
sum(tempmv,result);
T
sum
R
Running Brook
• Compiling .br files
Brook CG Compiler
Version: 0.2 Built: Apr 24 2004, 18:11:59
brcc [-hvndktyAN] [-o prefix] [-w workspace] [-p shader ] foo.br
-h
help (print this message)
-v
verbose (print intermediate generated code)
-n
no codegen (just parse and reemit the input)
-d
debug (print cTool internal state)
-k
keep generated fragment program (in foo.cg)
-t
disable kernel call type checking
-y
emit code for ATI 4-output hardware
-A
enable address virtualization (experimental)
-N
deny support for kernels calling other kernels
-o prefix
prefix prepended to all output files
-w workspace workspace size (16 - 2048, default 1024)
-p shader
cpu / ps20 / fp30 / cpumt (can specify multiple)
Running Brook
• BRT_RUNTIME selects platform
CPU Backend:
CPU Multithreaded Backend:
NVIDIA NV30 Backend:
OpenGL ARB Backend:
DirectX9 Backend:
BRT_RUNTIME = cpu
BRT_RUNTIME = cpumt
BRT_RUNTIME = nv30
BRT_RUNTIME = arb
BRT_RUNTIME = dx9
Runtime
• accessing stream data for graphics aps
– Brook runtime api available in c++ code
– autogenerated .hpp files for brook code
brook::initialize( "dx9", (void*)device );
// Create streams
fluidStream0 = stream::create<float4>( kFluidSize, kFluidSize );
normalStream = stream::create<float3>( kFluidSize, kFluidSize );
// Get a handle to the texture being used by
// the normal stream as a backing store
normalTexture = (IDirect3DTexture9*)
normalStream->getIndexedFieldRenderData(0);
// Call the simulation kernel
simulationKernel( fluidStream0, fluidStream0, controlConstant,
fluidStream1 );
Applications
• Includes lots of sample
applications
–
–
–
–
Ray-tracer
FFT
Image segmentation
Linear algebra
Brook performance
2-3x faster than CPU implementation
ATI Radeon 9800 XT
NVIDIA GeForce 6800
compared against 3GHz P4:
• Intel Math Library
• FFTW
• Custom cached-blocked
segment C code
GPUs still lose against SSE
cache friendly code.
Super-optimizations
• ATLAS
• FFTW
Brook for GPUs
• Release v0.3 available on Sourceforge
• Project Page
– http://graphics.stanford.edu/projects/brook
• Source
– http://www.sourceforge.net/projects/brook
• Over 4K downloads!
• Brook for GPUs: Stream Computing on Graphics Hardware
– Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon
Fatahalian, Mike Houston, Pat Hanrahan
Fly-fishing fly images from The English Fly Fishing Shop
GPGPU Examples
• Fluids
• Reaction/Diffusion
• Multigrid
• Tone mapping