A Survey of General-Purpose Computation on Graphics Hardware

Download Report

Transcript A Survey of General-Purpose Computation on Graphics Hardware

GPGPU
CS 446: Real-Time Rendering
& Game Technology
David Luebke
University of Virginia
Demo
• Today: Matthew Rodgers
• That’s it!
2
David Luebke
News
• New job: NVIDIA Research
• New e-mail: [email protected]
3
David Luebke
Final Topic: GPGPU
• GPGPU: “General Purpose GPU Computing”
• Active, exciting area of research and development
• A personal interest of mine
• Following slides taken from a recent talk I
co-presented in Dublin
– Accompanying this paper
4
David Luebke
A Survey of General-Purpose
Computation on Graphics
Hardware
John Owens
David Luebke
University of California, Davis
University of Virginia
with Naga Govindaraju, Mark Harris, Jens Krüger, Aaron Lefohn, Tim Purcell
Introduction
• The graphics processing unit (GPU) on
commodity video cards has evolved into an
extremely flexible and powerful processor
• Programmability
• Precision
• Power
• GPGPU: an emerging field seeking to harness
GPUs for general-purpose computation
6
Motivation: Computational Power
• GPUs are fast…
• 3.0 GHz dual-core Pentium4: 24.6 GFLOPS
• NVIDIA GeForceFX 7800: 165 GFLOPs
• 1066 MHz FSB Pentium Extreme Edition : 8.5 GB/s
• ATI Radeon X850 XT Platinum Edition: 37.8 GB/s
• GPUs are getting faster, faster
• CPUs: 1.4× annual growth
• GPUs: 1.7×(pixels) to 2.3× (vertices) annual growth
7
Motivation: Computational Power
Thanks to Ian Buck
8
Motivation: Computational Power
Thanks to John Owens
9
Motivation: Flexible and Precise
• Modern GPUs are deeply programmable
• Programmable pixel, vertex, video engines
• Solidifying high-level language support
• Modern GPUs support high precision
• 32 bit floating point throughout the pipeline
• High enough for many (not all) applications
10
Motivation: The Potential of GPGPU
• In short:
• The power and flexibility of GPUs makes them an
attractive platform for general-purpose computation
• Example applications range from in-game physics
simulation to conventional computational science
• Goal: make the inexpensive power of the GPU
available to developers as a sort of computational
coprocessor
11
Problems: Difficult To Use
• GPUs designed for & driven by video games
• Programming model unusual
• Programming idioms tied to computer graphics
• Programming environment tightly constrained
• Underlying architectures are:
• Inherently parallel
• Rapidly evolving (even in basic feature set!)
• Largely secret
• Can’t simply “port” CPU code!
12
Problems: Not A Panacea
• GPUs are fast because they are specialized
• Poorly suited to sequential, “pointer-chasing” code
• Missing support for some basic functionality
• E.g. integers, bitwise operations, indexed write
• More on limitations & difficulties later
13
STAR Goals
• Detailed & useful survey of general-purpose
computing on graphics hardware
• Hardware and software developments behind GPGPU
• Building blocks: techniques for mapping generalpurpose computation to the GPU
• Applications: important applications of GPGPU
• A comprehensive GPGPU bibliography (current
through Summer 2005…)
14
Architectural Considerations
The past: 1987
20 MIPS CPU
1987
[courtesy Anant Agarwal]
16
The future: 2007
1 Billion Transistors
2007
[courtesy Anant Agarwal]
17
Today’s VLSI Capability: Keys to High Performance
0.5mm
64-bit FPU
(to scale)
50pJ/FLOP
90nm Chip
$200
1GHz
1 clock
12mm
[courtesy Pat Hanrahan]
18
For High Performance, We Must …
0.5mm
64-bit FPU
(to scale)
1. Exploit
50pJ/FLOP
Ample
Computation!
90nm Chip
$200
1GHz
1 clock
12mm
[courtesy Pat Hanrahan]
19
For High Performance, We Must …
0.5mm
64-bit FPU
(to scale)
1. Exploit
50pJ/FLOP
Ample
Computation!
90nm Chip
$200
1GHz
1 clock
2. Require
Efficient
Communication!
12mm
[courtesy Pat Hanrahan]
20
Stream Programming Abstraction
• Let’s think about our problem in a new way
• Goal: SW programming model that matches today’s VLSI
• Streams
• Collection of data records
• All data is expressed in streams
• Kernels
stream
kernel
stream
• Inputs/outputs are streams
• Perform computation on streams
• Can be chained together
21
Why Streams?
• Ample computation by exposing parallelism
• Streams expose data parallelism
• Multiple stream elements can be processed in parallel
• Pipeline (task) parallelism
• Multiple tasks can be processed in parallel
• Kernels yield high arithmetic intensity
• Efficient communication
• Producer-consumer locality
• Predictable memory access pattern
• Optimize for throughput of all elements, not latency of one
• Processing many elements at once allows latency hiding
22
GPU: Special-Purpose Graphics Hardware
• Task-parallel
organization
• Assign each task to
processing unit
• Hardwire each unit to
specific task - huge
performance advantage!
• Provides ample
computation resources
• Efficient communication
patterns
• Dominant graphics
architecture
[ATI Flipper – 51M T]
23
The Rendering Pipeline
Application
Compute 3D geometry
Make calls to graphics API
Transform geometry from 3D to
2D (in parallel)
Geometry
Rasterization
Generate fragments from 2D
geometry (in parallel)
Combine fragments into image
Composite
GPU
24
The Programmable Rendering Pipeline
Application
Geometry
Compute 3D geometry
Make calls to graphics API
Transform geometry from 3D to
2D; vertex programs
(Vertex)
Rasterization
(Fragment)
Generate fragments from 2D
geometry; fragment programs
Composite
Combine fragments into image
GPU
25
NVIDIA GeForce 6800 3D Pipeline
Vertex
Triangle Setup
Z-Cull
Shader Instruction Dispatch
Fragment
L2 Tex
Fragment Crossbar
Memory
Partition
Memory
Partition
Composite
Memory
Partition
Memory
Partition
Courtesy Nick Triantos, NVIDIA
26
Characteristics of Graphics Apps
•
•
•
•
•
•
Lots of arithmetic
Lots of parallelism
Simple control
Multiple stages
Feed forward pipelines
Latency-tolerant / deep
pipelines
• What other applications
have these characteristics?
Application
Command
Geometry
Rasterization
Fragment
Composite
Display
27
Today’s Graphics Pipeline
Application
Command
Geometry
Rasterization
Fragment
Composite
Display
• Graphics is well suited to:
• The stream programming model
• Stream hardware organizations
• GPUs are a commodity stream
processor!
• What if we could apply these
techniques to more generalpurpose problems?
• GPUs should excel at tasks that
require:
• Ample computation
• Regular computation
• Efficient communication
28
Programming a GPU for Graphics
• Application specifies
geometry  rasterized
• Each fragment is
shaded
w/
SIMD program
• Shading can use values
from texture memory
• Image can be used as
texture on future
passes
29
Programming a GPU for GP Programs
• Draw a screen-sized
quad  stream
• Run a SIMD kernel over
each fragment
• “Gather” is permitted
from texture memory
• Resulting buffer can be
treated as texture on
next pass
30
Feedback
• Each algorithm step depend on
the results of previous steps
• Each time step depends on the
results of the previous time
step
31
CPU-GPU Analogies
.
CPU
.
.
Grid[i][j]= x;
.
.
.
Array Write
GPU
=
Render to Texture
32
CPU-GPU Analogies
CPU
Stream / Data Array =
Memory Read
=
GPU
Texture
Texture Sample
33
Kernels
CPU
GPU
Kernel / loop body / algorithm step = Fragment Program
34
Scatter vs. Gather
• Grid communication
• Grid cells share information
35
Computational Resources Inventory
• Programmable parallel processors
• Vertex & Fragment pipelines
• Rasterizer
• Mostly useful for interpolating addresses (texture
coordinates) and per-vertex constants
• Texture unit
• Read-only memory interface
• Render to texture
• Write-only memory interface
36
Vertex Processor
• Fully programmable (SIMD / MIMD)
• Processes 4-vectors (RGBA / XYZW)
• Capable of scatter but not gather
• Can change the location of current vertex
• Cannot read info from other vertices
• Can only read a small constant memory
• Latest GPUs: Vertex Texture Fetch
• Random access memory for vertices
• Gather (But not from the vertex stream itself)
37
Fragment Processor
•
•
•
•
Fully programmable (SIMD)
Processes 4-component vectors (RGBA / XYZW)
Random access memory read (textures)
Capable of gather but not scatter
• RAM read (texture fetch), but no RAM write
• Output address fixed to a specific pixel
• Paper details ways to synthesize scatter
• Typically more useful than vertex processor
• More fragment pipelines than vertex pipelines
• Direct output (fragment processor is at end of pipeline)
38
Building Blocks & Applications
GPGPU Building Blocks
• The STAR covers the following fundamental
techniques & computational building blocks:
• Flow control (a very fundamental building block)
• Stream operations
• Data structures
• Differential equations & linear algebra
• Data queries
• I’ll discuss each in a bit more detail
40
Flow control
• Surprising number of issues on GPUs
• Main themes:
• Avoid branching when possible
• Move branching earlier in the pipeline when possible
• Largely SIMD  coherent branching most efficient
• Mechanisms:
• Rasterized geometry
• Z-cull
• Occlusion query
41
Domain Decomposition
• Avoid branches where outcome is fixed
• One region is always true, another false
• Separate FPs for each region, no branches
• Example:
boundaries
42
Z-Cull
• In early pass, modify depth buffer
• Write depth=0 for pixels that should not be modified
by later passes
• Write depth=1 for rest
• Subsequent passes
• Enable depth test (GL_LESS)
• Draw full-screen quad at z=0.5
• Only pixels with previous depth=1 will be processed
• Can also use early stencil test
• Note: Depth replace disables ZCull
43
Pre-computation
• Pre-compute anything that will not change
every iteration!
• Example: arbitrary boundaries
• When user draws boundaries, compute texture
containing boundary info for cells
• Reuse that texture until boundaries modified
• Combine with Z-cull for higher performance!
44
Stream Operations
• Several stream operations in GPGPU toolkit:
• Map: apply a function to every element in a stream
• Reduce: use a function to reduce a stream to a
smaller stream (often 1 element)
• Scatter/gather: indirect read and write
• Filter: select a subset of elements in a stream
• Sort: order elements in a stream
• Search: find a given element, nearest neighbors, etc
45
Data Structures
• GPU memory model, iteration, virtualization
• Dense arrays (== textures)
• Sparse arrays
• Static sparsity: pack rows or bands into textures,
vertex arrays
• Dynamic sparsity: multidimensional page tables
46
Data Structures
• Adaptive data structures
• Static: packed uniform grid, stackless k-d tree
• Dynamic: mipmap of page tables, tree-based address
translators
• Non-indexable structures
• Open & important problem: stacks, priority queues,
sets, linked lists, hash tables, k-d tree construction…
47
Differential Equations
• Early & common application of GPGPU
• Ordinary differential equations (ODEs): commonly
used in particle systems
• Partial differential equations (PDEs): well suited to
GPU, especially when solved on dense grids
• Closely related to…
48
Linear Algebra
• Major application of GPGPU
• Ubiquitous in science, engineering, visual simulation
• Several approaches explored in early GPGPU work
• Memory layout a key consideration
• Pack vectors into 2D textures
• Split matrices, e.g. by columns (dense),
bands (banded), into vertex array (sparse)
49
Data Queries
• Relational database operations on the GPU
• Predicates, Boolean combinations, aggregations
• Join operations accelerated by sorting on join key
• Uses special-purpose depth & stencil
hardware extensively
• Attributes of data record stored in (multiple)
color channels of (multiple) textures
50
Applications
• STAR discusses the following broad GPGPU
application areas:
• Physically-based simulation
• Signal & image processing
• Global illumination
• Geometric computing
• Databases & data mining
51
Physically-Based Simulation
• Early work (pre-fully programmable GPU):
• Cellular automata, texture & blending modes
• Now
• Finite difference & finite element methods
• Particle system simulations
• Most popular topic
• Fluids! Incompressible flow, clouds, boiling, etc.
• Also-ran: mass-spring dynamics for cloth
52
Signal & Image Processing
• Segmentation: identify features in 2D or 3D
• Common example: identify surface of 3D feature
(tumor, blood vessel, etc) in medical scan
• Level-set deformation methods evolve an isosurface
• Need sparse solution methods for efficient solution
• GPGPU bonus: easy to integrate in volume renderer
• Computer vision:
• Image projection, compositing, rectification
• Fast stereo depth extraction
53
Signal & Image Processing cont.
• Image processing
• Image registration, motion reconstruction, computed
tomography (CT), tone mapping
• Core Image, Core Video
• Signal processing
• FFT: an interesting case! Memory bandwidth limited,
lack of writable cache harms performance
54
Global Illumination
• Ray tracing
• Seminal GPGPU papers
• Close to the heart of graphics
• Earliest complex data structures, control flow
• Analysis to inform future hardware design
• Comparison to current efficient CPU implementation
• Key insights
• Ray-triangle intersection maps well to pixel hardware
• Rest of the ray tracing pipeline can also be expressed
as a stream computation
55
Global Illumination cont.
• Photon mapping
• Even more involved data structures
• Introduced stencil routing (scatter), k-nearest
neighbor search
• Radiosity
• Iterative LA solvers
• Subsurface scattering
• Lots of work on using GPUs to accelerate/
approximate scattering algorithms
56
Geometric Computing
• Lots of image-space geometric computations
• CSG operations
• Distance fields
• Collision detection
• Sorting for transparency
• Shadow generation
• Heavy use of stencil & depth hardware
57
Databases & Data Mining
• GPU strengths are useful
• Memory bandwidth
• Parallel processing
• Accelerating SQL queries – 10x improvement
• Also well suited for stream mining
• Continuous queries on streaming data instead of
one-time queries on static database
58
Close-Up: Linear Algebra
Linear Algebra Data Structures
Vector representation
– Textures
2D Textures
RGBAbest
textures
are
weeven
can
really
do
better
rock
• Per-pixel vs. per-vertex operations
– 6 gigapixels/second vs. 0.7 gigavertices/second
– Efficient texture memory cache
– Texture read-write access
1
1
N
N
60
For details go: http://wwwcg.in.tum.de
Representation (cont.)
Dense Matrix representation
– Treat a dense matrix as a set of column vectors
– Again, store these vectors as 2D textures
i
N Vectors
Matrix
N 2D-Textures
N
... ...
N
N
1
1
i
i
...
...
N
N
61
For details go: http://wwwcg.in.tum.de
Representation (cont.)
Banded Sparse Matrix representation
– Treat a banded matrix as a set of diagonal vectors
– Combine opposing vectors to save space
i
Matrix
2 Vectors
2 2D-Textures
1
N
2
N
N-i
N
1
2
62
For details go: http://wwwcg.in.tum.de
Operations 1
• Vector-Vector Operations
– Reduced to 2D texture operations
– Coded in pixel shaders
• Example: Vector1 + Vector2  Vector3
Vector 1
Vector 2
Vector 3
Render Texture
Static quad
TexUnit 0
Pass through
Vertex Shader
TexUnit 1
return tex0 + tex1
Pixel Shader
63
For details go: http://wwwcg.in.tum.de
Operations 2 (reduce)
Reduce operation for scalar products
original Texture
st
1 pass
nd
2 pass
...
...
...
...
Reduce m x n region
in fragment shader
64
For details go: http://wwwcg.in.tum.de
Operations
In depth example: Vector / Banded-Matrix Multiplication
A
b
x
=
65
For details go: http://wwwcg.in.tum.de
Example (cont.)
Vector / Banded-Matrix Multiplication
A
A
b
b
x
=
66
For details go: http://wwwcg.in.tum.de
Example (cont.)
Compute the result in 2 Passes:
A
Pass 1
2
b
b‘
x
=
67
For details go: http://wwwcg.in.tum.de
Conclusions
Moving Forward …
• What works well now?
• What doesn’t work well now?
• What will improve in the future?
• What will continue to be difficult?
69
What Runs Well on GPUs?
• GPUs win when …
• Limited data reuse
Memory BW
Cache BW
P4 3GHz
6 GB/s
44 GB/s
NV GF 6800
36 GB/s
--
• High arithmetic intensity: Defined as math
operations per memory op
• Attacks the memory wall - are all mem ops necessary?
• Common error: Not comparing against optimized
CPU implementation
70
Arithmetic Intensity
Historical growth rates (per year):
• Compute: 71%
• DRAM bandwidth: 25%
GFLOPS
• DRAM latency: 5%
7x Gap
GFloats/sec
R300
R360
R420
[courtesy Ian Buck]
71
Arithmetic Intensity
GeForce 7800 GTX
Pentium 4 3.0 GHz
GPU wins when…
• Arithmetic intensity
 Segment
3.7 ops per word
11 GFLOPS
[courtesy Ian Buck]
72
Memory Bandwidth
GPU wins when…
• Streaming memory
bandwidth
 SAXPY

FFT
GeForce 7800 GTX
Pentium 4 3.0 GHz
[courtesy Ian Buck]
73
Memory Bandwidth
• Streaming Memory System
• Optimized for sequential
performance
• GPU cache is limited
• Optimized for texture
filtering
• Read-only
• Small
GeForce 7800 GTX
Pentium 4
• Local storage
• CPU >> GPU
[courtesy Ian Buck]
74
What Will (Hopefully) Improve?
• Orthogonality
• Instruction sets
• Features
• Tools
• Stability
• Interfaces, APIs, libraries, abstractions
• Necessary as graphics and GPGPU converge!
75
What Won’t Change?
• Rate of progress
• Precision (64b floating point?)
• Parallelism
• Won’t sacrifice performance
• Difficulty of programming parallel hardware
• … but APIs and libraries may help
• Concentration on entertainment apps
76
Top Ten
GPGPU Top Ten
• The Killer App
• Programming models
and tools
• GPU in tomorrow’s
computer?
• Data conditionals
• Relationship to other
parallel hw/sw
• Managing rapid change
•
•
•
•
in hw/sw (roadmaps)
Performance
evaluation and cliffs
Philosophy of faults
and lack of precision
Broader toolbox for
computation / data
structures
Wedding graphics and
GPGPU techniques
78