PPT - University of Utah

Download Report

Transcript PPT - University of Utah

Data Visualization And Mining
Using The GPU
Sudipto Guha (Univ. of Pennsylvania)
Shankar Krishnan (AT&T Labs - Research)
Suresh Venkatasubramanian (AT&T Labs - Research)
What you will see in this tutorial…






The GPU is fast
The GPU is programmable
The GPU can be used for interactive visualization
How do we abstract the GPU ? What is an efficient GPU
program
How various data mining primitives are implemented
Using the GPU for non-standard data visualization
What you will NOT see…



Detailed programming tricks
Hacks to improve performance
Mechanics of GPU programming

BUT…

we will show you where to find all of this.

Plethora of resources now available on the web, as well as
code, toolkits, examples, and more…
Schedule



1st Hour

1:30pm –

1:50pm –
2nd Hour

2:30pm –

2:50pm –

3:20pm –
3rd Hour

3:30pm –

4:00pm –

4:20pm –
1:50pm: Introduction to GPUs (Shankar)
2:30pm: Examples of GPUs in Data Analysis (Shankar)
2:50pm: Stream Algorithms on the GPU (Sudipto)
3:20pm: Data mining primitives (Sudipto)
3:30pm: Questions and Short break
4:00pm: GPU Case Studies (Shankar)
4:20pm: High-level software support (Shankar)
4:30pm: Wrap-up and Questions
Animusic Demo (courtesy ATI)
But I don’t do graphics ! Why
should I care about the GPU ?
Two Converging Trends in
Computing …

The accelerated development of graphics cards



developing faster than CPUs
GPUs are cheap and ubiquitous
Increasing need for streaming computations


original motivation from dealing with large data sets
also interesting for multimedia applications, image
processing, visualization etc.
What is a Stream?




An ordered list of data items
Each data item has the same type
 like a tuple or record
Length of stream is potentially very large
Examples
 data records in database applications
 vertex information in computer graphics
 points, lines etc. in computational geometry
Streaming Model




Input presented as a sequence
Algorithm works in passes
 allowed one sequential scan over input
 not permitted to move backwards mid-scan
Workspace
 typically o(n)
 arbitrary computation allowed
Algorithm efficiency
 size of workspace and computation time
Streaming: Data driven to
Performance driven



Primary motivation is computing over transient data (data
driven)
 data over a network, sensor data, router data etc.
Computing over large, disk-resident data which are expensive
to access (data and performance driven)
To improve algorithm performance
How does streaming help performance?
Von Neumann Bottleneck
Bus
Memory
CPU
Control
Unit
Instructions
Cache
ALU
Data
Von Neumann Bottleneck

Memory bottleneck
 CPU processing faster than memory bandwidth
 discrepancy getting worse
 large caches and sophisticated prefetching strategies
alleviate bottleneck to some extent
 caches occupy large portions of real estate in modern ship
design
Trends In Hardware
L3 Cache Array
Cache Real Estate
Die photograph
of the Intel/HP
IA-64 processor
(Itanium2 chip)
L3 Cache
Array
L3 Cache
Array
Stream Architectures



All input in the form of streams
Stream processed by a specialized computational unit called
kernel
Kernel performs the same operations on each stream element
Data
Stream
kernel
kernel
kernel
Stream Architectures






Items processed in a FIFO fashion
Reduced memory latency and cache requirements
Simplified control flow
Data-level parallelism
Greater computational efficiency
Examples
 CHEOPS [Rixner et. al. ’98] and Imagine [Kapasi et. al.
’02]
 high performance media applications
GPU: A Streaming Pipelined
Architecture





Inputs presented in streaming fashion
 processed data items pass to next phase and does not
return
Data-level parallelism
Limited local storage
 data items essentially carry their own state
Pipelining: each item processed identically
Not quite general purpose yet, but getting there
GPU Performance
From ‘Stream Programming Environments’ – Hanrahan, 2004.
The Graphics Pipeline
Modeling
Transformations
Illumination
(Shading)
Viewing Transformation
Input:
Geometric model:
Description of all object, surface and light source
geometry and transformations
Lighting model:
Computational description of object and light
properties, interaction (reflection, scattering etc.)
Synthetic viewpoint (or Camera location):
Eye position and viewing frustum
Clipping
Raster viewport:
Pixel grid on to which image is mapped
Projection
(to Screen Space)
Scan Conversion
(Rasterization)
Output:
Visibility / Display
Colors/Intensities suitable for display
(For example, 24-bit RGB value at each pixel)
The Graphics Pipeline
Modeling
Transformations

Illumination
(Shading)

Viewing Transformation


Clipping
Primitives are processed in a series of
stages
Each stage forwards its result on to the
next stage
The pipeline can be implemented in
different ways
Optimizations & additional programmability
are available at some stages
Projection
(to Screen Space)
Scan Conversion
(Rasterization)
Visibility / Display
Slide Courtesy: Durand and Cutler
Modeling Transformations
Modeling
Transformations
Illumination
(Shading)
Viewing Transformation


3D models defined in their own
coordinate system (object space)
Modeling transforms orient the models
within a common coordinate frame
(world space)
Clipping
Projection
(to Screen Space)
Scan Conversion
(Rasterization)
Visibility / Display
Object space
World space
Slide Courtesy: Durand and Cutler
Illumination (Shading) (Lighting)
Modeling
Transformations

Illumination
(Shading)

Viewing Transformation
Vertices lit (shaded) according to material
properties, surface properties (normal) and
light sources
Local lighting model
(Diffuse, Ambient, Phong, etc.)
Clipping
Projection
(to Screen Space)
Scan Conversion
(Rasterization)
Visibility / Display
Slide Courtesy: Durand and Cutler
Viewing Transformation
Modeling
Transformations


Illumination
(Shading)
Maps world space to eye space
Viewing position is transformed to
origin & direction is oriented along
some axis (usually z)
Viewing Transformation
Eye space
Clipping
Projection
(to Screen Space)
Scan Conversion
(Rasterization)
Visibility / Display
World space
Clipping
Modeling
Transformations

Transform to Normalized Device
Coordinates (NDC)
Illumination
(Shading)
Viewing Transformation
Eye space
Clipping
Projection
(to Screen Space)

NDC
Portions of the object
outside the view
volume
(view frustum)
are removed
Scan Conversion
(Rasterization)
Visibility / Display
Slide Courtesy: Durand and Cutler
Projection
Modeling
Transformations

The objects are projected to the 2D
image place (screen space)
Illumination
(Shading)
Viewing Transformation
NDC
Screen Space
Clipping
Projection
(to Screen Space)
Scan Conversion
(Rasterization)
Visibility / Display
Slide Courtesy: Durand and Cutler
Scan Conversion (Rasterization)
Modeling
Transformations
Illumination
(Shading)


Rasterizes objects into pixels
Interpolate values as we go (color,
depth, etc.)
Viewing Transformation
Clipping
Projection
(to Screen Space)
Scan Conversion
(Rasterization)
Visibility / Display
Slide Courtesy: Durand and Cutler
Visibility / Display
Modeling
Transformations

Each pixel remembers the closest
object (depth buffer)

Almost every step in the graphics
pipeline involves a change of
coordinate system. Transformations
are central to understanding 3D
computer graphics.
Illumination
(Shading)
Viewing Transformation
Clipping
Projection
(to Screen Space)
Scan Conversion
(Rasterization)
Visibility / Display
Slide Courtesy: Durand and Cutler
Coordinate Systems in the
Pipeline
Modeling
Transformations
Illumination
(Shading)
Object space
World space
Viewing Transformation
Clipping
Projection
(to Screen Space)
Eye Space /
Camera Space
Clip Space (NDC)
Scan Conversion
(Rasterization)
Screen Space
Visibility / Display
Slide Courtesy: Durand and Cutler
Programmable Graphics Pipeline
3D API:
OpenGL or
Direct3D
3D API
Commands
3D
Application
Or Game
Vertex
Index
Stream
Primitive
Assembly
Pre-transformed
Fragments
Pre-transformed
Vertices
Programmable
Vertex
Processor
Programmable
Fragment
Processor
Transformed
Fragments
GPU
Front End
Pixel
Pixel
Location
Updates
Stream
Rasterization
Frame
Raster
and
Buffer
Operations
Interpolation
Assembled
Primitives
Transformed
Vertices
GPU
Command &
Data Stream
CPU-GPU Boundary
Courtesy: Cg Book [Fernando and Kilgard]
Increase in Expressive Power
1996
simple if-then tests via depth and stencil testing.
1998
More complex arithmetic and lookup operations
2001
Limited programmability in pipeline via specialized
assembly constructs
2002
Full programmability, but only straight line programs
2004
True conditionals and loops
2006(?)
General purpose streaming processor ?
Recall
Hanrahan, 2004.
GPU = Fast co-processor ?



GPU speed increasing at cubed-Moore’s Law.
This is a consequence of the data-parallel streaming aspects of
the GPU.
GPUs are cheap ! Put enough together, and you can get a supercomputer.
So can we use the GPU for
general-purpose computing ?
NYT May 26, 2003: TECHNOLOGY; From PlayStation
to Supercomputer for $50,000:
National Center for Supercomputing Applications at
University of Illinois at Urbana-Champaign builds
supercomputer using 70 individual Sony Playstation 2
machines; project required no hardware engineering
other than mounting Playstations in a rack and
connecting them with high-speed network switch
In Summary

GPU is fast
The trends, compared to CPU, is already better

But is it being used in non-graphics applications?

Wealth of applications
Data Analysis
Motion Planning
Particle Systems
Voronoi Diagrams
Geometric Optimization
Physical Simulation
Linear Solvers
Force-field simulation
Molecular Dynamics Graph Drawing
Database queries
Sorting and Searching
Range queries
Audio, Video and Image processing
… and graphics too !!
Computation & Visualization




The success stories involving the GPU revolve around the
merger of computation and visualization
Linear system solvers used for real-time physical simulation
Voronoi diagrams allow us to perform shape analysis
n-body computations form the basis for graph drawing and
molecular modelling.
For large data, visualization=analysis
Analysis Tools
Viz tools
Interactive Data Analysis
Analysis Tools
Viz tools
GPU combines both in one
Example 1: Voronoi Diagrams
Hoff et. al.
Siggraph 99
Example 2: Change Detection
using Histograms
Other examples


Physical system simulation:

Fluid flow visualization for graphics and scientific
computing

One of the most well-studied and successful uses of the
GPU.
Image processing and analysis

A very hot area of research in the GPU world

Numerous packages (openvidia, apple dev kit) for
processing images on the GPU

Cameras mounted on robots with real-time scene
processing
Other examples

SCOUT: GPU-based system for expressing general data
visualization queries

provides a high level data processing language

Data processed thru ‘programs’; user can interact with
output data

SCOUT demonstrates

effectiveness of GPU for data visualization

need for general GPU primitives for processing data.

But the computation and visualization is decoupled.
Also are many more examples !!

Central Theme of This Tutorial




GPU is fast, and viz is built-in
GPU can be programmed at high level
Some of the most challenging aspects of computation and
visualization of large data

interactivity

dynamic data
GPU enables both
An Extended Example
Reverse Nearest Neighbours





An instance consists of clients and servers
Each server “serves” those clients that it is closest to.
What is the load on a server ? It is the number of clients that
consider this server to be their nearest neighbour (among all
servers)
Hence, the term “reverse nearest neighbour”: for each
server, find its reverse nearest neighbours.
Note: both servers and clients can move or be inserted or
deleted.
What do we know ?



First proposed by Korn and Muthukrishnan (2000)

1-D, static
Easy if servers are fixed (compute Voronoi diagram of
servers, and locate clients in the diagram)
In general, hard to do with moving clients and servers in two
dimensions.
Algorithmic strategy

For each of N clients

iterate over each of M servers, find closest one
For each of M servers

iterate over each of N clients that consider it closest,
and count them.

apparent “complexity” is M * N

……Or is it ?
Looking more closely…

For each of N clients

iterate over each of M servers, find closest one
For each of M servers

iterate over each of N clients that consider it closest,
and count them.

Each of these loops can be performed in parallel

Looking more closely…

For each of N clients

iterate over each of M servers, find closest one
For each of M servers

iterate over each of N clients that consider it closest,
and count them.

Complexity is N + M

Demo
Demo: Change distance function
RNN as special case of EM

In expectation-maximization, in each iteration

first determine which points are to be associated with
which cluster centers (the “nearest neighbour” step)

Then compute a new cluster center based on the points
in a cluster (the “counting” step)

We can implement expectation-maximization on the GPU !

More later
What does this example show us?

GPU is fast
GPU is programmable
GPU can deal with dynamic data
GPU can be used for interactive visualization

However, care is needed to get the best performance



Coming up next …

How do we abstract the GPU?
What is an efficient GPU program

How various data mining primitives are implemented

Using the GPU for non-standard data visualization

PART II. Stream algorithms using
the GPU
Programmable Graphics Pipeline
3D API:
OpenGL or
Direct3D
3D API
Commands
3D
Application
Or Game
Vertex
Index
Stream
Primitive
Assembly
Pre-transformed
Fragments
Pre-transformed
Vertices
Programmable
Vertex
Processor
Programmable
Fragment
Processor
Transformed
Fragments
GPU
Front End
Pixel
Pixel
Location
Updates
Stream
Rasterization
Frame
Raster
and
Buffer
Operations
Interpolation
Assembled
Primitives
Transformed
Vertices
GPU
Command &
Data Stream
CPU-GPU Boundary
Courtesy: Cg Book [Fernando and Kilgard]
Think parallel





First cut: Each Pixel is a processor.

Not really – multiple pipes
Each of the pipes perform bits of the whole
The mapping is not necessarily known
Can even be load dependent
But the drivers think that way.

Order of completion is not guaranteed

i.e. cannot say that the computation proceeds along “scan”
lines

Restricted interaction between pixels

We cannot write and read the same location
Think simple

SIMD machine

Single instruction

Same program is applied at every pixel

Recall data parallelism

Program size is bounded
The ability to keep state is limited

Cost is in changing state: a pass-based model


Cannot read and write from same memory location in a
single pass
Think streams


Works as transducers or kernels : Algorithms that take an
input, consume it and output a transform before looking at
the next input.
Can set up networks
Example: Dense Matrix
Multiplication

Larsen/McAllester (2002), Hall, Carr & Hart (2003)

Given matrices X, Y, compute Z=XY

For s=1 to n

For t=1 to n
 Zst=0
 For m:=1 to n Do

Zst ← Zst+XsmYmt
 EndFor

EndFor

EndFor
Data Structures
For s=1 to n
For t=1 to n
Zst=0
For m:=1 to n Do
Zst ← Zst+XsmYmt
EndFor
EndFor
EndFor
Data Structures: Arrays X,Y,Z
How are arrays implemented in GPUs ?
As texture memory (Used for shading
and bump maps)
Fast texture lookups built into pipeline.
Textures are primary storage mechanism
for general purpose programming
Control Flow
For s=1 to n
For t=1 to n
Zst=0
For m:=1 to n
Zst ← Zst+XsmYmt
EndFor
EndFor
EndFor
For s=1 to n
For t=1 to n
Zst ← 0
For m=1 to n
For s=1 to n
For t=1 to n
Zst ← Zst+XsmYmt
EndFor
EndFor
Endfor
From Loops to Pixels
For m=1 to n
For s=1 to n
For t=1 to n
Zst ← Zst+XsmYmt
EndFor
EndFor
Endfor
For m=1 to n
For pixel (s,t)
Zst ← Zst+XsmYmt
EndFor
Endfor
Operations for all pixels are performed in “parallel”
Is that it ? SIMD …
For m=1 to n
For s=1 to n
For t=1 to n
R1.r ← lookup(s,m,X)
R2.r ← lookup(m,t,Y)
R3.r ← lookup(s,t,Z)
buffer ← R3.r + R1.r * R2.r
EndFor
EndFor
Save buffer into Z
Each pixel (x,y) gets the parameter m:
Endfor
R1.r ← lookup(myxcoord,m,X)
R2.r ← lookup(m,myycoord,Y)
R3.r ← lookup(myxcoord,mycoord,Z)
mybuffer ← R3.r + R1.r * R2.r
Analysing theoretically



O(n) passes to multiply two n X n matrices. Uses O(n2) size
textures.
Your favourite parallel algorithm for matrix multiplication can be
mapped onto GPU
How about Strassen ?

Number of passes is the same

Total work decreases

Can reduce the # to O(n0.9) using special packing ideas
[GKV04]
In practice…

Simpler implementations win out 

The fastest GPU implementation uses n matrix—vector
operations, and packs data into 4-vectors.
It is better than naïve code using three For loops, but not
better than using optimized CPU code, e.g., in linear algebra
packages.
Trendlines show that the running time of GPU code grows
linearly.
Packing and additions, e.g. in Strassen type algorithms are
problematic due to data movement, cache effects.



Summing up



SIMD machine
Pass based computation

Pipelined operation

Cost is in changing state

Total work
Revisiting RNN
Recall the two implementations

M*N distances need to be evaluated

First idea:

M+N passes using M+N processors in parallel
N

Second idea:

M x N grid to compute the distances

1 pass

Find Min in 1 pass ?

How to aggregate ?
M
GPU hardware modelling



The GPU is a parallel processor, where each pixel can be
considered to be a single processor (almost)
The GPU is a streaming processor, where each pixel
processes a stream of data
We can build high-level algorithms (EM) from low level
primitives (distance calculations, counting)
Operation costs

When evaluating the cost of an operation on the GPU,
standard operation counts are not applicable

The basic unit of cost is the “rendering pass”, consisting
of a stream of data being processed by a single program

Virtual parallelism of GPU means that we “pretend” that
all streams can be processed simultaneously

Assign a cost of 1 to a pass: this is analogous to external
memory models where each disk access is 1 unit

Many caveats
An extreme example

One pass median finding

Well studied problem in streaming with log n pass lower bound.

Consider repeating the data O(log n) times to get a sequence of size
O(n log n)

At any point the algorithm has an UB and an LB.
1.
2.
3.
4.

The algorithm reads the next n numbers and picks a random
element u between the LB and UB.
The algorithm uses the next n numbers and finds the rank of u.
Now it sets LB=u or UB=u depending on counts.
Repeat !
Quickfind!
Why does the example not generalize?
Read is not the same as Write.
Total work vs number of passes.
PART II.5. Data Mining Primitives
I: Clustering

Problem definition: Given N points find K points c i to
optimize
min
2
min
dist
 i ( x j , ci )
j

Hall & Hart ’03: K+N pass KMEANS algorithm

What we will see:

Log N passes per iteration

EM and “soft assignments”
K-Means 2D version

Basic Algorithm: Recall Voronoi scenario
Why GPU ?

Motion/updates ! When does the cross over happen ?
K-Means in GPU

In each pass, given a set of K colors we can assign a “color”
to each point and keep track of “mass” and “mean” of each
color.

Issues:

Higher D

Extension to more general algorithms
The RNN type approach
N







K*N distances, 1 pass
Compute min, logarithmic passes
Aggregate, logarithmic passes
Normalize and compute new centers
High D ? Replace 1 by O(D)
K
EM ? Instead of Min (which is 0/1) compute “shares”
Same complexity!
Demo: Expectation Maximization
II: Wavelets


Transforms
Image analysis
A
B
C
D
A B C  D
2
A B C  D
2
A B C  D
2
A B C  D
2
Wavelets: Recurse!
A B C  D
2
A B C  D
2
A B C  D
2
A B C  D
2
A B C  D
2
A B C  D
2
A B C  D
2
A B C  D
2
A B C  D
2
A B C  D
2
A B C  D
2
A B C  D
2
A B C  D
2
A B C  D
2
A B C  D
2
A B C  D
2
DWT Demo
Wang, Wong, Heng & Leung
http://www.cse.cuhk.edu.hk/~ttwong/demo/dwtgpu/dwtgpu.html
DWT Demo
Wang, Wong, Heng & Leung
http://www.cse.cuhk.edu.hk/~ttwong/demo/dwtgpu/dwtgpu.html
III: FFTs
e
2in
N
What does the above remind you of ?
FFT Demo
Moreland and Angel
http://www.cs.unm.edu/~kmorel/documents/fftgpu/
III: Sorting Networks!




Can be used for quantiles, computing histograms
GPUSort: [Govindaraju etal 2005]
Basically “network computations”
Each pixel performs some local piece only.
In summary

The big idea

Think parallel.

Think simple.

Think streams.

SIMD
Pass based computing
Tradeoff of pass versus total cost
“Ordered” access



Next up

We saw “number crunching algorithms”.

What are the applications in mining which are
“combinatorial” in nature and yet visual ?

And of courses, resources and howtos.
Part III: Other GPU Case Studies
and High-level Software Support
for the GPU
What have we seen so far




Computation on the GPU
How to program
Cost model

What is cheap and expensive
Number based problems, e.g.,

nearest neighbor (Voronoi), clustering, wavelets, sorting
What you will see …

Examples with different kinds of input data
1.
Computation of Depth Contours
2.
Graph Drawing

Languages and tools to program the GPU
Final wrap-up

Depth Contours
Depth Contours
3
5
7
1
6
2
Location depth = 1
Depth Contours
k-contour: set of all
points having location
depth ≥ k
Every n-point set has an
n/3-contour
The Tukey median is the
deepest contour.
We wish to compute the set of all depth contours
Motivation



Special case of general non-parametric data analysis tool
“visualize the location, spread, correlation, skewness, and tails of the data”
[Tukey et al 1999]
Hypothesis testing, robust statistics
Point-Hyperplane Duality



Mapping from R2  R2
Points get mapped to lines and vice-versa

(a,b)  y = -ax + b

y = ax + b  (a,b)
Incidence properties are preserved
Contours  k-levels
k-contour: convex hulls of k- and n-k levels in dual.
Main Algorithm



Draw primal line for each dual pixel that lies on a dual line
Record only (at each primal pixel) the line whose dual pixel has least depth
Recall nearest neighbor (Voronoi), clustering – we are using the geometry heavily
Demo: Depth Contours
Dynamic Depth Contours




Smooth, but random movement of points
Compute incremental depth changes using duals
Worst-case quadratic updates might be performed
In practice, small number of updates

Can achieve real-time speeds for first time
Dynamic Depth Contours
Graph Drawing
Graph Drawing


Graph

Set of vertices

Boolean relation between pairs of vertices defines edges
Given a graph, layout the vertices and edges in a plane based
on some aesthetics
Abstract Graph

Example

9 vertices, 12 edges

Edges

v1: v2, v4

v2: v1, v3, v5

v3: v2, v6

v4: v1, v5, v7

v5: v2, v4, v6, v8

v6: v3, v5, v9

v7: v4, v8

v8: v5, v7, v9

v9: v6, v8
After Graph Layout

grid3.graph
v7
v4
v1

v8
v5
v2
v9
v6
v3
Can visualize the relationships much better
Force-Directed Layouts


Start with random initial placement of vertices
Define an energy field on the plane using forces between vertices
(both attractive and repulsive)

Fattractive(v i ) 

c. (v j  v i ) .(v j  v i )
( v i ,v j )E

1
1
Freplusive(v i )   .
.(v i  v j )
2
j i c
(v i  v j )

Evolve the system based on the energy field
Local energy minimization defines the layout

Essentially solving a N-body problem

GPU Algorithm


Naïve approach

At each iteration

Compute pairwise repulsive forces

Compute attractive forces between edges

Force on each vertex is sum of individual forces

Resultant force defines vertex displacement

Displace vertices and iterate

Complexity – O(n) per iteration

Bottleneck – summing individual forces
Optimized approach

Use logarithmic parallel summation algorithm

Complexity – O(log n) per iteration

Parallel Summation Algorithm
5
9
-3
8
-7
4
1
11
-6
offset = 4
-2
19 -13
4
1
11
O(log n)
28 -15 19 -13
13 -15 19 -13
-6
offset = 2
4
1
11
-6
offset = 1
4
1
11
-6
Demo: Graph Drawing
Software tools
GPGPU Languages



Why do we want them?

Make programming GPUs easier!

Don’t need to know OpenGL, DirectX, or ATI/NV
extensions

Simplify common operations

Focus on the algorithm, not on the implementation
Sh

University of Waterloo

http://libsh.sourceforge.net
Brook

Stanford University

http://brook.sourceforge.net
Brook
Brook: Streams

streams

collection of records requiring similar computation

particle positions, voxels, FEM cell, …
float3 positions<200>;
float3 velocityfield<100,100,100>;

similar to arrays, but…

index operations disallowed: position[i]

read/write stream operators
streamRead (positions, p_ptr);
streamWrite (velocityfield, v_ptr);

encourage data parallelism
Slide Courtesy: Ian Buck
Brook: Kernels

kernels

functions applied to streams

similar to for_all construct
kernel void foo (float a<>, float b<>,
out float result<>) {
result = a + b;
}
float a<100>;
float b<100>;
float c<100>;
foo(a,b,c);
for (i=0; i<100; i++)
c[i] = a[i]+b[i];
Slide Courtesy: Ian Buck
Brook: Kernels

kernels

functions applied to streams

similar to for_all construct
kernel void foo (float a<>, float b<>,
out float result<>) {
result = a + b;
}


no dependencies between stream elements
encourage high arithmetic intensity
Slide Courtesy: Ian Buck
Brook: Reductions

reductions

compute single value from a stream
reduce void sum (float a<>,
reduce float r<>)
r += a;
}
float a<100>;
float r;
sum(a,r);
r = a[0];
for (int i=1; i<100; i++)
r += a[i];
Slide Courtesy: Ian Buck
Brook: Reductions

reductions

associative operations only
(a+b)+c = a+(b+c)



Order independence
sum, multiply, max, min, OR, AND, XOR
matrix multiply
Slide Courtesy: Ian Buck
Brook: Reductions

multi-dimension reductions

stream “shape” differences resolved by reduce function
reduce void sum (float a<>,
reduce float r<>)
r += a;
}
float a<20>;
float r<5>;
sum(a,r);
for (int i=0; i<5; i++)
r[i] = a[i*4];
for (int j=1; j<4; j++)
r[i] += a[i*4 + j];
Slide Courtesy: Ian Buck
Brook: Matrix Vector Multiply
kernel void mul (float a<>, float b<>,
out float result<>)
{ result = a*b; }
reduce void sum (float a<>,
reduce float result<>)
{ result += a; }
float
float
float
float
matrix<20,10>;
vector<10, 1>;
tempmv<20,10>;
result<20, 1>;
mul(matrix,vector,tempmv);
sum(tempmv,result);
M
V
V
V
=
T
Slide Courtesy: Ian Buck
Brook: matrix vector multiply
kernel void mul (float a<>, float b<>,
out float result<>)
{ result = a*b; }
reduce void sum (float a<>,
reduce float result<>)
{
result += a; }
float
float
float
float
matrix<20,10>;
vector<10, 1>;
tempmv<20,10>;
result<20, 1>;
mul(matrix,vector,tempmv);
sum(tempmv,result);
T
sum
R
Slide Courtesy: Ian Buck
Demo: Matrix Vector Multiplication
Demo:
Bitonic Sort and Binary Search
Debugging Tools
Debugger Features

Per-pixel register watch



Including interpolants, outputs, etc.
Breakpoints
Fragment program interpreter

Single step forwards or backwards

Execute modified code on the fly
Slide Courtesy: Tim Purcell
Shadesmith
http://graphics.stanford.edu/projects/shadesmith

Debugger in the spirit of imdebug



Simply add a debug statement when binding shaders
Display window with scale, bias, component masking
Advanced features

Can watch any shader register contents without recompile
Shader single stepping (forward and backward), breakpointing

Shader source edit and reload without recompile

Slide Courtesy: Tim Purcell
Demo: Shadesmith
GPGPU.org





Your first stop for GPGPU information!
News: www.GPGPU.org
Discussion: www.GPGPU.org/forums
Developer resources, sample code, tutorials:
www.GPGPU.org/developer
And for open source GPGPU software:
http://gpgpu.sourceforge.net/
Summing up
What we saw in this tutorial …



GPU is fast outperforming the CPU
Shown to be effective in a variety of applications

High memory bandwidth

Parallel processing capabilities

Interactive visualization applications

Data mining, simulations, streaming operations

Spatial database operations
High-level programming and debugging support
Limitation of the GPU



Not designed as a general-purpose processor

meant to extract maximum performance for highly
parallel tasks of computer graphics
Will not be suited for all applications

Bad for “pointer chasing” tasks

word processing

No bit shifts or bit-wise logical operations

Cryptography

No double precision arithmetic yet

Large scale scientific applications
Unusual programming model
Conclusions: Looking forward




Increased performance
More features

double precision arithmetic

Bit-wise operations

Random bits
Increased programmability and generality

strike the right balance between generality and
performance
Other data-parallel processors will emerge

Cell processor by IBM, Sony, Toshiba

Better suited to general-purpose stream computing
Questions?
Slides can be downloaded soon from
http://www.research.att.com/areas/visualization/gpgpu.html