Status – Week 283

Download Report

Transcript Status – Week 283

3D Graphic
Hardware Pipeline
Victor Moya
Index






3D Graphic Pipeline Overview.
Geometry.
Rasterization.
Fragment.
3D Graphic Hardware pipeline.
Current GPUs.




ATI R300.
NVidia NV30.
3DLabs P10.
Matrox Parhelia.
3D Graphics Pipeline
3D Graphics Pipeline
Application: Simulation, Input event handlers,
modify data structures, database traversal,
primitive generation, utility functions.
 Command: command buffering, command
interpretation, unpack and perform format
conversion, mantain graphics state.
 Geometry: evaluation of polynomials for
curved surfaces, transform and projection,
clipping, culling and primitive assembly.

3D Graphics Pipeline

Fixed vs Programmable.
Geometry

Vertex operations:

(1) Transform coordinates and normal








Model => World.
World => Eye.
(2) Normalize the length of the normal.
(3) Compute vertex lightning.
(4) Transform texture coordinates.
(5) Transform coordinates to clip coordinates
(projection).
(8) Divide coordinate by w.
(9) Apply affine viewport transform (x, y, z).
Geometry

Primitive operations:
(6) Primitive assembly
 (7) Clipping:
 (10) Backface cull: eliminate back-facing
triangles.


Primitive generation: new pipeline stage
(ATI TruForm).
Lightning








Diffuse Lightning.
Light Sources.
Specular Lightning.
Emission.
Gouraud Shading.
Phong Shading.
Bump Mapping.
OpenGL Lightning.
Light Sources




Ambient Light.
Directional Light Sources
 Infinite light source (parallel rays).
 No attenuation.
Point Light Sources.
1
C
 All directions.
Kc Kl  Kq d 2
 Attenuation.
Spot Light Sources.
max{ UL,0}
 Cone of light.
C
2
K
c

K
l

K
qd
 Attenuation.
Kc, Kl and Kq are constant, linear and quadratic attenuation values.
U: Direction of the spot light.
L: Unit direction vector from surface point to light spot.
Diffuse Lighting
n
Kdiffuse  DTA DT  Ci max{ NLi ,0}
i 1
A: Ambient light
T: Texture sample.
D: Surface diffuse reflection color.
Ci: Intensity of the i light at the surface point.
N: Normal vector of the surface.
Li: Unit direction vector to the light source I.
Specular Lighting
n
Kspecular  SG Ci max{ NHi,0}
i 1
S: Surface specular color.
Ci: Intensity of the incident light.
m: specular exponent (larger, sharper hightlight).
G: Gloss map sample.
N: Normal vector at the surface.
L: Unit direction to light vector.
Hi: Halfway vector (V + L).
V: Unit direction to viewer vector.
m
( NL  0 )
Emission
Kemission = EM
E: Surface emission color.
M: Emission map sample.
OpenGL Lighting
Calculated at vertex, interpolated inside
the triangle (Gouraud).
 Bump mapping supported by propietary
extensions.
 Pixel Shaders for programmable per
pixel lighting.

OpenGL Lighting
OpenGL Lighting
OpenGL Lighting
Clipping
Clip geometry primitives with the view
frustrum (6 planes).
 Clip geometry primitives with the user
clip planes.
 Techniques used:

Guard-Band Clipping.
 Homogenous rasterization avoids clipping
in the geometry stage.

Guard-Band Clipping
Homogeneus coordinates

“Triangle Scan Conversion using 2D
Homogeneus Coordinates”, Olano and
Greer.
Programmable Pipeline
Vertex Program
Vertex Program
Vertex Shader





VS 1.0, 1.1 and 1.2 (current technology) for
Direct3D 8 and 8.1. OpenGL extensions:
ARB_vertex_program (finally in OpenGL
v1.4), NV_vertex_program1_1 (NVidia),
EXT_vertex_shader (ATI).
No branching.
Single cycle execution latency (?).
Single issue instruction each cycle.
Simple in order pipeline (?).
Vertex Shader
16 input registers (read only).
 15 output registers (write only).
 12 temporary registers (read/write).
 96 constant registers (read only or
read/write?).
 256 instructions max

Vertex Shader

























Output
Inputs
(vector or
Opcode (scalar or vector) replicated scalar) Operation
------ ------------------ ------------------ -------------------------ARL
s
address register
address register load
MOV
v
v
move
MUL
v,v
v
multiply
ADD
v,v
v
add
MAD
v,v,v
v
multiply and add
RCP
s
ssss
reciprocal
RSQ
s
ssss
reciprocal square root
DP3
v,v
ssss
3-component dot product
DP4
v,v
ssss
4-component dot product
DST
v,v
v
distance vector
MIN
v,v
v
minimum
MAX
v,v
v
maximum
SLT
v,v
v
set on less than
SGE
v,v
v
set on greater equal than
EXP
s
v
exponential base 2
LOG
s
v
logarithm base 2
LIT
v
v
light coefficients
DPH
v,v
ssss
homogeneous dot product
RCC
s
ssss
reciprocal clamped
SUB
v,v
v
subtract
ABS
v
v
absolute value
NV_vertex_program2
ARL (new support for four-component A0 and A1 instead of just A0.x)
 ARR (similar to ARL, but rounds instead of truncating before storing the
integer result in an address register)
 BRA, CAL, RET (branching instructions)
 COS, SIN (high-precision trigonometric functions)
 FLR, FRC (floor and fraction of floating-point values)
 EX2, LG2 (high-precision exponentiation and logarithm functions)
 ARA (adds pairs of components of an address register; useful for
looping and other operations)
 SEQ, SFL, SGT, SLE, SNE, STR (“set on” instructions similar to SLT,
SGE)
 SSG (“set sign” operation; generates a vector holding –1.0 for negative
operand components, 0 for zero-value components, and +1.0 for
positive components)

NV_vertex_program Overview
1. Condition codes
 2. Branching & subroutines
 3. Even faster performance
 4. Nineteen new instructions
 5. New source modifiers
 6. Clip plane support
 7. More registers & instructions

NV_vertex_program2 Resource Limits









256 vertex program parameters
Up from 96
16 temporary registers
Up from 12
Two 4-component address registers
Up from one single-component address register
256 static instructions per program
Up from 128
Given branching, 65536 dynamic instructions can
execute before termination to avoid infinite loops
NV_vertex_program2 Source Modifiers
Source operand absolute value
 Example: MOV R0, |R1|;
 In addition to source negation &
swizzling
 Example: MAD R0, -|R1|.yzwy, |R2|, R3,w;
 Swizzle, negate, & absolute value
operations are “free” source modifiers

NV_vertex_program2 Condition Codes (1)










Condition code state
4-component register stores condition code values
Four possible values
LT –less than zero
EQ – equal to zero
GT –greater than zero
UN– unordered, for comparisons involving NaN
Most instructions optionally update condition code state
Indicated with “C” suffix: DP4C, MOVC, etc
“CC” pseudo-register used to just update condition codes
NV_vertex_program2 Condition Codes (2)





Optional condition code based destination masking
Example: MOV R1.xy(NE.z), R0;
Copy R0components to R1’s X & Y components
except when condition code’s Z component is EQ
Condition code rules: EQ, equal; GE, greater or
equal; GT, greater than; LE, less or equal; LT, less
than; NE, not equal; FL, false; and TR, true
Note that condition code masking rule can swizzle
condition code components
Rasterization
Setup (per-triangle).
 Sampling (triangle = {fragments}.
 Interpolation (interpolate colors and
coordinates).

Rasterization

Converts primitives to fragments.


Primitive: point, line, polygon, …
Fragment: transient data structure
short x, y;
long depth;
short r, g, b, a;
Fragment selection.
 Parameter Assignment (color, depth ...).

Rasterization
Setup triangles.
 Fill triangle: Interpolate parameters.
 Parameters: R, G, B, z, r, s, t, q.

Pixel Planes

Calculate 3 edge functions: if all the
edge functions are positive in a point
(x, y) the point is inside the triangle.
E(x, y) = (x – X)dY – (y – Y)dX
E(x, y) > 0 if (x, y) is to the “right” side.
E(x, y) = 0 if (x, y) is exactly on the line.
E(x, y) < 0 if (x, y) is to the “left” side.
Edge Functions
Classification (1)
A polygon defined by N vertex:
(xi, yi)
0 < i <= N
(x0, y0) = (xN, yN)
The incremental classification of the points around a polygon can
be calculated as:
Initial values:
dXi = Xi – X(i-1)
dYi = Yi – Y(i-1)
Ei(Xs, Ys) = (Xs – Xi) dY – (Ys – Yi) dXi
for 0 < i <= N
Classification(2)
Incremental computation for a unit step in X and Y axis:
E(x + 1, y) = Ei(x, y) + dYi
E(x - 1, y) = Ei(x, y) - dYi
E(x, y + 1) = Ei(x, y) - dYi
E(x, y - 1) = Ei(x, y) + dXi
Fragment inside of the triangle if:
Ei >= 0 for all i : 0 < i <= N
Classification
Traversing the Polygon
Clipping
Parallel Rasterization
E(x + L, y) = E(x) + Ldy
Allows a group of
interpolators, each
responsible for a pixel
within a block of
contiguous pixels, to
simultaneously compute
the edge function of an
adjacent block in a
single cycle
Olano and Greer





Triangle Scan Conversion using 2D
Homogeneous Coordinates
Based in Pixel Planes and Pineda approach
(edge functions) but using homogeneous
coordinates.
Avoids the need of clipping.
Adds a hither edge function for user clipping.
Perspective correct interpolation.
Interpolation function
A parameter varies linearly accross a triangle in 3D:
u = aX + bY + cZ
The 3D position (X, Y, Z) projects to 2D, using 2DH coords (x = X, y = Y ,
w = Z). The equation in 2DH space:
u = ax + by + cw
2D perspective correct function (division by w):
u/w = a x/w + b y/w + c = a X + b Y + c
u/w is a linear function in screen space (X, Y)
Interpolation function

If each vertex has a a value for u we
can resolve [a b c] using this equation:
Scan conversion
Edge function parameters: [1 0 0], [0 1
0], [0 0 1].
 1/w interpolation parameter: [1 1 1].
 Zero-area and back facing triangles: 3x3
matrix inverse of M only exists if the
determinant of M isn’t 0. The
determinant calculates a function of the
area of the triangle.

Arbitrary clip planes

To add arbitrary clip planes (user clip
planes) we need to add new clip edge
functions:
Algorithm
To summarize the algorithm:
setup:
three edge functions = M-1
= inverse of 2D homogeneous vertex matrix for each clip edge
clip edge function = dot product test * M-1
interpolation function for 1/w = sum of rows of M-1
for each parameter
interpolation function = parameter vector * M-1
pixel processing:
interpolate linear edge and parameter functions
where all edge functions are positive
w = 1/(1/w)
for each parameter
perspective-correct parameter = parameter * w
Cost

Setup:




Calculate the interpolation coefficients and slopes.
1 matrix inversion (1 division, multiple
multiplication/additions).
1 matrix vector multiplication for each parameter.
This includes the edge and clip edge functions, the
1/w value and the other parameters (r, g, b, z, s,
t, r) (3x3 matrix/vector multiplication: 9 Mul + 6
Add).
Calculate the X and Y slopes (derivatives) for each
parameter and the initial value at the first pixels (2
Mul + 2 Add per parameter).
Cost (2)

Per pixel:




Interpolate parameters: 1 Addition per parameter.
Determine if the 3 edge functions are positive (3
test sign).
Determine if the clip edge functions are positive (n
test sign)
Per pixel inside the triangle:


w = 1/(1/w) (1 division????)
For each parameter, perspective correct parameter value:
u = uw * w (1 multiplication for each parameter).
OpenGL Rasterization
Rasterization/Fragments

Calculate the final color value of the
fragment:
Texture Read.
 Color sum.
 Fog.

Texture
Texture transformation and projection.
 Texture address calculation.
 Texture filtering.

Gouraud Shading

Lighting is calculated at each vertex and
interpolated across the triangle.
n
Kprimary  E DA D  Ci max{ NLi ,0}
n
i 1
Ksecondary  S  Ci max{ NHi,0}
m
( NL  0 )
i 1
K = Kprimary * T1 * T2 * ... * Tk + Ksecondary
Ti : Color samples for one of k texture maps.
* : One of several available texture combination operations
Phong Shading

Interpolate vertex normals and evaluates the lighting
formula at each pixel.
K = Kemission + Kdiffuse + Kspecular

Problem: interpolation of normals produce non unit
vectors. Use normalization cube maps.
Flat, Gouraud and Phong Shading
Bump Mapping
A hardware implementation of Phong
Shading.
 Uses a texture map to perturb the normal
vector at each pixel (not interpolated).
 Bump Map: 2D arrays of 3D vectors. Direction
of the normal vector relative to the
interpolated normal vector at the pixel.
 Uses tangent space for storing the
perturbations. Object to tanget space
transformation (3x3 matrix multiplication).

Bump Mapping
Fragment
Texture combiners and fog.
 Owner, scrissor, depth, alpha and stencil
tests.
 Blending or compositing.
 Dithering and logical operations.

Per fragment (tests)

Determine the vissibility of the fragment:






Ownership test.
Scissor test.
Alpha test.
Stencil test.
Depth Buffer test.
Final pixel color:



Blending.
Dithering.
Logic Operation.
OpenGL per fragment
Textures
Map from screen space coordinates to object
space to texture space.
 Texture formats: 1D, 2D, 3D and cubemap.
 Texture read: take a number of texture
samples (texels), filter them and combine the
result with other texture results or original
pixel color.




Size pixel > Size texel => minification
Size pixel = Size texel => copy
Size pixel < Size texel => magnification
Level of Detail

LOD is calculated to determine the
mipmap level to use and to determine if
minification or magnification.
Level of Detail

Select sampling mode using parameter
C (can be 0 or 0.5):
If λ > c => minification
 If λ <= c => magnification


Scaler factor:
Minification

Minification:
 Nearest: the texel in the center of the texture coordinates is
read.

Linear: interpolation (bilinear).
Minification(2)
Mipmapping
A texture is formed by a piramidal data
structure of max(n,m) images from
2nx2m to 1x1 pixels.
 The proper image is accessed using the
LOD parameter.

Mipmapping


Use calculated LOD for deciding which level to read from.
Filtering:
 NEAREST_MIPMAP_NEAREST and
LINEAR_MIPMAP_NEAREST

NEAREST_MIPMAP_LINEAR and LINEAR_MIPMAP_LINEAR
(trilinear filtering)
Magnification

LINEAR of NEAREST: similar to
mignification.
OpenGL Multitexture
Cubemap
A cubemap texture is composed by 6 2D
texture/images for each of the 6 faces of a
cube.
 The texture coordinates (s, t, r) are used as a
direction vector from the center of the cube
to one of the sides.
 The coordinate with the greatest absolute
value is used to determine which face to
access.
 The other two coordinates are recalculated to
acess the texture in that face as normal 2D
texture.

Cubemap
Texture environment and
texture functions
OpenGL 1.4, basic support for register
combiners (NV_texture_shaders for GF3 and
beyond, ATI_fragment_shader for R200).
 Defines source arguments and functions to
combine textures and original color.
 Functions: REPLACE, MODULATE, ADD,
ADD_SIGNED, INTERPOLATE, SUBSTRACT,
DOT3_RGB, DOT3_RGBA.
 Color channels (RGB) and alpha channel (A)
are calculated (and configured) separately in
parallel.

Shadow map
First pass: write depth buffer to a texture
from the point of view of a light.
 Second pass: compare z value in texture with
current z value (eye). Use stencil buffer.
 In OpenGL 1.4 use texture internal format
DEPT_COMPONENT and texture comparision
mode: TEXTURE_COMPARE_MODE =
COMPARE_R_TO_TEXTURE.
TEXTURE_COMPARE_FUNC = {LEQUAL,
GEQUAL}.

Projected textures

Divide by fourth component (s, t, r, q)
and access the texture (s/q, t/q, r/q).
Textures






Original: additional color (material) information per pixel. It is
used to compensate lack of geometry information.
Current: color, normals or any kind of information. Different
formats (access) supporter by hardware (1D, 2D, 3D, cubemap).
Supported dependant reads (use information from a texture as
address to access another texture).
Minimification, magnification.
MIP mapping (Multus in Parvum): multiple levels of detail for a
single texture.
Filtering: bilinear (4 access same mipmap), trilinear (8 access to
two mipmaps), anisotropic (up to 128 access (16x trilinear)
access).
Register combiners
Multitexture: multiple textures can be read
per cycle (multiple texture units per pipe, up
to 4 in Matrox Parhelia). Also multiple
textures per pass (loop mode, up to 16 in
DX9 hardware).
 The output of those textures is combined (*,
+, ...) with the pixel interpolated color.
 First implementation of pixel shaders (not
really instructions for a processor, but a
configuration for the hardware).

GeForce256 Register Combiners
4 RGB Inputs
Fragment Color
4 Alpha Inputs
3 RGB Outputs
Specular Color
General
Combiner
0
Fog Color/Factor
Texture 0
Texture
Fetching
Texture 1
Register Set
3 Alpha Outputs
4 RGB Inputs
4 Alpha Inputs
3 RGB Outputs
General
Combiner
1
3 Alpha Outputs
Spare 0
Specular Color
6 RGB Inputs
1 Alpha Input
Final
Combiner
GeForce 3/4 Register Combiners
GeForce 3/4 Register Combiners
GeForce 3/4 Register Combiners
Texture Effects

There is a large a new graphics effects
that can be achieved with those
extended texture functions:
Cubemap (lightning, shadows).
 Bump Mapping (per pixel
lightning/shading).
 Others?

Color Sum
C = Cpri + Csec.
 Combines diffuse and specular color.

Fog

Calculate blending factor f (3 modes):





c: FRAGMENT_DEPTH (eye to fragment distance),
FOG_COORDINATE (interpolated).
d: FOG_DENSITY
s: FOG_START
e: FOG_END.
Final color:
Ownership Test

Current pixel (x, y) is owned by the
current OGL context?
Scissor Test
void Scissor(int right, int bottom, sizei
width, sizei height).
 If left <= x < left + width and bottom
<= y < bottom + height the test
passes.
 Otherwisee fails and fragment is
discarded.

Alpha Test
void AlphaFunc(enum func, clampf ref)
 Compares reference value with current
fragment alpha (A) componed with a
function (NEVER, ALWAYS, LESS,
LEQUAL, EQUAL, GEQUAL, GREATER,
NOTEQUAL).
 If test fails fragment is discarded.

Stencil Test
void StencilFunc(enum func, int ref, uint mask).
 Void StencilOp(enum sfail, dpfail, enum dppass).
 Stencil Buffer: a n-bit (uses to be 8-bit) buffer per
pixel in the framebuffer.
 Tests the current stencil buffer value for the fragment
against the reference value, applying a binary mask
and using a test function.
 If the function fails the fragment is discarded and
sfail function executed over the stencil entry.
 The stencil buffer is also updated after depth test.
dpfail function is executed when depth test fails and
dppass when depth test pass.

Stencil Test

Test functions: NEVER, ALWAYS, LESS,
LEQUAL, GEQUAL, GREATER, NOTEQUAL.

Update functions: KEEP, ZERO, REPLACE,
INCR, DECR, INVERT, INCR_WRAP,
DECR_WRAP.

Applications:



Shadows volumes.
Shadow maps.
Others?
Depth Buffer Test


void DepthFunc(enum func)
Test functions (fragment z value with framebuffer z value):
 NEVER
 ALWAYS
 LESS
 LEQUAL
 EQUAL
 GREATER
 GEQUAL
 NOTEQUAL

If test fails fragment is discarded.

If enabled stencil update functions are called.
Z-Buffer









Vissibility test.
1 read from the Z-buffer (24bits).
If test fails the fragment is discarded.
If not 1 write to the Z-buffer (24 bits).
Early Z test (avoid useless work).
Hierarchical Z-Buffer: reduces bandwidth
Z-Buffer compression: reduces bandwidth and
memory usage.
Fast Z clear.
Pixel shaders that change pixel depth (Z) disable
early Z test.
Hierarchical Z, Z Compression and Fast Z-Clear
Blending
Combine fragment color with framebuffer
color.
 Blend equations:






FUNC_ADD: C =Cs*S + Cd*D
FUNC_SUBTRACT: C = Cs*S + Cd*
FUNC_REVERSE_SUBTRACT: C = Cd*D – Cs*S
MIN: C = min(Cs, Cd)
MA: C = max(Cs, CD)
Blend functions: weigth factors for the blend
equation.
 Blend color: Cc constant color.

Dithering
Approximate a fragment higher
fragment precission color to a lower
precission framebuffer color.
 Used?

Logical Operation
From an early OGL extension.
 Operations:

Fragment Program
Fragment Program
Pixel Shaders

Pixel Shader 1.0, 1.1, 1.2, 1.3: Program
register combiners stage in NVidia GeForce3
(NV20) and GeForce4 (NV25). Supported in
DX8 and
NV_texture_shader/NV_texture_shader2.

Pixel Shader 1.4: ATI R200 (Radeon 8500),
extra features but also based in register
combiner hardware. Supported in DX8.1 and
ATI_fragment_shader.
Pixel Shaders
Pixel Shader 2.0: Programmable
shaders (like vertex shaders) but
without branching. To be supported in
DX9 and ARB_fragment_shader.
 Pixel Shader 3.0: Extended pixel
shaders, unknown features (branching?,
NV30 pixel shaders?). To be supported
in DX9 or DX9.1.

Pixel Shader

Pixel Shader 1.4:
8 constants.
 Two phases divided in 4 parts:

Optional Sampling (Texture read): up to 6
textures.
 Address Shader: up to 8 instructions.
 Optional Sampling: up to 6 textures, can be
dependent reads.
 Color Shader: up to 8 instructions.

Pixel Shaders




PS2 pixel shaders are true processors (?). Based in
Vertex Shaders but without branching.
Replaces (or complements) the register combiner
stage (NV30).
Most instructions of the vertex shader are present in
the pixel shader (but branches).
Conditional codes, swizzle, negate, absolute value,
mask, conditional mask (NV30).
Pixel Shaders



DX9 pixel shaders are true processors. Based in Vertex Shaders
but without branching. Replaces (or complements) the register
combiner stage.
Most instructions of the vertex shader are present in the pixel
shader (but branches). Conditional codes, swizzle, negate,
absolute value, mask, conditional mask (NV30).
Additional instructions (NV30):
 Texture read: TEX, TEXP, TXD.
 Partial derivarives: DDX, DDY.
 Pack/Unpack: PK2H, PK2US, PK4B, PK4UB, PK4UBG, UP2H,
UP2US, UP4B, UP4UB, UP4UBG.
 Fragment conditional kill: KIL.
 Extra math: LRP (linear interpolation), X2D (2D coordinate
transform), RFL (reflection), POW (exponentation).
R300 Pixel Shader
Pixel Shader

Inputs:





1
2
8
1
position (x, y, z, 1/w)
colors (4 compenent vector RGBA)
texture coordinates
fog coordinate.
Outputs:


fragment color (RGBA), optionally new fragment
depth.
In NV30/R300 also to 4 RGBA textures.
Pixel Shader

Temporaries:



Constants:



NV30: 32 32-bit registers (64 16-bit registers).
R300: 12 temporary registers
NV30: unlimited? (maybe memory?). Accessed by
‘name’ (label). Also literal constants (embedded).
R300: 32 constants.
DX9 (PS 2.0): 16 samplers and 8 texture
coordinates.
Pixel Shader
R300: 64 ALU instructions, 32 texture
instructions, 4 levels of dependent read. Up
to 96 instructions (?).
 R300:




ALU instructions: ADD, MOV, MUL, MAD, DP3,
DP4, FRAC, RCP, RSP, EXD, LOG, CMP.
Texture: TEXLD, TEXLDP, TEXLDBIAS, TEXKILL.
NV30: up to 1024 instructions.
Pixel Shader


NV30: up to 1024 instructions.
Additional instructions (NV30):
 Texture read: TEX, TEXP, TXD.
 Partial derivarives: DDX, DDY.
 Pack/Unpack: PK2H, PK2US, PK4B, PK4UB,
PK4UBG, UP2H, UP2US, UP4B, UP4UB, UP4UBG.
 Fragment conditional kill: KIL.
 Extra math: LRP (linear interpolation), X2D (2D
coordinate transform), RFL (reflection), POW
(exponentation).
Others

Antialiasing
Anisotropic Filtering (textures).
 Line Antialiasing.
 Edge Antialiasing
 Full Screen Antialiasing (FSA).
 Supersampling.
 MultiSampling.

Display
Gamma correction.
 Analog to digital conversion.

3D Graphic Hardware Pipeline
Command Processor.
 Vertex Shader.
 Rasterization.
 Pixel Shader.
 Fragment Operations and Tests.

vertex data (16x4D):
1 pos
1 weight
1 normal
2 colors
1 fog coord
8 texture coords
Vertex Output (15x4D):
1 Homogeneous pos
4 colors
1 fog coord
1 point size
8 texture coord
Fragment 10x4D)
2 colors
8 texture coords
Fragment Output (10x4D)
1 color
1 depth coordinate
Fragment Coords
Fragment Coords
Framebuffer
Fragment
Operations
and Tests
Pixel
Shader
Rasterization
Vertex
Shader
Command
Processor
CPU
Memory
OGLState
Vertex Program
and Constants
OGLState
Fragment
Program and Texture
Memory
Constants
OGLState
Color Buffer
ZBuffer
Stencil Buffer
Command Processor
Recieves commands from the CPU
(driver, OpenGL/Direct3D).
 Fetches data from memory: vertex data
(DMA).
 Updates and stores OpenGL/Direct3D
render state.

Vertex Shader
Transforms and lits vertex streams.
 Vertex shader program (from GPU
memory?).
 Vertex shader constans (from GPU
memory?).
 Inputs: vertex data 16x4D
 Outputs: vertex data 14x4D

Rasterization





Includes:
 Clipping
 Divide by w
 Affine transform
 Primitive assembly
 Culling
 Setup
 Fragment generation.
Recieves vertexs and produces fragments.
Uses OpenGL/Direct3D render state.
Input: vertex (15x4D).
Output: fragments (10x4D).
Pixel Shader
Shades fragments: calculate texture address,
read texture, color operations.
 Pixel Shader program and constants (from
GPU memory?).
 Texture read: TMU (texture sample, filter
unit, texture cache, GPU memory).
 Optional:



Modify depth coordinate (1 Z output).
Render to texture (up to 4 colors outputs).
Input: fragment (12x4D).
 Output: color (2x4D).

Fragment Operations and Tests






Includes (OpenGL):
 Fog.
 Color Sum.
 Ownership Test.
 Scissor Test.
 Alpha Test.
 Stencil Test.
 Depth Test.
 Blend.
 Logic Operation.
Accesses framebuffer (GPU memory). Updates framebuffer.
Framebuffer: color, Z and stencil.
OpenGL/Direct3D render state defines operations.
Input: color.
Output: FB updated.
Others

Antialiasing
 Anisotropic Filtering (textures).
 Line Antialiasing.
 Edge Antialiasing
 Full Screen Antialiasing (FSAA):




Supersampling.
MultiSampling.
TBDR: Tile Based Deferred Rendering (STMicro PowerVR).
HOS (High Order Surfaces): N-Patches, Bezier, Displacement
Mapping, TruForm, Tesselation.
COMMANDS
AGP
COMMAND PROCESSOR
Vertex Array
VERTEX
BUFFER
VERTEX SETUP
Vertex Program
Vertex Constants
VERTEX SHADER
VERTEX
VERTEX
CACHE
PRIMITIVE ASSEMBLY
Primitive List
TRIANGLES
MEMORY
TRIANGLE SETUP
FRAGMENT
GENERATOR
FRAGMENT
(color, position, Z,
textures)
EARLY Z TEST
Z Buffer
PIXEL SHADER SETUP
Textures
PIXEL SHADER
FOG & COLOR SUM
OWNERSHIP & SCISSOR
TESTS
Stencil
Buffer
Z Buffer
GL_COLOR_SUM
GL_FOG
GL_Fog()
GL_SCISSOR_TEST
GL_Scissor()
ALPHA TEST
GL_ALPHA_TEST
GL_AlphaFunc()
STENCIL TST
GL_STENCIL_TEST
GL_StencilFunc()
GL_StencilOp()
DEPTH TEST
GL_DEPTH_TEST
GL_Depth_Func()
BLEND
Color
Buffer
Pixel Shader Program
Pixel Shader Constants
LOGIC OP
FRAGMENT
(color, position, Z)
GL_BLEND
GL_BlendEquation()
GL_BlendFuncSeparate()
GL_BlendFunc()
GL_BlendColor()
GL_COLOR_LOGIC_OP
GL_LogicOp()
PIXEL
Vertex Shader
The command processor sends a vertex
stream to the vertex shaders.
 A vertex buffer stores data read from DMA.
 A vertex cache (~ 10 vertexs) can be used to
avoid to execute vertex shader for the same
vertex twice.
 The vertex stream is grouped in primitives
and sent to the rasterizer.

address
MEMORY
FETCH
vertex array
Hardware Pipeline
address
vertex array
VERTEX BUFFER
vertex array address
vertex data
index list
INDEX
FIFO
VERTEX SHADER
index
vertex data (T&L)
COMMAND
PROCESSOR
index
primitive
(n vertexs)
index
PRIMITIVE
ASSEMBLY
VERTEX
CACHE
hit/miss
vertex data (T&L)
PRIMITIVE
FIFO
offset
PRIMITIVE
BUFFER
primitive
(n vertex)
primitive data
(n vertexs)
commands
AGP
Vertex Shader Architecture





SIMD architecture. Registers are 128b wide, four 32 bit fields.
Instruction set: typical arithmetic instructions (vector mul, add)
and some special instructions (ARL, DST), some complex
mathematic instructions (EXP, COS), support for branching,
loops and procedures.
3 different sources of data:
 Input stream (~ 16 registers).
 Constants (~ 256 registers).
 Temporaries (~ 16 registers).
2 different destinations:
 Output stream (~ 15 registers).
 Temporaries (~ 16 registers).
Conditional registers (NV30) and boolean constants
(R300, DX9) for conditional ‘execution’.
Vertex Shader Inputs and Outputs
SREG
VERTEX INPUT (16x128 bits)
SREG
2
TEMPORARY
(16 x 128 bits)
CONSTANTS
(256 x 128 bits)
DREG
2
OP
1
1
SREG
1
1
ADDRESS (2 x 128 bits)
MUX/ABS/NEGATE/SWIZZLE
SREG
OP
ALU/MASK
1
VERTEX OUTPUT
(15 x 128 bits)
DREG
DREG
STACK
+1
Vertex Shader Architecture
MUX
PC
CONSTANTS
BRANCH
VERTEX INPUT
INSTRUCTIONS
IR
ADDRESS
TEMPORALS
MUX
MUX
SWIZZLE
NEG/ABS
ALU
CCs
MASK
VERTEX OUTPUT
MUX
Vertex Shader: NV20
Exposes programmability of a small part of
the geometry pipeline.
 Vertex load & store, format conversion,
primitive assembly, clipping, triangle setup
occur completely in parallel, in pipeline
fashion.
 4-wide fine grained SIMD FP to provide the
necessary performance, and run multiple
execution threads to maintain efficiency
and provide a very simple programming
mode.

NV20: Introduction





Independent vertices.
IEEE single precission FP.
4 component vectors (x, y, z, w).
Input registers can have their components
arbitrarily rearranged/replicated (swizzled).
Any operation generating a scalar must
generate that scalar replicated across all
components, and output writes have a
component write mask.
NV20: Program Model
NV20: Input Attributes


Input Attributes:
 16 quad-float vertex source attribute registers.
 Position, normal, two colors, up to 8 texture coordinate sets,
skin weights, fog and point size.
 Default 0.0 for second and third components, 1.0 for the
fourth.
 Attributes are persistent.
 Only one vertex attribute may be read per program
instruction.
Constant memory:
 96 quad floats.
 Can only be loaded before vertices are processed.
 Only one constant may be read by one program instruction.
 The program may not read to constants.
NV20: Input Attributes

Integer address register:



Read/Write register file:




Loaded using ARL.
Indexed constant reads with out-of-range reads
returning (0,0,0,0).
12 quad floats.
Three reads and one write per instruction.
Initialized to (0,0,0,0) per vertex.
Any vector read may be sourced as multiple
operands and individually swizzled/negated
each time.
NV20: Output attributes







Standard mapping for the fixed function
pipeline at the homogeneous clip space point.
Position for clipping.
Vertex color output clamped to the range 0.0
to 1.0.
Fog distance, point size.
8 texture coordinates.
All instruction writes have an optional 4component write mask.
Initialized to (0.0, 0.0, 0.0, 1.0).
NV20: Instruction Set.


No branching.
Constant Latency: issue any instruction per clock and execute
all instructions with thhe same latency. All operands are
immediately available, limiting the size of registers and memory
banks.
NV20: Hardware Implementation

Two blocks: vertex attribute buffer
(VAB) and the floating point core.
NV20: VAB







The VAB is responsible for vertex attribute persistence.
16 input attributes
When a write to an addres is recieved defaults (0.0, 0.0, 0.0,
1.0) and the valid data overwrites the components.
The VAB drains into a number of input buffers (IB) that are used
to feed the FP core in a round robin fashion.
Dirty bits are maintained in the VAB so only changed attributes
are updated when the same buffer is again the drain target.
The transfer of a vertex is triggered by a write to address 0
(vertex position).
To prevent bubbles during simultaneous loading and draining of
the VAB, incoming writes may push out th contents of the target
address, superceding a default drain sequence.
NV20: VAB
NV20: Floating Point Core










Processes the instruction set.
Multithreaded vector processor operating on quad-float data.
Vertex data read from input buffers and transformed into
output buffers (OB).
Same latency for vector and special function units.
Multiple vertex threads are used to hide this latency.
SIMD VU: MOV, MUL, ADD, MAD, DP3, DP4, DST, MIN, MAX,
SLT, SGE.
Special FU: RCP, RSQ, LOG, EXP, LIT.
VU is approximately IEEE (no denormalized numbers or
exceptions, rounding always toward negative infinity).
1 instruction per clock and all input/output options have no
performance penalty.
All input vectors are available with no latency.
NV20: Float Point Core
Vertex Shader: R300



4 vertex shader units.
1 scalar unit, 1 vector unit.
Registers:
 ALU Registers:
 Constants: 256 read only vectors.
 Temporary: 12 read/write vectors
 Input: 16 read only vectors.
 Output: 15 write only vectors.
 Flow Control Registers:
 Integer Constat: 16 read only vectors.
 Address: 1 read/write vector.
 Loop Counter: 1 scalar.
 Boolean Constant: 16 read only bits.
R300: Instructions
Up to 256 instructions long shaders.
 Up to 64K executed instructions per vertex.
 ALU instructions: ADD, DP3, DP4, EXP, EXPP, EXPE,
FRAC, LOG, LOGP, MAD, MADDX2, MAX, MIN, MOV,
MUL, POW, RCP, RSQ, SGE, SLT.
 Control Flow instructions: CALL, LOOP, ENDLOOP,
JUMP, JNZ, LABEL, REPEAT, ENDREPEAT, RETURN.
 Address Instructions: ARL, ARR.
 Graphic Instructions: DST, LIT.
 Instructions based in DX9 VS2.0.

NV30: Overview








Supports all VS1 instructions and features.
Beyond VS2?
Condition codes.
Branches and subroutines.
Modifiers: absolute.
User clip support (new output registers CLP0CLP5).
New instructions.
More registers.
NV30: Overview
Up to 256 instructions per program.
 Up to 64K executed instructions per
vertex.
 16 temporary registers.
 2 vector address registers.
 256 program parameters (constants).

NV30: Condition Codes
4 component register:
 LT: less than zero.
 EQ: equal to zero.
 GT: greater than zero.
 UN: unordered, for comparisions involving NaN.
 Instructions optionally update condition code state:
 “C” suffix: DP4C, MOVC.
 “CC” pseudo register for update condition codes.
 Condition code used in:
 Branches and procedure call/return.
 Result masking.

NV30: Modifiers

Source:
Swizle
 Negate
 Absolute


Target
Masking
 Conditional masking

NV30: Branching and subroutines

BRA




Unconditional.
Conditional: BRA label (LE.xyww)
Computed (indirect): BRA [A1.z] (GT.x)
Call & return for subroutines.




CAL & RET.
Same options that with branches.
Four levels of subroutin execution.
No parameter stack.
NV30: Clipping
New output registers: o[CLP0]..o[CLP5].
 GL_CLIP_PLANEn enabled.

Clip coordinate n interpolated across the
primitive.
 Only the portion of the primitive where the
clip coordinate is greater than zero is
rasterized.
 Hardware performs fast trivial reject if all
clip coordinats of a primitive are negative.

NV30: New Instructions
ARL: supports loading 4-component A0 and A1 intergre registers now.
 ARR: like ARL except rounds rather than truncates before storing
integer result in an address register.
 BRA, CAL, RET: branching instructions.
 COS, SIN: high precision trigonometric functions.
 FLR, FRC: floor and fraction of floating point values.
 EX2, LG2: high-preccision exponentiation and logarithm functions.
 ARA: adds pairs of components of an address register, useful for
looping and other operations.
 SEQ, SFL, SGT, SLE, SNE, STR: add six “set on” instructions similar to
SLT and SGE.
 SSG: “set sign” operation generates a vector holding –1.0 for negative
operand components , 0 for zero components, and +1.0 for positive
components.

NV30: Instruction List
Add & multiply instructions: ADD, DP3, DP4, DPH,
MAD, MOV, SUB.
 Math functions: ABS, COS, EX2, FLR, FRC, LG2, LOG,
RCP, RSQ, SIN.
 Set on instructions: SEG, SFL, SGE, SGT, SLE, SLT,
SNE, STR.
 Branching instructions: BRA, CAL, RET.
 Address register instructions: ARL, ARA.
 Graphics-oriented instructions: DST, LIT, RCC, SSG.
 Minimum/maximum instructions: MAX, MIN

Current GPUs
ATI R300.
 3DLabs P10.
 Matrox Parhelia.

ATI R300. Specs
0.15 micron technology
 110+ million transistors.
 8 pixel rendering pipelines, 1 texture unit per
pipeline, 16 textures per pass.
 4 programmable vect4 vertex shader pipelines.
 256-bit DDR memory bus.
 Up to 256 MB of memory on board, clocket at over
300 MHz (19,2 GB/s).
 AGP8X.
 Full DirectX 9 Pixel and Vertex Shader support.

ATI R300. Specs.
ATI R300. GPU.
ATI R300. Memory Crossbar.
ATI R300. Vertex Shader.
ATI R300. Pixel Shader.
ATI R300. Pixel Shader.
3D Labs P10. Specs.









0.15-micron manufacturing process (same process as the
GeForce4)
76M transistors
Fabbed at TSMC (NVIDIA's chips are made here as well)
860 ball HSBGA package (TSMC's latest packaging technology)
4 pixel rendering pipelines, can process two textures per
pipeline
256-bit DDR memory interface (up to 20GB/s of memory
bandwidth w/ 312.5MHz DDR)
up to 256MB of memory on-board
AGP 4X support
Full DX8 pixel and vertex shader support
3DLabs P10. Evolution.
3DLabs P10. Pipeline.
3DLabs P10. Pipeline.
3DLabs. Command.
3DLabs. Vertex Units.
3DLabs P10. Raster Pipe.
3DLabs P10. Texture Pipe.
3DLabs P10. Pixel Pipe.
3DLabs P10. Virtual Memory.
Matrox Parhelia. Specs.
0.15-micron GPU manufactured at UMC
 80 Million transistors
 4 pixel rendering pipelines, can process four textures
per pipeline per clock
 4 programmable vect4 vertex shaders
 256-bit DDR memory bus (up to 20GB/s of memory
bandwidth w/ 312.5MHz DDR)
 up to 256MB of memory on board
 AGP 4/8X support
 Full DX8 pixel and vertex shader support

Matrox Parhelia. Pipeline.
Bibliography
http://developer.nvidia.com
 http://mirror.ati.com/developer/index.ht
ml
 http://graphics.stanford.edu/
 http://www.opengl.org

Bibliography

“Real Time Graphic Architecture”




Kurt Akeley
Pat Hanrahan
http://www.graphics.stanford.edu/courses/cs448a
-01-fall
The OpenGL Graphics System: A Specification
(version 1.4)


Mark Seagal
Kurt Akeley
Bibliography

Computer Graphics: Principles and
Practice in C
James D. Foley
 Andreis Van Dam
 Steven K. Feiner
 John F. Hughes
