Status-Week.242

Transcript Status-Week.242

Status – Week 242
Victor Moya
Summary
Current status.
 Tests.
 XBox documentation.
 Post Vertex Shader geometry.
 Rasterization.

Current Status

Basic Command Processor.





Read/Write GPU registers.
Read/Write GPU memory.
GPU commands.
No DMA/AGP data access.
Basic Memory Controller.




1 transaction per cycle served.
Memory module access latency accounted.
Transmission latency accounted.
3 buses (req/write + data): CP, StreamerFetch,
StreamerLoader.
Current Status

Shader (Vertex Shader).







Multithreaded.
F/D/E/W pipeline.
Variable execution latency.
Dependency checking is full register right now,
should be component based.
Problems with ‘ending’ instruction (requires
something to fetch after it and takes many cycles).
No branches (support code but instructions not
implemented).
No texture access (memory).
Current Status

Streamer.

Pipelined:






Hit: Fetch/OCache/Insert/Commit
Miss:
Fetch/OCache/IRQInsert/IRQRead/AttrLoad/Sh/Store/Co
mmit.
Stream and index based modes implemented.
No pre T&L cache (should be added to Streamer
Loader?).
Supports out of order vertexes (shader or
memory).
Doesn’t support data from the AGP.
Current Status

Streamer:

Streamer Loader pipeline should be (in hardware):






Insert in the IRQ.
Load from IRQ.
Setup Input: start address + address increment for each
active attribute.
Attribute Load: request attribute to MC, increment
address generators.
Issue to Shader.
IRQ should be implemented with a pre T&L cache.
Current Status

Comments:

Currently the signal latency/bandwidth is specified
with raw numbers. Alternatives:



Use constants. Store in a single ‘signal definition’ file for
all units or in separate units (must be shared between
the two boxes connected by the signal).
Use some kind of Architecture Description for signal
delays, bandwidth, data bus width (to be used in
memory transmission calculations and similar).
Currently most units only support single
issue/fetch/process. Should be ‘generalized’ to
multiissue/fetch/process and parametrized.
Current Status

Signal Trace Analyzer -> Carlos.
Tests

OpenGL test trace:

Used glutSolidSphere with (1, 100, 100) as
parameter:

100 batches.
– 2 triangle strips (200 triangles).
– 98 quad strips (9800 quads).


20000 vertexs.
Added a lightning shader replacing the normal
model view + project matrix transformation: one
green light in the infinity with diffuse and specular
component.

10 shader instructions.
Tests

Light shader:
//
//
//
//
//
//
//
//
//
//
//
//
//
//
//
i0 Vertex Position
i2 Vertex Normal
c0 - c3 Model View-Project Matrix.
c4 Light Direction
c5 Light Half Vector
c6.x Material shininess
c7 Light ambient color
c8 Light diffuse color
c9 Light specular color
o0 Vertex position (transformed)
o1 Vertex color.
Tests
// Vertex Model View-Project transformation
dp4 o0.x, c0, i0
dp4 o0.y, c1, i0
dp4 o0.z, c2, i0
dp4 o0.w, c3, i0
// Compute diffuse and specular dot products and
// use LIT to compute lightning coefficients
dp3 r0.x, i2, c4
dp3 r0.y, i2, c5
mov r0.w, c6.x
lit r0, r0
Tests
// Accumulate color contributions
mad r1, r0.y, c8, c7
mad o1, r0.z, c9, r1
// Finish shader.
end
Tests

Results:
Simulated cycles: ~350K.
 Simulation time: ~30s.
 Signal trace size: ~150MB.

Tests
Tests

Bugs:



TraceReader::parseFP() failed to correctly read a
negative number with a 0 before the decimal
point.
GPU_CLAMP was using ‘<‘ and ‘>’ when it should
be using ‘<=‘ and ‘>=‘.
ShaderDecodeExecute was allowing the execution
of the instruction in the same thread after a
blocked instruction (data dependency).
Tests

Changes:




Now ShaderDecodeExecute ignores any instruction
received after an end instruction.
Added QUAD and QUADSTRIP support to the
simulator (GPU.h, Rasterizer, Drawer).
Vertex color is clamped to 0.0 – 1.0 before being
send to OpenGL (Drawer). The correct behaviour
should be that color attributes should be clampled
when they exit the shader.
Added glNormal3f and glFrustum OpenGL
functions to the TraceReader and
OGLtoAGPTransaction.
Tests

Changes:
OGLtoAGPTransaction now supports a third
vertex attribute: normal.
 OGLtoAGPTransaction now supports a
‘special’ shader mode (the one used for the
light test). No support for OpenGL
lightning is implemented.

Tests

Further tests:

Try to implement a sphere using
Icosahedron subdivision to create a
triangle strip mesh to test the index stream
mode.
XBox Documentation
Interesting information about the Vertex
Shader architecture and the T&L
pipeline down to the Primitive Assembly
Cache and the Triangle Setup.
 Includes estimated sizes and clock
latencies for most of the operations.

200 MHz
Memory
cache line (raw vertex data)
4 KB 4-way set associative
128 32-B cache lines
Pre T&L Cache
raw vertex
Vertex Shader
transformed and lit vertex
16 – 24 entry FIFO
Post T&L Cache
transformed and lit vertex
3 vertices
Primitive Assembly
3 transformed and lit vertices
Triangle Setup
Rasterization
XBOX

Differences:



No Pre T&L cache.
The Post T&L cache seems to be accessed by the
Primitive Assembly Cache. However we push the
vertex to the Rasterizer (or whatever lays after the
shader).
Sending the shaded vertex to the primitive
assembly takes multiple cycles (2+) depending on
the number of attributes used by the vertex.
XBOX Vertex Shader

Registers:
16 input registers.
 12 temporary registers.
 192 constant registers.
 1 address register.
 11 output registers.

XBOX Vertex Shader

Instructions:

Shader Operations:



136 microcode instructions. Each instruction can:




13 MAC opcodes.
7 ILU (inverse logic unit) opcodes.
Read three register with swizzle and negation.
Compute one MAC op and one ILU op.
Write up one output register and two temporary registers
with masking.
Shader types:



Normal vertex shaders.
Read/write vertex shaders.
Vertex state shaders.
XBOX Vertex Shaders

Timing:
The cycle speed is 250 MHz
 For normal shaders, instructions take between
one-half cycle and one cycle to complete.
 For read/write and vertex state shaders,
instructions take between one cycle and six
cycles to complete.

XBOX Vertex Shaders

Multithreaded:




Two copies of the vertex shader pipeline (2 VS).
Each copy can run up to three threads (3 active
threads per shader).
Read/write vertex shaders and vertex state
shaders run single threaded, on a single pipeline.
Stalling:




Instructions take six cycles to compute their
outputs.
Bypasses: ALU, ILU and MLU bypasses.
Three cycles latency with bypasses.
Bypass allows swizzling and negate of the result.
Post Vertex Shader










(based in 3DLabs OpenGL2 overview).
Primitive assembly.
User clipping.
Frustum clipping.
Perspective projection.
Viewport Mapping.
Polygon offset.
Polygon mode.
Shade mode.
Culling.
Post Vertex Shader

Primitive Assembly:





Get the three vertexes of a triangle.
Triangles: keep the last three vertexes, generate
primitive with each new three vertexes.
Triangle strip: keep the last three vertexes,
generate primitive with each new vertex (after the
second)
Triangle fan: keep the first vertex and the last two
vertex, generate primitive with each new vertex
(after the second).
Similar with other primitives.
Post Vertex Shader

User clipping:





At least 6 user clip planes.
Define a clip volume.
glClipPlane(enum p, double eqn[4]).
(p1 p2 p3 p4) (x y z w) >= 0
Frustum clipping:




View volume.
-w <= x <= w
-w <= y <= w
-w <= z <= w
Post Vertex Shader

Clipping:
Clip polygon => add new vertexes =>
tesselate.
 Clip triangle => add new vertexes =>
retesselate.
 Use rasterization in homogeneous
coordinates: just add more clipping edges.
 Guard Band Clipping (scissor).

Post Vertex Shader
Divide by w.
 Viewport transformation.











Scale to screen/window coordinate system.
glViewport(x, y, w, h)
glDepthRange(clampd n, clampd f)
xw = (px/2)*xd + ox
yw = (py/2)*yd + oy
zw = [(f-n)/2]*zd + (n + f)/2
ox = x + w/2
oy = y + h/2
px = w
py = h
Post Vertex Shader

Back face culling:
Can be calculated using the area of the
triangle (determinant three vertex in
homogeneous coordinates).
 Negative or possitive area.
 Can be also used to cull zero area triangles

Post Vertex Shader

Discard degenerate triangles:

If two or more vertex are the same (could
be index based or full vertex comparition)
the triangle can be discarded.
Rasterization

Alternatives:
Scanline incremental interpolation (DDA).
 Rasterization in homogeneous coordinates.


Two phases:

Triangle setup.


Set interpolation registers.
Fragment generation.

Incrementally update the interpolants.

Status-Week.242

Transcript Status-Week.242

Directory