Transcript pact03

Compilation, Architectural Support, and Evaluation
of SIMD Graphics Pipeline Programs on a
General-Purpose CPU
Mauricio Breternitz Jr, Herbert Hum, Sanjeev Kumar
Microprocessor Research Labs,
Intel Corporation
Graphics Applications

Computational intensive graphics applications
are becoming increasingly popular

Computer-Aided Design
─





From Airplanes to Cars
Visualization of massive quantities of Data
Visual Simulators e.g. Training Pilots
Fancier Graphical User Interfaces
And, of course, Games
And this trend is continuing

As high-end applications become more mainstream
Parallel Architecture and Compilation Techniques, 2003
2
3D Application
OpenGL Or DirectX
Graphics Pipeline
Scene
Transform
Lighting
Vertex Shaders
• Operate on every vertex in the scene
Clipping
• Effects like
• Blur
• Diffuse and specular reflection
Rasterization
Texture Mapping
Compositing
Display
Pixel Shaders
• Operate on every pixel
• Effects like
• Texturing
• Fog blending
Parallel Architecture and Compilation Techniques, 2003
3
Vertex and Pixel Shaders

Need to operate millions of times a second




Small programs
Typically run on the graphics cards
However most desktops do not have graphics
cards that support programmable shaders
This work focuses on running Vertex Shaders
on the main CPU


Pixel shaders have very high computational and
bandwidth requirements
Graphics applications are designed to adapt to the
available features and performance
Parallel Architecture and Compilation Techniques, 2003
4
Goals

Improving the performance of Vertex Shaders
on the main CPU




Analyze the performance on today’s CPU
Better Compiler Optimizations
Additional Architectural Support
Identify three architectural and compiler
enhancements

Significant impact on the performance
─
Roughly by a factor of 2
Parallel Architecture and Compilation Techniques, 2003
5
Outline





Motivation
Baseline Compiler
Three Enhancements
Performance Evaluation
Conclusions
Parallel Architecture and Compilation Techniques, 2003
6
Vertex Shader Programs
Virtual
Machine
Temporary
Registers
12 x 4
Vertex Input
16 x 4 Registers
SIMD ALU
Integer
Registers
84 x 1
Constant
Memory
256 x 4
Vertex Output
15 x 4 Registers
dp4
dp4
dp4
dp4
oPos.x,
oPos.y,
oPos.z,
oPos.w,
v0,
v0,
v0,
v0,
c[0]
c[1]
c[2]
c[3]




Small Programs (at most 256 instructions)
SIMD instructions with xyzw components
Mask and Swizzle on each instruction
No state saved between vertices

mov
oD0, c[4].wzyx

Read-only memory & Temporary Registers
Program cannot change control flow
Parallel Architecture and Compilation Techniques, 2003
7
Baseline Optimizing Compiler

Implemented a Compiler for Vertex Shaders
Input:
Vertex Shader Assembly
Output: Optimized x86 (with SSE2)

Started with DirectX reference rasterizer: Interpreter
─





Used it as the front end
Use Olive pattern-matching code-generator generator
Graph-coloring based register allocator
Loop unrolling
List-scheduler
About 70% faster than a naïve translator

Translate into C and feed it to a C compiler
Parallel Architecture and Compilation Techniques, 2003
8
Characteristics of Generated Code

Mostly SIMD instructions (x86 with SSE2)


Large basic blocks



83-99 % instructions
Use of control-flow is limited
Makes it easier to compile efficiently
Vertex Shared Assembly to x86 Assembly

10-20 times increase in number of instructions
mul
r0.x_z_, v0.xyzz,
Parallel Architecture and Compilation Techniques, 2003
v1.wwww
9
Outline





Motivation
Baseline Compiler
Three Enhancements
Performance Evaluation
Conclusions
Parallel Architecture and Compilation Techniques, 2003
10
1. New Instructions


Dot products are very common in Shaders
A dot product translates is expensive on x86


A sequence of 7 instructions
1 multiply, 2 add, 4 shuffle instructions
─

In the simple case
New dot product instructions

Compute dot product of two source operands and
store it in each of the word of the destination
operand
Parallel Architecture and Compilation Techniques, 2003
11
2. Mask Analysis Optimization

Traditional optimizers keep track of the
liveness information on a per-register basis



Analysis Phase



Shaders: often only part of the SIMD register is live
Modify to do this for each word of the SIMD register
Annotate the IR with additional information
During live variable analysis, propagate the
liveness mask depending on the instructions
Optimization Phase


Identify dead code
Replace some shuffle/mask instructions with move
─
Might get eliminated entirely during register allocation
Parallel Architecture and Compilation Techniques, 2003
12
3. Number of Registers



Spilling registers to memory can degrade
performance
Investigate the impact of increasing the
number of registers from 8 to 16
Why not more?

Trickier to encode it in the ISA
Parallel Architecture and Compilation Techniques, 2003
13
Outline





Motivation
Baseline Compiler
Three Enhancements
Performance Evaluation
Conclusions
Parallel Architecture and Compilation Techniques, 2003
14
Experimental Setup

10 Vertex Shaders



2.2 GHz Pentium IV processor



8-84 instructions
Only 3 of them have loops (Control)
Instruction counts otherwise
Breakdown the instructions into categories
Measure performance by using the generated
code to process an array of vertices

Compute average
Parallel Architecture and Compilation Techniques, 2003
15
Evaluation
Normalized
Execution Time
Base
New Instructions Only
Mask Optimization Only
Both
1
0.8
0.6
0.4
0.2
0
B
CTC
L
PS
PL
PE
R
T
TS
W
Vertex Shaders

New dot-product Instructions: 27.4% Average (Estimate)



Reduces the number of instructions by 24 %
Mask optimization: 19.5% on Average
Both: 42% on Average
Parallel Architecture and Compilation Techniques, 2003
16
Evaluation Cont’d
Normalized
Instruction Count
Base
16 Registers
1
0.8
0.6
0.4
0.2
0
B
CTC
L
PS
PL
PE
R
T
TS
W
Vertex Shaders

Reduce the number of instructions by 8 % on average


35-100% of the spill instructions
This understates the potential benefit

More registers allow more aggressive optimizations like
instruction scheduling
Parallel Architecture and Compilation Techniques, 2003
17
Outline





Motivation
Baseline Compiler
Three Enhancement
Performance Evaluation
Conclusions
Parallel Architecture and Compilation Techniques, 2003
18
Conclusions & Future Work


Implemented an Optimizing Compiler for
Vertex Shaders
Propose and Evaluate Three Enhancements
Compiler: Mask Optimization
 Architectural: New Instructions & More registers
Improve the performance by a factor of 2 (Roughly)


Shaders are evolving rapidly


More like general purpose processors
More complex model
Parallel Architecture and Compilation Techniques, 2003
19