Transcript pact03
Compilation, Architectural Support, and Evaluation
of SIMD Graphics Pipeline Programs on a
General-Purpose CPU
Mauricio Breternitz Jr, Herbert Hum, Sanjeev Kumar
Microprocessor Research Labs,
Intel Corporation
Graphics Applications
Computational intensive graphics applications
are becoming increasingly popular
Computer-Aided Design
─
From Airplanes to Cars
Visualization of massive quantities of Data
Visual Simulators e.g. Training Pilots
Fancier Graphical User Interfaces
And, of course, Games
And this trend is continuing
As high-end applications become more mainstream
Parallel Architecture and Compilation Techniques, 2003
2
3D Application
OpenGL Or DirectX
Graphics Pipeline
Scene
Transform
Lighting
Vertex Shaders
• Operate on every vertex in the scene
Clipping
• Effects like
• Blur
• Diffuse and specular reflection
Rasterization
Texture Mapping
Compositing
Display
Pixel Shaders
• Operate on every pixel
• Effects like
• Texturing
• Fog blending
Parallel Architecture and Compilation Techniques, 2003
3
Vertex and Pixel Shaders
Need to operate millions of times a second
Small programs
Typically run on the graphics cards
However most desktops do not have graphics
cards that support programmable shaders
This work focuses on running Vertex Shaders
on the main CPU
Pixel shaders have very high computational and
bandwidth requirements
Graphics applications are designed to adapt to the
available features and performance
Parallel Architecture and Compilation Techniques, 2003
4
Goals
Improving the performance of Vertex Shaders
on the main CPU
Analyze the performance on today’s CPU
Better Compiler Optimizations
Additional Architectural Support
Identify three architectural and compiler
enhancements
Significant impact on the performance
─
Roughly by a factor of 2
Parallel Architecture and Compilation Techniques, 2003
5
Outline
Motivation
Baseline Compiler
Three Enhancements
Performance Evaluation
Conclusions
Parallel Architecture and Compilation Techniques, 2003
6
Vertex Shader Programs
Virtual
Machine
Temporary
Registers
12 x 4
Vertex Input
16 x 4 Registers
SIMD ALU
Integer
Registers
84 x 1
Constant
Memory
256 x 4
Vertex Output
15 x 4 Registers
dp4
dp4
dp4
dp4
oPos.x,
oPos.y,
oPos.z,
oPos.w,
v0,
v0,
v0,
v0,
c[0]
c[1]
c[2]
c[3]
Small Programs (at most 256 instructions)
SIMD instructions with xyzw components
Mask and Swizzle on each instruction
No state saved between vertices
mov
oD0, c[4].wzyx
Read-only memory & Temporary Registers
Program cannot change control flow
Parallel Architecture and Compilation Techniques, 2003
7
Baseline Optimizing Compiler
Implemented a Compiler for Vertex Shaders
Input:
Vertex Shader Assembly
Output: Optimized x86 (with SSE2)
Started with DirectX reference rasterizer: Interpreter
─
Used it as the front end
Use Olive pattern-matching code-generator generator
Graph-coloring based register allocator
Loop unrolling
List-scheduler
About 70% faster than a naïve translator
Translate into C and feed it to a C compiler
Parallel Architecture and Compilation Techniques, 2003
8
Characteristics of Generated Code
Mostly SIMD instructions (x86 with SSE2)
Large basic blocks
83-99 % instructions
Use of control-flow is limited
Makes it easier to compile efficiently
Vertex Shared Assembly to x86 Assembly
10-20 times increase in number of instructions
mul
r0.x_z_, v0.xyzz,
Parallel Architecture and Compilation Techniques, 2003
v1.wwww
9
Outline
Motivation
Baseline Compiler
Three Enhancements
Performance Evaluation
Conclusions
Parallel Architecture and Compilation Techniques, 2003
10
1. New Instructions
Dot products are very common in Shaders
A dot product translates is expensive on x86
A sequence of 7 instructions
1 multiply, 2 add, 4 shuffle instructions
─
In the simple case
New dot product instructions
Compute dot product of two source operands and
store it in each of the word of the destination
operand
Parallel Architecture and Compilation Techniques, 2003
11
2. Mask Analysis Optimization
Traditional optimizers keep track of the
liveness information on a per-register basis
Analysis Phase
Shaders: often only part of the SIMD register is live
Modify to do this for each word of the SIMD register
Annotate the IR with additional information
During live variable analysis, propagate the
liveness mask depending on the instructions
Optimization Phase
Identify dead code
Replace some shuffle/mask instructions with move
─
Might get eliminated entirely during register allocation
Parallel Architecture and Compilation Techniques, 2003
12
3. Number of Registers
Spilling registers to memory can degrade
performance
Investigate the impact of increasing the
number of registers from 8 to 16
Why not more?
Trickier to encode it in the ISA
Parallel Architecture and Compilation Techniques, 2003
13
Outline
Motivation
Baseline Compiler
Three Enhancements
Performance Evaluation
Conclusions
Parallel Architecture and Compilation Techniques, 2003
14
Experimental Setup
10 Vertex Shaders
2.2 GHz Pentium IV processor
8-84 instructions
Only 3 of them have loops (Control)
Instruction counts otherwise
Breakdown the instructions into categories
Measure performance by using the generated
code to process an array of vertices
Compute average
Parallel Architecture and Compilation Techniques, 2003
15
Evaluation
Normalized
Execution Time
Base
New Instructions Only
Mask Optimization Only
Both
1
0.8
0.6
0.4
0.2
0
B
CTC
L
PS
PL
PE
R
T
TS
W
Vertex Shaders
New dot-product Instructions: 27.4% Average (Estimate)
Reduces the number of instructions by 24 %
Mask optimization: 19.5% on Average
Both: 42% on Average
Parallel Architecture and Compilation Techniques, 2003
16
Evaluation Cont’d
Normalized
Instruction Count
Base
16 Registers
1
0.8
0.6
0.4
0.2
0
B
CTC
L
PS
PL
PE
R
T
TS
W
Vertex Shaders
Reduce the number of instructions by 8 % on average
35-100% of the spill instructions
This understates the potential benefit
More registers allow more aggressive optimizations like
instruction scheduling
Parallel Architecture and Compilation Techniques, 2003
17
Outline
Motivation
Baseline Compiler
Three Enhancement
Performance Evaluation
Conclusions
Parallel Architecture and Compilation Techniques, 2003
18
Conclusions & Future Work
Implemented an Optimizing Compiler for
Vertex Shaders
Propose and Evaluate Three Enhancements
Compiler: Mask Optimization
Architectural: New Instructions & More registers
Improve the performance by a factor of 2 (Roughly)
Shaders are evolving rapidly
More like general purpose processors
More complex model
Parallel Architecture and Compilation Techniques, 2003
19