Transcript Document

Shader Performance
Analysis on a Modern GPU
Architecture
Victor Moya, Carlos González,
Jordi Roca, Agustín Fernández
Department of Computer
Architecture UPC
Roger Espasa
Intel DEG
Barcelona
1
Introduction
Shaders in GPUs evolving towards general
programming

Branches, generic loads, scatter
New types of shaders: geometry in DX10
Current specialized shaders


Area hungry
Unbalancing leads to inefficiencies
This paper: unify all shaders

~8% higher performance with less area & resources
2
Outline
Attila – our GPU architecture
Attila-Classic: Non-unified shaders
Attila-Unified: Unified Shaders
Simulation Framework
Results
3
Outline
Attila – our GPU architecture
Attila-Classic: Non-unified shaders
Attila-Unified: Unified Shaders
Simulation Framework
Results
4
ATTILA
Our implementation of current GPUs


Inspired in both NVIDIA and ATI
Not exact to either pipeline
Lack of detailed micro architecture information
Educated guessing on our side
Implemented Features






2D Homogeneous Recursive Rasterization
Tiled Rasterization
Hierarchical Z
Texture compression
Anisotropic filtering
Depth compression, fast z/stencil and color clear
5
Outline
Attila – our GPU architecture
Attila-Classic: Non-unified shaders
Attila-Unified: Unified Shaders
Simulation Framework
Results
6
Attila Classic
Vertex Fetch
Vertex
Shader
Vertex
Shader
Vertex
Shader
Vertex
Shader
Primitive Assembly
Clipping
Specialized
Shaders
Triangle Setup
Rasterization
HierarchicalZ
Fragment
Shader
Fragment
Shader
Fragment
Shader
Fragment
Shader
ROP
ROP
ROP
ROP
Memory
Controller
Memory
Controller
Memory
Controller
Memory
Controller
7
Specialized Shader Issues
Unbalancing


In fragment shading limited scenarios (typical) up to 30% of the
processing power remains idle (for a GPU with 8 vertex and 4
fragment shaders)
In vertex shading limited scenarios up to 70% of the processing
power remains idle.
Dedicated Area


4 unused vertex shaders have the same processing power than
one 1 fragment shader
4 vertex shaders require 66% the area of a fragment shader
Different Designs


Increases the complexity of the micro architecture
Increases development and verification time
8
Outline
Attila – our GPU architecture
Attila-Classic: Non-unified shaders
Attila-Unified: Unified Shaders
Simulation Framework
Results
9
Attila Unified
Vertex Fetch
Shader
Clipping
Shader
Shader
Triangle Setup
Distributor
Scheduler
Primitive Assembly
Rasterization
Shader
HierarchicalZ
Unified
Shader
Pool
ROP
Memory
Controller
ROP
Memory
Controller
ROP
Memory
Controller
ROP
Memory
Controller
10
Unified Shader Architecture
Benefits
 Unified programming model
DX10/SM4 and OpenGL/GLSlang are already pushing for it

The same features for all the program targets
Texturing, branching, outputs

Not just vertex and fragment programs
DX10 => geometry shader
General Purpose GPU or Stream Processor

Workload balance
Shading resources allocated as required at any point of the
rendering
11
Unified Shader Architecture
Costs

Scheduler
Select which kind of workload must be processed
next
Partly implemented with multithreading in the
fragment shader to hide texture access latency


Larger instruction memory and constant bank
Rerouting required
All the paths cross the shader pool
12
Outline
Attila – our GPU architecture
Attila-Classic: Non-unified shaders
Attila-Unified: Unified Shaders
Simulation Framework
Results
13
ATTILA Framework
OpenGL Interceptor tool
OpenGL library for Attila GPU
Driver for our Attila GPU
Attila GPU simulator
Signal Visualizer Tool
14
Collect
Verify
Simulate
Analyze
OpenGL Application
GLInterceptor
Trace
GLPlayer
Statistics
Vendor OpenGL Driver
Vendor OpenGL Driver
ATTILA OpenGL Driver
Signal Traffic
ATI R520/NVidia G70
ATI R520/NVidia G70
ATTILA Simulator
Framebuffer
Framebuffer
Framebuffer
CHECK!
Signal Visualizer
CHECK!
15
Collect
Verify
Simulate
Analyze
OpenGL Application
GLInterceptor
•Capture a trace of OpenGL API alls from a real game
GLInterceptor
Trace
GLPlayer
Statistics
Vendor OpenGL Driver
Vendor OpenGL Driver
ATTILA OpenGL Driver
Signal Traffic
ATI R520/NVidia G70
ATI R520/NVidia G70
ATTILA Simulator
Framebuffer
Framebuffer
Framebuffer
CHECK!
Signal Visualizer
CHECK!
16
Collect
Verify
Simulate
Analyze
OpenGL Application
GLInterceptor
GLPlayer
•Reproduce the captured trace
Trace
GLPlayer
Statistics
Vendor OpenGL Driver
Vendor OpenGL Driver
ATTILA OpenGL Driver
Signal Traffic
ATI R520/NVidia G70
ATI R520/NVidia G70
ATTILA Simulator
Framebuffer
Framebuffer
Framebuffer
CHECK!
Signal Visualizer
CHECK!
17
Collect
Verify
Simulate
Analyze
OpenGL Library
OpenGL Application
- Transforms Fixed Function into Shader code
- 200 API Calls supported
- ARB Vertex and Fragment extensions
- Alpha and Fog emulated via Shader code
GLInterceptor
Driver
- Low level access
- Attila memory management
Trace
GLPlayer
Statistics
Vendor OpenGL Driver
Vendor OpenGL Driver
ATTILA OpenGL Driver
Signal Traffic
ATI R520/NVidia G70
ATI R520/NVidia G70
ATTILA Simulator
Framebuffer
Framebuffer
Framebuffer
CHECK!
Signal Visualizer
CHECK!
18
Collect
Verify
Simulate
Analyze
ATTILA Simulator
OpenGL Application
- Detailed cycle-by-cycle simulation of all
pipeline stages
- 20 boxes, modeling a 100-deep pipeline
- Execute@Execute: functionality
embedded at each pipeline stage
GLInterceptor
Trace
GLPlayer
Statistics
Vendor OpenGL Driver
Vendor OpenGL Driver
ATTILA OpenGL Driver
Signal Traffic
ATI R520/NVidia G70
ATI R520/NVidia G70
ATTILA Simulator
Framebuffer
Framebuffer
Framebuffer
CHECK!
Signal Visualizer
CHECK!
19
Find the differences 
NVIDIA GeForce FX 5900XT
Attila
20
Outline
Attila – our GPU architecture
Attila-Classic: Non-unified shaders
Attila-Unified: Unified Shaders
Simulation Framework
Results
21
Benchmark
Unreal Tournament 2004

Fixed function OpenGL API
Vertex and fragments shaders generated by our
library




1024x768 resolution
8x Anisotropic Filtering
160 of 450 frames simulated
40 frames ~ 1 day simulation
On a Xeon P4 @ 2.0Ghz
22
Baseline Configuration
Four Vertex Shaders (only for Attila- Classic)
Fragment and Unified shader configuration:

32 threads
4 fragments/vertices per thread
16 128-bit FP registers available for temporal storage per thread



n SIMD ALUs
1 scalar ALU (optional)
1 Texture Unit per Shader Unit
16 KB texture cache
Single cycle bilinear and two cycle trilinear
AF up to 16x
Geometry and Rasterization pipelines limited to 1 vertex and 1
triangle per cycle
Two ROPs: 8 z and 8 color values written per cycle
Four 64-bit DDR buses: peak bandwidth 64 bytes/cycle
23
“Classic” Performance
7%
3
relative performance
2,8
2,6
~40%
8sh
2,4
2,2
2sh
~45%
6sh
2
1,8
6sh
4sh
8sh
1,6
~75%
8%
1,4
1,2
4sh
2sh
1
1-way
1-way + scalar
2-way
4-way
8% improvement for 2-way
Near linear improvement for 4 shaders
Sublinear improvement for 6 and 8 shaders

Limited by memory bandwidth and latency
24
Frame 330 – Detailed Zoom
1
0,9
Utilization
0,8
0,7
0,6
Vertex Shader
0,5
Fragment Shader
0,4
0,3
0,2
0,1
0
1
101
201
301
401
501
601
701
801
901
Time (10K cycles steps)
Vertex shading
limited
Vertex shader and fragment shader workload for 4 vertex
shader units and 2 fragment shader units
25
Unified Shader Performance
relative performance
3
2,8
2,6
2,4
2,2
2
1,8
1,6
1,4
1,2
1
2sh
8sh
uni2sh
6sh
4sh
uni4sh
6sh
4sh
uni6sh
8sh
uni8sh
2sh
1-way
1-way + scalar
2-way
4-way
Unified improvement ranges from 1% (2 shaders) to 8% (eight
1-way shaders)



Fragment shading limited
Vertex fetch limited
Geometry pipeline limited
26
Area Estimation
ATI R400
ATI RV400
Transistors (millions)
160
120
Vertex Shaders
6
4
Fragment Shaders
4
2
160 – 120 = 40 = 2 vertex shader * 2.5 + 2 fragments shader * 15 + 5 (other)
Hardware Element
Estimated Area
Millions of Transistors
Vertex Shader
2.5
Fragment Shader
15
Additional SIMD ALU
+15%
Additional scalar ALU
+5%
27
Shader Scaling vs Transistors
170
150
2-way
130
fps
8sh
uni 2-way
110
1-way
6sh
uni 1-way
90
linear
4sh
70
2sh
50
30
80
130
180
MTransistors
Linear for 4 shader units, sublinear for more than 4 shader units
Up to 30% more efficient per area for the unified architecture (two 1way shaders)
28
Conclusion
Attila Unified architecture has better
performance than Attila Classic with less
hardware



Up to 8% better performance
8% to 25% less area required
10% to 30% better performance per area
Up to 8% better performance for 2-way shader
units
160% better performance from 2 to 8 fragment
or unified shader units

Memory bandwidth limited beyond 4 shaders
29
Questions
30
Performance of Attila Unified vs Classic Attila
1,09
relative performance
1,08
1,07
1,06
1-way
1,05
1-way + scalar
1,04
2-way
1,03
4-way
1,02
1,01
1
uni2sh
uni4sh
uni6sh
uni8sh
31