NVIDIA GeForce
Download
Report
Transcript NVIDIA GeForce
NVIDIA GeForce
Ryan Hendrixson
Ryan Schubert
Allison Walthall
What Does a GPU Actually Do?
Historically, from:
– Acting simply as a frame buffer
– Doing vertex transformations and pixel color
calculations
– Now even programmable
In the simplest sense, a modern GPU
implements a 3D rendering pipeline
3D Rendering Pipeline (direct illumination)
3D Geometric Primitives
Modeling
Transformation
Lighting
Viewing
Transformation
Projection
Transformation
Clipping
Scan
Conversion
Image
This is a pipelined
sequence of operations
to draw a 3D primitive
into a 2D image
3D Rendering Pipeline (direct illumination)
3D Geometric Primitives
Modeling
Transformation
Lighting
Viewing
Transformation
Projection
Transformation
Clipping
Scan
Conversion
Image
Transform into 3D world coordinate system
3D Rendering Pipeline (direct illumination)
3D Geometric Primitives
Modeling
Transformation
Lighting
Viewing
Transformation
Projection
Transformation
Clipping
Scan
Conversion
Image
Transform into 3D world coordinate system
Illuminate according to lighting and reflectance
3D Rendering Pipeline (direct illumination)
3D Geometric Primitives
Modeling
Transformation
Lighting
Viewing
Transformation
Projection
Transformation
Clipping
Scan
Conversion
Image
Transform into 3D world coordinate system
Illuminate according to lighting and reflectance
Transform into 3D camera coordinate system
3D Rendering Pipeline (direct illumination)
3D Geometric Primitives
Modeling
Transformation
Lighting
Transform into 3D world coordinate system
Illuminate according to lighting and reflectance
Viewing
Transformation
Transform into 3D camera coordinate system
Projection
Transformation
Transform into 2D screen coordinate system
Clipping
Scan
Conversion
Image
3D Rendering Pipeline (direct illumination)
3D Geometric Primitives
Modeling
Transformation
Lighting
Transform into 3D world coordinate system
Illuminate according to lighting and reflectance
Viewing
Transformation
Transform into 3D camera coordinate system
Projection
Transformation
Transform into 2D screen coordinate system
Clipping
Scan
Conversion
Image
Clip primitives outside camera’s view
3D Rendering Pipeline (direct illumination)
3D Geometric Primitives
Modeling
Transformation
Lighting
Transform into 3D world coordinate system
Illuminate according to lighting and reflectance
Viewing
Transformation
Transform into 3D camera coordinate system
Projection
Transformation
Transform into 2D screen coordinate system
Clipping
Scan
Conversion
Image
Clip primitives outside camera’s view
Draw pixels
Modern OpenGL Pipeline
Graphics State
Vertex
Vertex
Processor
Processor
Application
Vertices
(3D)
CPU
Assembly
& Rasterization
Xformed,
Lit
Vertices
(2D)
Fragments
(pre-pixels)
GPU
Pixel
Pixel
Processor
Processor
Final
pixels
(Color, Depth)
Video
Memory
(Textures)
Render-to-texture
Programmable Vertex Processor
Programmable Fragment (Pixel) Processor
OpenGL vs. DirectX
Just
graphics
Standard C
interfaces
State machine
Multiple platforms
Academic use
Graphics,
multimedia, etc.
C++ interfaces
Object oriented
Windows
PC games
Possible GPU Performance
Bottlenecks
CPU/Bus Bound
– Simply not able to send enough vertices to the card
to keep it busy
Vertex Bound
– Vertex processing engine is fully loaded, while the
fragment engine is just waiting and grabbing data as
soon as it’s ready
Pixel Bound
– The fragment engine is fully loaded, causing the
vertex engine to have to wait before sending more
data
Early History
NVIDIA founded in 1993
1997: RIVA
1998: RIVA TNT
1999: GeForce 256 (NV10)
GeForce 256 (NV10)
Lighting and transformation
DDR and SDR
HDTV compliant
Hardware alpha-blending
4 pixel pipelines at 120 MHz
Fill Rate: 480 Megapixels/second
GeForce2
2000: GeForce 2 GTS:
–
–
–
–
–
Doubled the pixel fill rate
Quadrupled the texel fill rate
Increased clock speed
Multi-texturing
S3TC, MPEG-2, FSAA
Anti-Aliasing
Without Anti-Aliasing
With Anti-Aliasing
GeForce2
2000:
GeForce 2 MX
– Cut pixel pipeline by 2, making it cost
effective
– Twinview
– Compatible with MACs
GeForce2
Jan 2001: Apple selected GeForce2
MX as default high-end graphics
solution for Power Mac G4
August 2000: GeForce2 Ultra
November 2000: GeForce2 Go
December 2000: NVIDIA buys 3DFX
GeForce3
2001: GeForce3 (NV20)
–
–
–
–
–
–
240 MHz Core/500 MHz Memory
57 million transistors
46-76 Gigaflops
Vertex shader technology
Pixel shader technology
LightSpeed Memory architecture
LightSpeed Memory Architecture
GeForce4
2002: GeForce4 Ti (NV25) and MX (NV17)
–
Ti:
4200, 4400, 4600, and 4800 versions
63 million transistors
Chip clock 225-300 MHz
Memory Clock 500-650 MHz
75-100 million vertices/second
GeForce FX
November 2002: Geforce FX (NV30)
–
–
–
–
–
–
16 variations for different price ranges
125 million transistors
8 pixels/clock
1 tmu/pipe (16 textures/unit)
128 bit memory interface
128 MB/256 MB Memory size support
GeForce 6 series
GeForce 6 series (NV40 )
–
–
–
–
–
6200; 6600 GT and Ultra; 6800 GT,
Ultra, and Ultra Extreme
Core clock speed 450 MHz
Memory clock speed 600 MHz
6 4-wide fp32 vector MADDs/ clock cycle
vertex shader units
16 4-wide fp32 vector MADDs/ clock
cycle pixel shader units
GeForce 6 series
Super
scalar 16 pipe architecture
CineFX3.0 engine
All operations done in FP32
precision per component
200 Gigaflops (Compare this to
the Itanium’s 6.4 Gigaflops)
General Diagram (6800/NV40)
TurboCache
Uses PCI-Express bandwidth to render
directly to system memory
Card needs less memory
Performance boost while lowering cost
TurboCache Manager dynamically allocates
from main memory
Local memory used to cache data and to
deliver peak performance when needed
TurboCache
NV40 Vertex Processor
An NV40 vertex processor is able to execute one vector operation (up to four
FP32 components), one scalar FP32 operation, and make one access to the
texture per clock cycle
NV40 Fragment Processors
Early termination from mini z buffer and z
buffer checks; resulting sets of 4 pixels
(quads) passed on to fragment units
Programmable 2D and Video
Processor
Can be used for video decoding and
coding (IDCT, deinterlacing, color model
transformations, etc.)
Why NV40 series was better
Massive parallelism
Scalability
–
Computation Power
–
–
Lower end products have fewer pixel pipes
and fewer vertex shader units
222 million transistors
First to comply with Microsoft’s DirectX 9
spec
Dynamic Branching in pixel shaders
Dynamic Branching
Helps detect if pixel needs shading
Instruction flow handled in groups of
pixels
Specify branch granularity (the number of
consecutive pixels that take the same
branch)
Better distribution of blocks of pixels
between the different quad engines
Dynamic Branching
GeForce 7 series
7800 GT
$449
7 vertex units
20 pixel pipelines
Clock speed 400
MHz
Memory clock
speed 500 MHz
7800 GTX
$600
8 vertex units
24 pixel pipelines
Clock speed 430
MHz
Memory clock
speed 600 MHz
GeForce 7800
302 million transistors
200 Gigaflops of multiply/add calculations
per second
128-bit floating point precision through
the entire rendering pipeline
Fill Rate: 10.3 Gigatexels
860 million vertices/sec
GeForce 7800
ALU Units in Pixel Processor
Sub-unit 1:
– NV40: textures data and can issue a MUL
vector instruction or use its mini-ALU to issue
a non-vector instruction
– G70: same but also can issue a multiply/add
Sub-unit 2:
– NV40: can issue a multiply/add vector
instruction or use its own mini-ALU to issue a
non-vector instruction
– G70: same
GeForce 6 vs. GeForce 7
ALU Units
– G70: 24 ALU Units
– NV40: 16 ALU Units
Register file: same size
Texture samplers the same but when
fetching large textures in preparation for
filtering, G70's samplers have less latency
pulling those textures out of memory
GeForce 6 vs. GeForce 7
(speculative)
Increased L2 texture cache (to around
12KB)
Better cache re-use with larger textures,
decompressing those larger textures into
L1 faster
Possibly offering more granularity in cache
access by the GPU, to reduce texture
bandwidth, speeding up rendering.
GeForce 6 vs. GeForce 7
33 % more vertex units, each with more
performance
Improved vertex fetch unit (unconfirmed
by Nvidia)
Triangle setup and rasteriser optimized via
the use of a new raster pattern (again
unconfirmed by Nvidia)
General Diagram (7800/G70)
32-bit IEEE floating-point
throughout pipeline (NV40)
Framebuffer
Textures
Fragment processor
Vertex processor
Interpolants
GeForce 7800 (G70) supports 128 bit
through entire pipeline!
Hardware supports several other
data types
Fragment processor also supports:
– 16-bit “half” floating point
– 12-bit fixed point
– These may be faster than 32-bit on some HW
Framebuffer/textures also support:
– Large variety of fixed-point formats
– E.g., classical 8-bit per component
– These formats use less memory bandwidth than FP32
How are current GPU’s different
from CPU?
GPU is a stream processor
Multiple programmable processing units
Connected by data flows
Textures
Framebuffer
Fragment
Processor
Framebuffer
Operations
Vertex
Processor
Assembly &
Rasterization
Application
How are current GPU’s different
from CPU?
Optimized for 4-vector arithmetic
– Useful for graphics – colors, vectors,
texcoords
– Easy way to get high performance/cost
– SIMD/MIMD
GPU Memory Model vs CPU’s
Much more restricted memory access
– Allocate/free memory only before computation
– Limited memory access during computation (kernel)
Registers
– Read/write
Local memory
– Does not exist
Global memory
– Read-only during computation
– Write-only at end of computation (pre-computed
address)
Disk access
– Does not exist
GPU Memory Model
Where is GPU Data Stored?
– Vertex buffer
– Frame buffer
– Texture
VS 3.0 GPUs
Texture
Vertex Buffer
Vertex
Processor
Rasterizer
Fragment
Processor
Frame
Buffer(s)
GPGPU and Motivation
GPUs are fast…
– Itanium: 6.4 GFLOPS
– GeForceFX 7800: 200 GFLOPs
– GPUs are getting faster, faster
– CPUs: annual growth 1.5× decade growth
60×
– GPUs: annual growth > 2.0× decade
growth > 1000
Motivation:
Computational Power GPU
GPU
CPU
Courtesy Naga Govindaraju
GPGPU
Good for inherently parallel applications
Rapidly evolving ISA and HW architecture
– Largely secret
Can’t simply “port” code written for the
CPU!
Programs are Shaders
Bound by the specific hardware profile:
– E.g. different cards have different supported
hardware, OpenGL has different restrictions than
DirectX, etc
Hardware profiles change relatively drastically as
new GPUs are developed
– But typically new profiles only add features, so there
is generally still backwards compatibility (but not
always)
Vertex processor
256 instructions per program originally
(effectively higher with branching)
– Now up to 65535 instructions
Executes on all vertices
Outputs new vertices or texture
coordinates, etc
Fragment Processor Flow Chart
Fragment processor has
flexible texture mapping
Memory is accessible through texture
reads
Texture reads are just another instruction
Allows computed texture coordinates,
nested to arbitrary depth
Allows multiple uses of a single
texture unit
Additional fragment processor
capabilities
Read access to window-space position
Read/write access to fragment Z
Built-in derivative instructions
– Partial derivatives w.r.t. screen-space x or y
– Useful for anti-aliasing
Conditional fragment-kill instruction
Multiple FP formats supported
Fragment processor limitations
Originally No branching
– Now support dynamic branching (but it’s still
costly)
No indexed reads from registers
– Use texture reads instead
No memory writes
Branching Instruction Costs
(GeForce 6800)
Fragment shaders
Originally very limited in size (only 96
instructions), now expanded to 65535+
instructions
New cards support dynamic branching (but it
still incurs some performance penalty)
Now have the ability to output to multiple render
targets
CineFX 4.0 Engine
A redesigned vertex shader unit reduces the
time to set up and perform geometry
processing.
A new pixel shader unit design can carry out
twice as many floating-point operations and
greatly accelerates other mathematical
operations to increase throughput.
An advanced texture unit incorporates new
hardware algorithms and better caching to
speed filtering and blending operations.
Vertex Shaders
The 7800 has 8 vertex
shaders
The Triangle Setup stage
turns the vertex points
into a triangle
It also determines
mathmatically the
rasterization for each
triangle
Accelerating triangle
setup increases the total
throughput of the 3D
pipeline
Theoretical Rasterization Pattern of a
Triangle
New Pixel Shader – MADD
Multiply and Accumulate are commonly
used math functions in 3D graphics
MADD stands for Multiply-ADD operations
The 7800 can do twice the amount of
MADD operations than previous GPUs
could
This allows developers to create much
more complex visual effects
Transparency Adaptive
Supersampling
Takes extra passes of thin-lined objects
such as chain linked fences or trees to
enhance quality
Pixels inside of a polygon are usually not
touched by anti-aliasing methods
With this, a key set is devised, and those
pixels are anti-aliased, creating a
smoother image.
Transparency Adaptive
Supersampling
Transparency Adaptive
Multisampling
Higher levels of performance, because it
uses one texel to determine other subpixel
values
Not as high quality
Supporting the Future
The 7800 is already set up to support the
new Microsoft Longhorn OS with some of
the following advancements
– Video post-processing
– Real-time desktop compositing
– Seamless multiple 3D applications
– Accelerated antialiased text rendering
– Special effects and animation
Accelerated Graphics Port (AGP)
The AGP is superior to the PCI because it
provides a dedicated pathways between the slot
and the processor
Uses sideband addressing
PCI must load a texture from the hard drive into
the systems RAM, then from the RAM into the
GNU framebuffer
AGP can read textures directly from system RAM
by “tricking” the CPU into believing the textures
are in the framebuffer, when they are really in
memory
PCI Express
Based on the PCI system,
allowing for backwards
compatibility
Uses 1 bit, bi-directional
lanes (PCI used a bus)
Each lane can support
250 MB/s in each lane
(4GB/s total)
– AGP is only 2 GB/s
Scalable Link Interface (SLI)
Takes advantage of the PCI express bus,
which will allow more than one discrete
graphics device on the same PCI host
Allows two of the same GeForce GPUs to
run on one machine, thus “sharing” load.
There are two modes for this
– Split-frame Rendering (SFR)
– Alternate-frame Rendering (AFR)
Split-frame Rendering
Has each GPU render a
portion of the screen,
split horizontally
No extra latency
Not necessarily evenly
split
– SFR is load shared, so it
splits up the frame by the
amount of work, not the
size
A large amount of
overhead is involved,
causing a max speed up
of around 1.8 times
Alternate-frame Rendering
Avoids all the
overhead problems of
SFR
Many buffer swaps
Reliant on the speed
of the processor
Can cause latency
issues
Recommended mode
by NVIDIA
GeForce Go 7800 GTX
The mobile version of the 7800
GTX
Everything from the desktop
release has been carried over
to this
Can switch between x1 and
x16 lanes of PCI Express
Uses PowerMizer 6.0, which
allows this chip to operate in
the same envelope as it’s
predecessor, the 6800
GeForce Go 7800 – Power Issues
Power consumption and package are the same as the 6800 Ultra
chip, meaning notebook designers do not have to change very much
about their thermal designs
Dynamic clock scaling can run as slow as 16 MHz
– This is true for the engine, memory, and pixel clocks
Heavier use of clock gating than the desktop version
Runs at voltages lower than any other mobile performance part
Regardless, you won’t get much battery-based runtime for a 3D
game
Questions?