TotalColor - University of North Carolina at Chapel Hill

Download Report

Transcript TotalColor - University of North Carolina at Chapel Hill

Panel Discussion
GPUs and CPUs:
The Uneasy Alliance
Panelists
•
•
•
•
•
Neil Trevett, 3Dlabs
Michael Doggett, ATI
Adam Lake, Intel
David Kirk, NVIDIA
Bill Mark, University of Texas at Austin
Moderator
• Peter N. Glaskowsky, MemoryLogix
2
Neil Trevett
3Dlabs
Neil Trevett is Senior Vice President for Market Development
at 3Dlabs, Inc. Trevett also serves as President of the
Web3D Consortium and secretary of the Khronos Group
developing the OpenML and OpenGL ES standards for
dynamic media processing and graphics APIs for embedded
appliances and applications.
GP2 Musings
Neil Trevett, Senior VP Market Development, 3Dlabs
President, Khronos Group
Los Angeles 2004
© Copyright 3Dlabs 2004 - Page 4
CPUs and GPUs – Dynamic Tension
CPUs and GPUs exist because of their different design goals
CPUs – maximize performance and minimize cost of executing SCALAR code
GPUs – exploit parallelism to beat CPUs at executing VECTOR code
BUT - GPUs are rapidly integrating many CPU techniques
Learned and refined by the CPU community over decades
Demand-Paged Virtual Memory – 256GB
Virtual Shader Program Memory - 256K instructions
Efficient multi-tasking and isochronous channel
512MB memory -> 1GB memory
High-level Language Programmability
Advanced GPUs designed exclusively
for PROFESSSIONAL PRODUCTIVITY
A message from your sponsor
If you would like try a Wildcat Realizm board
email [email protected]
© Copyright 3Dlabs 2004 - Page 5
CPUs and GPUs – Dynamic Tension
Fundamentally different designs finding increasingly common ground
CPU
Fundamental differences
in design approach
Increasing areas of
commonality
GPU
Increasing commonality creates possibilities for tighter integration
E.g. merge virtual address spaces with cache coherency
Would enable new CPU/GPU cooperative paradigms
Possibility of increased coprocessor linkage
Break the AGP/PCIe bottleneck
CPU
Subsystem
Cache Coherent
Unified Virtual
Memory Space
GPU
Subsystem
© Copyright 3Dlabs 2004 - Page 6
GPUs – More Than Graphics Processors?
The volume of graphics shipments has created the GPU phenomenon
Ingenious work ongoing to find alternative uses for these graphics machines
Can GPUs be modified to address non-graphics needs?
E.g. double precision, less SIMD more MIMD, more general data storage
Primarily an economic question
Not just technology
Does reaching for new markets decrease your graphics market share?
Increased costs bring no benefit for core market
$
Graphics
Design Shift will only occur if the
“Integral of Achieved Profit” is increased
Imaging
HPC
Probably a small
stretch for increased
volume
Shifting this far –
decreases
effectiveness in
graphics market?
Market Design Spectrum
© Copyright 3Dlabs 2004 - Page 7
Programming GPUs – Industry Challenge
GPU microarchitectures will not be exposed externally any time soon
Too much intellectual property would be exposed
Would create too much architectural inertia at a time of rapid innovation
Agree that Domain Specific Libraries are effective, pragmatic approach
Good to start solving specific real problems now
But we should aim higher than just a library approach?
Feels like we need to expose the full flexibility of programmability
Creating effective industry programming infrastructure is a challenge
Domain
Domain
Languages
Domain
Languages
Domain
Languages
Domain
Languages
Languages
Evolving
Evolving
GPU
Evolving
GPU
architectures
Evolving
GPU
architectures
Evolving
GPU
architectures
Evolving
GPU
architectures
GPU
architectures
architectures
Combinatorial
Problem
Firewall to
GPU ISAs
© Copyright 3Dlabs 2004 - Page 8
Industry Standard Virtual Machine?
Could a Virtual Machine standard avoid combinatorial explosion?
Uncouples multiple languages from multiple GPUs
Target for domain language architects AND enables innovation by GPU vendors
Create an open and cross-platform industry standard virtual machine?
Correct virtual machine could help and persuade GPUs evolve into stream processors
What should that virtual machine be?
Can we work together to figure out this key question?
Domain Languages
The level of abstraction we need to break out of the
graphics mind-set? TOO big a leap from graphics base?
Too high-level to be a useful virtual machine?
Too-graphics oriented? Effectively a graphics Domain
Specific Library – with the flexibility of programmability?
Can be extended for more generality?
What direction should the OpenGL ARB take?
Too-graphics oriented, too low-level to track the
capabilities of evolving GPU architecture?
Brook or sh?
OpenGL Shading
Language?
ARB Vertex and
Fragment extensions?
Virtual
Machines
Combine
desirable
features from
the different
approaches
GPUs
© Copyright 3Dlabs 2004 - Page 9
Battery Powered GPUs!
The Khronos Group is now defining OpenGL ES 2.0
The OpenGL Shading Language comes to cell phones!
Driven hard cell-phone industry for compelling hand-held gaming
Aggressive development to match the availability of GPUs in handsets
OpenGL ES 2.0 will not just be in phones – e.g. games consoles
Sony Playstation is a Khronos Member
OpenGL 2.0
OpenGL ES 2.0
OpenGL 1.5
OpenGL ES 1.1
OpenGL 1.3
OpenGL ES 1.0
Mid-03
GLSL-based Shader programmability
for embedded devices. Tackling
issues such as remote compilation
Increased emphasis on
hardware acceleration and
enhanced 3D pipeline
Enabled software AND hardware 3D engines – including
small-footprint, low-end fixed point platforms
Mid-04
Mid-05© Copyright 3Dlabs 2004 - Page 10
Embedded Industry - GP2 Genetic Diversity
Cell phones – 100Ms units a year that will have GPUs
3D gaming now PLUS phones mutating to general-purpose personal compute devices
Size, power and cost - low-power design now getting lot of attention
Interesting for build handhelds AND large arrays for HPC etc.
Embedded industry has fast innovation, flexible infrastructure
Tight CPU/GPU integration might happen here first – systems on a chip
Programmable acceleration avoids multiple media acceleration blocks
A programmable GPU can accelerate 3D, images, video, audio, speech and ….
OpenMAX – a new Khronos standard – domain specific primitive libraries
Uneasy alliance with DSPs too!!
Will GPUs even assume some baseband processing?
Single Chip
ARM
CPU
Core
Domain-specific primitive libraries –
can be accelerated on GPUs
Cache
Coherent
Unified Virtual
Memory Space
Low
Power
GPU
Core
© Copyright 3Dlabs 2004 - Page 11
Michael Doggett
ATI
Michael Doggett is an architect at ATI. He is working on
upcoming graphics hardware for microsoft and desktop PC
graphics chips. Before joining ATI, Doggett was a post doc at
the University of Tuebingen in Germany and completed his
Ph.D. at the University of New South Wales in Sydney,
Australia.
GPUs and CPUs: The
Uneasy Alliance
Mike Doggett
ATI
GPUs
• Not
stream processors
• Graphics black box
• Deep pipeline
– Arithmetic intensity
GPUs and CPUs: The Uneasy Alliance?
14
GPUs
• How
to get new features into
GPUs ?
– Get game developers to use them
• Architectural
Specs
– API definition
– GPUBench
• Double
precision
– Performance tradeoff
– Simulated double
GPUs and CPUs: The Uneasy Alliance?
15
GPU future
• Competitive
market
• More of the same
GPUs and CPUs: The Uneasy Alliance?
16
Adam Lake
Intel
Adam Lake is a Sr. Software Engineer at Intel specializing in
3D graphics. Previous areas of work include stream
processing, compilers for high level shading languages, and
non-photorealistic rendering. He holds an M.S. degree from
the University of North Carolina at Chapel Hill.
A few alternatives…
Intel IXP
Network Processor Family
19
IXP Perf. Characteristics

IXP2800 [Intel02]

51 GB/s peak to RDRAM


32 GB/s peak to SDRAM






3 RDRAM channels input and output, total aggregate@533 MHz
4 QDR II SDRAM ports (2 read/2write) @250 MHz
Example Application: 10GB/s Ethernet
1.4 GHz clock rate
IXP2400 4,800 MIPS
IXP1200 1,200 MIPS
Notes:


NO FPU!!
Packet arrival rate determines # instructions executed per packet
20
Key takeaways for IXP




Designed for Network processing workloads
Switch on event model for hardware resources
No FPU, nor plans for FPU
Improving software stack

Shangri-la project
21
MXP5800
22
Specs of MXP5800

Internal B/W


Theoretical External B/W




532 Mbytes/S/Connection
1 GByte/S
130 nm
256 MHz
35 mm x 35 mm die
23
Key takeaways from MXP



Not a general purpose Microprocessor
Shipping today with software tools
One common ISA for all execution units
24
So what’s the point?


Some alternatives for general purpose
computing on special purpose hardware
Larger context of stream processing
architectures
25
Programming Models

Getting the programming model right is hard



Made harder if you try to be completely general
Reason: Increase generality, you lose
performance


Graphics architects got it right for graphics
You can quickly lose any benefit of your stream
programming model
Fully general streaming, in the limit, is multithreading
26
Call to Action


For some applications in computational science and
other domains performance is dominant factor, not
cost
However, in other domains, cost is dominant:



Purchase Price per MIP
Not just raw performance
Call to action

Consider chipset implementations:


Analysis of GPGPU taking raw $ cost into account
There are 3 options, not 2:

CPU vs. CPU and chipset vs. GPU
27
The BIG Problems

How do we program it?


How do we feed it?


Programming Model
Memory hierarchy and bandwidth
How do we keep it cool?

Power and Thermal requirements provide significant
challenges for ALL architectures
28
David Kirk
NVIDIA
David Kirk has been NVIDIA's Chief Scientist since January
1997. Prior to joining NVIDIA, Kirk held positions at Crystal
Dynamics and the Apollo Systems Division of HewlettPackard Company. Kirk holds M.S. and Ph.D. degrees in
Computer Science from the California Institute of
Technology.
(Year 2000)
The GeForce256 Graphics Pipeline
vertex
setup
polygon
rasterizer
vertex
transform & lighting
polygon setup &
rasterization
texture
per-pixel
interpolation
per pixel texture
filter & x8 blending
image
Z-buffer, x8 blending
& anti-alias
pixel
memory
(Year 2004)
The GeForce6 Graphics Pipeline
vertex
programmable vertex
processing (fp32)
setup
polygon
rasterizer
polygon setup,
culling, rasterization
pixel
texture
image
programmable perpixel math (fp32)
per-pixel texture,
fp16 blending
Z-buf, fp16 blending,
anti-alias (MRT)
memory
(Year 2004)
The GeForce6 NON-Graphics Pipeline
data
setup
lists
rasterizer
data
data
data
programmable MIMD
processing (fp32)
SIMD
“rasterization”
programmable SIMD
processing (fp32)
data fetch,
fp16 blending
predicated write, fp16
blend, multiple output
memory
“GP” Processors
X
Shared peak
Input bandwidth
Dedicated peak
Processing power
Shared peak
Output bandwidth
memory
Bill Mark
University of Texas at Austin
Bill Mark is an assistant professor in the Department of
Computer Sciences at the University of Texas at Austin.
Mark was the lead architect of NVIDIA's Cg language and
development system. He holds a Ph.D. from the University
of North Carolina at Chapel Hill.
GP2 Panel Presentation
William Mark, University of Texas at Austin
We’re entering an era of
disruptive change
• Driven by VLSI technology
– Too many transistors: CPU performance plateau
– Heat/Power is now a first-class constraint
– Possible to fit many processors on a single chip
• Two kinds of change coming:
– Technical – single-chip parallel computation
– Industry structure – pressure for vertical re-integration
What do we mean by
“CPU vs. GPU”?
• General HW vs. specialized HW
– GPU’s moving towards generality, but not fully there yet
• Sequential vs. Parallel
– Latency optimized vs. Throughput optimized
• Two separate chips
• Different sets of companies (exception: Intel)
• Raw HW access vs. Managed code
Need at least two parallel
programming models
• Stream model
– Naturally exposes parallelism and communication
– Easy to use, when problem maps well
• Communicating sequential processes (e.g. pthreads)
–
–
–
–
–
Explicitly exposes spatial dimension of HW parallelism
Efficiently supports data-dependent communication patterns
Useful for creating/modifying large irregular data structures
Harder to use – e.g. race conditions
Hard to get performance portability
HW must satisfy
mass-market needs
• Games will continue to dominate
– Rendering
– Simulation? – an opportunity
• Maximize impact of research by meeting game needs
–
–
–
–
Chicken/Egg problem: Co-evolve algorithms and architectures
Different visibility algorithms – ray casting?
Global illumination – shadows, ambient occlusion, reflection, …
Parallelize model management, simulation, game behavior, …
• Solving these problems will help other applications
2-year predictions
• CPU’s: multi-core trend accelerates
– Multicore used by games and HPC
• GPU’s: More powerful streaming model
– Scatter, gather, conditional streams, reductions, etc.
– Start to see more success stories for GPGPU
– But limits of stream model become apparent
• “Dark Horses” attract increasing attention
– CELL and others
6-year predictions
• One processing chip for PC’s
– Who makes it?
• Heterogeneous architecture for this chip:
– Classical CPU
– Parallel fine-grained shared memory (pthreads)
– Parallel stream processor (Brook)
• Supports ray-casting visibility
• This architecture emerges in console space first
• This architecture meets many HPC needs
Peter N. Glaskowsky
MemoryLogix
Peter Glaskowsky is Chief System Architect at
MemoryLogix, a Silicon Valley microprocessor design
startup. Formerly, Glaskowsky was editor in chief of
Microprocessor Report and a principal analyst with InStat/MDR, a chief engineer at Integrated Device Technology,
and a lead engineer at SuperMac and Telebit.
Some Panel Topics
• Which problems are the natural province
of the CPU?
• …of the GPU?
• Which CPU design elements will be
borrowed by GPUs, and vice-versa?
• Which problems support cooperation
between the CPU and GPU?
– How do we stimulate this cooperation?
– Or will it be more like competition?
43
Panelists
•
•
•
•
•
Neil Trevett, 3Dlabs
Michael Doggett, ATI
Adam Lake, Intel
David Kirk, NVIDIA
Bill Mark, University of Texas at Austin
Moderator
• Peter N. Glaskowsky, MemoryLogix
44