J++ Machine - Computer Graphics at Stanford University
Download
Report
Transcript J++ Machine - Computer Graphics at Stanford University
Heterogeneous Multi-Core
Processors
Jeremy Sugerman
GCafe May 3, 2007
Context
Exploring the CPU and GPU future relationship
– Joint work, thinking with Kayvon
– Much kibbitzing from Pat, Mike, Tim, Daniel
Vision and opinion, not experiments and results
– More of a talk than a paper
– The value is more conceptual than algorithmic
– Wider gcafe audience appeal than our near
term elbows-deep plans to dive into GPU guts
Outline
Introduction
CPU “Special Feature” Background
Compute-Maximizing Processors
Synthesis, with Extensions
Questions for the Audience…
Introduction
Multi-core is status quo for forthcoming CPUs
Variety of emerging (for “general purpose”)
architectures try to offer discontinuous
performance boost over traditional CPUs
– GPU, Cell SPEs, Niagara, Larrabee, …
CPU vendors have a history of co-opting special
purpose units for targeted performance wins:
– FPU, SSE/Altivec, VT/SVM
CPUs should co-opt entire “compute” cores!
Introduction
Industry is already exploring hybrid models
– Cell: 1 PowerPC and 8 SPEs
– AMD Fusion: Slideware CPU + GPU
– Intel Larrabee: Weirder, NDA encumbered
The programming model for communicating
deserves to be architecturally defined.
Tighter integration than the current “host +
accelerator” model eases porting and efficiency.
Work queues / buffers allow intregrated
coordination with decoupled execution.
Outline
Introduction
CPU “Special Feature” Background
Compute-Maximizing Processors
Synthesis, with Extensions
Questions for the Audience…
CPU “Special Features”
CPUs are built for general purpose flexibility…
… but have always stolen fixed function units in
the name of performance.
– Old CPUs had schedulers, malloc burned in!
– CISC instructions really were faster
– Hardware managed TLBs and caches
– Arguably, all virtual memory support
CPU “Special Features”
More relevantly, dedicated hardware has been
adopted for domain-specific workloads.
… when the domain was sufficiently large /
lucrative / influential
… and the increase in performance over
software implementation / emulation was BIG
… and the cost in “design budget” (transistors,
power, area, etc.) was acceptable.
Examples: FPUs, SIMD and Non-Temporal
accesses, CPU virtualization
Outline
Introduction
CPU “Special Feature” Background
Compute-Maximizing Processors
Synthesis, with Extensions
Questions for the Audience…
Compute-Maximizing Processors
“Important” common apps are FLOP hungry
– Video processing, Rendering
– Physics / Game “Physics”
– Even OS compositing managers!
HPC apps are FLOP hungry too
– Computational Bio, Finance, Simulations, …
All can soak vastly more compute than current CPUs
can deliver.
All can utilize thread or data parallelism.
Increased interest in custom / non-”general” processors
Compute-Maximizing Processors
Or “throughput oriented”
Packed with ALUs / FPUs
Application specified parallelism replaces the
focus on single-thread ILP
Available in many flavours:
– SIMD
– Highly threaded cores
– High numbers of tiny cores
– Stream processors
Real life examples generally mix and match
Compute-Maximizing Processors
Offer an order of magnitude potential
performance boost… if the workload sustains
high processor utilization
Mapping / porting algorithms is a labour
intensive and complex effort.
This is intrinsic. Within any design budget, a
BIG performance win comes at a cost…
If it didn’t, the CPU designers would steal it.
Compute-Maximizing Programming
Generally offered as off-board “accelerators”
– Data “tossed over the wall” and back
– Only portions of computations achieve a
speedup if offloaded
– Accelerators mono-task one kernel at a time
Applications are sliced into successive statically
defined phases separated by resorting,
repacking, or converting entire datasets.
Limited to a single dataset-wide feed forward
pipeline. Effectively back to batch processing
Outline
Introduction
CPU “Special Feature” Background
Compute-Maximizing Processors
Synthesis, with Extensions
Questions for the Audience…
Synthesis
Add at least one compute-max core to CPUs
– Workloads that use it get BIG performance
– Programmers are struggling to get any
performance from having more normal cores
– Being “on-chip” architected and ubiquitous is
huge for application use of compute-max
Compute core exposed as programmable
independent multithreaded execution engine
– A lot like adding (only!) fragment shaders
– Largely agnostic on hardware “flavour”
Extensions
Unified address space
– Coherency is nice, but still valuable without it
Multiple kernels “bound” (loaded) at a time
– All part of the same application, for now
“Work” delivered to compute cores through work
queues
– Dequeuing batches / schedules for
coherence, not necessarily FIFO
– Compute and CPU cores can insert on
remote queues
Extensions
CLAIM: Queues break the “batch processing”
straitjacket and still expose enough coherent
parallelism to sustain compute-max utilization.
First part is easy:
Obvious per-data element state machine
Dynamic insertion of new “work”
Instead of being idle as the live thread count
in a “pass” drops, a core can pull in “work”
from other “passes” (queues).
Extensions
CLAIM: Queues break the “batch processing”
straitjacket and still expose enough coherent
parallelism to sustain compute-max utilization.
Second part is more controversial:
“Lots” of data quantized into a “few” states
should have plentiful, easy coherence.
If the workload as a whole has coherence
Pigeon hole argument, basically
Also mitigates SIMD performance constraints
Coherence can be built / specified dynamically
Outline
Introduction
CPU “Special Feature” Background
Compute-Maximizing Processors
Synthesis, with Extensions
Questions for the Audience…
Audience Participation
Do you believe my argument conceptually?
– For the heterogeneous / hybrid CPU in general?
– For queues and multiple kernels?
What persuades you 3 x86 + compute is preferable to quad x86?
– What app / class of apps and how much of a win? 10x? 5x?
How skeptical are you that queues can match the performance of
multi-pass / batching?
What would you find a compelling flexibility / expressiveness
justification for adding queues?
– Performance wins regaining coherence in existing
branching/looping shaders?
– New algorithms if shaders and CPU threads can dynamically
insert additional “work”?