J++ Machine - Computer Graphics at Stanford University

Download Report

Transcript J++ Machine - Computer Graphics at Stanford University

Heterogeneous Multi-Core
Processors
Jeremy Sugerman
GCafe May 3, 2007
Context
 Exploring the CPU and GPU future relationship
– Joint work, thinking with Kayvon
– Much kibbitzing from Pat, Mike, Tim, Daniel
 Vision and opinion, not experiments and results
– More of a talk than a paper
– The value is more conceptual than algorithmic
– Wider gcafe audience appeal than our near
term elbows-deep plans to dive into GPU guts
Outline





Introduction
CPU “Special Feature” Background
Compute-Maximizing Processors
Synthesis, with Extensions
Questions for the Audience…
Introduction
 Multi-core is status quo for forthcoming CPUs
 Variety of emerging (for “general purpose”)
architectures try to offer discontinuous
performance boost over traditional CPUs
– GPU, Cell SPEs, Niagara, Larrabee, …
 CPU vendors have a history of co-opting special
purpose units for targeted performance wins:
– FPU, SSE/Altivec, VT/SVM
 CPUs should co-opt entire “compute” cores!
Introduction
 Industry is already exploring hybrid models
– Cell: 1 PowerPC and 8 SPEs
– AMD Fusion: Slideware CPU + GPU
– Intel Larrabee: Weirder, NDA encumbered
 The programming model for communicating
deserves to be architecturally defined.
 Tighter integration than the current “host +
accelerator” model eases porting and efficiency.
 Work queues / buffers allow intregrated
coordination with decoupled execution.
Outline





Introduction
CPU “Special Feature” Background
Compute-Maximizing Processors
Synthesis, with Extensions
Questions for the Audience…
CPU “Special Features”
 CPUs are built for general purpose flexibility…
 … but have always stolen fixed function units in
the name of performance.
– Old CPUs had schedulers, malloc burned in!
– CISC instructions really were faster
– Hardware managed TLBs and caches
– Arguably, all virtual memory support
CPU “Special Features”
 More relevantly, dedicated hardware has been
adopted for domain-specific workloads.
 … when the domain was sufficiently large /
lucrative / influential
 … and the increase in performance over
software implementation / emulation was BIG
 … and the cost in “design budget” (transistors,
power, area, etc.) was acceptable.
 Examples: FPUs, SIMD and Non-Temporal
accesses, CPU virtualization
Outline





Introduction
CPU “Special Feature” Background
Compute-Maximizing Processors
Synthesis, with Extensions
Questions for the Audience…
Compute-Maximizing Processors
 “Important” common apps are FLOP hungry
– Video processing, Rendering
– Physics / Game “Physics”
– Even OS compositing managers!
 HPC apps are FLOP hungry too
– Computational Bio, Finance, Simulations, …
 All can soak vastly more compute than current CPUs
can deliver.
 All can utilize thread or data parallelism.
 Increased interest in custom / non-”general” processors
Compute-Maximizing Processors
 Or “throughput oriented”
 Packed with ALUs / FPUs
 Application specified parallelism replaces the
focus on single-thread ILP
 Available in many flavours:
– SIMD
– Highly threaded cores
– High numbers of tiny cores
– Stream processors
 Real life examples generally mix and match
Compute-Maximizing Processors
 Offer an order of magnitude potential
performance boost… if the workload sustains
high processor utilization
 Mapping / porting algorithms is a labour
intensive and complex effort.
 This is intrinsic. Within any design budget, a
BIG performance win comes at a cost…
 If it didn’t, the CPU designers would steal it.
Compute-Maximizing Programming
 Generally offered as off-board “accelerators”
– Data “tossed over the wall” and back
– Only portions of computations achieve a
speedup if offloaded
– Accelerators mono-task one kernel at a time
 Applications are sliced into successive statically
defined phases separated by resorting,
repacking, or converting entire datasets.
 Limited to a single dataset-wide feed forward
pipeline. Effectively back to batch processing
Outline





Introduction
CPU “Special Feature” Background
Compute-Maximizing Processors
Synthesis, with Extensions
Questions for the Audience…
Synthesis
 Add at least one compute-max core to CPUs
– Workloads that use it get BIG performance
– Programmers are struggling to get any
performance from having more normal cores
– Being “on-chip” architected and ubiquitous is
huge for application use of compute-max
 Compute core exposed as programmable
independent multithreaded execution engine
– A lot like adding (only!) fragment shaders
– Largely agnostic on hardware “flavour”
Extensions
 Unified address space
– Coherency is nice, but still valuable without it
 Multiple kernels “bound” (loaded) at a time
– All part of the same application, for now
 “Work” delivered to compute cores through work
queues
– Dequeuing batches / schedules for
coherence, not necessarily FIFO
– Compute and CPU cores can insert on
remote queues
Extensions
CLAIM: Queues break the “batch processing”
straitjacket and still expose enough coherent
parallelism to sustain compute-max utilization.
 First part is easy:
 Obvious per-data element state machine
 Dynamic insertion of new “work”
 Instead of being idle as the live thread count
in a “pass” drops, a core can pull in “work”
from other “passes” (queues).
Extensions
CLAIM: Queues break the “batch processing”
straitjacket and still expose enough coherent
parallelism to sustain compute-max utilization.
 Second part is more controversial:
 “Lots” of data quantized into a “few” states
should have plentiful, easy coherence.
 If the workload as a whole has coherence
 Pigeon hole argument, basically
 Also mitigates SIMD performance constraints
 Coherence can be built / specified dynamically
Outline





Introduction
CPU “Special Feature” Background
Compute-Maximizing Processors
Synthesis, with Extensions
Questions for the Audience…
Audience Participation
 Do you believe my argument conceptually?
– For the heterogeneous / hybrid CPU in general?
– For queues and multiple kernels?
 What persuades you 3 x86 + compute is preferable to quad x86?
– What app / class of apps and how much of a win? 10x? 5x?
 How skeptical are you that queues can match the performance of
multi-pass / batching?
 What would you find a compelling flexibility / expressiveness
justification for adding queues?
– Performance wins regaining coherence in existing
branching/looping shaders?
– New algorithms if shaders and CPU threads can dynamically
insert additional “work”?