Transcript 7810-24
CS 7810
Lecture 24
The Cell Processor
H. Peter Hofstee
Proceedings of HPCA-11
February 2005
Heterogeneous CMPs
Multi-core design at 0.1m technology: half the area occupied
by shared L2 cache; half the area occupied by 4 diverse cores
Relative Core Parameters
Dynamic Core Selection
An oracle selects the core that minimizes
energy, while performing within 10% of optimal
Results
Heterogeneous CMPs
• Inevitable!
• A 100-core CMP in 2015: special-purpose cores to
execute network operations, SIMD code, GP code,
OS code, graphics operations, …
• Special-purpose cores Vs. general-purpose cores: is the
investment in silicon area, design effort worthwhile?
Importance of target applications (games!)
Difference in behavior (performance, power)
Arguments for Cell: Improve Efficiency
• Metric of choice: performance per transistor or
performance per watt or …
• Gelsinger’s Law: transistor doubling leads to 40%
improvement in architectural performance
larger caches/bpreds, diminishing returns
transistor overhead a issue widthx ; low
utilization of issue slots
overheads for deep pipelines
Cell Innovations
• For regular apps, the compiler can better manage
the cache, loop unrolling helps, software branch
prediction is not terrible
Local storage: can be directly addressed by the
program (almost like a large register file)
DMA controllers to keep local storage full:
fight memory latency with high bandwidth!
Large register file to enable unrolling
Branch target hints
Cell Basics
PowerPC Core
• PPC core serves as the master – the SPEs have
no provision for protection
• This is the “inefficient” general-purpose processor
that provides high performance for most GP apps
(including the OS)
• 2-way in-order issue; 2-thread SMT; 32KB L1s;
512KB L2
SPE Features
• Simple 2-wide in-order pipeline for number crunching
• 11 FO4 pipeline stages (3.2 GHz); 14M SRAM transistors
and 7M logic transistors (total 234M)
• 256KB local storage that is explicitly managed
by the program; 128-entry register file (128b registers);
2-cycle register file access with full bypass
• Most ALUs are SIMD: they operate on 128b vectors of
four 32b elements
Cell Voltage/Frequency/Power
Power5 Overview
• Two cores, each being 2-way SMT (more threads
increase complexity and yield little benefit)
• Cores share a 1.875MB L2 and an off-chip 36MB L3
• 389mm2 in 130nm technology
• 8-wide fetch is fine-grain multi-threaded; rest is SMT
Power5 Pipeline
Pipeline Details
• Long loop within fetch
• Combining branch predictor (bimodal and gshare)
• Five instructions (from the same thread) are
dispatched in a cycle
• 120 Int and 120 FP registers; only 20 entries in
the ROB (each entry tracks a group of 5 instrs)
• Eight execution units (many register ports!)
SMT Features
• A thread is given lower priority during fetch if it occupies
too many ROB entries
• A thread is not decoded until outstanding L2 misses clear
• If a thread executes a long-latency operation, decode is
stalled and part of the pipeline is flushed
• Software can set priority for a thread (useful for real-time
apps or when the app knows that a thread is in an idle loop)
Thread Priorities
Title
• Bullet