Scaling blah blah - Georgia Institute of Technology

Download Report

Transcript Scaling blah blah - Georgia Institute of Technology

Pushing Performance, Efficiency and
Scalability of Microprocessors
CERCS IAB Meeting, Fall 2006
Gabriel Loh
Research Overview
• Funding from state of GA, Intel, MARCO
• Currently 2 PhD students, 2 MS
– Active undergrad research as well
• Collaborations
– Universities: PSU, UO, Rutgers
– Industry: Intel, IBM
Research Focus
• “Near-term” microprocessor design issues
– ~ 5-year time scale
– Power/performance/complexity
– Traditional uniprocessor performance
– Multi-core performance
• “Longer-term”
– Keeping Moore’s Law alive for the longer term
– Primarily, 3D integration for now
Scaling Performance and Efficiency
• Multi-cores are here, but single-thread perf
still matters
– Intel Core 2 Duo is multi-core, but…
– Single core is more OOO than ever
•Larger instruction window, improved branch
prediction, speculative load-store ordering, wider pipe
and decoders
– But power also really matters
•Lower clock speeds, different channel length
transistors, more uop fusion, …
Research Focus
• Maximum performance within bounds
– Bounds = power, area, TDP, …
• Single-core performance helps multi-core
performance, too
– For future multi-core systems, need to strike a good
balance between 1T and MT
• Most of our research is at the uarch level
– Caches, branch predictors, instruction schedulers,
memory queue design, memory dependence prediction,
etc.
Highlight: Traditional Caching [MICRO’06]
• Well known that different apps respond
differently to different replacement policies
• Previous work in the OS domain has
described adaptive replacement with
provable bounds on performance
• Adapted techniques for on-chip caches
Idea…
Adaptive Cache Implementation
• Theoretical Guarantees
– Miss rate provably bounded to be within a factor
of two of the better algorithm
In practice,
it’s much better
Current Research
• Working on multi-core generalizations of
adaptive caching and other ways to manage
shared resources
• Uniprocessor microarchitecture
– Scalable memory scheduling [MICRO’06]
– Memory dependence prediction [HPCA’06]
– Branch prediction […]
– And more…
Longer-Term Processor Scaling
• Limitations/Obstacles
– Wire scaling
•Latency/performance
•Power
– Feature size
•Lithography, parametric variations
– Off-chip communication
3D Integration
Active
Layer 1
Metal
Layers 1
Die-to-Die
Vias
Metal
Layers 2
Active
Layer 2
Die/Wafer Stacking
• Wire
– Power/perf.
• Off-chip
• Feature size
– Limitations, variations
Less RC  faster,
lower-power
Example: Caches
Wordline length halved
• in our studies, WL was
critical for latency
Simplified 2D SRAM Array
3D Bitline Stacking
Bitline length halved
3D Wordline Stacking
• BL reduction has greater
impact on power savings
• Split decoder  no
activity stacking
We’ve studied
a wide variety
of other CPU
building blocks
Uarch-level 3D design
Smaller footprint 
faster and lower-power
Width-based gating 
even lower power,
close to original power density
Overall: 47% performance gain at
only 2 degree temperature increase
Example: 4-die significance-partitioned datapath
Use uarch prediction mechanism for early determination of width
3D Research Summary
• Circuit-level [ICCD’05,ISVLSI’06,ISCAS’06,GLSVLSI’06]
• Uarch-level [MICRO’06 (w/
),HPCA’07]
• Tutorial papers [JETC’06]
• Tutorial [MICRO’06]
• Tools [DATE’06,TCAD’07] w/ GTCAD &
• Parametric Variations w/ Jim Meindl
• Funding, equip from
,
Summary
• loh@cc
• http://www.cc.gatech.edu/~loh
• Lots of exciting work going on here