semitherm05_luncheon.. - University of Virginia
Download
Report
Transcript semitherm05_luncheon.. - University of Virginia
Designing Cool Chips in an Era of
Gigascale Integration:
History, Challenges, and
Opportunities
© 2005, Kevin Skadron
Kevin Skadron
LAVA/HotSpot Lab
Dept. of Computer Science
University of Virginia
Charlottesville, VA
© 2005, Kevin Skadron
“Cooking-Aware” Computing?
2
ITRS Projections
2001 – was 0.4
Year
Tech node (nm)
Vdd (high perf) (V)
Vdd (low power) (V)
Frequency (high perf) (GHz)
High-perf w/ heatsink
Cost-performance
Hand-held
2003
100
1.2
1.0
3.0
149
80
2.1
2006
2010
70
45
1.1
1.0
0.9
0.7
6.8
15.1
Max power (W)
180
198
98
120
2.4
2.8
© 2005, Kevin Skadron
ITRS 2004
2013
32
0.9
0.6
23.0
2016
22
0.8
0.5
39.7
198
138
3.0
198
158
3.0
2001 – was 288
• These are targets, doubtful that they are feasible
• Growth in power density means cooling costs
continue to grow
• High-performance designs seem to be shifting
away from clock frequency toward # cores
3
Power Evolution
?
100
Pentium® II
Pentium® 4
Max Power (Watts)
Pentium® Pro
Pentium® III
10
Pentium®
Pentium®
w/MMX tech.
i486
i386
© 2005, Kevin Skadron
1
1.5m
1m
0.8m
0.6m
0.35m
0.25m
0.18m
0.13m
Source: Intel
Zero-Sum Architecture!
4
Leakage – A Growing Problem
• The fraction of leakage power is increasing exponentially with
each generation
• Also exponentially dependent on temperature
• Curiously, ITRS 2004 projections are lower than what industry is
currently reporting
• Changes tradeoffs! Idle logic hurts, e.g. CMPs
Increasing
ratio
across
generations
Static power/ Dynamic Power
70
50
40
30
20
10
373
368
363
358
353
348
343
338
333
328
323
318
313
308
303
0
298
© 2005, Kevin Skadron
Percentage
60
(Data
derived
from
ITRS
2001)
Temperature(K)
180nm
130nm
100nm
90nm
80nm
70nm
5
Thermal Packaging is Expensive
© 2005, Kevin Skadron
•
Nvidia GeForce 5900 card – “dustbuster”
Source: Tech-Report.com
6
Architecture Trends
•
High-performance market
•
•
•
© 2005, Kevin Skadron
•
Across all market segments
•
•
•
“Fat” (wide, superscalar) CPUs and high frequencies
giving way to multiple cores, plateau in frequencies
– Huge number of multi-core product announcements
– # cores might be the next marketing buzz
Multiple threads per core
– This probably won’t scale – limit of 2-4 thread contexts
Interesting example: Sun Niagara
– 8 4-threaded cores
Growing integration (SoC)
Specialized co-processors and offload engines
Growing heterogeneity
•
•
Part of the programming model in SoCs
Not part of the programming model in CMPs!
7
Basketball Analogy
•
© 2005, Kevin Skadron
•
Recent trends in high-performance
processors are like building a team around
Shaq when you have a limited budget
• Huge salary (power) to one player
• Huge ego, team friction (heat)
• Shaq can’t get much better (except possibly his
free throws) (diminishing returns)
New trend: multiple CPUs on a chip (CMP/SoC)
• Don’t need superstars (less power per core, better
energy efficiency)
• Choose team players (better heat distribution)
• Performance scales linearly with cores
• Heterogeneous cores possible (SoCs)
• Detroit Pistons
8
Talk Outline
•
Different philosophies of Power-Aware
design
• Energy efficient vs. low power vs. temperatureaware
•
Power Management Techniques
© 2005, Kevin Skadron
• Dynamic
• Static
• Temperature
•
Summary of Important Challenges
•
My perspective tends to be architecturecentric, and slanted toward highperformance desktop/server/etc. CPUs
9
Metrics
• Power
Design for power delivery
• Average power, instantaneous power, peak power
• Energy
Low-Power Design
Power-Aware/
• Energy (MIPS/W)
Energy-Efficient
2
• Energy-Delay product (MIPS /W)
Design
• Energy-Delay2 product (MIPS3/W) – voltage
(Zyuban, GVLSI’02)
independent!
© 2005, Kevin Skadron
• Temperature
Temperature-Aware Design
• Correlated with power density over sufficiently
large time periods
• No good figures of merit for trading off thermal
efficiency against performance, area, or energy
efficiency
10
© 2005, Kevin Skadron
11
Circuit Techniques
•
•
•
•
•
Transistor sizing
Signal and clock gating
Dynamic vs. static logic
Circuit restructuring
Low power caches, register files, queues
© 2005, Kevin Skadron
• These typically reduce the capacitance
being switched
12
Clock Gating, Signal Gating
“Disabling a functional block when it is not required for an extended
period”
• Implementation
• Simple gate that replaces
one buffer in the clock tree
• Signal gating is similar, helps
avoid glitches
• Delay is generally not a concern
except at fine granularities
signal
functional
functional
unitunit
© 2005, Kevin Skadron
ctrl
• Choice of circuit design and
clock gating style can have
a dramatic effect on temperature
distribution
13
Circuit Restructuring
• Pipeline (tolerate smaller, longer-latency circuitry)
• Parallelize (can reduce frequency)
• Reorder inputs so that most active input is closest
to output (reduces switched capacitance)
• Restructure gates (equivalent functions are not
equivalent in switched capacitance)
Example: Parallelizing (maintain throughput)
Vdd
© 2005, Kevin Skadron
Logic Block
Vdd/2
Freq = 1
Vdd = 1
Logic Block
Throughput = 1
Power = 1
Logic Block
Area = 1
Pwr Den = 1
Freq = 0.5
Vdd = 0.5
Throughput = 1
Power = 0.25
Area = 2
Pwr Den = 0.125
Source: Shekhar Borkar, keynote presentation, MICRO-37, 2004
14
Architectural-Level Techniques
• Sleep modes
• Pipeline depth
• Energy-efficient front end
Prevalent
• Branch prediction accuracy is a major determinant of
pipeline activity -> spending more power in the branch
predictor can be worthwhile if it improves accuracy
•
•
•
•
•
•
Integration (e.g. multiple cores)
Multi-threading
Dynamic voltage/frequency scaling
Multi clock domain architectures (similar to GALS)
Power islands
Encoding/compression
© 2005, Kevin Skadron
• Can reduce both switched capacitance and cross talk
• Application specific hardware
• Co-processors, functional units, etc.
• Compiler techniques
Growing or Imminent
15
Optimal Pipeline Depth
•
•
•
Increased power and diminishing returns vs. increased
throughput
5-10 stages, 15-30 FO4
Srinivasan et al, MICRO-35, Hartstein and Puzak, ACM TACO,
Dec. 2004
4-wide issue
© 2005, Kevin Skadron
Single issue
Hartstein and Puzak, ACM TACO, Dec. 2004
Pipeline Stages
16
Architectural-Level Techniques
• Sleep modes
• Pipeline depth
• Energy-efficient front end
Prevalent
• Branch prediction accuracy is a major determinant of
pipeline activity -> spending more power in the branch
predictor can be worthwhile if it improves accuracy
•
•
•
•
•
•
Integration (e.g. multiple cores)
Multi-threading
Dynamic voltage/frequency scaling
Multi clock domain architectures (similar to GALS)
Power islands
Encoding/compression
© 2005, Kevin Skadron
• Can reduce both switched capacitance and cross talk
• Application specific hardware
• Co-processors, functional units, etc.
• Compiler techniques
Growing or Imminent
17
Multi-threading
•
Do more useful work per unit time
• Amortize overhead and leakage
•
Switch-on-event MT
• Switch on cache misses, etc. (Ex: Sun Niagara
“throughput computing”)
• Can even rotate among threads every instruction
(Tera/Cray)
•
Simultaneous Multithreading/HyperThreading
© 2005, Kevin Skadron
• For superscalar – eliminate waste
• Intel Pentium 4, IBM POWER5, Alpha 21464
18
Architectural-Level Techniques
• Sleep modes
• Pipeline depth
• Energy-efficient front end
Prevalent
• Branch prediction accuracy is a major determinant of
pipeline activity -> spending more power in the branch
predictor can be worthwhile if it improves accuracy
• Integration (e.g. multiple cores)
• Multi-threading
• Dynamic voltage/frequency scaling
• Limits
© 2005, Kevin Skadron
• Multi clock domain architectures (similar to GALS)
• Power islands
• Encoding/compression
• Can reduce both switched capacitance and cross talk
• Application specific hardware
• Co-processors, functional units, etc.
• Compiler techniques
Growing or Imminent
19
Compiler Techniques for Low Power
• Basic idea is for the compiler to identify
opportunities for using low-power modes
• Compiler-guided DVS
• Reduce voltage in memory-bound program regions
– Hsu and Kremer, ISLPED’01, PLDI’03; Xie et al, PLDI’03
• Dynamic resource configuration/hibernation
• Deactivate modules when they won’t be used for a long time
(>> sleep/wakeup time)…avoids waiting for timeout
– Heath et al, PACT’02
© 2005, Kevin Skadron
• Profile/compiler-guided adaptation
• Subroutine-guided (“positional”) adapation (Huang et al,
ISCA’03)
– Uses profiling and a hierarchy of low-power modes
• Much work in this area – this only touches the surface
20
© 2005, Kevin Skadron
21
Static Power Dissipation
• Static power: dissipation due to leakage
current
• Exponentially dependent on T, Vdd, Vth
• Most important sources of static power:
subthreshold leakage and gate leakage
• We will focus on subthreshold
• Gate leakage has essentially been ignored
© 2005, Kevin Skadron
– New gate insulation materials may solve problem
22
Thermal Runaway
•
The leakage-temperature feedback can lead
to a positive feedback loop
© 2005, Kevin Skadron
• Temperature increases leakage increases
temperature increases leakage increases
• …
Source: www.usswisconsin.org
23
A Smorgasbord
• Transistor sizing
• Multi Vth
• Dynamic threshold voltage – reverse body bias –
Transmeta Efficeon
• Transmeta uses runtime compilation and load monitoring to
select thresholds
• Stack effect
• Sleep transistors
• DVS
© 2005, Kevin Skadron
• Coarse or fine grained
• Low leakage caches, register files, queues
• Techniques for reducing gate leakage
• Hurry up and wait
• Low leakage: maintain min possible V, f
• High leakage: use high V/f to finish work quickly, then go to
sleep
24
Sleep Transistors
• Recent work suggests that a properly sized, low-Vth
footer transistor can preserve enough leakage to
keep the cell active (Li et al, PACT’02; Agarwal et al,
DAC’02)
• Great care must be taken when
switching back to full voltage:
noise can flip bits
• Extra latency may be necessary
when re-activating
Logic Block
• Similar to principles in
sub-threshold computing
© 2005, Kevin Skadron
• Ex – sensor motes for wireless
sensor networks
• Concerns about susceptibility to SEU
25
A Smorgasbord
• Transistor sizing
• Multi Vth
• Dynamic threshold voltage – reverse body bias –
Transmeta Efficeon
• Transmeta uses runtime compilation and load monitoring to
select thresholds
• Stack effect
• Sleep transistors
• DVS
© 2005, Kevin Skadron
• Coarse or fine grained
• Low leakage caches, register files, queues
• Techniques for reducing gate leakage
• Hurry up and wait
• Low leakage: maintain min possible V, f
• High leakage: use high V/f to finish work quickly, then go to
sleep
26
© 2005, Kevin Skadron
27
Worst-Case leads to Over-design
• Average case temperature lower than worst-case
• Aggressive clock gating
• Application variations
• Underutilized resources, e.g. FP units during integer code
• Currently 20-40% difference
Reduced target
power density
© 2005, Kevin Skadron
TDP
Reduced cooling
cost
Source: Gunther et al, ITJ 2001
28
Temporal, Spatial Variations
© 2005, Kevin Skadron
Temperature variation
of SPEC applu over time
Localized hot spots
dictate cooling solution
29
© 2005, Kevin Skadron
Temperature-Aware Design
•
Worst-case design is wasteful
•
Power management is not sufficient for
chip-level thermal management
• Must target blocks with high power density
• When they are hot
• Spreading heat helps
– Even if energy not affected
– Even if average temperature goes up
• This also helps reduce leakage
30
Role of Architecture?
Dynamic thermal management (DTM)
• Automatic hardware response when temp. exceeds cooling
• Cut power density at runtime, on demand
• Trade reduced costs for occasional performance loss
• Architecture natural granularity for thermal
management
© 2005, Kevin Skadron
• Activity, temperature correlated within arch. units
• DTM response can target hottest unit: permits fine-tuned
response compared to OS or package
• Modern architectures offer rich opportunities for remapping
computation
– e.g., CMPs/SoCs, graphics processors, tiled architectures
– e.g., register file
• Thermal engineering must consider role of
architecture
• Thermal engineers and architects need to
collaborate
31
© 2005, Kevin Skadron
Existing DTM Implementations
•
Intel Pentium 4: Global clock gating with
shut-down fail-safe
•
Intel Pentium M: Dynamic voltage scaling
•
Transmeta Crusoe: Dynamic voltage scaling
•
IBM Power 5: Probably fetch gating
•
ACPI: OS configurable combination of passive &
active cooling
•
These solutions sacrifice time (slower or stalled
execution) to reduce power density
Better: a solution in “space”
•
•
Tradeoff between exacerbating leakage (more idle logic) or
reducing leakage (lower temperatures)
32
© 2005, Kevin Skadron
Alternative: Migrating Computation
This is only a
simplistic
illustrative example
33
Space vs. Time
•
Moving the hotspot, rather than throttling it,
reduces performance overhead by almost 60%
Space
Time
© 2005, Kevin Skadron
Slowdown Factor
1.40
1.30
1.359
1.270
1.231
1.20
1.112
1.10
1.00
DVS
FG
Hyb
MC
The greater the replication and spread,
the greater the opportunities
34
© 2005, Kevin Skadron
35
1
1000
10000
365nm
1000
micron
0.1
100
Lithography
Wavelength
248nm
193nm
nm
180nm
130nm
90nm
65nm
45nm
Generation
32nm
10
500
250
130
65
32
Technology Node (nm)
250
150
100
50
0
Heat Flux (W/cm2)
Results in Vcc variation
10
1990
2000
2010
2020
Source: Mark Bohr, Intel
Sub-wavelength Lithography
110
Heat Flux (W/cm2)
200
0.01
1980
Random Dopant Fluctuations
13nm
EUV
100
90
80
70
60
50
Temperature (C)
1000
100
Gap
40
Temperature Variation (°C)
Hot spots
Source: Shekhar Borkar, keynote presentation, MICRO-37, 2004
Mean Number of Dopant Atoms
Sources of Variations
36
Impact of Static Variations
Normalized Frequency
1.4
1.3
Frequency
~30%
30%
1.2
130nm
Leakage
Power
~5-10X
1.1
1.0
5X
0.9
1
2
3
4
Normalized Leakage (Isb)
5
Source: Shekhar Borkar, keynote presentation, MICRO-37, 2004
37
Parameter Variations
•
•
•
•
•
Parameter variations mess everything up!
T variation in Vcc, leakage T
Vcc speed variation, leakage T
Manufacturing (L, W, Vth, etc) speed, Vcc, T
Packaging variations (TIM, roughness) T
•
Some transistors/functional units won’t work, some
will be lousy, some will fail over time, and some will be
intermittent
•
Guard banding won’t work
© 2005, Kevin Skadron
•
•
•
Design devolves to worst component, can’t easily bound
intermittent behavior
T/P problems may no longer be limited to specific
units
Makes dynamic logic even more difficult
38
Future Architectures
•
Asymmetry unavoidable
© 2005, Kevin Skadron
•
•
•
•
Specialized units (part of programming model)
Power management (can try to hide this)
Thermal throttling (hard to hide this)
Parameter variations (hard to hide this without
extreme performance loss)
39
Raw Architecture (MIT)
Compute
Processor
© 2005, Kevin Skadron
Routers
On-chip networks
Only one of many examples of tiled
architectures
Source: MIT RAW project
40
Future Architectures
•
•
•
•
Increasing integration, e.g. increasing
# cores, e.g. Niagara
Clustered architectures
Tiled architectures
Multiple voltage islands
•
Asymmetry unavoidable
© 2005, Kevin Skadron
•
•
•
•
Specialized units (part of programming model)
Power management (can try to hide this)
Thermal throttling (hard to hide this)
Parameter variations (hard to hide this without
extreme performance loss)
• Increasing problems with yield, failures in time
(Redundancy: costly; graceful degradation:
introduces asymmetry)
41
Power and Thermal Security
•
A consequence of designing for expected
rather than worst-case conditions
•
•
•
Energy-drain attacks
Voltage stability attacks (dI/dt)
Thermal attacks
© 2005, Kevin Skadron
• Thermal throttling
• Denial of service
• Direct physical damage
42
Summary
•
Reviewed current techniques for managing
dynamic power, leakage power, temperature
• A major obstacle with architectural techniques
is the difficulty of predicting performance
impact
•
•
Continuing integration makes power an
ever-present concern
Thermal limits and parameter variations are
becoming serious obstacles
© 2005, Kevin Skadron
• Spread heat in space, not time
•
Security challenges
43
Soap-Box
•
•
•
Architecture solutions are essential
Thermal engineers, circuit designers, CAD
designers, and architects all need to work
together
Joint infrastructure
© 2005, Kevin Skadron
• Simulators – esp. pre-RTL tools
• Test chips
– Ex: Combine architecture and circuit
research on a single test chip
44
© 2005, Kevin Skadron
45
More Info
© 2005, Kevin Skadron
http://www.cs.virginia.edu/~skadron
LAVA Lab
46
© 2005, Kevin Skadron
Backup Slides
47
Hot Chips are No Longer Cool!
1000
Rocket
Nozzle
Nuclear Reactor
Watts/cm 2
100
SIA
Pentium® 4
Pentium® III
Pentium® II
Hot plate
10
Today’s
laptops:
Pentium® Pro
Pentium®
i386
© 2005, Kevin Skadron
i486
1
1.5m
1m
0.7m
0.5m
0.35m
0.25m
0.18m
0.13m
0.1m
0.07m
* “New Microarchitecture Challenges in the Coming Generations of CMOS Process
Technologies” – Fred Pollack, Intel Corp. Micro32 conference key note - 1999.
48
© 2005, Kevin Skadron
ITRS quotes – thermal challenges
•
For small dies with high pad count, high power
density, or high frequency, “operating
temperature, etc for these devices exceed the
capabilities of current assembly and packaging
technology.”
•
“Thermal envelopes imposed by affordable
packaging discourage very deep pipelining.”
• Intel recently canceled its NetBurst
microarchitecture
– Press reports suggest thermal envelopes were
a factor
49
Thermal Packaging is Expensive
© 2005, Kevin Skadron
• P4 packaging
Source: Intel web site
50
Thermal Packaging is Expensive
© 2005, Kevin Skadron
•
Laptops and other constrained form factors
51
Trends in Battery Technology
• Battery lifetime is increasing perhaps 810%/yr.
(Powers, Proc. of IEEE 1995)
© 2005, Kevin Skadron
• Not keeping up with rate of growth in energy
consumption
Source: Rabaey 1995, cited in Irwin et al, “Low Power Design Methodologies, Hardware and Software Issues”,
tutorial at PACT 2000
52
Dynamic Power Consumption
• Power dissipated due to switching activity
• A capacitance is charged and discharged
Vdd
Ec=1/2CLV2
© 2005, Kevin Skadron
Ed=1/2CLV2
Charge/discharge at the frequency f
P=a CLV2 f
53
Transistor Sizing
• Transistor sizing plays an important role to reduce
power
K = Ci/Ci-1
© 2005, Kevin Skadron
C0
C1
CN-1
CN
• Delay ~ a (k / ln K)
• Power ~ K / (K-1)
• Optimum K for both power and delay must be
pursued
54
Signal Gating
“techniques to mask unwanted switching activities from propagating
forward, causing unnecessary power dissipation”
• Implementation
• Simple gate
• Tristate buffer
• ...
signal
Output
ctrl
• Control signal needed
• Generation requires additional logic
© 2005, Kevin Skadron
• Especially helps to prevent power dissipation due
to glitches
55
Cache Design
Banked organization
Dividing word line
Same effect for wordlines
Reducing voltage swings
Targets switched capacitance
Caccess = R C Ccell / B
Sense amplifiers used to detect Vdiff across bitlines
Read operation can complete as soon as Vdiff is detected
Limiting voltage swing saves a fraction of power
Pulse word lines
Enabling the word line for the time needed to discharge bitcell
voltage
Designer needs to estimate access time and implement a pulse
generator
56
Low Power Register File Design
RF’s usually single-ended bitlines
Modified storage cell
Lot of zeros fetched from the RF
Bitline connections are modified to eliminate bitline discharge
when reading a zero
Tseng and Asanovic, ICSD, 2000
Zyuban and Kogge, ISLPED 1998
57
Efficient Issue Queue
Useful comparison
Empty entries and ready entries consume energy
• Wakeup of empty entries can be disabled
Gating off precharge logic using valid bit
• Wakeup of ready sources can be disabled
Gating off precharge logic using ready bit
Folegnani and Gonzalez, ISCA 2001
Energy-efficient Comparators
Traditional comparators dissipate energy on a mismatch in any
bit position.
10%-20% of source operands match each cycle
Solution: comparators that dissipate energy in a match
Kuckuc et al, ISLPED 2001
58
Multi Clock Domain Architecture
Domains must be carefully chosen
Small cost on communications
Re-using existing structures for cross-domain
synchronizatoin
Example
5 domains
•
•
•
•
•
Front-end
Integer unit
FP unit
On-chip cache unit
Main memory
59
Multi Clock Domain Architecture
Integer
CPU
IIQ
int.
register
file
int.
FUs
Memory
Front-end
fetch
L1
i-cache
IFQ
branch
predict
LSQ
dispatch
rename
Floating Point
FIQ
Magklis et al, ISCA 2003
L2
L1
unified
d-cache
cache
60
fp.
register
file
fp.
FUs
Main
Memory
Multi Clock Domain Architecture
Advantages
Local clock design is not aware of global skew
Each domain limited by its local critical path, allowing
higher frequencies
Different voltage regulators allow for a finer-grain
energy control
Frequency/voltage of each domain can be tailored to
its dynamic requirements
Clock Power is reduced
Drawbacks
Complexity and penalty of synchronizers
Feasibility of multiple voltage regulators
61
Sleep Modes
ACPI: Advance Configuration and Power Interface
Developed by Microsoft, HP, Toshiba, Phoenix and Intel
Replaces APM and PnP BIOS
Establishes interfaces for OS-directed powermanagement
Defines various power states, e.g. Cx, Sx… with various
power-performance tradeoffs—OS can choose
62
Dynamic Voltage/Frequency
Scaling
• Allow the device to dynamically adapt the
© 2005, Kevin Skadron
voltage (and the frequency)
• P ~ Vdd2
• F ~ Vdd/(Vdd-Vth)k
• Tradeoff between power reductions and delay
increase
• But this is a vey powerful paradigm
– Approx. quadratic or cubic reduction in power
(power density) relative to frequency reduction
– Most other techniques are linear with respect
to perf. loss
– DVS switching overhead must be taken into
account (PLL, etc.)
63
DVS “Critical Power Slope”
• It may be more efficient not to use DVS, and
to run at the highest possible frequency, then
go into a sleep mode!
• Depends on power dissipation in sleep mode vs.
power dissipation at lowest voltage
© 2005, Kevin Skadron
• This has been formalized as the critical
power slope (Miyoshi et al, ICS’02):
• mcritical = (Pfmin – Pidle) / fmin
• If the actual slope m = (Pf - Pfmin) / (f – fmin) < mcritical
then it is more energy efficient to run at the
highest frequency, then go to sleep
• Switching overheads must be taken into account
64
Multi Clock Domain Architecture
© 2005, Kevin Skadron
• Multiple clock domains inside the processor
• Globally-asynchronous locally synchronous
(GALS) clock style
• Independent voltage/frequency scaling
among domains
• Synchronizers to ensure inter-domain
communication
65
Application-Specific Hardware
•
•
Specialized logic is usually much lower power
Co-processors
•
•
Functional units
•
•
© 2005, Kevin Skadron
•
Ex: TCP/IP offload, codecs, etc.
Ex: Intel SSE, specialized arithmetic (e.g., graphics), etc.
Ex: Custom instructions in configurable cores (e.g.,
Tensilica)
Specific example: Zoran ER4525 – cell phone
•
•
•
•
•
•
ARM microcontroller, no DSP!
Video capture & pre/post processing
Video codec
2D/3D rendering
Video display
Security
66
Power Savings for Real Time
Systems
• Soft vs. hard real time
• Most work has focused on DVS scheduling
• Example: Multimedia apps must process
every frame within a time limit
• Slow down the processor to just meet deadlines
© 2005, Kevin Skadron
– Based on frame type (Hughes et al MICRO 2001)
– Based on queue occupancy (Lu et al, ICCD 2003)
67
Leakage Control
Body Bias
Stack Effect
Sleep Transistor
Vbp
Vdd
+Ve
Equal Loading
-Ve
Logic Block
Vbn
2-10X
5-10X
2-1000X
Reduction
Reduction
Reduction
Source: Shekhar Borkar, keynote presentation, MICRO-37, 2004
68
Low-Leakage Caches
• Gated-Vdd/Vss (Powell et al, ISLPED’00; Kaxiras et al, ISCA-28)
• Uses sleep transistor on Vdd/ground for each cache line
• Typically considered non-state-preserving, but recent work (Agarwal
et al, DAC’02) suggests that gated-Vss it may preserve state
• Many algorithms for determining when to gate
• Simplest (Kaxiras et al, ISCA-28): Two-bit access counter and decay
interval
• Adaptive decay intervals - hard
• Drowsy cache (Flautner et al, ISCA-29)
© 2005, Kevin Skadron
• Uses dual supply voltages: normal Vdd and a low Vdd close to the
threshold voltage
• State preserving, but requires an extra cycle to wake up – two extra
cycles if tags are decayed
• State preservation using leakage currents (Li et al, PACT’02;
Agarwal et al, DAC’02)
• Similar to gated-Vss but designed to keep supply voltage high
enough to preserve state (100-120 mV)
69
DVS
•
Chip or “island” granularity
• Leakage depends exponentially on Vdd
•
Fine granularities
• Requires routing multiple Vdd’s, voltage
steppers
© 2005, Kevin Skadron
•
Dynamic switching
• Instead of Vth or sleep transistor, can use low
voltage to put a logic block to sleep
• Reliability questions
– Low voltage reduces Qcrit
70
Gate Leakage
• Not clear if new oxide materials will arrive in time
• Any technique that reduces Vdd helps
• Otherwise it seems difficult to develop architecture
techniques that directly attack gate leakage
• In fact, very little work has been done in this area
• One example: domino gates (Hamzaoglu & Stan,
ISLPED’02)
© 2005, Kevin Skadron
• Replace traditional NMOS pull-down network with a PMOS
pull-up network
• Gate leakage is greater in NMOS than PMOS
• But PMOS domino gate is slower
• Note: Gate oxide so thin - especially prone to
manufacturing variations
71
Application Variations
•
•
Wide variation across applications
Architectural and technology trends are making it
worse, e.g. simultaneous multithreading (SMT)
ST
SMT
420
420
© 2005, Kevin Skadron
Kelvin
Kelvin
410
410
400
400
390
390
380
380
370
370
gzip
gzip
mcf
mcf
swim mgrid
mgrid applu
applu
swim
eon
eon
mesa
mesa
72
Dynamic Thermal Management
(DTM)
© 2005, Kevin Skadron
Temperature
(Brooks
Martonosi,
HPCA
2001)
Designedand
for Cooling
Capacity w/out
DTM
Designed for Cooling
Capacity w/ DTM
System
Cost Savings
DTM Trigger
Level
DTM Disabled
DTM/Response Engaged
Time
Source: David Brooks 2002
73
Thermal Modeling
• Want a fine-grained, dynamic model of
temperature
•
•
•
•
At a granularity architects can reason about
That accounts for adjacency and package
That does not require detailed designs
That is fast enough for practical use
HotSpot - a compact model based on
thermal R, C
© 2005, Kevin Skadron
• Parameterized to automatically derive a model
based on various…
– Architectures
– Power models
– Floorplans
– Thermal Packages
74
© 2005, Kevin Skadron
Our Model (Lateral)
75
Our Model (Lateral and Vertical)
© 2005, Kevin Skadron
Derived from material and geometric properties
Interface material
(not shown)
76
Validation
•
Validated and calibrated using MICRED test
chips
© 2005, Kevin Skadron
• 9x9 array of power dissipators and sensors
• Compared to HotSpot configured with same grid,
package
•
Within 7% for both steady-state and transient
step-response
• Interface material (chip/spreader) matters a lot
77
HotSpot
• Time evolution of temperature is driven by
unit activities and power dissipations
averaged over 10K cycles
• Power dissipations can come from any power
simulator, act as “current sources” in RC
circuit
• Simulation overhead in Wattch/SimpleScalar:
< 1%
© 2005, Kevin Skadron
• Requires models of
• Floorplan: important for adjacency
• Package: important for spreading and time
constants
78
Hybrid DTM
•
DVS is attractive because of its cubic advantage
•
•
•
•
Fetch gating is attractive because it can use
instruction level parallelism to reduce impact of DTM
© 2005, Kevin Skadron
•
•
P V2f
This factor dominates when DTM must be aggressive
But changing DVS setting can be costly
– Resynchronize PLL
– Sensitive to sensor noise spurious changes
Only effective when DTM is mild
So use both!
79
Migrating Computation
•
When one unit overheats, migrate its
functionality to a distant, spare unit (MC)
•
•
•
•
© 2005, Kevin Skadron
•
Spare register file (Skadron et al. 2003)
Separate core (CMP) (Heo et al. 2003)
Microarchitectural clusters
etc.
Raises many interesting issues
•
•
•
•
Cost-benefit tradeoff for that area
Use both resources (scheduling)
Extra power for long-distance communication
Floorplanning
80
Hybrid DTM, cont.
Combine fetch gating with DVS
•
•
•
•
•
When DVS is better, use it
Otherwise use fetch gating
Determined by magnitude of temperature overshoot
Crossover at FG duty cycle of 3
FG has low overhead: helps reduce cost of sensor noise
1.3
1.4
FG
Hyb
Slowdown
© 2005, Kevin Skadron
DVS
1.3
1.2
1.2
1.1
1.1
1.0
1.0
20
5
Duty Cycle
2
20
15
10
Duty Cycle
5
0
81
Slowdown
•
Hybrid DTM, cont.
•
DVS doesn’t need more than two settings for
thermal control
•
•
FG by itself does need multiple duty cycles and
hence requires PI control
•
But in a hybrid configuration, FG does not require
PI control
•
•
© 2005, Kevin Skadron
Lower voltage cools chip faster
•
FG is only used at mild DTM settings
Can pick one fixed duty cycle
This is beneficial because feedback control is
vulnerable to noise
82
Sensors
•
Almost half of DTM overhead is due to
•
•
•
•
Need localized, fine-grained sensing
Need new sensor designs that are cheap
and can be used liberally – co-locate with
hotspots
© 2005, Kevin Skadron
•
•
•
Guard banding due to offset errors and lack
of co-located sensors
Spurious sensor readings due to noise
But these may be imprecise
Many sensor designs look promising
Need new data fusion techniques to reduce
imprecision, possibly combine
heterogeneous sensors
83
© 2005, Kevin Skadron
Multi-clustered Microarchitecture
84