Thermal Management Issues (MICRO-35 Tutorial)

Download Report

Transcript Thermal Management Issues (MICRO-35 Tutorial)

Overview
1.
2.
3.
4.
5.
6.
7.
8.
9.
Motivation (Kevin)
Thermal issues (Kevin)
Power modeling (David)
Thermal management (David)
Optimal DTM (Lev)
Clustering (Antonio)
Power distribution (David)
What current chips do (Lev)
HotSpot (Kevin)
1
Power modeling
•
Research Power Simulators
–
–
–
–
–
•
Wattch – Brooks and Martonosi ISCA2000
SimplePower – Vijaykrishnan et al (Penn
State) ISCA2000
TEMPEST – Dhodapkar et al
(Intel/Wisconsin)
PowerAnalyzer – Umich/Colorado
AccuPower – SUNY Binghamton
Industry Power Simulators
–
–
IBM PowerTimer – Brooks and Bose
PACS2000
Intel ALPS – Gunther, et al.
2
Power: The Basics
• Dynamic power vs. Static power
– Dynamic: “switching” power
– Static: “leakage” power
– Dynamic power dominates, but static power
increasing in importance
– Trends in each
• Static power: steady, per-cycle energy cost
• Dynamic power: capacitive and shortcircuit
• Capacitive power: charging/discharging at
transitions from 01 and 10
• Short-circuit power: power due to brief
short-circuit current during transitions.
• Mostly focus on capacitive, but recent
work on others
3
Capacitive Power dissipation
Vdd
Vin
Vout
CL
Capacitance:
Function of wire
length, transistor size
Supply Voltage:
Has been dropping
with successive fab
generations
Power ~ ½ CV2Af
Activity factor:
How often, on average,
do wires switch?
Clock frequency:
Increasing…
4
Short-Circuit Power Dissipation
ISC
VIN
VOUT
CL
• Short-Circuit Current caused by finiteslope input signals
• Direct Current Path between VDD and
GND when both NMOS and PMOS
transistors are conducting
5
Leakage Power
VIN
VOUT
ISub
IGate
CL
IDSub  k  e
 qVT
akaT
• Subthreshold currents grow exponentially
with increases in temperature, decreases in
threshold voltage
6
Modeling Hierarchy and Tool Flow
Energy Models
microarch
level
refine,
update
RTL
level
gate-level
ckt-level
set of workloads
Performance
Test Cases
Early analytical performance models
Trace/exec-driven, cycle-accurate simulation models
Microarch
parms/specs
RTL MODEL (VHDL)
(Architectural)
Sim Test Cases
RTL
sim
gate-level model (if synthesized)
Circuit-level (hierarchical) netlist model
edit/debug
edit/debug
Bitvector
test cases
ckt sim,
extract
edit/tune/
debug
Design rules
layout-level
Layout-level physical design model
Cap
extract,
sim
design rule
check,
validate
7
Analysis Abstraction Levels
Abstraction
Level
Analysis Analysis Analysis Analysis
Energy
Capacity Accuracy Speed Resources Savings
Most
Worst
Fastest
Least
Least
Best
Slowest
Most
Most
Application
Behavioral
Architectural (RTL)
Logic (Gate)
Transistor (Circuit)
Least
8
Power/Performance abstractions
• Low-level:
– Hspice
– PowerMill
• Medium-Level:
– RTL Models
• Architecture-level:
–
–
–
–
–
PennState SimplePower
Intel Tempest
Princeton Wattch
IBM PowerTimer
Umich/Colorado PowerAnalyzer
9
Low-level models: Hspice
• Extracted netlists from circuit/layout
descriptions
– Diffusion, gate, and wiring capacitance is
modeled
• Analog simulation performed
– Detailed device models used
– Large systems of equations are solved
– Can estimate dynamic and leakage power
dissipation within a few percent
– Slow, only practical for 10-100K transistors
• PowerMill (Synopsys) is similar but about
10x faster
10
Medium-level models: RTL
• Logic simulation obtains switching
events for every signal
• Structural VHDL or verilog with zero
or unit-delay timing models
• Capacitance estimates performed
– Device Capacitance
• Gate sizing estimates performed, similar to
synthesis
– Wiring Capacitance
• Wire load estimates performed, similar to
placement and routing
• Switching event and capacitance
estimates provide dynamic power
estimates
11
Architecture level models
Power ~ ½ CV2Af
• Bottom-up Approach:
– Estimate “CV2f” via analytical models
– Tools: Wattch, PowerAnalyzer, Tempest (mixedmode)
• Top-Down Approach
– Estimate “CV2f” via empirical measurements
– Tools: PowerTimer, AccuPower, Most Industrial
Tools
• Estimate “A” via statistics from
architectural-performance simulators
12
Analytical Models: Capacitance
• Requires modeling wire length and
estimating transistor sizes
• Related to RC Delay analysis for
speed along critical path
– But capacitance estimates require
summing up all wire lengths, rather than
only an accurate estimate of the longest
one.
13
Register File: Capacitance Analysis
Bit
Decoders
Pre-Charge
Bit
Cell Access
Transistors (N1)
Wordlines
Cell
(Number of
Entries)
Sense Amps
Bitlines
(Data Width of Entries)
Number of
Ports
Number of
Ports
Cwordline  CdiffcapW ordlineDriver  NumberBitlines * CgatecapN1 
Wordlinele ngth * Cmetal
Cbitline  CdiffcapPchg  NumberWordlines * CdiffcapN1
 Bitlinelen gth * Cmetal
14
Register File Model: Accuracy
Error Rates
Wordline(r)
Wordline(w)
Bitline(r)
Bitline(w)
Gate
1.11
-6.37
2.82
-10.96
Diff
0.79
0.79
-10.58
-10.60
InterConn.
15.06
-10.68
-19.59
7.98
Total
8.02
-7.99
-10.91
-5.96
(Numbers in Percent)
• Validated against a register file schematic
used in internal Intel design
• Compared capacitance values with
estimates from a layout-level Intel tool
• Interconnect capacitance had largest
errors
– Model neglects poly connections
– Differences in wire lengths -- difficult to tell wire
distances of schematic nodes
15
Different Circuit Design Styles
• RTL and Architectural level power
estimation requires the tool/user to perform
circuit design style assumptions
– Static vs. Dynamic logic
– Single vs. Double-ended bitlines in register
files/caches
– Sense Amp designs
– Transistor and buffer sizings
• Generic solutions are difficult because
many styles are popular
• Within individual companies, circuit design
styles may be fixed
16
Clock Gating: What, why, when?
Clock
Gated Clock
Gate
• Dynamic Power is dissipated on clock
transitions
• Gating off clock lines when they are
unneeded reduces activity factor
• But putting extra gate delays into clock
lines increases clock skew
• End results:
– Clock gating complicates design analysis but
saves power.
17
Wattch: An Overview
Wattch’s Design Goals
• Flexibility
• Planning-stage info
• Speed
• Modularity
• Reasonable accuracy
Overview of Features
• Parameterized models for different CPU units
– Can vary size or design style as needed
• Abstract signal transition models for speed
– Can select different conditional clocking and input
transition models as needed
• Based on SimpleScalar (has been ported to many
simulators)
• Modular: Can add new models for new units
studied
18
Unit Modeling
Bitline Activity
Number of Active Ports
Number of entries
Data width of entries
# Read Ports
Parameterized
Register File
Power
Model
Power
Estimate
# Write Ports
Modeling Capacitance
Modeling Activity Factor
• Models depend on
• Use cycle-level simulator to
structure, bitwidth, design
determine number and type
style, etc.
of accesses
• E.g., may model
– reads, writes, how many
capacitance of a register
ports
file with bitwidth & number • Abstract model of bitline
of ports as input
activity
parameters
19
One Cycle in Wattch
Power
(Units
Accessed)
Fetch
Dispatch
Issue/Execute
Writeback/
Commit
 I-cache
 Bpred
 Rename
Table
 Inst. Window
 Reg. File






 Result Bus
 Reg File
 Bpred
Performance  Cache Hit?
 Bpred
Lookup?
 Inst. Window
Full?
Inst. Window
Reg File
ALU
D-Cache
Load/St Q
Dependencies
Satisfied?
 Resources?
 Commit
Bandwidth?
• On each cycle:
– determine which units are accessed
– model execution time issues
– model per-unit energy/power based on which units
used and how many ports.
20
Units Modeled
by Wattch
 Array Structures
 Caches, Reg Files,
Map/Bpred tables
 Content-Addressable
Memories (CAMs)
 TLBs, Issue Queue,
Reorder Buffer
 Complex
combinational blocks
 ALUs, Dependency
Check
 Clocking network
 Global Clock Drivers,
21
Local Buffers
PowerTimer
• IBM Tool First Develop During
Summer of 2000
– Continued Development: 2001 => Today
– Methodology Applied to Research and
Product Power-Performance Simulators
with IBM
– Currently in Beta-Release
– Working towards Full Academic Release
22
PowerTimer: Empirical Power
Clock Tree
10%
L3 Tags
2%
IDU FXU
3% 4%
Other
10%
IFU
6%
ISU
10%
L2
23%
CIU
4%
FBC
3%
Map
Tables
43%
Issue
Queues
32%
Completion
Table
9%
Dispatch
6%
LSU
19%
GX
ZIO RAS
1%
4% 5%
Core
Buffer
1%
FPU
5%
Pre-silicon, POWER4-like superscalar design
23
Processor Power Density
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
IFU
IDU
ISU
FXU
LSU
FPU
L2
L3 Tag
BHT
Icache
FXIssueQ
Pre-silicon, POWER4-like superscalar design
Originally presented at PACS2002
24
PowerTimer
Circuit Power
Data (Macros)
SubUnit Power =
f(SF, uArch, Tech)
Tech Parms
Compute
Sub-Unit
Power
uArch Parms
Power
AF/SF Data
Program
Executable
or Trace
Architectural
Performance
Simulator
CPI
25
PowerTimer: Energy Models
• Energy models for uArch structures formed
by summation of circuit-level macro data
Energy Models
Sub-Units (uArch-level Structures)
Power=C1*SF+HoldPower
SF
Data
Power=C2*SF+HoldPower
Macro1
Macro2
Power=Cn*SF+HoldPower
MacroN
Power
Estimate
26
Empirical Estimates with CPAM
• Estimate power under “Input Hold” and
“Input Switching” Modes
• Input Hold: All Macro Inputs (Except
Clocks) Held
– Can also collect data for Clock Gate Signals
• Input Switching: Apply Random Switching
Patterns with 50% Switching on Input Pins
Macro
Inputs
Macro
• 0% Switching
(Hold Power)
• 50% Switching
Power
27
Example Unit
• Made up of 5 macros
800
700
macro1
macro2
macro3
macro4
macro5
total
mW
600
500
400
300
200
100
0
0
10
20
30
40
50
SF
28
PowerTimer: Models f(SF)
Assumption: Power linearly dependent on Switching Factor
This separates Clock Power and Switching Power
1400
1200
Switching
Power
Unit1
Unit2
Unit3
Unit4
Unit5
mW
1000
800
600
400
Clock
Power
200
0
0
10
20
30
40
50
SF
At 0% SF, Power = Clock Power (significant without clock gating)
29
Key Activity Data
1400
Changes in SF
1200
Unit1
Unit2
Unit3
Unit4
Unit5
mW
1000
Changes
in AF
800
600
400
200
0
0
10
20
30
40
50
SF
• SF => Moves along the Switching Power
Curve
– Estimated on a per-unit basis from RTL Analysis
• AF => Moves along the Clock Power Curve
– Extracted from Microarchitectural Statistics
(Turandot)
30
Microarchitectural Statistics
• Stats are very similar to tracking used in Wattch, etc
• Differences:
– Clock Gating Modes (3 modes)
– Customized Scaling Based on Circuit Style (4 styles)
• Clock Gating Modes:
–
–
–
–
P_constrained = P_unconstrained (not clock-gateable)
P_constrained_1 = AF * (Pclock + Plogic) (common)
P_constrained_2 = AF * Pclock + Plogic (rare)
P_constrained_3 = Pclock + AF * Plogic (very rare)
• Scaling Based on Circuit Styles
– AF_1 = #valid
Gating)
– AF_2 = #valid - #stalls
Stall Gating)
– AF_3 = #writes
updates)
– AF_4 = #writes + #reads
(Latch-and-Mux, No Stall
(Latch-and-Mux, With
(Arrays that only gate
(Arrays, RAM Macros)
31
Clock Gating: Valid-Bit Gating
• Latch-Based Structures: Execute Pipelines, Issue
Queues
Clock
V
V
V
V
V
V
32
Clock Gating Modes
• P_constrained_1 = AF * (Pclock + Plogic)
clock
valid
Plogic
Pclock
• P_constrained_2 = AF * Pclock + Plogic
clock
valid
Selection
Logic
Pclock
Plogic
33
Valid-bit Gating, Stalls?
• Option 1: Stalls cannot be gated
clk
valid
Data From
Previous Pipestage
Stall From
Previous Pipestage
Data For
Next Pipestage
• Option 2: Stalls can be gated
clk
valid
Data From
Previous Pipestage
Stall From
Previous Pipestage
Data For
Next Pipestage
34
Scaling: Array Structures
• Option 1: Reads and Writes Eligible to Gate
for Power
Write
Bitline
Read
Bitline
read_wordline_active
read_gate
write_wordline_active
write_gate
Cell
read_data
write_gate
write_data
35
Scaling: Array Structures
• Option 2: Only Writes Eligible to Gate for
Power
Write
Bitline
read_entry_n
read_entry_2
write_wordline_active
write_gate
read_entry_1
read_data
Cell
read_entry_0
write_gate
write_data
36
12 Clock Gating Modes
Gating
Mode
Valid
Valid+
Stalls
Writes Writes+
Reads
Gat Gate
e
Cloc
Both k
Gate Examples
Logic
0
No
No
No
No
No
No
No
Control Logic, Buffers,
Small Macros
1
Yes
No
No
No
Yes
No
No
2
No
Yes
No
No
Yes
No
No
Issue Queues, Execute
Pipelines
3
No
No
Yes
No
Yes
No
No
Caches
4
No
No
No
Yes
Yes
No
No
Some Queues
5
Yes
No
No
No
No
Yes
No
CAMs, Selection Logic
6
No
Yes
No
No
No
Yes
No
7
No
No
Yes
No
No
Yes
No
No Known macros
8
No
No
No
Yes
No
Yes
No
No Known macros
9
Yes
No
No
No
No
No
Yes
No Known macros
10
No
Yes
No
No
No
No
Yes
No Known macros
11
No
No
Yes
No
No
No
Yes
No Known macros
12
No
No
No
Yes
No
No
Yes
No Known macros
PowerTimer Observations
• PowerTimer works well for POWER4like estimates and derivatives
– Scale base microarchitecture quite well
– E.g. optimal power-performance
pipelining study
– Lack of run-time, bit-level SF not seen as
a problem within IBM (seen as noise)
• Chip bit-level SFs are quite low (5-15%)
• Most (60-70%) power is dissipated while
maintaining state (arrays, latches, clocks)
• Much state is not available in early-stage
timers
38
Comparing models: Flexibility
• Flexibility necessary for certain studies
– Resource tradeoff analysis
– Modeling different architectures
• Purely analytical tools provides fullyparameterizable power models
– Within this methodology, circuit design styles
could also be studied
• PowerTimer scales power models in a
user-defined manner for individual subunits
– Constrained to structures and circuit-styles
currently in the library
• Perhaps Mixed Mode tools could be very
useful
39
Comparing models: Accuracy
• PowerTimer -- Based on validation of
individual pieces
– Extensive validation of the performance model
(AFs)
– Power estimates from circuits are accurate
– Circuit designers must vouch for clock gating
scenarios
– Certain assumptions will limit accuracy or
require more in-depth analysis
• Analytical Tools
– Inherent Issues
• Analytical estimates cannot be as accurate as
SPICE analysis (“C” estimates, CV2 approximation)
– Practical Issues
• Without industrial data, must estimate transistor
sizing, bits per structure, circuit choices
40
Comparing models: Speed
• Performance simulation is slow enough!
• Post-Processing vs. Run-Time Estimates
• Wattch’s per-cycle power estimates:
roughly 30% overhead
– Post-processing (per-program power estimates)
would be much faster (minimal overhead)
• PowerTimer allows both no overhead postprocessing and run-time analysis for
certain studies (di/dt, thermal)
– Some clock gating modes may require run-time
analysis
• Third Option: Bit Vector Dumps
– Flexible Post-Processing  Huge Output Files
41
Static Power Dissipation
• Static power: dissipation due to
leakage current
• Growing worse because as Vdd
scales, Vth must also scale to
maintain switching speeds
– “Off” transistors are no longer fully cut
off
42
Leakage Power
• The fraction of leakage power is
increasing exponentially with each
generation
• Also exponentially dependent on
temperature
Increasing
ratio
across
generations
Static power/ Dynamic Power
70
Percentage
60
50
40
30
20
10
373
368
363
358
353
348
343
338
333
328
323
318
313
308
303
298
0
Temperature(K)
180nm
130nm
100nm
90nm
80nm
70nm
Source: Sankaranarayanan et al, University of Virginia, based on ITRS 2001 data
43
Static Power - HotLeakage
• Main mechanisms for leakage current
– Subthreshold (Berkely predictive model):
I leakage   0  COX
Vdd

  Vth0  Voff
W a b*(Vdd Vdd0 ) 2 
vt 
 e
 vt  1  e
 exp 


L
 n  vt






– Gate
• Igate = Igate0 * exp(a*(tox-tox0)) * exp(b*(vdd-vdd0))
• There is also a weak temp dependence,
ignored here
• Gate leakage has essentially been ignored
– New gate insulation materials may reduce urgency of
problem, eg recent Intel announcement
• R. Chau, Technology@intel Magazine. www.intel.com
• Gate-induced drain leakage (GIDL) occurs at
negative gate voltages and high Vdd or high values
of reverse body bias
– New leakage path opens!
44
Effects of Parameter Variations
• Ioff depends exponentially on Vth
• There is a large fluctuation of Ioff from die to die and
from gate to gate
• Controlling Vth is difficult in nanometer scale
– Drain-induced barrier lowering
• Channel length is not constant
• Exacerbated in sub-100nm devices
– Discrete dopant effects
• In a very small channel, small number of dopants
• Presence of these dopants and random fluctuation of their
number, lead to changes in Vth from device to device
• Process variation affects
– Gate length (Ldrawn)
– Gate oxide thickness (Tox)
– Channel dose (Nsub)
Srivastava et al, ISLPED 2002
45
Static Power
Modeling Leakage
– Butts and Sohi (MICRO-33)
• Pstatic = Vcc · N · kdesign · Îleak
• Îleak determined by circuit simulation, kdesign
empirically
• Key contribution: separate technology from design
– HotLeakage (UVA TR CS-2003-05, DATE’04)
• Extension of Butts & Sohi approach: scalable with Vdd,
Vth, Temp, and technology node; adds gate leakage
• Îleak determined by BSIM3 subthreshold equation and
BSIM4 gate-leakage equations, giving an analytical
expression that accounts for dependence on factors
that may change at runtime, namely Vdd, Vth, and Temp
• kdesign replaced by separate factors for N- and P-type
transistors
• kdesign also exponentially dependent on Vdd and Tox,
linearly dependent on Temp
• Currently integrated with SimpleScalar/Wattch for
caches
• http://lava.cs.virginia.edu/HotLeakage
46
Static Power
• Modeling Leakage (cont.)
– Su et al, IBM (ISLPED’03)
• Similar approach to HotLeakage – but they
observe that modeling the change in leakage
allows linearization of the equations
– Many, many other papers on various
aspects of modeling different aspects of
leakage
• Most focus on subthreshold
• Few suggest how to model leakage in
microarchitecture simulations
47
Power modeling summary
• Wattch provides excellent relative
accuracy
– Underestimates full chip power (some
units not modeled, etc)
• PowerTimer models based on
circuit-level power analysis
– Inaccuracy is introduced in SF/AF and
scaling assumptions
• Modeling leakage is essential for
thermal research
– “Thermal runaway” is even possible
48
Overview
1.
2.
3.
4.
5.
6.
7.
8.
9.
Motivation (Kevin)
Thermal issues (Kevin)
Power modeling (David)
Thermal management (David)
Optimal DTM (Lev)
Clustering (Antonio)
Power distribution (David)
What current chips do (Lev)
HotSpot (Kevin)
49
Existing Work
• Research Ideas
– DEETM – Huang and Torrellas MICRO2000
– DTM – Brooks and Martonosi HPCA2001
– Control-Theoretic DTM – Skadron,
Abdelzaher, Stan HPCA2002
– Thermal Scheduling – Cai, Lim, Daasch
WCED2002
• Commercial Products
– PowerPC G3 Microprocessor
– Pentium III
– Pentium 4
50
Overview
• Hard to optimize power-performance at
design time for all cases
• Forces conservative choices for issues like
cooling, current delivery, resource sizes
• Want to explore dynamic power
optimizations for run-time power
management
–
–
–
–
–
–
Dynamic Voltage/Frequency Scaling [Burd, 2000]
Dynamic Hardware Resizing [Albonesi, 1999]
Fetch Throttling [Sanchez, 1997]
Global Clock Gating [Gunther, 2001]
Speculation Control [Manne, 1998]
Dynamic Thermal Management [Brooks,
2001][Huang, 2000]
51
Relative Power
Important to optimize P & T early
12FO4
5
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
tradeoff via changing Vdd and HI
tradeoff via changing frequency
tradeoff via changing pipeline depth
14FO4
18FO4
0.7
0.8
0.9
Maximum Power Budget
23FO4
1
1.1
1.2
1.3
Relative 1/Performance
1.4
1.5
1.6
52
Dynamic Thermal Management
• Goal:
– Provide dynamic techniques to cool chip
when needed
– Exploit natural variations due to different
applications, phase behavior, …
– Allow designers to target average, rather
than worst-case behavior
• Design Decisions:
– Mechanism & policy for triggering
response?
– What should response be?
– How to select DTM trigger levels?
53
Power consumption impacts cost
CPU
From: Gunther, et al.
“Managing the Impact of
Increasing Microprocessor
Power Consumption,” Intel
Technology Journal, Q1,
2001
• System costs associated with power dissipation:
– Thermal control cost
• Heatsinks, fans
– Power delivery
• Power supply
• Decoupling caps…
54
Average and Worst Case Power
100
Max
80
Avg
60
40
20
0
Alpha 21264
Intel PPro
• System costs are
constrained by worst case
power dissipation
• Average case power
dissipation can often be
much lower
– Aggressive Clock Gating
– Applications variations
– Underutilized resources
• Not enough ILP
• Floating point units during
integer code execution
• Currently about a 30%
difference
• Likely to further diverge…
55
Dynamic Thermal Management
Temperature
Designed for Cooling Capacity w/out DTM
Designed for Cooling
Capacity w/ DTM
System
Cost Saving
DTM Trigger
Level
DTM Disabled
DTM/Response Engaged
Time
56
DTM: Definitions
Trigger
Reached
Turn
Response
On
Initiation Response
Delay
Delay
Check
Temp
Policy
Delay
Check
Temp
Turn
Response
Off
Shutoff
Delay
 Initiation Delay – OS interrupt/handler
 Response Delay – Invocation time (e.g. adjust clock)
 Policy Delay – Number of cycles engaged
 Shutoff Delay – Disabling time (e.g. re-adjust clock)
57
DTM: When, How, and What
Trigger Mechanism:
When do we enable
DTM techniques? Initiation Mechanism:
How do we enable
technique?
Response Mechanism: What
technique do we enable?
58
DTM: Trigger Mechanisms
• Mechanism: How to  Policy: When to begin
deduce temperature? responding?
 Trigger level set too high:
• Direct approach:
Packaging cost will be high
Temperature sensors
Little advantage
providing feedback
 Trigger level set too low
– Implemented in some
Frequent triggering causes
PowerPC chips (G3, G4)
performance to suffer
[Sanchez, 1997]
 Choose trigger level to exploit
– Sensor quantity,
placement, and precision difference between average
and worst-case power.
will be discussed later
• Other indirect
approaches possible
59
DTM: Initiation Mechanisms
• Operating system or
microarchitectural control?
– Hardware support can significantly reduce
performance penalty
• Policy Delay Settings
– For Volt/Freq scaling, much of the
performance penalty can be attributed to
enabling/disabling
– Increasing policy delay reduces overhead;
smarter initiation techniques would help
as well
60
DTM: Response Mechanisms
• Scaling Techniques
– Clock Frequency Scaling [Intel Pentium 4]
– Voltage and Frequency Scaling
– Temperature-tracking frequency scaling[Skadron03]
• Adjusts frequency to account for T-dep. of switching speed
• Microarchitectural Techniques
– Speculation Control [Manne98]
– Low-Power Cache Techniques [Huang00]
–
–
–
–
–
–
• Hierarchical Responses
Decode Throttling [Sanchez97]
Fetch Toggling [Brooks01]
Feedback controlled Fetch Gating [Skadron02]
Migrating Computation [Skadron03]
Dual Pipelines [Lim02]
Hybrid DTM [Skadron04]
61
Dynamic Voltage/Frequency Scale
• Voltage Scheduler predicts workload
requirements
• Set frequency/voltage to nearoptimal, energy savings
• Burd, et al., ISSCC2000
– 5MHz @ 1.2V: 6 MIPS, 2.8mW
– 80MHz @ 3.8V: 85 MIPS, 460mW
– 70us 1.2V <-> 3.8V
• Transmeta Crusoe
– Commercial implementation (500700MHz, 1.2-1.6V)
62
Temperature-Tracking Frequency
Temperature affects :
• Transistor threshold and mobility
• Ion, Ioff, Igate, delay
• ITRS: 85°C for high-performance, 110°C for embedded!
• So adjust frequency as f(T) -- TTDFS
Ioff
Ion
NMOS
63
Speculation Control
• Manne et al. (ISCA ’98)
– Branch confidence estimator used to determine
whether to speculate
– Pipeline gating based on confidence estimation
– 38% reduction in wrong-path instructions with
~1% performance loss
• But Parikh et al. (HPCA ’02) found much
smaller savings; ED product is zero or
negative
– Significant energy savings only come with
significant loss of performance
– This is because many instructions are squashed
early in the pipeline, so reduction in wrong-path
instructions is not a useful metric
– Benefit is actually a function of prediction
accuracy
• Only for very badly predicted programs do you get
benefit
• Well-predicted programs suffer
64
Dynamic Hardware Resizing
• Complexity Adaptive Processors
• Based on application characteristics
– Underutilized structures may be reduced
with minimal performance impact
– Resize Caches, Issue Queues, etc.
– Resize => Reduce Capacitance =>
Reduce Energy
– Of course, this only helps manage heat if
it reduces heat dissipation within hot
spots
• And does so for a sufficiently long duration
65
DEETM
• Dynamic Energy Efficiency and
Temperature Management
• Slack algorithm detects if slowdown
can be tolerated
– If so, invoke techniques to reduce energy
• Temperature algorithm
– If temperature limit is reached, invokes
techniques
• Techniques considered
– Filter Cache, Voltage Scaling, etc.
66
Control-theoretic DTM
• Fetch toggling
– disable fetch every N cycles
– 4/5, 2/3, 1/2, 1/3, 1/5, …
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
– How to set the fetch rate?
• (Assume idealized temperature sensing)
67
Feedback-Control of Fetch Toggling
• Formal feedback control
setpoint
e
Controller
measured T
u[k]
Actuator:
I-fetch toggling
P
Thermal
dynamics
T
Temp. sensor
Integral ctrl: u[k] = u[k-1] + Kc · e[k-1]
Original work used PID
• easy to compute
• toggling = f(m)
• optimality: will discuss shortly
68
Formal Feedback Control
• Regulatory control problem: hold value to a
specified setpoint
– Example: temperature
– Proved that PID controller will not allow
temperature to exceed setpoint by more than
0.02°
• Max power dissipation, thermal dynamics,
sampling rate  max overshoot
• This precision is excessive but illustrates the value
of formal feedback control theory
69
0%
25%
MEAN
30%
bzip
vortex
perlbmk
eon
parser
fma3d
facerec
crafty
equake
art
mesa
gcc
Percent Loss in Performance
Performance Loss
• Performance loss reduced by 65%
toggle1
PID
20%
15%
10%
5%
70
Migrating Computation
• When one unit overheats, migrate its
functionality to a distant, spare unit (MC)
–
–
–
–
Spare register file (Skadron et al. 2003)
Separate core (CMP) (Heo et al. 2003)
Microarchitectural clusters
etc.
• Raises many interesting issues
–
–
–
–
Cost-benefit tradeoff for that area
Use both resources (scheduling)
Extra power for long-distance communication
Floorplanning
• This is probably one of the richest
opportunities for DTM
71
Migrating Computation – Reg File
72
Thermal Scheduling (Cai 2002)
Primary
FE
DE
Secondary
DE
OOP
EX
Majority mobile apps
with performance
requirements
RF
IOP
Text email, caller-id, reminder and
other none high performance w/
anywhere-anytime requested
apps
• Primary pipeline: maximal performance,
complex pipeline structure
• Second pipeline: Minimum power and
energy consumption, very simple in order
structure and target mobile anywhereanytime applications.
• Transparent to OS and applications
• Maximal utilizing on die clock/power gating
for energy saving
73
Scheduling Algorithm (Cai 2002)
T1  TL
& T2  TL
T1 > TL
|| T2 > TL
Temperature (C)
Tmax
TH
TS1
S1
T1  TL
S4
T1 < TH
T1  TH
S2
T1  TL
T1  TH
TL
Ta
tcycle
tcool theat
TS2
T1 > TL
& T2  TH
S3
T1 > TL
& T2 < TH
Time (s)
S1: Normal Operation (Primary Pipeline)
S2: Stall Fetch & Clear Pipeline
S3: Alternate Operation (Secondary Pipeline)
S4: Disable Clock or Scale F-V
74
Hybrid DTM
•
DVS is attractive because of its cubic
advantage
– P  V2f
– This factor dominates when DTM must be
aggressive
– But changing DVS setting can be costly
• Resynchronize PLL
• Sensitive to sensor noise  spurious changes
•
“ILP techniques” are attractive because
they can use instruction level parallelism to
hide/reduce impact of DTM
– Only effective when DTM is mild
•
So use both!
– Need to find “crossover point”
75
Hybrid DTM, cont.
•
Combine fetch gating with DVS
– When DVS is better, use it
– Otherwise use fetch gating
– Determined by magnitude of temperature
overshoot
– Crossover at FG duty cycle of 3
– FG has low overhead: helps reduce cost of
sensor noise
1.4
FG
Hyb
Slowdown
DVS
1.3
1.2
1.2
1.1
Slowdown
1.3
1.1
1.0
1.0
20
5
Duty Cycle
2
20
15
10
Duty Cycle
5
0
76
Hybrid DTM, cont.
•
DVS doesn’t need more than two settings
for thermal control
– Lower voltage cools chip faster
•
FG by itself does need multiple duty
cycles and hence requires PI control
•
But in a hybrid configuration, FG does not
require PI control
– FG is only used at mild DTM settings
– Can pick one fixed duty cycle
•
This is beneficial because feedback
control is vulnerable to noise
77
Simulation Details
• 85°C maximum temperature
– Guard band requires a trigger threshold of 81.8°
• Ambient temperature (inside computer case): 45°C
• Rpackage = 0.8 K/W (old package model)
– 0.7 K/W necessary if DTM not available
• Die thickness: 0.5mm
• Currently neglecting interface material
• 9 SPEC2000 benchmarks, both integer and FP
– 4 hover near 81.8°C, rest are above
• SimpleScalar/Wattch, modified to model pipeline and
power of an Alpha 21364 as closely as possible
• Scaled to 130nm, 1.3V, 3.0 GHz
78
Performance Comparison
•
TT-DFS is best but can’t prevent excess temperature
–
•
Hybrid technique reduces DTM cost by 25% vs. DVS
(DVS overhead important)
A substantial portion of MC’s benefit comes from the
altered floorplan, which separates hot units
1.40
Slowdown Factor
•
Suitable for use with aggressive clock rates at low
temp.
1.359
1.270
1.30
1.231
1.20
1.112
1.10
1.045
1.00
TTDFS
DVS
FG
Hyb
MC
79
Floorplan Impact
2 cycle penalty
LSQ moves:
no penalty
80
Conclusions so far
•
•
DTM can be used to reduce cooling costs
Proper modeling is required
– HotSpot is publicly available at
http://lava.cs.virginia.edu/HotSpot
•
•
ILP matters
Hybrid techniques beneficial
– Merge advantages of different schemes
– Simplify control
•
•
Architectural techniques important in
thermal design
Growing use of clusters and redundant
units opens an incredibly rich design
space
81
DTM: Summary and Key Issues
• Dynamic optimizations translate
max-power problem to averagepower problem
• Heightens importance of averagepower techniques like clock gating
• Key Issues:
– Initiation interval
– Collection of possible response
mechanisms
– Sensor accuracy and data fusion
– Hot spots
82