Thermal Management Issues (MICRO

Download Report

Transcript Thermal Management Issues (MICRO

© Mircea Stan, Kevin Skadron, David Brooks, 2002
Overview
1.
2.
3.
4.
5.
6.
7.
8.
9.
Motivation (Kevin)
Thermal issues (Kevin)
Power modeling (David)
Thermal management (David)
Optimal DTM (Lev)
Clustering (Antonio)
Power distribution (David)
What current chips do (Lev)
HotSpot (Kevin)
1
Power modeling
© Mircea Stan, Kevin Skadron, David Brooks, 2002
•
Research Power Simulators
–
–
–
–
–
•
Wattch – Brooks and Martonosi ISCA2000
SimplePower – Vijaykrishnan et al (Penn
State) ISCA2000
TEMPEST – Dhodapkar et al
(Intel/Wisconsin)
PowerAnalyzer – Umich/Colorado
AccuPower – SUNY Binghamton
Industry Power Simulators
–
–
IBM PowerTimer – Brooks and Bose
PACS2000
Intel ALPS – Gunther, et al.
2
Power: The Basics
© Mircea Stan, Kevin Skadron, David Brooks, 2002
• Dynamic power vs. Static power
– Dynamic: “switching” power
– Static: “leakage” power
– Dynamic power dominates, but static power
increasing in importance
– Trends in each
• Static power: steady, per-cycle energy cost
• Dynamic power: capacitive and shortcircuit
• Capacitive power: charging/discharging at
transitions from 01 and 10
• Short-circuit power: power due to brief
short-circuit current during transitions.
• Mostly focus on capacitive, but recent
work on others
3
Capacitive Power dissipation
Vdd
© Mircea Stan, Kevin Skadron, David Brooks, 2002
Vin
Vout
CL
Capacitance:
Function of wire
length, transistor size
Supply Voltage:
Has been dropping
with successive fab
generations
Power ~ ½ CV2Af
Activity factor:
How often, on average,
do wires switch?
Clock frequency:
Increasing…
4
Short-Circuit Power Dissipation
© Mircea Stan, Kevin Skadron, David Brooks, 2002
ISC
VIN
VOUT
CL
• Short-Circuit Current caused by finiteslope input signals
• Direct Current Path between VDD and
GND when both NMOS and PMOS
transistors are conducting
5
© Mircea Stan, Kevin Skadron, David Brooks, 2002
Leakage Power
VIN
VOUT
ISub
IGate
CL
IDSub  k  e
 qVT
akaT
• Subthreshold currents grow exponentially
with increases in temperature, decreases in
threshold voltage
6
Modeling Hierarchy and Tool Flow
© Mircea Stan, Kevin Skadron, David Brooks, 2002
Energy Models
microarch
level
refine,
update
RTL
level
gate-level
ckt-level
set of workloads
Performance
Test Cases
Early analytical performance models
Trace/exec-driven, cycle-accurate simulation models
Microarch
parms/specs
RTL MODEL (VHDL)
(Architectural)
Sim Test Cases
RTL
sim
gate-level model (if synthesized)
Circuit-level (hierarchical) netlist model
edit/debug
edit/debug
Bitvector
test cases
ckt sim,
extract
edit/tune/
debug
Design rules
layout-level
Layout-level physical design model
Cap
extract,
sim
design rule
check,
validate
7
Analysis Abstraction Levels
© Mircea Stan, Kevin Skadron, David Brooks, 2002
Abstraction
Level
Analysis Analysis Analysis Analysis
Energy
Capacity Accuracy Speed Resources Savings
Most
Worst
Fastest
Least
Least
Best
Slowest
Most
Most
Application
Behavioral
Architectural (RTL)
Logic (Gate)
Transistor (Circuit)
Least
8
Power/Performance abstractions
• Low-level:
© Mircea Stan, Kevin Skadron, David Brooks, 2002
– Hspice
– PowerMill
• Medium-Level:
– RTL Models
• Architecture-level:
–
–
–
–
–
PennState SimplePower
Intel Tempest
Princeton Wattch
IBM PowerTimer
Umich/Colorado PowerAnalyzer
9
Low-level models: Hspice
© Mircea Stan, Kevin Skadron, David Brooks, 2002
• Extracted netlists from circuit/layout
descriptions
– Diffusion, gate, and wiring capacitance is
modeled
• Analog simulation performed
– Detailed device models used
– Large systems of equations are solved
– Can estimate dynamic and leakage power
dissipation within a few percent
– Slow, only practical for 10-100K transistors
• PowerMill (Synopsys) is similar but about
10x faster
10
© Mircea Stan, Kevin Skadron, David Brooks, 2002
Medium-level models: RTL
• Logic simulation obtains switching
events for every signal
• Structural VHDL or verilog with zero
or unit-delay timing models
• Capacitance estimates performed
– Device Capacitance
• Gate sizing estimates performed, similar to
synthesis
– Wiring Capacitance
• Wire load estimates performed, similar to
placement and routing
• Switching event and capacitance
estimates provide dynamic power
estimates
11
Architecture level models
© Mircea Stan, Kevin Skadron, David Brooks, 2002
Power ~ ½ CV2Af
• Bottom-up Approach:
– Estimate “CV2f” via analytical models
– Tools: Wattch, PowerAnalyzer, Tempest (mixedmode)
• Top-Down Approach
– Estimate “CV2f” via empirical measurements
– Tools: PowerTimer, AccuPower, Most Industrial
Tools
• Estimate “A” via statistics from
architectural-performance simulators
12
© Mircea Stan, Kevin Skadron, David Brooks, 2002
Analytical Models: Capacitance
• Requires modeling wire length and
estimating transistor sizes
• Related to RC Delay analysis for
speed along critical path
– But capacitance estimates require
summing up all wire lengths, rather than
only an accurate estimate of the longest
one.
13
Register File: Capacitance Analysis
Bit
Decoders
© Mircea Stan, Kevin Skadron, David Brooks, 2002
Pre-Charge
Bit
Cell Access
Transistors (N1)
Wordlines
Cell
(Number of
Entries)
Sense Amps
Bitlines
(Data Width of Entries)
Number of
Ports
Number of
Ports
Cwordline  CdiffcapW ordlineDriver  NumberBitlines * CgatecapN1 
Wordlinele ngth * Cmetal
Cbitline  CdiffcapPchg  NumberWordlines * CdiffcapN1
 Bitlinelen gth * Cmetal
14
© Mircea Stan, Kevin Skadron, David Brooks, 2002
Register File Model: Accuracy
Error Rates
Wordline(r)
Wordline(w)
Bitline(r)
Bitline(w)
Gate
1.11
-6.37
2.82
-10.96
Diff
0.79
0.79
-10.58
-10.60
InterConn.
15.06
-10.68
-19.59
7.98
Total
8.02
-7.99
-10.91
-5.96
(Numbers in Percent)
• Validated against a register file schematic
used in internal Intel design
• Compared capacitance values with
estimates from a layout-level Intel tool
• Interconnect capacitance had largest
errors
– Model neglects poly connections
– Differences in wire lengths -- difficult to tell wire
distances of schematic nodes
15
Different Circuit Design Styles
© Mircea Stan, Kevin Skadron, David Brooks, 2002
• RTL and Architectural level power
estimation requires the tool/user to perform
circuit design style assumptions
– Static vs. Dynamic logic
– Single vs. Double-ended bitlines in register
files/caches
– Sense Amp designs
– Transistor and buffer sizings
• Generic solutions are difficult because
many styles are popular
• Within individual companies, circuit design
styles may be fixed
16
© Mircea Stan, Kevin Skadron, David Brooks, 2002
Clock Gating: What, why, when?
Clock
Gated Clock
Gate
• Dynamic Power is dissipated on clock
transitions
• Gating off clock lines when they are
unneeded reduces activity factor
• But putting extra gate delays into clock
lines increases clock skew
• End results:
– Clock gating complicates design analysis but
saves power.
17
© Mircea Stan, Kevin Skadron, David Brooks, 2002
Wattch: An Overview
Wattch’s Design Goals
• Flexibility
• Planning-stage info
• Speed
• Modularity
• Reasonable accuracy
Overview of Features
• Parameterized models for different CPU units
– Can vary size or design style as needed
• Abstract signal transition models for speed
– Can select different conditional clocking and input
transition models as needed
• Based on SimpleScalar (has been ported to many
simulators)
• Modular: Can add new models for new units
studied
18
Unit Modeling
Bitline Activity
Number of Active Ports
© Mircea Stan, Kevin Skadron, David Brooks, 2002
Number of entries
Data width of entries
# Read Ports
Parameterized
Register File
Power
Model
Power
Estimate
# Write Ports
Modeling Capacitance
Modeling Activity Factor
• Models depend on
• Use cycle-level simulator to
structure, bitwidth, design
determine number and type
style, etc.
of accesses
• E.g., may model
– reads, writes, how many
capacitance of a register
ports
file with bitwidth & number • Abstract model of bitline
of ports as input
activity
parameters
19
© Mircea Stan, Kevin Skadron, David Brooks, 2002
One Cycle in Wattch
Power
(Units
Accessed)
Fetch
Dispatch
Issue/Execute
Writeback/
Commit
 I-cache
 Bpred
 Rename
Table
 Inst. Window
 Reg. File






 Result Bus
 Reg File
 Bpred
Performance  Cache Hit?
 Bpred
Lookup?
 Inst. Window
Full?
Inst. Window
Reg File
ALU
D-Cache
Load/St Q
Dependencies
Satisfied?
 Resources?
 Commit
Bandwidth?
• On each cycle:
– determine which units are accessed
– model execution time issues
– model per-unit energy/power based on which units
used and how many ports.
20
Units Modeled
by Wattch
 Array Structures
 Caches, Reg Files,
Map/Bpred tables
 Content-Addressable
Memories (CAMs)
 TLBs, Issue Queue,
Reorder Buffer
 Complex
combinational blocks
 ALUs, Dependency
Check
 Clocking network
 Global Clock Drivers,
21
Local Buffers
PowerTimer
© Mircea Stan, Kevin Skadron, David Brooks, 2002
• IBM Tool First Develop During
Summer of 2000
– Continued Development: 2001 => Today
– Methodology Applied to Research and
Product Power-Performance Simulators
with IBM
– Currently in Beta-Release
– Working towards Full Academic Release
22
© Mircea Stan, Kevin Skadron, David Brooks, 2002
PowerTimer: Empirical Power
Clock Tree
10%
L3 Tags
2%
IDU FXU
3% 4%
Other
10%
IFU
6%
ISU
10%
L2
23%
CIU
4%
FBC
3%
Map
Tables
43%
Issue
Queues
32%
Completion
Table
9%
Dispatch
6%
LSU
19%
GX
ZIO RAS
1%
4% 5%
Core
Buffer
1%
FPU
5%
Pre-silicon, POWER4-like superscalar design
23
Processor Power Density
0.7
© Mircea Stan, Kevin Skadron, David Brooks, 2002
0.6
0.5
0.4
0.3
0.2
0.1
0
IFU
IDU
ISU
FXU
LSU
FPU
L2
L3 Tag
BHT
Icache
FXIssueQ
Pre-silicon, POWER4-like superscalar design
Originally presented at PACS2002
24
© Mircea Stan, Kevin Skadron, David Brooks, 2002
PowerTimer
Circuit Power
Data (Macros)
SubUnit Power =
f(SF, uArch, Tech)
Tech Parms
Compute
Sub-Unit
Power
uArch Parms
Power
AF/SF Data
Program
Executable
or Trace
Architectural
Performance
Simulator
CPI
25
PowerTimer: Energy Models
© Mircea Stan, Kevin Skadron, David Brooks, 2002
• Energy models for uArch structures formed
by summation of circuit-level macro data
Energy Models
Sub-Units (uArch-level Structures)
Power=C1*SF+HoldPower
SF
Data
Power=C2*SF+HoldPower
Macro1
Macro2
Power=Cn*SF+HoldPower
MacroN
Power
Estimate
26
© Mircea Stan, Kevin Skadron, David Brooks, 2002
Empirical Estimates with CPAM
• Estimate power under “Input Hold” and
“Input Switching” Modes
• Input Hold: All Macro Inputs (Except
Clocks) Held
– Can also collect data for Clock Gate Signals
• Input Switching: Apply Random Switching
Patterns with 50% Switching on Input Pins
Macro
Inputs
Macro
• 0% Switching
(Hold Power)
• 50% Switching
Power
27
Example Unit
800
700
macro1
macro2
macro3
macro4
macro5
total
600
500
mW
© Mircea Stan, Kevin Skadron, David Brooks, 2002
• Made up of 5 macros
400
300
200
100
0
0
10
20
30
40
50
SF
28
PowerTimer: Models f(SF)
1400
1200
Switching
Power
Unit1
Unit2
Unit3
Unit4
Unit5
1000
mW
© Mircea Stan, Kevin Skadron, David Brooks, 2002
Assumption: Power linearly dependent on Switching Factor
This separates Clock Power and Switching Power
800
600
400
Clock
Power
200
0
0
10
20
30
40
50
SF
At 0% SF, Power = Clock Power (significant without clock gating)
29
Key Activity Data
1400
Changes in SF
Unit1
Unit2
Unit3
Unit4
Unit5
1000
mW
© Mircea Stan, Kevin Skadron, David Brooks, 2002
1200
Changes
in AF
800
600
400
200
0
0
10
20
30
40
50
SF
• SF => Moves along the Switching Power
Curve
– Estimated on a per-unit basis from RTL Analysis
• AF => Moves along the Clock Power Curve
– Extracted from Microarchitectural Statistics
(Turandot)
30
Microarchitectural Statistics
• Stats are very similar to tracking used in Wattch, etc
• Differences:
© Mircea Stan, Kevin Skadron, David Brooks, 2002
– Clock Gating Modes (3 modes)
– Customized Scaling Based on Circuit Style (4 styles)
• Clock Gating Modes:
–
–
–
–
P_constrained = P_unconstrained (not clock-gateable)
P_constrained_1 = AF * (Pclock + Plogic) (common)
P_constrained_2 = AF * Pclock + Plogic (rare)
P_constrained_3 = Pclock + AF * Plogic (very rare)
• Scaling Based on Circuit Styles
– AF_1 = #valid
Gating)
– AF_2 = #valid - #stalls
Stall Gating)
– AF_3 = #writes
updates)
– AF_4 = #writes + #reads
(Latch-and-Mux, No Stall
(Latch-and-Mux, With
(Arrays that only gate
(Arrays, RAM Macros)
31
© Mircea Stan, Kevin Skadron, David Brooks, 2002
Clock Gating: Valid-Bit Gating
• Latch-Based Structures: Execute Pipelines, Issue
Queues
Clock
V
V
V
V
V
V
32
Clock Gating Modes
© Mircea Stan, Kevin Skadron, David Brooks, 2002
• P_constrained_1 = AF * (Pclock + Plogic)
clock
valid
Plogic
Pclock
• P_constrained_2 = AF * Pclock + Plogic
clock
valid
Selection
Logic
Pclock
Plogic
33
Valid-bit Gating, Stalls?
© Mircea Stan, Kevin Skadron, David Brooks, 2002
• Option 1: Stalls cannot be gated
clk
valid
Data From
Previous Pipestage
Stall From
Previous Pipestage
Data For
Next Pipestage
• Option 2: Stalls can be gated
clk
valid
Data From
Previous Pipestage
Stall From
Previous Pipestage
Data For
Next Pipestage
34
Scaling: Array Structures
© Mircea Stan, Kevin Skadron, David Brooks, 2002
• Option 1: Reads and Writes Eligible to Gate
for Power
Write
Bitline
Read
Bitline
read_wordline_active
read_gate
write_wordline_active
write_gate
Cell
read_data
write_gate
write_data
35
Scaling: Array Structures
© Mircea Stan, Kevin Skadron, David Brooks, 2002
• Option 2: Only Writes Eligible to Gate for
Power
Write
Bitline
read_entry_n
read_entry_2
write_wordline_active
write_gate
read_entry_1
read_data
Cell
read_entry_0
write_gate
write_data
36
12 Clock Gating Modes
Gating
Mode
Valid
Valid+
Stalls
Writes Writes+
Reads
Gat Gate
e
Cloc
Both k
Gate Examples
Logic
0
No
No
No
No
No
No
No
Control Logic, Buffers,
Small Macros
1
Yes
No
No
No
Yes
No
No
2
No
Yes
No
No
Yes
No
No
Issue Queues, Execute
Pipelines
3
No
No
Yes
No
Yes
No
No
Caches
4
No
No
No
Yes
Yes
No
No
Some Queues
5
Yes
No
No
No
No
Yes
No
CAMs, Selection Logic
6
No
Yes
No
No
No
Yes
No
7
No
No
Yes
No
No
Yes
No
No Known macros
8
No
No
No
Yes
No
Yes
No
No Known macros
9
Yes
No
No
No
No
No
Yes
No Known macros
10
No
Yes
No
No
No
No
Yes
No Known macros
11
No
No
Yes
No
No
No
Yes
No Known macros
12
No
No
No
Yes
No
No
Yes
No Known macros
PowerTimer Observations
© Mircea Stan, Kevin Skadron, David Brooks, 2002
• PowerTimer works well for POWER4like estimates and derivatives
– Scale base microarchitecture quite well
– E.g. optimal power-performance
pipelining study
– Lack of run-time, bit-level SF not seen as
a problem within IBM (seen as noise)
• Chip bit-level SFs are quite low (5-15%)
• Most (60-70%) power is dissipated while
maintaining state (arrays, latches, clocks)
• Much state is not available in early-stage
timers
38
Comparing models: Flexibility
• Flexibility necessary for certain studies
© Mircea Stan, Kevin Skadron, David Brooks, 2002
– Resource tradeoff analysis
– Modeling different architectures
• Purely analytical tools provides fullyparameterizable power models
– Within this methodology, circuit design styles
could also be studied
• PowerTimer scales power models in a
user-defined manner for individual subunits
– Constrained to structures and circuit-styles
currently in the library
• Perhaps Mixed Mode tools could be very
useful
39
Comparing models: Accuracy
© Mircea Stan, Kevin Skadron, David Brooks, 2002
• PowerTimer -- Based on validation of
individual pieces
– Extensive validation of the performance model
(AFs)
– Power estimates from circuits are accurate
– Circuit designers must vouch for clock gating
scenarios
– Certain assumptions will limit accuracy or
require more in-depth analysis
• Analytical Tools
– Inherent Issues
• Analytical estimates cannot be as accurate as
SPICE analysis (“C” estimates, CV2 approximation)
– Practical Issues
• Without industrial data, must estimate transistor
sizing, bits per structure, circuit choices
40
© Mircea Stan, Kevin Skadron, David Brooks, 2002
Comparing models: Speed
• Performance simulation is slow enough!
• Post-Processing vs. Run-Time Estimates
• Wattch’s per-cycle power estimates:
roughly 30% overhead
– Post-processing (per-program power estimates)
would be much faster (minimal overhead)
• PowerTimer allows both no overhead postprocessing and run-time analysis for
certain studies (di/dt, thermal)
– Some clock gating modes may require run-time
analysis
• Third Option: Bit Vector Dumps
– Flexible Post-Processing  Huge Output Files
41
Power modeling summary
© Mircea Stan, Kevin Skadron, David Brooks, 2002
• Wattch provides excellent relative
accuracy
– Underestimates full chip power (some
units not modeled, etc)
• PowerTimer models based on
circuit-level power analysis
– Inaccuracy is introduced in SF/AF and
scaling assumptions
42
© Mircea Stan, Kevin Skadron, David Brooks, 2002
Overview
1.
2.
3.
4.
5.
6.
7.
8.
9.
Motivation (Kevin)
Thermal issues (Kevin)
Power modeling (David)
Thermal management (David)
Optimal DTM (Lev)
Clustering (Antonio)
Power distribution (David)
What current chips do (Lev)
HotSpot (Kevin)
43
Existing Work
© Mircea Stan, Kevin Skadron, David Brooks, 2002
• Research Ideas
– DEETM – Huang and Torrellas MICRO2000
– DTM – Brooks and Martonosi HPCA2001
– Control-Theoretic DTM – Skadron,
Abdelzaher, Stan HPCA2002
– Thermal Scheduling – Cai, Lim, Daasch
WCED2002
• Commercial Products
– PowerPC G3 Microprocessor
– Pentium III
– Pentium 4
44
© Mircea Stan, Kevin Skadron, David Brooks, 2002
Overview
• Hard to optimize power-performance at
design time for all cases
• Forces conservative choices for issues like
cooling, current delivery, resource sizes
• Want to explore dynamic power
optimizations for run-time power
management
–
–
–
–
–
–
Dynamic Voltage/Frequency Scaling [Burd, 2000]
Dynamic Hardware Resizing [Albonesi, 1999]
Fetch Throttling [Sanchez, 1997]
Global Clock Gating [Gunther, 2001]
Speculation Control [Manne, 1998]
Dynamic Thermal Management [Brooks,
2001][Huang, 2000]
45
Relative Power
© Mircea Stan, Kevin Skadron, David Brooks, 2002
Important to optimize P & T early
12FO4
5
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
tradeoff via changing Vdd and HI
tradeoff via changing frequency
tradeoff via changing pipeline depth
14FO4
18FO4
0.7
0.8
0.9
Maximum Power Budget
23FO4
1
1.1
1.2
1.3
Relative 1/Performance
1.4
1.5
1.6
46
Dynamic Thermal Management
© Mircea Stan, Kevin Skadron, David Brooks, 2002
• Goal:
– Provide dynamic techniques to cool chip
when needed
– Exploit natural variations due to different
applications, phase behavior, …
– Allow designers to target average, rather
than worst-case behavior
• Design Decisions:
– Mechanism & policy for triggering
response?
– What should response be?
– How to select DTM trigger levels?
47
© Mircea Stan, Kevin Skadron, David Brooks, 2002
Power consumption impacts cost
CPU
From: Gunther, et al.
“Managing the Impact of
Increasing Microprocessor
Power Consumption,” Intel
Technology Journal, Q1,
2001
• System costs associated with power dissipation:
– Thermal control cost
• Heatsinks, fans
– Power delivery
• Power supply
• Decoupling caps…
48
© Mircea Stan, Kevin Skadron, David Brooks, 2002
Average and Worst Case Power
100
Max
80
Avg
60
40
20
0
Alpha 21264
Intel PPro
• System costs are
constrained by worst case
power dissipation
• Average case power
dissipation can often be
much lower
– Aggressive Clock Gating
– Applications variations
– Underutilized resources
• Not enough ILP
• Floating point units during
integer code execution
• Currently about a 30%
difference
• Likely to further diverge…
49
Dynamic Thermal Management
Temperature
© Mircea Stan, Kevin Skadron, David Brooks, 2002
Designed for Cooling Capacity w/out DTM
Designed for Cooling
Capacity w/ DTM
System
Cost Saving
DTM Trigger
Level
DTM Disabled
DTM/Response Engaged
Time
50
DTM: Definitions
© Mircea Stan, Kevin Skadron, David Brooks, 2002
Trigger
Reached
Turn
Response
On
Initiation Response
Delay
Delay
Check
Temp
Policy
Delay
Check
Temp
Turn
Response
Off
Shutoff
Delay
 Initiation Delay – OS interrupt/handler
 Response Delay – Invocation time (e.g. adjust clock)
 Policy Delay – Number of cycles engaged
 Shutoff Delay – Disabling time (e.g. re-adjust clock)
51
© Mircea Stan, Kevin Skadron, David Brooks, 2002
DTM: When, How, and What
Trigger Mechanism:
When do we enable
DTM techniques? Initiation Mechanism:
How do we enable
technique?
Response Mechanism: What
technique do we enable?
52
© Mircea Stan, Kevin Skadron, David Brooks, 2002
DTM: Trigger Mechanisms
• Mechanism: How to  Policy: When to begin
deduce temperature? responding?
 Trigger level set too high:
• Direct approach:
Packaging cost will be high
Temperature sensors
Little advantage
providing feedback
 Trigger level set too low
– Implemented in some
Frequent triggering causes
PowerPC chips (G3, G4)
performance to suffer
[Sanchez, 1997]
 Choose trigger level to exploit
– Sensor quantity,
placement, and precision difference between average
and worst-case power.
will be discussed later
• Other indirect
approaches possible
53
© Mircea Stan, Kevin Skadron, David Brooks, 2002
DTM: Initiation Mechanisms
• Operating system or
microarchitectural control?
– Hardware support can significantly reduce
performance penalty
• Policy Delay Settings
– For Volt/Freq scaling, much of the
performance penalty can be attributed to
enabling/disabling
– Increasing policy delay reduces overhead;
smarter initiation techniques would help
as well
54
© Mircea Stan, Kevin Skadron, David Brooks, 2002
DTM: Response Mechanisms
• Scaling Techniques
– Clock Frequency Scaling [Intel Pentium 4]
– Voltage and Frequency Scaling
– Temperature-tracking frequency scaling[Skadron03]
• Adjusts frequency to account for T-dep. of switching speed
• Microarchitectural Techniques
– Speculation Control [Manne98]
– Low-Power Cache Techniques [Huang00]
• Hierarchical Responses
–
–
–
–
–
Decode Throttling [Sanchez97]
Fetch Toggling [Brooks01]
Feedback controlled Fetch Gating [Skadron02]
Migrating Computation [Skadron03]
Dual Pipelines [Lim02]
55
© Mircea Stan, Kevin Skadron, David Brooks, 2002
Dynamic Voltage/Frequency Scale
• Voltage Scheduler predicts workload
requirements
• Set frequency/voltage to nearoptimal, energy savings
• Burd, et al., ISSCC2000
– 5MHz @ 1.2V: 6 MIPS, 2.8mW
– 80MHz @ 3.8V: 85 MIPS, 460mW
– 70us 1.2V <-> 3.8V
• Transmeta Crusoe
– Commercial implementation (500700MHz, 1.2-1.6V)
56
Temperature-Tracking Frequency
Temperature affects :
• Transistor threshold and mobility
© Mircea Stan, Kevin Skadron, David Brooks, 2002
• Ion, Ioff, Igate, delay
• ITRS: 85°C for high-performance, 110°C for embedded!
• So adjust frequency as f(T) -- TTDFS
Ioff
Ion
NMOS
57
Speculation Control
© Mircea Stan, Kevin Skadron, David Brooks, 2002
• Manne et al. (ISCA ’98)
– Branch confidence estimator used to determine
whether to speculate
– Pipeline gating based on confidence estimation
– 38% reduction in wrong-path instructions with
~1% performance loss
• But Parikh et al. (HPCA ’02) found much
smaller savings; ED product is zero or
negative
– Significant energy savings only come with
significant loss of performance
– This is because many instructions are squashed
early in the pipeline, so reduction in wrong-path
instructions is not a useful metric
– Benefit is actually a function of prediction
accuracy
• Only for very badly predicted programs do you get
benefit
• Well-predicted programs suffer
58
Dynamic Hardware Resizing
© Mircea Stan, Kevin Skadron, David Brooks, 2002
• Complexity Adaptive Processors
• Based on application characteristics
– Underutilized structures may be reduced
with minimal performance impact
– Resize Caches, Issue Queues, etc.
– Resize => Reduce Capacitance =>
Reduce Energy
– Of course, this only helps manage heat if
it reduces heat dissipation within hot
spots
• And does so for a sufficiently long duration
59
© Mircea Stan, Kevin Skadron, David Brooks, 2002
DEETM
• Dynamic Energy Efficiency and
Temperature Management
• Slack algorithm detects if slowdown
can be tolerated
– If so, invoke techniques to reduce energy
• Temperature algorithm
– If temperature limit is reached, invokes
techniques
• Techniques considered
– Filter Cache, Voltage Scaling, etc.
60
Control-theoretic DTM
• Fetch toggling
© Mircea Stan, Kevin Skadron, David Brooks, 2002
– disable fetch every N cycles
– 4/5, 2/3, 1/2, 1/3, 1/5, …
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
– How to set the fetch rate?
• (Assume idealized temperature sensing)
61
Feedback-Control of Fetch Toggling
• Formal feedback control
© Mircea Stan, Kevin Skadron, David Brooks, 2002
setpoint
e
Controller
m
measured T
Actuator:
I-fetch toggling
P
Thermal
dynamics
T
Temp. sensor
PID: m = KC (e + KIe + Kdde/dt)
• easy to compute
• toggling = f(m)
62
Formal Feedback Control
• Regulatory control problem: hold value to a
specified setpoint
© Mircea Stan, Kevin Skadron, David Brooks, 2002
– Example: temperature
– Proved that PID controller will not allow
temperature to exceed setpoint by more than
0.02°
• Max power dissipation, thermal dynamics,
sampling rate  max overshoot
• This precision is excessive but illustrates the value
of formal feedback control theory
63
0%
25%
MEAN
30%
bzip
vortex
perlbmk
eon
parser
fma3d
facerec
crafty
equake
art
mesa
gcc
Percent Loss in Performance
© Mircea Stan, Kevin Skadron, David Brooks, 2002
Performance Loss
• Performance loss reduced by 65%
toggle1
PID
20%
15%
10%
5%
64
Migrating Computation
© Mircea Stan, Kevin Skadron, David Brooks, 2002
• When one unit overheats, migrate its
functionality to a distant, spare unit (MC)
–
–
–
–
Spare register file (Skadron et al. 2003)
Separate core (CMP) (Heo et al. 2003)
Microarchitectural clusters
etc.
• Raises many interesting issues
–
–
–
–
Cost-benefit tradeoff for that area
Use both resources (scheduling)
Extra power for long-distance communication
Floorplanning
65
© Mircea Stan, Kevin Skadron, David Brooks, 2002
Migrating Computation – Reg File
66
Thermal Scheduling (Cai 2002)
Primary
© Mircea Stan, Kevin Skadron, David Brooks, 2002
FE
DE
Secondary
DE
OOP
EX
Majority mobile apps
with performance
requirements
RF
IOP
Text email, caller-id, reminder and
other none high performance w/
anywhere-anytime requested
apps
• Primary pipeline: maximal performance,
complex pipeline structure
• Second pipeline: Minimum power and
energy consumption, very simple in order
structure and target mobile anywhereanytime applications.
• Transparent to OS and applications
• Maximal utilizing on die clock/power gating
for energy saving
67
Scheduling Algorithm (Cai 2002)
T1 > TL
|| T2 > TL
Tmax
Temperature (C)
© Mircea Stan, Kevin Skadron, David Brooks, 2002
T1  TL
& T2  TL
TH
TS1
S1
T1  TL
S4
T1 < TH
T1  TH
S2
T1  TL
T1  TH
TL
Ta
tcycle
tcool theat
TS2
T1 > TL
& T2  TH
S3
T1 > TL
& T2 < TH
Time (s)
S1: Normal Operation (Primary Pipeline)
S2: Stall Fetch & Clear Pipeline
S3: Alternate Operation (Secondary Pipeline)
S4: Disable Clock or Scale F-V
68
Hybrid DTM
© Mircea Stan, Kevin Skadron, David Brooks, 2002
•
DVS is attractive because of its cubic
advantage
– P  V2f
– This factor dominates when DTM must be
aggressive
– But changing DVS setting can be costly
• Resynchronize PLL
• Sensitive to sensor noise  spurious changes
•
“ILP techniques” are attractive because
they can use instruction level parallelism to
hide/reduce impact of DTM
– Only effective when DTM is mild
•
So use both!
– Need to find “crossover point”
69
Hybrid DTM, cont.
Combine fetch gating with DVS
– When DVS is better, use it
– Otherwise use fetch gating
– Determined by magnitude of temperature
overshoot
– Crossover at FG duty cycle of 3
– FG has low overhead: helps reduce cost of
sensor noise
1.4
FG
Hyb
DVS
1.3
1.2
1.2
1.1
Slowdown
1.3
Slowdown
© Mircea Stan, Kevin Skadron, David Brooks, 2002
•
1.1
1.0
1.0
20
5
Duty Cycle
2
20
15
10
Duty Cycle
5
0
70
Hybrid DTM, cont.
•
DVS doesn’t need more than two settings
for thermal control
© Mircea Stan, Kevin Skadron, David Brooks, 2002
– Lower voltage cools chip faster
•
FG by itself does need multiple duty
cycles and hence requires PI control
•
But in a hybrid configuration, FG does not
require PI control
– FG is only used at mild DTM settings
– Can pick one fixed duty cycle
•
This is beneficial because feedback
control is vulnerable to noise
71
Simulation Details
• 85°C maximum temperature
© Mircea Stan, Kevin Skadron, David Brooks, 2002
– Guard band requires a trigger threshold of 81.8°
• Ambient temperature (inside computer case): 45°C
• Rpackage = 0.8 K/W (old package model)
– 0.7 K/W necessary if DTM not available
• Die thickness: 0.5mm
• Currently neglecting interface material
• 9 SPEC2000 benchmarks, both integer and FP
– 4 hover near 81.8°C, rest are above
• SimpleScalar/Wattch, modified to model pipeline and
power of an Alpha 21364 as closely as possible
• Scaled to 130nm, 1.3V, 3.0 GHz
72
Performance Comparison
•
TT-DFS is best but can’t prevent excess temperature
–
•
Hybrid technique reduces DTM cost by 25% vs. DVS
(DVS overhead important)
A substantial portion of MC’s benefit comes from the
altered floorplan, which separates hot units
1.40
Slowdown Factor
© Mircea Stan, Kevin Skadron, David Brooks, 2002
•
Suitable for use with aggressive clock rates at low
temp.
1.359
1.270
1.30
1.231
1.20
1.112
1.10
1.045
1.00
TTDFS
DVS
FG
Hyb
MC
73
Conclusions so far
© Mircea Stan, Kevin Skadron, David Brooks, 2002
•
•
DTM can be used to reduce cooling costs
Proper modeling is required
– HotSpot is publicly available at
http://lava.cs.virginia.edu/HotSpot
•
•
ILP matters
Hybrid techniques beneficial
– Merge advantages of different schemes
– Simplify control
•
•
Architectural techniques important in
thermal design
Growing use of clusters and redundant
units opens an incredibly rich design
space
74
© Mircea Stan, Kevin Skadron, David Brooks, 2002
DTM: Summary and Key Issues
• Dynamic optimizations translate
max-power problem to averagepower problem
• Heightens importance of averagepower techniques like clock gating
• Key Issues:
– Initiation interval
– Collection of possible response
mechanisms
75