et al - University of Virginia, Department of Computer Science
Download
Report
Transcript et al - University of Virginia, Department of Computer Science
© 2004, Kevin Skadron and Jose Gonzalez
Power-Aware Design for
High-Performance Processors
A Tutorial at HPCA-2004
Kevin Skadron
Jose Gonzalez
University of Virginia
Intel Labs Barcelona
© 2004, Kevin Skadron and Jose Gonzalez
Roadmap
Introduction & Trends
Dynamic Power Dissipation
Static Power Dissipation
Sources, modeling, reduction techniques
Sources, modeling, reduction techniques
Summary
2
© 2004, Kevin Skadron and Jose Gonzalez
Introduction
Power: Work done per unit time (watts)
Energy: Total Work (joules)
Why is power a concern in current processors? ?
Increased market demand for consumer electronics powered by
batteries; battery life is a selling point
Electricity, cooling costs for large data centers are becoming
substantial
• 5-25% of data center income (cf. Rajamony & Bianchini tutorial, ICS’02)
Government energy-efficiency requirements
• (eg Energy* in US)
Electricity costs for large ISPs are becoming substantial
Packaging and cooling costs (due to the increase in the power
density) are becoming prohibitive
Power dissipation may reach technology limits are
becoming prohibitive
Current delivery is becoming3 expensive
© 2004, Kevin Skadron and Jose Gonzalez
Metrics
Some different power metrics & fallacies:
Reducing power does not always save energy
Energy = P dt
• If you reduce power but increase execution time, energy
may go up
Also note that reducing power does not always
reduce temperature
Sustained power density limits thermal
design/packaging
– approx. same as thermal design power
– note that on-chip temperatures and total heat production are
somewhat different concerns
4
© 2004, Kevin Skadron and Jose Gonzalez
Metrics
Power
Energy
Average power
Power density map
Energy (MIPS/W)
Energy-Delay product (MIPS2/W)
Energy-Delay2 product (MIPS3/W) – voltage independent!
(Zyuban, GVLSI’02)
Temperature
Average temperature
Peak temperature
Temperature map
• Does not necessarily match power density map
No good figures of merit for trading off thermal efficiency against
performance, area, or energy efficiency
5
© 2004, Kevin Skadron and Jose Gonzalez
Power Dissipation
Dynamic power dissipation
Due to switching activity
Static power dissipation
Due to leakage current – major paths are:
• Subthreshold leakage
Exponentially dependent on Vdd, Vth, Temp
• Gate leakage
Exponentially dependent on Vdd, Tox
6
© 2004, Kevin Skadron and Jose Gonzalez
Power Dissipation
Total power actually consists of
Switching power
Short-circuit power
Leakage power
7
© 2004, Kevin Skadron and Jose Gonzalez
Big Picture - Trends
Data on current power dissipation for various
chips
Distribution of power within a typical processor
Trends in Scaling trends in power dissipation
Trends in leakage power
Power Trends in battery life
8
© 2004, Kevin Skadron and Jose Gonzalez
Power Dissipation
Processor Alpha
21364
Clock
1.15 GHz
Rate
Power
110W
(Max)
AMD
Opteron
2.2 GHz
HPIBMPA8700 Power 4
870 MHz 1.7 GHz
Intel
Itanium 2
1.5 GHz
Intel
Xeon
3.2 GHz
MIPS
R14000
600 MHz
86 W
75W
130W
86W
16W
100W
Source: Microprocessor Report
9
© 2004, Kevin Skadron and Jose Gonzalez
Power Dissipation Breakdown
Alpha 21264
Global clock network
Instruction issue units
Caches
FP execution units
Int. execution units
Mem. management unit
I/O
Miscellaneous
Source: Gowan et al. “Power Considerations in the design of the alpha 21264 microprocessor”, DAC 1998
10
© 2004, Kevin Skadron and Jose Gonzalez
Effects of Technology Scaling on
Power Dissipation
Feature size is scaling down
Frequency is increasing
at least 30% (Ideal scaling: decreases by 30%)
Vdd is not scaled down at the same rate as feature size
25% (Ideal scaling: decreases by 50%)
Active capacitance increases
~2x (Ideal scaling: decreases by 30%)
Area increases due to microarchitecture improvements
30%
0-10% (Ideal scaling) 30%
Ideal scaling: P CV2f → 0.72 reduction 0.5
Observed scaling → 2 – 2.5x increase
Power density becomes a problem!
Especially since the power density is non-uniform
11
© 2004, Kevin Skadron and Jose Gonzalez
Power Evolution
?
100
Pentium® II
Pentium® 4
Max Power (Watts)
Pentium® Pro
Pentium® III
10
Pentium®
Pentium®
w/MMX tech.
i486
i386
1
1.5m
Source: Intel
1m
0.8m
0.6m
0.35m
12
0.25m
0.18m
0.13m
© 2004, Kevin Skadron and Jose Gonzalez
Trends in Power Density
1000
Rocket
Nozzle
Watts/cm
2
Nuclear Reactor
100
Pentium® 4
Pentium® III
Pentium® II
Hot plate
10
Pentium® Pro
Pentium®
i386
i486
1
1.5m
1m
0.7m
0.5m
0.35m
0.25m
0.18m
0.13m
0.1m
0.07m
* “New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies” –
Fred Pollack, Intel Corp. Micro32 conference key note - 1999.
13
© 2004, Kevin Skadron and Jose Gonzalez
ITRS Projections
Year
Tech node (nm)
Vdd (high perf) (V)
Vdd (low power) (V)
Frequency (high perf) (GHz)
High-perf w/ heatsink
Cost-performance
Hand-held
2003
100
1.0
1.1
3.1
2006
2010
70
45
0.9
0.6
1.0
0.8
5.6
11.5
Max power (W)
180
218
98
120
3.5
3.0
160
85
3.2
2013
32
0.5
0.7
19.3
2016
22
0.4
0.6
28.8
251
138
3.0
288
158
3.0
ITRS 2001
These are targets
Based on historical trends, the high-performance power targets
seem optimistic
Intel papers suggest that in the 45-75W range, cooling costs $1/W;
but then rate of increase goes up: $2, $3/W, maybe more!
(Borkar, IEEE Micro ’99, Gunther et al, ITJ ’01)
14
The fraction of leakage power is increasing
exponentially with each generation
Also exponentially dependent on temperature
Increasing
ratio
across
generations
Static power/ Dynamic Power
70
60
50
40
30
20
10
0
29
8
30
3
30
8
31
3
31
8
32
3
32
8
33
3
33
8
34
3
34
8
35
3
35
8
36
3
36
8
37
3
Percentage
© 2004, Kevin Skadron and Jose Gonzalez
Leakage Power
Temperature(K)
180nm
130nm
100nm
Source: Skadron et al, University of Virginia 15
90nm
80nm
70nm
© 2004, Kevin Skadron and Jose Gonzalez
Trends in Battery Technology
Battery lifetime is increasing perhaps 8-10%/yr.
(Powers, Proc. of IEEE 1995)
Not keeping up with rate of growth in energy
consumption
Source: Rabaey 1995, cited in Irwin et al, “Low Power Design Methodologies, Hardware and Software Issues”,
tutorial at PACT 2000
16
© 2004, Kevin Skadron and Jose Gonzalez
Roadmap
Introduction & Trends
Dynamic Power Dissipation
Static Power Dissipation
Sources, modeling, reduction techniques
Sources, modeling, reduction techniques
Summary
17
© 2004, Kevin Skadron and Jose Gonzalez
Dynamic Power Dissipation
Roadmap
Sources of dynamic power dissipation
Modeling dynamic power
Circuit- and architecture-domain techniques to reduce
power
18
© 2004, Kevin Skadron and Jose Gonzalez
Dynamic Power Consumption
Power dissipated due to switching activity
A capacitance is charged and discharged
Vdd
01
Ec=1/2CLV2
Ed=1/2CLV2
10
Charge/discharge at the frequency f
P=CLV2 f
Note that energy consumed from battery is CLV2 and is
drawn upon charging
19
© 2004, Kevin Skadron and Jose Gonzalez
Dynamic Power Dissipation
Equation
P = a CL Vdd2 f
a: Activity factor
Depends on the processor architecture
CL: Capacitance of the circuit
Depends on the design style, number of transistors,
transistor sizing, etc
Vdd: Operating voltage
f: Frequency
20
© 2004, Kevin Skadron and Jose Gonzalez
Dynamic Power Modelling
P = a CL V2 f
Information needed
Activity counters in each unit
Energy dissipated per access
Configuration
Performance
Model
Activity
Performance metrics
Power
Model
Power metrics
For precision, “a” (# of signal transitions) should be measured or at
least estimated with a probabilistic model
More commonly, a = 0.5 is assumed
21
© 2004, Kevin Skadron and Jose Gonzalez
Dynamic Power Modelling
Activity counters
Energy per access
Analytically: calculating capacitances as function of size, ports, etc
Example: Cache access: decoder, precharge transistors, bitline, cell
access, wordline, sense amplifiers ...
• Wattch (Brooks et al, ISCA 2000)
• Cacti
Empirically: using low level designs and applying “virus” tests
• Virus test: microbenchmark that stresses a particular unit
• ALPS (Gunther et al, ITJ, 2001)
Circuit-extracted model
Performance model is used
Counters for: cache access, FU usage, Register File, ...
PowerTimer – IBM Power4 (Brooks et al, PACS’00)
AccuPower – Parameterized, based on SPICE measurements of actual
layouts (SUNY Binghamton, Ponomarev et al, DATE’02)
PowerAnalyzer – StrongARM (Michigan, assoc. w/ SimpleScalar)
Many of these ignore the actual number of signal transitions
22
© 2004, Kevin Skadron and Jose Gonzalez
Circuit-Level Techniques
Transistor sizing
Signal and clock gating
Circuit restructuring
Low power caches
Low power register files
Issue queue
These typically reduce the capacitance being
switched
23
© 2004, Kevin Skadron and Jose Gonzalez
Transistor Sizing
Transistor sizing plays an important role to reduce power
K = Ci/Ci-1
C0
C1
CN-1
CN
Delay ~ a (k / ln K)
Power ~ K / (K-1)
Optimum K for both power and delay must be pursued
24
© 2004, Kevin Skadron and Jose Gonzalez
Signal Gating
“techniques to mask unwanted switching activities from propagating
forward, causing unnecessary power dissipation”
Implementation
ctrl
Generation requires additional logic
Identification of signals to be gated
Output
Control signal needed
Simple gate
Tristate buffer
...
signal
Clock
Address bus
Also helps to prevent power dissipation due to glitches
25
© 2004, Kevin Skadron and Jose Gonzalez
Clock Gating
“Disabling a functional block when it is not required for a extended
period”
Implementation
signal
Simple gate that replaces
one buffer in the clock tree ctrl
Delay is generally not a concern
Decision
Architectural level
26
functional
functional
unitunit
© 2004, Kevin Skadron and Jose Gonzalez
Circuit Restructuring
Pipeline (can reduce frequency)
Parallelize (can reduce frequency)
Reorder inputs so that most active input is
closest to output (reduces switched capacitance)
Restructure gates (equivalent functions are not
equivalent in switched capacitance)
Energy-efficient flip-flops and latches
27
bitline
bitline
R rows
C cols
row dec
80
Read
Write
70
60
wordline
50
sens amp
40
Column dec
30
20
10
Switched capacitance
Voltage swing
Activity factor
Frequency
th
er
I/O
O
bu
se
s
LS
A
D
B
A
TB
LS
W
lin
e
s
0
de
r
Caccess = R C Ccell
Reducing power
ec
o
D
© 2004, Kevin Skadron and Jose Gonzalez
Cache Design
TBLSA: Tagbitlines & sense amp.
DBLSA: Data bitlines and sense amp.
Cache parameters: 16 KB cache 0.25 μm
Villa et al, MICRO 2000
28
© 2004, Kevin Skadron and Jose Gonzalez
Cache Design
Banked organization
Dividing word line
Same effect for wordlines
Reducing voltage swings
Targets switched capacitance
Caccess = R C Ccell / B
Sense amplifiers used to detect Vdiff across bitlines
Read operation can be curtailed as soon as Vdiff is detected
Limiting voltage swing saves a fraction of power
Pulse word lines
Enabling the word line for the time needed to discharge bitcell
voltage
Designer needs to estimate access time and implement a pulse
generator
29
© 2004, Kevin Skadron and Jose Gonzalez
Low Power Register File Design
RF’s usually single-ended bitlines
Modified storage cell
Lot of zeros fetched from the RF
Bitline connections are modified to eliminate bitline discharge
when reading a zero
Tseng and Asanovic, ICSD, 2000
Zyuban and Kogge, ISLPED 1998
30
© 2004, Kevin Skadron and Jose Gonzalez
Efficient Issue Queue
Constitute a high fraction of the overall power
>25% for some authors
Tag 1
Tag w
OR
RDY
comp
comp
comp
comp
Oprnd
Oprnd
31
OR
RDY
© 2004, Kevin Skadron and Jose Gonzalez
Efficient Issue Queue
Useful comparison
Empty entries and ready entries consume energy
• Wakeup of empty entries can be disabled
Gating off precharge logic using valid bit
• Wakeup of ready sources can be disabled
Gating off precharge logic using ready bit
Folegnani and Gonzalez, ISCA 2001
Energy-efficient Comparators
Traditional comparators dissipate energy on a mismatch in any
bit position.
10%-20% of source operands match each cycle
Solution: comparators that dissipate energy in a match
Kuckuc et al, ISLPED 2001
32
© 2004, Kevin Skadron and Jose Gonzalez
Architectural-Level Techniques
Encoding/compression
Energy-efficient front end
Energy-efficient caches
Asymmetric processors
Dynamic Voltage/Frequency scaling
Multi clock domain architectures (similar to GALS)
Pipeline gating
Compiler techniques
Sleep modes
These typically take advantage of locality or slack
33
© 2004, Kevin Skadron and Jose Gonzalez
Bus Invert Encoding
Reduce power of parallel synchronous signals
Idea: Minimize the number of transitions
• (Stan & Burleson, IEEE Trans. on VLSI, 1995)
Sender examines the current and the next values
Decides whether sending the true or the compliment signal
Additional polarity signal is sent along with data
Example
Current data
110011101
Next data
000100110
Number of
transitions
Current data
NOT (Next data)
Number of
transitions
8
34
110011101
111011001
2
© 2004, Kevin Skadron and Jose Gonzalez
Dynamic Zero Compression
Zero Indicator Bit (ZIB) added to each byte
Circuit Modifications
Zero-detection and store bus drivers
Wordline gating: 8-bit data is driven by the associated ZIB
Sense Amps: modified to drive a zero if ZIB active
Drawbacks
Enabled if a zero is stored in cache
On a read access, bitline discharge is prevented by disabling
local wordline
On a write, if the byte is zero, just ZIB is written.
9% area increase, 2-gate delay increase
Results
26% energy reduction data cache, 10% instruction cache
Villa et al, MICRO 2000
35
High percentage of integer operations require <16 bits
Difficult for the compiler to know the actual operand size
Variability for the same instruction in successive instances
Clock Gating is used to partially disable the FU
zero48
0
Result
64
zero48
clk
1
AND
Zero
detec
High
latch
Operand
A
64
Low
latch
zero48
clk
Operand
B
AND
Integer FU
© 2004, Kevin Skadron and Jose Gonzalez
Exploiting Narrow Width Operands
High
latch
64
Low
latch
36
Brooks and Martonosi, HPCA 1999
0-15
16-63
64
© 2004, Kevin Skadron and Jose Gonzalez
Energy-Efficient Front End:
Branch Prediction
Branch Prediction
Parikh et al, HPCA’02, IEEE Trans. Computers ‘04
Branch prediction accuracy is a major determinant of
pipeline activity -> spending more power in the branch
predictor can be worthwhile if it improves accuracy
Branch predictors can be designed to reduce power, eg
• Banking
• Gate off unnecessary accesses (“prediction probe detector”)
37
© 2004, Kevin Skadron and Jose Gonzalez
Energy Efficient Front End:
Register Renaming
RAT often implemented as a multiported register file
indexed by logical register, returns physical register
Liu and Lu , MICRO’00
Kucuk et al, PATMOS’03
Hierarchical RAT- top level is a cache of the full table
Prevent lookup of sources that will be supplied by a freshly
renamed instruction in the same rename group
Filter cache
Could instead organize as an associative lookup in a
table organized by physical register with dissipate-onmatch comparator (Ergin et al, ICCD’02)
38
© 2004, Kevin Skadron and Jose Gonzalez
Energy-Efficient Caches
Filter cache
Banks
Selective cache ways (Albonesi, MICRO-32)
Small L0 cache filters many accesses to L1, allows an L1 with
fewer ports (Kin et al, MICRO-30)
Ways in a set associative cache can be disabled if not needed
Many variations of this approach
Staggering number of papers on this topic
Exploit victim cache, load-store queue
Clever cache organizations (eg combining banks w/ high assoc,
specialized caches, etc.)
See recent proceedings of VLSI, architecture conferences,
esp. ISLPED
39
© 2004, Kevin Skadron and Jose Gonzalez
Asymmetric Processors
Processors have different “versions” of the same
resource, with different power/latency
Fast, power-hungry resources are allocated to critical
instructions
Slow, low-power resources are allocated to non-critical
instructions
Criticality predictor is needed!!!
40
© 2004, Kevin Skadron and Jose Gonzalez
Asymmetric Processors
Reducing power of functional units
Critical instructions
2 sets of functional units
2 sets of instruction queues
Criticality predictor
In-order queue: critical path is usually a serial chain of
dependent instructions
Fast functional units
Non-critical instructions
OoO queue
Slow functional units
Seng et al, MICRO 2001
41
© 2004, Kevin Skadron and Jose Gonzalez
Decode
Fetch
Slow pipeline
Reg
File
Commit
Dual Speed Pipelines
Fast pipeline
Criticality
predictor
Slow pipeline works at half the frequency
Criticality predictor key component to keep energy-efficiency
No communications penalties
Pyreddy and Tyson, WCED 2001
42
© 2004, Kevin Skadron and Jose Gonzalez
Dynamic Voltage/Frequency Scaling
Allow the device to dynamically adapt the voltage (and the
frequency)
Already implemented in many processors
Implementation
P ~ Vdd2
F ~ Vdd/(Vdd-Vth)k
Tradeoff between power reductions and delay increase
MUST BE energy-efficient
Voltage regulator
Predict future processor utilization and adjust frequency/voltage to
maximize power reduction while keeping performance
43
© 2004, Kevin Skadron and Jose Gonzalez
TransmetaTM LongRunTM
Crusoe processor can configure itself*
Management
Voltage changes in steps of 25 mV (depending on the voltage
regulator)
Frequency changes in steps of 33 MHz
From 1.6v, 600 MHz to 1.2V, 300MHz (2001)
Implemented in the Code MorphingTM software layer
Idle time of the system is sampled to determine performance
demands
Thermal extension
May be a form of thermal throttling
Expands the thermal budget of the processor
* Source: http://www.transmeta.com
44
© 2004, Kevin Skadron and Jose Gonzalez
Transmeta™LongRun™
Idle time
On-line activity
Voltage drops to minimum
Voltage raises to maximum
Real-Time activity
Voltage adjusted to meet
requirements
DVD player
• 24 frames/second
Source: Transmeta
45
© 2004, Kevin Skadron and Jose Gonzalez
Intel SpeedStep®
Configuration*
From 0.844v (600MHz) to 1.48v (1.7 GHz)
100μs delay
Voltage-Frequency switching separation
No Change
Volt. Transition
Freq. Transition
Volt. Transition
* Source: http://www.intel.com
Freq. Transition
46
© 2004, Kevin Skadron and Jose Gonzalez
Intel SpeedStep®
Configuration
Clock partitioning
• Core clock
• Bus clock (sequencer and interrupt interface)
Event blocking
• Interrupts, pin events and snoop requests are not lost
47
© 2004, Kevin Skadron and Jose Gonzalez
Voltage Scheduling
Real-time problem will be discussed later
For non-real time workload, goal is to improve
energy efficiency
This is hard, because it is difficult to predict an
arbitrary workload’s future needs without
deadline information
Instead, try to schedule processes and voltages
to reduce idle time
eg, Weiser et al, OSDI-1
48
© 2004, Kevin Skadron and Jose Gonzalez
Sleep Modes
ACPI: Advance Configuration and Power Interface
Developed by Microsoft, HP, Toshiba, Phoenix and Intel
Establishes interfaces for OS-directed powermanagement
Replaces APM, MPS APIs and PnP BIOS
Defines
Hardware registers
BIOS interfaces
System and device power states
Source: ACPI overview, http://www.acpi.info
49
© 2004, Kevin Skadron and Jose Gonzalez
DVS “Critical Power Slope”
It may be more efficient not to use DVS, and to
run at the highest possible frequency, then go
into a sleep mode!
Depends on power dissipation in sleep mode
And power dissipation at lowest voltage
This has been formalized as the critical power
slope (Miyoshi et al, ICS’02):
mcritical = (Pfmin – Pidle) / fmin
If the actual slope m = (Pf - Pfmin) / (f – fmin) < mcritical
then it is more energy efficient to run at the highest
frequency, then go to sleep
Switching overheads must be taken into account
50
© 2004, Kevin Skadron and Jose Gonzalez
Multi Clock Domain Architecture
Multiple clock domains inside the processor
Globally-asynchronous locally synchronous
(GALS) clock style
Independent voltage/frequency scaling
Synchronizers to ensure inter-domain
communication
51
© 2004, Kevin Skadron and Jose Gonzalez
Multi Clock Domain Architecture
Advantages
Local clock design is not aware of global skew
Each domain limited by its local critical path, allowing higher
frequencies
Different voltage regulators allow for a finer-grain energy control
Frequency/voltage of each domain can be tailored to its dynamic
requirements
Clock Power is reduced
Drawbacks
Complexity and penalty of synchronizers
Feasibility of multiple voltage regulators
52
© 2004, Kevin Skadron and Jose Gonzalez
Multi Clock Domain Architecture
Synchronization
1
4
CLK1
2
3
CLK2
Src runs with CLK1, dst
with CLK2
Src writes at T1
T
Semeraro et al, ISCA 2003
53
If T > Ts then dst can use
the data at T2
If T < Ts then dst can use
the data at T3
© 2004, Kevin Skadron and Jose Gonzalez
Multi Clock Domain Architecture
Domains must be carefully chosen
Small cost on communications
Re-using existing structures
Example
5 domains
•
•
•
•
•
Front-end
Integer unit
FP unit
On-chip cache unit
Main memory
54
© 2004, Kevin Skadron and Jose Gonzalez
Multi Clock Domain Architecture
Integer
CPU
IIQ
int.
register
file
int.
FUs
Memory
Front-end
fetch
L1
i-cache
IFQ
branch
predict
dispatch
rename
LSQ
Floating Point
FIQ
Magklis et al, ISCA 2003
L2
L1
unified
d-cache
cache
55
fp.
register
file
fp.
FUs
Main
Memory
© 2004, Kevin Skadron and Jose Gonzalez
Multi Clock Domain Architecture
Dynamic voltage/frequency scaling in each domain
Reconfiguration points must be chosen
Off-line “shaker” algorithm
• Aggressive oracle algorithm with good results
• Uses detailed dynamic execution trace to find frequencies
• It is not practical, requires future knowledge of this precise dynamic
run
On-line Attack-decay
• Interval-based hardware algorithm
• Transparent to the application, minimal overhead
• More conservative, achieves 75% efficiency of off-line
Profile-based
• Use profiling to associate frequencies with parts of the code
• When these points in the code are reached during a dynamic run
then change frequencies
56
© 2004, Kevin Skadron and Jose Gonzalez
Gating/Throttling
Gating: Disable some of the stages of the processor
To reduce useless activity: after a branch misprediction
Manne et al, ISCA 1998
Effectiveness is heavily dependent on accuracy of branch
confidence predictor
Parikh et al, HPCA’02
Throttling: Slow down some processor stage when it is
predicted that the performance will not be reduced
Branch misprediction
Long latency load miss
IPC reduction in general
Baniasadi and Moshovos, ISLPED 2001
57
Control Speculation increases power dissipation (28%)
Energy wasted by mispredicted instructions
30
Speedup & Savings (%)
© 2004, Kevin Skadron and Jose Gonzalez
Selective Throttling for Control Speculation
Speedup
Power savings
Energy savings
E-D improvement
25
20
15
10
5
0
oracl
Based on branch confidence
Gating of selection stage
ct
ode
e sele
e dec
oracl
oracl
Selective throttling of fetch/decode
h
e fetc
Instructions that likely belong to a mispredicted path
9% Energy-Delay improvement
Aragon et al, HPCA 2003
58
© 2004, Kevin Skadron and Jose Gonzalez
Co-Adaptive Instruction Fetch and Issue
Fetch gating based on issue queue utilization
Fetch is stopped if close parallelism is present
Rather than using instruction window usage
Just instructions from the head of the IQ are issued
To match the size of the window residing in the IQ to
application’s ILP
Fetch gating combined with dynamic issue queue
adaptation
20% energy-delay improvement
Buyuktosunoglu et al, ISCA 2003
59
© 2004, Kevin Skadron and Jose Gonzalez
Compiler Techniques for Low Power
Good reference: tutorial by Kremer, PLDI’03
Traditional compiler optimizations often improve
energy efficiency
But some compiler optimizations waste energy
eg, register allocation, CSE, tiling for cache hit rate
eg, aggressive speculation
Energy efficiency of code sequences is highly
dependent on microarchitecture
eg, free slot in a VLIW word
60
© 2004, Kevin Skadron and Jose Gonzalez
Compiler Techniques for Low Power, cont.
Compiler-guided DVS
v1: reduce voltage while meeting real-time deadlines
v2: reduce voltage in memory-bound program regions
• Hsu and Kremer, ISLPED’01, PLDI’03
• Xie et al, PLDI’03
Dynamic resource configuration/hibernation
Deactivate modules when they won’t be used for a long time (>>
sleep/wakeup time)
• Heath et al, PACT’02
Profile/compiler-guided adaptation
eg,profile-guided MCD adaptation mentioned earlier (Magklis et
al, ISCA’03)
eg, subroutine-guided (“positional”) adapation (Huang et al,
ISCA’03)
• Uses a hierarchy of low-power modes
Much work in this area – this only touches the surface
61
© 2004, Kevin Skadron and Jose Gonzalez
Power Savings for Real Time Systems
Soft vs. hard real time
Periodic vs. aperiodic
Periodic tasks are especially important in control systems
Most work has focused on DVS scheduling
Examples
MPEG playback
Web server
62
© 2004, Kevin Skadron and Jose Gonzalez
DVS for Multimedia Apps
(soft real-time approach)
MM apps must process every frame within a time limit
If idle time, then there is some slack
IPC is constant across frames of the same type
Slow down the processor to meet deadlines
2 Phases
Profiling
• Determines max. number of insts. can be executed for each conf
• Sorts that list
Adaptation
• Predicts the number of instructions to be executed in the next interval
• Uses the lowest energy hardware configuration that fulfills
requirements
Hughes et al MICRO 2001
63
© 2004, Kevin Skadron and Jose Gonzalez
DVS for Multimedia Apps
(hard real-time approach)
decrease
frequency
Buffering decoded frames provides a
control point to enforce deadlines using
feedback control
Dead-zone proportional-integral controller sets
DVS to maintain queue occupancy
No profiling or other prior knowledge about
stream is needed
If queue becomes empty, “panic” model forces
highest speed
dead
zone
increase
frequency
Lu et al ICCD 2003
64
© 2004, Kevin Skadron and Jose Gonzalez
DVS for Web Servers
Basic idea: load balance, then do DVS to
reclaim slack (Elnozahy et al, PACS’02)
But it may be more profitable to cluster requests onto
fewer nodes and put some to sleep
Even on single nodes, it may be profitable to
briefly defer requests, then batch them at the
highest frequency before going to sleep
(Elnozahy et al, USITS’03)
To provide delay guarantees requires feedback
control (Sharma et al RTSS 2001)
A natural and effective control point is synthetic
utilization
• Combines true utilization with real-time schedulability
65
© 2004, Kevin Skadron and Jose Gonzalez
Other Approaches
Almost all RT algorithms attempt to reclaim slack
Episode detection (Flautner et al, MOBICOM’01)
Identify interactive and periodic events, schedule accordingly
Program checkpoints – check performance relative to
deadline and adjust DVS accordingly
Exploit direct knowledge of task execution times or
utilization
VISA (Anantaraman et al, ISCA’03)
Model a superscalar (unpredictable processor) as a predictable
scalar processor to perform RT analysis and scheduling, then
reduce DVS setting when superscalar processor runs faster than
predicted
Use program checkpoints to check progress/slack
66
© 2004, Kevin Skadron and Jose Gonzalez
Short-Circuit Power
Main solutions are
Reduce rise/fall times
• Tradeoff: reducing rise/fall times requires stronger drivers,
more dynamic power
Reduce capacitance being switched
67
© 2004, Kevin Skadron and Jose Gonzalez
Roadmap
Introduction & Trends
Dynamic Power Dissipation
Static Power Dissipation
Sources, modeling, reduction techniques
Sources, modeling, reduction techniques
Summary
68
© 2004, Kevin Skadron and Jose Gonzalez
Static Power Dissipation
Static power: dissipation due to leakage current
Growing worse because Vth is not scaling as fast
as Vdd
Roadmap
Most important sources of static power: subthreshold
leakage and gate leakage
Inter-process variation
Trends
Modeling leakage power
Circuit/architectural-level techniques
69
© 2004, Kevin Skadron and Jose Gonzalez
Static Power
Main mechanisms for leakage current
Subthreshold (Berkely predictive model):
I leakage m 0 COX
Vdd
W
e a b*(Vdd Vdd0 ) vt2 1 e vt
L
exp Vth0 Voff
n vt
Gate
• Igate = Igate0 * exp(a*(tox-tox0)) * exp(b*(vdd-vdd0))
We will focus on subthreshold
Gate leakage has essentially been ignored
New gate insulation materials may solve problem, eg recent Intel
announcement
• R. Chau, Technology@intel Magazine. www.intel.com
Gate-induced drain leakage (GIDL) occurs at negative gate voltages
and high Vdd or high values of reverse body bias
70
© 2004, Kevin Skadron and Jose Gonzalez
Effects of Parameter Variations
Ioff depends exponentially on Vth
There is a large fluctuation of Ioff from die to die and from gate to
gate
Controlling Vth is difficult in nanometer scale
Drain-induced barrier lowering
• Channel length is not constant
• Exacerbated in sub-100nm devices
Discrete dopant effects
• In a very small channel, small number of dopants
• Presence of these dopants and random fluctuation of their number, lead to
changes in Vth from device to device
Process variation affects
Gate length (Ldrawn)
Gate oxide thickness (Tox)
Channel dose (Nsub)
Srivastava et al, ISLPED 2002
71
Motivation
Growing relative to dynamic power dissipation: soon 50% of total
power
Exponentially dependent on Temp, Vth, Vdd
Natural target for optimization: idle transistors
Increasing
ratio
across
generations
Static power/ Dynamic Power
70
60
50
40
30
20
10
0
29
8
30
3
30
8
31
3
31
8
32
3
32
8
33
3
33
8
34
3
34
8
35
3
35
8
36
3
36
8
37
3
Percentage
© 2004, Kevin Skadron and Jose Gonzalez
Static Power
Temperature(K)
180nm
130nm
100nm
Source: Skadron et al, University of Virginia
72
90nm
80nm
70nm
© 2004, Kevin Skadron and Jose Gonzalez
Static Power
Modeling Leakage
Butts and Sohi (MICRO-33)
• Pstatic = Vcc · N · kdesign · Îleak
• Îleak determined by circuit simulation, kdesign empirically
• Key contribution: separate technology from design
HotLeakage (UVA TR CS-2003-05, DATE’04)
• Extension of Butts & Sohi approach: scalable with Vdd, Vth,
Temp, and technology node; adds gate leakage
• Îleak determined by BSIM3 subthreshold equation and BSIM4
gate-leakage equations, giving an analytical expression that
accounts for dependence on factors that may change at
runtime, namely Vdd, Vth, and Temp
• kdesign replaced by separate factors for N- and P-type
transistors
• kdesign also exponentially dependent on Vdd and Tox, linearly
dependent on Temp
• Currently integrated with 73
SimpleScalar/Wattch for caches
© 2004, Kevin Skadron and Jose Gonzalez
Static Power
Modeling Leakage (cont.)
Su et al, IBM (ISLPED’03)
• Similar approach to HotLeakage – but they observe that
modeling the change in leakage allows linearization of the
equations
Many, many other papers on various aspects of
modeling different aspects of leakage
• Most focus on subthreshold
• Few suggest how to model leakage in microarchitecture
simulations
74
© 2004, Kevin Skadron and Jose Gonzalez
Circuit/architectural level techniques
Transistor sizing
Dual Vth
DVS
Dynamic threshold voltage – reverse body bias
Sleep transistors
Low leakage caches/branch predictors
Low leakage register file
Low leakage issue queue
Low leakage ALUs
Techniques for reducing gate leakage
What else?
75
© 2004, Kevin Skadron and Jose Gonzalez
Transistor sizing, Dual-Vth
Transistor sizing
Dual-Vth
Reducing W/L reduces leakage: use smallest possible
transistors
Leakage-performance tradeoff
High-threshold transistors dramatically reduce
leakage: use low-Vth on critical paths, high-Vth
elsewhere
Often suggested in caches: many possible
permutations
DVS
Leakage is exponentially dependent on Vdd, so
DVS reduces leakage
76
© 2004, Kevin Skadron and Jose Gonzalez
Dynamic Threshold Voltage
Adjust threshold voltage dynamically
Also called reverse body bias (RBB), auto backgatecontrolled multi-threshold CMOS (ABB-MTCMOS)
(Nii et al, ISPLED’98)
Apply negative voltage to body: requires larger VGS to
establish channel, so it raises Vth
Engage RBB for idle transistors
Preserves state
Requires twin-well process; more expensive to
manufacture
Limited by GIDL
Can also be used at testing to adjust circuit properties
and reduce parameter variations
77
© 2004, Kevin Skadron and Jose Gonzalez
Sleep Transistors
Add a high-Vth transistor between the
circuit and either/both power rails – the
sleep transistor
Also referred to as a “header” (to Vdd) or
“footer” (to ground)
The high-Vth transistor cuts off most
leakage
In fact, a properly sized, lower-Vth
footer transistor can preserve enough
leakage to keep the cell active (Li et
al, PACT’02; Agarwal et al, DAC’02)
Great care must be taken when switching
back to full voltage: noise can flip bits
Extra latency may be necessary when reactivating
78
© 2004, Kevin Skadron and Jose Gonzalez
Low-Leakage Caches
Gated-Vdd/Vss (Powell et al, ISLPED’00; Kaxiras et al, ISCA-28)
Drowsy cache (Flautner et al, ISCA-29)
Uses sleep transistor on Vdd/ground for each cache line
Typically considered non-state-preserving, but recent work (Agarwal et al,
DAC’02) suggests that gated-Vss it may preserve state
Many algorithms for determining when to gate
Simplest (Kaxiras et al, ISCA-28): Two-bit access counter and decay
interval
Adaptive decay intervals - hard
Uses dual supply voltages: normal Vdd and a low Vdd close to the
threshold voltage
State preserving, but requires an extra cycle to wake up – two extra
cycles if tags are decayed
State preservation using leakage currents (Li et al, PACT’02; Agarwal
et al, DAC’02)
Similar to gated-Vss but designed to keep supply voltage high enough to
preserve state (100-120 mV)
79
© 2004, Kevin Skadron and Jose Gonzalez
Low Leakage Caches, cont.
Comparison (Parikh, Li, et al, WDDD’03, DATE’04)
Compared non-state-preserving gated-Vss with state-preserving
drowsy cache
If gating is state-preserving, it wins because it essentially
eliminates subthreshold and gate leakage
• Unless wakeup time is significantly longer than with drowsy
Otherwise, drowsy cache typically has an advantage because it
is state preserving; no L2 accesses needed on “induced misses”
But induced misses are rare, so for a reasonable range of onchip L2 penalties (< 8 cycles in our studies), gating can still be
superior
80
© 2004, Kevin Skadron and Jose Gonzalez
Low-Leakge Caches, cont: 4T Cells
4 transistor cells [ 4T ]
6T (left) and 4T (right) circuit diagrams
4T-based branch predictors, caches
Hu , Juang, et al, ISLPED’02,
CA-Letters’02
Non state-preserving
Decay rate : temperature-dependent
•
Can be adjusted with passives
Eliminates decay state bits
81
Eliminates two
transistors connected to
Vdd
Naturally decays over
time
Refreshes upon access
When decayed, force
default output
Up to 33% smaller than
equivalent 6T
Decays quickly [8K
cycles at 1 GHz]
Leak only as much
energy as is deposited
© 2004, Kevin Skadron and Jose Gonzalez
Low-Leakage Caches, cont:
Other Techniques
RBB (Nii et al, ISLPED’98)
Leakage-biased bitlines (Heo et al, ISCA-29)
Back bias cache lines that are idle – can use the same
decay counters as gated-Vdd/Vss
Disable precharge and let the bitlines float: they will
settle to a value that minimizes leakage
Can only be applied to idle subbanks and requires
accurate prediction of which subbank will be accessed
Huge variety of other techniques – this is only an
overview of some of the major ones
82
© 2004, Kevin Skadron and Jose Gonzalez
Register Files
In general, state-preserving techniques for
caches may work for register files too
Leakage-biased bitlines work here too
Register file divided into subbanks
Alvandpour et al, Intel, ISLPED’01
Uses dual Vth and a conditional keeper
• “Keeper” used on dynamic circuits to counteract voltage
droop due to leakage – they constitute a static pull-up path
• Dynamic circuits arise in the muxes due to multiporting
• “Conditional” keeper technique uses two cascaded keepers;
one is fixed and the other only engaged when needed to
drive an output – requires careful timing analysis
Access transistors and keepers are high-Vt/
83
© 2004, Kevin Skadron and Jose Gonzalez
ALUs
Usually Dual-VT domino logic
Area & Speed
Sleep transistors can be used but it has a cost
Dynamic nodes are discharged
Can be used if worthy
Dropsho et al, MICRO842002
© 2004, Kevin Skadron and Jose Gonzalez
Other Techniques
Queues (eg, issue queues)
Various occupancy-based or rate-matching
techniques have been proposed for issue queue
resizing.
Deactivating queue entries reduces leakage
eg, Ponomarev et al, MICRO-34
Compiler techniques
When compiler knows that regions are idle, they can
be deactivated
eg, Zhang et al, MICRO-35
85
© 2004, Kevin Skadron and Jose Gonzalez
Gate Leakage
Any technique that reduces Vdd
Otherwise it seems difficult to develop architecture
techniques that directly attack gate leakage
In fact, very little work has been done in this area
One example: domino gates (Hamzaoglu & Stan,
ISLPED’02)
Replace traditional NMOS pull-down network with a PMOS pullup network
Gate leakage is greater in NMOS than PMOS
But PMOS domino gate is slower
86
© 2004, Kevin Skadron and Jose Gonzalez
Roadmap
Introduction & Trends
Dynamic Power Dissipation
Static Power Dissipation
Sources, modeling, reduction techniques
Sources, modeling, reduction techniques
Summary
87
© 2004, Kevin Skadron and Jose Gonzalez
Other Power-Related Issues
Thermal
Managing on-chip temperatures (as opposed to
average heat dissipation) is not just a matter of
reducing average power density
Spatial and temporal variation
• Spatial: hot spots—must reduce power density in the right
places
• Temporal: must reduce power when chip is hot
This is often when there is less slack
Most model temperature directly
• Average power metrics do not accurately predict temperature
(Skadron et al, ISCA’03)
88
© 2004, Kevin Skadron and Jose Gonzalez
Other Power-Related Issues
Voltage stability (dI/dt)
Inductance means that abrupt changes in current can
cause voltage droop
This can be addressed with decoupling capacitance,
but required capacitance is becoming expensive
Grochowski et al HPCA’02, Joseph et al, HPCA’03
89
© 2004, Kevin Skadron and Jose Gonzalez
Roadmap
Introduction & Trends
Dynamic Power Dissipation
Sources, modeling, reduction techniques
Static Power Dissipation
Sources, modeling, reduction techniques
Summary
90
© 2004, Kevin Skadron and Jose Gonzalez
Summary
Power dissipation is becoming a huge concern
Power dissipation
Total power budget
Power density (thermal)
Energy consumption & battery life
Switching
Short-circuit
Leakage
Power modeling crucial
Academia: accurate research
Industry: detect hot spots on time to meet POR
91
© 2004, Kevin Skadron and Jose Gonzalez
Summary
Reducing dynamic power
Circuits perspective
• Energy-effective access (reducing capacitance or driving
voltage)
• Gating
Architectural perspective
• Decreasing activity factor
• Pipeline gating
• Adjusting voltage/frequency to meet application requirements
Reducing static power
• Dual Vth
• Non-state-preserving vs. state-preserving techniques
92