Transcript Document

ISCA 2002
Dynamic Fine-Grain
Leakage Reduction Using
Leakage-Biased Bitlines
Seongmoo Heo, Kenneth Barr,
Mark Hampton, and Krste Asanović
Computer Architecture Group, MIT LCS
Leakage Power
• Growing impact of leakage power
– Increase of leakage power due to scaling of
transistor lengths and threshold voltages
– Power budget limits use of fast leaky
transistors
• Challenge:
– How to maintain performance scaling in face
of increasing leakage power?
Leakage Reduction Techniques
Static: Design-time Selection of Slow
Transistors (SSST) for non-critical paths
– Replace fast transistors with slow ones on
non-critical paths
– Tradeoff between delay and leakage power
Dynamic: Run-time Deactivation of Fast
Transistors (DDFT) for critical paths
– DDFT switches critical path transistors
between inactive and active modes
Observation:
Critical paths dominate leakage after
applying SSST techniques
Example: PowerPC 750
– 5% of transistor width is low Vt, but these
account for >50% of total leakage.
DDFT could give large leakage savings
Existing DDFT Circuit Techniques
Vbody > Vdd
Gate
• Body Biasing
Drain Source
– Vt increase by
Body
reverse-biased body effect
– Large transition time and wakeup latency due to
well cap and resistance
• Power Gating
Vdd
Sleep signal
– Sleep transistor between
Virtual Vdd
supply and virtual supply lines
Logic cells
– Increased delay due to sleep transistor
• Sleep Vector
0
0
– Input vector which minimizes leakage
– Increased delay due to mux and active energy
due to spurious toggles after applying sleep
vector
Fine-Grain DDFT Techniques
• Have to turn off small pieces of an active
processor for short periods of time
– Difficult to turn off large pieces for long periods
 Fine-grain DDFT techniques
• Requirements of Fine-grain DDFT techniques
– Circuits with low active delay penalty, low energy
moving in and out of sleep, and fast wakeup time
– Micro-architectural scheduling to keep the sleep
time as long and often as possible
• Compare to coarse-grain DDFT techniques
– O.S. puts whole processor to sleep for a long time
 doesn’t save power when running code
– Low steady-state leakage only concern.
Highlights of This Work
•
We introduce metrics for comparing finegrain dynamic deactivation techniques
–
•
We present a new circuit-level leakage
reduction technique, Leakage-Biased
Bitlines (LBB)
–
•
Steady-stage leakage, Transition time, Fixed
transition energy, Breakeven time
Low deactivation energy and fast wakeup
We save leakage power of I-Cache and
Multiported regfile by LBB
–
–
I-cache: idle subbank deactivation
Multiported regfile: idle read ports and dead
register deactivation
Outline
1. Methodology and DDFT Metrics
2. Cache Leakage Saving
•
Idle subbank deactivation
3. Multiported Regfile Leakage Saving
•
•
Dead reg deactivation (Horizontal)
Idle read port deactivation (Vertical)
4. Conclusion
Methodology
• Process Technology
– 180nm DVT process modeled after 0.18um TSMC
LVT and MVT processes
– Scaled to 130, 100, and 70nm processes based on
SIA roadmap
– Optimistic/pessimistic leakage prediction:
2x/4x increase of leakage current density (nA/um)
• Evaluation with SimpleScalar
– Modified to model unified physical register file
– 4 issue, 100 integer physical regs, 16KB/4-Way/32B block I-Cache and D-Cache, Unified L-2 Cache
– SPECint95 refs
• Energy measurements
– Hspice simulation for 180nm process and scaled
to other processes accordingly
Metrics for Fine-Grain DDFT
Techniques
Leakage Current
Leakage Energy
Original
Leakage
DDFT applied
Original
Leakage
Transition
Time
Break-Even
Time
DDFT
Leakage
Fixed Active
Transition
Energy
Steady-state
Sleep Leakage
Time
•Wakeup Latency
•Active delay and power
Length of Sleep
L1 Cache and Multiported Regfile
• Good targets for Fine-grain DDFT techniques
– Timing-critical
• Contrast: L2 cache is a better target for SSST
(long channel or HVT transistors)
– Large leakage current
• Cache: Large number of fast transistors
• Multiported Regfile: Ever increasing number of
registers and ports
– Alpha 21464 register file is 5x larger than
64KB data cache
LBB for Caches
• Modern cache structure
: Hierarchical Bitlines
– To save active power
– To reduce delay
– To reduce bitline noise
Subbank
Global Bitline
Local Bitline
Local-Global Switch
SenseAmp
• Local bitlines (32-bit cells) disconnected from
senseamp by local-global switch.
• LBB for Caches: If a subbank is not in use,
turn off precharge transistors and delay
precharging.
Cache: Dual Vt SRAM cell
GLOBAL
BIT
1
1
BIT
GLOBAL
BIT_BAR
BIT_BAR
0
0
HVT transistors: green-colored
1
WL
Cache: Dual Vt SRAM cell
GLOBAL
BIT
1
1
BIT
GLOBAL
BIT_BAR
BIT_BAR
0
0
1
WL
Cache: Dual Vt SRAM cell
GLOBAL
BIT
1
1
BIT
GLOBAL
BIT_BAR
BIT_BAR
0
0
WL
1
Bitline leakage depends on the stored value
Cache: Dual Vt SRAM cell
GLOBAL
BIT
1
1
BIT
GLOBAL
BIT_BAR
BIT_BAR
0
0
Our Target
WL
1
Bitline leakage depends on the stored value
Forcing 1
Forcing 0
Forcing ?
0
1
0
0
1
1
Leakage-Biased Bitlines (LBB)
Discharge to an intermediate
value between 0 and 1
Stay at 1
Discharge to 0
0
1
0
0
1
1
• LBB lets bitlines float by turning off the local HVT NMOS
precharge transistors
– No static current draw because local bitline isolated
– LBB uses leakage itself to bias bitlines to the voltage which
minimizes leakage!
• A good fine-grain dynamic technique
– Minimal transition energy:
• Same number of precharges (delayed precharge)
– Minimal transition time:
• Wakeup latency is only that of precharge phase
LBB versus Sleep Vector
• LBB finds the minimal leakage state.
– Always better than sleep vectors
Leakage Power of 32x16B SRAM subbank
Leakage Power (uW)
350
Original
300
Sleep Vector 1
250
Sleep Vector 0
LBB
200
150
100
0
20
40
60
80
Zero Percentage (%)
100
Cumulative Leakage Energy
32-row x 32B SRAM subbank
(optimistic leakage current used. 75% zero assumed)
70nm
180nm
Original
40
30
LBB
20
10
50
Energy (pJ)
Energy (pJ)
50
0
Original
40
30
20
10
LBB
0
0
100
200
300
400
Length of Sleep (cy cles)
500
0
100
200
300
400
Length of Sleep (cy cles)
Dynamic energy cost: Need to replace the lost charge
-LBB curve increases fast in the beginning
Decrease of Breakeven time
-180nm: 200 cycles, 70nm: less than a cycle
-Active energy scales down faster than leakage energy
500
Performance Issues for LBB
Caches
• Subbank must be precharged before use
– Case 1 (best): subbank decode and precharge
happen before more complex word-line decode,
therefore no penalty.
– Case 2 (worst): add additional pipeline stage for
precharge
• One cycle increase in branch misprediction penalty
– Focus on I-Cache because any latency increase
can be partly hidden by branch prediction
ross processes
I-Cache Subbank Deactivation
Leakage energy sav ing at 70nm process
Total energy sav ing at 70nm process
30
Percentage (%)
25
20
15
Pessimistic Prediction
O ptimistic Prediction
10
5
0
15
10
5
25
20
20
10
5
0
-5
-10
180nm
130nm
100nm
70nm
Percentage (%)
25
15
av
g
t
vo
r
pe
rl
88
k
m
li
jp
eg
go
p
co
m
vo
rt
pe
rl
88
k
li
m
jp
eg
go
gc
c
p
av
g
Total energy sav ing across processes
Leakage energy sav ing across processes
Percentage (%)
20
0
co
m
70nm
25
gc
c
Percentage (%)
30
15
10
5
0
-5
180nm
130nm
100nm
-10
Case 2 (worst) assumption (adding additional pipeline stage)
 2.5% IPC decrease on average
70nm
Multiported Regfile Cell
8R, 4W unbalanced DVT reg cell
READ[0:7]
WRITE[0:3]
WRITEB[0:3]
WWL[0:3]
RWL[0:7]
x4
x4
x8
HVT transistors: green-colored
•Simplified but active/leakage power-aware baseline
LBB for Multiported Regfiles
• LBB for Multiported
Regfiles: Turn off the
precharge transistor on
idle subbank read ports
– Leakage current discharges
bitlines to 0 if any bits are
holding 1.
Dead Register Deactivation
•
•
•
•
Horizontal technique
Dead registers = Registers
Subbank 1
in free list
If all registers in a
subbank are dead, all read
ports in the subbank are
turned off by LBB
No performance penalty
since there is ample time to
re-precharge between
allocation and write.
Readport 0
Readport 1
Readport 2
Dead Register Deactivation
•
•
•
•
Horizontal technique
Dead registers = Registers in
Subbank 1
free list
If all registers in a
subbank are dead, all read
ports in the subbank are
turned off by LBB
No performance penalty
since there is ample time to
re-precharge between
allocation and write.
Readport 0
Readport 1
Readport 2
NMOS Sleep Transistor (NST)
•
•
•
•
Alternative horizontal DDFT
To turn off dead registers Register 1
using NMOS sleep
transistors (NST)
Advantage: registers can
1
be turned off individually
Disadvantage: increased
read access time
–
Set delay penalty to 5%
(tradeoff between delay and
leakage)
Readport 0
Readport 1
Readport 2
NMOS Sleep Transistor (NST)
•
•
•
•
Alternative horizontal DDFT
To turn off dead registers Register 1
using NMOS sleep
transistors (NST)
Advantage: registers can
0
be turned off individually
Disadvantage: increased
read access time
–
Set delay penalty to 5%
(tradeoff between delay and
leakage)
Readport 0
Readport 1
Readport 2
Idle Readport Deactivation
•
•
•
•
Vertical technique
Idle read ports when fewer
than max # of instructions are
issued in a superscalar
machine
Idle read ports deactivated by
LBB
No performance penalty since
it is known whether a read
port is needed before it is
known which register will be
accessed in the pipeline.
Readport 0
Readport 1
Readport 2
Idle Readport Deactivation
•
•
•
•
Vertical technique
Idle read ports when fewer
than max # of instructions are
issued in a superscalar
machine
Idle read ports deactivated by
LBB
No performance penalty since
it is known whether a read
port is needed before it is
known which register will be
accessed in the pipeline.
Readport 0
Readport 1
Readport 2
Comparison of DDFTs
32 x 32-b Regfile subbank
(75% zero assumed. Optimistic leakage current used.)
Process Tech. (nm)
Original (uW)
SV steady-state (uW)
LBB steady-state (uW)
NST steady-state (uW)
180
177.9
2.0
2.0
1.8
130
214.1
2.4
2.4
2.2
50
Original
50
40
Sleep Vector
40
30
Leakage-Biased Bitlines
20
NMOS Sleep Transistor
10
0
Original
30
20
NMOS Sleep Transistor
10
Sleep Vector
Leakage-Biased Bitlines
0
0
500
Length of Sleep (cy cles)
70
276.7
3.1
3.1
2.9
70nm
Energy (pJ)
Energy (pJ)
180nm
100
263.6
3.0
3.0
2.7
1000
0
500
Length of Sleep (cy cles)
1000
Energy (pJ)
Comparison of DDFTs
Blowup:
70nm70nm
8
7
6
5
4
3
2
1
0
Original
Sleep Vector
NMOS Sleep Transistor
Leakage-Biased Bitlines
0
10
20
30
40
Length of Sleep (cy cles)
50
Dead Register/Subbank
Deactivation Policies
•
Free list policies for NST (NMOS Sleep
Transistor): queue and stack
–
–
•
queue: conventional
stack: keeps some regs dead for longer
• 2.4-10% greater savings than queue at 70nm
• Benefit increases as feature sizes shrink
Subbank allocation policy for LBB: stack
–
Allocate a new subbank only when the previous
bank is empty of dead registers
Dead Reg Deactivation (Horizontal)
Leakage energy savings (70nm process)
Total energy savings (70nm process)
40
20
40
20
Leakage Energy Savings
percent (%)
percent (%)
60
20
180
130 100
Process (nm)
70
g
av
Total Energy Savings
60
40
m li
88
k
pe
co rl
m
p
vo
rt
c
g
jp o
eg
gc
g
av
gc
m li
88
k
pe
co rl
m
p
vo
rt
0
c
go
jp
eg
0
0
Colored: optimistic
White: pessimistic
60
percent (%)
percent (%)
60
NST Queue
NST Stack
LBB 16 regs/bank
LBB 8 regs/bank
40
20
0
180
130 100
Process (nm)
70
NST stack better than NST queue, LBB stack better than either NST
Read Port Deactivation (Vertical)
60
60
Percentage (%)
70
50
40
30
10
40
30
10
Leakage energy sav ing across processes
70
60
10
60
5
30
0
20
10
-5
180nm
130nm
100nm
0
-10
-10
Percentage (%)
15
40
av
g
t
vo
r
pe
rl
li
88
k
m
jp
eg
go
Total energy sav ing across processes
70
50
gc
c
p
co
m
av
g
t
vo
r
pe
rl
li
88
k
m
jp
eg
0
go
co
m
p
25
20
Percentage (%)
50
20 processes
Leakage energy sav ing across
20
0
Percentage (%)
Total energy sav ing at 70nm process
70
gc
c
Percentage (%)
Leakage energy sav ing at 70nm process
Pessimistic Prediction
O ptimistic Prediction
50
40
30
20
70nm
10
0
180nm
130nm
100nm
70nm
-10
180nm
130nm
100nm
70nm
•More energy saving for wider issue processors
•Readport deactivation can be combined with dead subbank deactivation.
Conclusion
• Most leakage power is in critical paths
– Dynamic leakage reduction (DDFT) desired
• LBB allows Fine-grain dynamic leakage
reduction with zero or minimal performance
penalty.
– 0% performance penalty for multiported regfiles
• Sleep time can be improved by changing
micro-architectural scheduling policies.
– Stack better than queue for free list policy
• Follow on work:
– Leakage-biased domino logic to save leakage
power in critical ALUs [ VLSI Symposium 2002 ]
Acknowledgments
• Thanks to Christopher Batten, Ronny
Krashinsky, Rajesh Kumar, and
anonymous reviewers
• Funded by DARPA PAC/C award F3060200-2-0562, NSF CAREER award CCR0093354, and a donation from Infineon
Technologies.
DDFT Examples
Body
Biasing
Steady-state
leakage power
Power Gating
Sleep Vector
Less than 5% Less than 5%
Less than 50%
(depends on (depends on sleep (depends on the
Vbody)
transistor)
circuit)
0.1~100us
Less than a cycle
Less than a cycle
Transition energy Well cap
,Breakeven time switching
Sleep transistor
gate cap
switching energy
Active energy
consumed due to
spurious toggling
after sleep vector
Yes. Due to sleep
transistor
Yes. Due to mux
Area for sleep
transistor and
virtual supplies
Finding sleep
vector is hard
Transition time,
Wakeup latency
energy
Delay Impact
Etc
No