Reducing Leakage Power in Peripheral Circuit of L2 Caches

Download Report

Transcript Reducing Leakage Power in Peripheral Circuit of L2 Caches

Reducing Leakage Power in Peripheral
Circuits of L2 Caches
Houman Homayoun and Alex Veidenbaum
Dept. of Computer Science, UC Irvine
{hhomayou, alexv}@ics.uci.edu
ICCD 2007
L2 Caches and Power

L2 caches in high-performance processors are large

2 to 4 MB is common

They are typically accessed relatively infrequently

Thus L2 cache dissipates most of its power via leakage

Much of it was in the SRAM cells


Many architectural techniques proposed to remedy this
Today, there is also significant leakage in the
peripheral circuits of an SRAM (cache)

In part because cell design has been optimized
The problem

How to reduce power dissipation in the peripheral
circuits of the L2 cache?


Seek an architectural solution with a circuit assist
Approach:

Reduce peripheral leakage when circuits are unused


Use architectural techniques to minimize “wakeup” time


By applying “sleep transistor” techniques
During an L2 miss service, for instance
Will assume that an SRAM cell design is already
optimized and will attempt to save power in cells
Miss rates and load frequencies
ammp
applu
apsi
art
bzip2
crafty
eon
equake
facerec
galgel
gap
gcc
gzip
DL1
DL1
miss L2miss %
miss L2
%
rate rate loads
rate miss rate loads
0.046 0.1872 0.22 lucas 0.0970.6657 0.15
0.056 0.6572 0.26 mcf 0.2390.4284 0.34
0.027 0.2778 0.22 mesa 0.0030.2674 0.26
0.414 0.0001 0.17 mgrid 0.0360.4587 0.30
0.017 0.0417 0.24 parser 0.0200.0688 0.22
0.002 0.0087 0.28 perlbmk 0.0050.4576 0.31
0.000 1
0.26 sixtrack 0.0120.0012 0.22
0.017 0.6727 0.25 swim 0.0890.6308 0.21
0.034 0.3121 0.21 twolf 0.0540.0003
0.054 0.0003 0.23
0.037 0.0057 0.22 vortex 0.0030.2314 0.24
0.007 0.5506 0.21 vpr
0.0230.1476 0.30
0.046 0.0367 0.21 wupwise0.0120.674 0.17
0.007 0.0468 0.20 Average 0.0520.3131680.24






SPEC2K benchmarks
128KB L1 cache
5% average L1 miss rate,
Loads are 25% of instr.
In many benchmarks the
L2 is mostly idle
In some L1 miss rate is
high

Much waiting for data
 L2 and CPU idle?
SRAM Leakage Sources
Addr Input Global Drivers
Bitline
addr0
Global Wordline
addr1
Decoder
Local Wordline
addr2
addr3
Predecoder and Global Wordline Drivers
addr
Sense amp





SRAM cell
Global Output Drivers
Sense Amps
Multiplexers
Local and Global Drivers (including the wordline driver)
Address decoder
Bitline
Leakage Energy Break Down in L2 Cache


Large, more leaky transistors used in peripheral circuits
High Vth, less leaky transistors in memory cells
global
address
input drivers
14%
global data
output
drivers
24%
global data
input drivers
25%
local data
output
drivers
20%
others
9%
local row
decoders
1%
global row
predecoder
7%
Circuit Techniques for Leakage Reduction



Gated-Vdd, Gated-Vss
Voltage Scaling (DVFS)
ABB-MTCMOS


Forward Body Biasing (FBB), RBB
Typically target cache SRAM cell design

But are also applicable to peripheral circuits
Architectural Techniques

Way Prediction, Way Caching, Phased Access


Drowsy Cache


Evict lines not used for a while, then power them down
Applying DVS, Gated Vdd, Gated Vss to memory cell


Keeps cache lines in low-power state, w/ data retention
Cache Decay


Predict or cache recently access ways, read tag first
Many architectural support to do that.
All target cache SRAM memory cell
What else can be done?

Architectural Motivation:

A load miss in the L2 cache takes a long time to service

prevents dependent instructions from being issued
dispatch
issue

When dependent instructions cannot issue

After a number of cycles the instruction window is full



ROB, Instruction Queue, Store Queue
The processor issue stalls and performance is lost
At the same time, energy is lost as well!

This is an opportunity to save energy
IPC during an L2 miss
Cumulative over the L2 miss service time for a program
Decreases significantly compared to program average

Issue Rate

3.25
sixtrack
3
mesa
crafty
galgel
eon
vortex
2.75
bzip2
2.5
average issue rate during cache miss period
gzip
2.25
program average issue rate
apsi
facerec
gcc
2
1.75
Average
parser
art
twolf
gap
wupwise
perlbmk
1.5
vpr
1.25
equake
applu
1
mgrid
ammp
0.75
lucas
apsi
0.5
galgel
bzip2
applu
0.25
crafty
0
vortex
vpr
mgrid
gcc
mcf
equake
gap
ammp
sixtrack
gzip
facerec
art
eon
swim
mesa
swim
parser
lucas
mcf
twolf
perlbmk
Average
wupwise
A New Technique

Idle time Management (IM)

Assert an L2 sleep signal (SLP) after an L2 cache miss




Puts L2 peripheral circuits into a low-power state
L2 cannot be accessed while in this state
De-assert SLP when the cache miss completes
Can also apply to the CPU
 Use SLP for DVFS, for instance
 But L2 idle time is only 200 to 300 clocks

It currently takes longer than that for DVFS
A Problem
• Disabling the L2 as soon as the miss is detected
• Prevents the issue of independent instructions
• In particular, of loads that may hit or miss in the L2
• This may impact the performance significantly
• Up to a 50% performance loss
Percentage (%)
60
lucas
50
mcf
40
applu
swim
30
mgrid
20
apsi
equake
vpr
10
Average
perlbmk
0
ammp
art
bzip2
crafty
eon
facerec
gap
galgel
gcc gzip
mesa
parser
twolf
sixtrack
wupwise
vortex
What are independent instructions?

Independent instructions do not depend on a load miss

Or any other miss occuring during the L2 miss service

Independent instructions can execute during miss service
Logarithmic Percentages (log %)
100
swim
mcf
applu
lucas
equake
facerec
gap
10
apsi
1
ammp
mgrid
wupwise
vpr
gcc
bzip2
perlbmk
mesa
parser
vortex
gzip
galgel
0.1
crafty
0.01
0.001
sixtrack
twolf
art
eon
Average
Two Idle Mode Algorithms

Static algorithm (SA)




put L2 in stand-by mode N cycles after a cache miss occurs
enable it again M cycles before the miss is expected to compete
Independent instructions execute during the L2 miss service
L2 can be accesses during the N+M cycles

L1 misses are buffered in an L2 buffer during stand-by


Adaptive algorithm (AA)

Monitor the issue logic and functional units of the processor



after an L2 miss
Put the L2 into stand-by mode if no instructions are issued AND
functional units have not executed any instructions in K cycles

The algorithm attempts to detect that there are no more instructions that
may access the L2
A Second Leakage Reduction Technique

Sometimes the L2 is not accessed much and is mostly idle

In this case it is best to use the Stand-By Mode (SM)
 Start the L2 cache in stand-by, low-power mode

“Wake it up” on an L1 cache miss and service the miss


Return the L2 to stand-by mode right after the L2 access
 However, this is likely to lead to performance loss


L1 misses are often clustered, there is a wake-up delay…
A better solution:
 Keep the L2 awake for J cycles after it was turned on

increases energy consumption, but improves performance
Hardware Support

Add appropriately sized sleep
transistors in global drivers
Add delayed-access buffer to L2

allows L1 misses to be issued and
stored in this buffer at L2
L2 Cache
Write Buffer
Pre-decoder

Cell Array
Read Buffer
Access L2 when it get
enabled
SLP
Delayed- Access Buffer
10 entries(10*8bits)
assert SLP signal, insert
forthcoming Loads and
stores into Delayed
Access Buffer
System Description
L1 I-cache
128KB, 64 byte/line, 2 cycles
L1 D-cache
128KB, 64 byte/line, 2 cycles, 2 R/W ports
L2 cache
4MB, 8 way, 64 byte/line, 20 cycles
issue
4 way out of order
Branch predictor
64KB entry g-share,4K-entry BTB
Reorder buffer
96 entry
Instruction queue
64 entry (32 INT and 32 FP)
Register file
128 integer and 128 floating point
Load/store queue
32 entry load and 32 entry store
Arithmetic unit
4 integer, 4 floating point units
Complex unit
2 INT, 2 FP multiply/divide units
Pipeline
15 cycles (some stages are multi-cycles)
Performance Evaluation
% Time L2 Turned ON
IPC Degradation
100%
8%
90%
7%
80%
6%
70%
5%
60%
50%
INT
4%
INT
FP
40%
FP
3%
30%
2%
20%
1%
10%
0%
0%
SM_200

SM_500
SM_750
SM_1000
SM_1500
IM/SA
IM/AA
Fraction of total execution time
L2 cache was active using IM & SM
SM_200

SM_500
SM_750
SM_1000
SM_1500
IM/SA
IM/AA
IPC loss due to L2 not being
accessible under IM & SM


p2
m
cf
es
a
m
A
ve
w vpr
up
w
is
e
s
pe er
rlb
m
si k
xt
ra
ck
sw
im
tw
ol
f
vo
rt
ex
pa
r
gr
id
p
cf
es
a
m
m
A
ve
w vpr
up
w
is
e
s
pe er
rlb
m
si k
xt
ra
ck
sw
im
tw
ol
f
vo
rt
ex
pa
r
gr
id
m
p
as
lu
c
gz
i
gc
c
bz
ip
2
cr
af
ty
eo
eq n
ua
k
fa e
ce
re
c
ga
lg
el
ga
p
ar
t
ap
pl
u
ap
si
m
am
100%
m
p
as
lu
c
gz
i
gc
c
eq
ua
k
fa e
ce
re
c
ga
lg
el
ga
p
eo
n
af
ty
cr
bz
i
ar
t
ap
pl
u
ap
si
p
m
am
p2
cf
es
a
ve
A
w vpr
up
w
is
e
s
pe er
rlb
m
si k
xt
ra
ck
sw
im
tw
ol
f
vo
rt
ex
pa
r
gr
id
m
m
p
as
m
lu
c
gz
i
gc
c
eo
eq n
ua
k
fa e
ce
re
c
ga
lg
el
ga
p
af
ty
cr
bz
i
i
ar
t
ap
s
m
p
ap
pl
u
am
Power-Performance Trade Off
100%
Leakage Power Savings
80%
(a)
60%
40%
20%
IM/SA
IM/AA
SM
0%
Total Energy-Delay Reduction
80%
(b)
60%
40%
IM/SA
IM/AA
SM
20%
%
20%
Performance Degradation
16%
(c)
12%
8%
IM/SA
IM/AA
SM
4%
%
(IM): 18 to 22% leakage power reduction with 1% performance loss
(SM) : 25% leakage power reduction with 2% performance loss
Conclusions

Study break down of leakage in L2 cache components, show
peripheral circuit leaking considerably.

Architectural techniques address reducing leakage in memory
cell.

Present an architectural study on what is happening after an L2
cache miss occurred.

Present two architectural techniques to reduce leakage in the L2
peripheral circuits; IM and SM. (IM) achieves 18 or 22% average
leakage power reduction, with a 1% average IPC reduction. (SM)
achieves a 25% average savings with a 2% average IPC reduction.

two techniques benefit different benchmarks, indicates a possibility
adaptively selecting the best technique. This is subject of our ongoing
research