Adaptive Techniques for Leakage Power Management in L2 Cache

Download Report

Transcript Adaptive Techniques for Leakage Power Management in L2 Cache

Adaptive Techniques for Leakage
Power Management in L2 Cache
Peripheral Circuits
Houman Homayoun Alex Veidenbaum and Jean-Luc Gaudiot
Dept. of Computer Science, UC Irvine
[email protected]
Outline





L2 Cache Power Dissipation
Why Cache Peripheral ?
Study recently proposed static approach to reduce
leakage
Propose two adaptive technique to reduce leakage
Present power, performance and energy-delay results
L2 Cache and Power

L2 cache in high-performance processors is
large

2 to 4 MB is common

It is typically accessed relatively infrequently

Thus dissipates most of its power via leakage

Much of it was in the SRAM cells


Many architectural techniques proposed to
remedy this
Today, there is also significant leakage in the
peripheral circuits of an SRAM (cache)

In part because cell design has been optimized
Pentium M processor die photo
Courtesy of intel.com
Peripherals ?!
Addr Input Global Drivers
Bitline
addr0
Global Wordline
addr1
Decoder
Bitline
Local Wordline
addr2
addr3
Predecoder and Global Wordline Drivers
addr
Sense amp
Global Output Drivers





Data Input/Output Driver
Address Input/Output Driver
Row Pre-decoder
Wordline Driver
Row Decoder
Others : sense-amp, bitline pre-charger, memory cells, decoder logic
Why Peripherals ?
100000
10000
1000
6300X
( pw )
100
200X
10
m
em
or
y
ce
ll
IN
V
IN X
V2
X
IN
V3
X
IN
V4
X
IN
V5
X
IN
V6
X
IN
V8
IN X
V1
2
IN X
V1
6
IN X
V2
0
IN X
V2
4
IN X
V3
2X
1

Using minimal sized transistor for area considerations in cells
and larger, faster and accordingly more leaky transistors to
satisfy timing requirements in peripherals

Using high vt transistors in cells compared with typical threshold
voltage transistors in peripherals
Leakage Power Components of L2 Cache
global data
output drivers
25%
global address
input drivers
11%
global data
input drivers
14%
global row
predecoder
1%
local data
output drivers
8%
others
8%

local row
decoders
33%
SRAM peripheral circuits dissipate more than 90%
of the total leakage power
Leakage power as a Fraction of Total L2
Power Dissipation
ammp
applu
apsi
art
bzip2
crafty
eon
equake
facerec
galgel
gap
gcc
gzip
lucas
mcf
mesa
mgrid
parser
perlbmk
sixtrack
swim
twolf
vortex
vpr
wupwise
average
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Leakage

Dynamic
L2 cache leakage power dominates its dynamic power
above 87% of the total
Circuit Techniques Address Leakage in SRAM Cell




Gated-Vdd, Gated-Vss
Voltage Scaling (DVFS)
ABB-MTCMOS
Forward Body Biasing (FBB), RBB
Target SRAM memory cell
Architectural Techniques

Way Prediction, Way Caching, Phased Access


Drowsy Cache


Evict lines not used for a while, then power them down
Applying DVS, Gated Vdd, Gated Vss to memory cell


Keeps cache lines in low-power state, w/ data retention
Cache Decay


Predict or cache recently access ways, read tag first
Many architectural support to do that.
All target cache SRAM memory cell
Static Architectural Techniques: SM

SM Technique (ICCD’07)

Asserts the sleep signal by default.

Wakes up L2 peripherals on an access to the cache

Keeps the cache in the normal state for J cycles (turn-on
period) before returning it to the stand-by mode (SM_J)


No wakeup penalty during this period
Larger J leads to lower performance degradation but lower
energy savings
Static Architectural Techniques: IM

IM technique (ICCD’07)

Monitor issue logic and functional units of the processor
after L2 cache miss. Asserts the sleep if the issue logic
has not issued any instructions and functional units
have not executed any instructions for K consecutive
cycles (K=10)

De-asserted the sleep signal M cycles before the miss is
serviced

No performance loss
Simulated Processor Architecture


Parameter
Value
L1 I-cache
L1 D-cache
L2 cache
Fetch, dispatch
Issue
Memory
Reorder buffer
Instruction queue
Register file
Load/store queue
Branch predictor
Arithmetic unit
Complex unit
Pipeline
128KB, 2 cycles
128KB, 2 cycles
2MB, 8 way, 20 cycles
4 wide
4 way out of order
300 cycles
96 entry
32 entry
128 integer and 125 floating point
32 entry
64KB entry g-share
4 integer, 4 floating point units
2 INT, 2 FP multiply/divide units
15 cycles
SimpleScalar 4.0
SPEC2K benchmarks


Compiled with the -O4 flag using the Compaq compiler targeting
the Alpha 21264 processor
fast–forwarded for 3 billion instructions, then fully simulated for 4
billion instructions

using the reference data sets.
SM Performance Degradation
100%
99%
98%
97%
96%
95%
94%
93%
92%
SM-100
SM-200
SM-500
SM-750
SM-1500
More Insight on SM and IM
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
IM

average
wupwise
vpr
vortex
twolf
swim
sixtrack
perlbmk
parser
mgrid
mesa
mcf
lucas
gzip
gcc
gap
galgel
facerec
equake
eon
crafty
bzip2
art
apsi
applu
ammp
0%
SM-750
Fraction of program execution time during which L2
cache is in low power mode (FLP) using one of IM or
SM

two techniques benefit different benchmarks
More Insight on SM and IM (Cont.)
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
IM

average
wupwise
vpr
vortex
twolf
swim
sixtrack
perlbmk
parser
mgrid
mesa
mcf
lucas
gzip
gcc
gap
galgel
facerec
equake
eon
crafty
bzip2
art
apsi
applu
ammp
0%
SM-750
In almost half of the benchmarks the FLP is negligible and there is no
leakage reduction opportunity using IM

The majority of load instructions satisfied within the cache hierarchy

The memory accesses are extremely infrequent

The average FLP period is 26.9%
Some Observations
100%
90%
80%
70%
60%
50%
40%
30%
20%
adaptive technique combining IM
and SM has the potential to deliver
an even greater power reduction
10%



average
wupwise
vpr
twolf
vortex
swim
sixtrack
perlbmk
parser
mesa
mgrid
mcf
gzip
IM
lucas
gcc
gap
galgel
facerec
eon
equake
crafty
bzip2
art
apsi
applu
ammp
0%
SM-750
Some benchmarks SM and IM techniques are both effective
facerec, gap, perlbmk and vpr
IM works well in almost half of the benchmarks but is ineffective
in the other half
SM work well in about one half of the benchmarks but not the
same benchmarks as the IM
Which Technique Is the Best and When ?
DL1
miss
rate
ammp 0.05
applu 0.06
apsi
0.03
art
0.41
bzip2 0.02
crafty 0.00
eon
0.00
equake 0.02
facerec 0.03
galgel 0.04
gap
0.01
gcc
0.05
gzip
0.01
L2
miss
rate
0.19
0.66
0.28
0.00
0.04
0.01
1.00
0.67
0.31
0.01
0.55
0.04
0.05
L1xL2 miss
rates
x 10K
lucas
96.11
mcf
368.03
75.01
mesa
0.41
mgrid
7.09
parser
0.17
perlbmk
0.00
sixtrack
swim
124.36
86.11
twolf
2.11
vortex
38.54
vpr
16.88
wupwise
3.28
Average
DL1 miss
rate
0.10
0.24
0.00
0.04
0.02
0.01
0.01
0.09
0.05
0.00
0.02
0.02
0.05
miss rate product (MRP) may
be a good indicator of the
cache behavior

L2 to be idle

There are few L1 misses

Many L2 misses waiting for memory
L2
miss
rate
0.67
0.43
0.27
0.46
0.07
0.46
0.00
0.63
0.00
0.23
0.15
0.68
0.31
L1xL2 miss
rates x 10K
645.73
1023.88
8.02
165.13
13.76
22.88
0.14
561.41
0.16
6.94
33.95
122.40
136.50
The Adaptive Techniques

Adaptive Static Mode (ASM)

MRP measured only once during an initial learning
period (the first 100M committed instructions)




MRP > A
 IM (A=90)
MRP ≤ A
 SM_J
Initial technique  SM_J
Adaptive Dynamic Mode (ADM)

MRP measured continuously over a K cycle period (K is 10M)
choose IM or the SM, for the next 10M cycles
 MRP > A
 IM (A=100)
 A ≥ MRP > B  SM_N (B=200)
 otherwise
 SM_P
More Insight on ASM and ADM

ASM attempts to find the more effective static
technique per benchmark by profiling a small subset of
a program

ADM is more complex and attempts to find the more
effective static technique at a finer granularity of every
10M cycles intervals based on profiling the previous
timing interval
ASM Results
80%
100%
99%
70%
98%
60%
97%
50%
96%
95%
40%
94%
30%
93%
20%
92%
J=100
J=200
J=500
J=750
FLP Period
J=1500
J=100
J=200
J=500
J=750
Performance Loss
ASM_750 makes a good power-performance trade-off with
a 44% FLP and an approximately 2% performance loss
J=1500
Compare ASM with IM and SM
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
a small subset of program can be used
to identify L2 cache behavior, whether it
is accessed very infrequently or it is idle
since processor is idle
ASM-IM
average
wupwise
vpr
twolf
vortex
swim
sixtrack
perlbmk
parser
mesa
mgrid
mcf
gzip
lucas
gcc
gap
galgel
facerec
eon
equake
crafty
art
bzip2
apsi
applu
ammp
0%
ASM-SM
fraction of IM and SM contribution for ASM_750

Most benchmarks ASM correctly selects the more effective
static technique

Exception: equake
ASM and SM Performance
100%
98%
96%
94%
92%
90%
ASM_750

average
wupwise
twolf
vortex
vpr
swim
sixtrack
parser
perlbmk
mgrid
mcf
mesa
gcc
gzip
lucas
gap
galgel
equake
facerec
eon
crafty
ammp
82%
bzip2
84%
apsi
art
86%
applu
2X more leakage power reduction
and less performance loss
compare to static approaches
88%
SM_750
No Performance Loss

ammp, applu, lucas, mcf, mgird, swim and wupwise
ADM Results
100%
100%
90%
80%
98%
70%
96%
60%
94%
50%
92%
40%

average
vpr
twolf
vortex
swim
sixtrack
perlbmk
mgrid
parser
mcf
mesa
lucas
gcc
gzip
gap
galgel
facerec
eon
equake
crafty
art
bzip2
apsi
applu
average
vpr
wupwise
twolf
vortex
swim
sixtrack
perlbmk
mgrid
parser
mcf
mesa
Many benchmarks both IM and SM make a noticeable contribution


ADM_SM
wupwise
ADM_IM
lucas
gcc
gzip
gap
galgel
facerec
eon
equake
crafty
84%
art
0%
bzip2
86%
apsi
10%
applu
88%
ammp
20%
ammp
90%
30%
ADM is effective in combining the IM and SM
Some benchmarks either IM or SM contribution is negligible

ADM selects the best static technique
Power Measurement Approach

CACTI-5



Total dynamic power : N*Eaccess/Texec




Peripheral circuits account for 90% of all the leakage
power
The power reduction is 88%.
N is the total number of accesses (obtained from
simulation)
Eaccess is the single access energy from CACTI-5
Texec is the program execution time
Leakage energy is dissipated on every cycle
Power Results
90%
80%
(b)
(a)
80%
70%
70%
60%
60%
50%
50%
40%
40%
30%
30%
20%
20%
2~3 X more leakage power reduction
and less performance loss compare to
static approaches
10%
10%
leakage power savings
ASM
ADM
total energy delay reduction
leakage reduction using ASM and ADM is 34% and 52% respectively
The overall energy delay reduction is 29.4 and 45.5%
respectively, using the ASM and ADM.
average
vpr
wupwise
twolf
vortex
swim
sixtrack
perlbmk
mgrid
parser
mcf
mesa
lucas
gcc
gzip
gap
galgel
facerec
eon
crafty
art
bzip2
apsi
equake
-20%
applu
-10%
ammp
average
vpr
twolf
vortex
swim
sixtrack
perlbmk
mgrid
parser
mcf
ADM
wupwise
ASM
mesa
gcc
gzip
gap
galgel
facerec
eon
equake
crafty
art
bzip2
apsi
applu
ammp
lucas
0%
0%
Conclusion





Study break down of leakage in L2 cache components,
show peripheral circuit leaking considerably
Study recently proposed IM and SM approach
Propose a metric (cache miss rate product) to
differentiate the benchmarks works well with each of
static approach
Propose two adaptive technique to select the best
static approach dynamically
Present power, performance and energy-delay results

2 to 3 X improvement over recently proposed static
techniques