RELOCATE Register File Local Access Pattern Redistribution

Download Report

Transcript RELOCATE Register File Local Access Pattern Redistribution

Architectural and Circuit-Levels
Design Techniques for Power and
Temperature Optimizations in OnChip SRAM Memories
Houman Homayoun
PhD Candidate
Dept. of Computer Science, UC Irvine
Outline

Past Research

Low Power Design



Thermal-Aware Design


Thermal Management in Register File (HiPEAC-2010)
Reliability-Aware Design


Power Management in Cache Peripheral Circuits (CASES-2008, ICCD2008,ICCD-2007, TVLSI, CF-2010)
Clock Tree Leakage Power Management (ISQED-2010)
Process Variation Aware Cache Architecture for Aggressive VoltageFrequency Scaling (DATE-2009, CASES-2009)
Performance Evaluation and Improvement

Adaptive Resource Resizing for Improving Performance in Embedded
Processor (DAC-2008, LCTES-2008)
April 2010 – Houman Homayoun
University of California Irvine
2
RELOCATE
Register File Local Access Pattern
Redistribution Mechanism for Power and
Thermal Management in Out-of-Order
Embedded Processor
Houman Homayoun, Aseem Gupta, Alexander V. Veidenbaum
Avesta Sasan, Fadi J. Kurdahi, Nikil Dutt
Outline






Motivation
Background study
Study of Register file Underutilization
Study of Register file default access patterns
Access concentration and activity redistribution
to relocate register file access patterns
Results
April 2010 – Houman Homayoun
University of California Irvine
4
Why Temperature?

Higher power densities (Watt per mm2) lead to
higher operating temperatures, which
(i) Increase the probability of timing violations
(ii) Reduce IC lifetime
(iii) Lower operating frequency
(iv) Increase leakage power
(v) Require expensive cooling mechanisms
(vi) Overall increase in design effort and cost
April 2010 – Houman Homayoun
University of California Irvine
5
Why Register File?

RF is one of the hottest units in a processor



A small, heavily multi-ported SRAM
Accessed very frequently
Example: IBM PowerPC 750FX, AMD Athlon 64
AMD Athlon 64 core floorplan blocks
April 2010 – Houman Homayoun
Thermal Image of AMD Athlon 64 core
floorplan blocks using infrared cameras,
Courtesy of Renau et al. ISCA 2007
University of California Irvine
6
Prior Work: Activity Migration

Reduces temperature by migrating the activity to a
replicated unit.

requires a replicated unit


large area overhead
leads to a large performance degradation
T final
Idle Period
Active Period
T init
Temperature
Temperature
Active Period
T crisis
Idle Period
T ambient
time
AM
April 2010 – Houman Homayoun
Cooling due to
Cooling due to
inactivity
inactivity and
power gating
time
AM+PG
University of California Irvine
7
Conventional Register Renaming
Head
pointer
Tail
pointer
Active List
Free List
Register Renamer
• Physical
April 2010 – Houman Homayoun
Instruction #
Original code
Renamed code
1
RA <- ...
PR1 <- ...
2
…. <- RA
....
3
branch to _L
branch to _L
4
RA <- ...
PR4 <- ...
5
... ...
6
_ L:
_ L:
7
…. <- RA
....
<- PR1
... ...
<- PR1
Register allocation-release
registers are allocated/released in a
somewhat random order
University of California Irvine
8
Analysis of Register File Operation: Register File Occupancy
MiBench
SPECint2K
(b)
(a)
100%
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
90%
80%
70%
60%
50%
40%
30%
RF_ocuupancy < 16
32 < RF_ocuupancy < 48
16 < RF_ocuupancy < 32
48 < RF_ocuupancy < 64
RF_ocuupancy < 16
32 < RF_ocuupancy < 48
v
av pr
er
ag
e
m
c
pa f
rs
pe er
rlb
m
k
tw
ol
vo f
rt
ex
gc
c
gz
ip
ga
p
2
cr
af
ty
eo
n
ga
lg
el
0%
bz
ip
ff t
gs
gs
m
la
m
e
m
pa ad
tr
ic
ia
qs
se ort
ar
su
ch
sa
n_ s
su co ha
sa rn
n_ er
ed s
g
tif es
f
av 2bw
er
ag
e
ba
si
cm
at
h
bc
di crc
jk
st
dj ra
pe
g
20%
10%
16 < RF_ocuupancy < 32
48 < RF_ocuupancy < 64
% performance degradation
(a)
20%
15%
10%
5%
Ba
si
cM
ff t
gs
gs
m
la
m
e
m
pa ad
tr
ic
ia
qs
se ort
ar
su
ch
sa
n_ s
su c ha
o
sa rn
n_ er
ed s
g
tif es
f
av 2bw
er
ag
e
0%
48-entry
April 2010 – Houman Homayoun
32-entry
50%
40%
30%
20%
10%
0%
48-entry
16-entry
University of California Irvine
32-entry
v
av pr
er
ag
e
25%
ga
p
gc
c
gz
ip
m
c
pa f
rs
pe er
rlb
m
k
tw
ol
vo f
rt
ex
30%
(b)
60%
bz
ip
2
cr
af
ty
eo
ga n
lg
el
35%
at
h
bc
di crc
jk
st
dj ra
pe
g
% performance degradation
Performance Degradation with a Smaller Register File
16-entry
9
Analysis of Register File Operation
Register File Access Distribution

Coefficient of variation (CV) shows a “deviation”
from average # of accesses for individual physical
registers.
CVaccess 


1
N
n
2
(
na

na
)
 i
i 1
na
nai is the number of accesses to a physical register i
during a specific period (10K cycles). na is the
average
N, the total number of physical registers
April 2010 – Houman Homayoun
University of California Irvine
10
cm
at
(a)
6%
4%
2%
0%
MiBench
April 2010 – Houman Homayoun
University of California Irvine
v
av pr
er
ag
e
8%
% coefficient of variation
12%
ga
p
gc
c
gz
ip
m
c
pa f
rs
pe er
rlb
m
k
tw
o
vo lf
rt
ex
% coefficient of variation
10%
bz
ip
2
cr
af
ty
eo
ga n
lg
el
gs
gs
m
la
m
e
pa mad
tr
ic
q s ia
se ort
su
ar
sa
ch
n_
su c s
sa orn ha
n_ e
ed rs
g
t if e s
av f 2b
er w
ag
e
h
bc
di crc
jk
st
dj ra
pe
g
fft
ba
si
Coefficient of Variation
14%
(b)
12%
10%
8%
6%
4%
2%
0%
SPEC2K
11
Register File Operation
Underutilization which is distributed uniformly
while only a small number of
registers are occupied at any given time, the
total accesses are uniformly distributed over
the entire physical register file during the
course of execution
April 2010 – Houman Homayoun
University of California Irvine
12
RELOCATE: Access Redistribution within a Register File

The goal is to “concentrate” accesses within a partition
of a RF (region)

Some regions will be idle (for 10K cycles)

Can power-gate them and allow to cool down
(a)
c
(b)
(c)
P2
P1
4 x c
4 x c
P2
P4
P3
P1
Idle region
Active region
P4
P3
register activity (a) baseline, (b) in-order (c) distant patterns
April 2010 – Houman Homayoun
University of California Irvine
13
An Architectural Mechanism to Support Access Redistribution

Active partition:


Idle partition:


a register renamer partition which does not participate in
renaming
Active region:


a register renamer partition currently used in register
renaming
a region of the register file corresponding to a register
renamer partition (whether active or idle) which has live
registers
Idle region:

a region of the register file corresponding to a register
renamer partition (whether active or idle) which has no live
registers
April 2010 – Houman Homayoun
University of California Irvine
14
Activity Migration without Replication

An access concentration mechanism allocates registers
from only one partition

This default active partition (DAP) may run out of free
registers before the 10K cycle “convergence period” is
over

another partition (according to some algorithm) is
then activated (referred to as additional active
partitions or AAP )

To facilitate physical register concentration in
DAP, if two or more partitions are active and have
free registers, allocation is performed in the same
order in which partitions were activated.
April 2010 – Houman Homayoun
University of California Irvine
15
The Access Concentration Mechanism

Partition activation order is 1-3-2-4
Active List 1 empty
Active List 3 empty
Active List
Active List 2 empty
Active List
Partition
P1
Partition
P3
Free List
Free List
Free-list 1 full
Active List 4 empty
Active List
Active List
Partition
P2
Partition
P4
Free List
Free List
Free-list 3 full
Free-list 2 full
Free-list 4 full
P1
P3
P2
P4
April 2010 – Houman Homayoun
University of California Irvine
16
The Redistribution Mechanism

The default active partition is changed once every N
cycles to redistribute the activity within the register file
(according to some algorithm)
 Once a new default partition (NDP) is selected, all
active partitions (DAP+AAP) become idle.

The idle partitions do not participate in register renaming,
but their corresponding RF regions may have to be kept
active (powered up)
 A physical register in an idle partition may be live

An idle RF region is power gated when its active list
becomes empty.
April 2010 – Houman Homayoun
University of California Irvine
17
Performance Impact?

There is a two-cycle delay to wakeup a power gated
physical register region

The register renaming occurs in the front end of the
microprocessor pipeline whereas the register access
Caninwake
up a end.
required register file region
occurs
the back


without
incurring
a performance
There
is a delay
of at least
two pipelinepenalty
stages between
the time aofphysical
accessregister file
renaming andataccessing
Can wake up the requested region in time
April 2010 – Houman Homayoun
University of California Irvine
18
Experimental setup
Table 1. Processor Architecture
L1 I-cache
8KB, ,4
cycles
way,
Table 2. RF Design specification
2
L1 D-cache
8KB, 4 way, 2 cycles
L2-cache
128KB, 15 cycles
Fetch, dispatch
2 wide
Register file
64 entry
Memory
50 cycles
Instruction
queue
fetch
2
Load/store queue
16 entry
Arithmetic units
2 integer
Complex unit
2 INT
Pipeline
12 stages
Processor speed
800 MHz
Issue
Out-of-order
Process
45nm-CMOS
9 metal layers
Operating Modes
Read
Cycle
Register
file
area
2
layout 0.009mm
Active:R/W
Operating
Sleep: no data Voltage
retention
Access 200MHz
to 1.1GHz
0.6V~1.1V
Access time
typical corner 0.32ns
(0.9V, 45)
Active
Power
(Total) in typical 66mW
corner (0.9V, 45)
@ 800MHz
Active
Leakage
Power typical 15mW
corner (0.9V,
45)
Sleep
Leakage
Power in typical 2mW
corner (0.9V, 45)
Wakeup
Delay
0.42ns
Wakeup Energy
per register file 0.42nJ
row (64bits)




MASE (SimpleScalar 4.0)

Model MIPS-74K processor, 800 MHz
MiBench and SPECint2K benchmarks compiled with Compaq compiler, -O4 flag
Industrial memory compiler used

64-entry, 64bit single-ended SRAM memory in TSMC 45nm technology
HotSpot to estimate thermal profiles
April 2010 – Houman Homayoun
University of California Irvine
19
Results-Power Reduction
sm
la
m
e
m
p ad
at
ri
ci
a
q
so
se rt
ar
su
ch
sa
n
su _c sh
sa orn a
n er
_e s
d
g
ti es
ff
av 2b
er w
ag
e
s
g
g
ff
t
d crc
ijk
s
tr
d a
jp
eg
B
a
si
cM
b
c
(a)
at
h
Power Reduction %
Mibench RF power reduction
55%
50%
45%
40%
35%
30%
25%
20%
15%
10%
5%
0%
num_partition=2
num_partition=4
num_partition=8
SPEC2K RF power reduction
50%
(b)
Power Reduction %
45%
40%
35%
30%
25%
20%
15%
10%
5%
num_partition=2
April 2010 – Houman Homayoun
num_partition=4
v
av pr
er
ag
e
cf
ar
s
p er
er
lb
m
k
tw
o
vo lf
rt
ex
p
m
zi
p
g
cc
g
ap
g
el
n
al
g
g
eo
ft
y
cr
a
b
zi
p
0%
num_partition=8
University of California Irvine
20
Analysis of Power Reduction

Increasing the number of RF partitions provides more
opportunity to capture and cluster unmapped registers to
a partition


Indicates that wakeup overhead is amortized for a larger
number of partitions.
Some exceptions


the overall power overhead associated with waking up
an idle region becomes larger as the number of
partition increases.
frequent but ineffective power gating and its overhead
as the number of partition increases
April 2010 – Houman Homayoun
University of California Irvine
21
Peak Temperature Reduction
Table 1. Peak temperature reduction for MiBench
benchmarks
Table 2. Peak temperature
SPEC2K integer benchmarks
temperature
reduction
for
temperature different number
of partition (C)
(C)
4P
8P
for
temperature
reduction
for
temperature different number
of partition (C)
(C)
base
2P
reduction
base
2P
4P
8P
basicMath
94.3
3.6
4.8
5.0
bzip2
92.7
4.8
3.9
3.1
bc
95.4
3.8
4.4
5.2
crafty
83.6
9.5
11
10.4
crc
92.8
5.3
6.0
6.0
eon
77.3
10.6
12.4
12.5
dijkstra
98.4
6.3
6.8
6.4
galgel
89.4
6.9
7.2
5.8
djpeg
96.3
2.8
3.5
2.4
gap
86.7
4.8
5.9
7.1
fft
94.5
6.8
7.4
7.6
gcc
79.8
7.9
9.4
10.1
gs
89.8
6.5
7.4
9.7
gzip
95.4
3.2
3.8
3.9
gsm
92.3
5.8
6.7
6.9
mcf
85.8
6.9
8.7
9.4
lame
90.6
6.2
8.5
11.3
parser
97.8
4.3
5.8
4.8
mad
93.3
3.8
4.3
2.2
perlbmk
85.8
10.6
12.3
12.6
patricia
79.2
11.0
12.4
13.2
twolf
86.2
8.8
10.2
10.5
qsort
88.3
10.1
11.6
11.9
vortex
81.7
11.3
12.5
12.9
search
93.8
8.7
9.3
9.1
vpr
94.6
4.9
5.2
4.4
sha
90.1
5.1
5.4
4.5
average
87.4
7.2
8.3
8.2
susan_corners 92.7
4.7
5.3
5.1
susan_edges
91.9
3.7
5.8
6.3
tiff2bw
98.5
4.5
5.9
4.1
average
92.5
5.6
6.8
6.9
April 2010 – Houman Homayoun
University of California Irvine
22
Analysis of Temperature Reduction

Increasing the number of partitions results in larger power
density in each partition because RF access activity is
concentrated in a smaller partition

While capturing more idle partitions and power gating
them may potentially result in higher power reduction,
larger power density due to smaller partition size
results in overall higher temperature
April 2010 – Houman Homayoun
University of California Irvine
23
Conclusions

Showed Register File Underutilization

Studied Register file default access patterns

Propose access concentration and activity redistribution
to relocate register file accesses

Results show a noticeable power and temperature
reduction in the RF

RELOCATE technique can be applied when units are
underutilized

as opposed to activity migration, which requires replication
April 2010 – Houman Homayoun
University of California Irvine
24
Current and Future Work Extension

Formulate the Best partition selection out of available
partitions for activity redistribution.

Apply activity concentration and redistribution mechanism to
other hot units; example: L1 cache.

Apply Proactive NBTI Recovery to the idle partitions to
improve lifetime reliability.

Trade-off NBTI recovery and power gating to simultaneously
reduce power and improve lifetime reliability.

Tackle the temperature barrier in 3D stack processor design
using similar activity concentration and redistribution.
April 2010 – Houman Homayoun
University of California Irvine
25
Multiple Sleep Modes Leakage
Control for Cache Peripherals
Houman Homayoun, Avesta Sasan, Alexander V. Veidenbaum
On-chip Caches and Power

On-chip caches in high-performance
processors are large


more than 60% of chip budget
Dissipate significant portion of power via
leakage

Much of it was in the SRAM cells


Many architectural techniques proposed to
remedy this
Today, there is also significant leakage in the
peripheral circuits of an SRAM (cache)

Pentium M processor die photo
Courtesy of intel.com
In part because cell design has been optimized
April 2010 – Houman Homayoun
University of California Irvine
27
Peripherals ?
Addr Input Global Drivers
Bitline
addr0
Global Wordline
addr1
Decoder
Bitline
Local Wordline
addr2
addr3
Data Input/Output Driver
Address Input/Output Driver
Row Pre-decoder
Wordline Driver
Row Decoder





Predecoder and Global Wordline Drivers
addr
Sense amp
Global Output Drivers

Using minimal sized transistor for area
considerations in cells and larger,
faster and accordingly more leaky
transistors to satisfy timing
requirements in peripherals.
100000
10000
1000
6300X
( pw )
100
200X
10
April 2010 – Houman Homayoun
y
em
or
Using high vt transistors in cells
compared with typical threshold
voltage transistors in peripherals
m

ce
ll
IN
VX
IN
V2
X
IN
V3
X
IN
V4
X
IN
V5
X
IN
V6
X
IN
V8
IN X
V1
2
IN X
V1
6
IN X
V2
0
IN X
V2
4
IN X
V3
2X
1
University of California Irvine
28
Power Components of L2 Cache

global address
input drivers
11%
global data
output drivers
25%
SRAM peripheral circuits dissipate
more than 90% of the total leakage
power
global data
input drivers
14%
global row
predecoder
1%
local data
output drivers
8%
others
8%
L2 cache leakage power dominates
its dynamic power above 87% of the
total
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
ammp
applu
apsi
art
bzip2
crafty
eon
equake
facerec
galgel
gap
gcc
gzip
lucas
mcf
mesa
mgrid
parser
perlbmk
sixtrack
swim
twolf
vortex
vpr
wupwise
average

local row
decoders
33%
Leakage
April 2010 – Houman Homayoun
University of California Irvine
Dynamic
29
Techniques Address Leakage in SRAM Cell



Circuit





Architecture


Gated-Vdd, Gated-Vss
Voltage Scaling (DVFS)
ABB-MTCMOS
Forward Body Biasing (FBB), RBB
Sleepy Stack
Sleepy Keeper
Way Prediction, Way Caching, Phased Access

Predict or cache recently access ways, read tag first
Drowsy Cache

Keeps cache lines in low-power state, w/ data retention
Cache Decay

Evict lines not used for a while, then power them down
Applying DVS, Gated Vdd, Gated Vss to memory cell

Many architectural support to do that.
Target SRAM memory cell
April 2010 – Houman Homayoun
University of California Irvine
30
Sleep Transistor Stacking Effect

Subthreshold current: inverse exponential function of
threshold voltage
VT  VT 0   (

(2) F  VSB 
2 F )
Stacking transistor N with slpN:

The source to body voltage (VM ) of
transistor N increases, reduces its
subthreshold leakage current, when
both transistors are off
Drawback : rise time, fall time, wakeup
delay, area, dynamic power, instability
April 2010 – Houman Homayoun
University of California Irvine
vdd
VC
Vgn
N
CL
VM
Vgslpn
vss
slpN
vss
31
Impact on Rise Time and Fall Time

vdd
The rise time and fall time of the output
of an inverter is proportional to the
Rpeq * CL and Rneq * CL
slpP1
P2
P1
I leakage
1
0

vdd
slpP2
Inserting the sleep transistors increases
both Rneq and Rpeq
0
N1
slpN1
N2
slpN2
I leakage
vss
vss
Increasing in rise time
Impact on performance
Increasing in fall time
Impact on memory functionality
April 2010 – Houman Homayoun
University of California Irvine
32
A Zig-Zag Circuit
Sleep signal
vdd
vdd

Rpeq for the first and third inverters
and Rneq for the second and fourth
inverters doesn’t change.

Fall time of the circuit does not
change
slpP4
slpP2
W
 12
L
vdd
1
0
P4
P3
P2
P1
W
 12
L
vdd
0
1
0
slpN5
N1
slpN1
Sleep signal
W
6
L
vss
N3
N2
vss
slpN3
W
6
L
W
1.5
L
N4
vss
vss
vss
Sleep signal

To improve leakage reduction and area-efficiency of the zig-zag
scheme, using one set of sleep transistors shared between
multiple stages of inverters


Zig-Zag Horizontal Sharing
Zig-Zag Horizontal and Vertical Sharing
April 2010 – Houman Homayoun
University of California Irvine
33
Zig-Zag Horizontal and Vertical Sharing
vdd
Sleep signal
vdd
vdd
Word-line
Driver line K
vdd
vdd
Word-line
Driver line K +1
slpP
P 11
P12
P 13
P14
P 21
P22
P 23
P24
N11
VM
N 12
N 13
N 14
N21
N 22
N 23
N 24
Sleep signal
slpN
vss

vss
vss
vss
vss
To improve leakage reduction and area-efficiency of the zig-zag scheme, using
one set of sleep transistors shared between multiple stages of inverters


Zig-Zag Horizontal Sharing

Minimize impact on rise time

Minimize area overhead
Zig-Zag Horizontal and Vertical Sharing

Maximize leakage power saving

Minimize the area overhead
April 2010 – Houman Homayoun
University of California Irvine
34
ZZ-HVS Evaluation : Power Result
1000
(a)
x100
log (nW)
100
x10
x12
10
x2
1
1
2
3
4
5
6
7
8
9
10
number of wordline row
baseline


redundant
zigzag
zz-hs
zz-hvs
Increasing the number of wordline rows share sleep transistors increases the
leakage reduction and reduces the area overhead
Leakage power reduction varies form a 10X to a 100X when 1 to 10 wordline
shares the same sleep transistors

2~10X more leakage reduction, compare to the zig-zag scheme
April 2010 – Houman Homayoun
University of California Irvine
35
Wakeup Latency

To benefit the most from the leakage savings of
stacking sleep transistors



keep the bias voltage of NMOS sleep transistor as low as possible
(and for PMOS as high as possible)
Drawback: impact on the wakeup latency of wordline
drivers
reduction in
reduction
inthe
the
Control the gate
voltage of
sleep transistors

leakage power
circuit wakeup delay
savings
overhead
Increasing the gate voltage of footer sleep transistor reduces the
virtual ground voltage (VM)
April 2010 – Houman Homayoun
University of California Irvine
36
1
4.5
4.0
0.9
0.8
3.5
3.0
0.7
0.6
2.5
2.0
0.5
0.4
0.3
0.2
(0
.3
0,
0.
78
)
(0
.2
5,
0.
83
)
(0
.2
0,
0.
88
)
(0
.1
5,
0.
93
)
(0
.1
,0
.9
8)
trade-off between the
wakeup overhead
and leakage power saving
(0
.0
5,
1.
03
)
1.5
1.0
Normalized Wake-Up Delay
5.0
(0
,1
.0
8)
Normalized Leakage Power
Wakeup Delay vs. Leakage Power Reduction
(Footer,Header) Gate Bias Voltage Pair
Normalized leakage

Normalized wake-up delay
Increasing the bias voltage increases the leakage
power while decreases the wakeup delay overhead
April 2010 – Houman Homayoun
University of California Irvine
37
Multiple Sleep Modes

power mode
wakeup
delay (cycle)
leakage
reduction (%)
basic-lp
1
42%
lp
2
75%
aggr-lp
3
81%
ultra-lp
4
90%
Power overhead of waking up peripheral circuits


Almost equivalent to the switching power of sleep
transistors
Sharing a set of sleep transistors horizontally and
vertically for multiple stages of a (wordline) driver makes
the power overhead even smaller
April 2010 – Houman Homayoun
University of California Irvine
38
Reducing Leakage in L1 Data Cache

Maximize the leakage reduction in DL1 cache


put DL1 peripheral into ultra low power mode
adds 4 cycles to the DL1 latency


significantly reduces performance
Minimize Performance Degradation



put DL1 peripherals into the basic low power mode
requires only one cycle to wakeup and
hide this latency during address computation stage
thus not degrading performance

Not noticeable leakage power reduction
April 2010 – Houman Homayoun
University of California Irvine
39
Motivation for Dynamically Controlling Sleep Mode

large leakage reduction
benefit
dynamically
adjust


Basic-lp mode
Periods of frequent access


sleep power mode
low performance impact benefit


peripheral
circuitmodes
Ultra and aggressive
low power
Basic-lp mode
Periods of infrequent access

Ultra and aggressive low power modes
April 2010 – Houman Homayoun
University of California Irvine
40
Reducing DL1 Wakeup Delay

Can determine whether an instruction is load or a store
at least one cycle prior cache access



Accessing DL1 while its peripherals are in basic-lp mode
doesn’t require an extra cycle
wake up DL1 peripherals one cycle prior to access
One cycle of wakeup delay can be hidden for all other
low-power modes

Reducing the wakeup delay by one cycle
Put DL1 in basic-lp mode by default
April 2010 – Houman Homayoun
University of California Irvine
41
Architectural Motivations

Architectural Motivation

A load miss in L1/L2 caches takes a long time to service


When dependent instructions cannot issue


prevents dependent instructions from being issued
performance is lost
At the same time, energy is lost as well!

This is an opportunity to save energy
April 2010 – Houman Homayoun
University of California Irvine
42
Low-end Architecture
P rocessor continue
D L1 m iss
serviced
basic-lp
P ending D L1
m iss
lp
P rocessor stall
ultra -lp
D L1 m iss
a
m is s ll D L 1
es se
r v ic
D L1 m iss++
ess
P ro c
ed
ll
o r s ta
aggr-lp
P ending D L 1
m iss/es

Given the miss service time of 30 cycles
 likely that processor stalls during the miss service period

Occurrence of additional cache misses while one DL1
cache miss is already pending further increases the
chance of pipeline stall
April 2010 – Houman Homayoun
University of California Irvine
43
Low Power Modes in a 2KB DL1 Cache
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
hp
trivial-lp
lp
aggr-lp
ultra-lp
Fraction of total execution time DL1 cache spends in each of the power mode

85% of the time DL1 peripherals put into low power modes
 Most of the time spent in the basic-lp mode (58% of total
execution time)
April 2010 – Houman Homayoun
University of California Irvine
44
Low Power Modes in Low-End Architecture
10%
(a)
100%
9%
90%
80%
8%
7%
70%
60%
6%
5%
50%
40%
4%
3%
30%
20%
2%
1%
10%
0%
0%
2KB
2KB 4KB 8KB 16KB
Performance degradation

hp
4KB
basic-lp
8KB
lp
aggr-lp
16KB
ultra-lp
Frequency of different low power mode
Increasing the cache size reduces DL1 cache miss rate


Reduces opportunities to put the cache into more aggressive
low power modes
Reduces performance degradation for larger DL1 cache
April 2010 – Houman Homayoun
University of California Irvine
45
High-end Architecture
L2 miss serviced
DL1 miss
serviced
basic-lp
Pending DL1
miss/es
lp
L2 miss
ultra-lp
DL1 miss
L2 miss

DL1 transitions to ultra-lp mode right after an L2 miss
occurs


Given a long L2 cache miss service time (80 cycles) the
processor will stall waiting for memory
DL1 returns to the basic-lp mode once the L2 miss is
serviced
April 2010 – Houman Homayoun
University of California Irvine
46
Leakage Power Reduction
80%
70%
60%
50%
40%
30%
20%
10%
0%
trivial-lp lp


aggr-lp ultra-lp
DL1 leakage is reduced by 50%
While ultra-lp mode occurs much less frequently compared to basic-lp mode, its leakage reduction
is comparable to the basic-lp mode.

in ultra-lp mode the peripheral leakage is reduced by 90%, almost twice that of basic-lp mode.
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
trivial-lp lp

April 2010 – Houman Homayoun
ultra-lp
The average leakage reduction is almost 50%
University of California Irvine
47
Conclusion





Highlighted the large leakage power dissipation in
SRAM peripheral circuits.
Proposed zig-zag share to reduce leakage in SRAM
peripheral circuits.
Extended zig-zag share with multiple sleep modes
which trade-off the leakage power reduction vs wakeup
delay overhead.
Applied multiple sleep modes technique in L1 cache of
an embedded processor.
Presented Leakage power reduction.
April 2010 – Houman Homayoun
University of California Irvine
48
April 2010 – Houman Homayoun
University of California Irvine
49