Power, Temperature, Reliability and Performance

Download Report

Transcript Power, Temperature, Reliability and Performance

Power, Temperature, Reliability and
Performance - Aware Optimizations
in On-Chip SRAMs
Houman Homayoun
PhD Candidate
Dept. of Computer Science, UC Irvine
Outline

Past Research

Low Power Design



Thermal-Aware Design


Thermal Management in Register File (HiPEAC-2010)
Reliability-Aware Design


Power Management in Cache Peripheral Circuits (CASES-2008, ICCD2008,ICCD-2007, TVLSI, CF-2010)
Clock Tree Leakage Power Management (ISQED-2010)
Process Variation Aware Cache Architecture for Aggressive VoltageFrequency Scaling (DATE-2009, CASES-2009)
Performance Evaluation and Improvement

Adaptive Resource Resizing for Improving Performance in Embedded
Processor (DAC-2008, LCTES-2008)
2
Outline

Current Research

Inter-core Selective Resource Pooling in 3D Chip
Multiprocessor

Extend Previous Work (for Journal Publication!!)
3
Leakage Power Management in
Cache Peripheral Circuits
Outline: Leakage Power in Cache Peripherals






L2 cache power dissipation
Why cache peripheral?
Circuit techniques to reduce leakage in Peripheral
(ICCD-08, TVLSI)
Study static approach to reduce leakage in L2 cache
(ICCD-07)
Study adaptive techniques to reduce leakage in L2
cache (ICCD-08)
Reducing Leakage in L1 cache (CASES-2008)
5
On-chip Caches and Power

On-chip caches in high-performance
processors are large


more than 60% of chip budget
Dissipate significant portion of power via
leakage

Much of it was in the SRAM cells


Many architectural techniques proposed to
remedy this
Today, there is also significant leakage in the
peripheral circuits of an SRAM (cache)

Pentium M processor die photo
Courtesy of intel.com
In part because cell design has been optimized
6
Peripherals ?
Addr Input Global Drivers
Bitline
addr0
Global Wordline
addr1
Decoder
Bitline
Local Wordline
addr2
addr3
Predecoder and Global Wordline Drivers
addr
Sense amp
Global Output Drivers





Data Input/Output Driver
Address Input/Output Driver
Row Pre-decoder
Wordline Driver
Row Decoder
Others : sense-amp, bitline pre-charger, memory cells, decoder logic
7
Why Peripherals ?
100000
10000
1000
6300X
( pw )
100
200X
10
m
em
or
y
ce
ll
IN
V
IN X
V2
X
IN
V3
X
IN
V4
X
IN
V5
X
IN
V6
X
IN
V8
IN X
V1
2
IN X
V1
6
IN X
V2
0
IN X
V2
4
IN X
V3
2X
1

Using minimal sized transistor for area considerations in
cells and larger, faster and accordingly more leaky
transistors to satisfy timing requirements in peripherals.

Using high vt transistors in cells compared with typical
threshold voltage transistors in peripherals
8
Leakage Power Components of L2 Cache
global data
output drivers
25%
global address
input drivers
11%
global data
input drivers
14%
global row
predecoder
1%
local data
output drivers
8%
others
8%

local row
decoders
33%
SRAM peripheral circuits dissipate more than
90% of the total leakage power
9
Leakage power as a Fraction of L2 Power Dissipation
ammp
applu
apsi
art
bzip2
crafty
eon
equake
facerec
galgel
gap
gcc
gzip
lucas
mcf
mesa
mgrid
parser
perlbmk
sixtrack
swim
twolf
vortex
vpr
wupwise
average
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Leakage

Dynamic
L2 cache leakage power dominates its dynamic power
above 87% of the total
10
Circuit Techniques Address Leakage in SRAM Cell






Gated-Vdd, Gated-Vss
Voltage Scaling (DVFS)
ABB-MTCMOS
Forward Body Biasing (FBB), RBB
Sleepy Stack
Sleepy Keeper
Target SRAM memory cell
11
Architectural Techniques

Way Prediction, Way Caching, Phased Access


Drowsy Cache


Evict lines not used for a while, then power them down
Applying DVS, Gated Vdd, Gated Vss to memory cell


Keeps cache lines in low-power state, w/ data retention
Cache Decay


Predict or cache recently access ways, read tag first
Many architectural support to do that.
All target cache SRAM memory cell
12
Multiple Sleep Mode Zig-Zag
Horizontal and Vertical Sleep
Transistor Sharing
Sleep Transistor Stacking Effect

Subthreshold current: inverse exponential function of
threshold voltage
VT  VT 0   (

( 2) F  VSB 
2 F )
Stacking transistor N with slpN:

The source to body voltage (VM ) of
transistor N increases, reduces its
subthreshold leakage current, when
both transistors are off
Drawback : rise time, fall time, wakeup
delay, area, dynamic power, instability
vdd
VC
Vgn
N
CL
VM
Vgslpn
vss
slpN
vss
14
A Redundant Circuit Approach
Sleep signal
vdd
vdd
vdd
slpP2
slpP1
W
 12
L
slpP3
W
 12
L
W
 12
L
P3
1
0
slpP4
W
 12
L
P2
P1
vdd
P4
0
1
slpN5
N1
slpN1
Sleep signal
VM
W
6
L
vss
N2
slpN2
W
6
L
vss
N3
slpN3 W
L
W
1.5
L
N4
6
vss
slpN4
W
6
L
vss
Sleep signal
Drawback impact on wordline driver output
rise time, fall time and propagation delay
15
Impact on Rise Time and Fall Time

vdd
The rise time and fall time of the output
of an inverter is proportional to the
Rpeq * CL and Rneq * CL
slpP1
P2
P1
I leakage
1
0

vdd
slpP2
Inserting the sleep transistors increases
both Rneq and Rpeq
0
N1
slpN1
N2
slpN2
I leakage
vss
vss
Increasing in rise time
Impact on performance
Increasing in fall time
Impact on memory functionality
16
A Zig-Zag Circuit
Sleep signal
vdd
vdd
slpP4
slpP2
W
 12
L
vdd
1
0
P4
P3
P2
P1
W
 12
L
vdd
0
1
0
slpN5
N1
slpN1
Sleep signal
W
6
L
vss
N3
N2
vss
slpN3
W
6
L
W
1.5
L
N4
vss
vss
vss
Sleep signal

Rpeq for the first and third inverters and Rneq for the
second and fourth inverters doesn’t change.

Fall time of the circuit does not change
17
A Zig-Zag Share Circuit

To improve leakage reduction and area-efficiency of
the zig-zag scheme, using one set of sleep transistors
shared between multiple stages of inverters


Zig-Zag Horizontal Sharing
Zig-Zag Horizontal and Vertical Sharing
18
Zig-Zag Horizontal Sharing



Zz-hs less impact on rise time
Both reduce leakage almost the
same
vdd
vdd
Comparing zz-hs with zigzag
scheme, with the same area
overhead
Sleep signal
slpP
P2
P1
1
0
N1
vdd
P3
P4
N3
N4
0
N2
VM
Sleep signal
2 x slpN
vss
R Neq
I share
vss
vss
R N1
I share
R nslp  zz  hs 
R nslp  zz
2
vss
19
Zig-Zag Horizontal and Vertical Sharing
vdd
Sleep signal
vdd
vdd
Word-line
Driver line K
vdd
vdd
Word-line
Driver line K +1
slpP
P 11
P12
P 13
P14
P 21
P22
P 23
P24
N11
VM
N 12
N 13
N 14
N21
N 22
N 23
N 24
Sleep signal
slpN
vss
vss
vss
vss
vss
20
Leakage Reduction of ZZ Horizontal and Vertical Sharing
(a)
(b)
vdd
Vg0
Vg0
N 11
IN11
Vg0
vdd
VM1
I slpN
slpN
vss
Vg0
N 21
IN21
Vg0
vdd
VM1
I slpN
slpN
Vg0
N11
N21
IN 21
IN11
VM2
Vg0
vdd
I slpN
slpN
vss
Increase in virtual ground voltage
increase leakage reduction
vss
VM 1 
VM 2 
n. log
WN 1 1
Wslp N
10
 Vdd  Vg 0
2
n. log
2.WN 1 1
Wslp N
10
 Vdd  Vg 0
2
21
ZZ-HVS Evaluation : Power Result
1000
(a)
x100
log (nW)
100
x10
x12
10
x2
1
1
2
3
4
5
6
7
8
9
10
number of wordline row
baseline


redundant
zigzag
zz-hs
zz-hvs
Increasing the number of wordline rows share sleep transistors increases the
leakage reduction and reduces the area overhead
Leakage power reduction varies form a 10X to a 100X when 1 to 10 wordline
shares the same sleep transistors

2~10X more leakage reduction, compare to the zig-zag scheme
22
Wakeup Latency

To benefit the most from the leakage savings of
stacking sleep transistors



keep the bias voltage of NMOS sleep transistor as low as possible
(and for PMOS as high as possible)
Drawback: impact on the wakeup latency of wordline
drivers
reduction in
reduction
inthe
the
Control the gate
voltage of
sleep transistors

leakage power
circuit wakeup delay
savings
overhead
Increasing the gate voltage of footer sleep transistor reduces the
virtual ground voltage (VM)
23
1
4.5
4.0
0.9
0.8
3.5
3.0
0.7
0.6
2.5
2.0
0.5
0.4
0.3
0.2
(0
.3
0,
0.
78
)
(0
.2
5,
0.
83
)
(0
.2
0,
0.
88
)
(0
.1
5,
0.
93
)
(0
.1
,0
.9
8)
trade-off between the
wakeup overhead
and leakage power saving
(0
.0
5,
1.
03
)
1.5
1.0
Normalized Wake-Up Delay
5.0
(0
,1
.0
8)
Normalized Leakage Power
Wakeup Delay vs. Leakage Power Reduction
(Footer,Header) Gate Bias Voltage Pair
Normalized leakage

Normalized wake-up delay
Increasing the bias voltage increases the leakage
power while decreases the wakeup delay overhead
24
Multiple Sleep Modes

power mode
wakeup
delay (cycle)
leakage
reduction (%)
basic-lp
1
42%
lp
2
75%
aggr-lp
3
81%
ultra-lp
4
90%
Power overhead of waking up peripheral circuits


Almost equivalent to the switching power of sleep
transistors
Sharing a set of sleep transistors horizontally and
vertically for multiple stages of a (wordline) driver makes
the power overhead even smaller
25
Reducing Leakage in L2 Cache
Peripheral Circuits Using Zig-Zag
Share Circuit Technique
Static Architectural Techniques: SM

SM Technique (ICCD’07)

Asserts the sleep signal by default.

Wakes up L2 peripherals on an access to the cache

Keeps the cache in the normal state for J cycles (turn-on
period) before returning it to the stand-by mode (SM_J)


No wakeup penalty during this period
Larger J leads to lower performance degradation but lower
energy savings
27
Static Architectural Techniques: IM

IM technique (ICCD’07)

Monitor issue logic and functional units of the processor
after L2 cache miss. Asserts the sleep if the issue logic
has not issued any instructions and functional units
have not executed any instructions for K consecutive
cycles (K=10)

De-asserted the sleep signal M cycles before the miss is
serviced

No performance loss
28
More Insight on SM and IM
100%
90%
80%
70%
60%
50%
40%
30%
20%
adaptive technique combining IM
and SM has the potential to deliver
an even greater power reduction
10%



average
wupwise
vpr
twolf
vortex
swim
sixtrack
perlbmk
parser
mesa
mgrid
mcf
gzip
IM
lucas
gcc
gap
galgel
facerec
eon
equake
crafty
bzip2
art
apsi
applu
ammp
0%
SM-750
Some benchmarks SM and IM techniques are both effective
facerec, gap, perlbmk and vpr
IM works well in almost half of the benchmarks but is ineffective
in the other half
SM work well in about one half of the benchmarks but not the
same benchmarks as the IM
29
Which Technique Is the Best and When ?
DL1
miss
rate
ammp 0.05
applu 0.06
apsi
0.03
art
0.41
bzip2 0.02
crafty 0.00
eon
0.00
equake 0.02
facerec 0.03
galgel 0.04
gap
0.01
gcc
0.05
gzip
0.01
L2
miss
rate
0.19
0.66
0.28
0.00
0.04
0.01
1.00
0.67
0.31
0.01
0.55
0.04
0.05
L1xL2 miss
rates
x 10K
lucas
96.11
mcf
368.03
75.01
mesa
0.41
mgrid
7.09
parser
0.17
perlbmk
0.00
sixtrack
swim
124.36
86.11
twolf
2.11
vortex
38.54
vpr
16.88
wupwise
3.28
Average
DL1 miss
rate
0.10
0.24
0.00
0.04
0.02
0.01
0.01
0.09
0.05
0.00
0.02
0.02
0.05
miss rate product (MRP) may
be a good indicator of the
cache behavior

L2
miss
rate
0.67
0.43
0.27
0.46
0.07
0.46
0.00
0.63
0.00
0.23
0.15
0.68
0.31
L1xL2 miss
rates x 10K
645.73
1023.88
8.02
165.13
13.76
22.88
0.14
561.41
0.16
6.94
33.95
122.40
136.50
L2 to be idle

There are few L1 misses

Many L2 misses waiting for memory
30
The Adaptive Techniques

Adaptive Static Mode (ASM)

MRP measured only once during an initial learning
period (the first 100M committed instructions)




MRP > A
 IM (A=90)
MRP ≤ A
 SM_J
Initial technique  SM_J
Adaptive Dynamic Mode (ADM)

MRP measured continuously over a K cycle period (K is 10M)
choose IM or the SM, for the next 10M cycles



MRP > A
 IM (A=100)
A ≥ MRP > B  SM_N (B=200)
otherwise
 SM_P
31
More Insight on ASM and ADM

ASM attempts to find the more effective static
technique per benchmark by profiling a small subset of
a program

ADM is more complex and attempts to find the more
effective static technique at a finer granularity of every
10M cycles intervals based on profiling the previous
timing interval
32
Compare ASM with IM and SM
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
a small subset of program can be used
to identify L2 cache behavior, whether it
is accessed very infrequently or it is idle
since processor is idle
ASM-IM
average
wupwise
vpr
twolf
vortex
swim
sixtrack
perlbmk
parser
mesa
mgrid
mcf
gzip
lucas
gcc
gap
galgel
facerec
eon
equake
crafty
art
bzip2
apsi
applu
ammp
0%
ASM-SM
fraction of IM and SM contribution for ASM_750

Most benchmarks ASM correctly selects the more effective
static technique

Exception: equake
33
ADM Results
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%

average
wupwise
vpr
vortex
twolf
swim
sixtrack
perlbmk
mgrid
parser
mcf
ADM_SM
Many benchmarks both IM and SM make a noticeable contribution


mesa
gzip
ADM_IM
lucas
gcc
gap
galgel
facerec
eon
equake
crafty
art
bzip2
apsi
applu
ammp
0%
ADM is effective in combining the IM and SM
Some benchmarks either IM or SM contribution is negligible

ADM selects the best static technique
34
Power Results
90%
80%
(b)
(a)
80%
70%
70%
60%
60%
50%
50%
40%
40%
30%
30%
20%
20%
2~3 X more leakage power reduction
and less performance loss compare to
static approaches
10%
10%
leakage power savings
ASM
average
vpr
wupwise
twolf
vortex
swim
sixtrack
perlbmk
mgrid
parser
mcf
mesa
lucas
gcc
gzip
gap
galgel
facerec
eon
crafty
art
bzip2
apsi
equake
-20%
applu
-10%
ammp
average
vpr
twolf
vortex
swim
sixtrack
perlbmk
mgrid
parser
mcf
ADM
wupwise
ASM
mesa
gcc
gzip
gap
galgel
facerec
eon
equake
crafty
art
bzip2
apsi
applu
ammp
lucas
0%
0%
ADM
total energy delay reduction
leakage reduction using ASM and ADM is 34% and 52% respectively
The overall energy delay reduction is 29.4 and 45.5%
respectively, using the ASM and ADM.
35
RELOCATE: Register File Local Access
Pattern Redistribution Mechanism for
Power and Thermal Management in Outof-Order Embedded Processor
Outline






Motivation
Background study
Study of Register file Underutilization
Study of Register file default access patterns
Access concentration and activity redistribution to
relocate register file access patterns
Results
37
Why Register File?

RF is one of the hottest units in a processor



A small, heavily multi-ported SRAM
Accessed very frequently
Example: IBM PowerPC 750FX
38
Prior Work: Activity Migration
Reduces temperature by migrating the activity to a
replicated unit.

requires a replicated unit


large area overhead
leads to a large performance degradation
T final
Active Period
T crisis
Idle Period
Active Period
T init
Temperature
Temperature

Idle Period
T ambient
Cooling due to
Cooling due to
inactivity
inactivity and
power gating
time
time
AM
AM+PG
39
Conventional Register Renaming
Head
pointer
Tail
pointer
Active List
Free List
Register Renamer
• Physical
Instruction #
Original code
Renamed code
1
RA <- ...
PR1 <- ...
2
…. <- RA
....
3
branch to _L
branch to _L
4
RA <- ...
PR4 <- ...
5
... ...
6
_ L:
_ L:
7
…. <- RA
....
<- PR1
... ...
<- PR1
Register allocation-release
registers are allocated/released in a
somewhat random order
40
Analysis of Register File Operation
Register File Occupancy
(b)
(a)
100%
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
90%
80%
70%
60%
50%
40%
30%
RF_ocuupancy < 16
32 < RF_ocuupancy < 48
16 < RF_ocuupancy < 32
48 < RF_ocuupancy < 64
MiBench
RF_ocuupancy < 16
32 < RF_ocuupancy < 48
v
av pr
er
ag
e
m
c
pa f
rs
pe er
rlb
m
k
tw
ol
vo f
rt
ex
gc
c
gz
ip
ga
p
2
cr
af
ty
eo
n
ga
lg
el
0%
bz
ip
ff t
gs
gs
m
la
m
e
m
pa ad
tr
ic
ia
qs
o
se rt
ar
su
ch
sa
n_ s
su co ha
sa rn
n_ er
ed s
g
tif es
f2
av bw
er
ag
e
ba
si
cm
at
h
bc
di crc
jk
st
dj ra
pe
g
20%
10%
16 < RF_ocuupancy < 32
48 < RF_ocuupancy < 64
SPECint2K
41
(a)
20%
15%
10%
5%
0%
48-entry
32-entry
16-entry
MiBench
60%
48-entry
32-entry
v
av pr
er
ag
e
25%
% performance degradation
30%
ga
p
gc
c
gz
ip
m
c
pa f
rs
pe er
rlb
m
k
tw
o
vo lf
rt
ex
% performance degradation
35%
bz
ip
2
cr
af
ty
eo
ga n
lg
el
ff t
gs
gs
m
la
m
e
m
pa ad
tr
ic
ia
qs
se ort
ar
su
ch
sa
n_ s
su c ha
sa orn
n_ er
ed s
g
tif es
f2
av bw
er
ag
e
at
h
bc
di crc
jk
st
dj ra
pe
g
Ba
si
cM
Performance Degradation with a Smaller RF
(b)
50%
40%
30%
20%
10%
0%
16-entry
SPECint2K
42
Analysis of Register File Operation
Register File Access Distribution
 Coefficient of variation (CV) shows a “deviation” from
average # of accesses for individual physical registers.
CVaccess 


1
N
n
2
(
na

na
)
 i
i 1
na
nai is the number of accesses to a physical register i
during a specific period (10K cycles). na is the average
N, the total number of physical registers
43
cm
at
8%
6%
4%
2%
0%
MiBench
v
av pr
er
ag
e
10%
% coefficient of variation
(a)
ga
p
gc
c
gz
ip
m
c
pa f
rs
pe er
rlb
m
k
tw
o
vo lf
rt
ex
% coefficient of variation
12%
bz
ip
2
cr
af
ty
eo
ga n
lg
el
gs
gs
m
la
m
e
pa mad
tr
ic
q s ia
se ort
su
ar
sa
ch
n_
su c s
sa orn ha
n_ e
ed rs
g
t if e s
av f 2b
er w
ag
e
h
bc
di crc
jk
st
dj ra
pe
g
fft
ba
si
Coefficient of Variation
14%
(b)
12%
10%
8%
6%
4%
2%
0%
SPEC2K
44
Register File Operation
Underutilization which is distributed uniformly
while only a small number of
registers are occupied at any given time, the
total accesses are uniformly distributed over
the entire physical register file during the
course of execution
45
RELOCATE: Access Redistribution within a Register File

The goal is to “concentrate” accesses within a partition
of a RF (region)

Some regions will be idle (for 10K cycles)

Can power-gate them and allow to cool down
(a)
c
(b)
(c)
P2
P1
4 x c
4 x c
P2
P4
P3
P1
Idle region
Active region
P4
P3
register activity (a) baseline, (b) in-order (c) distant patterns
46
An Architectural Mechanism for Access Redistribution

Active partition: a register renamer partition currently used in register
renaming

Idle partition: a register renamer partition which does not participate in
renaming

Active region: a region of the register file corresponding to a register
renamer partition (whether active or idle) which has live registers

Idle region: a region of the register file corresponding to a register
renamer partition (whether active or idle) which has no live registers
47
Activity Migration without Replication

An access concentration mechanism allocates registers from
only one partition

This default active partition (DAP) may run out of free registers
before the 10K cycle “convergence period” is over
 another partition (according to some algorithm) is then
activated (referred to as additional active partitions or AAP )
 To facilitate physical register concentration in DAP, if two or
more partitions are active and have free registers, allocation
is performed in the same order in which partitions were
activated.
48
The Access Concentration Mechanism

Partition activation order is 1-3-2-4
Active List 1 empty
Active List 3 empty
Active List
Active List 2 empty
Active List
Partition
P1
Partition
P3
Free List
Free List
Free-list 1 full
Free-list 3 full
Active List 4 empty
Active List
Active List
Partition
P2
Partition
P4
Free List
Free List
Free-list 2 full
Free-list 4 full
49
The Redistribution Mechanism

The default active partition is changed once every N cycles to redistribute
the activity within the register file (according to some algorithm)

Once a new default partition (NDP) is selected, all active partitions
(DAP+AAP) become idle.

The idle partitions do not participate in register renaming, but their
corresponding RF regions may have to be kept active (powered up)

A physical register in an idle partition may be live

An idle RF region is power gated when its active list becomes empty.
50
Performance Impact?


There is a two-cycle delay to wakeup a power gated physical register
region
The register renaming occurs in the front end of the microprocessor
pipeline whereas the register access occurs in the back end.


There is a delay of at least two pipeline stages between
renaming and accessing a physical register file
Can
wake
up the
in time
Can
wake
up requested
a requiredregion
register
file region
without incurring a performance penalty
at the time of access
51
Power Reduction %
Results: Mibench RF power reduction
55%
50%
45%
40%
35%
30%
25%
20%
15%
10%
5%
0%
cM
i
s
a
B
(a)
h
at
bc crc tra eg
s p
jk dj
i
d
t
ff t gs sm me ad cia or rch ha ers es bw ge
s rn dg f 2 ra
g la m t ri qs e a
o _e tif v e
a
s
c
p
a
n_ sa n
a
s u
su s
num_partition=2
num_partition=4
num_partition=8
52
Results: SPEC2K RF power reduction
50%
(b)
40%
35%
30%
25%
20%
15%
10%
5%
num_partition=2
num_partition=4
av vpr
er
ag
e
ga
p
gc
c
gz
ip
m
c
pa f
r
pe s er
rlb
m
k
tw
ol
vo f
rt
ex
0%
bz
ip
cr
af
ty
eo
n
ga
lg
el
Power Reduction %
45%
num_partition=8
53
Analysis of Power Reduction


Increasing the number of RF partitions provides more opportunity
to capture and cluster unmapped registers to a partition
 Indicates that wakeup overhead is amortized for a larger
number of partitions.
Some exceptions
 the overall power overhead associated with waking up an idle
region becomes larger as the number of partition increases.
 frequent but ineffective power gating and its overhead as the
number of partition increases
54
Peak Temperature Reduction
Table 1. Peak temperature reduction for MiBench
benchmarks
Table 2. Peak temperature
SPEC2K integer benchmarks
temperature
reduction
for
temperature different number
of partition (C)
(C)
4P
8P
for
temperature
reduction
for
temperature different number
of partition (C)
(C)
base
2P
reduction
base
2P
4P
8P
basicMath
94.3
3.6
4.8
5.0
bzip2
92.7
4.8
3.9
3.1
bc
95.4
3.8
4.4
5.2
crafty
83.6
9.5
11
10.4
crc
92.8
5.3
6.0
6.0
eon
77.3
10.6
12.4
12.5
dijkstra
98.4
6.3
6.8
6.4
galgel
89.4
6.9
7.2
5.8
djpeg
96.3
2.8
3.5
2.4
gap
86.7
4.8
5.9
7.1
fft
94.5
6.8
7.4
7.6
gcc
79.8
7.9
9.4
10.1
gs
89.8
6.5
7.4
9.7
gzip
95.4
3.2
3.8
3.9
gsm
92.3
5.8
6.7
6.9
mcf
85.8
6.9
8.7
9.4
lame
90.6
6.2
8.5
11.3
parser
97.8
4.3
5.8
4.8
mad
93.3
3.8
4.3
2.2
perlbmk
85.8
10.6
12.3
12.6
patricia
79.2
11.0
12.4
13.2
twolf
86.2
8.8
10.2
10.5
qsort
88.3
10.1
11.6
11.9
vortex
81.7
11.3
12.5
12.9
search
93.8
8.7
9.3
9.1
vpr
94.6
4.9
5.2
4.4
sha
90.1
5.1
5.4
4.5
average
87.4
7.2
8.3
8.2
susan_corners 92.7
4.7
5.3
5.1
susan_edges
91.9
3.7
5.8
6.3
tiff2bw
98.5
4.5
5.9
4.1
average
92.5
5.6
6.8
6.9
55
Analysis of Temperature Reduction

Increasing the number of partitions results in larger
power density in each partition because RF access
activity is concentrated in a smaller partition

While capturing more idle partitions and power gating
them may potentially result in higher power reduction,
larger power density due to smaller partition size results
in overall higher temperature
56
Adaptive Resource Resizing for
Improving Performance in
Embedded Processor
Introduction

Technology scaling into the ultra deep submicron
allowed hundreds of millions of gates integrated onto
a single chip.


Restrictions with the power budget and practically
achievable operating clock frequencies are limiting
factors.


Designers have ample silicon budget to add more
processor resources to exploit application parallelism
and improve performance.
Increasing register file (RF) size increases its access
time, which reduces processor frequency.
Dynamically Resizing RF in tandem with dynamic
frequency scaling (DFS) significantly improves the
performance.
58
Motivation for Increasing RF Size

After a long latency L2 cache miss the processor
executes some independent instructions but eventually
ends up becoming stalled.
 After L2 cache miss one of ROB, IQ, RF or LQ/SQ fills
up and processor stalls until the miss serviced.
40%
35%
30%
25%
20%
15%
10%
5%
as
m
es
a
m
gr
i
pa d
rs
si er
xt
r
w ac
up k
w
i
av se
er
ag
e
p
e
ip
lu
c
gz
ga
n
ua
k
eo
eq
cr
af
ty
ip
2
bz
ap
si
0%
Frequency of stalls due to L2 cache misses, in PowerPC 750FX architecture


With larger resources it is less likely that these resources will fill up
completely during the L2 cache miss service time and potentially
improve performance.
The sizes of resources have to be scaled up together; otherwise the
non-scaled ones would become a performance bottleneck.
59
Impact of Increasing RF Size

Increasing the size of RF, (as well as ROB, LQ and IQ)


RF decide the max achievable operating frequency
delay (ns)

can potentially increase processor performance by
reducing the occurrences of idle periods,
has critical impact on the achievable processor
operating frequency
0.55
0.50
0.45
0.40
0.35
0.30
0.25
0.20
0.15
0.10
RF-24
input driver
bitline
RF-32
decoder
sense_amp
RF-48
wordline
output driver
Breakdown of RF component delay with increasing size

significant increase in bitline delay when the size of the
RF increases.
60
Analysis of RF Component Access Delay


The equivalent capacitance on the bitline is Ceq = N *
diffusion capacitance of pass transistors + wire
capacitance (usually 10% of total diffusion
capacitance) where N is the total number of rows.
As the number of rows increases the equivalent bitline
capacitance also increases and therefore the
propagation delay increases.
Reduction in clock freq with increasing resource size
Processor Configuration Baseline
RF size
24
ROB size
16
IQ size
8
RF access time (ns)
1.67
Operating Freq (MHz)
595
Conf_1
Conf_2
32
24
12
1.76
568
48
32
24
1.92
520
61
Impact on Execution Time
The execution time increases with larger resource
sizes
Baseline
as
m
es
a
m
gr
i
pa d
rs
si er
xt
r
w ac
up k
w
av i se
er
ag
e
lu
c
gz
ga
p
ip
1.12
1.10
1.08
1.06
1.04
1.02
1.00
0.98
0.96
0.94
0.92
0.90
ap
s
bz i
ip
2
cr
af
ty
eo
eq n
ua
ke
Normalized Execution Time

Conf-1
Conf-2
Normalized execution time for different configs with reduced operating frequency compared to baseline architecture

Trade-off between



larger resources (and hence reducing the occurrences
of idle period) and
lowering the clock frequency,
The latter becomes more important and plays a major
role in deciding the performance in terms of
execution time.
62
Dynamic Register File Resizing

Dynamic RF scaling based on L2 cache misses


To satisfy accessing the RF in one cycle, reduce the operating
clock frequency when we scale up its size



allows the processor use smaller RF (having a lower access time)
during the period when there is no pending L2 cache miss (normal
period) and a larger RF (at the cost of having a higher access time)
during the L2 cache miss period.
DFS needs to be done fast, otherwise it impacts the performance
benefit
need to use a PLL architecture capable of applying DFS with the least
transition delay.
The studied processor (IBM PowerPC 750) uses a dual PLL
architecture which allows fast DFS with effectively zero latency.
63
Circuit Modification


The challenge is to design the RF
in such a way that its access time
is dynamically being controlled.
Among all RF components,
the bitline delay increase is
responsible for the majority of
RF access time increase.
single bit
Register entry
free/taken
Wordline
Upper segment
full/empty
Wordline
Segment
Select
Segment
Select
Wordline
Wordline
Wordline
Dynamically adjust bitline
load.
Sense Amp and Bitline
Pre- Charge Circuit
Proposed circuit modification for RF
64
L2 Miss Driven RF Scaling (L2MRFS)
single bit

Normal period: the upper segment is
power gated and the transmission
gate is turned off to isolate the lower
bitline segment from the upper bitline
segment.


Only the lower segment bitline is precharged during this period.
L2 cache miss period: the
transmission gate is turned on and
both segments bitlines are precharged.

Register entry
free/taken
Wordline
Upper segment
full/empty
Wordline
Segment
Select
Segment
Select
Wordline
Wordline
Wordline
Sense Amp and Bitline
Pre- Charge Circuit
Proposed circuit modification for RF
downsize at the end of cache miss period
when the upper segment is empty.
Augment the upper segment with one extra bit per entry. Set
the entry when a register is taken and reset it when a register
is released.
ORing these bits can detect when the segment is empty.
65
Performance and Energy-delay
DYN_Conf_2
ga
e
n
ua
k
eo
eq
ip
2
cr
af
ty
si
p
gz
ip
lu
ca
s
m
es
a
m
gr
id
pa
rs
si er
xt
r
w ack
up
w
i
av se
er
ag
e
DYN_Conf_1
(b)
ap
e
p
gz
ip
lu
ca
s
m
es
a
m
gr
id
pa
rs
si er
xt
r
w ack
up
w
i
av se
er
ag
e
ga
n
ua
k
eo
eq
ip
2
cr
af
ty
ap
bz
si
(a)
1.00
0.98
0.96
0.94
0.92
0.90
0.88
0.86
0.84
bz
16%
14%
12%
10%
8%
6%
4%
2%
0%
DYN_Conf_1
DYN_Conf_2
Experimental results: (a) normalized performance improvement for L2MRFS (b) normalized energy-delay
product compare to conf_1 and conf_2
Performance improvement 6% and 11%
Energy-delay reduction 3.5% and 7%
66
Inter-core Selective Resource
Pooling in 3D Chip Multiprocessor
An Example!
number of occupied RF
entries
I1
I2
I3
128
96
64
32
0
cycle
gready: mcf
helper: gcc
mcf
gready: gcc
helper: mcf
gcc
An example of register file utilization for different cores in a dual core CMP
68
Preliminary Results for Register File Pooling
layer 1
MUX
die-to-die via
MUX
layer 0
5.0
4.5
4.0
3.5
3.0
2.5
2.0
1.5
1.0
am
m
ap p
pl
u
ap
si
ar
bz t
ip
cr 2
af
ty
e
eq on
u
fa ake
ce
r
ga ec
lg
el
ga
p
gc
c
gz
i
lu p
ca
s
m
c
m f
es
m a
g
pa rid
pe rse
rl r
si bmk
xt
ra
c
sw k
im
tw
vo olf
rte
x
w v
up pr
av wis
er e
ag
e
Speedup (Normalized IPC)
Register files participating in resource pooling
single-core
2-core
3-core
4-core
The normalized IPC of resource pooling
69
Challenges




The level of resource sharing
 “loose pooling”: HELPER gets a higher priority in accessing to
the pooled resource
 “tight pooling”: the priority is given to the GREADY core
The granularity of resource sharing
 number of entries
 number of ports
The level of confidence in predicting the resource utilizations
 avoid starving the HELPER core
 avoiding over provisioning for the GREADY core
A new floorplanning
 put identical resources as close to each other
 can incurs additional thermal and power burden on a currently
power hungry and thermal critical resources
70
Conclusion
Power-Thermal-Reliability aware High
Performance Design Through
Inter-Disciplinary Approach
71
Reducing Leakage in L2 Cache
Peripheral Circuits Using
Multiple Sleep Mode Technique
Multiple Sleep Modes

power mode
wakeup
delay (cycle)
leakage
reduction (%)
basic-lp
1
42%
lp
2
75%
aggr-lp
3
81%
ultra-lp
4
90%
Power overhead of waking up peripheral circuits


Almost equivalent to the switching power of sleep
transistors
Sharing a set of sleep transistors horizontally and
vertically for multiple stages of a (wordline) driver makes
the power overhead even smaller
73
Reducing Leakage in L1 Data Cache

Maximize the leakage reduction in DL1 cache


put DL1 peripheral into ultra low power mode
adds 4 cycles to the DL1 latency


significantly reduces performance
Minimize Performance Degradation



put DL1 peripherals into the basic low power mode
requires only one cycle to wakeup and
hide this latency during address computation stage
thus not degrading performance

Not noticeable leakage power reduction
74
Motivation for Dynamically Controlling Sleep Mode

large leakage reduction
benefit
dynamically
adjust


Basic-lp mode
Periods of frequent access


sleep power mode
low performance impact benefit


peripheral
circuitmodes
Ultra and aggressive
low power
Basic-lp mode
Periods of infrequent access

Ultra and aggressive low power modes
75
Reducing DL1 Wakeup Delay

Can determine whether an instruction is load or a store
at least one cycle prior cache access



Accessing DL1 while its peripherals are in basic-lp mode
doesn’t require an extra cycle
wake up DL1 peripherals one cycle prior to access
One cycle of wakeup delay can be hidden for all other
low-power modes

Reducing the wakeup delay by one cycle
Put DL1 in basic-lp mode by default
76
Architectural Motivations

Architectural Motivation

A load miss in L1/L2 caches takes a long time to service


When dependent instructions cannot issue


prevents dependent instructions from being issued
performance is lost
At the same time, energy is lost as well!

This is an opportunity to save energy
77
Low-end Architecture
P rocessor continue
D L1 m iss
serviced
basic-lp
P ending D L1
m iss
lp
P rocessor stall
ultra -lp
D L1 m iss
a
m is s l l D L 1
es se
r v ic e
d
D L1 m iss++
ess
P ro c
a ll
or st
aggr-lp
P ending D L1
m iss/es

Given the miss service time of 30 cycles
 likely that processor stalls during the miss service period

Occurrence of additional cache misses while one DL1
cache miss is already pending further increases the
chance of pipeline stall
78
Low Power Modes in a 2KB DL1 Cache
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
hp
trivial-lp
lp
aggr-lp
ultra-lp
Fraction of total execution time DL1 cache spends in each of the power mode

85% of the time DL1 peripherals put into low power modes
 Most of the time spent in the basic-lp mode (58% of total
execution time)
79
Low Power Modes in Low-End Architecture
10%
(a)
100%
9%
90%
80%
8%
7%
70%
60%
6%
5%
50%
40%
4%
3%
30%
20%
2%
1%
10%
0%
0%
2KB
2KB 4KB 8KB 16KB
Performance degradation

hp
4KB
basic-lp
8KB
lp
aggr-lp
16KB
ultra-lp
Frequency of different low power mode
Increasing the cache size reduces DL1 cache miss rate


Reduces opportunities to put the cache into more aggressive
low power modes
Reduces performance degradation for larger DL1 cache
80
High-end Architecture
L2 miss serviced
DL1 miss
serviced
basic-lp
Pending DL1
miss/es
lp
L2 miss
ultra-lp
DL1 miss
L2 miss

DL1 transitions to ultra-lp mode right after an L2 miss
occurs


Given a long L2 cache miss service time (80 cycles) the
processor will stall waiting for memory
DL1 returns to the basic-lp mode once the L2 miss is
serviced
81
Low Power Modes in 4KB Cache
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
hp

trivial-lp
lp
ultra-lp
Many benchmarks the ultra-lp mode has a considerable
contribution

These benchmarks have high L2 miss rate which triggers
transition to ultra low power mode
82
Low Power Modes in High-End Architecture
(b)
5.0%
4.5%
90%
4.0%
80%
3.5%
70%
3.0%
60%
2.5%
2.0%
50%
1.5%
40%
1.0%
30%
0.5%
20%
0.0%
8KB-128KB
16KB-256KB
Performance degradation
32KB-512KB
wu vp
pw r
i
av se
er
ag
e
gc
c
gz
ip
lu
ca
s
m
cf
m
es
a
m
gr
i
pa d
rs
e
pe
rlb r
m
k
sw
im
tw
o
vo lf
rte
x
am
m
p
ap
pl
u
ap
si
ar
t
bz
ip
2
cr
af
ty
eo
eq n
ua
fa ke
ce
re
ga c
lg
el
ga
p
10%
4KB-64KB

(b)
100%
0%
4KB-64KB
8KB-128KB
hp
basic-lp
16KB-256KB
lp
32KB-512KB
ultra-lp
Frequency of different low power mode
Increasing the cache size reduces DL1 cache miss rate


Reduces opportunities to put the cache into more aggressive
low power modes
Reduces performance degradation
83
Leakage Power Reduction: Low-End Architecture
80%
70%
60%
50%
40%
30%
20%
10%
0%
trivial-lp


lp
aggr-lp
ultra-lp
DL1 leakage is reduced by 50%
While ultra-lp mode occurs much less frequently compared to
basic-lp mode, its leakage reduction is comparable to the basiclp mode.
 in ultra-lp mode the peripheral leakage is reduced by 90%,
almost twice that of basic-lp mode.
84
Leakage Power Reduction: High-End Architecture
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
trivial-lp

lp
ultra-lp
The average leakage reduction is almost 50%
85