Dynamic Register File Resizing and Frequency Scaling to Improve

Download Report

Transcript Dynamic Register File Resizing and Frequency Scaling to Improve

Dynamic Register File Resizing and Frequency
Scaling to Improve Embedded Processor
Performance and Energy-Delay Efficiency
Houman Homayoun, Sudeep Pasricha, Mohammad
Makhzan, Alex Veidenbaum
Center for Embedded Computer Systems, University of California, Irvine,
[email protected]
INTRODUCTION
 Technology scaling into the ultra deep submicron allowed
hundreds of millions of gates integrated onto a single chip.

Designers have ample silicon budget to add more processor
resources to exploit application parallelism and improve
performance.
 Restrictions with the power budget and practically achievable
operating clock frequencies are limiting factors.

Increasing register file (RF) size increases its access time, which
reduces processor frequency.
 Dynamically Resizing RF in tandem with dynamic frequency
scaling (DFS) significantly improves the performance.
MOTIVATION FOR INCREASING RF SIZE
 After a long latency L2 cache miss the processor executes some
independent instructions but eventually ends up becoming stalled.
 After L2 cache miss one of ROB, IQ, RF or LQ/SQ fills up and
processor stalls until the miss serviced.
40%
35%
30%
25%
20%
15%
10%
5%
as
m
es
a
m
gr
i
pa d
rs
si er
xt
r
w ac
up k
w
i
av se
er
ag
e
p
e
ip
lu
c
gz
ga
n
ua
k
eo
eq
cr
af
ty
ip
2
bz
ap
si
0%
Frequency of stalls due to L2 cache misses, in PowerPC 750FX architecture
 With larger resources it is less likely that these resources will fill up
completely during the L2 cache miss service time and potentially improve
performance.
 The sizes of resources have to be scaled up together; otherwise the nonscaled ones would become a performance bottleneck.
IMPACT OF INCREASING RF SIZE
 Increasing the size of RF, (as well as ROB, LQ and IQ)


can potentially increase processor performance by reducing the
occurrences of idle periods,
has critical impact on the achievable processor operating
frequency
delay (ns)
 RF decide the max achievable operating frequency
0.55
0.50
0.45
0.40
0.35
0.30
0.25
0.20
0.15
0.10
RF-24
input driver
bitline
RF-32
decoder
sense_amp
RF-48
wordline
output driver
Breakdown of RF component delay with increasing size
 significant increase in bitline delay when the size of the RF
increases.
ANALYSIS OF RF COMPONENT ACCESS
DELAY
 The equivalent capacitance on the bitline is Ceq = N *
diffusion capacitance of pass transistors + wire capacitance
(usually 10% of total diffusion capacitance) where N is the
total number of rows.
 As the number of rows increases the equivalent bitline
capacitance also increases and therefore the propagation
delay increases.
Reduction in clock freq with increasing resource size
Processor Configuration Baseline
RF size
24
ROB size
16
IQ size
8
RF access time (ns)
1.67
Operating Freq (MHz)
595
Conf_1
Conf_2
32
24
12
1.76
568
48
32
24
1.92
520
STATIC REGISTER FILE SIZING
Conf-1
Conf-2
Performance in terms of IPC for different configurations
as
m
es
a
m
gr
id
pa
rs
si er
xt
r
w ac
up k
w
i
av se
er
ag
e
ip
p
e
Baseline
lu
c
gz
ga
n
ua
k
as
m
es
a
m
gr
id
pa
rs
si er
xt
r
w ac
up k
w
i
av se
er
ag
e
ip
p
Baseline
lu
c
gz
ga
e
n
ua
k
eo
eq
ip
2
cr
af
ty
bz
ap
si
0.90
eo
0.95
eq
1.00
ip
2
cr
af
ty
1.05
bz
1.10
ap
1.15
si
45%
40%
35%
30%
25%
20%
15%
10%
5%
0%
1.20
Conf_1
Conf_2
Relative idle period processor stalls due to L2 cache
misses for different configurations
 Increasing the size of RF

Increases the IPC
Reduces relative idle period processor stalls due to L2 cache misses

Reduces the max achievable operating clock frequency

IMPACT ON EXECUTION TIME
Baseline
as
m
es
a
m
gr
i
pa d
rs
si er
xt
r
w ac
up k
w
av i se
er
ag
e
lu
c
gz
ga
p
ip
1.12
1.10
1.08
1.06
1.04
1.02
1.00
0.98
0.96
0.94
0.92
0.90
ap
s
bz i
ip
2
cr
af
ty
eo
eq n
ua
ke
Normalized Execution Time
 The execution time increases with larger resource sizes
Conf-1
Conf-2
Normalized execution time for different configs with reduced operating frequency compared to baseline architecture
 trade-off between


larger resources (and hence reducing the occurrences of idle
period) and
lowering the clock frequency,
 the latter becomes more important and plays a major role in
deciding the performance in terms of execution time.
DYNAMIC REGISTER FILE RESIZING
 dynamic RF scaling based on L2 cache misses

allows the processor use smaller RF (having a lower access
time) during the period when there is no pending L2 cache miss
(normal period) and a larger RF (at the cost of having a higher
access time) during the L2 cache miss period.
 To satisfy accessing the RF in one cycle, reduce the operating
clock frequency when we scale up its size


DFS needs to be done fast, otherwise it impacts the
performance benefit
need to use a PLL architecture capable of applying DFS with the
least transition delay.
 The studied processor (IBM PowerPC 750) uses a dual PLL
architecture which allows fast DFS with effectively zero
latency.
CIRCUIT MODIFICATION
 The challenge is to design the RF in
such a way that its access time is
dynamically being controlled.
single bit
Register entry
free/taken
Wordline
Upper segment
full/empty
Wordline
 Among all RF components, the
bitline delay increase is responsible
for the majority of RF access time
increase.
Segment
Select
Segment
Select
Wordline
Wordline
Dynamically adjust bitline load.
Wordline
Sense Amp and Bitline
Pre- Charge Circuit
Proposed circuit modification for RF
L2 MISS DRIVEN RF SCALING (L2MRFS)
 Normal period: the upper segment is
power gated and the transmission gate is
turned off to isolate the lower bitline
segment from the upper bitline segment.

Only the lower segment bitline is pre-charged
during this period.
 L2 cache miss period: the transmission
gate is turned on and both segments bitlines
are pre-charged.

downsize at the end of cache miss period when
the upper segment is empty.
single bit
Register entry
free/taken
Wordline
Upper segment
full/empty
Wordline
Segment
Select
Segment
Select
Wordline
Wordline
Wordline
Sense Amp and Bitline
Pre- Charge Circuit
Proposed circuit modification for RF
Augment the upper segment with one extra bit per entry. Set
the entry when a register is taken and reset it when a register
is released.
ORing these bits can detect when the segment is empty.
PERFORMANCE AND ENERGY-DELAY
ga
e
n
ua
k
eo
eq
ip
2
cr
af
ty
si
(b)
p
gz
ip
lu
ca
s
m
es
a
m
gr
id
pa
rs
si er
xt
r
w ack
up
w
i
av se
er
ag
e
DYN_Conf_2
1.00
0.98
0.96
0.94
0.92
0.90
0.88
0.86
0.84
ap
as
m
es
a
m
gr
id
pa
rs
si er
xt
r
w ack
up
w
i
av se
er
ag
e
ip
DYN_Conf_1
lu
c
e
p
gz
ga
n
ua
k
eo
eq
ip
2
cr
af
ty
ap
bz
si
(a)
bz
16%
14%
12%
10%
8%
6%
4%
2%
0%
DYN_Conf_1
DYN_Conf_2
Experimental results: (a) normalized performance improvement for L2MRFS (b) normalized energy-delay
product compare to conf_1 and conf_2
Performance improvement 6% and 11%
Energy-delay reduction 3.5% and 7%
CONCLUSION
 Technology scaling into the ultra deep submicron allowed hundreds of
millions of gates integrated onto a single chip.
 Restrictions with the power budget and practically achievable operating
clock frequencies are limiting factors.
 Increasing register file size, statically, while can increase IPC, reduces the
execution time due to the impact on max achievable operating frequency.
 Dynamic register file resizing, allows the processor use smaller RF (having a
lower access time) during the period when there is no pending L2 cache
miss (normal period) and a larger RF (at the cost of having a higher access
time) during the L2 cache miss period.
 Minimal modification in the register file to be able to adapt its size along with
its access time.
 Combined dynamic register file resizing with dynamic frequency scaling
achieves 11% performance improvement and 7% energy-delay reduction
 A similar methodology applied for RF can be applied to other timing
constrains resources such as ROB, IQ, LQ/SQ and Caches.