A Centralized Cache Miss Driven Technique to Improve Processor

Download Report

Transcript A Centralized Cache Miss Driven Technique to Improve Processor

A Centralized Cache Miss Driven Technique to
Improve Processor Power Dissipation
Houman Homayoun, Avesta Makhzan, Jean-Luc Gaudiot, Alex Veidenbaum
University of California Irvine
[email protected]
IC-SAMOS 2008
Outline

Introduction: why an IQ, ROB, RF major power dissipators?

Study processor resources utilization during L2/multiple L1 misses
service time
Architectural approach on dynamically adjusting the size of resources
during cache miss period for power conservation
Hardware modification + circuit assists to implement the approach
Experimental results
Conclusions




Superscalar Architecture
Fetch
Decode
ROB
Reservation
Station
Physical
Register
File
Rename
Dispatch
Instruction Queue
Issue
Write-Back
F.U.
Execute
F.U.
F.U.
F.U.
Load Store
Queue
Logical
Register
File
Instruction Queue

The Instruction Queue is a CAM-like structure which holds
instructions until they can be issued.




Set entries for new dispatched instructions
Read entries to issue instructions to functional units
Wakeup instructions waiting in the IQ once a result is ready
Select instructions for issue when the number of instructions
available exceed the processor issue limit (Issue Width).
Main Complexity: Wakeup Logic
tagIW3
tagIW2
tagIW1
tagIW0
tagIW0
tagIW1
tagIW2
tagIW3
tag03
tag02
tag01
tag00
tag00
tag02
tag03
Vdd
tag01
Circuit Implementation of Instruction Queue
Pre-charge
matchline1
matchline2
Ready Bit
OR
matchline3
matchline4




At each cycle, the match lines are pre-charged high

To allow the individual bits associated with an instruction tag to be compared with the
results broadcasted on the taglines.
Upon a mismatch, the corresponding matchline is discharged. Otherwise, the match line
stays at Vdd, which indicates a tag match.
At each cycle, up to 4 instructions broadcasted on the taglines,

four sets of one-bit comparators for each one-bit cell are needed.
All four matchlines must be ORed together to detect a match on any of the broadcasted
tags. The result of the OR sets the ready bit of instruction source operand
No Need to always have such aggressive wakeup/issue width!
Instruction Queue Matchline Power Dissipation

Matchline discharge is the major energy consumption
activity responsible for more than 58% of the energy
consumption in the instruction queue



As the matchlines must go across the entire width of the
instruction queue, it has a large wire capacitance. Adding the
one-bit comparators diffusion capacitance makes the
equivalent capacitance of matchline large
Pre-charging and discharging this large capacitor is
responsible for the majority of power in the instruction queue
a broadcasted tag has on average one dependent
instruction in the instruction queue
Discharging other matchlines cause significant
power dissipation in the instruction queue
ROB and Register File

The ROB and the register file are multi-ported SRAM
structures with several functionalities:



Setting entries for up to IW instructions in each cycle,
Releasing up to IW entries during commit stage in a cycle, and
Flushing entries during the branch recovery.
Circuit level implementation of an SRAM ROB and
Register File
Bitline
Decoder and Wordline Drivers
Bitline
addr0

The majority of power (both
leakage and dynamic) is
dissipated in the bitline (and
memory cells)

Local Wordline
addr1
addr2
addr3
Bitline leakage is accumulated
with the memory cell leakage
which flow through two off pass
data output
decode
transistors
driver
11%
decode
data
15%
8%
wordline
sense_am
Bitline dynamic power is decided output
1%
driver
p
wordline
29%
3%
by its equivalent capacitance
8%
which is N * diffusion capacitance
bitline and
sense_am
of pass transistors + wire
bitline and
memory
p
memory
cell
capacitance (usually 10% of total 4%
cell
63%
58%
diffusion capacitance) where N is
the total number of rows
Dynamic Power
Leakage Power
Sense amp
Output Drivers
Input Drivers

Bitline is the major power dissipator
58% of dynamic power and 63% of leakage power
System Description
L1 I-cache
128KB, 64 byte/line, 2 cycles
L1 D-cache
128KB, 64 byte/line, 2 cycles, 2 R/W ports
L2 cache
4MB, 8 way, 64 byte/line, 20 cycles
issue
4 way out of order
Branch predictor
64KB entry g-share, 4K-entry BTB
Reorder buffer
96 entries
Instruction queue
64 entry (32 INT and 32 FP)
Register file
128 integer and 128 floating point
Load/store queue
32 entry load and 32 entry store
Arithmetic unit
4 Integer, 4 Floating Point units
Complex unit
2 INT, 2 FP multiply/divide units
Pipeline
15 cycles (some stages are multi-cycles)
Simulation Environment


The clock frequency of the processor is 2GHz
SPEC2K benchmarks were using the Compaq compiler for the Alpha
21264 processor






compiled with the -O4 flag
executed with reference data sets
The architecture was simulated using an extensively modified version
of SimpleScalar 4.0 (sim-mase)
The benchmarks were fast–forwarded for 2 billion instructions, then
fully simulated for 2 billion instructions
A modified version of Cacti4 was used for estimating power in the ROB
and the Register files in 65nm technology
The power in the Instruction Queue was evaluated using Spice and the
TSMC 65nm technology

Vdd at 1.08 volts
Architectural Motivations

Architectural Motivation:

A load miss in L1/L2 caches takes a long time to service


When dependent instructions cannot issue

After a number of cycles the instruction window is full



prevents dependent instructions from being issued
ROB, Instruction Queue, Store Queue, Register Files
The processor issue stalls and performance is lost
At the same time, energy is lost as well!

This is an opportunity to save energy


Scenario I: L2 cache miss period
Scenario II: three or more pending DL1 cache misses
How Architecture can help reducing power in ROB,
Register File and Instruction Queue
Issue rate decrease
pa
m
cf
rs
er
tw
o
vo lf
rte
x
IN
T
vp
av r
er
ag
e
ap
pl
u
ap
si
A
eq rt
ua
fa ke
ce
re
c
ga
lg
el
lu
ca
s
m
gr
id
sw
w im
FP upw
av ise
er
ag
e
Scenario II
bz
ip
2
cr
af
ty
ga
p
gc
c
gz
ip
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
-10%
Scenario I
Scenario I: The issue rate drops by more than 80%
Scenario II: The issue rate drops is 22% for integer benchmarks and
32.6% for floating-point benchmarks.
Significant issue width decrease!
How Architecture can help reducing power in ROB,
Register File and Instruction Queue


ROB occupancy grows
significantly during scenario I
and II for integer benchmarks:
98% and 61% on average
The increase in ROB
occupancy for floating point
benchmarks is less, 30% and
25% on average for scenario I
and II.
Benchmark
Scenario I
bzip2
165.0
88.6
crafty
179.6
gap
Scenario II Benchmark
Scenario I
Scenario II
applu
13.8
-4.9
63.6
apsi
46.6
18.2
6.6
61.7
Art
31.7
56.9
gcc
97.7
43.9
equake
49.8
38.1
gzip
152.9
41.0
facerec
87.9
14.1
mcf
42.2
40.6
galgel
30.9
34.4
parser
31.3
102.3
lucas
-0.7
54.0
twolf
81.8
58.8
mgrid
8.8
5.6
vortex
118.7
57.8
swim
-4.3
11.4
vpr
96.6
55.7
wupwise
40.2
24.4
INT average
98.2
61.4
FP average
30.5
25.2
How Architecture can help reducing power in ROB,
Register File and Instruction Queue
nonnonnonnonRegister File Scenario I
Scenario I
Scenario II
Scenario II
Scenario I
Scenario I
Scenario II
Scenario
occupancy
IRF
FRF
IRF
FRF
II FRF
IRF
FRF
IRF
bzip2
crafty
gap
gcc
gzip
mcf
parser
twolf
vortex
vpr
INT average
applu
apsi
art
equake
facerec
galgel
lucas
mgrid
swim
wupwise
FP average
74.4
83.4
46.2
46.3
45.1
40.8
37.4
58.7
70.9
63.9
55.3
6.0
16.1
35.4
34.2
52.6
50.4
21.7
5.9
23.3
26.3
26.6
28.8
31.9
41.1
21.2
27.2
29.3
29.8
32.3
31.1
29.0
29.2
5.6
18.3
25.0
27.4
22.5
27.4
23.8
6.2
27.8
28.8
20.9
0.0
0.1
0.1
0.2
0.0
1.0
0.0
2.6
0.3
7.8
1.1
76.6
65.7
36.2
16.1
50.0
41.8
47.7
90.0
77.1
53.5
56.5
0.0
0.0
0.7
0.1
0.0
1.1
0.0
2.1
0.2
8.6
1.2
64.8
37.6
30.7
7.1
28.9
48.7
44.0
80.7
78.1
28.7
44.7
56.6
51.4
65.8
28.7
39.8
46.8
57.0
46.0
52.4
66.4
50.3
1.7
15.8
23.0
32.7
30.3
32.1
41.7
1.9
29.7
40.5
24.0
30.7
32.2
42.9
24.0
27.2
36.4
29.8
29.8
35.0
41.0
32.0
6.2
17.9
29.0
29.4
38.4
26.0
22.1
6.4
23.1
26.9
22.1
0.0
0.0
0.6
0.0
0.0
3.2
0.1
2.5
0.2
8.7
1.4
77.3
58.8
42.9
21.0
48.1
61.0
29.7
96.7
87.1
38.0
56.2
0.0
0.0
0.5
0.1
0.0
0.1
0.0
2.0
0.2
8.3
1.0
73.7
43.6
6.3
9.6
35.0
44.2
47.0
87.2
76.2
42.2
46.0
IRF occupancy always grows for both scenarios when experimenting
with integer benchmarks. a similar case is for FRF when running
floating-point benchmarks and only during scenario II
Proposed Architectural Approach

Adaptive resource resizing during cache miss period




Reduce the issue and the wakeup width of the processor during L2
miss service time.
Increase the size of ROB during L2 miss service time or when at least
three DL1 misses are pending
Reduce IRF size when running floating-point benchmarks. Similarly
reduce FRF size when running integer benchmarks. The same
algorithm applied to ROB is being applied for IRF when running
integer benchmarks and FRF when running floating point
benchmarks.
simple resizing scheme: reduce to half size. not necessarily
optimized for individual units, but a simple scheme to
implement
Reducing issue/wakeup width
tag0 tag1 tag2 tag3
Drivers
tag0 tag1 tag2 tag3
Select
Tag Lines
Tag Lines
Wordline Precharge
SLP
SLP
Wordline Precharge
(a)
(b)
avoid pre-charging half of matchlines during L2 cache miss service time
worse case scenario: more than half of taglines are broadcasting
tags during L2 miss period where only half of matchlines are active
Small 8 entries auxiliary broadcast buffer
Reducing ROB and Register File size

Using the divide bit line technique
which has been proposed for SRAM
memory design and attempt to reduce
the bit line capacitance and hence its
dynamic power.
Vdd
Pre-charge
Vdd
Wordline
Bit line
Wordline
Segment Bit line


Bitline cap = N * diffusion capacitance of pass
transistors + wire capacitance
Divided Bitline cap = M * diffusion capacitance +
wire capacitance
Segment Select
Wordline

Turning of the entire partition by
applying gated-vdd technique to the
partition memory cell and wordline
driver parts.
Wordline
(Segment Select)
SS
DS
(Downsizing Signal)
Sense amp
Output
Simulation Results
50%
Power (Dynamic/Leakage) Reduction
45%
40%
15%
10%
5%
av vp r
er
ag
e
ap
pl
u
ap
si
A
eq rt
ua
k
fa e
ce
re
ga c
lg
el
lu
ca
s
m
gr
id
sw
w
i
u m
FP pw
av ise
er
ag
e
IN
T
bz
ip
2
cr
af
ty
ga
p
gc
c
gz
ip
m
pa cf
rs
er
tw
o
vo lf
rte
x
0%
ROB Leakage
ROB Dynamic
Issue Queue
Power (Dynamic/Leakage) Reduction
40%
35%
30%
25%
20%
15%
10%
5%
av vp r
er
ag
e
ap
pl
u
ap
si
A
eq rt
ua
fa ke
ce
re
c
ga
lg
el
lu
ca
s
m
gr
id
sw
w
i
u m
FP pw
av ise
er
ag
e
IN
T
gc
c
gz
ip
m
pa cf
rs
er
tw
ol
vo f
rte
x
0%
INT RF Leakage
INT RF Dynamic
FP RF Lekage
FP RF Dynamic
6%
IPC Degradation
5%
4%
3%
2%
1%
T
av vp
er r
ag
ap e
pl
u
ap
si
eq Art
ua
fa ke
ce
re
ga c
lg
e
lu l
ca
m s
gr
id
s
w wim
up
FP w
av i se
er
ag
e
0%
IN

20%
ga
p

25%
bz
ip
2
cr
af
ty

Performance loss 0.9% for integer
benchmarks and 2.2% for floatingpoint benchmarks.
The average dynamic and leakage
power savings for IRF is 26% and
30% respectively and 20% and
24% for FRF.
24% dynamic power reduction in
instruction queue for FP
benchmarks and 11% reduction in
integer benchmarks.
19% dynamic power reduction and
23% leakage power savings for
ROB.
bz
ip
cr 2
af
ty
ga
p
gc
c
gz
ip
m
p a cf
rs
er
tw
o
vo lf
rt
ex

35%
30%
Conclusions

Reducing L2 Cache Leakage Power





Reducing Reorder Buffer, Instruction Queue and Register File Power




Architectural study during L2 cache miss service time
Study the break down of leakage in L2 cache show the peripheral circuits
leaking considerably
Architectural approach on when to turn on/off L2 cache for reducing leakage
power while conserving performance, 20+% Power savings while 2-%
performance degradation
Circuit assist, minimal modifications and transition overhead
Study processor resources utilization during L2/multiple L1 misses service
time
Architectural approach on dynamically adjust the size of resources during
cache miss period for power conservation
Hardware modification + circuit assists to implement the approach
Applying similar adaptive techniques to other energy hungry resources in
the processor