Energy-efficient Instruction Dispatch Buffer Design for

Download Report

Transcript Energy-efficient Instruction Dispatch Buffer Design for

Energy-efficient Instruction Dispatch
Buffer Design for Superscalar Processors*
Gurhan Kucuk, Kanad Ghose, Dmitry V. Ponomarev
Department of Computer Science
State University of New York
Binghamton, NY 13902-6000
{gurhan, ghose, dima}@cs.binghamton.edu
Peter M. Kogge
Dept. of Computer Science and Engineering
University of Notre Dame
Notre Dame, IN 46556
[email protected]
International Symposium on Low Power Electronics and Design (ISLPED’01)
* supported in part by DARPA through the PAC-C program and NSF
Motivation/Goals
Current Trends in Microarchitecture:
• Aggressive out-of-order execution, use of register renaming, multiple FUs,
sizable on-chip caches, large register files, ROB etc.
Impact on Energy/Power Dissipation:
• Absolute power dissipation of processor is high
• Areal energy/power density of high-end superscalar processors is becoming an
immediate, serious concern - will soon become comparable to that of nuclear
reactors
Consequences:
• Intermittent and permanent failures on the die and serious challenges for the
cooling facilities/packaging
Goal:
• Limit energy dissipation through technology independent techniques with no
impact on performance
Typical Superscalar Datapath
The Dispatch Buffer
•
•
Instruction Dispatch Buffer (a.k.a. Issue Queue) is one of the major source of
power dissipation in modern superscalar processors: up to 22% of total chip
power
Major components of power dissipation in Dispatch Buffer are:
1. Dispatch (Entry Setup = Locating free entries + writing to them)
2. Issue (FU arbitration + Reading selected instr. From the DB)
3. Forwarding (Tag comparison + latching) From Decode/Dispatch Stage
D
D
24.2 %
F
I
50.1 %
25.7 %
SPECint 95
issue
17.3 %
forwarding
F
dispatch
28.9 %
I
53.8 %
SPECfp 95
To function
units (issue)
From function units (forwarding)
Main Results
•
60% plus energy savings within the DB achieved using three
relatively independent techniques:



•
•
Replacing traditional comparators with dissipate-on-match comparator
Not reading or writing leading zero bytes
Using bit-line segmentation to reduce bit line capacitance and
dissipations during reads and writes
No impact on cycle time
Only 12% increase in layout area of DB – for 4 metal layer, 0.5
micron layout: smaller increase with additional metal layers
Low Power Comparator
• Traditional comparators dissipate power on mismatches
• Only 5% of total comparisons matches
• This is a major source for power dissipation in Dispatch Buffer
Number of
bits
matching

% of total cases
2
LSBs
4
LSBs
6
LSBs
All 8
bits
Avg.
SPECint 95
27.2
9.8
5.7
5.6
Avg.
SPECfp 95
33.1
10.7
5.4
4.4
Avg. all
SPEC 95
30.5
10.3
5.6
4.9
LSB = least significant bits
Dispatch Buffer Comparator Statistics
Low Power Comparator
8 bit phys. reg. number
• Traditional comparators dissipate power on mismatches
• Only 5% of total comparisons matches
• This is a major source for power dissipation in Dispatch Buffer
Number of
bits
matching

X
X
Forwarding Bus
Inactive slot
X
DB
Waiting slot
% of total cases
2
LSBs
4
LSBs
6
LSBs
All 8
bits
Avg.
SPECint 95
27.2
9.8
5.7
5.6
Avg.
SPECfp 95
33.1
10.7
5.4
4.4
Avg. all
SPEC 95
30.5
10.3
5.6
4.9
LSB = least significant bits
X
Matching slot
Dispatch Buffer Comparator Statistics
Low Power Comparator
Idea: Design of a new comparator that dissipates power only on matches
Number of
bits
matching

% of total cases
2
LSBs
4
LSBs
6
LSBs
All 8
bits
Avg.
SPECint 95
27.2
9.8
5.7
5.6
Avg.
SPECfp 95
33.1
10.7
5.4
4.4
Avg. all
SPEC 95
30.5
10.3
5.6
4.9
LSB = least significant bits
Dispatch Buffer Comparator Statistics
New dissipate-on-match comparator:
Domino logic with pass-transistor at the front end
Zero-Byte Encoding
Observation:
• The simulated execution of the SPEC 95
benchmarks show that about half of the byte
fields within operands are all zeros
Reasons:
• Use of small integer literals (address offsets,
literal operands, flags, byte ops, etc.)
• Consequence of byte packing and unpacking
operations and usage of the bit or byte
masks to isolate parts of the operands
• Some floating point operands may not use
all of the bits allowed in the mantissa field
• Use of lower-precision data may not make
use of full datapath width
60
50
40
0s everywhere
Leading 0s
Leading 1s
30
20
10
0
RF to DB
DB to FU
FU to DB
32 and 64-bit Integer Operands
(90%of all operands)
Zero-Byte Encoding
Idea:
• Instead of driving byte with all-zeroes, encode it using the ZI (Zero
Indicator) bit and only drive this bit, thus achieving power savings during
writes.
Associated Circuit Techniques: Readout Logic
Zero-Byte Encoding
• Stored ZI bit disables reading of associated byte- avoiding bitline
discharge and sense-amp dissipation
Associated Circuit Techniques: Readout Logic
Zero-Byte Encoding
Associated Circuit Techniques: Encoding Logic for bytes of all zeroes
Bitline Segmentation
• The DB is essentially a Register File with additional associative
logic for data forwarding.
•
•
•
•
For each instruction dispatched in a
cycle a write port is needed for entry
setup process
For each instruction issued in a
cycle a read port is needed to move
the instruction from DB to FU.
The bitlines associated with each
read and write port present a high
capacitive load, which consists of a
component that varies linearly with
the number of rows in the DB.
This component is due to the wire
capacitances of the bitlines and the
diffusion capacitance of the pass
transistors that connect the bitcells
to the bitlines
From Decode/Dispatch Stage
WRITE PORTS
READ PORTS
To function
units (issue)
Bitline Segmentation
Idea: The DB is reconstructed into segments.
• Capacitive loading on each segment is lowered: Each segment is connected
to only 16 pass devices
• Wire capacitance is lowered: Wire length of the bitline segment is one
fourth of the original bitline
Bitline segmented DB
Evaluation Methodology
• Used a true cycle-by-cycle register-level simulator for a
typical superscalar pipeline. Simplescalar has been
substantially modified for this purpose to mimic real
superscalar datapaths
• Simulated the execution of SPEC 95 benchmarks.
• Collected transition counts for each major datapath
component
• Used SPICE measurements for the VLSI layout of
dispatch buffer and reorder buffer in a 0.5 micron, 4metal layer process to estimate the power dissipated for
each type of transition within each major component
(migrating to 0.18 micron soon!)
Evaluation Methodology
Datapath Power Estimator
0
44%
Power dissipation within the DB during forwarding
SPEC 95
250
SPECfp 95
SPECint 95
200
wave5
tomcatv
swim
su2cor
mgrid
hydro2d
applu
apsi
fpppp
turb3d
perl
mW
lisp
ijpeg
go
gcc
m88ksim
vortex
compress
Power Dissipation
Results
Traditional vs. New Comparator
350
Traditional
300
New
49%
45%
150
100
50
Power Dissipation
1000
900
800
700
600
500
400
300
200
100
0
Total power dissipation within the DB
11%
SPEC 95
SPECfp 95
SPECint 95
wave5
tomcatv
swim
su2cor
mgrid
hydro2d
applu
apsi
fpppp
turb3d
perl
mW
lisp
ijpeg
go
gcc
m88ksim
vortex
compress
Results
Traditional vs. New Comparator
Traditional
New
14%
12%
Results
Zero-Byte Encoding and Bitline Segmentation
mW
54%,
26%,
61%
53%, 53%,
21%, 23%,
59% 60%
SPEC 95
SPECfp 95
SPECint 95
wave5
tomcatv
swim
su2cor
mgrid
hydro2d
applu
apsi
fpppp
turb3d
perl
lisp
ijpeg
go
gcc
m88ksim
vortex
compress
Power Dissipation
160
140
120
100
80
60
40
20
0
Power dissipation within the DB during instruction dispatch
Base
Seg.
ZE
Seg + ZE
Results
Zero-Byte Encoding and Bitline Segmentation
mW
600
40%,
41%,
62%
500
41%, 41%,
32%, 35%,
58% 60%
Power Dissipation
400
300
200
SPEC 95
SPECfp 95
SPECint 95
wave5
tomcatv
swim
su2cor
mgrid
hydro2d
applu
apsi
fpppp
turb3d
perl
lisp
ijpeg
go
gcc
m88ksim
vortex
0
compress
100
Power dissipation within the DB during instruction issue
Base
Seg.
ZE
Seg + ZE
Results
Zero-Byte Encoding, Bitline Segmentation and New Comparators
mW
33%,
50%,
61%
31%, 32%,
44%, 46%,
59% 60%
Power Dissipation
1000
900
800
700
600
500
400
300
200
100
0
Base
Seg.
Seg + ZE
Total power dissipation within the DB
SPEC 95
SPECfp 95
SPECint 95
wave5
tomcatv
swim
su2cor
mgrid
hydro2d
applu
apsi
fpppp
turb3d
perl
lisp
ijpeg
go
gcc
m88ksim
vortex
compress
Seg + ZE +
New
Comp.
Related Work
• Zero byte encoding of function unit results (Brooks and Martonosi,
1999)
• Zero-byte compression on buses, register files, DB and ROB in
superscalar datapath (Ponomarev, Ghose, Kucuk, Kogge and
Toomarian, 2000)
• Zero-byte compression for I-caches (Villa, Zhang, Asanovic, 2000)
• Zero-byte compression in simple scalar datapath (Canal, Gonzalez and
Smith, 2000)
• Dynamic resizing of issue queue (Buyuktosunoglu, Albonesi, Schuster,
Brooks, Bose and Cook, 2001 + Folegnani and Gonzalez, 2001)
• Dynamic resizing of dispatch buffer and reorder buffer (Ponomarev,
Kucuk and Ghose, 2001)
Conclusion
•
We studied three relatively independent techniques to reduce the energy
dissipation in the instruction dispatch buffers of modern superscalar
processors:
1.
2.
3.
1.
•
•
New comparators that dissipate the energy mainly on the tag matches
Zero-Byte encoding to reduce the number of bitlines that have to be driven
during instruction dispatch and issue as well as during forwarding of the results
to the waiting instructions in the DB
Bitline segmentation to reduce the length of bitlines (to reduce wire and
diffusion capacitances)
Total power reduction is about 60%
The DB power reductions are achieved without compromising the cycle
time and only through a modest growth in the area of the DB (about 12%)
Our studies also show that the use of all the techniques that reduce the DB
power can also be used to achieve reductions of a similar scale in other
datapath artifacts that use associative addressing (such as the Reorder
Buffer and LOAD/STORE Queue.)