Transcript Document

Safe RTL Annotations for Low
Power Microprocessor Design
Vinod Viswanath
Department of Electrical and Computer Engineering
University of Texas at Austin
Outline
• Power Dissipation in Hardware Circuits
• Instruction-driven Slicing to attain lower
power dissipation
– Automatically annotates microprocessor
description at the Register Transfer Level and
Architectural level
• Correctness of the introduced annotations
• Case studies
Power Dissipation
• Switching activity power dissipation
– To charge and discharge nodes
• Short Circuit power dissipation
– High only for output drivers, clock buffers
• Static power dissipation
– Due to leakage current
Switching Activity Power Dissipation
• Reduce the squared term VDD
– Leads to exponential increase in Ileak
• Host of techniques to reduce switching
power at the gate level
– Clock gating
• Relatively much lesser at the RTL
– Use program structure and dataflow
information available at that level of
abstraction
Transistor-level Methods
• Designing Complex Gates
– Reordering transistors for optimizing
power/delay
• Transistor Sizing
– Transistor size inversely proportional to
gate delay
– Transistor size proportional to power
dissipated at the gate
– Given a delay constraint, size the transistor
to minimize power dissipation
Gate-level Methods
• Combinational logic optimizations
– Don’t-care optimizations
– Path balancing
• Sequential logic optimizations
– Encoding
– Pre-computation based optimization
– Guarded evaluation
Combinational Logic Optimizations
• Don’t-care optimization
– Optimize for input/output patterns that
can/should not ever occur
• Path Balancing
– Typically path balancing is done to eliminate
spurious transitions
– Adds unit delay buffers, increasing the
power dissipation
– Useful skew in clocktrees
Sequential Logic Optimizations
• Encoding
– Encode state transition graphs
– Encode values in datapath logic
– Passing data value 0010 followed by 1101 on
a bus
Sequential Logic Optimizations
• Pre-computation based optimization
– Selectively precompute outputs of the
circuit one cycle before they are required
– If the output value is computes, the circuit
can be turned off for the next cycle
– Size of pre-computation logic determines
power dissipation reduction, area increase,
and delay increase
• Use predictor functions
• Pre-compute outputs based on subset of inputs
Sequential Logic Optimizations
• Guarded Evaluation
Instruction-driven Slice
• An instruction-driven slice of a
microprocessor design is
– all the relevant circuitry of the design
required to completely execute a specific
instruction
– Parts of the decode, execute, writeback
etc. blocks
• Cone of influence of the semantics of
the instruction
Instruction-driven Slicing
• Given a microprocessor design and an
instruction
– Identify the instruction-driven slice
– Shut off the rest of the circuitry
• This might include
– Gating out parts of different blocks
– Gating out floating point units during
integer ALU execution
– Turning off certain FSMs in different
control blocks since exact constraints on
their inputs are available due to
instruction-driven slicing
Algorithm (High Level)
• Algorithm instruction-driven-slicing.
Begin
• Inputs: vRTL (Verilog RTL), insts (instructions)
• Output: aRTL (Annotated RTL)
– Parse vRTL to obtain the Abstract Syntax Program
Graph (ASPG)
– For each instruction I in insts repeat
•
•
•
•
Slice the ASPG for instruction I
Traverse the ASPG
Add annotation variables if such a block is found
If a particular flop is already gated, then
add the current annotation in an optimal fashion
• Return the annotated ASPG
– Generate Verilog code (aRTL) for the annotated ASPG
End.
or1200_ctrl.lsu_op
or1200_ctrl.pre_branch_op
Methodology
• In order to demonstrate our technique
– We have incorporated instruction-driven slicing as
part of the traditional design flow
– The vRTL model is annotated to obtain the aRTL
model
– Synopsys Design Environment has been sufficiently
modified to accept the aRTL, SPEC2000
benchmarks and power process parameters and
estimate the power dissipation due to switching
activity
– The annotated Architectural model is fed to the
SimpleScalar simulator with the Wattch power
estimator to estimate the power dissipation
Methodology
Experiment: OR1200
• We have used our tool-chain to test our
methodology on OR1200
– OR1200 is a pipelined microprocessor
implementing the OpenRISC ISA.
– 4-stage integer pipeline with single
instruction issue per cycle
– We have annotated both the RTL and the
architectural models of OR1200
OR1200: single instruction issue
pipelined microprocessor
OR1200 Power Gain Results
• Results are shown after annotating the
– RTL (left) and Architectural (Right) models
– For un-sliced and sliced on 1, 4, 10 instructions
– For SPECINT2000 benchmarks
• Power dissipation decreases consistently
OR1200 Results (contd.)
Fig.2a
Fig.2b
Fig. 1
• Power gains are consistently good (Fig. 1)
• Power gains far outperform area losses
(Fig 1)
• Flop distribution shown before slicing
(Fig. 2a) after slicing on add (Fig. 2b)
and after slicing on load (Fig. 2c)
Fig.2c
Experiment: PUMA
• We have used our tool-chain to test our
methodology on PUMA
– PUMA is a dual-issue, out-of-order superscalar, fixed-point PowerPC core
– We have annotated both the RTL and the
architectural models of PUMA
PUMA: a fixed point PowerPC core
PUMA Power Gain Results
• Results are shown after annotating the
– RTL (left) and Architectural (Right) models
– For un-sliced and sliced on 1, 4, 10 instructions
– For SPECINT2000 benchmarks
• Power dissipation decreases consistently
PUMA Results (contd.)
PUMA-RTL Power vs. Delay
1.2
Fig. 1
%-age Power gain, Area loss
1
0.8
Power
0.6
Delay
Fig.3a
0.4
0.2
lic e
d
ce
d
10
-S
4- S
li
1- S
li
Un
slic
e
d
ce
d
0
Instruction-driven slicing
PUMA-RTL Power vs. Area
Fig. 2
%-age Power gain, Area loss
1.15
1.1
1.05
Power
1
Fig.3b
Area
0.95
0.9
lic e
d
10
-S
ce
d
4- S
li
ce
d
1- S
li
Un
slic
e
d
0.85
Instruction-driven slicing
•
•
•
Power gains are good upon slicing for a few
instructions (~7) before delay losses start
dominating (Fig. 1)
Power gains far outperform area losses (Fig 2)
Flop distribution shown before slicing (Fig. 3a)
after slicing on add (Fig. 3b) and after slicing
on load (Fig. 3c)
Fig.3c
Comparing OR1200 and PUMA
Correct Annotations
• Notion of correctness
– Original RTL and the annotated RTL should
be functionally equivalent under all
conditions
• Correctness theorem
(defthm or1200_slicing_correct
(equal (or1200_cpu n)
(or1200_cpu_sliced n)))
ACL2 Theorem Prover
• First order logic general purpose
theorem prover
• Breakdown the theorem into sub-goals
• Many engines work on the sub-goals and
will either prove them or break them
down further and add to the central
pool of goals to be proved
• Success story in Hardware
– Verified FDIV in the AMD processors
Proof Methodology
Proof Methodology
• The RTL is a shallow embedding in ACL2
• Convert Verilog RTL into ACL2RTL
• We have created a large RTL library to
recognize as well as analyze ACL2RTL
• Slicing is done on the Verilog code
• Both original and annotated Verilog are
converted into ACL2 and we construct
the functional equivalence proof in ACL2
Verilog to ACL2
Proof Structure
• Create a library of functions to
interpret the ACL2 model of the RTL
• Functional equivalence theorem is built
up block by block
– Per instruction basis
Conclusions
• Proposed Instruction-driven Slicing as a
new technique to automatically reduce
power dissipation
• Implemented the methodology of
incorporating instruction-driven slicing
into the design flow tool-chain
• Inserting these annotations preserves
the functionality of the circuit
Conclusions (continued)
• This technique seems most applicable to
single-issue multi-staged pipelined machines.
• When there are multiple instructions in-flight
in the same pipeline stage, the gains of a
single-instruction-abstraction are lost.
• Graphics processors, various embedded
applications are more often better suited for
this technique than general purpose out-oforder superscalars.