Reducing Power Consumption with Relaxed Quasi Delay
Download
Report
Transcript Reducing Power Consumption with Relaxed Quasi Delay
Reducing Power Consumption with Relaxed
Quasi Delay-Insensitive Circuits
Christopher LaFrieda and Rajit Manohar
Computer Systems Laboratory
Cornell University
Outline
Motivation / Background
Contributions
Relaxed Quasi Delay-Insensitive (RQDI)
RQDI Voltage Scaling
RQDI Two Phase Circuits
Results
Summary
Motivation:
How Does Dynamic Power Scale?
PD N C L Vdd f
2
α – activity factor (1x)
N – total number of transistors (2x)
CL – average load capacitance per transistor (.7x)
Vdd – doesn’t scale well anymore
Scaled by 17-20% from 130nm to 65nm.
Scaled by 10% at 45nm and 5.5% at 32nm.
2
PD1
Vdd1 f1
1.4
2
PD 0
Vdd 0 f 0
Motivation:
Power Scaling With Fixed Frequency
Power Scaling With Fixed Frequency
150
140
Power (%)
130
120
110
100
90
80
130
90
65
45
Technology(nm)
32
22
Motivation:
Process Variations Getting Worse
Process Variation in 65nm:
FO4 delays across corners:
SS Corner
TT Corner
FF Corner
13.6 ps
18.2 ps
22.6 ps
FF is 70% faster than SS.
Circuits need to be robust w.r.t. process variations.
QDI is a logical place to start.
Background:
QDI – WCHB Buffer
• Simple buffer.
• Neutrality is checked
in the pull-up stack
of the c-element.
• Timing assumption?
RQDI:
Staticizer Timing Assumption I
• Data is neutral and
enable is high.
RQDI:
Staticizer Timing Assumption II
• Data is neutral and
enable is high.
• Data becomes valid
which sets _R0 low.
If R0 inverter is slow,
R0 will remain low.
RQDI:
Staticizer Timing Assumption III
• Data is neutral and
enable is high.
• Data becomes valid
which sets _R0 low.
If R0 inverter is slow,
R0 will remain low.
• Nothing is fighting
the weak feedback,
_R0 can go high.
RQDI:
Half Cycle Timing Assumption
The half cycle timing assumption (HCTA):
A small amount of combinational logic (1-2 transitions) will
always switch within one half cycle of a process.
There is a 4.5x (@ 18 t.p.c.) timing margin.
With worst case corners, 2.7x margin in 65nm.
Wire delays make the assumption even more conservative.
QDI has an HCTA in staticizers.
RQDI allows them everywhere.
RQDI:
HCHB Template
• N tracks neutrality.
• Check N+, but
assume N- happens
in the first half
cycle.
• Two transition
latency.
• 14 transition cycle
time.
• Validity must be
checked by pulldown.
RQDI Voltage Scaling:
Scaling Scenarios
Mismatched slack
• Two possible scenarios
for voltage scaling.
• Top: mismatched slack.
Lower pipeline can run
slower.
• Bottom: Token limited
loop. Latency through
loop should be minimal,
but cycle time can scale.
• In some applications
these can’t be avoided.
Token
limited loop
RQDI Voltage Scaling:
Slack Mismatch In An FPGA
• Logic blocks (LB) for
•
•
•
•
logic.
Switch boxes (SB) for
routing.
Limited routing
resources.
Imperfect slack
matching.
Can scale voltage on
blue path.
RQDI Voltage Scaling:
DVHB: Dual Voltage Template
• Data rails are full swing.
• Acknowledges are low
swing.
• Latency remains
constant through
voltage scaling.
• Cycle time can be
adjusted through voltage
scaling.
RQDI Two Phase Circuits:
Two Phase Buffer (HCFB2P)
• An HCTA exists on the
right pair of XORs.
• Two transition latency.
• Seven transition cycle
time.
• Twice the area of a
WCHB. However, it
can replace two stages.
RQDI Two Phase Circuits:
Two Phase In An FPGA
• Replace routing (SB)
with two phase logic.
• Logic (LB) remains
four phase.
• Phase converters are
placed around logic
blocks.
• Routing makes up
over half the area in
an asynchronous
FPGA, so power
savings can be large.
Width N Switch
RQDI Two Phase Circuits:
Converters
Need to convert between two phase (for routing) and four
phase (for logic).
The 4:2 converter is 3x larger than a WCHB.
The 2:4 converter is 3.25x larger than a WCHB.
Experimental Setup
• Simulated in HSpice
with a 65nm bulk
technology.
• Circuits are sized to
the drive strength of
a 20/10 lambda
inverter.
Name Description
Inputs
Outputs Implies
Validity?
and2
And
2
1
No
or2
Or
2
1
No
xor2
Exclusive Or
2
1
Yes
fa
Full Adder
3
2
Yes
benc
Booth Encoder
3
2
No
Results :
HCHB – Energy Per Cycle
• HCHB consumes
Energy Per Operation (pJ)
32% less energy
than PCHB.
• HCHB consumes
36% less energy
than PCEHB.
• Slight frequency
improvement.
• Negligible latency
penalty.
0.25
PCHB
PCEHB
0.2
HCHB
0.15
0.1
0.05
0
and2
or2
xor2
fa
benc
Benchmark
avg
Results:
HCHB – Total Transistor Area
• Despite the
Total Transistor Area (µm2)
additional
transistors to check
validity, HCHB is
smaller.
• HCHB is about
20% smaller than
PCHB.
• HCHB is about
15% smaller than
PCEHB.
8
7
PCHB
6
PCEHB
5
HCHB
4
3
2
1
0
and2
or2
xor2
fa
benc
Benchmark
avg
Results:
DVHB – Low voltage vs. Dual Voltage
0.25
100
90
0.2
80
0.1
60
50
40
DVHB
30
Low Vdd
0.05
20
DVHB
10
0
1
0.9
0.8
0.7
Voltage(V)
0.6
0.5
0
1
0.9
0.8
0.7
Voltage(V)
0.6
0.5
Power (%)
0.15
Dynamic Slack
70
Results:
HCFB2P Switch – Energy Reduction vs. WCHB
• Wider switches
52
Energy Reduction (%)
means larger
MUXes and larger
PCs.
• The associated caps
switch half as much.
• Over 50%
reduction in power.
Due to replacing
two stages.
52.5
51.5
51
50.5
50
49.5
49
2
4
6
8
10
Switch Width
12
14
16
RQDI Two Phase Circuits:
Results – Area Overhead
• Typically, there is
40
Area Overhead (%)
about of 8 stages
of 4-wide switches
between logic
blocks.
• Area overhead is
15%.
• With direct
connections, there
are about 10
stages with an
overhead of 10%.
45
35
30
width 4
25
20
15
10
5
0
4
6
8
10
12
14
Number of Stages
16
18
20
Summary
RQDI allows half cycle timing assumptions outside of
staticizers.
With RQDI, we can simplify the PCHB logic template. The
resulting template, HCHB, consumes 32% less energy.
The dual voltage logic template can be used to adjust the
dynamic slack of a stage. This allows us to save energy with a
minimal throughput penalty in token limited loops.
Replacing the routing in an FPGA with two phase logic can
reduce energy consumption by 50%. Using the RQDI two
phase buffer and converters will achieve this with a 10-15%
area overhead.
Questions?