Reducing Power Consumption with Relaxed Quasi Delay

Download Report

Transcript Reducing Power Consumption with Relaxed Quasi Delay

Reducing Power Consumption with Relaxed
Quasi Delay-Insensitive Circuits
Christopher LaFrieda and Rajit Manohar
Computer Systems Laboratory
Cornell University
Outline
 Motivation / Background
 Contributions
 Relaxed Quasi Delay-Insensitive (RQDI)
 RQDI Voltage Scaling
 RQDI Two Phase Circuits
 Results
 Summary
Motivation:
How Does Dynamic Power Scale?
PD    N  C L  Vdd  f
2
 α – activity factor (1x)
 N – total number of transistors (2x)
 CL – average load capacitance per transistor (.7x)
 Vdd – doesn’t scale well anymore
 Scaled by 17-20% from 130nm to 65nm.
 Scaled by 10% at 45nm and 5.5% at 32nm.
2
PD1
Vdd1 f1
 1.4 

2
PD 0
Vdd 0 f 0
Motivation:
Power Scaling With Fixed Frequency
Power Scaling With Fixed Frequency
150
140
Power (%)
130
120
110
100
90
80
130
90
65
45
Technology(nm)
32
22
Motivation:
Process Variations Getting Worse
 Process Variation in 65nm:
 FO4 delays across corners:
SS Corner
TT Corner
FF Corner
13.6 ps
18.2 ps
22.6 ps
 FF is 70% faster than SS.
 Circuits need to be robust w.r.t. process variations.
 QDI is a logical place to start.
Background:
QDI – WCHB Buffer
• Simple buffer.
• Neutrality is checked
in the pull-up stack
of the c-element.
• Timing assumption?
RQDI:
Staticizer Timing Assumption I
• Data is neutral and
enable is high.
RQDI:
Staticizer Timing Assumption II
• Data is neutral and
enable is high.
• Data becomes valid
which sets _R0 low.
If R0 inverter is slow,
R0 will remain low.
RQDI:
Staticizer Timing Assumption III
• Data is neutral and
enable is high.
• Data becomes valid
which sets _R0 low.
If R0 inverter is slow,
R0 will remain low.
• Nothing is fighting
the weak feedback,
_R0 can go high.
RQDI:
Half Cycle Timing Assumption
 The half cycle timing assumption (HCTA):





A small amount of combinational logic (1-2 transitions) will
always switch within one half cycle of a process.
There is a 4.5x (@ 18 t.p.c.) timing margin.
With worst case corners, 2.7x margin in 65nm.
Wire delays make the assumption even more conservative.
QDI has an HCTA in staticizers.
RQDI allows them everywhere.
RQDI:
HCHB Template
• N tracks neutrality.
• Check N+, but
assume N- happens
in the first half
cycle.
• Two transition
latency.
• 14 transition cycle
time.
• Validity must be
checked by pulldown.
RQDI Voltage Scaling:
Scaling Scenarios
Mismatched slack
• Two possible scenarios
for voltage scaling.
• Top: mismatched slack.
Lower pipeline can run
slower.
• Bottom: Token limited
loop. Latency through
loop should be minimal,
but cycle time can scale.
• In some applications
these can’t be avoided.
Token
limited loop
RQDI Voltage Scaling:
Slack Mismatch In An FPGA
• Logic blocks (LB) for
•
•
•
•
logic.
Switch boxes (SB) for
routing.
Limited routing
resources.
Imperfect slack
matching.
Can scale voltage on
blue path.
RQDI Voltage Scaling:
DVHB: Dual Voltage Template
• Data rails are full swing.
• Acknowledges are low
swing.
• Latency remains
constant through
voltage scaling.
• Cycle time can be
adjusted through voltage
scaling.
RQDI Two Phase Circuits:
Two Phase Buffer (HCFB2P)
• An HCTA exists on the
right pair of XORs.
• Two transition latency.
• Seven transition cycle
time.
• Twice the area of a
WCHB. However, it
can replace two stages.
RQDI Two Phase Circuits:
Two Phase In An FPGA
• Replace routing (SB)
with two phase logic.
• Logic (LB) remains
four phase.
• Phase converters are
placed around logic
blocks.
• Routing makes up
over half the area in
an asynchronous
FPGA, so power
savings can be large.
Width N Switch
RQDI Two Phase Circuits:
Converters
 Need to convert between two phase (for routing) and four
phase (for logic).
 The 4:2 converter is 3x larger than a WCHB.
 The 2:4 converter is 3.25x larger than a WCHB.
Experimental Setup
• Simulated in HSpice
with a 65nm bulk
technology.
• Circuits are sized to
the drive strength of
a 20/10 lambda
inverter.
Name Description
Inputs
Outputs Implies
Validity?
and2
And
2
1
No
or2
Or
2
1
No
xor2
Exclusive Or
2
1
Yes
fa
Full Adder
3
2
Yes
benc
Booth Encoder
3
2
No
Results :
HCHB – Energy Per Cycle
• HCHB consumes
Energy Per Operation (pJ)
32% less energy
than PCHB.
• HCHB consumes
36% less energy
than PCEHB.
• Slight frequency
improvement.
• Negligible latency
penalty.
0.25
PCHB
PCEHB
0.2
HCHB
0.15
0.1
0.05
0
and2
or2
xor2
fa
benc
Benchmark
avg
Results:
HCHB – Total Transistor Area
• Despite the
Total Transistor Area (µm2)
additional
transistors to check
validity, HCHB is
smaller.
• HCHB is about
20% smaller than
PCHB.
• HCHB is about
15% smaller than
PCEHB.
8
7
PCHB
6
PCEHB
5
HCHB
4
3
2
1
0
and2
or2
xor2
fa
benc
Benchmark
avg
Results:
DVHB – Low voltage vs. Dual Voltage
0.25
100
90
0.2
80
0.1
60
50
40
DVHB
30
Low Vdd
0.05
20
DVHB
10
0
1
0.9
0.8
0.7
Voltage(V)
0.6
0.5
0
1
0.9
0.8
0.7
Voltage(V)
0.6
0.5
Power (%)
0.15
Dynamic Slack
70
Results:
HCFB2P Switch – Energy Reduction vs. WCHB
• Wider switches
52
Energy Reduction (%)
means larger
MUXes and larger
PCs.
• The associated caps
switch half as much.
• Over 50%
reduction in power.
Due to replacing
two stages.
52.5
51.5
51
50.5
50
49.5
49
2
4
6
8
10
Switch Width
12
14
16
RQDI Two Phase Circuits:
Results – Area Overhead
• Typically, there is
40
Area Overhead (%)
about of 8 stages
of 4-wide switches
between logic
blocks.
• Area overhead is
15%.
• With direct
connections, there
are about 10
stages with an
overhead of 10%.
45
35
30
width 4
25
20
15
10
5
0
4
6
8
10
12
14
Number of Stages
16
18
20
Summary
 RQDI allows half cycle timing assumptions outside of
staticizers.
 With RQDI, we can simplify the PCHB logic template. The
resulting template, HCHB, consumes 32% less energy.
 The dual voltage logic template can be used to adjust the
dynamic slack of a stage. This allows us to save energy with a
minimal throughput penalty in token limited loops.
 Replacing the routing in an FPGA with two phase logic can
reduce energy consumption by 50%. Using the RQDI two
phase buffer and converters will achieve this with a 10-15%
area overhead.
Questions?