A 64b Adder Using Self-Calibrating Differential Output Prediction Logic

Download Report

Transcript A 64b Adder Using Self-Calibrating Differential Output Prediction Logic

Advanced VLSI FALL 2006
CLASS PRESENTATION
ISSCC 2006
A 64b Adder Using Self-Calibrating Differential Output
Prediction Logic
K. H. Chong and Larry McMurchie
Dept. of Electrical Engineering
University of Washington
Carl Sechen
Electrical Engineering Dept.
University of Texas at Dallas
Supervisor:
M.Fakhraei
BY:
A.Jahanshahi
Dept. of Electrical And
Computer Engineering
University of Tehran
Outline
• History of Output Prediction Logic
• Introduction to Output Prediction Logic (OPL)
– Fastest digital logic technique
• Self-calibrating Differential OPL (DOPL)
– Twice as fast and uses half the energy compared to domino logic
• High-speed, power-efficient 64b adder architecture
– Valency-3 Kogge-Stone sparse tree
– DOPL-specific 3b carry-select units
– Only 5 logic levels
• Measurement results
2
History
• First introduced in 2000[1]
- speed up of 2X to 3X over optimized static CMOS and to 5X
when applied to wide input NOR Gates.[1]
• Differential OPL(DOPL) is 5 times faster than optimized static
CMOS and nearly 2 times faster than OPL and domino [2].
• Until Now several successful Chips have been reported.[3-7]
3
Output Prediction Logic
Gate 1
•
•
1
Gate 2
0
Gate 3
1
Gate 4
0
Both static and OPL gates are inherently inverting
In the worst-case for static CMOS, the output of every gate in a critical path
must fully transition from 1 to 0, or 0 to 1
Gate 1
clk1
•
•
•
•
•
•
1
Gate 2
clk2
1
Gate 3
clk3
1
Gate 4
1
clk4
OPL reduces the worst case by predicting that all outputs are 1
On any critical path, only every other gate will have to transition (pull down)
If consecutive gates pull down, then this is not a critical path since the first pulldown event does not cause the second
Critical path delay will be reduced by at least 50%
Speedup > 2X by skewing the gates for pd transitions
How to achieve high inputs and high outputs on inverting gates?
4
Three Types of Single-Rail OPL Gates
OPL-static
a
OPL-pseudo
OPL-dynamic
clk
clk
b
c
out
a
b
clk
c
out
a
b
clk
c
out
a
b
c
clk
• In all 3 cases, when gates are not enabled (clk = 0), output will be high
even if inputs are high
• Enable the gates (clk = 1) when inputs have arrived
5
OPL Clock Separations
• Predicted 1’s are maintained by delaying the clocks
Gate 1
1
Gate 2
clk1
clk2
1
Gate 3
clk3
1
Gate 4
1
clk4
clock separation ti
clki-1
clki
clki+1
• One fast pull-down event every TWO adjacent clock separations!
– for ANY critical path
• Separations are small, less than an inverter delay which can be
produced for example by Reduced swing logic[3]
6
Delay vs. Clock Separation
Delay (ns)
• Red: OPL-static NOR3 chain
• Blue point is optimally sized static
CMOS NOR3 chain
• Robust:
+/- 30% over nominal sep. of .14
gives > 2X speedup
[2]
VDD
Clock separation (ns)
VDD
VDD
OUT
OUT
IN
OUT
CLK
GND
• Clock too early
IN
IN
CLK
CLK
GND
• Delay optimal
GND
• Clock blocking
7
OPL-Differential Gates
• Drawback of true differential gates is that one side or the other will have a
tall stack of devices
• In differential domino, in the worst case, every stack on a particular signal
path will have to discharge
• In OPL-differential, at most every other stack on a critical signal path will
have to discharge: 2X speedup
8
Diff. Domino vs. OPL-Diff.
• Delays (ns) for chains of 10 gates (FO of 4) in 0.18um TSMC
– static CMOS chains are optimally sized
– domino and OPL-differential use same size transistors
ChainType
Static CMOS
Diff. Domino
OPL-Diff.
INV
0.84 (1.0)
0.62 (0.74)
0.16 (0.19)
NOR2
1.26 (1.0)
0.66 (0.52)
0.25 (0.20)
NOR3
1.59 (1.0)
0.74 (0.47)
0.30 (0.19)
NOR4
2.34 (1.0)
0.89 (0.38)
0.34 (0.15)
NAND2
1.02 (1.0)
0.66 (0.65)
0.30 (0.29)
NAND3
1.38 (1.0)
0.80 (0.58)
0.45 (0.33)
NAND4
1.48 (1.0)
0.89 (0.60)
0.52 (0.35)
AOI21
1.30 (1.0)
0.72 (0.55)
0.35 (0.27)
AOI22
1.74 (1.0)
0.82 (0.47)
0.33 (0.19)
AOI222
2.95 (1.0)
1.01 (0.34)
0.54 (0.18)
AOI31
1.76 (1.0)
0.83 (0.47)
0.52 (0.30)
AOI33
2.60 (1.0)
1.00 (0.38)
0.50 (0.19)
AOI333
4.00 (1.0)
1.19 (0.30)
0.59 (0.14)
AOI321
2.43 (1.0)
0.91 (0.37)
0.54 (0.22)
average
1.91 (1.0)
0.84 (0.44)
0.41 (0.21)
9
Improved OPL-Differential
P4 P3
• S. Kio, L. McMurchie, and C. Sechen, “Application
of Output Prediction Logic to Differential CMOS,”
Proc. of IEEE Computer Society Annual Workshop
on VLSI, Orlando, FL, April 19-20, 2001
clk
P1 P2
OUT
OUTB
clk
P4 P3
• tfall(OUT)  tfall(OUTB) needed for contention
free evaluation
OUTB
P1
P2
clk
OUT
clk
10
Self-Calibrating Differential OPL
• Dual output gates: Use a completion detector to produce a downstream clock
– Ideally should feed to the next level
– But, DOPL gates are too fast!
• If a DOPL gate evaluates slower (faster) than expected, downstream clock will be
delayed (sped-up) to compensate
1st
level
DOPL
1st
clk
2nd
level
DOPL
2nd
clk
3rd
level
DOPL
3rd
clk
4th
level
DOPL
5th
level
DOPL
4th
clk
5th
clk
Buffer tree
clk_ref
T-gate
11
Clock Skew Reduction
• DOPL circuits are levelized
• Completion detector outputs for each level are tied together
• Cannot use static CMOS NAND2’s due to contention
1st
2nd
3rd
4th
DOPL
DOPL
DOPL
DOPL
DOPL
1st
2nd
3rd
4th
DOPL
DOPL
DOPL
5th
DOPL
5th
DOPL
Buffer tree
clk_ref
T-gate
12
pMOS Dynamic NAND2 Completion Detector
• Minimizes crowbar current
• Fast, monotonic rising clock edge
• Power consumption is comparable to that of an inverter
DOPL1
clk
out1
out2
Evaluate
devices
out1
out2
Reset
clk
DOPL2
out3
out3
Evaluate
devices
out4
out4
Reset
13
Low Skew Inverter Generates Reset Signals
1st
2nd
DOPL
DOPL
4th
5th
DOPL
DOPL
3rd
DOPL
Reset(2)
Reset(1)
Reset(3)
Buffer tree
clk_ref
T-gate
• Crowbar current is minimized:
– Reset goes low slightly before DOPL evaluates
in2
in1
Reset(n)
clk
in4
in3
Reset(n)
– Reset goes high slightly after DOPL pre-charges
14
Self-Calibrating DOPL Floorplan
clk_reference
2nd Level
DOPL
1st Level
DOPL
3th Level
DOPL
4th Level
DOPL
5th Level
DOPL
Mesh Buffer Tree
Mesh Buffer Tree
Completion detector
Low Skew Inverter
A level may span
multiple rows
15
64b Adder Architecture
• Valency-3 Kogge-Stone sparse carry tree
• log3N levels for every 3rd carry, but the challenge is in efficiently producing
the “missing” pairs of carries
c16
c13
c10
c7
c4
c1
16
3b Carry-Select Units
• Valency-3 Kogge-Stone sparse tree quickly generates every 3rd carry
– log3N levels
• Use carry-select CLA Adder to output sums when this “quick” carry arrives
S4-S2
S7-S5
MUX
C5
S1-S0
MUX
MUX
C2
Cin
S71-S51
S41-S21
S11-S01
S70-S50
S40-S20
S10-S00
C5
C8
C2
Cin
17
64b Adder Layout and Photomicrograph
• IBM 130nm 1.2V process (8RF): Area = 264um X 180um
• Auto placed and routed
18
Energy Consumption
• Energy per operation: 29.5 pJ for the IBM 130nm 1.2V process
• Since energy is proportional to CV2, we can conservatively
estimate the energy consumption for a 90nm 1.1V process:
Scaled Energy 
29.5pJ
24.8
24.8


 17.2 pJ
130

2  C130
C
1.2
90
 130nm

C90

2 
 C90nm 1.1 

19
Measured Results
Self-calibrating
DOPL
Adder Delay Actual Adder
Process
(FO4)
Energy
(pJ)
3.9 (1)
29.5
IBM 130nm
1.2V CMOS
Estimated
Ref.
Adder Energy in
90nm (pJ)
17.2 (1)
Domino logic
6.8 (1.8)
33.5
130nm 1.2V
CMOS
33.5 (1.95)
[1]
Single-rail
OPL-dynamic
4.7 (1.2)
325
TSMC 180nm
1.8V CMOS
60.6 (3.53)
[2]
Static CMOS
logic
11.7 (3)
54
ST 180nm 1.8V
CMOS
10.1 (0.58)
[3]
1. R. Zlatanovici and B. Nikolic, “Power-performance optimal 64-bit carry-lookahead adders” Proc. ESSCIRC,
Sep 2003, pp. 321 – 324.
2. S. Sun, Y. Han, X. Guo, K.H. Chong, L. McMurchie, and C. Sechen, “409ps 4.7 FO4 64b Adder Based on
Output Prediction Logic in 0.18um CMOS,” Proc. IEEE Comp. Soc. Annual Symp. on VLSI (ISVLSI), 11-12
May 2005, Pages: 52 – 58.
3. S. Perri, P. Corsonello, and G. Staino, “A Low Power Sub-Nanosecond Standard-Cells Based Adder,” Proc.
IEEE ICECS 2003, pp. 296 – 299.
20
VDD (V)
VDD vs. Delay Curve for 64b DOPL Adder
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
200
400
600
800
1000
Delay (ps)
21
Summary
• Developed self-calibrating differential output prediction logic (DOPL)
– Twice as fast as domino logic, and half the energy
• Developed hybrid 64b adder architecture, consisting of a valency-3
Kogge-Stone sparse tree and DOPL-specific 3b carry-select units
• 64b adder implemented using 130nm 1.2V IBM process (8RF)
• Nominal measured delay 238ps (3.9 FO4)
– Best measured delay 215ps (3.5 FO4)
• Fastest 64b adder reported by nearly 2X
• DOPL is a great candidate for scaling.
• Energy: 29.5 pJ (conservatively scales to 17.2 pJ for a 90nm process)
– Competitive with fast static CMOS adders
22
References
•
•
•
•
•
•
•
[1] L. McMurchie, S. Kio, G. Yee, T. Thorp and C. Sechen, “Output Prediction Logic: A High Performance
CMOS Design Technique”, Proc. Int. Conf. On Computer Design (ICCD), September 17-20,2000, Austin, TX.
[2] Kio Su, et al., “Application of Output Prediction Logic to Differential CMOS,” Proc. IEEE Workshop on VLSI,
pp. 57- 65, April 2001.
[3] S. Sun, Y. Han, X. Guo, K.H. Chong, L. McMurchie, and C. Sechen, “409ps 4.7 FO4 64b Adder Based on
Output Prediction Logic in 0.18um CMOS,” Proc. IEEE Comp. Soc. Annual Symp. on VLSI (ISVLSI), 11-12
May 2005, Pages: 52 – 58.
[4] X. Guo and C. Sechen, “A High Throughput Divider Implementation,”Proc. IEEE CICC, Paper 15.2, Sept.,
2005.
[5] R. Zlatanovici and B. Nikolic, “Power-Performance Optimal 64-bit Carry-Lookahead Adders,” Proc.
ESSCIRC, pp. 321 - 324, Sept., 2003.
[6] Sheng Sun; McMurchie, L.; A High-Performance 64-bit Adder Implemented in Output Prediction Logic,
Sechen, C.; Advanced Research in VLSI, 2001. ARVLSI 2001. Proceedings. 2001 Conference on 14-16
March 2001 Page(s):213 - 222
[7] High Speed Redundant Adder and Divider in Output Prediction Logic, Proceedings of the IEEE Computer
Society Annual Symposium on VLSI New Frontiers in VLSI Design,2005
23
24