Chapter 8 Data Path Designs - IC Design & Application Research Lab.

Download Report

Transcript Chapter 8 Data Path Designs - IC Design & Application Research Lab.

Chapter 12
Arithmetic Building Blocks
Boonchuay Supmonchai
Integrated Design Application Research (IDAR) Laboratory
August 20, 2004; Revised - July 5, 2005
B.Supmonchai
Goals of This Chapter

Designing for Performance, area, or power
 Adders
 Multipliers
 Shifters

Logic and System Optimizations for datapath
modules

Power-Delay trade-offs in datapaths
2102-545 Digital ICs
Arithmetic Building Blocks
2
B.Supmonchai
Review: A Generic Processor
RAM, ROM, Shift Register
Memory
Input/Output
Switches,
Arbiters,
Bus
Drivers
Control
FSM,
PLA,
Counter,
Random
Logic
Datapath
Adder, Multiplier,
Shifter, Comparator, etc.
2102-545 Digital ICs
Arithmetic Building Blocks
3
B.Supmonchai
Bit-Sliced Architecture
n-bit
Data In
Identical
Processing
Elements
Register
Bit 0
Shifter
Bit 1
…
Bit n-2
Bit n-1
Control
Adder
Datapath Unit
Multiplexer

Modular
n-bit
Data Out
 Easy to design and verify

Potential to be fast
 Easy to expand
2102-545 Digital ICs
Arithmetic Building Blocks
4
B.Supmonchai
Example: Itanium Bit-Sliced Design
From register files / Cache / Bypass
Multiplexers
Shifter
Adder stage 1
Adder stage 2
Wiring
Bit slice 0
Bit slice 1
Bit slice 2
Bit slice 63
Adder stage 3
Loopback Bus
Loopback Bus
Loopback Bus
Wiring
Sum Select
To register files / Cache
2102-545 Digital ICs
Arithmetic Building Blocks
5
B.Supmonchai
Example: Itanium Integer Datapath
Itanium has 6 integer execution units (ALU)
2102-545 Digital ICs
Arithmetic Building Blocks
6
B.Supmonchai
One-Bit Binary Full Adder (FA)
Cin
A
B
1-bit
Full Adder
(FA)
S
Cout
S = A  B  Cin
Cout = AB + ACin + BCin

A
B
Cin
Cout
S
Carry
Status
0
0
0
0
0
kill
0
0
1
0
1
kill
0
1
0
0
1
propagate
0
1
1
1
0
propagate
1
0
0
0
1
propagate
1
0
1
1
0
propagate
1
1
0
1
0
generate
1
1
1
1
1
generate
A VERY common operation - so worth spending some
time trying to optimize
 Often in the critical path, so need to look at both logic level and
circuit level optimizations
2102-545 Digital ICs
Arithmetic Building Blocks
7
B.Supmonchai
Propagate, Generate, and Delete (Kill)

Define 3 new variable which ONLY depend on A, B
Generate (G) = AB
Propagate (P) = A  B
Delete(D) = A B

(FA itself generates a carry)
(FA passes along carry)
(FA stops propagation of carry)
Then we can write S and Cout in terms of G, P, and Cin
S(G,P,C) = P  Cin
Cout(G,P,C) = G + PCin

We can also write S and Cout in terms of D, P, and Cin

Sometimes an alternative definition for P can be used
Propagate (P) = A + B
2102-545 Digital ICs
Arithmetic Building Blocks
8
B.Supmonchai
FA CMOS Implementation: First Try
A
A
A
A
A
B
B
B
B
Cin
B
S
Cin
B
B
A
2102-545 Digital ICs
B
A
Cin
A
Cin
A
Cout
B
Cin
Cin
Cin
Cin
A
B
32 Transistors
Majority Function Maj(A,B,C)
outputs 0 or 1 whichever has
greater numbers at the inputs
Arithmetic Building Blocks
9
B.Supmonchai
Improved CMOS Implementation

A more compact design is based on the observation that
S can be factored to reuse the Cout term
S = ABCin+ (A + B + Cin)Cout
A
B
Cin
A
B
Cin
S
S
Cout
Minority Function
Cout
2102-545 Digital ICs
Arithmetic Building Blocks
10
B.Supmonchai
Improved CMOS Implementation II
VDD
VDD
A
Ci
A
B
B
A
B
B
Ci
A
X
Ci
VDD
Ci
S
A
Ci
A
B
B
VDD
A
B
Co
Ci
A
B
28 Transistors
2102-545 Digital ICs
Arithmetic Building Blocks
11
B.Supmonchai
Notes on Improved CMOS FA

Note that the PMOS network is identical to the NMOS
network rather than being the complement.
 This is possible because of the inversion property which says
that the function of complemented inputs is equal to the
complement of the function.
 This simplification reduces the number of series transistors
and makes the layout more uniform

This design has a greater delay to compute S than Cout
 Most of the time the extra delay computing S has little effect
on the critical path because carry is the signal that propagates
 With proper sizing this delay on S can be minimized
2102-545 Digital ICs
Arithmetic Building Blocks
12
B.Supmonchai
Inversion Property

The function must be symmetric
A
Ci
A
B
FA
Co
Ci
S
B
FA
Co
S
S  A B C i  = S  A B  Ci 
C  A B C  = C  A B  C 
o
i
o
i
2102-545 Digital ICs
Arithmetic Building Blocks
13
B.Supmonchai
TG-Based FA
Cin
A
P
S
B
Cout
XOR
16 Transistors
2102-545 Digital ICs
2-to-1 MUX
XOR
Extra delay - slower
Arithmetic Building Blocks
14
B.Supmonchai
Complementary PT Logic (CPL) FA
B
B
Cin
Cin
S
A
S
A
B
Cin
B
28 transistors
dual rail
Voltage drop
Problems
Cin
A
Cout
B
Cin
Cout
A
Cin
B
Faster, Lower Power, and small area than full static CMOS
2102-545 Digital ICs
Arithmetic Building Blocks
15
B.Supmonchai
Mirror Adder
PUN and PDN are symmetrical not complemented
24+4 transistors
A
8 B
8 B
8
A
4 B
4
Cin
4
kill
0-propagate
8
A
8
4
!Cout
B
6
A
6
Cin
6
!S
Cin
1-propagate
A
4
4 B
A
4
4 B
4
2
generate
Cout = AB + ACin + BCin
2102-545 Digital ICs
A
2 B
2
Cin
2
Cin
3
A
3
B
3
S = ABCin+ (A + B + Cin)Cout
Arithmetic Building Blocks
17
B.Supmonchai
Mirror Adder Features

The NMOS and PMOS chains are completely
symmetrical with a maximum of two series transistors
in the carry circuitry, guaranteeing identical rise and
fall transitions if the NMOS and PMOS devices are
properly sized.

When laying out the cell, the most critical issue is the
minimization of the capacitances at node !Cout (four
diffusion capacitances, two internal gate capacitances,
and two inverter gate capacitances).


Shared diffusions can reduce the stack node capacitances.
The transistors connected to Cin are placed closest to the
output.
2102-545 Digital ICs
Arithmetic Building Blocks
18
B.Supmonchai
Mirror Adder Sizing Issues

Only the transistors in the carry stage have to be
optimized for optimal speed. All transistors in the sum
stage can be minimal size.

Assume PMOS/NMOS ratio of 2. Each input in the
carry circuit has a logical effort of 2 so the optimal fanout for each is also 2.

Since !Cout drives 2 internal and 2 inverter transistor
gates (to form Cout for the bit adder) the carry circuit
should be oversized
2102-545 Digital ICs
Arithmetic Building Blocks
19
B.Supmonchai
Mirror Adder Stick Diagram
VDD
A
B
Ci
B
A Ci
Co
Ci
A
B
Co
S
GND
2102-545 Digital ICs
Arithmetic Building Blocks
20
B.Supmonchai
Ripple Carry Adder (RCA)
A3
Cout = C4
B3
FA
S3
A2
C3
B2
FA
S2
A1
C2
B1
FA
S1
A0
C1
B0
FA
C0 = Cin
S0
tripple  tFA(A,BCout) + (N - 2)tFA(CinCout) + tFA(CinS)
Worst Case Delay : tripple = O(N)
Slow!
2102-545 Digital ICs
Make the fastest possible carry path
Arithmetic Building Blocks
21
B.Supmonchai
Exploiting the Inversion Property
A3
Cout = C4
B3
FA
A2
C3
FA
S3


B2
S2
A1
C2
B1
FA
A0
C1
B0
FA
C0 = Cin
S1
S0
inverted cell regular cell
Now need two “flavors” of FAs
Minimizes the critical path (the carry chain) by eliminating inverters between the FAs
 Need increasing the transistor sizes on the carry chain portion
of the mirror adder.
2102-545 Digital ICs
Arithmetic Building Blocks
22
B.Supmonchai
Fast Carry Chain Design

The key to fast addition is a low latency carry network

What matters is whether in a given position a carry is

 Generated
Gi = AiBi
 Propagated
Pi = Ai  Bi (sometimes use Ai | Bi)
 Annihilated (killed)
Ki = !Ai !Bi
Giving a carry recurrence of
C i+1 = Gi + PiCi
C1 = G0 + P0C0
C2 = G1 + P1G0 + P1P0 C0
C3 = G2 + P2G1 + P2P1G0 + P2P1P0 C0
C4 = G3 + P3G2 + P3P2G1 + P3P2P1G0 + P3P2P1P0 C0
2102-545 Digital ICs
Arithmetic Building Blocks
23
B.Supmonchai
Manchester Carry Chain

Switches controlled by Gi and Pi
VDD
VDD
Pi
Pi

Gi
Co
Ci
Gi
Di
Static

Co
Ci
Pi

Components of total delay
Domino
 time to form the switch control signals Gi and Pi
 setup time for the switches
 signal propagation delay through N switches in the worst case
2102-545 Digital ICs
Arithmetic Building Blocks
24
B.Supmonchai
4-bit Sliced MCC Adder
A3 B3
A2 B2
A1 B1
A0 B0
clk
&

G P
&

G P
&

G P
&

G P
!C4
!C0
!C3
2102-545 Digital ICs
!C1
!C2




S3
S2
S1
S0
Arithmetic Building Blocks
25
B.Supmonchai
Domino MCC Circuit
3
P3
3
1
Ci,4
P2
3
2
P1
3
3
P0
3
clk
4
1
G3 2
G2 3
G1 4
G0 5
2
3
4
5
6
Ci,0
clk
G0 + P0C0
G2 + P2G1 + P2P1G0 + P2P1P0 C0
G1 + P1G0 + P1P0 C0
G3 + P3G2 + P3P2G1 + P3P2P1G0 + P3P2P1P0 C0
2102-545 Digital ICs
Arithmetic Building Blocks
26
B.Supmonchai
MCC Stick Diagram
Propagate/Generate Row
VDD
Pi
Ci - 1
Gi

Pi + 1
Gi + 1
Ci

Ci + 1
GND
Inverter/Sum Row
2102-545 Digital ICs
Arithmetic Building Blocks
27
B.Supmonchai
Notes on MCC Adder

When clock is low, the carry nodes precharge; when
clock goes high if Gi is high, Ci+1 is asserted (goes low)

To prevent Gi from affecting Ci, the signal Pi must be
computed as the xor (rather than the or) which is not a
problem since we need the xor of Ai and Bi for
computing the sum anyway

Delay is roughly proportional to n**2 (as n pass
transistors are connected in series)
 we usually limit each group to 4 stages, then buffer the carry
chain with an inverter between each group
2102-545 Digital ICs
Arithmetic Building Blocks
28
B.Supmonchai
Binary Adder Landscape
Bit-Serial
Adders
Synchronous Word
Parallel Adders
Ripple Carry Adders (RCA)
Asynchronous
Adders
Carry Prop Min Adders
t = O(N), A = O(N)
Signed-Digit
Adders
Fast Carry Prop Adders
Residue Adder
t = O(1), A = O(N)
Manchester
Carry Chain
t = O(N)
A = O(N)
2102-545 Digital ICs
Carry
Select
Parallel
Prefix
Conditional
Sum
t = O(log N)
A = O(N log N)
Arithmetic Building Blocks
Carry
Skip
t = O(N)
A = O(N)
29
B.Supmonchai
Carry-Skip (Carry-Bypass) Adder
A3
B3
C0,3
1
FA
A2
C3
B2
FA
A1
C2
B1
FA
A0
C1
B0
FA
Ci,0
Co,3
0
S3
BP = P0 P1 P2 P3
S2
S1
S0
“Block Propagate”
If (P0 & P1 & P2 & P3 = 1) then Co,3 = Ci,0 otherwise the
block itself kills or generates the carry internally
2102-545 Digital ICs
Arithmetic Building Blocks
30
B.Supmonchai
Carry-Skip Chain Implementation
block carry-out
carry-out
BP (By-Pass)
block carry-in
P3
P2
P1
P0
Cout
Cin
G3
BP
2102-545 Digital ICs
G2
G1
G0
Only 10% to 20% area overhead
Only two “gate delays” to
produce Cout if skip occurs
Arithmetic Building Blocks
31
B.Supmonchai
4-bit Block Carry-Skip Adder
bits 12 to 15
Setup
tcarry
Carry
Propagation
Sum
tsum
bits 8 to 11
bits 4 to 7
bits 0 to 3
Setup
Setup
Setup
Carry
Propagation
Carry
Propagation
Carry
Propagation
Sum
Sum
Sum
tskip
tsetup
Ci,0
Worst-case delay  carry from bit 0 to bit 15 = carry generated in bit 0,
ripples through bits 1, 2, and 3, skips the middle two groups (B is the
group size in bits), ripples in the last group from bit 12 to bit 15
tadd = tsetup + B tcarry + ((N/B) -1) tskip + B tcarry + tsum
2102-545 Digital ICs
Arithmetic Building Blocks
32
B.Supmonchai
Optimal Block Size and Time

Assuming one stage of ripple (tcarry) has the same delay
as one skip logic stage (tskip) and both are 1
tCSkA = 1 + B + (N/B-1) + B + 1
tsetup
ripple in
block 0
skips
ripple in
last block
tsum
= 2B + N/B + 1

So the optimal block size, B, is
dtCSkA/dB = 0  (N/2) = Bopt

And the optimal time is
Optimal tCSkA = 2((2N)) + 1
2102-545 Digital ICs
Arithmetic Building Blocks
33
B.Supmonchai
Variations of Carry-Skip Adders I

Variable block sized Carry-Skip Adders
 A carry that is generated in, or absorbed by, one of the inner
blocks travels a shorter distance through the skip blocks
 Hence a CSA adder can have bigger blocks for the inner
carries without increasing the overall delay
Cout
Cin
NB Blocks
tCSkA = 2B + O(NB)
2102-545 Digital ICs
Arithmetic Building Blocks
34
B.Supmonchai
Variations of Carry-Skip Adders II

Multiple Levels of Skip Logic
 CSAs with large number of bits suffer from linear carry
propagation delay time.
 Added higher levels of skip logic, a CSA can skip more blocks
at a time.
Cout
Cin
skip level 1
skip level 2
tCSkA = 2B + O(logBN)
2102-545 Digital ICs
Arithmetic Building Blocks
AND of the
first level skip
signals (BP’s)
35
B.Supmonchai
Carry-Skip Adder Comparisons
70
60
50
40
30
B=4
20
B=2
B=5
B=6
RCA
CSkA
VSkA
B=3
10
0
8 bits
2102-545 Digital ICs
16 bits
32 bits
48 bits
Arithmetic Building Blocks
64 bits
36
B.Supmonchai
Carry Select Adders
A’s


Idea: Precompute the
carry out of each block for
both carry_in = 0 and
carry_in = 1 (can be
done for all blocks in
parallel) and then select
the correct one
More cost effective than
the ripple carry adder
B’s
4-bit Setup
P’s
G’s
“0” Carry Propagation
0
“1” Carry Propagation
1
Cout
Multiplexer
Cin
C’s
Sum Generation
S’s
2102-545 Digital ICs
Arithmetic Building Blocks
37
B.Supmonchai
Carry Select Adder: Critical Path
Cout
bits 12 to 15
bits 8 to 11
A’s
A’s
B’s
B’s
bits 4 to 7
A’s
B’s
bits 0 to 3
A’s
B’s
Setup
Setup
Setup
Setup
P’s G’s
P’s G’s
P’s G’s
P’s G’s
“0” carry
“0” carry
“0” carry
“0” carry
“1” carry
“1” carry
“1” carry
“1” carry
Mux
Mux
Mux
Mux
C’s
C’s
C’s
C’s
Sum Gen
Sum Gen
Sum Gen
Sum Gen
S’s
S’s
S’s
S’s
Cin
tadd = tsetup + B tcarry + (N/B) tmux + tsum
2102-545 Digital ICs
Arithmetic Building Blocks
38
B.Supmonchai
Square Root Carry Select Adders
Bit 0-1
Bit 2-4
Setup
Setup
Bit 5-8
Bit 9-13
Setup
Setup
Bit 14-19
(1)
"0" Carry
"0"
"0"
"0" Carry
"0"
"0" Carry
"0"
"0" Carry
(1)
"1" Carry
"1" Carry
"1"
(3)
"1" Carry
"1"
(3)
"1"
(4)
(5)
(4)
Multiplexer
"1" Carry
"1"
(5)
Multiplexer
(6)
Multiplexer
(7)
(6)
(7)
Multiplexer
Mux
Ci,0
(8)
Sum Generation
S0-1
Sum Generation
Sum Generation
S2-4
S5-8
Sum Generation
S9-13
Sum
S14-19 (9)
Balance Delay - Making later block bigger
tadd = tsetup + 2 tcarry + √N tmux + tsum
2102-545 Digital ICs
Arithmetic Building Blocks
39
B.Supmonchai
Adder Delays - Comparison
50
Ripple adder
tp (in unit delays)
40
30
Linear select
20
10
0
Square root select
0
20
40
60
N
2102-545 Digital ICs
Arithmetic Building Blocks
40
B.Supmonchai
LookAhead - Basic Idea
A0, B0
A1, B1
•••
AN-1, BN-1
Carry Network
Ci,0
P0 Ci,1
S0
P1
S1
Ci, N-1
•••
PN-1
SN-1
Co,k = f(Ak, Bk,Co,k-1) = Gk + PkCo,k-1
2102-545 Digital ICs
Arithmetic Building Blocks
41
B.Supmonchai
Look-Ahead: Topology
VDD
By expanding carry generation
all the way:
G3
G2
C1 = G0 + P0C0
G1
C2 = G1 + P1G0 + P1P0 C0
G0
C3 = G2 + P2G1 + P2P1G0 + P2P1P0 C0
C4 = G3 + P3G2 + P3P2G1 + P3P2P1G0
+ P3P2P1P0 C0
…
Ci,0
Co,3
P0
P1
P2
P3
2102-545 Digital ICs
Arithmetic Building Blocks
42
B.Supmonchai
Logarithmic Look-Ahead Adder
F
A0
A1
A2
A3
A4
A5
A6
A7
tp N
A0
A1
A2
A3
F
A4
A5
tp log2(N)
A6
A7
2102-545 Digital ICs
Arithmetic Building Blocks
43
B.Supmonchai
Parallel Prefix Adders (PPAs)

Define carry operator € on (G,P) signal pairs
(G’’,P’’) (G’,P’)
G’’
where
G = G’’ + P’’G’
P = P’’P’
€
(G,P)
G’
!G
P’’
 € is associative, i.e.,
[(g’’’,p’’’) € (g’’,p’’)] € (g’,p’) = (g’’’,p’’’) € [(g’’,p’’) € (g’,p’)]
€
€
€
2102-545 Digital ICs
€
Arithmetic Building Blocks
44
B.Supmonchai
PPA General Structure

Given P and G terms for each bit position, computing all the carries
is equal to finding all the prefixes in parallel
(G0,P0) € (G1,P1) € (G2,P2) € … € (GN-2,PN-2) € (GN-1,PN-1)

Since € is associative, we can group them in any order
 but note that it is not commutative
Pi, Gi logic (1 unit delay)
Ci parallel prefix logic tree
(1 unit delay per level)
Si logic (1 unit delay)
2102-545 Digital ICs

Measures to consider

number of € cells

tree cell depth (time)

tree cell area

cell fan-in and fan-out

max wiring length

wiring congestion

delay path variation (glitching)
Arithmetic Building Blocks
45
B.Supmonchai
Brent-Kung PPA
G15 G14 G13 G12
p15 p14 p13 P12
€
€
G11 G10 G9
p11 P10 p9
€
€
G8 G7
P8 P7
€
G6 G5
P6 P5
€
€
G4 G3
P4 P3
€
€
€
€
G2
p2
G1
P1
G0
P0 C
in
€
€
C2
C1
€
€
€
€
€
€
€
€
€
€
€
C16 C15 C14 C13 C12 C11 C10 C9
2102-545 Digital ICs
€
C8 C7
Arithmetic Building Blocks
€
C6 C5
€
C4 C3
46
B.Supmonchai
Kogge-Stone PPF Adder
G15 G14 G13 G12
P15 P14 P13 P12
G11 G10 G9 G8 G7 G6
P11 P10 P9 P8 P7 P6
G5 G4 G3
P5 P4 P3
G2 G1 G0
P2 P1 P0 C
in
€
€
€
€
€
€
€
€
€
€
€
€
€
€
€
€
€
€
€
€
€
€
€
€
€
€
€
€
€
€
€
€
€
€
€
€
€
€
€
€
€
€
€
€
€
€
€
€
C8
C7
C6
C5 C4
C16 C15 C14 C13 C12 C11 C10 C9
2102-545 Digital ICs
Arithmetic Building Blocks
C3
€
€
C2
C1
47
B.Supmonchai
More Adder Comparisons
70
60
50
RCA
CSkA
VSkA
KS PPA
40
30
20
10
0
8 bits
2102-545 Digital ICs
16 bits
32 bits
48 bits
Arithmetic Building Blocks
64 bits
48
B.Supmonchai
Adder Speed Comparisons
70
60
RCA
MCC
CCSkA
VCSkA
CCSlA
B&K
50
40
30
20
10
16 bits
2102-545 Digital ICs
32 bits
Arithmetic Building Blocks
64 bits
49
B.Supmonchai
Adder Average Power Comparisons
35
30
25
RCA
MCC
CCSkA
VCSkA
CCSlA
B&K
20
15
10
5
0
16 bits
2102-545 Digital ICs
32 bits
Arithmetic Building Blocks
64 bits
50
B.Supmonchai
Binary Multiplication - Basics

Given two unsigned binary numbers X (M bits)
and Y (N bits)
N 1
M 1
X   Xi2
i
Y  Y j 2 j
j 0
i 0
where Xi, Yj  {0, 1}


The

multiplication operation
Z = X  Y is
M 1
N 1
 M 1N 1



k
i
j
i j
 Z k 2   X i 2 Y j 2   X i Y j 2 
i 0
j 0
k 0
 i 0 j 0

M N 1
2102-545 Digital ICs
Arithmetic Building Blocks
52
B.Supmonchai
Binary Multiplication Operation

Binary Multiplication as repeated additions
N
M
1 0 1 0 1 0 multiplicand
1 0 1 1 multiplier
N
1
1 0
0 0 0
1 0 1 0
0
1
0
1
1 0 1 0
partial
0 1 0
product
0 0
array
0
can be formed in parallel
1 1 1 0 0 1 1 1 0 double precision product
2N
2102-545 Digital ICs
Arithmetic Building Blocks
53
B.Supmonchai
Shift-and-Add Multiplication

Right Shift and Add (N bits  N bits)
N
Multiplicand
“0”
N
N
1
0
N
N
N
N-bit Adder
Multiplier
*Left shift requires
2n-bit adder
N+1
Bit out
tshift&add_mult = O(N · tadder) = O(N2) for an RCA
2102-545 Digital ICs
Arithmetic Building Blocks
54
B.Supmonchai
Improving Multipliers

Making them faster (therefore, bigger area)
 Use faster adders
 Use higher radix (e.g., base 4) multiplication
 Use multiplier recoding to simplify multiple formation
 Form partial product array in parallel and add it in parallel

Making them smaller (i.e., slower)
 Use array multipliers
 Very regular structure with only short wires to nearest neighbor
cells. Thus, very simple and efficient layout in VLSI
 Can be easily and efficiently pipelined
2102-545 Digital ICs
Arithmetic Building Blocks
55
B.Supmonchai
multiple
forming
circuits
PP Generation
partial
product
array
reduction
tree
PP Accumulation
fast carry
propagate
adder
(CPA)
Final
Addition
Array (or Tree) Multiplier Structure
2102-545 Digital ICs
0 D
Q (‘ier)
0 D
0 D
0 D (‘icand)
mux
+
reduction
tree (log N)
+
CPA (log N)
P (product)
Arithmetic Building Blocks
56
B.Supmonchai
Partial Product (PP) Generation

Each row in the partial-product array is either a copy of
the multiplicand or a row of zeros
X7
X6
X5
X4
X3
X2
X1
X0
Yi
PP7

PP6
PP5
PP4
PP3
PP2
PP1
PP0
Careful optimization of the PP generation can lead to
some substantial delay and area reduction.
 Booth’s and modified Booth’s recording
2102-545 Digital ICs
Arithmetic Building Blocks
57
B.Supmonchai
Array Multiplier Implementation
HA: Half Adder
FA: Full Adder
CP: Critical Path
X3
X3
X2
X1
X0
Y0
Y1 Z0
X2
X1
X0
HA
FA
FA
HA
X3
X2
X1
X0
FA
FA
FA
HA
X3
X2
X1
X0
FA
FA
FA
CP1
Z7
Z6
Z5
Z4
CP2
Y3
Y2
Z2
Z1
HW for One
Partial Product
HA
Z3
* Assume tadd = tcarry
tarray_mult = [(M -1)+(N - 2)] tcarry + (N - 1) tsum + tand = O(N)
2102-545 Digital ICs
Arithmetic Building Blocks
58
B.Supmonchai
Carry-Save Multiplier


The idea is to “save” the (PP) carry and add it in the
next adder stage
In the final addition a fast carry-propagate (e.g., carrylookahead) adder is used.
HA
HA
HA
HA
HA
FA
FA
FA
HA
FA
FA
FA
FA
FA
HA
Vector Merging Adder
HA
6 HAs
6 FAs
Unique and
Shorter CP
tCSM = (N - 1) tcarry + tmerge + tand = O(N)
2102-545 Digital ICs
Arithmetic Building Blocks
59
B.Supmonchai
CSM Floorplan
X3
X2
X1
X0
Y0
Y1
C S
C S
C S
C S
HA Multiplier Cell
Z0
FA Multiplier Cell
Y2
Y3
C S
C S
C S
C S
C S
C S
C S
C S
C
C
C
C
Z7
2102-545 Digital ICs
S
Z6
S
Z5
S
Z4
Z1
Vector Merging Cell
Z2
X and Y signals are broadcasted
through the complete array.
(
)
S
Z3
Arithmetic Building Blocks
Regularity makes the
generation of structure
amenable to automation
60
B.Supmonchai
Wallace-Tree Multiplier
Partial Products
First Stage
Bit
6 5 4 3 2 1 0
Position
Rearranging PPs
6 5 4 3 2 1 0
HA
Second Stage
Final Adder
6 5 4 3 2 1 0
6 5 4 3 2 1 0
FA
HA
Any Types of adder
can be used
GOAL: Minimize depth (# of stages) with min. no. of adder elements
2102-545 Digital ICs
Arithmetic Building Blocks
61
B.Supmonchai
Wallace-Tree Multiplier Implementation
HA
3 HAs and 3 FAs for the reduction process (stage 1 + stage 2)
Any type of adder can be used for the final adder
2102-545 Digital ICs
Arithmetic Building Blocks
62
B.Supmonchai
Notes on Wallace-Tree Multiplier

Wallace tree substantially saves hardware for large
multipliers
 Number of partial products is reduced by two-thirds per stage

The propagation delay is found to be bound,
tWTM = O(log 3/2 (N))

Although substantially faster than CSM, WTM structure
is very irregular
 Difficulty in finding efficient VLSI layout

Many of today’s high performance multipliers use higher
order (e.g. 4-2) compressors in stead of 3-2 compressors
(FAs)
2102-545 Digital ICs
Arithmetic Building Blocks
63
B.Supmonchai
Parallel Programmable Shifters

Shifting a data word left or right over a constant amount
is a trivial hardware operation and is implemented by
the appropriate signal wiring

Shifters are used in multipliers, floating point units
Control
=
Shift amount
Shift direction
Shift type (logical,
arith, circular)
Shifter
Consume lots of area if done in random logic gates
2102-545 Digital ICs
Arithmetic Building Blocks
64
B.Supmonchai
A Programmable Binary Shifter
Right
nop
Left
Exactly one
signal is active
Ai
Bi
Ai-1
Bi-1
Bit-Slice i
...
Ai
Ai-1
right
nop
left
Bi
Bi-1
A1
A0
0
1
0
A1
A0
A1
A0
1
0
0
0
A1
A1
A0
0
0
1
A0
0
2102-545 Digital ICs
Arithmetic Building Blocks
65
B.Supmonchai
4-bit Barrel Shifter
A3
B3
Sh1
A2
B2
Example: Sh0 = 1
B3B2B1B0 = A3A2A1A0
Sh1 = 1
B3B2B1B0 = A3A3A2A1
Sh2 = 1
B3B2B1B0 = A3A3A3A2
Sh2
A1
B1
Sh3
Sh3 = 1
B3B2B1B0 = A3A3A3A3
Arithmetic shift
A0
B0
Sh0
2102-545 Digital ICs
Sh1
Sh2
Sh3
Area dominated by wiring
Arithmetic Building Blocks
67
B.Supmonchai
Notes on Barrel Shifter

Note that signal goes through at most one FET (so
constant propagation delay (in theory))

Also note, that the FET diffusion capacitance on an
output wire increases linearly with the shift width but
the FET diffusion capacitance on the input data lines
increases quadratically (i.e., N2 for circular shifter)

Size of cell is bounded by the pitch of the metal wires.

A decoder is usually needed for shift control signals
since the amount of shift are normally given in
(encoded) binary number.
2102-545 Digital ICs
Arithmetic Building Blocks
68
B.Supmonchai
4-bit Barrel Shifter Layout
Widthbarrel
A3
A2
A1
A0
Sh0
Sh1
Sh2
Sh3
Buffer
Widthbarrel ~ 2 pm N
N = max shift distance, pm = metal pitch
2102-545 Digital ICs
Arithmetic Building Blocks
69
B.Supmonchai
8-bit Logarithmic Shifter
0
1
Sh1 !Sh1
1
0
Sh2 !Sh2
0
1
Sh3 !Sh3
A3
B3
A2
B2
A1
B1
A0
B0
log N stages
2102-545 Digital ICs
Arithmetic Building Blocks
71
B.Supmonchai
8-bit Logarithmic Shifter Layout Slice
1
2
4
A3
B3
A2
B2
A1
B1
A0
B0
Widthlog ~ pm(2K+(1+2+…+2K-1)) = pm(2K+2K-1)
K = log2 N
2102-545 Digital ICs
Arithmetic Building Blocks
72
B.Supmonchai
Shifter Implementation Comparisons
Barrel
Logarithmic
Width
Speed
Width
Speed
N
K
2 N pm
1 + N diffs
pm(2K+2K-1)
K + 2 diffs
8
3
16 pm
1+8
13 pm
3+2
16
4
32 pm
1 + 16
23 pm
4+2
32
5
64 pm
1 + 32
41 pm
5+2
64
6
128 pm
1 + 64
75 pm
6+2

Barrel Shifter is better for small shifters (faster, not much bigger)
while Log Shifter is preferred for larger shifters.
 Log Shifters are always smaller

For large shifter we may have to start worrying about the number
of pass transistors in series.
2102-545 Digital ICs
Arithmetic Building Blocks
73
B.Supmonchai
Decoders

Decodes inputs to activate one of many outputs
Enable
In0
In1
2-to-4
Decoder
Out0 = In0 In1
Out1 = In0 In1
Out2 = In0 In1
Out3 = In0 In1

Cost of 2-to-4 Decoder
 two inverters, four 2-input NAND gates, four
inverters plus enable logic
 how about cost for a 3-to-8, 4-to-16, etc. decoder?
2102-545 Digital ICs
Arithmetic Building Blocks
74
B.Supmonchai
Dynamic NOR Decoder
Vdd
GND
GND
on
on
B3 1  0
on
B2 1  0
on
B1 1  0
B0 1  1
precharge
0 1
A0
A0
A1
A1
0
1
0
1
Active HIGH Outputs
Capacitance of the output wires increases linearly with the decoder size
2102-545 Digital ICs
Arithmetic Building Blocks
76
B.Supmonchai
Dynamic NAND Decoder
GND
Active LOW Outputs
B3 1  1
B2 1  1
B1 1  1
on
2102-545 Digital ICs
on
A0
A0
A1
A1
0
1
0
1
Arithmetic Building Blocks
B0 1  0
precharge
0 1
78
B.Supmonchai
Notes on Dynamic Decoders

In Dynamic NOR decoder signal goes through at most
one FET
 So constant propagation delay (in theory)
 However, some output wires may have two or more parallel
paths to GND - effectively shortening the transition time

On the contrary, signal in dynamic NAND decoder pass
through a series of FET
 The number of FETs rises linearly with the decoder size
 Thus it will be slower than the NOR implementation if the
gate capacitance dominates diffusion capacitance

For the NAND decoder all the input signals must be low
during precharge else Vdd and GND will be connected!
2102-545 Digital ICs
Arithmetic Building Blocks
79
B.Supmonchai
Building Bigger Decoders
Active low enable, Active low output
101
enable
2x4
.
.
.
2x4
1x2
2x4
2x4
A4
0
A3 A2
0 0
A1 A0
0 1
Need to catch the output that goes to zero before it precharges again
2102-545 Digital ICs
Arithmetic Building Blocks
80
B.Supmonchai
Layout of Bit-Sliced Datapaths
Must have enough
drive capacity to
handle large fan-out
Sized for peak
current
Horizontal gap for
feeding signals to the
cells downstream
2102-545 Digital ICs
Arithmetic Building Blocks
81
B.Supmonchai
Optimizing Bit-sliced Datapaths
Without feedthroughs or
pitch matching (4.2m2)
2102-545 Digital ICs
With feedthroughs
(3.2m2)
Arithmetic Building Blocks
With feedthroughs and
pitch matching (2.2m2)
82