Transcript Document
Presentation 5
MAD MAC 525
W2
Farhan Mohamed Ali (W2-1)
Jigar Vora (W2-2)
Sonali Kapoor (W2-3)
Avni Jhunjhunwala (W2-4)
Design Manager: Zack Menegakis
22nd February, 2006
Top Level Integration
Project Objective:
Design a crucial part of a GPU called the Multiply
Accumulate Unit (MAC) which will revolutionize graphics.
1
MAD MAC 525 Status:
Project chosen
Specifications defined
Architecture
Design
Behavioral Verilog
Testbenches
Verilog : Gate Level Design
Floor plan
Schematic
To be done
Layout (started)
Extraction, LVS, post-layout simulation
2
Recap - MAD MAC 525
• Multiply Add (MAD) / Multiply Accumulate Unit (MAC)
• Executes function AB+C on 16 bit floating point
inputs
• Multiply and add in parallel to greatly speed up
operation
• Rounding is only performed only once so greater
accuracy than individual multiply and add functions.
• One circuit to rule them all!
3
Block Diagram
Input
Input
16
16
RegArray A
16
5
RegArray B
10
10
Input
RegArray C
5
10
5
Exp Calc
Multiplier
Align
5
22
35
Adder/Subtractor
Control
Logic
&
Sign
Dtrmin
14
Leading 0
Anticipator
36
4
Normalize
14
5
Round
1
10
5
Reg Y
16
Output
4
Floorplan
Reg A
Reg B
Reg C
Exp
Calc
Multiplier
Align
C
Pipeline Reg
Ld
Zero
Pipeline Reg
Adder
Pipeline Reg
Round
Normalize
Reg Y
5
Design Decisions
• Adder – Variable length carry select adder
• Registers – Pulsed Latches
• Pass logic in shifters
6
Adder Schematic – Carry Select
• Variable length carry select adder
• Very regular – good compromise between speed
and ease of layout
• 2.5ns delay through 37bits
7
Adder Schematic – 1 bit Carry Select
8
Pulsed Latch
•
•
•
•
•
Advantage – Practically eliminates setup time
120ns Clock to ~Q delay (146 loaded)
16 transistors
Simplified version of those used in the Pentium 4
Sizing does not seem to affect speed under load
Clock pulse
generator
9
More Pass Logic
• Compared different kinds of pass logic for
shifters
• Transmission gates with buffers are the fastest
Mux Type
Propagation Delay
(worst case)
N-pass
78.32ps
(Align)
Transmission gate 50.5ps
(Normalize)
NAND
81.22p
10
Multiplier
Transistor Area in Prop. Power in mW
Count
um2
Delay (350MHz)
3500
25000 4.43n 8.86
Exponents 700
5000
942p
1.608
Align
530
3800
480p
1.031
Adder
3700
26500
3.24n 4.58
Leading 0
350
2500
2.05n 0.232
Normalize
900
6500
430p
Round
300
2000
1.81n 0.198
Registers
1800
9000
120p
-
-
-
Total 11780
80300
2.291
11
Design Goals – On target
• At least 300MHz – 600 MFLOPS
• Will be achievable through optimization and
pipelining
• Pipeline stages not fully determined – 6 stages
expected
• Multiplier will be pipelined to cut delay in half
• All other individual blocks can clock ~500MHz
• Faster adder is being developed. Not easily
pipelined like multiplier – speed of this block will
be the limiting factor for entire circuit
12
Top Level Schematic
13
Simulations: Normalize
14
Simulations: Align
15
Simulations: Multiplier
16
Problems
• Verilog simulation of circuit generated don’t
cares after switching to new improved pass
logic. Analog simulations work just fine
• Pass logic can be evil if done wrong. Multiplier
initially ran at only 50MHz due to transmission
gate XORs. Buffers solved the problem.
17
Questions?
18