Pipeline Optimization for Asynchronous Circuit
Download
Report
Transcript Pipeline Optimization for Asynchronous Circuit
A Channel-Based Asynchronous LowPower High-Performance Standard
Cell-Based Sequential Decoder
Implemented with QDI Templates
Recep Ö. Özdağ & Peter A. Beerel
University of Southern California
Motivation and Approach
Background
Fine-grain asynchronous pipelines have demonstrated high-performance in largely fullcustom back-end flows
• Caltech’s MIPS R3000 Microprocessor [Martin97]
• Fulcrum’s PivotPoint High Performance Switch [HotChips03]
Problem
However full-custom flows are tedius, error-prone, and time-consuming and often
require significant in-house tool automation
Our Solution
Create asynchronous cell library
Integrate cell library into commercial P&R flow using Verilog modelling
Evaluate on a real design
• Target a digital communication chip implementing the Fano algorithm
Our Goal: Close to Full-Custom Performance with ASIC Design Times
USC Asynchronous Group
2
Channel Based Asynchronous Design
Dual-Rail Channel
Sender
Receiver
Ack
clock
Asynchronous
channel
Data
• Two wires per data bit
• One acknowledgment wire
• Generalizes to 1-of-N coding
• Advantage:
• Delay insensitive System
communication
Synchronous
Asynchronous System
Synchronization and communication between blocks
implemented with handshaking using asynchronous channels by
sending/receiving “data tokens”
USC Asynchronous Group
3
Channel-Based Design
Characteristics
Architecture is typically a multi-level hierarchy of communicating blocks
Reg A
Main FSM
Reg B
Memory
Adder
ASIC
Register
Bank
Multiplier
BN-1 BN-2 BN-3
leaf cells
Subtract/
Divider
channels
Adder/
Mult.
Reg C
FAN-1
FAN-2
FAN-3
FA0
Netlist consists of leaf cells communicating along channels
USC Asynchronous Group
4
Asynchronous Leaf Cells
Definition
Smallest block that communicates
via asynchornous channels
Input
Channels
L
Output
Channels
Functionality
Reads a subset of input channels
Computes F and writes to a subset
of output channels
L
Linear Pipeline
Linear Pipelines
Only one input and one output
channel
Non Linear Pipelines
L
Joins and Forks
Conditional Joins: Read only some of
the input channels
Conditional Splits: Write only to
some of the output channels
USC Asynchronous Group
Conditional Join
L
Conditional Split
5
Template-Based Leaf-Cell Design
• Each pipeline style (QDI, timed…) has a different blueprint
• Create a library using a blueprint to implement the lowest level
communicating blocks
C
L
LCD
LCD
RCD
RCD
LCD
F
C
2-input 1-output pipeline stage
LCD
LCD
L
RCD
RCD
F
C
LCD
L
F
Blueprint for a QDI N-input
M-output pipeline stage
RCD
RCD
RCD
RCD
1-input 2-output pipeline stage
Generation of instances from templates is straightforward
USC Asynchronous Group
6
Background: Caltech’s QDI Templates
Precharged Half Buffer (PCHB) [Lines96]
1-of-N Rail Channels
• Delay-insensitive communication
Quasi-delay-insensitive design
bit0
OR
bit1
OR
bitn
OR
C
Done
Completion Detector
• Negligible timing assumptions
Dynamic Logic Function Block
Left and right completion detection
R
L
precharge
control
nmos
network
Function Block
evaluation
control
USC Asynchronous Group
7
PCHB Performance Analysis
C
C
C
LCD
RCD
LCD
F1
RCD
LCD
F2
RCD
F3
3 t+
2 tCD
tc+ t tprech
CycleCycle
timetime
= 3=
tEval
++
2 2tc+
Eval2+tCD
prech
2-D Pipelining: The key to high-throughput [MiniMips97]
Small forward latency per stage (as little as 2 gate delays)
Smaller completion detection units, reduces control overhead
Only local communication between blocks
L11
L21
L31
L12
L22
L32
USC Asynchronous Group
8
Outline
• Background
Illustration of the Fano Algorithm
The base-line synchronous Fano design
• The Asynchronous Fano Design
• The Back-End Asynchronous Design Flow
• Summary of Contributions
USC Asynchronous Group
9
Background on Fano Algorithm
• Fano algorithm is a depth first tree-search algorithm [Fano64]
• Achieves good performance with a low average complexity
-5
+3
Total
Metric:
+1
Total
Path
Metric:
-2
TotalPath
Paththat
Metric:
0
Estimate
transmitted
a1
1 error
01 (+3)
(-5)
10 (-5)
10
10 (-10)
0 errors
11 (-5)
11
(+3)
11 (+3)
00 (-5)
Estimate that transmitted a 0
Received Branch Bits
Decoded Bit Index
00 (+3)
00
01
01
10
10
root
root
Decoded bit
11 (-10)
11
10 (-5)
01 (-5)
10 (-5)
10
01 (-5)
1
0
X
X
0
1
0
X
11
X
01
X
X
00
1
2
USC Asynchronous Group
…
3
…
10
The Synchronous Architecture [Asilomar99]
Critical path consists of a 2 ALU’s and 2 MUX’s
USC Asynchronous Group
11
Outline
• Introduction and Background
• The Asynchronous Fano Design
• The Back-End Asynchronous Design Flow
• Summary of Contributions
USC Asynchronous Group
12
The Asynchronous Fano
At typical SNR most of the branches will be error free
Key idea: optimize architecture for forward moves
Circuit can be partitioned into two units
Skip Ahead Unit: operates at high speed for error free sequences
Error Logic: operates when errors are encountered
Circuit Operation Switches Back and Forth
Between Skip Ahead and Error Logic until it reaches end of tree
Asynchronous Design Advantage
Allows seamless switching between blocks
USC Asynchronous Group
13
The Asynchronous Architecture
To BMU
To BMU
From BMU
noError
XOR_SPLIT
Comparison
Result
ERROR-DETECT
Decision_bit
FILTER
SkipAhead
Decision
Received Data compared
MERGE
with estimated branch bits
FAST
SHIFT
REGISTER
XOR
XOR
BMU
Decision
FAST
DECISION
REGISTER
XOR
The Skip-Ahead Unit
The critical path of the Skip Ahead Unit runs at 450MHz (post layout)
USC Asynchronous Group
14
The Memory Design
Supports a packet length of only 128 words. Each word is a pair of branch bits.
Used standard place and route tools for the physical design of the memories
Faster design time at the expense of more area and power consumption
Unacknowledged tri-state buffers on the data bus
Efficiently allows multiple drivers of the bus.
Introduceds minor timing assumptions
This is typical in synchronous design, but not typical of PCHB-based designs.
8 sets of branch bits
USC Asynchronous Group
15
Fano: Error-Free Operation
17971ns
18449ns
Total of 8x16 = 128 bits decoded
USC Asynchronous Group
16
Fano: Error Operation
17537ns
Error Encountered
Move back
USC Asynchronous Group
25361ns
17
The Layout
Asynchronous Fano Properties
TSMC 0.25
Skip Ahead Unit runs at
450MHz
2600m x 2600m =
Received
Memory
Decision
Memory
6.76mm2
Fano
2.15 x speed
1/3 the power
10 man months to design
5x the area
Threshold
Adjust
Unit
USC Asynchronous Group
Branch
Metric
Calculator
Skip
Ahead
Unit
Counter
Compared to the Synchronous
Lookup Table
Power dissipation: 32mW
(@450MHz,2.5V)
360,000 transistors
10 man months to design +
6 man months library and
flow development
18
Outline
• Introduction and Background
• The Asynchronous Fano Design
• The Back-End Asynchronous Design Flow
• Summary of Contributions
USC Asynchronous Group
19
Physical Design Flow
Specification
Simulation
and Analysis
Schematic
Symbol
Schematic
Functional
(Virtuoso, Synopsys)
(Hspice/Nanosim/Verilog)
Netlist (.v)
Asynchronous
Leaf Cell/Gate
Library
Cell views:
•Symbol
•Schematic
•Functional
•Layout
•Abstract
Abstract (.lpe)
Netlist (.sp)
Place & Route
(Silicon Ensemble)
Layout
Layout (.gds)
Netlist (.cir)
Chip Assembly
(Virtuoso)
LVS & DRC
(Virtuoso, Dracula)
Layout (.gds)
Chip Fabrication
Standard Flow Works
USC Asynchronous Group
20
Cell Library Flow: Alternatives
• Used for the Fano Algorithm
• More suitable for designs with
relaxed timing assumptions at the
leaf cell level
Leaf Level Design
Leaf Cell
Library
Technology
Layout Mapping
Leaf Cell Design
Physical P&R
Gate Level Netlist
Technology Mapping
• Used for the STFB based adder
• More suitable for designs with strict
timing assumptions at the leaf cell
level
Template
Gate
Library
Physical P&R
Leaf cell level or gate level place and route
USC Asynchronous Group
21
Cell Library Flow
Cell Design
(Virtuoso)
Layout (.gds)
Cell Abstract
(Abstract generator)
Symbol
Schematic
Functional
Layout
Simulation
and Analysis
(Hspice/Nanosim/Verilog)
Netlist (.sp)
DRC & LVS
(Virtuoso, Dracula)
Abstract (.lpe)
Asynchronous
Gate
Library
Developed asynchronous gate library
USC Asynchronous Group
22
Initial cell sizes
Transistor Sizing
2X for pull down network
8X for inverter drivers
Staticizer inverter is ~10x weaker than pull down network
Additional sub-types added as necessary
Create a number of subtypes for different strengths
USC Asynchronous Group
23
Charge-Sharing Considerations
•
Output inverters and staticizers are internal to all dynamic cells and form part of known
minimum load on dynamic node (allowing 10% dip in voltage)
•
On each dynamic gate minimum load is guaranteed to be sufficient to ensure no charge
sharing problems exist via extensive simulation
Output inverters and staticizers are encapsulated with the
dynamic logic into a single gate
USC Asynchronous Group
24
Netlist extraction
Verilog netlist (.v) for placement and routing
// LAST TIME SAVED: Jun 4 17:49:17 2003
// NETLIST TIME: Jun 4 17:51:34 2003
`timescale 1ns / 1ns
module Counter2 ( Backward_e, BmuErr_e[5], Forward_e, From_FSM_T,
Go_Fast, Go_Slow_FSM, Go_Start_Pointer_F, Go_Start_Pointer_T,
Go_e, LB, LFB, LFBTE, LFB_LFBTE, LFNB, NewStat_e0, NewStat_e1,
Slow_ShiftB_e, Start, ZeroCheck, Zero_e, infi_e1, infi_e2, nReset);
output Backward_e, Forward_e, From_FSM_T, Go_Fast, Go_Slow_FSM,
Go_Start_Pointer_F, Go_Start_Pointer_T, Go_e, LB, LFB, LFBTE,
Send_T_Re, ShiftB_e, ZeroCheck_e, Zero_False, Zero_True, infi_e;
input
BmuErr_e5a, BmuErr_e5b, BmuErr_e5c, BmuErr_e5d, ConnectGnd, Dec,
Go, Go_Fast_Re, Go_Slow_FSM_e, Go_Start_Pointer_e, Inc, LFB_e1,
LFNB_e1, NoZeroCheck, Re_LB, Re_LFB, Re_LFBTE, Re_LFNB, Re_S19,
Zero_e, infi_e1, infi_e2, nReset;
output [5:5]
BmuErr_e;
// Buses in the design
wire [0:7] Forw_e;
PCHB_SingleRail_SlowDataPath I54 ( .Ae(net01493), .A1(net0507),
.BUFe(Send_Delta_to_Encode_e), .BUF1(Send_Delta_to_Encode),
.nReset(nReset));
PCHB_BUFFER1_for_Counter_1 I204 ( .Ae(net0489), .A1(net0486),
.A0(ConnectGnd), .BUFe(LFB_e1), .BUF1(LFB_LFBTE), .BUF0(nc[30]),
.Start(Start), .nReset(nReset))
…
Verilog netlist of library gates is auto-generated
USC Asynchronous Group
25
Placement, Routing and Extraction
*
* CADENCE/LPE SPICE FILE : SPICE
*
DATE : 5-JUN-2003
*
******
******
MOS XTOR
PARAMETERS FROM : 7MOSXREF
...
******
*
MM1-XI59-3 NET72 XI59-NET35 VDD! VDD! PCH L=0.24U W=2.50U
*
+
PD=3.24U AS=1.65P PS=6.32U NRS=0.088 NRD=0.088
*.GLOBAL
VDD! GND!
*
*
*----- TOTAL # OF MOS TRANSISTORS FOUND :
2018
*
*----COMMENTED :
0
.SUBCKT INC2
DATA REQ ACK NRST4 L0 L1
*
*
******
*
******
RESISTORS
PARAMETERS FROM : 7RESXREF
******
CORNER ADJUSTMENT FACTOR =
0.0000000
******
******
******
MM2-XI60-XI36 XI36-A NET0432 VDD! VDD! PCH L=0.24U W=2.80U AD=1.04P
******
DIODE
PARAMETERS FROM : 7DIOXREF
+
PD=3.54U AS=1.88P PS=6.94U NRS=0.079 NRD=0.079
******
MM3-XI60-XI36 XI36-A NR<6> VDD! VDD! PCH L=0.24U W=2.80U AD=1.04P
******
+
PD=3.54U AS=1.88P PS=6.94U NRS=0.079 NRD=0.079
******
CAPACITORS PARAMETERS FROM : 7CAPXREF
MM7-XI60-XI36 XI36-XI60-NET029 NET0432 XI36-A GND! NCH L=0.24U W=1.20U
******
+
AD=0.24P PD=1.60U AS=0.44P PS=1.94U NRS=0.183 NRD=0.167
******
MM7-XI60-XI36-1 685 NET0432 GND! GND! NCH L=0.24U W=1.20U AD=0.24P
******
CAPACITORS PARAMETERS FROM : 7CAPXMER
+
PD=1.60U AS=0.80P PS=3.74U NRS=0.183 NRD=0.167
******
...
*
*
C1
NET77 GND! 8.00421E-15
C2
NET209 GND! 1.06917E-14
C3
NET188 GND! 1.16892E-14
C4
NET121 GND! 1.34065E-14
C5
NET215 GND! 1.02445E-14
...
USC Asynchronous Group
AD=0.93P
26
Chip Assembly
•
Stream-in blocks layout (from SE to Virtuoso)
•
Block placement and routing
•
DRC, LVS and netlist extraction (.sp)
•
Post-layout simulation
Future Work:
•
Static timing
•
Automatic block
placement and routing
•
Synthesis
USC Asynchronous Group
27
Summary
Design Flow: Standard ASIC flow for channel based asynchronous circuits
Async high performance designs with ASIC design time is possible
Verilog modelling and structural simulation is feasible
Commercial P&R tool (Silicon Ensemble) works quite well
Design flow is applicable to many templates (QDI or STFB)
Architectural: Design and implementation of the Fano Algorithm
A complex design implemented both in synchronous and asynchronous
Over 2x performance with 1/3 the power at the expense of 3-5x area
First freely available asynchronous library
Working on characterization and Lib file generation
USC Asynchronous Group
28
Thank You
USC Asynchronous Group
29
Skip-Ahead Unit with RSPCHB
A 14% throughtput improvement in
the Skip-Ahead Unit using RSPCHB
instead of PCHB
To BMU
To BMU
From BMU
noError
XOR_SPLIT
Comparison
Result
ERROR-DETECT
Decision_bit
FILTER
SkipAhead
Decision
Received Data compared
MERGE
with estimated branch bits
FAST
SHIFT
REGISTER
XOR
XOR
XOR
FAST
DECISION
REGISTER
The Skip-Ahead Unit
USC Asynchronous Group
30
BMU
Decision
Overview of New Pipeline Templates
2-D
Timing
Style
Assumptions
PCHB
Throughput
DI/QDI
772 MHz
RSPCHB
QDI
920 MHz
LP2/2+
Moderate
1.0 GHz
Aggressive
1.2 GHz
HC
Foundation of design space exploration trading robustness for performance
USC Asynchronous Group
31