Uncle * An RTL Approach to Asynchronous Design
Download
Report
Transcript Uncle * An RTL Approach to Asynchronous Design
Uncle – An RTL Approach to
Asynchronous Design
Presentor : Chi-Chuan Chuang
Date : 2012.12.20
Outline
Introduction
◦ C-element
◦ Null convention logic (NCL)
◦ NCL asynchronous systems
UNCLE synthesis flow
◦
◦
◦
◦
◦
From RTL to gates
Ack generation
Net buffering
Latch balancing
Relaxation, cell merging
Comparisons
Conclusion
C-element
Commonly used asynchronous
logic component
Hysteresis
Implementations
◦ Semi-static : with two cross-coupled inverters
◦ Static : doesn’t rely on feedback inverters
◦ Gate-level : depends on which gate used
C-element (cont.)
Semi-static
C-element (cont.)
Static
Gate-level
Null convention logic
Dual-rail
Delay-insensitive logic style
Based on threshold logic
Use 27 fundamental threshold gates with
2~4 inputs
Hysteresis state-holding capability
Null convention logic (cont.)
Definitions of threshold gate
◦
◦
◦
◦
set : equation determines the gate function
hold1: all input Ored together
reset : complement of hold1
hold0 : complement of set
Z = set + Z − ∙ hold1
Z ′ = reset + Z −′ ∙ hold1
An example of implement TH23
set = AB + BC + AC
hold1 = A + B + C
reset = ABC
hold0 = AB + BC + AC
Null convention logic (cont.)
Compare between two types of DR AND2
27 Basic NCL macros
NCL asynchronous systems
Data-driven approach
◦ Use NCL gates for both registers and
control
Control-driven approach
◦ Uses Balsa-style registers and control
Data-driven approach
Using dual-rail latch with acknowledge
signals ki, ko to control the datapath
Dual-rail latches
Dual-rail latches
◦
◦
◦
◦
◦
◦
C_0 = C-element with async reset to 0
C_1 = C-element with async reset to 1
t_d/f_d = dual-rail in
ko = ackout
t_q/f_q = dual-rail out
ki = ackin
Types of latch
◦ drlatn
◦ drlatr
◦ drlats
Dual-rail latches (cont.)
drlatn
drlatr
drlats
Data-driven approach (cont.)
Finite state machine
◦ The middle half-latch contains initial data
◦ All ports and registers are read and written
every cycle
Control-driven Approach
Registers with selective read/write
Control network is separate from the
datapath
Number of read ports can be easily added
to the register
UNCLE synthesis flow
Both data-driven and control-driven are
supported
lower-level synthesis tool
Verilog as its input language
From RTL to Gates
RTL is transformed to a gate level netlist
using commercial synthesis tools
The target library read by the tool
contains:
◦ AND2, XOR2, OR2, inverter
◦ D-flip-flop (DFF), D-latch (DLAT)
◦ Gates for special (T- elements, S-elements…)
◦ Complex gates that have been mapped into NCL
Gates have unit delays for timing
Area is proportional to transistor counts
Ack Generation
Data-driven
◦ Each latch receive an ack signal from each
destination latch of its output
Control-driven
◦ Each control element receive an ack signal from
each destination latch
A simple Ack merging algorithm:
◦ any latches having at least one common
destination have their ack networks merged
An ack checker step is included at the end of
the flow to check ack network validity
Net Buffering
Timing data is non-linear delay model
(NLDM)
The signal net target transition time used
for all examples in this paper is
approximately equivalent to a 1X inverter
driving four separate 4X inverter
loads
Gate sizing
Build a buffer tree with invertors
Latch Balancing
For the data-driven style that moves halflatches in the netlist to balance data
delays with ack delays
Ack delay
◦ Depends on the number of destination that
sets the completion network depth
Data delay
◦ depends on the data logic complexity.
Latch Balancing (cont.)
Latch Balancing (cont.)
Generally results in more transistors as
the datapath width increases moving
towards the source registers
Requiring more latches, with a increase in
the ack network size
Implement by iterative heuristic algorithm
Latch Balancing (cont.)
Latch Balancing (cont.)
Several sorting/pruning stages based on
data/ack/cycle delays are used to find
latch that are most likely to improve
performance if pushed
Chosen latches are pushed one gate level,
and affected ack networks are rebuilt
Latches only feed primary outputs are
ineligible
Latch Balancing (cont.)
Works appropriately for FSMs
Has problems with linear pipelines if
latches are pushed in one direction only
Relaxation and Cell Merging
Relaxation is a technique that
◦ Look for redundant paths from a PI to a PO
◦ Finds gates that don’t have to be fully
expanded to dual-rail versions, but can be
implemented by eager versions that require
fewer transistors
Cell Merging
◦ A cell merging step is performed in which
adjacent gates with no fanout are merged into
more complex gates
◦ Area-driven
Example RTL Statements
Comparison
GCD16 with different Uncle version
Uncle ver.
DD
DD/NB
DD/LB/NB
CD
CD/NB
transistors
16192
16226
20128
8658
8662
*
1.87
1.87
2.32
1.00
1.00
cyc. time
(ns)
105.7
86.0
64.9
75.7
62.4
*
1.69
1.38
1.04
1.21
1.00
energy (pJ)
32.4
35.3
49.7
10.2
10.8
*
3.17
3.44
4.85
1.00
1.05
Conditional port activity caused data-driven designs to be large, slow.
Latch balancing helped DD performance. Control driven produced
best results
DD:data driven, CD:ctrl-driven, LB:latch balanced, NB:net buffered, *:ratio to best
Comparison (cont.)
GCD16 between Uncle and Balsa
transistors
*
cyc. time (ns)
energy (pJ)
Balsa
Uncle
(CD/NB)
Balsa
Uncle
(CD/NB)
Balsa
Uncle
(CD/NB)
11455
8662
85.2
62.4
13.7
10.8
1.32
1.00
1.37
1.00
1.27
1.00
Balsa used more read ports on registers reducing loading but
increasing transistor count
Net buffering helped offset increased loading in Uncle design,
improved performance
Comparison (cont.)
Viterbi decoder design
◦ Branch Metric Unit (BMU)
Just combinational logic
With a half latch at the output for UNCLE ack
◦ Path Metric Unit (PMU)
It’s a set of parallel accumulator-like registers resulting
in many parallel three half-latch loops
◦ History Unit (HU)
It has three 16-entry register files(4-bit, 2-bit, and 1-bit)
An outer loop writes the registers, and can conditionally
trigger an inner while loop that contains register
read/write operations and executes a variable number
of iterations
Comparison (cont.)
Viterbi’s Branch Metric Unit comparison
◦ Combination only
transistors
*
cyc. time (ns)
energy (pJ)
Balsa
Uncle
(CD/NB)
Balsa
Uncle
(CD/NB)
Balsa
Uncle
(CD/NB)
9040
5338
9.30
8.87
2.33
1.35
1.69
1.00
1.05
1.00
1.73
1.00
Uncle version just combinational logic with half-latch on
output
Balsa version used loop splitting to split combinational
logic into concurrent blocks that increased parallelism of
internal computations at the cost of more transistors.
Comparison (cont.)
Uncle’s Viterbi Path Metric Unit (PMU)
Uncle ver.
DD/NB
DD/NB/LB
DD/NB/LB+
CD/NB
transistors
20184
21778
24561
18838
*
1.07
1.16
1.30
1.00
cyc. time (ns)
13.4
13.4
6.9
13.3
*
1.93
1.93
1.00
1.91
energy (pJ)
5.1
5.7
6.8
4.6
*
1.12
1.24
1.48
1.00
LB+=latch-balanced, two set of half-latches added to RTL (one in FSM loop, and
one on output port)
Comparison (cont.)
Viterbi’s Path Metric Unit comparison
transistors
*
cyc. time (ns)
energy (pJ)
Balsa
Uncle
(DD/NB/
LB+)
Balsa
Uncle
(DD/NB/
LB+)
Balsa
Uncle
(DD/NB/
LB+)
38328
24561
9.39
6.94
9.73
6.81
1.56
1.00
1.35
1.00
1.43
1.00
Comparison (cont.)
Viterbi’s History Unit comparison
V1
V2
Balsa
Uncle
CD/NB
Uncle
CD
transistors
21819
16471
16425
*
1.33
1.00
1.00
cyc. time (ns)
10.8
6.8
8.4
*
1.60
1.00
1.25
energy (pJ)
1.34
1.17
1.07
*
1.26
1.09
1.00
cyc. time (ns)
230.7
161.3
192.0
*
1.43
1.00
1.19
energy (pJ)
2.54
19.6
18.7
*
1.36
1.05
1.00
Comparison (cont.)
Viterbi comparison between Balsa and
Uncle
transistors
*
cyc. time (ns)
energy (pJ)
Balsa
Uncle
(DD/NB/
LB+)
Balsa
Uncle
(DD/NB/
LB+)
Balsa
Uncle
(DD/NB/
LB+)
71370
46752
22.0
17.3
15.0
10.5
1.53
1.00
1.27
1.00
1.43
1.00
The Uncle decoder uses the DD/NB/LB+ PMU RTL
Comparison (cont.)
Balsa
Uncle
Combinational
synthesis
Yes
Yes
Control synthesis
Yes
Data-driven only
Logic Style
Different dual-rail styles, NCL only
bundled data
Behavioral
simulation
Yes
Limited
Area
optimizations
No
Relaxation, limited cell
merging, ack sharing
Area
optimizations
Relaxation, limited cell
merging, ack sharing
RTL style allow area/perf.
tradeoffs, latch balancing,
net buffering
Timing model
Fixed delay
NLDM
Conclusion
Requires more effort by the designer than
Balsa, But can have a higher quality design
If performance of the always active
module is our goal, data-driven style
would be better
Control-driven style better for modules
with conditional port activity.
Appendix : Teak
Teak is a successor toolset to Balsa that
uses a data-driven style
One of Teak’s goals is to automatically
insert latch stages and balance delays for
optimum throughput.
Teak is a fairly new tool with only one
public release
Reference
Uncle – An RTL Approach to Asynchronous Design
ASYNC12 powerpoint about Uncle – An RTL Approach
To Asynchronous Design
Design of Asynchronous Circuits Using Synchronous
CAD Tools
Optimization of NULL convention self-timed circuits