Introduction to basic concepts on asynchronous circuit design
Download
Report
Transcript Introduction to basic concepts on asynchronous circuit design
Industrial Experiences
Pioneering Asynchronous
Commercial Design
Peter A. Beerel
Fulcrum Microsystems
Calabasas Hills, CA, USA
1
Agenda
Introduction to Fulcrum
Description of Integrated Pipelining
Circuit
B
Circuit
A
Fulcrum’s clockless circuit architecture
Specification
Description of Fulcrum’s Design Flow
Design & Verification
Synthesis & Floor Planning
Simulation
& Verification
Design & Verification
Physical Design
Database Release to Manufacturing
Overview of Nexus
Fulcrum’s Terabit crossbar
Overview of PivotPoint
Fulcrum’s first commercial product
2
Company Snapshot
“Clockless”
Semiconductor Company
Located in Calabasas, CA
(30 people)
Formed out of Caltech
(1/00)
Technology proven
in large-scale designs
Backed by top-tier investors
(raised $14M in June)
3
Agenda
Introduction to Fulcrum
Description of Integrated Pipelining
Circuit
B
Circuit
A
Fulcrum’s clockless circuit architecture
Specification
Description of Fulcrum’s Design Flow
Design & Verification
Synthesis & Floor Planning
Simulation
& Verification
Design & Verification
Physical Design
Database Release to Manufacturing
Overview of Nexus
Fulcrum’s Terabit crossbar
Overview of PivotPoint
Fulcrum’s first commercial product
4
Fulcrum’s Integrated Pipelining
Robust, power efficient, and high performance
Dual-Rail
Domino
Logic
Dual-Rail
Domino
Logic
Dual-Rail
Domino
Logic
Acknowledge
Acknowledge
Fast delay-insensitive style using domino logic without latches
(Developed at Caltech by Fulcrum’s founders)
5
Integrated Pipelining
Leaf Cell A
Input
Completion
Detection
Leaf Cell B
Leaf Cell C
Dual-Rail
Domino
Logic
Dual-Rail
Domino
Logic
Dual-Rail
Domino
Logic
Control
Control
Control
Output
Completion
Detection
Harnessing the power of Domino Logic
Addresses delay variability with Completion Sensing
Addresses power inefficiency with Async Handshakes
Leverages more efficient “N” transistors
6
Hierarchical Design
Multi-level hierarchy of communicating blocks
At each level blocks communicate along channels
Reg A
Reg B
Main FSM
Memory
Adder
Register
Bank
Multiplier
BN-1 BN-2 BN-3
ASIC
leaf cells
Subtract/
Divider
Adder/
Mult.
Reg C
channels
FAN-1
FAN-2
FAN-3
FA0
7
Leaf Cells
C
F
LCD
RCD
D
Definition
Smallest block that performs logic and communicates via channels
Based on small number of pipeline templates guiding design
Forms basic building block for physical design
Features
Facilitates high throughput and low latency
Provides easy timing validation and analog verification
~1,000 digital leaf cell types compose our leaf cell library
~200 additional subtypes for different physical environments (e.g.,
loads)
8
Template-Based Cell Design
• Each pipeline style (QDI, timed…) has a different blueprint
• Library uses a blueprint to implement the lowest level blocks
C
LCD
RCD
LCD
F
C
2-input 1-output pipeline stage
LCD
RCD
F
C
LCD
Blueprint for a QDI N-input
M-output pipeline stage
RCD
F
RCD
1-input 2-output pipeline stage
9
Summary of Characteristics
Delay-Insensitive timing model
Gates and wires can have arbitrary delays
4 phase 1of4 handshake
Uses 4 wires to send 2 bits
Plus an acknowledge wire for flow control
Returned to neutral between each data transfer
Self shielding
Precharge domino logic plus async handshake
Low latency; high frequency; robust
Auto power conservation; zero standby power
10
Agenda
Introduction to Fulcrum
Description of Integrated Pipelining
Circuit
B
Circuit
A
Fulcrum’s clockless circuit architecture
Specification
Description of Fulcrum’s Design Flow
Design & Verification
Synthesis & Floor Planning
Simulation
& Verification
Design & Verification
Physical Design
Database Release to Manufacturing
Overview of Nexus
Fulcrum’s Terabit crossbar
Overview of PivotPoint
Fulcrum’s first commercial product
11
Fulcrum Design Flow
Design Specification
Executable specifications
Formal decomposition
Creates design hierarchy
Semi-custom
synthesis & layout
Hierarchical floor planning
Automated transistor sizing
Semi-automated physical
design
Supports synchronous &
asynchronous designs
Hard macro from place & route
Architecture
Design & Verification
Micro-architecture
Design & Verification
Synthesis &
Floor Planning
Physical Design
Mitered Simulation & Verification
Hierarchical design flow
Database Release
to Manufacturing
12
Managing Design Hierarchy
Proprietary Objected Oriented Hardware Language
Integrated hierarchical design/verification language
Defines cell specification & implementation
Specification
Java or communicating-sequential-processes (CSP)
Implementation: multiple forms
Sub-cells
Sub-cells defined in terms of specification or implementation
Defines integrated test environment for each cell
Enables verification at all pairs of levels
Efficiency features
Supports refinement of cells and channels
13
Physical Design
Layout hierarchy based on design hierarchy
Hierarchical floor-planning semi-automated
Large scale hand placement before sizing
Long distance channels planned carefully
Timing closure by construction
Placement drives sizing
Can insert extra pipelining on long wires late in design
Tradeoffs between performance and design time
Hand layout where necessary
Automated layout where possible
Goals
Full-custom density and speed within ASIC design time
14
Design Verification: System-Level
Test Bench
Device Under Test
Configuration
Manager
Test
Cases
Bus
Functional
Model
Executable
Spec
Traffic Generator
& Checker
Gate-level
Verilog
Model
Mission
Verify that executable spec = written spec + gatelevel model
Use industry-standard tools & methods
Cadence NCSIM and efficient Java-Verilog interface
Directed random testing
Line & functional coverage
Monitor
15
Design Verification: Unit-Level
High level
(Java/CSP)
Test
Engine
Copy
==
Log
Low level
(CSP/PRS/CDL)
Mitered co-simulation for unit-level verification
Check correctness of digital model by comparing it to golden
CSP/Java model
Features
Framework automated and regressed
Checks correctness
Checks delay insensitivity and/or throughput and latency
16
Analog Verification: Charge Sharing
Charge Sharing
Test Generator
Synthesis
SPICE
SPICE-based charge sharing analysis
Test case generation and analysis automated
Charge-sharing problems solved in numerous ways
Symmetrization
Less transistor sharing
Delay perturbations
17
Synthesis: Gate Generation / Sizing
Automated generation of
transistor netlists
Dynamic logic generation
Transistor sharing
Symmetrization
Gate-library matching
Transistor sizing
Gate
Library
Floor planning
Information
Logic Synthesis
Transistor Sizing
Path-based sizing to meet
amortized unit-delay model
Micro-architecture feedback
CSP
CDL Netlist
Identifies where fanout limits
performance
18
Fulcrum QDI v. Synchronous Flows
Save clock tree design, analysis, optimization, and verification
No timing closure problems
Unexpected long-wire bottlenecks easily solved with additional pipeline
buffers late in design cycle
QDI/DI timing model reduces timing analysis challenges
Fulcrum QDI hierarchical design facilitates:
Composability, re-use, and early bug detection
Hierarchical-floorplanning improves predictability of wires
Template-based leaf cell designs simplifies logic design
Design reuse reduces criticality of high-level synthesis
Decomposition methodology amenable to formal verification
19
Agenda
Introduction to Fulcrum
Description of Integrated Pipelining
Circuit
B
Circuit
A
Fulcrum’s clockless circuit architecture
Specification
Description of Fulcrum’s Design Flow
Design & Verification
Synthesis & Floor Planning
Simulation
& Verification
Design & Verification
Physical Design
Database Release to Manufacturing
Overview of Nexus
Fulcrum’s Terabit crossbar
Overview of PivotPoint
Fulcrum’s first commercial product
20
Globally Asynchronous,
Locally Synchronous
SoC designs: many cores with different clock domains
Async circuits can interconnect multiple sync cores in an
SoC design, eliminating global clock distribution and
simplifying clock domain crossing
Fulcrum’s “Nexus” is a high speed on-chip interconnect:
16 port, 36 bit asynchronous crossbar
Asynchronous cross-chip channels
Async-sync clock domain converters
Runs at 1.35GHz in 130nm process
21
Nexus System-on-Chip
Interconnect
Generic Nexus Example
- Synchronous IP block
- Asynchronous IP block
Non-blocking crossbar
16 full-duplex ports
Flow control extends
through the crossbar
Full speed arbitration
Arbitrary-length “bursts”
Bridges clock domains
Scales in bit width and
ports
Process portable
- Pipelined repeater
- Clock domain converter
22
Nexus Burst Format
Incoming From Source
Data
36 bit
Tail
1 bit
Control
4 bit
Source
Module
DN
1
•••
Outgoing To Target
D3
D2
D1
DN
0
0
0
1
To
•••
D3
D2
D1
0
0
0
From
Target
Module
Arbitrary-length source-routed bursts provide flexibility
23
Sync-to-Async Conversion
Synchronous Request / Grant FIFO protocol
Data transferred if request and grant both high on
rising edge of clock
Compensates for any skew on asynchronous side
Low latency: 1/2 to 3/2 clock cycles at A2S
S2A
A2S
Synchronous
Datapath
Request
Grant
Asynchronous
Datapath
Asynchronous
Datapath
Synchronous
Datapath
A
A
clock
clock
Request
Grant
Seamlessly Bridges Different Clock Domains
24
Arbitration and Ordering
Unrelated sender/receiver links are independent
Bursts sent from multiple input ports to the same output
port are serviced fairly by built-in arbitration circuitry
Bursts from A to B remain ordered
Producer-consumer and global-store-ordering satisfied
A sends X to B, A notifies C, C can read X from B
A writes X to B, A writes Y to C, if D reads Y from C, it can read
X from B
Split transactions implement loads
Load request and load completion bursts
Load completions returned out-of-order
Can tunnel common bus and cache coherance protocols
25
Example: Load/Store Systems
Option 1: Pure Master/Target Ports
Masters send Requests to Targets, which may
return Completions
Each port must either be a Master or a Target so
that Completions are never blocked by Requests
Devices which need to be both Masters and Targets
are given two separate full-duplex ports
Could use two separate Nexus crossbars
Option 2: Peers
Modules which are both Masters and Targets
implement an internal buffer to hold Requests so
that Completions can bypass them
All Masters or Peers restrict number of outstanding
Requests to avoid overflowing Request buffers
26
Example: Switch Fabric
Each module maintains input/output queues for
traffic to/from each other module
Data is sent from an input queue to an output
queue over Nexus as a series of short bursts
Flow control credits for each output queue are
sent backward
Eliminates head-of-line blocking
Segmentation, buffering, and overspeed optimize
performance during congestion
Used in PivotPoint, Fulcrum’s first chip product.
27
Nexus Silicon Validation
TSMC 130nm LV Results
Block diagram of
Nexus Validation Chip
S1
Serial IO
S2
Proc
V
GHz
ns
pJ/bit
Low-K
1.2
1.35
2.0
10.4
Low-K
1.0
1.11
2.4
7.0
FSG
1.2
1.10
2.5
11.2
FSG
1.0
0.87
3.1
7.6
S5
S3
S6
S4
S7
ALU
Crossbar area: 1.75mm^2
Total interconnect area: 4.15mm^2
Peak cross-section bandwidth: 778Gb/s
Plot of Nexus
crossbar
28
Nexus Summary
Nexus is an asynchronous crossbar
interconnect designed to connect up to 16
synchronous modules in a SoC
Nexus can be used to implement load/store
systems as well as switch fabrics
Systems using Nexus can be tested with
standard equipment
Nexus runs up to 1.35GHz in TSMC 130nm
Asynchronous interconnect is now viable for
very high performance SoC designs
29
Agenda
Introduction to Fulcrum
Description of Integrated Pipelining
Circuit
B
Circuit
A
Fulcrum’s clockless circuit architecture
Specification
Description of Fulcrum’s Design Flow
Design & Verification
Synthesis & Floor Planning
Simulation
& Verification
Design & Verification
Physical Design
Database Release to Manufacturing
Overview of Nexus
Fulcrum’s Terabit crossbar
Overview of PivotPoint
Fulcrum’s first commercial product
30
PivotPoint Blade Interconnect
World’s first high-performance clockless chip
Large-scale SoC design
Generic System “Blade”
CPU
NPU
ASIC
FPGA
CPU
NPU
ASIC
FPGA
SPI-4
I/O
(Phy/MAC)
CPU
NPU
ASIC
FPGA
Includes key Fulcrum IP
X8
Backplane
Interface
CPU
NPU
ASIC
FPGA
>32.5M transistors (83% async)
14 separate clock domains
Nexus Terabit Crossbar
Quad-port 600MHz async SRAM
Operates at over 1GHz
Delivers 192Gbps of nonblocking switching capacity
Testable via standard tools
JTAG; scan chain
Activity-based power scaling
31
9-month project
PivotPoint Leverages Nexus
Flexible architecture
6 duplex SPI-4.2 interfaces
All paths are independent
Optimized for performance
CPU
Interface
JTAG
Interface
Boundary
Scan
SPI-4
Route
Table
SPI-4
16KB
Buffer
Control Bus
(Serial Tree)
16KB
Buffer
16KB
Buffer
16KB
Buffer
16KB
Buffer
16KB
Buffer
16KB
Buffer
16KB
Buffer
16KB
Buffer
16KB
Buffer
16KB
Buffer
16KB
Buffer
SPI-4
Route
Table
SPI-4
SPI-4
Route
Table
SPI-4
SPI-4
Route
Table
Easily configurable
SPI-4
SPI-4
Route
Table
SPI-4
Up to 14.4Gbps per interface
Up to 32Gbps per Nexus port
Full-rate buffer memories
Lossless flow control
SPI-4
Route
Table
SPI-4
3ns latency
A true SoC GALS design
16-bit CPU interface
JTAG support
Modest size and power
~2 Watt per active interface
1036 ball package
32
Testing –
A Multi-Dimensional Approach
DFT
Synchronous scan chains for Synchronous logic
Asynchronous scan-chain-like structures for
asynchronous logic and sync-async interfaces
Standardized JTAG interface for testing
Fault-Grading
Verilog fault-model for domino logic
Industry-standard fault grading tools
BIST
Use Nexus for observability in Nexus-Based SOCs
RAM self test and repair
33
Differentiating Through Technology
Leveraging our clockless technology foundation
Differentiated Product Offering
High performance (latency, capacity)
Power efficient (linear scaling)
Robust in operation
Unique IP Blocks
Unmatched performance
Extremely robust (power and temperature)
Easy to integrate (benign behavior)
Clockless Technology Foundation
Silicon proven and customer validated
Mature CAD flow (integrated with commercial tools)
Robust cell library (thousands of unique cells)
34
Thank You!
Peter A. Beerel, PhD
VP Strategic CAD
[email protected]
818.871.8100
www.fulcrummicro.com
26775 Malibu Hills Road
Suite 200
Calabasas Hills, CA 91301
“A group of engineers wants to turn the microprocessor world on its
head by doing the unthinkable: tossing out the clock and letting the
signals move about unencumbered. For those designers, inspired
by research conducted at Caltech, clocks are for wimps.”
Anthony Cataldo , EE Times
35