Seminar on High-Speed Asynchronous Pipelines
Download
Report
Transcript Seminar on High-Speed Asynchronous Pipelines
Clockless Logic
or
How do I make hardware fast, powerefficient, less noisy, and easy-to-design?
Montek Singh
Tue, Jan 14, 2003
1
Course Information (1)
Course Number: COMP290-084
Time and Place
Tue/Thu 3:30-4:45pm, Sitterson Hall 325
Instructor
Montek Singh
[email protected] (not singh@cs!)
SN 245, 962-1832
Office hours: most afternoons/by appointment
Teaching Assistant
None
Course Web Page
http://www.cs.unc.edu/~montek
2
Course Information (2)
Prerequisites:
undergraduate knowledge of: digital logic, algorithms,
discrete math (sets and graphs)
no knowledge of advanced circuit design or of VLSI is
assumed
relevant topics will be covered in class as needed
you are assumed to know the following topics:
digital logic: Boolean algebra, logic gates, and latches and registers
algorithms: search techniques, enumeration, divide and conquer,
and time complexity
discrete math: elementary set theory and graph theory
3
Course Information (3)
Reading Material:
Papers and technical reports supplied by instructor
Course Content:
The following topics will be covered:
Introduction to clockless logic
Graphical representation of asynchronous systems
Algorithms for logic synthesis
– Combinational
– Sequential
Design techniques
– High-performance
– Low-power
Formal methods (performance analysis and verification)
Case studies of real-world asynchronous processors
4
Course Information (4)
Grading
30% homework assignments
35% class project
your choice of topic: from pure algorithms to VLSI design
30% exams
5% class participation
Honor Code is in effect
encouraged to discuss ideas/concepts
work handed in must be your own
5
Lecture 1: Introduction
What is asynchronous design?
Why do we want to study it?
How is data represented in an asynchronous system?
How is information exchanged?
6
Introduction: Clocked Digital Design
Most current digital systems are synchronous:
Clock: a global signal that paces operation of all components
clock
Benefit of clocking: enables discrete-time representation
all components operate exactly once per clock tick
component outputs need to be ready by next clock tick
allows “glitchy” or incorrect outputs between clock ticks
7
Microelectronics Trends
Current and Future Trends: Significant Challenges
Large-Scale “Systems-on-a-Chip” (SoC)
100 Million ~ 1 Billion transistors/chip
Very High Speeds
multiple GigaHertz clock rates
Explosive Growth in Consumer Electronics
demand for ever-increasing functionality …
… with very low power consumption (limited battery life)
Higher Portability/Modularity/Reusability
“plug ’n play” components, robust interfaces
8
Challenges to Clocked Design
Breakdown of Single-Clock Paradigm:
Chip will be partitioned into multiple timing domains
challenge: gluing together multiple timing domains
– glue logic is susceptible to “metastability” (=incorrect values
transferred) and latency overheads
Increasing Difficulties with Clocked Design:
Clock distribution: requires significant designer effort
Performance bottleneck: a single slow component
Clock burns large fraction of chip power (~40-70%)
Fixed clock rate: poor match for
designing reusable components
interfacing with mixed-timing environments
9
What is Asynchronous Design?
Digital design with no centralized clock
Synchronization using local “handshaking”
handshaking
interface
clock
Synchronous System
(Centralized Control)
Asynchronous System
(Distributed Control)
10
Why Asynchronous Design? (1)
Higher Performance
May obtain “average-case” operation (not “worst-case”)
not limited by slowest component
Avoids overheads of multi-GHz clock distribution
Lower Power
No clock power expended
Inactive components consume negligible power
Better Electromagnetic Compatibility
Smooth radiation spectra: no clock spikes
Much less interference with sensitive receivers [e.g., Philips
pagers, smartcards]
Greater Flexibility/Modularity
Naturally adapt to variable-speed environments
Supports reusable components
11
Why Asynchronous Design? (2)
The world already is mostly asynchronous!
Events at the level of (or in between) large-scale systems are
asynchronous
several seconds to several milliseconds
e.g., PC-printer communication, keyboard inputs, network comm.
Events at the board level (or between chips) are often
asynchronous
milliseconds to 100 nanoseconds
e.g., CPU-memory interface, interface with I/O subsystem (interrupts)
Events within a chip, at the level of functional units (e.g., adders,
control logic) are currently synchronous
several nanoseconds to 100 picoseconds
Events at the level of a single logic gate are asynchronous
10 picoseconds
Events at the quantum level are asynchronous
picoseconds to femtoseconds
So, why bother with clocks at all?!
make everything asynchronous greater elegance and robustness12
Challenges of Asynchronous Design
Hazards: potential “glitches” on wire
clock tick
clean signals
hazardous signals
no problem
for clocked
systems
communication must be hazard-free!
special design challenge = “hazard-free synthesis”
Testability Issues:
absence of clock means no “single-stepping”
Lack of Commercial CAD Tools:
chicken-and-egg problem
13
Asynchronous Design: Past & Present
Async Design: In existence for 50 years, but …
… many recent technical advances:
Hazard-Free Circuit Design:
several practical techniques for controllers [Stanford/Columbia]
Design for Testability:
several test solutions, e.g. Philips Research
Maturing Computer-Aided-Design (“CAD”) Tools:
software tools for automated design [Philips,Columbia,Manchester]
Successful Fabricated Chips:
embedded processors, high-speed pipelines, consumer electronics…
14
Recent Commercial Interest
Several commercial asynchronous chips:
Philips: asynchronous 80c51 microcontrollers
used in commercial pagers [1998] and smartcards [2001]
Univ. of Manchester: async ARM processor [2000]
Motorola: async divider in PowerPC chip [2000]
HAL: async floating-point divider
in HAL-I and II processors [early 1990’s]
Recent experimental chips:
IBM, Sun and Intel:
fast pipelines, arbiters, instruction-length decoder…
IBM/Columbia/UNC: asynchronous digital FIR filter
Several recent startups:
Theseus Logic, Fulcrum, Self-Timed Solutions…
15
A 5-minute Homework Problem
Alice and Bob live on opposite sides of a wide river:
Alice
Bob
Alice is supposed to send a message (say, a “Yes”/”No”) across
to Bob around midnight. Both have flashlights, but neither
owns a watch. What should they do?
Suggest several strategies, and discuss pros and cons of each.
16
Solution 1
Alice uses 2 lamps:
1 to indicate that she is ready with the message, and
1 for the message itself
Bob uses 1 lamp:
to indicate that he has received the message
Alice
Bob
17
Solution 2
Alice uses 2 lamps:
Green lamp to indicate “yes”
Red lamp to indicate “no”
Bob uses 1 lamp:
to indicate that he has received the message
Alice
Bob
18
Solution 3
What if Alice and Bob could keep time?
Alice uses 1 lamp for the message:
At 12 midnight: turns on lamp if message = “yes”
At 12:01: turns lamp off
Bob needs no lamps!
Takes down the message between 12 and 12:01
Pros: Fewer signals, lesser processing needed
Cons: Alice and Bob must keep their clocks closely synchronized
If Bob’s watch is off by a minute, incorrect communication possible
19
Data Representation Styles: “Bundled Data”
Single-rail “Bundled Datapath”: simplest approach
widely used
Features:
datapath: 1 wire per bit (e.g. standard sync blocks)
matched delay: produces delayed “done” signal
worst-case delay: longer than slowest path
request
bit 1
bit n
matched
delay
function
block
done
bit 1
done indicates
valid data
bit m
+ Practical style: can reuse sync components; small area
– Fixed (worst-case) completion time
20
Data Representation Styles: Dual-Rail
Dual-rail: uses 2 wires per data bit
bit 1
bit 1
bit n
bit m
Dual-rail Meaning
code
00
01
10
11
“reset” value
0 value
1 value
unused
Each Dual-Rail Pair: provides both data value and validity
+ provides robust data-dependent completion
– needs completion detectors
21
Dual-Rail (contd.)
Dual-Rail Completion Detector:
combines dual-rail signals
indicates when all bits are valid (or
reset)
C-element:
if all inputs=1, output 1
if all inputs=0, output 0
bit0
OR
bit1
OR
bitn
OR
else, maintain output value
C
Done
OR together 2 rails per bit
Merge results using a Müller “C-element”
22
Handshaking Styles: 4-phase
4-Phase: requires 4 events per handshake
Request
get ready for
next event
start
event
event
done
ready for
next event
Acknowledge
+ “Level-sensitive” simpler logic implementation
– Overhead of “return-to-zero” (RTZ or resetting)
extra events which do no useful computation
23
Handshaking Styles: 2-phase
2-Phase: requires 2 events per handshake
Request
start next
event
start
event
event
done
next event
done
Acknowledge
+ Elegant: no return-to-zero
– Slower logic implementation:
logic primitives are inherently level-sensitive, not event-based
(at least in CMOS)
24
Handshaking + Data Representation
Several combinations possible:
dual-rail 4-phase, single-rail 4-phase, dual-rail 2-phase, and single-
rail 2-phase
Example: dual-rail 4-phase
bit 1
A
bit m
B
ack
dual-rail data: functions as an implicit “request”
4-phase cycle: between acknowledge and implicit request
25
Other Data Representation Styles
Level-Encoded Dual-Rail (LEDR)
2 wires per bit: “data” and “phase”
exactly one wire per bit changes value
data
phase
if new value is different, “data” wire changes value
else “phase” wire change value
M-of-N Codes
N wires used for a data word
M wires (M <= N) change value
Values of N and M: have impact on…
information transmitted, power consumed and logic complexity
Knuth codes, Huffman codes, …
26
Which to use?
Depends on several performance parameters:
speed
single-rail vs. dual-rail
– single-rail may be faster (if designed aggressively)
– dual-rail may be faster (if completion times vary widely)
2-phase vs. 4-phase
– 2-phase may be faster (if logic overhead is small)
– 4-phase may be faster (if overhead of return-to-zero is small)
power consumption
2-phase typically has fewer gate transitions ( lower power)
amount of logic used (#gates/wires/pins chip area)
single-rail needs fewer gates/wires/pins
design and verification effort
dual-rail, 1-of-N, M-of-N, Knuth codes…:
– delay-insensitive: robust in the presence of arbitrary delays
single-rail: requires greater timing verification effort
27
Sutherland’s Micropipelines
Seminal Paper
28
Focus of Sutherland’s Turing Award
Lecture: Pipelining
Motivation: Pipelining is at the heart of nearly all
high-performance digital systems
Additional Benefits:
Low power
Interfacing with mixed systems
Modular and scalable design
29
Background: Pipelining
What is Pipelining?: Breaking up a complex operation on a
stream of data into simpler sequential operations
fetch
decode
execute
A “coarse-grain” pipeline (e.g. simple processor)
Storage elements
(latches/registers)
A “fine-grain” pipeline (e.g. pipelined adder)
Throughput = #data items processed/second
+ Throughput: significantly increased
– Latency: somewhat degraded
30
Focus of Async Community
Our Focus: Extremely fine-grain pipelines
“gate-level” pipelining = use narrowest possible stages
each stage consists of only a single level of logic gates
some of the fastest existing digital pipelines to date
Application areas:
multimedia hardware (graphics accelerators, video DSP’s, …)
naturally pipelined systems, throughput is critical
input is often “bursty”
optical networking
serializing/deserializing FIFO’s
genomic string matching?
KMP style string matching: variable skip lengths
31