Seminar on High-Speed Asynchronous Pipelines

Download Report

Transcript Seminar on High-Speed Asynchronous Pipelines

Clockless Logic
or
How do I make hardware fast, powerefficient, less noisy, and easy-to-design?
Montek Singh
Tue, Jan 14, 2003
1
Course Information (1)
Course Number: COMP290-084
Time and Place
 Tue/Thu 3:30-4:45pm, Sitterson Hall 325
Instructor
 Montek Singh
 [email protected] (not singh@cs!)
 SN 245, 962-1832
 Office hours: most afternoons/by appointment
Teaching Assistant
 None
Course Web Page
 http://www.cs.unc.edu/~montek
2
Course Information (2)
Prerequisites:
 undergraduate knowledge of: digital logic, algorithms,
discrete math (sets and graphs)
 no knowledge of advanced circuit design or of VLSI is
assumed
 relevant topics will be covered in class as needed
 you are assumed to know the following topics:
 digital logic: Boolean algebra, logic gates, and latches and registers
 algorithms: search techniques, enumeration, divide and conquer,
and time complexity
 discrete math: elementary set theory and graph theory
3
Course Information (3)
Reading Material:
 Papers and technical reports supplied by instructor
Course Content:
 The following topics will be covered:
 Introduction to clockless logic
 Graphical representation of asynchronous systems
 Algorithms for logic synthesis
– Combinational
– Sequential
 Design techniques
– High-performance
– Low-power
 Formal methods (performance analysis and verification)
 Case studies of real-world asynchronous processors
4
Course Information (4)
Grading
 30% homework assignments
 35% class project
 your choice of topic: from pure algorithms to VLSI design
 30% exams
 5% class participation
Honor Code is in effect
 encouraged to discuss ideas/concepts
 work handed in must be your own
5
Lecture 1: Introduction
 What is asynchronous design?
 Why do we want to study it?
 How is data represented in an asynchronous system?
 How is information exchanged?
6
Introduction: Clocked Digital Design
Most current digital systems are synchronous:
 Clock: a global signal that paces operation of all components
clock
Benefit of clocking: enables discrete-time representation


all components operate exactly once per clock tick
component outputs need to be ready by next clock tick
 allows “glitchy” or incorrect outputs between clock ticks
7
Microelectronics Trends
Current and Future Trends: Significant Challenges
 Large-Scale “Systems-on-a-Chip” (SoC)
 100 Million ~ 1 Billion transistors/chip
 Very High Speeds
 multiple GigaHertz clock rates
 Explosive Growth in Consumer Electronics
 demand for ever-increasing functionality …
 … with very low power consumption (limited battery life)
 Higher Portability/Modularity/Reusability
 “plug ’n play” components, robust interfaces
8
Challenges to Clocked Design
Breakdown of Single-Clock Paradigm:
 Chip will be partitioned into multiple timing domains
 challenge: gluing together multiple timing domains
– glue logic is susceptible to “metastability” (=incorrect values
transferred) and latency overheads
Increasing Difficulties with Clocked Design:
 Clock distribution: requires significant designer effort
 Performance bottleneck: a single slow component
 Clock burns large fraction of chip power (~40-70%)
 Fixed clock rate: poor match for
 designing reusable components
 interfacing with mixed-timing environments
9
What is Asynchronous Design?
 Digital design with no centralized clock
 Synchronization using local “handshaking”
handshaking
interface
clock
Synchronous System
(Centralized Control)
Asynchronous System
(Distributed Control)
10
Why Asynchronous Design? (1)
 Higher Performance
 May obtain “average-case” operation (not “worst-case”)
 not limited by slowest component
 Avoids overheads of multi-GHz clock distribution
 Lower Power
 No clock power expended
 Inactive components consume negligible power
 Better Electromagnetic Compatibility
 Smooth radiation spectra: no clock spikes
 Much less interference with sensitive receivers [e.g., Philips
pagers, smartcards]
 Greater Flexibility/Modularity
 Naturally adapt to variable-speed environments
 Supports reusable components
11
Why Asynchronous Design? (2)
 The world already is mostly asynchronous!
 Events at the level of (or in between) large-scale systems are
asynchronous
 several seconds to several milliseconds
 e.g., PC-printer communication, keyboard inputs, network comm.
 Events at the board level (or between chips) are often
asynchronous
 milliseconds to 100 nanoseconds
 e.g., CPU-memory interface, interface with I/O subsystem (interrupts)
 Events within a chip, at the level of functional units (e.g., adders,
control logic) are currently synchronous
 several nanoseconds to 100 picoseconds
 Events at the level of a single logic gate are asynchronous
 10 picoseconds
 Events at the quantum level are asynchronous
 picoseconds to femtoseconds
 So, why bother with clocks at all?!
 make everything asynchronous  greater elegance and robustness12
Challenges of Asynchronous Design
 Hazards: potential “glitches” on wire
clock tick
clean signals
hazardous signals
no problem
for clocked
systems
 communication must be hazard-free!
 special design challenge = “hazard-free synthesis”
 Testability Issues:
 absence of clock means no “single-stepping”
 Lack of Commercial CAD Tools:
 chicken-and-egg problem
13
Asynchronous Design: Past & Present
Async Design: In existence for 50 years, but …
… many recent technical advances:
 Hazard-Free Circuit Design:
 several practical techniques for controllers [Stanford/Columbia]
 Design for Testability:
 several test solutions, e.g. Philips Research
 Maturing Computer-Aided-Design (“CAD”) Tools:
 software tools for automated design [Philips,Columbia,Manchester]
 Successful Fabricated Chips:
 embedded processors, high-speed pipelines, consumer electronics…
14
Recent Commercial Interest
Several commercial asynchronous chips:
 Philips: asynchronous 80c51 microcontrollers
 used in commercial pagers [1998] and smartcards [2001]
 Univ. of Manchester: async ARM processor [2000]
 Motorola: async divider in PowerPC chip [2000]
 HAL: async floating-point divider
 in HAL-I and II processors [early 1990’s]
Recent experimental chips:
 IBM, Sun and Intel:
 fast pipelines, arbiters, instruction-length decoder…
 IBM/Columbia/UNC: asynchronous digital FIR filter
Several recent startups:
 Theseus Logic, Fulcrum, Self-Timed Solutions…
15
A 5-minute Homework Problem
Alice and Bob live on opposite sides of a wide river:
Alice
Bob
Alice is supposed to send a message (say, a “Yes”/”No”) across
to Bob around midnight. Both have flashlights, but neither
owns a watch. What should they do?
Suggest several strategies, and discuss pros and cons of each.
16
Solution 1
Alice uses 2 lamps:
 1 to indicate that she is ready with the message, and
 1 for the message itself
Bob uses 1 lamp:
 to indicate that he has received the message
Alice
Bob
17
Solution 2
Alice uses 2 lamps:
 Green lamp to indicate “yes”
 Red lamp to indicate “no”
Bob uses 1 lamp:
 to indicate that he has received the message
Alice
Bob
18
Solution 3
What if Alice and Bob could keep time?
Alice uses 1 lamp for the message:
 At 12 midnight: turns on lamp if message = “yes”
 At 12:01: turns lamp off
Bob needs no lamps!
 Takes down the message between 12 and 12:01
Pros: Fewer signals, lesser processing needed
Cons: Alice and Bob must keep their clocks closely synchronized
 If Bob’s watch is off by a minute, incorrect communication possible
19
Data Representation Styles: “Bundled Data”
Single-rail “Bundled Datapath”: simplest approach
 widely used
Features:
 datapath: 1 wire per bit (e.g. standard sync blocks)
 matched delay: produces delayed “done” signal
 worst-case delay: longer than slowest path
request
bit 1
bit n
matched
delay
function
block
done
bit 1
done indicates
valid data
bit m
+ Practical style: can reuse sync components; small area
– Fixed (worst-case) completion time
20
Data Representation Styles: Dual-Rail
Dual-rail: uses 2 wires per data bit
bit 1
bit 1
bit n
bit m
Dual-rail Meaning
code
00
01
10
11
“reset” value
0 value
1 value
unused
Each Dual-Rail Pair: provides both data value and validity
+ provides robust data-dependent completion
– needs completion detectors
21
Dual-Rail (contd.)
Dual-Rail Completion Detector:
 combines dual-rail signals
 indicates when all bits are valid (or
reset)
C-element:
if all inputs=1, output  1
if all inputs=0, output  0
bit0
OR
bit1
OR
bitn
OR
else, maintain output value
C
Done
 OR together 2 rails per bit
 Merge results using a Müller “C-element”
22
Handshaking Styles: 4-phase
4-Phase: requires 4 events per handshake
Request
get ready for
next event
start
event
event
done
ready for
next event
Acknowledge
+ “Level-sensitive”  simpler logic implementation
– Overhead of “return-to-zero” (RTZ or resetting)
 extra events which do no useful computation
23
Handshaking Styles: 2-phase
2-Phase: requires 2 events per handshake
Request
start next
event
start
event
event
done
next event
done
Acknowledge
+ Elegant: no return-to-zero
– Slower logic implementation:
 logic primitives are inherently level-sensitive, not event-based
(at least in CMOS)
24
Handshaking + Data Representation
Several combinations possible:
 dual-rail 4-phase, single-rail 4-phase, dual-rail 2-phase, and single-
rail 2-phase
Example: dual-rail 4-phase
bit 1
A
bit m
B
ack
 dual-rail data: functions as an implicit “request”
 4-phase cycle: between acknowledge and implicit request
25
Other Data Representation Styles
 Level-Encoded Dual-Rail (LEDR)
 2 wires per bit: “data” and “phase”
 exactly one wire per bit changes value
data
phase
 if new value is different, “data” wire changes value
 else “phase” wire change value
 M-of-N Codes
 N wires used for a data word
 M wires (M <= N) change value
 Values of N and M: have impact on…
 information transmitted, power consumed and logic complexity
 Knuth codes, Huffman codes, …
26
Which to use?
Depends on several performance parameters:
 speed
 single-rail vs. dual-rail
– single-rail may be faster (if designed aggressively)
– dual-rail may be faster (if completion times vary widely)
 2-phase vs. 4-phase
– 2-phase may be faster (if logic overhead is small)
– 4-phase may be faster (if overhead of return-to-zero is small)
 power consumption
 2-phase typically has fewer gate transitions ( lower power)
 amount of logic used (#gates/wires/pins  chip area)
 single-rail needs fewer gates/wires/pins
 design and verification effort
 dual-rail, 1-of-N, M-of-N, Knuth codes…:
– delay-insensitive: robust in the presence of arbitrary delays
 single-rail: requires greater timing verification effort
27
Sutherland’s Micropipelines
Seminal Paper
28
Focus of Sutherland’s Turing Award
Lecture: Pipelining
Motivation: Pipelining is at the heart of nearly all
high-performance digital systems
Additional Benefits:
 Low power
 Interfacing with mixed systems
 Modular and scalable design
29
Background: Pipelining
What is Pipelining?: Breaking up a complex operation on a
stream of data into simpler sequential operations
fetch
decode
execute
A “coarse-grain” pipeline (e.g. simple processor)
Storage elements
(latches/registers)
A “fine-grain” pipeline (e.g. pipelined adder)
Throughput = #data items processed/second
+ Throughput: significantly increased
– Latency: somewhat degraded
30
Focus of Async Community
Our Focus: Extremely fine-grain pipelines
 “gate-level” pipelining = use narrowest possible stages
 each stage consists of only a single level of logic gates
 some of the fastest existing digital pipelines to date
Application areas:
 multimedia hardware (graphics accelerators, video DSP’s, …)
 naturally pipelined systems, throughput is critical
 input is often “bursty”
 optical networking
 serializing/deserializing FIFO’s
 genomic string matching?
 KMP style string matching: variable skip lengths
31