Titel und Thema des Vortrages

Download Report

Transcript Titel und Thema des Vortrages

Asynchronous Circuit Design
GALS Systems
Synchronous and GALS NoCs
- DAAD Workshop, Nis, Serbia, July 2009 Dr. Miloš Krstić
IHP
Im Technologiepark 25
15236 Frankfurt (Oder)
Germany
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
Overview
•
Motivation
•
Problems of the synchronous design
•
Asynchronous circuit design
•
GALS - State of the Art
•
Synchronous and GALS NoCs
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
2
Challenges with Synchronous Design
•
Most digital systems today operate synchronously.
•
However, the complexity of electronic systems grows
enormously.
Year
Property
1999
2001
2005
2011
CMOS process [m]
0.18
0.15
0.1
0.05
7
14
41
247
On-chip clock [GHz]
1.25
1.77
3.5
10
Off-chip clock [GHz]
0.48
0.722
1.035
1.54
Power dissipation (handheld systems) [W]
1.4
1.7
2.4
2.2
Vdd [V]
1.5
1.2
0.9
0.5
2
Transistors on chip [Mtrans/cm ]
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
3
Classical Synchronous Paradigm
•
Usually digital circuits are designed to work synchronously
CLK
R1
R2
CL3
R3
CL4
R4
R3
CL4
R4
CLK GATING SIGNAL
CLK
R1
R2
CL3
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
4
Synchronous communication
1
•
•
•
1
0
0
1
0
Clock edges determine the time instants where data must be sampled
Data wires may glitch between clock edges (setup/hold times must be
satisfied)
Data are transmitted at a fixed rate - clock frequency
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
5
Problems with Synchronous Design
•
As clock speeds increase clock distribution becomes difficult:
We need to minimize clock skew.
There is some upper limit to clock speed that depends on the material
properties of the device.
It is not possible to propagate a signal from one side of the chip to the other
side within the single clock cycle
•
Worst-case performance.
•
Sensitive to variations in
Voltage, Temperature, Process.
•
Not modular
(fixed clock rate: poor match for reusability of components).
•
Clock burns large fraction of chip power (~40-70%)
•
Synchronization failure.
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
6
What is Asynchronous Design ? (I)
•
Synchronization is achieved without a global clock.
•
Asynchronous Communication:
Handshake mechanisms
request
Sender
acknowledge
Receiver
data
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
7
What is Asynchronous Design ? (II)
ACK
CTL
CTL
CTL
CTL
REQ
R1
CL
3
R2
EXAMPLE:
R1
CL
3
R2
TOKEN FLOW
R3
CL
4
R4
REQ
ACK
DATA
R3
CL
4
R4
LINK / CHANNEL
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
8
Asynchronous design styles (I)
•
Bundled data (Single Rail) 4 - phase protocol
This style is very widely used because of very small and fast
asynchronous controllers
REQ
ACK
n
DATA
DATA
SOME VARIATIONS
REQ
4 PHASE PROTOCOL:
ALWAYS LIKE THIS
ACK
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
9
Bundled data
1
•
•
•
1
0
0
1
0
Validity signal
Similar to an aperiodic local clock
n-bit data communication requires n+1 wires
Data wires may glitch when no valid
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
10
Asynchronous design stiles (II)
•
Bundled data (Single Rail) 2 - phase protocol
This style looks simpler and faster than 4-phase, but
controllers are more complex
REQ
ACK
n
DATA
DATA
REQ
2 PHASE PROTOCOL
ACK
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
11
Asynchronous design stiles (III)
•
4-phase dual rail protocol
Each data bit encoded into 2 wires
Offers generation of Delay-Insensitive circuits
Introduces very big area overhead
VALUE
d.t
EMPTY
0
VALID “0” 0
VALID “1” 1
Not used 1
ACK
2n
DATA
EMPTY
VALID
DATA
EMPTY
VALID
EMPTY
d.f
0
1
0
1
VALID
ACK
0
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
E
1
www.ihp-microelectronics.com
© 2009 - All rights reserved
12
Dual rail
1
1
1
0
•
0
0
Two wires per bit
“00” = spacer, “01” = 0, “10” = 1
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
13
Asynchronous modules
DATA
PATH
Data IN
start
Data OUT
done
req in
ack in
•
req out
CONTROL
ack out
Signaling protocol:
reqin+ start+ [computation] done+ reqout+ ackout+ ackin+
reqin- start[reset]
done- reqout- ackout- ackin-
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
14
Asynchronous components
•
Asynchronous design require additional components and special logic
•
Such components are not available in standard synchronous design kit
•
Critical components are C-element and Mutex
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
15
Muller C-element
A
0
0
1
1
b
0
1
0
1
z
0
no change
no change
1
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
16
Mutual Exclusion element
•
ME prevents multiple event propagation
ME is used for arbitration
R2
R2
MUTEX
R1
G1
x2
x1
G2
G1
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
R1
www.ihp-microelectronics.com
G2
© 2009 - All rights reserved
17
Dual-rail logic
• Dual-rail logic require additional logic for each logical operation
A.t
B.t
C.t
Dual-rail AND gate
A.f
C.f
B.f
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
18
Completion detection (dual-rail)
C
•
•
•
done
•
•
•
Completion detection tree
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
19
Completion detection (bundled-data)
logic
Conventional logic +
matched delay
•
•
•
•
•
•
start
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
delay
www.ihp-microelectronics.com
done
© 2009 - All rights reserved
20
Muller pipeline
ACK
ACK
ACK
ACK
ACK
ACK
ACK
RIGHT
LEFT
C
REQ
REQ
C
REQ
C[i-1]
C
C
REQ
C[i]
REQ
C[i+1]
REQ
C[i+2]
•
The” delay-insensitive handshake machine
•
C[i] accepts 1/0 from C[i-1] only if C[i+1]=0/1
•
Think of 1010101.. as waves: 10 10 10 1..
•
The C-elements propagate waves precisely
•
Timing depends on local delays, may vary along the pipe
•
If RIGHT is quiet, pipe will fill and stall
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
REQ
© 2009 - All rights reserved
21
Micropipelines (Sutherland 89)
Aout
delay
C
L
logic
L
C
logic
C
Rin
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
Ain
delay
L
logic
L
C
delay
www.ihp-microelectronics.com
Rout
© 2009 - All rights reserved
22
Abstract Pipeline
E
V
V
E
•
Bubbles
•
Tokens
Valid (0 or 1, who cares) and Empty tokens
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
E
© 2009 - All rights reserved
23
Abstract Rings
V
E
V
token
bubble
V
E
E
•
V
V
E
E
V
E
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
3 stages, 1 bubble:
3 steps for token round
6 steps to cycle
www.ihp-microelectronics.com
© 2009 - All rights reserved
24
Building Blocks
Latch
Source
Fork
Join
(wait for all)
0
0
1
1
MUX
DEMUX
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
Sink
Merge
(wait for one)
Function Block
(Join; CL; Fork)
www.ihp-microelectronics.com
© 2009 - All rights reserved
25
Describing Asynchronous Cirsuit - STGs
A+
A
B+
A–
B
B–
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
A input
B output
www.ihp-microelectronics.com
© 2009 - All rights reserved
26
Control specification – C element
A+
B+
A
C+
C
A-
B-
C
B
CIHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
27
Control specification – FIFO Controller
Ri
FIFO
cntrl
Ao
Ro
Ai
Ri
Ao
C
C
Ri+
Ro+
Ao+
Ai+
Ri-
Ro-
Ao-
Ai-
Ro
Ai
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
28
A simple filter: specification
Ain Rin
y := 0;
loop
x := READ (IN);
WRITE (OUT, (x+y)/2);
y := x;
end loop
filter
Aout Rout
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
IN
OUT
© 2009 - All rights reserved
J. Cortadella - Introduction to asynchronous circuit design: specification and synthesis
29
A simple filter: block diagram
+
x
IN
Rin
Ain
Rx
OUT
y
Ax
Ry
Ay
control
Ra
Aa
Rout
Aout
• x and y are level-sensitive latches (transparent when R=1)
• + is a bundled-data adder (matched delay between Ra and Aa)
• Rin indicates the validity of IN
• After Ain+ the environment is allowed to change IN
• (Rout,Aout) control a level-sensitive latch at the output
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
30
A simple filter: control spec.
+
x
IN
Rin
Ain
Rx
OUT
y
Ax
Ry
Ay
Ra
Aa
control
Rout
Aout
Rin+
Rx+
Ry+
Ra+
Rout+
Ain+
Ax +
Ay+
Aa+
Aout+
Rin-
Rx-
Ry-
Ra-
Rout-
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
Ain-
Ax-
www.ihp-microelectronics.com
Ay-
© 2009 - All rights reserved
Aa-
Aout-
31
A simple filter: control impl.
Rx Ax
Ay Ry
Ra Aa
Aout
C
Ain
Rout
Rin
Rin+
Rx+
Ry+
Ra+
Rout+
Ain+
Ax +
Ay+
Aa+
Aout+
Rin-
Rx-
Ry-
Ra-
Rout-
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
Ain-
Ax-
www.ihp-microelectronics.com
Ay-
© 2009 - All rights reserved
Aa-
Aout-
32
Taking delays into account
z+
x+
xy+
x
z
z-
x’
z’
y
yDelay assumptions:
• Environment: 3 times units
• Gates: 1 time unit
events: x+  x’-  y+  z+  z’-  x-  x’+  z-  z’+  y- 
time: 3
4
5
6
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
7
9
www.ihp-microelectronics.com
10
12
13
© 2009 - All rights reserved
14
33
Taking delays into account
z+
x+
xy+
x
z
z-
y-
x’
z’
y
very slow
Delay assumptions: unbounded delays
events: x+  x’-  y+  z+  x-  x’+  ytime: 3
4
5
6
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
9
10
www.ihp-microelectronics.com
failure !
11
© 2009 - All rights reserved
34
Gate vs wire delay models
•
Gate delay model: delays in gates, no delays in wires
•
Wire delay model: delays in gates and wires
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
35
Delay models for async. circuits
•
•
•
•
Bounded delays (BD): realistic for gates and wires.
Technology mapping is easy, verification is difficult
BD
Speed independent (SI): Unbounded (pessimistic) delays for
gates and “negligible” (optimistic) delays for wires.
Technology mapping is more difficult, verification is easy
DI
Delay insensitive (DI): Unbounded (pessimistic) delays for gates
and wires.
DI class (built out of basic gates) is almost empty
SI  QDI
Quasi-delay insensitive (QDI): Delay insensitive except for
critical wire forks (isochronic forks).
Formally, it is the same as speed independent
In practice, different synthesis strategies are used
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
36
Desynchronization - concept
•
Start with synchronous design
•
Replace clock with local handshake
•
Use standard CAD tools
•
Does not change datapath
•
Guaranteed correctness
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
* Eyal Friedman, Desynchronization - From Synchronous to Asynchronous design, Seminar
in VLSI Architecture, Technion, Israel, Spring 2008
37
Desynchronization - flow steps
•
Main assumptions:
Normal Combinatorial logic, DFF
single clock
single clock edge
D-FF
CL
D-FF
CL
D-FF
CLK
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
38
Desynchronization flow step #1
•
Replace DFF by M+S latches
M
D-FF S
CL
CL
MD-FFS
CL
CL
MD-FF
S
CLK
CLK
39
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
Desynchronization flow step #2
•
•
Add matched delays
Respect bundling assumption
Delay > Tpd of CL
Delay serves as completion signal
MM
SS
CL
CL
M
S
Matched delay
CL
CL
MM
SS
Matched delay
CLK
CLK
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
40
Desynchronization flow step #3
•
Replace clock by local handshake controllers
MM
CLK
ctrl
SS
ctrl
CL
CL
Matched delay
delay
Matched
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
M
ctrl
S
ctrl
www.ihp-microelectronics.com
CL
CL
Matched
Matched delay
delay
M
M
ctrl
SS
ctrl
© 2009 - All rights reserved
41
Why Asynchronous Design?
•
We are used to sync design
Logic and timing assumptions are simpler, but not true in reality
Currently it is very hard to solve big problems of synchronous design
like clock skew, big power consumption, process variability ...
•
Common arguments for asynchronous design:
Low power ? 
High speed ?  
Low emission ? 
Low sensitivity to PVT (Process, Voltage, Temperature) variations ? 
High modularity (SoC) ? 
No clock distribution and timing problems (works) ? 
Secure chips ? 
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
42
Why not Asynchronous Design?
•
Overhead (area, speed, power)
•
Hard to design
Non-decomposable to small combinatorial logic blocks
Converting synchronous design to asynchronous typically
fails
•
Few CAD tools
There is no real complete design-flow available
There is only one commercial async EDA vendor available
(Handshake Solutions) with very specific design flow (HASTE)
•
Hard to test
Asynchronous test methods are not present yet (or not mature
enough), and it is difficult to go into any production without
proper testing
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
43
Available tools
•
•
•
•
There are several tools available for automation of Asynchronous
Design
Mostly tools are developed at Universities
Two groups of tools: for synthesis of asynchronous controllers
and for design of the systems
I group
Minimalist
Petrify
3D
II group
BALSA
TAST
HASTE
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
44
Minimalist
•
•
•
•
Developed at Columbia University
“burst-mode” synthesis package
based on synthesis of asynchronous FSMs
integrates synthesis, testability and verification tools
•
Good side
Produce Hazard-free control circuits
Contains several different algorithms for synthesis
Can provide generalized C-element based mapping and also
behavioral Verilog
•
Bad side
Doesn’t support arbitration and EBM
No optimal algorithm selection
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
45
Petrify
•
•
Designed by J. Cortadella, M. Kishinevsky, A. Kondratyev, L.
Lavagno, A. Yakovlev
Synthesis of Asynchronous controllers defined as Petri Nets or
Signal Transition Graphs (STG)
•
Good side
Produce optimal Hazard-free control circuits
Can provide generalized C-element based mapping, complex-gate
mapping and mapping to the technology libraries
•
Bad side
Supports only asynchronous design, not mixed sync-async
With increased number of signals, synthesis time grows
exponentially
Suitable for relatively small controllers
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
46
3D
•
•
Produced by Kenneth Yun
“Extended Burst-Mode” synthesis package
•
Good side
Produce Hazard-free control circuits
Supports restricted multiple-input change (input burst) with don'tcare inputs
Supports input choices based on sampling possibly glitchy signals
Suitable for mixed sync-async systems (like GALS)
•
Bad side
No technology mapping
No optimal algorithm selection
No support and further development
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
47
TAST
•
•
Produced by TIMA Laboratory, France
TAST is compiler/synthesizer of Asynchronous digital circuits from high
level communication description language
Input is CHP language, which can describe Petri Nets.
It is using VHDL as a format for behavioral and post synthesis
simulation.
Produces QDI (dual-rail, 1-M code rail) circuits
•
Good side
Produces complete asynchronous system and provides full design-flow
•
Bad side
Uses QDI style, which gives very big area overhead
Gives not optimized output circuits
Not available in the moment
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
48
TAST Design flow
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
49
BALSA
•
•
Produced by University of Manchester
BALSA is compiler/synthesizer of Asynchronous digital circuits
from high level communication description language
Input is BALSA language developed specially for this package
Produces Bundled data, Dual-rail, 1-M code rail circuits
•
Good side
Produces complete asynchronous system and provides full
design-flow
•
Bad side
Gives large overhead compared with manual design (up to 300 %)
All tools are not freely available
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
50
BALSA Design Flow
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
51
Asynchronous Success Stories - Philips
Philips developed its own full design flow based on TANGRAM language
Design flow also contains design for testability
Asynchronous Demonstrators
DCC error corrector
- 1993-1994
- Low Power
80C51
- 1995
- Low Power, Low EMI
Smartcards
- 1998
- Low Power, Security
DCC error corrector
date
area [mm2]
power [mW]
synchronous
93
3.4
2.60
async (dual-rail)
93/05
7.0
0.41
synchronous
94
3.3
0.60
async (single rail)
94/09
3.9
0.08
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
52
Asynchronous Success Stories - Philips 80c51 (I)
•
Application - Pager baseband controller
First asynchronous C ever on the market
•
Motivations for asynchronous solution of 80c51
Low power
Low EMI for easy integration
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
53
Asynchronous Success Stories - Philips 80c51 (II)
•
Low power issue
Circuit is only active when and where needed
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
54
Asynchronous Success Stories - Philips 80c51 (III)
•
Low current peaks
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
55
Asynchronous Success Stories - Philips 80c51 (IV)
•
Low EMI
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
56
Asynchronous Success Stories - RAPPID
•
•
•
•
RAPPID - Revolving Asynchronous Pentium Processor
Instruction-length Decoder
Instruction Length Decoder was performance bottleneck in ca.
1995-vintage CISC processors
Potential for optimization for common cases (RISC-like)
Results
Developed a novel aggressive asynchronous method
About 3x throughput
T=3x
About one half latency
L=2x
About one half power
P=2x
About same area
A=0.8x
Namely, this is TxLxPxA  10 improvement
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
57
Asynchronous Success Stories - Amulet
•
Amulet group is formed in Manchester University
•
Amulet1 (1994)
60000 transistors in 1.0m, ARM6 instruction set
Half instruction throughput with same energy efficiency as ARM6
•
Amulet2e (1996)
450000 transistors in 0.5m, ARM7 compatible
Still half the performance of a synchronous chip
•
Amulet3i (2000)
800000 transistors in 0.35m, ARM9 compatible
Same performance as synchronous solution with an equal or
marginally better energy efficiency
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
58
Globally Asynchronous Locally Synchronous (GALS)
Systems
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
59
GALS Technique
• GALS is abbreviation for Globally-Asynchronous LocallySynchronous systems.
• GALS techniques have the potential to solve some of the most
challenging design issues of SoC integration of
communication systems.
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
60
GALS method
• GALS can be used on ist own or within the NoC concept
Asynchronous wrapper
Asynchronous wrapper
Req
Synchronous
block 1
Ack
Synchronous
block 2
Data
Data
Network
Node
Network
Node
Network
Node
Synchronous
block 3
Asynchronous wrapper
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
61
GALS as a Powerful Design Technique
•
In the wireless communication systems GALS can approach the
main design challenges.
•
GALS makes data transfer between the blocks very easy.
•
Design problems as timing closure or clock-tree generation are
limited to the level of much smaller local blocks.
•
Decoupling of local blocks from central clock source reduces
spectral noise considerably.
•
Power saving is automatically integrated in asynchronous
wrapper.
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
62
Power reduction with GALS
• Clock signal is the dominant source of
power consumption .
DATAPATH
M E MO RY
• First estimations showed that about 30%
of power savings could be expected in the
clock net due to the application of GALS.
• Recently, some more pessimistic power
estimation figures were presented
• GALS techniques offer independent
setting of frequency and voltage levels for
each locally synchronous module.
• When using dynamic voltage scaling
(DVS), an average energy reduction of up to
30% can be reached
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
CO NT RO L,
I /O
CLO CK
Power distribution in highperformance CPU
© 2009 - All rights reserved
63
Potential for reducing EMI with GALS
•
We have simulated noise generated on the power supply line
in the synchronous and request-driven GALS system.
dB
GALS introduces reduction of
about 20 dB
-20
-40
-60
-80
-100
-120
dB
-20
-40
-60
-80
-100
-120
-140
0.5
1
1.5
2
2.5
3
Frequency GHz
3.5
4
4.5
0.5
1
1.5
2
3.5
4
4.5
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
2.5
3
Frequency GHz
www.ihp-microelectronics.com
© 2009 - All rights reserved
64
GALS Opportunities – 3D Integration
Sensor
Sensor
A/D
A/D
Memory
Memory
•
3D Integration
can be very
interesting as the
application field
DSP
DSP
Comm
Comm
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
65
GALS Opportunities - NoCs
•
Another interesting application can be Networks on Chips
and MP SoCs (Multi-Processor System-on-Chip)
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
66
GALS Opportunities – Process Scaling and Variability
•
Asynchronous design gives average-case performance in
comparison to worst-case performance of synchronous system
Variability on the Vth makes
individual transistors faster or
slower, more or less energy
consuming.
65nm
min-size
%Vth variability = +/30% (+/-3)
VtNom
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
67
GALS Methods
•
GALS based on synchronizers
•
GALS based on asynchronous FIFOs
•
GALS based on pausible clocking
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
68
GALS with the Synchronizers
2-phase
handshake
clock
4-phase
handshake
req
req
Clocked
Handshake
Converter
domain
Clockless
domain
ack
ack
data
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
69
GALS with FIFOs
Locally
Synchronous
Module 1
empty
full
FIFO
Data
Rd_valid
Data
Wr_en
Rd_en
Wr_clk
Rd_clk
Clock 2
Clock 1
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
Locally
Synchronous
Module 2
www.ihp-microelectronics.com
© 2009 - All rights reserved
70
Asynchronous wrappers
•
•
GALS usually contains synchronous islands communicating with
each other through asynchronous wrappers
Asynchronous wrapper surrounds locally-synchronous islands
Wrapper consists of pausable clock and Input & Output ports
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
71
Classical Pausible Clocking GALS approach
•
Published in Jens Muttersbach et al., Globally-Asynchronous LocallySynchronous Architectures to Simplify the Design of On-Chip Systems, In
Proc. of ASIC/SOC Conference, pp. 317-321, Sept. 1999.
Data
Input port
Output port
Locally
Synchronous
Module 1
Locally
Synchronous
Module 2
handshake
Asynchronous
Wrapper 1
Local
Clock
Generator
1
stretch1
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
stretch2
www.ihp-microelectronics.com
Local
Clock
Generator
2
Asynchronous
Wrapper 2
© 2009 - All rights reserved
72
Pausable Clock Generator
REQI1/2
ACKI1/2
clk_grant
ARBITER
C
DELAY LINE
RC L K
L CL K
RC L KD
STOPI
rclk
rclkd
fin
DELAY SLICE
bout
fout
DELAY SLICE
DELAY SLICE
cc2
ccn
bin
cc1
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
73
Main challenges of the typical GALS methods
•
In many solutions, the problems of data transfer and throughput is critical.
Most of them can perform data transfer every second clock cycle of the
local clock.
•
Some described circuits can theoretically transfer data every clock cycle.
However, the intensive stretching of the pausable clock generator will
significantly diminish the practical performance.
•
The latency of the transferred data is not known in advance and may vary
significantly from one data transfer to the other one.
•
It is not very practical to use the ring oscillators for local clock generation.
•
All solutions are oriented towards a very general application.
They are not optimised for specific systems and environmental demands.
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
74
Basic concept of the request-driven operation
•
This approach covers point-to-point communication with very
intensive but bursty data transfer.
•
When receiving input burst, GALS block can operate in a
request-driven mode.
•
When there is no input activity, the data stored inside the
locally synchronous pipeline has to be flushed out.
Then a local clock generator drives the GALS blocks.
•
A Time-out function controls the transition from request
driven operation to local clock generation mode.
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
75
Request-driven asynchronous wrapper
request driven clock
local clock
Handshake
signals
Output
port
Input
port
Data
Handshake
signals
Data
Locally
Synchronous
Module
Local clock
generation
Time-out
detection
Asynchronous wrapper
•
Local clock can be generated either internally or externally.
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
76
What can we gain from this GALS technique?
•
Reliable and fast transfer of large bursts of data is achieved.
Data transfer is possible at every clock cycle of synchronous
block.
•
In request-driven mode operation there is no arbitration in input
port. The circuit immediately responds to input requests.
•
The clock speed is determined by the master and not by the slower
participant in the communication.
•
The local clock can be generated internally or externally.
•
This proposed architecture offers an efficient power-saving
mechanism, similar to clock gating.
•
EMI should be reduced due to varying delays and frequencies in
different asynchronous wrappers.
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
77
Building the wrapper components - input port
REQ_INT
REQ_A1
ACK_A
ACKEN
INPUT
CONTROLLER
ACK_INT
ACKC
ST
STOP
RST
REQI1
ACKI1
• Input port has to provide control of the dataflow according to a
‘broad’ 4-phase handshake protocol.
• The input port consists of a speed-independent (SI) input
controller along with few additional gates that have to provide
glitch-free transitions of the input signals.
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
78
Input controller specification
Idle mode
•
Input controller is modeled as an AFSM
(asynchronous finite state machine).
•
The controller is specified according to
burst-mode requirements.
•
Burst-mode AFSM is implemented as
‘Huffman Machine’ without explicit
latches.
0
STOP-, ST-/
RSTREQ_A1+ /
REQ_INT+,
RST+, ACK_A+
Request-driven mode
REQ_A1+ /
REQ_INT+,
RST+, ACK_A+
4
1
STOP+ /
RST+
ACKC+, REQ_A1- /
REQ_INT-, RST-,
ACK_A-
ACKC-, REQ_A1+ /
REQ_INT+,
RST+, ACK_A +
9
2
REQ_A1-, ST-/
ACK_A-, RST-,
ACKEN-
8
ACKC- /
ACK_A +, RST+
ACKC-, ST+ /
3
Local
clock
ST+ /
generation mode
7
REQ_A1+ /
REQI1+
5
6
ACKI1-, ACKC+ /
inputs
A
B
C
ACKI1+ /
ACKEN+,
REQI1-
Transitional mode
State graph of the input controller
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
Hazard-Free
Combinational
Network
outputs
X
Y
Z
State (several bits)
© 2009 - All rights reserved
79
Input controller implementation
•
Burst-mode input controller is synthesized using 3D tool that supports
2-level hazard-free logic minimization and achieves optimal state
assigment:
REQ_INT = REQ_A1 REQ_INT + ACKC' REQ_INT + REQ_A1 ACKC' ST' ACKEN'
ACK_A = ACKC' REQ_INT + REQ_A1 RST +ACKC' ST ACKI1' ACKEN Z0' + REQ_A1 ACKC'
ST' ACKEN'
ACKEN = ACKI1 + REQ_A1 ACKEN + ST ACKEN
RST = STOP + ACKC' REQ_INT + REQ_A1 RST + ST RST + ACKC' ST ACKI1' ACKEN Z0' +
REQ_A1 ACKC' ST' ACKEN'
REQ_I1 = REQ_A1 ST ACKI1' ACKEN'
Z0 = ACKI1 + REQ_A1' ACKC + REQ_A1' ST' Z0 + ACKC' ACKEN Z0 + ACKC ACKEN' Z0
•
Logic equations are automatically converted into synthesizable
structural VHDL code with our 3DC tool.
•
Formal analysis of the asynchronous wrapper is performed.
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
80
VHDL description of a port
UN1: inv1x port map (ackc,t3);
UN2: inv1x port map (st,t4);
UN3: inv1x port map (clk1,t5);
UN4: inv1x port map (req,t6);
UN5: inv1x port map (ackeni,t7);
UN6: inv1x port map (endi,t8);
UN7: inv1x port map (z0,t9);
UN8: inv1x port map (z1,t10);
UN8i: inv1x port map (dvsi,t11);
U6: and2ix port map (reqci,ackc,t1);
U7: and2x port map (req,reqci,t28);
U8: and4x port map (req,t3,t4,t9,t12);
U9: or3x port map (t1,t28,t12,reqcix);
U7i: and2x port map (req,reseti,t2);
U7ii: and2x port map (st,acki,t31);
U13: and3x port map (req,t3,z0,t13);
U14: or5x port map (t1,t13,t12,t2,t31,ackix);
U10: and2x port map (ackc,ackeni,t14);
U12: and2x port map (t9,ackeni,t15);
U15: or3x port map (t15,t14,clk1,ackenix);
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
U11: and3x port map (st,t3,z0,t16);
U19: or5x port map (endi,t1,t2,t12,t16,resetix);
U17: and2x port map (t7,t9,t17);
U18: and3x port map (req,st,t5,t18);
U20: and2x port map (t18,t17,reqiix);
U25: and2x port map (req,z0,t22);
U26: and2x port map (st,z0,t23);
U23: and3x port map (ackc,t5,ackeni,t21);
U27: or4x port map (t21,t22,t23,endi,z0x);
U28: and2x port map (t6,ackc,t24);
U29: and2x port map (ackc,z1,t25);
U30: and3x port map (t6,t4,z1,t26);
U32: or3x port map (t25,t26,t24,z1x);
entity and2x is
port (a,b: in std_logic; c: out std_logic);
end and2x;
architecture struc of and2x is
attribute DONT_TOUCH_NETWORK of a,b,c: signal is true;
begin
c<=(a and b) after 100 ps;
end struc;
www.ihp-microelectronics.com
© 2009 - All rights reserved
81
Externally-driven GALS Wrapper
D at a _ o u t
externally
generated clock
TIME-OUT
DE TE CTIO N
Handshake
signals
request driven clock
OUTPUT
PORT
LOCALLY
SYNCHRONOUS
M O D UL E
INPUT
PORT
Handshake
signals
Data_in
C MU
Asynchronous wrapper
R e u sed
blocks
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
External
cl o c k
www.ihp-microelectronics.com
Adapted
b l o ck
© 2009 - All rights reserved
82
Clock Management Unit
REQI1
ACKI1
STOPI
Stretch
INV1
OR2
ste
MUTEX
MUTEX
M1
clk_grant
M2
external_clock
ECLK
AND2
MUTEX
M3
C
C1
C
cg
C2
C3
+
C
sti
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
83
Baseband processor for WLAN
•
The goal of one of our projects was to develop a wireless
broadband communication system in the 5 GHz band.
•
The modem is compliant with the IEEE802.11a WLAN standard
•
System uses Orthogonal Frequency Division Multiplexing
(OFDM) with data rates ranging from 6 to 54 Mbit/s.
•
The synchronous baseband processor was implemented as
an ASIC (700k gates).
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
84
Structure of the synchronous baseband processor
Baseband Processor
Transmitter
Receiver
Synchronizer
datapath
FFT
Channel
estimator
Buffer 20 - 80
Demapper
Deinterleaver
Viterbi decoder
Descrambler
Parallel
converter
Synchronizer
tracking
Buffer 80 -20
Mapper
Interleaver
Encoder
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
•
Baseband processor
includes receiver and
transmitter datapath
structure.
•
Very complex blocks
are implemented such
as Viterbi decoder, FFT,
IFFT, CORDIC
processors, ...
Preamble
insertion
Guard interval
insertion
IFFT
Pilot insertion
Mapper
Interleaver
Encoder
Signal field
generator
Input buffer
Scrambler
Pilot
scrambler
www.ihp-microelectronics.com
80 Msps block
20 Msps block
© 2009 - All rights reserved
85
Design challenges in the baseband processor
•
Design of the baseband processor involves the challenges as:
several clock domains,
global clock tree generation,
large number of clock leaves (36 k flip- flops),
clock skew handling,
timing closure between the different modules,
clock gating,
power consumption,
EMI.
•
Request–driven GALS architecture was developed as a
possible solution for those problems.
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
86
GALS partitioning
Baseband Processor
Tx_int
(async-sync interface)
Preamble
insertion
Guard interval
insertion
IFFT
Signal field
generator
Encoder
Interleaver
Mapper
Pilot insertion
Scrambler
Input buffer
Rate adaption block
Interface block
FFT
Channel
estimator
Buffer
- 80
20rate
Token
adaptation
Demapper
Deinterleaver
Viterbi decoder
Descrambler
Parallel
converter
Encoder
Interleaver
Mapper
Buffer
FIFO80
TA-20
Synchronizer
tracking
Activation interface
20 Msps block
Synchronizer
datapath
80 Msps block
Rx_int
(async-sync interface)
87
© 2009 - All rights reserved
www.ihp-microelectronics.com
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
Rx_2
Rx_TRA
Rx_3
Tx_3
Tx_2
Tx_1
• The partitioning
process has to take
into account
possible power
saving.
Pilot
scrambler
Rx_1
Test strategy
•
We are using a hardware tester which is strictly cycle based and
cannot react to asynchronous output signals of the circuit.
•
The GALS arbitration processes preclude cycle level
determinism.
•
We want to have a possibility to run very complex functional tests
internally.
•
Applied test technique should support system diagnosis.
•
A test strategy based on Built-In Self-Test (BIST) is proposed.
•
BIST reduces the effort for generating a test program and enables
us to use a synchronous tester.
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
88
Design for Testability in GALS
TDE0
TDE2
TDE1
Tx_3
block
TPG1
TPG2
TDE10
TDE7
TDE6
Rx_int
Rx_TRA
block
Rx_3
block
TPG4
Rx_2
block
TPG3
TDE9
TDE8
FIFO_TA
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
Activation interface
Rx_1
block
BIST internal loop
TPG0
•
TPG and TDE are based on the
linear feedback shift register
structure with embedded
additional logic.
•
A central BIST controller
performs control of the test
procedure.
•
We can run hierarchical tests.
•
This BIST technique can be used
as a method for prototype
verification.
•
In combination with the scan
approach, BIST can be even used
as a basis for the manufacturing
test.
TDE4
Tx_int
Tx_2
block
Tx_1
block
TDE3
TDE5
www.ihp-microelectronics.com
© 2009 - All rights reserved
89
Asynchronous wrappers
AFSM specifaction
Design flow
•
•
Synchronous blocks
3D - Logic synthesis
Functional specification
3DC tool – translation
from 3D to structural
VHDL
We have used IHP 0.25 CMOS
process.
LoLA
Formal analysis
VHDL description
Model Sim
Abstract behavioural
simulation
Asynchronous wrapper is
equivalent to about 1.3 k inverter
gates.
Only tunable clock generation is
0.9 k gates.
Synopsys DC
Gate mapping
Model Sim
Realistic behavioural
simulation
Synopsys DC
Timing driven synthesis
•
Asynchronous wrapper has
throughput up to 150 Msps in
request driven mode and 100 Msps
in local mode.
This application needs 80 Msps.
Model Sim
Prime Power
Power estimation
Postsynthesis
simulation
Cadence Silicon Encounter
Layout
Model Sim
Back annotation
Prime Power
Power estimation
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
Tape-out
© 2009 - All rights reserved
90
Area and power distribution
•
Area and power statistics are based on the synthesized netlist data.
Locally synchronous blocks occupy around 90% of the total area,
The BIST circuitry requires around 3.5%,
interface blocks 2.9%, and
asynchronous wrappers 2%.
•
Based on the switching activities, in the realistic transceiver scenario,
power estimation with Prime Power tool has been performed.
Synchronous datapath logic uses most of the power (around 52.4%),
then local synchronous clock trees are using 34.5%,
async-to-sync interfaces 7%, and
asynchronous wrappers 2.9%.
•
After layout, the estimated power consumption is 324.6 mW.
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
91
Implementational results
•
Our GALS baseband processor
is fabricated and tested.
•
The total number of pins is 120 and the
silicon area including pads is 45.1 mm2.
•
Measured dynamic power dissipated in
the pure synchronous baseband processor
was 332 mW, and for the GALS baseband
processor slightly lower, at 328 mW.
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
Receiver
Transmitter
© 2009 - All rights reserved
92
Improving System Integration with GALS
•
Synchronous baseband processor challenges:
-
several clock domains,
Solved by GALS architecture
-
global clock tree generation,
No global clock in GALS
-
large number of clock leaves,
-
clock skew handling,
-
timing closure between blocks,
-
clock gating.
Clock leaves distributed over
GALS blocks
Clock skew is reduced from
660ps to 486 ps
Communication between the
blocks through handshaking
Clock-gating embedded in the
asynchronous wrapper
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
93
EMI measurement (I)
•
The supply voltage variation spectrum of the inner processor
core is measured.
0
synchronous baseband processor
GALS baseband processor
-10
~ 5 dB
-20
-30
-40
-50
-60
dB
-70
0
MHz
50
100
150
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
200
250
300
www.ihp-microelectronics.com
350
400
450
500
© 2009 - All rights reserved
94
EMI measurement (II)
•
Additionally, instantaneous supply voltage peaks are reduced
from 140 mV (synchronous design) from cycle to cycle to the
less than 100 mV (GALS).
•
This reduction can be very important for mixed-signal designs
and for secure systems.
•
An application with fine-grained GALS partitioning can lead to
results closer to theoretical maximum reduction.
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
95
Conclusions
•
There are several asynchronous design currently on the
market
Asynchronous design is with greatest success used in the
medium complexity - medium performance circuits
•
Future applications
GALS, large networks on the chips (NoCs)
3D Integration
Some local blocks in the GALS then could be asynchronous
Asynchronous circuitry can provide lower EMI for SOCs
•
Design & Test flow remains as a problem
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
96
Synchronous and GALS Networks on Chips
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
97
Synchronous and GALS NoCs
•
Today on-chip design is more and more communication-centric
•
Classical topologies are not sufficient (point-to-point, mesh, bus, etc.)
•
Shared bus = low performance
Bandwidth is shared
Bus width (bits) relatively small
Global clock frequency limited
•
Disadvantage of multiple buses
Not scalable, not generic
•
Promising alternative could be Networks on Chip (NoCs)
•
NoCs can be implemented completely synchronously, mesochronously, or in
GALS fashion
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
98
NoC Paradigm
•
•
Apply Networks Protocols to SoC
Network:
Provides communication
Satisfy quality-of-service requirements:
Reliability
Performance: Throughput, latency, ..
Power ?
•
Additional requirements unique to NoC
Energy bounds
Area
Fit it to the standard design flow
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
99
Switching Network Basics
•
Transport Layer: Msg end-to-end
Implemented using network adapters
Assembly and disassembly of the packets at
source/destination
•
Network Layer: Pkt end-to-end
Implemented using routers
Routers decide the routing path to destination
header of the packet
topology knowledge
Scalable distributed system: load shared between
routers
•
Data-Link Layer : Pkt over link
Packets: header, payload, trailer
Error correction (on packet): redundancy, error
correction codes
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
* Technion - Asynchronous NoC - Nikolai Samolazov
100
Bus vs. Network Arguments
Scalability:
Bandwidth:
Latency:
BUS
NoC
Every IP adds parasitic
capacitance
Only P2P connections
Timing is difficult
Can be pipelined
Bus Arbiter performance
Load shared by routers
Limited and shared by all
IP
Scales with network size
Zero when granted control Network latency always exists
Cost:
Low area
Significant area
Design
Complexity:
Simple: well known and
understood
Requires changes in HW and
sometimes SW levels
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
101
Hybrid Network
•
•
Shared Busses as first level communication medium
NoC routers as main communication devices
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
102
Homogenous NoC
FU
FU
FU
FU
FU
FU
FU
FU
FU
FU
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
* NoC General Concepts - Andreas Ehliar - Per Karlström
103
Heterogeneous NoC
FU DSP FU
MUL
FU
FU
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
FU
www.ihp-microelectronics.com
ALU
© 2009 - All rights reserved
104
Heterogeneus NoC
DSP
FU
FU
MUL
FU
FU
ALU
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
105
Quality of Service
•
Guaranteed latency
•
Guaranteed bandwidth
•
Correctness
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
106
Design Issues - Architecture
FU
FU
FU
FU
FU
FU
FU
FU
FU
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
108
Design Issues - Architecture
FU
FU
FU
FU
FU
FU
FU
FU
FU
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
109
NoC Design
•
Architecture
Network Adapter and Router Architecture
- Asynchronous or synchronous
Network Topology
Routing Strategy
- Static Routing
- Adaptive Routing
Interconnect
- Repeaters
- Pipelining
•
Design Technology
Tools and Methodologies
Simulation and (correctness, performance, power) Validation
- SystemC
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
111
Design Issues - Flow Control
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
112
Design Issues - Long Wires
•
Solving the global interconnect mess
Delay
Bit errors
Repeaters
Clock domains
•
Create one optimized solution that can be reused
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
113
Design Issues - Long Wires
•
•
Add flip flops to increase clock frequency
What about ACKs?
NoC
Router
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
NoC
Router
www.ihp-microelectronics.com
© 2009 - All rights reserved
114
Design Issues - Long Wires
•
•
Add flip flops to increase clock frequency
What about ACKs?
NoC
Router
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
NoC
Router
www.ihp-microelectronics.com
© 2009 - All rights reserved
115
Design Issues - Long Wires
•
Bit errors on long wires will not be avoidable in the future
•
Use error correcting codes
Disadvantage: More wires, more throughput needed
•
Use parity bits to discover errors
Resend damaged packets
No longer possible to guarantee real-time performance
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
116
Design Issues - Long Wires
•
Possibility to create heavily optimized solution
Low voltage signaling
Advanced symbol encoding/decoding
Wave pipelining
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
117
Design Issues - Long Wires
•
High performance interconnect through wave pipelining
Need very careful analysis
NoC
Router
NoC
Router
NoC
Router
NoC
Router
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
118
Design Issues - Long Wires
•
Wave pipelining performance
3.45 GHz signaling on one bit line in 0.25 um
More energy efficient than regular pipeline
Faster than regular pipeline
•
Disadvantage
Much harder to test/verify
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
119
Network Topologies
•
•
•
•
Mesh
Tree
Fat-Tree
Routing algorithm depends on topology
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
120
Routing
•
Routing: path from source to destination.
Must: deadlock free, livelock free
Livelock: message proceeds indefinitely, but never arrives
Possible only in adaptive non-minimal routing
Deadlock: packets waiting for each other in a cycle
•
Three main categories:
Static (non-adaptive): predetermined path
Minimal fully adaptive: routes through any shortest path
Partially adaptive:
multiple routing paths
Some paths not shortest
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
121
Wormhole Routing
•
•
•
•
Header forwarded ASAP, not waiting to trailer
Used in high-performance parallel computing networks (lumped)
Not in the internet (distributed)
Packet may span several routers
Packet divided into flits (atomic flow control units)
Main Disadvantage: cascaded contention
Packet requests busy link
VLSI routers: small buffers  packet cannot be buffered in one
router
Routers spanned by packet are stalled
Practical limitation, prevents achieving theoretical bandwidth
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
122
NoC Design Characteristics: Cost
•
Area
Network components area
Wires, repeaters area
•
Power
Energy per transmitted packet
Idle power
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
123
NoC Design Characteristics: Performance
•
•
•
Latency [sec]
From header leaving source, to trailer reaching destination
Composed of waiting latency + network latency
Waiting Latency
Time message waits before entering the network
Network Latency
Time message travels inside the network
Throughput [bits/sec]
Measured at network port
Average amount of user data that is accepted by the network on
that port in a certain amount of time
Aggregate Throughput [bits/sec]
Sum of the throughputs at all network ports
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
124
NoC Saturation
•
Offered Load
Traffic produced by network clients as percentage of maximal network
bandwidth
L : number of cycles needed to accept the message, D : average number of
cycles between messages
L
OL 
LD
•
Saturation Threshold:
Offered Load at which average latency rises exponentially to infinite value
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
125
Cost - Performance Tradeoff
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
126
Santiago Gonzalez Pestana et al. “Cost-Performance Trade-offs in Networks on Chip: A Simulation-Based
Approach”, DATE 2004
Architecture of On-Chip Router
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
127
•Technion, Asynchronous vs. Synchronous Design Techniques for NoCs
•Robert Mullins, Asynchronous vs. Synchronous Design Techniques for NoCs
Router Pipeline
•
Numerous stages of Router Pipeline
•
Raise communication latency
•
Can make packet buffers less effective
•
Incurs pipelining overheads
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
128
Synchronous NoCs - Summary
•
Can design high-performance single cycle routers
•
Design is simplified by presence of global synchrony
•
Distribution of global clock can be eased by
New clock generation / distribution techniques
Source synchronous communication
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
129
Limitations of Fully-Synchronous Networks
1. Difficult to distribute clock
Network spread over die & may have irregular layout
Minimising skew costs complexity and power
• Alternatives/extensions to PLL and H-tree:
Clock deskewing techniques
Distributed Clock Generator (DCG).
Distributed PLLs
Standing-wave oscillators and rotary clock schemes
Resonant global clocks, optical clock distribution etc.
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
130
Limitations of Fully-Synchronous Networks
2. Single Network Clock Frequency
Communicating synchronous IP blocks may operate at
different and potentially adaptive clock frequencies
What is most appropriate network clock frequency?
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
131
Why Asynchronous NoCs
•
•
•
•
•
•
No clock distribution, simple solution
Networked IP blocks run at different clock frequencies
No synchronization issues at interfaces
Ability to exploit data / path-dependent delays
Low-latency common or high-priority paths through router
Freedom to optimize network links
Not constrained by need to distribute/generate multiple clock
frequencies. Can exploit high-frequency narrow links
Dynamic latency/throughput trade-offs (adaptive pipeline
depth)
Exploit dynamic optimizations on links (e.g. DVS)
Easy to use interfaces, modularity, Robust and simple
implementation, Reduced design time
Some arguments for reduced power
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
132
Different NoC Architectures
•
•
•
•
•
•
•
Router clocks derived from a single source
Locally Generated Clocks (periodic & free-running)
Synchronous Routers with Asynchronous Links
Locally Clocked Routers / Asynchronous Interconnect
(GALS style network)
Can support asynchronous interconnects
No longer exploiting periodic nature of router clocks
Correct operation is independent of the delay of the link
GALS interfaces with pausible clocks
If necessary clock is stretched, data is always transferred reliably
Need to construct local delay line
Local aperiodic clock generation
Data-Driven Local Clock
Similarities to stoppable GALS interface and asynchronous priority arbiters
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
133
Mesochronous Clocking
•
Clock skew may force the system to be partitioned into multiple clock
domains
•
Can exploit the fact that only the phase of each router’s clock differs, simple
error-free clock-domain crossing possible (single clock source)
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
134
Router clocks derived from a single source
• Each router’s clock may be generated from the global network clock,
either by:
Clock division or
Clock multiplication
• Clock domain crossing techniques can exploit known clock frequency
relationships
Chakraborty and M. Greenstreet, “Efficient Self-Timed Interfaces for Crossing Clock Domains”,
In Proceedings ASYNC’03
L. F. G. Sarmenta, G. A. Pratt and S. A. Ward, “Rational Clocking”, ICCD’95
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
135
Using Synchronisers for GALS NoCs
•
Asynchronous channel uses 4-phase bundled data protocol
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
A. Sheibanyrad, A. Greiner, Two efficient synchronous asynchronous converters well-suited for
networks-on-chip in GALS architectures, 2005
136
Locally Generated Clocks (periodic & free-running)
•
Can exploit knowledge about clocks (when crossing clock domains) even if all
we know is that they are periodic, examples:
predictive synchronizers [Dally][Frank/Ginosar]
asynchronous FIFOs [Chakraborty/Greenstreet]
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
137
Using Asynchronous FIFOs in GALS NoCs
•
•
Synchronous network wrapper assembly/disassembly data
packets
Can connect many independent clock domains
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
138
NoC architecture for low power
•
NoC concept together with GALS methodology gives good
opportunities for power saving
•
Each hardware block in NoC system can be setted to the optimal
frequency/voltage
•
Best is to combine DVFS with GALS concept in order to reduce
power
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
139
NoC architecture for DVFS – LETI Solution (NoCs 2008)
•
•
•
•
A fully asynchronous Network-on-Chip
IP units are synchronous islands using programmable Local Clock
Generator
Within the IP unit
Synchronization is done thanks to Pausable Clock
A Power Unit manages internal Vcore generated using external
Vhigh and Vlow
A Network Interface is in charge of
NoC communications
Local Power Management
Main CPU in charge of global power management
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
140
DVFS with GALS NoCs
•
•
•
•
•
Each synchronous IP is an independent power and frequency domain
A local fine grain Dynamic Voltage Scaling:
Implementation of a local hardware controller to control transitions
between Vhigh and Vlow
Ensures smooth DVS transitions for IP safe computation
A local fine grain Dynamic Frequency Scaling:
Automatic frequency scaling
Use of clock generation re-programming to find the optimal V/F
point of operation
Thanks to pausable clock technique, IP unit continues its operation
during DVFS phases
GALS architecture and local clock generation is a natural enabler for
easy local DVFS
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
141
NoC Unit architecture
•
Each IP core encapsulated with
Network Interface
Test Wrapper
Pausable Clock
Power Supply Unit
•
IP units have 5 supply modes
Init: reset at Vhigh (1.2V)
High: Vhigh supply
Low: Vlow supply (0.8V)
Hopping: switch Vhigh / Vlow for
DVFS
Idle: retention state at Vlow (no
clock)
Off: stand-by mode
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
142
Local Power Manager
•
•
•
•
Local Power Manager handles unit power modes
A set of programmable registers, through the NoC
Configuration of
Programmable delay line
Power Supply Unit
Pulse Width modulator used to control the Hopping mode
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
143
Power Supply Unit
•
•
•
•
Power Supply Unit manages Vcore
Two power switches Thigh and Tlow LVT transistors
A Hopping Unit
An Ultra Cut-Off Generator
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
144
Hopping Unit
•
Energy per operation scales with V²
Decrease Voltage (and Frequency) to be energy efficient
•
«Triple state» power supply
Use of two PMOS power switches
Vhigh (1.2 V), Vlow (0.7 V), or OFF (0 V)
•
Switch between Vhigh and Vlow
Transitions take less than 100 ns
Mean speed / mean power of the IP is programmed by a PWM
•
Compatible with synchronous and asynchronous IPs
For GALS system: coordination done with local clock
generator
•
Can easily be integrated in any CMOS circuit
No inductor contrary to traditional DC/DC converters
No capacitor contrary to charge pump implementation
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
145
Ultra Cut-Off Generator
•
When reverse polarizing the gate, the
leakage current goes through a minimum
•
The optimal polarization point varies with the
temperature, the supply voltage and the
process corners
•
The proposed UCO generator automatically
polarizes the gate of the Power switch to its
point of minimum leakage
•
Compensates for temperature variation,
alleviates corners variations.
•
The gate oxide reliability is considered by
introducing a passive stress reduction
mechanism
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
146
Pausable Clock Interface
•
•
•
Pause temporary the clock when a transfer (NoC) or a supply switch is
required
Based on
Two GALS ports : Synchronous-to Asynchronous and Asynchronousto-Synchronous
A programmable delay line
A pausable clock generator
Pausable Clock Generator arbitrates pause requests
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
147
Pausable Clock Interface
•
Programmable delay line
Precise, small and low power
Using Standard cells
On the same unit power domain
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
148
Power Gain
•
•
•
Programmable delay line matches with unit logic on the same power domain
Compensates any mismatch thanks to re-programmation
Power reduction
Vhigh=1.2V and Vlow=0.8V
35 % dynamic power reduction between High and Low modes
Hopping mode is used to save power without any latency cost
Leakage power thanks to UCO is reduced by 2 decade
Power Supply Unit efficiency
Hopping Unit
Only resistive losses in the power transistors
About 1 mW dynamic power
=> more than 95 % power efficiency
90 % total efficiency (external DC-DC taken into account)
An adaptive and reliable Power Supply Unit giving high power reduction factor and
high power efficiency
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
149
Physical Implementation
•
•
Power Switch
One single Power-Switch for the complete power domain
Sized to get a speed loss<5%
Area : about <5% of the power domain
Hopping Unit
Area : 140μm*35μm
Hopping Transition : <100 ns
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
150
Synchronous or Asynchronous?
•
•
•
A clock less on-chip network appears to be an elegant solution
although some questions remain:
Test
Performance concerns
Shouldn’t asynchronous designs offer latency advantages?
Fast local control, path/data dependent delays, DI interconnects
Perhaps asynchronous routers mimic synchronous architectures
too closely?
Exploit flexibility, novel architectures, different topologies
Overheads for data-driven clocking or GALS currently look small
in comparison to the classical approach
Synchronous design has advantages too
Predictability and determinism can be exploited
Fast single cycle routers possible
Global snapshot of state is good for scheduling
Still lots of interesting research to be done
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
151
GALAXY project
•
•
GALAXY project (GALS InterfAce for CompleX Digital
SYstem Integration) is funded in the FP7 program of EU
www.galaxy-project.org
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
152
Project goals
• This project builds on a technology approach in which the EU
currently has world leadership
• We are on the way to provide an integrated GALS NoC design flow
• We will provide an interoperability framework between the existing
open and commercial CAD tools
• The project is evaluating the ability of the GALS approach to
solve system integration issues,
implement a complex GALS system on 40 nm CMOS process,
explore the low EMI and low-power properties,
and robustness to process variability problems.
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
© 2009 - All rights reserved
153