ppt - UCSD VLSI CAD Laboratory
Download
Report
Transcript ppt - UCSD VLSI CAD Laboratory
ECE260B – CSE241A
Winter 2005
Clocking
Website: http://vlsicad.ucsd.edu/courses/ece260b-w05
ECE 260B – CSE 241A Clocking 1
Slides courtesy of Prof. Andrew B. Kahng
http://vlsicad.ucsd.edu
Outline
Problem Statement
Clock Distribution Structures
Robustness / Signal Integrity Control
Clock Design:
Skew Scheduling
Topology Construction
Embedding
ECE 260B – CSE 241A Clocking 2
http://vlsicad.ucsd.edu
Why Clocks?
Clocks provide the means to synchronize
By allowing events to happen at known timing boundaries, we
can sequence these events
Greatly simplifies building of state machines
No need to worry about variable delay through
combinational logic (CL)
All signals delayed until clock edge (clock imposes the worst
case delay)
FSM
Courtesy K. Yang, UCLA
Comb
Logic
register
register
ECE 260B – CSE 241A Clocking 3
register
Comb
Logic
Dataflow
http://vlsicad.ucsd.edu
Clock Distribution Network
General goal of clock distribution
Deliver clock to all memory elements with acceptable skew
Deliver clock edges with acceptable sharpness
Clocking network design is one of the greatest challenges
in the design of a large chip
Consume up to 1/3 of chip power
Accurate signal delay
Signal integrity
Subject to uncertainty / variation of different processes / operating
conditions
ECE 260B – CSE 241A Clocking 4
http://vlsicad.ucsd.edu
Clock Design Components
Oscillator
Dividers
Buffers
Strong drivers
Reduce delay
Signal integrity / slew rate
Interconnects
Balanced trees, meshes, etc.
Shielding (e.g., for crosstalk reduction)
Non-tree links / feedback loops
ECE 260B – CSE 241A Clocking 5
http://vlsicad.ucsd.edu
Clock Distribution Objective
Minimum / bounded skew
performance / hold time requirements
Guaranteed slew rate / signal integrity
Small insertion delay
Robustness under process / operating condition variation
Minimum cell / routing area
Minimum power consumption
ECE 260B – CSE 241A Clocking 6
http://vlsicad.ucsd.edu
Clock Distribution Robustness Subject to
Radically different loading (flip-flop density)
Interconnect coupling
From lot-to-lot
Across the die
Buffers
Metal width
Supply voltage variation across the die
Signal integrity
Delay variation
Process variation
Across the die
ECO (Engineering Change Order)
Both static IR drop
Dynamic voltage drop
Temperature
ECE 260B – CSE 241A Clocking 7
http://vlsicad.ucsd.edu
Issues in Clock Distribution Network Design
Skew
Process, voltage, and temperature
Data dependence
Noise coupling
Load balancing
Power, CV2f (consume up to 1/3 of total chip power)
Clock gating
Flexibility/Tunability
Compactness – fit into existing layout/design
Facilitate ECO
ECE 260B – CSE 241A Clocking 8
http://vlsicad.ucsd.edu
Skew: Clock Delay Varies With Position
ECE 260B – CSE 241A Clocking 9
http://vlsicad.ucsd.edu
Clock Skew Causes
Designed (unavoidable) variations – mismatch in buffer load
sizes, interconnect lengths
Process variation – process spread across die yielding
different Leff, Tox, etc. values
Temperature gradients – changes MOSFET performance
across die
IR voltage drop in power supply – changes MOSFET
performance across die
Note: Delay from clock generator to fan-out points (clock
latency) is not important by itself
BUT: increased latency leads to larger skew for same amount of
relative variation
Sylvester
Shepard,
2001
ECE/ 260B
– CSE
241A Clocking 10
http://vlsicad.ucsd.edu
Outline
Problem Statement
Clock Distribution Structures
Robustness / Signal Integrity Control
Clock Design:
Skew Scheduling
Topology Construction
Embedding
ECE 260B – CSE 241A Clocking 11
http://vlsicad.ucsd.edu
Clock Distribution Structures
RC-Tree
Less capacitance
More accuracy
Flexible wiring
Grids
Reliable
Less data dependency
Tunable (late in design)
Shown here for final stage drivers driving F/F loads
ECE 260B – CSE 241A Clocking 12
http://vlsicad.ucsd.edu
Grids
Gridded clock distribution common on
earlier DEC Alpha microprocessors
Advantages:
Skew determined by grid density, not
too sensitive to load position
Clock signals available everywhere
Tolerant to process variations
Usually yields extremely low skew
values
Disadvantages:
Predrivers
Global
grid
Huge amount of wiring and power
To minimize such penalties, need to
make grid pitch coarser lose the grid
advantage
Sylvester
Shepard,
2001
ECE/ 260B
– CSE
241A Clocking 13
http://vlsicad.ucsd.edu
H-Tree
H-tree (Bakoglu)
One large central driver, recursive structure to
match wirelengths
Halve wire width at branching points to reduce
reflections
Disadvantages
Slew degradation along long RC paths
Unrealistically large central driver
courtesy of P. Zarkesh-Ha
- Clock drivers can create large temperature
gradients (ex. Alpha 21064 ~30° C)
Non-uniform load distribution
Inherently non-scalable (wire R growth)
Partial solution: intermediate buffers at branching
points
Sylvester
Shepard,
2001
ECE/ 260B
– CSE
241A Clocking 14
http://vlsicad.ucsd.edu
Buffered H-tree
Advantages
Ideally zero-skew
Can be low power (depending on skew requirements)
Low area (silicon and wiring)
CAD tool friendly (regular)
Disadvantages
Sensitive to process variations
- Devices Want same size buffers at each level of tree
- Wires Want similar segment lengths on each layer in each source-sink
path !!!
Local clocking loads inherently non-uniform
Sylvester
Shepard,
2001
ECE/ 260B
– CSE
241A Clocking 15
http://vlsicad.ucsd.edu
Tree Balancing
Some techniques:
Con: Routing area
often more valuable
than Silicon
a) Introduce dummy loads
b) Snaking of wirelength to match delays
Sylvester
Shepard,
2001
ECE/ 260B
– CSE
241A Clocking 16
http://vlsicad.ucsd.edu
Examples From Processor Chips
H-Tree, Asymmetric
RC-Tree (IBM)
Grids
DEC [Alphas]
Serpentines
Intel x86
[Young ISSCC97]
ECE 260B – CSE 241A Clocking 17
http://vlsicad.ucsd.edu
Example Skews From Processor Chips
DEC-Alpha 21064 clock spines
DEC-Alpha 21064 RC delays
DEC-Alpha 21164 RC local delays
DEC-Alpha 21164 RC delays for Global
Distribution
(Spine + Grid)
ECE 260B – CSE 241A Clocking 18
http://vlsicad.ucsd.edu
ReShape Clocks Example (High-End ASIC)
Balanced, shielded H-tree for pre-clock distribution
Mesh for block level distribution
All routes 5-6u M6/5,
shielded with 1u
grounds
~10 buffers per node
E.g., ganged BUFx20’s
Output mesh must hit
every sub-block
output mesh
ECE 260B – CSE 241A Clocking 19
http://vlsicad.ucsd.edu
Block Level Mesh (.18u)
Clumps of 1-6 clock buffers, surrounded by
capacitor pads
Shielded input and output m6 shorting straps
Pre-clock connects to input shorting straps
1u m5 ribs every 20 - 30 u
(4 to 6 rows)
Max 600u stride
ECE 260B – CSE 241A Clocking 20
http://vlsicad.ucsd.edu
Problems with Meshes
Burn more power at low frequencies
Difficult for ‘spare’ clock domains that will not tolerate regioning
Blocks more routing resources (solution: integrated power
distribution with ribs can provide shielding for ‘free’)
Post placement (and routing) tuning required
No ‘beneficial skew’ possible
Clock gating only easy at root
Fighting tools to do analysis:
Clumped buffers a problem in Static Timing Analysis tools
Large shorted meshes a problem for STA tools
What does Elmore delay calculation look like for a non-tree?
Need full extraction and SPICE-like simulation to determine skew
ECE 260B – CSE 241A Clocking 21
http://vlsicad.ucsd.edu
Benefits of Meshes
Deterministic since shielded all the way down to rib
distribution
No ECO placement required: all buffers preplaced
before block placement
Low latency since uses shorted (= ganged, parallel)
drivers, therefore lower skew
ECO placements of FFs later do not require rebalancing
of tree
“Idealized” clocking environment for “concurrent dance”
of RTL design and timing convergence
ECE 260B – CSE 241A Clocking 22
http://vlsicad.ucsd.edu
Hybrid Structure
Balanced tree on the top
Mesh in the middle
Minimize skew
Steiner minimum tree at the bottom
Minimize cost
Facilitate ECO
ECE 260B – CSE 241A Clocking 23
http://vlsicad.ucsd.edu
Outline
Problem Statement
Clock Distribution Structures
Robustness / Signal Integrity Control
Clock Design:
Skew Scheduling
Topology Construction
Embedding
ECE 260B – CSE 241A Clocking 24
http://vlsicad.ucsd.edu
Process Variation
Intra-die and inter-die variations
Intra-die variation is increasingly significant since 0.13um technology
Systematic and random variations
Systematic variation is due to equipment, process, etc.
- Global len aberration in lithograthy causes systematic variation
- Pattern-dependent optical proximity, chemical mechanical polish (CMP)
Random variation is due to inherent variation
Spatial correlation across a chip
Fast vs. slow corners
ECE 260B – CSE 241A Clocking 25
http://vlsicad.ucsd.edu
Process Variation
Metal wires
Width variation can be estimated by LUT(width, spacing)
Thickness variation CMP local density
Thickness variation also depends on wire width and spacing
Could be up to 30-40% in 90nm process
Transistors
Channel length variation (delay ~ L1.5)
Thin gate oxide tox variation Vth variation
Up to 30% variation in term of driving capability
ECE 260B – CSE 241A Clocking 26
http://vlsicad.ucsd.edu
Process Variations – SPICE model
Process variations are reflected into a statistical SPICE
model
Usually only a few parameters have a statistical distribution (e.g. :
{DL, DW, TOX,VTn, VTp}) and the others are set to a nominal value
The nominal SPICE model is obtained by setting the statistical
parameters to their nominal value
ECE 260B – CSE 241A Clocking 27
Slide courtesy of A. Nardi, J. Rabaey, K. Keutzer of UCB
http://vlsicad.ucsd.edu
Global Variations (Inter-die)
Process variations Performance variations
Critical path delay of a 16-bit adder
All devices have the same set
of model parameters value
ECE 260B – CSE 241A Clocking 28
Slide courtesy of A. Nardi, J. Rabaey, K. Keutzer of UCB
http://vlsicad.ucsd.edu
Local Variations (Intra-die)
Each device instance has a slightly different set of model
parameter values (aka device mismatch)
The performance of some analog circuits strongly
depends on the degree of matching of device properties
Digital circuits are in general more immune to mismatch,
but clock distribution network is sensitive (clock skew)
ECE 260B – CSE 241A Clocking 29
Slide courtesy of A. Nardi, J. Rabaey, K. Keutzer of UCB
http://vlsicad.ucsd.edu
Statistical Design
Need to account for process variations during design
phase
•Statistical design
–Nominal design
–Yield optimization
–Design centering
ECE 260B – CSE 241A Clocking 30
Slide courtesy of A. Nardi, J. Rabaey, K. Keutzer of UCB
http://vlsicad.ucsd.edu
Statistical Design
ECE 260B – CSE 241A Clocking 31
Slide courtesy of A. Nardi, J. Rabaey, K. Keutzer of UCB
http://vlsicad.ucsd.edu
Process Variation Tolerance Enhancement
Rule of thumb: balanced tree
Identical buffers at identical heights
Drive identical subtree loads
Can we do better than this?
Process variation tolerant clock design
Bounded-skew DME
Topology construction
- With process variation tolerance in objective
Useful skew scheduling
- To the center of permissible ranges
ECE 260B – CSE 241A Clocking 32
http://vlsicad.ucsd.edu
Signal Integrity
Crosstalk
Supply voltage drop
IR, L dI/dt, LC resonance
Temperature
Capacitive, inductive
Increased resistance with higher temperature
Substrate coupling
Parasitic resistance, capacitance in the substrate layer
ECE 260B – CSE 241A Clocking 33
http://vlsicad.ucsd.edu
Crosstalk
Due to the coupling capacitance between
interconnections, a signal switching on a net (aggressor)
may affect the voltage waveform on a neighboring net
(victim)
Noise Propagation
Increased Delay
ECE 260B – CSE 241A Clocking 34
http://vlsicad.ucsd.edu
Circuit Model for Crosstalk
ECE 260B – CSE 241A Clocking 35
http://vlsicad.ucsd.edu
Crosstalk Simulation
ECE 260B – CSE 241A Clocking 36
http://vlsicad.ucsd.edu
Design for Crosstalk
It can be both capacitive and inductive
Capacitive is dominant at current switching speeds
To reduce it:
Use of shielding layer (inter-layer)
Use of shielding wire (intra-layer)
GND
VDD
GND
Substrate
ECE 260B – CSE 241A Clocking 37
http://vlsicad.ucsd.edu
Clock Gating
Reduce power consumption
by temporarily shutting down
part of the circuit
FF
Q
FF
combinational
logic
D
Additional cost of enabling
CLK1
circuits
CLK2
CLK ENABLING
ECE 260B – CSE 241A Clocking 38
http://vlsicad.ucsd.edu
Outline
Problem Statement
Clock Distribution Statement
Robustness / Signal Integrity Control
Clock Design:
Skew Scheduling
Topology Construction
Embedding
ECE 260B – CSE 241A Clocking 39
http://vlsicad.ucsd.edu
Skew = Local Constraint
Timing is correct as long as the clock signals of
sequentially adjacent FFs arrive within a permissible
skew range
FF
-d + thold
race condition
<
D : longest path
d : shortest path
Skew
FF
<
safe
Tperiod - D - tsetup
cycle time violation
permissible range
W. Dai,
UC260B
Santa
Cruz241A Clocking 40
ECE
– CSE
http://vlsicad.ucsd.edu
“Useful Skew” Design Robustness
Design will be more robust if clock signal arrival time is in
the middle of permissible skew range, rather than on edge
FF
FF
2 ns
6 ns
4
FF
T = 6 ns
0
“0 0 0”: at verge of violation
4
0
“2 0 2”: more safety margin
2
W. Dai,
UC260B
Santa
Cruz241A Clocking 41
ECE
– CSE
-2
http://vlsicad.ucsd.edu
Constraints on Skews
FFi receives clock signal delayed by xi MIN_DEL
0 < 1 : if nominal clock delay is xi, then actual clock delay
must fall within interval xi x xi
For FF to operate correctly when clock edge arrives at time x, the
correct input data must be present and stable during the time
interval (x – SETUP, x + HOLD)
For 1 i,j L (#FFs), we compute lower and upper bounds MIN(i,j)
and MAX(i,j) for the time that is required for a signal edge to
propagate from FFi to FFj
Avoid double-clocking (race condition)
xi + MIN(i,j) xj + HOLD
Avoid zero-clocking
xj + SETUP + MAX(i,j) xj + P;
ECE 260B – CSE 241A Clocking 42
P = clock period
http://vlsicad.ucsd.edu
Optimal Useful Skews by Linear Programming
LP_SPEED (clock period reduction):
minimize P s.t.
xj - xj HOLD – MIN(i,j)
xi– xj + P SETUP + MAX(i,j)
xi MIN_DEL
LP_SAFETY (robustness):
Maximize M s.t.
xj - xj – M HOLD – MIN(i,j)
xi– xj – M SETUP + MAX(i,j) – P
xi MIN_DEL
Notes
- J. P. Fishburn, “Clock Skew Optimization”, IEEE Trans. Computers 39(7) (1990), pp. 945-951.
- T. G. Szymanski, “Computing Optimal Clock Schedules”, Proc. DAC, June 1992, pp. 399-404.
- Useful Skew optimization is similar to Retiming optimization
- Peak current reductions are a side benefit
ECE 260B – CSE 241A Clocking 43
http://vlsicad.ucsd.edu
Outline
Problem Statement
Clock Distribution Structures
Robustness / Signal Integrity Control
Clock Design:
Skew Scheduling
Topology Design
Embedding
For zero skew (ZST-DME)
For bounded skew (BST-DME)
ECE 260B – CSE 241A Clocking 44
http://vlsicad.ucsd.edu
Zero-Skew Tree (ZST) Problem
Zero Skew Clock Routing Problem (S,G): Given a set S of sink
locations and a connection topology G, construct a ZST T(S) with
topology G and having minimum cost.
Skew = maximum value of |td(s0,si) – td(s0,sj)| over all sink pairs si, sj in
S.
Td = signal delay (from source s0)
Connection topology G = rooted binary tree with nodes of S as leaves
Edge ea in G is the edge from a to its parent
|ea| is the (assigned) length of edge ea
Cost = total edge length
ECE 260B – CSE 241A Clocking 45
http://vlsicad.ucsd.edu
Zero-Skew Example (555 sinks, 40 obstacles)
ECE 260B – CSE 241A Clocking 46
http://vlsicad.ucsd.edu
A Zero-Skew Routing Algorithm
Finds a ZST under linear delay
model with minimum cost over all
ZSTs with topology G and sink set
S
Terms
Manhattan Arc: line segment with
slope +1 or –1
Tilted Rectangular Region (TRR):
collection of points within a fixed
distance of a Manhattan arc
-
Core = Manhattan arc
Radius = distance
Merging segment = locus of feasible
locations for a node v in the topology,
consistent with minimum wirelength
-
If v is a sink, then ms(v) = {v}
If v is an internal node, then ms(v) is
the set of all points within distance
|ea| of ms(a), and within distance |eb|
of ms(b)
ECE 260B – CSE 241A Clocking 47
http://vlsicad.ucsd.edu
Phase 1: Tree of Merging Segments
Goal: Construct a tree of merging segments corresponding
to topology G
Merging segment of a node depends on merging segment of its
children bottom-up construction
Let a, b be children of v. We want placements of v that allow TSa and
TSb to be merged with minimum added wire while preserving zero
skew
Merging cost = |ea| + |eb|
Fact: The intersection of
two TRRs is also a TRR
and can be found in
constant time
Constant time per each
new merging segment
linear time (in size of S) to
construct entire tree
ECE 260B – CSE 241A Clocking 48
http://vlsicad.ucsd.edu
Phase 2: Find Node Placements
Goal: Find exact locations (“embeddings”) pl(v) of internal nodes v in
the ZST topology
If v is the root node, then any point on ms(v) can be chosen as pl(v)
If v is an internal node other
than the root, and p is the parent
of v, then v can be embedded at
any point in ms(v) that is at
distance |ev| or less from pl(p)
Detail: create square TRR trrp
with radius ev and core equal to
pl(p); placement of v can be
any point in ms(v) trrp
Each instruction executed at
most once for each node in G,
and TRR intersection is O(1)
time Find_Exact_Placements
is O(n) DME is O(n)
ECE 260B – CSE 241A Clocking 49
http://vlsicad.ucsd.edu
Outline
Problem Statement
Clock Distribution Structures
Robustness / Signal Integrity Control
Clock Design:
Skew Scheduling
Topology Design
Embedding
For zero skew (ZST-DME)
For bounded skew (BST-DME)
ECE 260B – CSE 241A Clocking 50
http://vlsicad.ucsd.edu
Non-Zero Skew Bounds
Given a skew bound, where can internal nodes of the given topology
(e.g., a, b, v) be placed?
skew
0
a
2
4
6
6
2
4
4
2
skew
0
2
v
6
s0
v
a
ECE 260B – CSE 241A Clocking 51
b
Topology
s1 s2 s3 s4
4
b
6
http://vlsicad.ucsd.edu
BST-DME Bottom-Up Phase
Bottom-Up: build tree of merging
regions corresponding to given
topology
B=4
s0
a
b
Topology
s1 s2 s3 s4
s2
s0
mr(a)
s1
v
mr(v)
s3
mr(b)
s4
ECE 260B – CSE 241A Clocking 52
http://vlsicad.ucsd.edu
BST-DME Top-Down Phase
s0
v
a
s1 s2 s3 s4
s2
B=4
s0
s1
a
b
Topology
v
s3
b
s4
ECE 260B – CSE 241A Clocking 53
http://vlsicad.ucsd.edu
Good Luck for the Mid-Term!
ECE 260B – CSE 241A Clocking 54
http://vlsicad.ucsd.edu