ppt - UCSD VLSI CAD Laboratory

Download Report

Transcript ppt - UCSD VLSI CAD Laboratory

ECE260B – CSE241A
Winter 2005
Clocking
Website: http://vlsicad.ucsd.edu/courses/ece260b-w05
ECE 260B – CSE 241A Clocking 1
Slides courtesy of Prof. Andrew B. Kahng
http://vlsicad.ucsd.edu
Outline
 Problem Statement
 Clock Distribution Structures
 Robustness / Signal Integrity Control
 Clock Design:
 Skew Scheduling
 Topology Construction
 Embedding
ECE 260B – CSE 241A Clocking 2
http://vlsicad.ucsd.edu
Why Clocks?
 Clocks provide the means to synchronize

By allowing events to happen at known timing boundaries, we
can sequence these events
 Greatly simplifies building of state machines
 No need to worry about variable delay through
combinational logic (CL)

All signals delayed until clock edge (clock imposes the worst
case delay)
FSM
Courtesy K. Yang, UCLA
Comb
Logic
register
register
ECE 260B – CSE 241A Clocking 3
register
Comb
Logic
Dataflow
http://vlsicad.ucsd.edu
Clock Distribution Network
 General goal of clock distribution


Deliver clock to all memory elements with acceptable skew
Deliver clock edges with acceptable sharpness
 Clocking network design is one of the greatest challenges
in the design of a large chip




Consume up to 1/3 of chip power
Accurate signal delay
Signal integrity
Subject to uncertainty / variation of different processes / operating
conditions
ECE 260B – CSE 241A Clocking 4
http://vlsicad.ucsd.edu
Clock Design Components
 Oscillator
 Dividers
 Buffers



Strong drivers
Reduce delay
Signal integrity / slew rate
 Interconnects



Balanced trees, meshes, etc.
Shielding (e.g., for crosstalk reduction)
Non-tree links / feedback loops
ECE 260B – CSE 241A Clocking 5
http://vlsicad.ucsd.edu
Clock Distribution Objective

Minimum / bounded skew






performance / hold time requirements
Guaranteed slew rate / signal integrity
Small insertion delay
Robustness under process / operating condition variation
Minimum cell / routing area
Minimum power consumption
ECE 260B – CSE 241A Clocking 6
http://vlsicad.ucsd.edu
Clock Distribution Robustness Subject to

Radically different loading (flip-flop density)



Interconnect coupling






From lot-to-lot
Across the die
Buffers
Metal width
Supply voltage variation across the die



Signal integrity
Delay variation
Process variation


Across the die
ECO (Engineering Change Order)
Both static IR drop
Dynamic voltage drop
Temperature
ECE 260B – CSE 241A Clocking 7
http://vlsicad.ucsd.edu
Issues in Clock Distribution Network Design
 Skew




Process, voltage, and temperature
Data dependence
Noise coupling
Load balancing
 Power, CV2f (consume up to 1/3 of total chip power)

Clock gating
 Flexibility/Tunability


Compactness – fit into existing layout/design
Facilitate ECO
ECE 260B – CSE 241A Clocking 8
http://vlsicad.ucsd.edu
Skew: Clock Delay Varies With Position
ECE 260B – CSE 241A Clocking 9
http://vlsicad.ucsd.edu
Clock Skew Causes

Designed (unavoidable) variations – mismatch in buffer load
sizes, interconnect lengths

Process variation – process spread across die yielding
different Leff, Tox, etc. values

Temperature gradients – changes MOSFET performance
across die

IR voltage drop in power supply – changes MOSFET
performance across die

Note: Delay from clock generator to fan-out points (clock
latency) is not important by itself

BUT: increased latency leads to larger skew for same amount of
relative variation
Sylvester
Shepard,
2001
ECE/ 260B
– CSE
241A Clocking 10
http://vlsicad.ucsd.edu
Outline
 Problem Statement
 Clock Distribution Structures
 Robustness / Signal Integrity Control
 Clock Design:
 Skew Scheduling
 Topology Construction
 Embedding
ECE 260B – CSE 241A Clocking 11
http://vlsicad.ucsd.edu
Clock Distribution Structures
 RC-Tree



Less capacitance
More accuracy
Flexible wiring


Grids

Reliable

Less data dependency

Tunable (late in design)
Shown here for final stage drivers driving F/F loads
ECE 260B – CSE 241A Clocking 12
http://vlsicad.ucsd.edu
Grids


Gridded clock distribution common on
earlier DEC Alpha microprocessors
Advantages:

Skew determined by grid density, not
too sensitive to load position
Clock signals available everywhere

Tolerant to process variations

Usually yields extremely low skew
values


Disadvantages:


Predrivers
Global
grid
Huge amount of wiring and power
To minimize such penalties, need to
make grid pitch coarser  lose the grid
advantage
Sylvester
Shepard,
2001
ECE/ 260B
– CSE
241A Clocking 13
http://vlsicad.ucsd.edu
H-Tree

H-tree (Bakoglu)



One large central driver, recursive structure to
match wirelengths
Halve wire width at branching points to reduce
reflections
Disadvantages


Slew degradation along long RC paths
Unrealistically large central driver
courtesy of P. Zarkesh-Ha
- Clock drivers can create large temperature
gradients (ex. Alpha 21064 ~30° C)



Non-uniform load distribution
Inherently non-scalable (wire R growth)
Partial solution: intermediate buffers at branching
points
Sylvester
Shepard,
2001
ECE/ 260B
– CSE
241A Clocking 14
http://vlsicad.ucsd.edu
Buffered H-tree
 Advantages




Ideally zero-skew
Can be low power (depending on skew requirements)
Low area (silicon and wiring)
CAD tool friendly (regular)
 Disadvantages

Sensitive to process variations
- Devices  Want same size buffers at each level of tree
- Wires  Want similar segment lengths on each layer in each source-sink
path !!!

Local clocking loads inherently non-uniform
Sylvester
Shepard,
2001
ECE/ 260B
– CSE
241A Clocking 15
http://vlsicad.ucsd.edu
Tree Balancing
Some techniques:
Con: Routing area
often more valuable
than Silicon
a) Introduce dummy loads
b) Snaking of wirelength to match delays
Sylvester
Shepard,
2001
ECE/ 260B
– CSE
241A Clocking 16
http://vlsicad.ucsd.edu
Examples From Processor Chips
H-Tree, Asymmetric
RC-Tree (IBM)
Grids
DEC [Alphas]
Serpentines
Intel x86
[Young ISSCC97]
ECE 260B – CSE 241A Clocking 17
http://vlsicad.ucsd.edu
Example Skews From Processor Chips
DEC-Alpha 21064 clock spines
DEC-Alpha 21064 RC delays
DEC-Alpha 21164 RC local delays
DEC-Alpha 21164 RC delays for Global
Distribution
(Spine + Grid)
ECE 260B – CSE 241A Clocking 18
http://vlsicad.ucsd.edu
ReShape Clocks Example (High-End ASIC)
 Balanced, shielded H-tree for pre-clock distribution
 Mesh for block level distribution

All routes 5-6u M6/5,
shielded with 1u
grounds

~10 buffers per node


E.g., ganged BUFx20’s
Output mesh must hit
every sub-block
output mesh
ECE 260B – CSE 241A Clocking 19
http://vlsicad.ucsd.edu
Block Level Mesh (.18u)
Clumps of 1-6 clock buffers, surrounded by
capacitor pads
Shielded input and output m6 shorting straps
Pre-clock connects to input shorting straps
1u m5 ribs every 20 - 30 u
(4 to 6 rows)
Max 600u stride
ECE 260B – CSE 241A Clocking 20
http://vlsicad.ucsd.edu
Problems with Meshes


Burn more power at low frequencies





Difficult for ‘spare’ clock domains that will not tolerate regioning
Blocks more routing resources (solution: integrated power
distribution with ribs can provide shielding for ‘free’)
Post placement (and routing) tuning required
No ‘beneficial skew’ possible
Clock gating only easy at root
Fighting tools to do analysis:

Clumped buffers a problem in Static Timing Analysis tools
Large shorted meshes a problem for STA tools

What does Elmore delay calculation look like for a non-tree?

 Need full extraction and SPICE-like simulation to determine skew

ECE 260B – CSE 241A Clocking 21
http://vlsicad.ucsd.edu
Benefits of Meshes

Deterministic since shielded all the way down to rib
distribution

No ECO placement required: all buffers preplaced
before block placement

Low latency since uses shorted (= ganged, parallel)
drivers, therefore lower skew

ECO placements of FFs later do not require rebalancing
of tree

“Idealized” clocking environment for “concurrent dance”
of RTL design and timing convergence
ECE 260B – CSE 241A Clocking 22
http://vlsicad.ucsd.edu
Hybrid Structure


Balanced tree on the top
Mesh in the middle


Minimize skew
Steiner minimum tree at the bottom


Minimize cost
Facilitate ECO
ECE 260B – CSE 241A Clocking 23
http://vlsicad.ucsd.edu
Outline
 Problem Statement
 Clock Distribution Structures
 Robustness / Signal Integrity Control
 Clock Design:
 Skew Scheduling
 Topology Construction
 Embedding
ECE 260B – CSE 241A Clocking 24
http://vlsicad.ucsd.edu
Process Variation

Intra-die and inter-die variations


Intra-die variation is increasingly significant since 0.13um technology
Systematic and random variations

Systematic variation is due to equipment, process, etc.
- Global len aberration in lithograthy causes systematic variation
- Pattern-dependent optical proximity, chemical mechanical polish (CMP)


Random variation is due to inherent variation
Spatial correlation across a chip

Fast vs. slow corners
ECE 260B – CSE 241A Clocking 25
http://vlsicad.ucsd.edu
Process Variation

Metal wires





Width variation can be estimated by LUT(width, spacing)
Thickness variation  CMP  local density
Thickness variation also depends on wire width and spacing
Could be up to 30-40% in 90nm process
Transistors



Channel length variation (delay ~ L1.5)
Thin gate oxide tox variation  Vth variation
Up to 30% variation in term of driving capability
ECE 260B – CSE 241A Clocking 26
http://vlsicad.ucsd.edu
Process Variations – SPICE model
 Process variations are reflected into a statistical SPICE
model


Usually only a few parameters have a statistical distribution (e.g. :
{DL, DW, TOX,VTn, VTp}) and the others are set to a nominal value
The nominal SPICE model is obtained by setting the statistical
parameters to their nominal value
ECE 260B – CSE 241A Clocking 27
Slide courtesy of A. Nardi, J. Rabaey, K. Keutzer of UCB
http://vlsicad.ucsd.edu
Global Variations (Inter-die)
Process variations  Performance variations
Critical path delay of a 16-bit adder
All devices have the same set
of model parameters value
ECE 260B – CSE 241A Clocking 28
Slide courtesy of A. Nardi, J. Rabaey, K. Keutzer of UCB
http://vlsicad.ucsd.edu
Local Variations (Intra-die)
 Each device instance has a slightly different set of model
parameter values (aka device mismatch)
 The performance of some analog circuits strongly
depends on the degree of matching of device properties
 Digital circuits are in general more immune to mismatch,
but clock distribution network is sensitive (clock skew)
ECE 260B – CSE 241A Clocking 29
Slide courtesy of A. Nardi, J. Rabaey, K. Keutzer of UCB
http://vlsicad.ucsd.edu
Statistical Design
 Need to account for process variations during design
phase
•Statistical design
–Nominal design
–Yield optimization
–Design centering
ECE 260B – CSE 241A Clocking 30
Slide courtesy of A. Nardi, J. Rabaey, K. Keutzer of UCB
http://vlsicad.ucsd.edu
Statistical Design
ECE 260B – CSE 241A Clocking 31
Slide courtesy of A. Nardi, J. Rabaey, K. Keutzer of UCB
http://vlsicad.ucsd.edu
Process Variation Tolerance Enhancement

Rule of thumb: balanced tree


Identical buffers at identical heights
Drive identical subtree loads

Can we do better than this?

Process variation tolerant clock design



Bounded-skew DME
Topology construction
- With process variation tolerance in objective
Useful skew scheduling
- To the center of permissible ranges
ECE 260B – CSE 241A Clocking 32
http://vlsicad.ucsd.edu
Signal Integrity

Crosstalk


Supply voltage drop


IR, L dI/dt, LC resonance
Temperature


Capacitive, inductive
Increased resistance with higher temperature
Substrate coupling

Parasitic resistance, capacitance in the substrate layer
ECE 260B – CSE 241A Clocking 33
http://vlsicad.ucsd.edu
Crosstalk
 Due to the coupling capacitance between
interconnections, a signal switching on a net (aggressor)
may affect the voltage waveform on a neighboring net
(victim)
Noise Propagation
Increased Delay
ECE 260B – CSE 241A Clocking 34
http://vlsicad.ucsd.edu
Circuit Model for Crosstalk
ECE 260B – CSE 241A Clocking 35
http://vlsicad.ucsd.edu
Crosstalk Simulation
ECE 260B – CSE 241A Clocking 36
http://vlsicad.ucsd.edu
Design for Crosstalk

It can be both capacitive and inductive


Capacitive is dominant at current switching speeds
To reduce it:


Use of shielding layer (inter-layer)
Use of shielding wire (intra-layer)
GND
VDD
GND
Substrate
ECE 260B – CSE 241A Clocking 37
http://vlsicad.ucsd.edu
Clock Gating


Reduce power consumption
by temporarily shutting down
part of the circuit
FF
Q
FF
combinational
logic
D
Additional cost of enabling
CLK1
circuits
CLK2
CLK ENABLING
ECE 260B – CSE 241A Clocking 38
http://vlsicad.ucsd.edu
Outline
 Problem Statement
 Clock Distribution Statement
 Robustness / Signal Integrity Control
 Clock Design:
 Skew Scheduling
 Topology Construction
 Embedding
ECE 260B – CSE 241A Clocking 39
http://vlsicad.ucsd.edu
Skew = Local Constraint
 Timing is correct as long as the clock signals of
sequentially adjacent FFs arrive within a permissible
skew range
FF
-d + thold
race condition
<
D : longest path
d : shortest path
Skew
FF
<
safe
Tperiod - D - tsetup
cycle time violation
permissible range
W. Dai,
UC260B
Santa
Cruz241A Clocking 40
ECE
– CSE
http://vlsicad.ucsd.edu
“Useful Skew”  Design Robustness
 Design will be more robust if clock signal arrival time is in
the middle of permissible skew range, rather than on edge
FF
FF
2 ns
6 ns
4
FF
T = 6 ns
0
“0 0 0”: at verge of violation
4
0
“2 0 2”: more safety margin
2
W. Dai,
UC260B
Santa
Cruz241A Clocking 41
ECE
– CSE
-2
http://vlsicad.ucsd.edu
Constraints on Skews
 FFi receives clock signal delayed by xi  MIN_DEL



0 <   1   : if nominal clock delay is xi, then actual clock delay
must fall within interval xi  x  xi
For FF to operate correctly when clock edge arrives at time x, the
correct input data must be present and stable during the time
interval (x – SETUP, x + HOLD)
For 1  i,j  L (#FFs), we compute lower and upper bounds MIN(i,j)
and MAX(i,j) for the time that is required for a signal edge to
propagate from FFi to FFj
 Avoid double-clocking (race condition)

xi + MIN(i,j)  xj + HOLD
 Avoid zero-clocking

xj + SETUP + MAX(i,j)  xj + P;
ECE 260B – CSE 241A Clocking 42
P = clock period
http://vlsicad.ucsd.edu
Optimal Useful Skews by Linear Programming

LP_SPEED (clock period reduction):
minimize P s.t.
xj - xj  HOLD – MIN(i,j)
xi– xj + P  SETUP + MAX(i,j)
xi  MIN_DEL

LP_SAFETY (robustness):
Maximize M s.t.
xj - xj – M  HOLD – MIN(i,j)
xi– xj – M  SETUP + MAX(i,j) – P
xi  MIN_DEL

Notes
- J. P. Fishburn, “Clock Skew Optimization”, IEEE Trans. Computers 39(7) (1990), pp. 945-951.
- T. G. Szymanski, “Computing Optimal Clock Schedules”, Proc. DAC, June 1992, pp. 399-404.
- Useful Skew optimization is similar to Retiming optimization
- Peak current reductions are a side benefit
ECE 260B – CSE 241A Clocking 43
http://vlsicad.ucsd.edu
Outline
 Problem Statement
 Clock Distribution Structures
 Robustness / Signal Integrity Control
 Clock Design:
 Skew Scheduling
 Topology Design
 Embedding


For zero skew (ZST-DME)
For bounded skew (BST-DME)
ECE 260B – CSE 241A Clocking 44
http://vlsicad.ucsd.edu
Zero-Skew Tree (ZST) Problem

Zero Skew Clock Routing Problem (S,G): Given a set S of sink
locations and a connection topology G, construct a ZST T(S) with
topology G and having minimum cost.

Skew = maximum value of |td(s0,si) – td(s0,sj)| over all sink pairs si, sj in
S.


Td = signal delay (from source s0)
Connection topology G = rooted binary tree with nodes of S as leaves



Edge ea in G is the edge from a to its parent
|ea| is the (assigned) length of edge ea
Cost = total edge length
ECE 260B – CSE 241A Clocking 45
http://vlsicad.ucsd.edu
Zero-Skew Example (555 sinks, 40 obstacles)
ECE 260B – CSE 241A Clocking 46
http://vlsicad.ucsd.edu
A Zero-Skew Routing Algorithm

Finds a ZST under linear delay
model with minimum cost over all
ZSTs with topology G and sink set
S

Terms


Manhattan Arc: line segment with
slope +1 or –1
Tilted Rectangular Region (TRR):
collection of points within a fixed
distance of a Manhattan arc
-

Core = Manhattan arc
Radius = distance
Merging segment = locus of feasible
locations for a node v in the topology,
consistent with minimum wirelength
-
If v is a sink, then ms(v) = {v}
If v is an internal node, then ms(v) is
the set of all points within distance
|ea| of ms(a), and within distance |eb|
of ms(b)
ECE 260B – CSE 241A Clocking 47
http://vlsicad.ucsd.edu
Phase 1: Tree of Merging Segments

Goal: Construct a tree of merging segments corresponding
to topology G



Merging segment of a node depends on merging segment of its
children  bottom-up construction
Let a, b be children of v. We want placements of v that allow TSa and
TSb to be merged with minimum added wire while preserving zero
skew
Merging cost = |ea| + |eb|

Fact: The intersection of
two TRRs is also a TRR
and can be found in
constant time

Constant time per each
new merging segment 
linear time (in size of S) to
construct entire tree
ECE 260B – CSE 241A Clocking 48
http://vlsicad.ucsd.edu
Phase 2: Find Node Placements

Goal: Find exact locations (“embeddings”) pl(v) of internal nodes v in
the ZST topology

If v is the root node, then any point on ms(v) can be chosen as pl(v)

If v is an internal node other
than the root, and p is the parent
of v, then v can be embedded at
any point in ms(v) that is at
distance |ev| or less from pl(p)


Detail: create square TRR trrp
with radius ev and core equal to
pl(p); placement of v can be
any point in ms(v)  trrp
Each instruction executed at
most once for each node in G,
and TRR intersection is O(1)
time  Find_Exact_Placements
is O(n)  DME is O(n)
ECE 260B – CSE 241A Clocking 49
http://vlsicad.ucsd.edu
Outline
 Problem Statement
 Clock Distribution Structures
 Robustness / Signal Integrity Control
 Clock Design:
 Skew Scheduling
 Topology Design
 Embedding


For zero skew (ZST-DME)
For bounded skew (BST-DME)
ECE 260B – CSE 241A Clocking 50
http://vlsicad.ucsd.edu
Non-Zero Skew Bounds

Given a skew bound, where can internal nodes of the given topology
(e.g., a, b, v) be placed?
skew
0
a
2
4
6
6
2
4
4
2
skew
0
2
v
6
s0
v
a
ECE 260B – CSE 241A Clocking 51
b
Topology
s1 s2 s3 s4
4
b
6
http://vlsicad.ucsd.edu
BST-DME Bottom-Up Phase
Bottom-Up: build tree of merging
regions corresponding to given
topology
B=4
s0
a
b
Topology
s1 s2 s3 s4
s2
s0
mr(a)
s1
v
mr(v)
s3
mr(b)
s4
ECE 260B – CSE 241A Clocking 52
http://vlsicad.ucsd.edu
BST-DME Top-Down Phase
s0
v
a
s1 s2 s3 s4
s2
B=4
s0
s1
a
b
Topology
v
s3
b
s4
ECE 260B – CSE 241A Clocking 53
http://vlsicad.ucsd.edu
Good Luck for the Mid-Term!
ECE 260B – CSE 241A Clocking 54
http://vlsicad.ucsd.edu