3.2 - ACM/IEEE International Workshop on Timing Issues 2006

Download Report

Transcript 3.2 - ACM/IEEE International Workshop on Timing Issues 2006

Timing Analysis Challenges
for High speed CPU's at 90nm
and below
Agenda:
ITRS Predictions & Design Challenges
 Timing Analysis at intel
 Current issues and solutions
 Mid-term challenges
 Summary

Avi Efrati, Moshe Kleyner
®
R
R
1
The VLSI Chip in 2010...
Process Technology
Transistors
Logic Transistors
Size
Clock frequency
Chip I/O’s
Wiring levels (metals)
Voltage
Power
Supply current
25nm gate length
1,546 M
300 M
2
280 mm
11.5 GHz
3,840
9 - 10
0.8 - 1.0
120-218 Watts
~ 160 Amps
®
R
R
Source: ITRS ‘01 roadmap
2
Timing verification for Intel CPUs

Synchronous design style, mostly

Multiple synchronized clocks, GHz range



NO trend to asynchronous design in near future
Deep pipelining
Internal static timer – Tango



Cell-based, using abstract models for custom blocks
Handles transparent latches and sequential transparent
loops, both BFS and DFS timing propagation options
Generates and uses proprietary abstract timing model for
hierarchical timing



At each level an abstract timing model can be created for next
level
Typically 2-3 timing hierarchy levels
PathMill used at device-level, produces same abstract model
®
R
R
3
What’s under the hood ?
Handling transparent loops
 False paths
 Hierarchical Analysis

 Shell
models
®
R
R
4
Loops…

clk#
clk2
clk
Combinational loops are disallowed


clk
Local self-resetting circuitry may exist
Sequential loops exist
Formed by combinational paths and transparent
latches
 Actually form SCC (Strongly connected
component), handled automatically
 Typical for FSM implemented with Latches

®
R
R
5
False Paths
Manual marking of false paths,
considered in timing analysis
 Automatic SAT-based false paths

 Work
g
c
a
b
d
e
f
done with K.Sakallah U.Mich.
 Applied in combinational logic
z
b=0
c=1
d=0
e=1
c=0
®
R
R
6
Hierarchical Analysis

Cannot analyze full-chip at transistor or
gate level
 Huge

data, impractical run-time
Abstract blocks as compact models
 Hide
internal details not relevant at chip
level, assume pre-defined clocks
 As accurate as possible electrical interface
and timing model
 Abstract model supports also timing
transparency – BLUE BOX
®
R
R
7
Shell Model
Core
FF1
FF2
clk
Q
D
Electrical Shell
elements
clk
Q
D
OUT
IN
clk
clk
Q
D
L2
L1
Combinational Cells

Q
D
D
Core
Core
clk
Q
L3
MB1
MB2
Flat FC interconnect
Interface cells and interconnect are preserved
User may select deeper than 1 shell
 User may expose some transparent latches


Balance core complexity versus amount of cells exposed
in full-chip, Deep Shell Model
Cores are abstract timing models
 Full-chip analysis uses shell models of blocks

®
R
R
8
Current and near-term
challenges
CrossTalk impact on timing
 Active interconnect
 Mixed abstraction, device to full-chip
 Use of domino as characterized cells
 SoC challenges

®
R
R
9
CrossTalk impact on Timing

CrossTalk has noise and timing impact

Search for highest peak noise while…



CrossTalk timing effect may be approximated
as a Miller Xcap multiplier (MCF), but…







R
R
Default MCF may over or under-estimate effect
MCF is slope dependent, difficult to set upfront
AWE + superposition gives good results but may be too
costly to apply everywhere
Accuracy vs. run-time tradeoff is key

®
Victim transitions – for timing
Victim stable – for functional noise
Timing filtering followed by local logic filtering
SMCF (smart MCF) or AWE-based peak
Timing iterations to converge CrossTtalk impact
Very active research in last few years !!
10
Fitting SMCF to experimental
data


Physically MCF depends on L=Tvic/Tagg
Experimentally fitted with equation a-b*exp(-L)
Best fitting of MCF
2.6
2.4
SMCF
2.2
2
smcf interpolated to
err=0
smcf best fitted to
Smcf=a-b*exp(-L)
smcf initially used in
experiments
1.8
1.6
1.4
0
0.5
1
1.5
2
2.5
3
Slope ratio, Tvic/Tagg
®
R
R
11
“Active” Interconnect

For quite some time interconnect is not negligible, now it becomes
active !



Interconnect may be:






Repeaters may be buffers, inverters, latches, flops
Virtual (early design) or real repeaters
Simple wire
Buffered (inverted or not)
Pipelined (and buffered)
Pipelining the interconnect is considered simultaneously in RTL,
Floor Plan and early timing
Mutual Inductance impact being assessed
Asynchronous long-distance on-chip communication ?
Rcv
Drv
®
R
R
Rcv
12
Mixed Abstraction

Layout becomes more cell-based…but circuit
families in cells are more complex



Some circuits may be characterized as cells, some may
require device-level analysis
Fluid cells & device-level optimization
Comprehend devices, cells and abstract
models in same run


Single timing graph
May need on-the-fly dynamic analysis on parts of circuit





Use circuit recognition capabilities
Requires stimuli generation
More detailed waves, not only slope
Sophisticated timing checks for domino
Propagate also pulses not only arrival time
®
R
R
13
Mixed-level Timing


Cell, abstracts and devices co-exist at
analysis level
Choose flexible abstraction/accuracy trade-off
Core
Mixed device/cells/abstracts
®
R
R
14
Domino characterization

Regular or footless domino as characterized
cells
Will be supported in cell-based timing
 Additional domino latches, etc…


Delay similar to static cells and latches

Checks are more complex !!…next page
keeper
clk
inputs
Domino
node
keeper
clk
output
inputs
Domino
node
Domino And2
output
Footless And2
®
R
R
See Van Campenhout, Sakallah, Mudge paper 1999
15
Pulse Width Checks

Need sufficiently wide pulse at
domino node



Modeling issues




Ensure pulse width to next stage
Ensure feedback can hold data
Slopes of inputs
Pulse width per discharge path
Translating inputs intersection into
pulse at domino node
eval
Domino
node
a
b
Dis-allowing min-transparency
converts pulse width to setup
check

Non-transparency hold check
Domino
node
Inputs
®
R
R
16
SoC challenges

Multi-core CPU’s or high-integration SoC


New integration level in all areas – RTL, timing,
layout, testing etc…
Timing challenges

New level of hierarchical timing, more need for
functionality aware timing, better abstract models


Optimize interfaces without core re-design
Integrative approach, zoom-in from abstract to detailed in
same environment
Multiple clocks, possibly asynchronous to each other
 Inter-module communication, protocols, early spec
and accurate verification
 More in-die variation, instances of same module may
operate at different Vcc/temperature etc…

®
R
R
17
Mid-term challenges
MIS – Multiple Input Switching
 Process and environment variability

 Voltage
and Temperature
 Process variability

Timing challenges due to leakage
reduction techniques
transistors – usage methodology
and support in timing
 Sleep
®
R
R
18
MIS – Multiple Input Switching

More MIS situations as frequency increases
Less stages in clock cycle
 Slope steepness increases slower than frequency


Broad range of effects


Single stage well known
Impact across stages more subtle

Load stage may present different effective load
due to Miller coupling



Either slow-down or speed-up
Holding side input by real driver versus “ideal
voltage” has accuracy impact
Characterization/modeling issues
®
R
R
19
One gate slow-down/ speed-up
Effectively adds device strengths
a
b
Vds incremental
across top device
In series stack
a
b
Mitigate with legging
1.2
1.2
39.7% speedup
1
1
12.6% pushout
0.8
Volts
Volts
0.8
0.6
0.4
Single input
switches
0.2
0.6
0.2
0
-0.2
0
0
50
100
Single input
switches
0.4
150
200
0
50
100
150
200
Time ps
Time ps
®
R
R
20
Two gates, Fanout pull-in




c with a or b or both MIS
Miller coupling c,o
Position dependent
No generic model
a
o
o2
b
c
1.2
1
o2
Volts
0.8
0.6
o
single input
switching o
0.4
0.2
c
15.6% speedup
0
0
50
100
150
200
miller coupling,
droop causes
speedup on o
mitigate with legging,
pushing down stack
if only one signal critical
Time ps
®
R
R
21
Fanout Signal Location
c with a, b or both MIS
Either speedup or pushout based on connection
 connected to pin a: -15.6% to 12.6% variation
 connected to pin b: -0.8% to 0.3% variation


1.2
o2
b
0.8
Volts
o
a
1
c
0.6
0.4
0.2
o/c
0
c/o
0
50
100
150
200
Time ps
®
R
R
22
MIS – Modeling issues

Not so easy to model in CBD (Cell-Based
Design)

Min/Max timing window provides a range of
switching times


Assuming full MIS leads to over-design

Most important to check MIS effect on min-delay
which may lead to chip failure



Window overlap of two inputs allows MIS but doesn’t
guarantee it
Max delay MIS may only reduce operating frequency
Possibly consider max-delay MIS as random variable
over overlap window
Easier to consider MIS in BFS timing
propagation
®
R
R
23
Process and Environment
Variability

Both deterministic and random variation

The absolute  of CD does not decrease at same
pace as channel length


Lower voltages, higher currents





R
R
Non-uniform Vdd on chip, consider Vdd in timing
Big drivers may “starve” neighbors
Are variations causing significant critical path reordering ?
“Nominal” timing is not good enough to
accurately predict silicon

®
Thus relative value of L and Vt variation increases

Worst-casing all effects reduces design space or makes
design impossible
Consider chip map for deterministic variations
Need statistical approach in STA for random effects
24
Reducing leakage power


Most important for mobile and internet
servers, as important as speed !
Standby leakage
power consumed when whole chip is idle, Tj is
NOT high (Spec temp. for mobile at 50C)
 impact on battery life for portable devices


Active leakage

power consumed due to device leakage when chip
is working, and Tj is high (110C)


Subthreshold and Gate leakage significantly higher
impact on overall chip thermal design power and
frequency

Ptot=Pswitch + Pleak,,
®
R
R
25
Leakage Gating with Sleep
Transistor


Leakage is a main concern below 90nm
Partition the chip to allow individual control of the
sleep transistors


Sleep transistor is on while the block is working
Sleep transistor is off while the block is idle
Block A
Sleep
Control
Block C
Sleep
Control
Block B
Sleep
Control
Block D
Sleep
Control
®
R
R
26
Sleep transistors in timing

Difficult to comprehend in STA
Many cells share same virtual ground through one
sleep transistor (legged/distributed in reality)
 Voltage of virtual ground depends on current
drawn by all active gates on same sleep transistor




Need to guarantee max/min voltage on virtual ground
How to verify statically min/max GND voltage
Need cell models and interaction models for
cells on different virtual ground
Logic grouping, by time of common switching
 Estimate current needed in worst case


Lack of support in timing tools is main limiting
factor for using this technique
®
R
R
27
Summary

STA is a key component of chip design
 New

VDSM and high frequency challenges
Hierarchical models cope with full chip
complexity
 Electrical
interaction across logical hierarchy
boundaries

CrossTalk, MIS, variability and more
phenomena need efficient solutions
 Will
require more dynamic device-level
analysis within static timing tools

Closer interaction with Logic/Satisfiability
®
R
R
28
Contributors
Noel Menezes
Florentin Dartu
Ken Stevens
Vladi Tsipenyuk
Uri First
Igor Keller
Abhijit Dharchoudhury
®
R
R
29