t_line - Computer Science and Engineering

Download Report

Transcript t_line - Computer Science and Engineering

Prediction of High-Performance
On-Chip Global Interconnection
Yulei Zhang1, Xiang Hu1, Alina Deutsch2, A. Ege Engin3
James F. Buckwalter1, and Chung-Kuan Cheng1
1Dept.
of ECE, UC San Diego, La Jolla, CA
2IBM T. J. Watson Research Center, Yorktown Heights, NY
3Dept. of ECE, San Diego State Univ., San Diego, CA
Outline

Introduction



On-Chip Global Interconnection







Overview: structures, tradeoffs
Interconnect schemes
Global wire modeling
Performance analysis
Design Methodologies for T-line schemes
Prediction of Performance Metrics



Technology trend
Current approaches
Experimental settings
Performance metrics comparison and scaling trend
 Latency
 Energy per bit
 Throughput
Signal Integrity
Conclusion
2
Introduction – Performance Impact

Interconnect delay determines the
system performance [ITRS08]

542ps for 1mm minimum pitch Cu global
wire w/o repeater @ 45nm
 ~150ps for 10 level FO4 delay @ 45nm
[Ho2001] “Future of Wire”
3
Introduction – Power Dissipation

Interconnects consume a significant portion of power

1-2 order larger in magnitude compared with gates


Half of the dynamic power dissipated on repeaters to minimize latency [Zhang07]
Wires consume 50% of total dynamic power for a 0.13um microprocessor [Magen04]

About 1/3 burned on the global wires.
4
Introduction
– Different Approaches and Our Contributions

Different Approaches

Repeater Insertion Approach



T-line Approach [Zhang09]





Pros: Low latency.
Cons: low throughput density due to low bandwidth and large wire dimension
Equalized T-line Approach [Zhang08]


Pros: High throughput density.
Cons: Overhead in terms of power consumption and wiring complexity.
Pros: Low power, Low noise, Higher throughput than single-ended.
Cons: The area overhead brought by passive components.
We explore different global interconnection structures and compare
their performance metrics across multiple technology nodes.
Contributions:

A simple linear model
 A general design framework
 A complete prediction and comparison
5
Organization of On-Chip Global Interconnections
6
Multi-Dimensional Design Consideration


Preliminary analysis results assuming
65nm CMOS process.
Application-oriented choice

Low Latency
T-TL or UT-TL -> Single-Ended T-lines

High Throughput
R-RC

Low Power
PE-TL or UE-TL

Low Noise
Differential T-lines
PE-TL or UE-TL

Low Area/Cost
R-RC
For each architecture, the more area the pentagon
covers, the better overall performance is achieved.
7
On-Chip Global Interconnect Schemes (1)

Repeated RC wires (R-RC)
R-RC structure

Repeater size/Length of segments
 Adopt previous design methodology
[Zhang07]

UT-TL structure

Full swing at wire-end
 Tapered inverter chain as TX

T-TL structure

Optimize eye-height at wire-end
 Non-Tapered inverter chain as TX
Un-Terminated and Terminated T-Line
(UT-TL and T-TL)
8
On-Chip Global Interconnect Schemes (2)
Un-Equalized and Passive-Equalized T-Line
(UE-TL and PE-TL)

Driver side: Tapered differential driver
 Receiver side: Termination resistance, Sense-Amplifier (SA) + inverter chain
 Passive equalizer: parallel RC network
 Design Constraint: enough eye-opening (50mV) needed at the wire-end
9
Global Wire Modeling
– Single-Ended & Differential On-Chip T-lines



Orthogonal layers replaced by ground planes -> 2D cap extraction, accurate when
loading density is high.
Top-layer thick wires used -> dimension maintains as technology scales.
LC-mode behavior dominant
Determine the bit rate


Smallest wire dimensions that satisfy eye constraint
Notice PE-TL needs narrower wire -> Equalization helps to increase density.
10
Global Wire Modeling – RC wires and T-lines

RC wire modeling



T-line 2D-R(f)L(f)C parameter extraction
2D-C Extraction Template

Distributed Π model composed of wire
resistance and capacitance
Closed-form equations [Sim03] to
calculate 2D wire capacitance
2D-R(f)L(f) Extraction Template
T-line Modeling

R(f)L(f)C Tabular model -> Transient simulation to estimate eye-height.
 Synthesized compact circuit model [Kopcsay02] -> Study signal integrity issue.
11
Performance Analysis – Definitions

Normalized delay (unit: ps/mm)


Normalized energy per bit (unit: pJ/m)


Propagation delay includes wire delay and gate delay.
Bit rate is assumed to be the inverse of propagation delay for RC wires
Normalized throughput (unit: Gbps/um)
12
Performance Analysis – Latency

Variables: technology-defined parameters

Supply voltage: Vdd (unit: V)
 Dielectric constant:  r
 Min-sized inverter FO4 delay:


R-RC structure (min-d)
(unit: ps)

T-line structures




r0 is roughly constant

Sum of wire delay and TX delay
Wire delay   r
TX delay improved w/ FO4 delay  
cnmos  1/ S , rw  S 2 , cw   r
FO4 delay scales w/ scaling factor S
  1/ S
Decreasing w/ technology scaling!
Increasing w/ technology scaling!
13
Performance Analysis – Energy per Bit

Same variables defined before

R-RC structure (min-d)

T-line structures

Sum of power consumed on wire and TX.
2
Power of T-line  VDD
2
Power of TX circuit  fCVDD

FO4 delay reduces exponentially




Constant !
Vdd reduces as technology scales
 r reduces as technology scales
Energy decreases w/ technology scaling!
Energy decreases w/ larger slope!!
14
Performance Analysis – Throughput

Same variables defined before

R-RC structure (min-d)

Assuming wire pitch  1/ S  

FO4 delay reduces exponentially

T-line structures




Throughput increases by
20% per generation!
TX bandwidth  1/ 
Neglect the minor change of wire pitch
K1 = 0, for UT-TL
FO4 delay reduces exponentially
Throughput increases by
43% per generation !!
15
Design Framework for On-Chip T-line Schemes


Proposed framework can be applied to design UT-TL/T-TL/UE-TL/PE-TL by
changing wire configuration and circuit structure.
Different optimization routines (LP/ILP/SQP, etc) can be adopted according to
the problem formulation.
16
Experimental Settings





Design objective: min-d
Technology nodes: 90nm-22nm
Five different global interconnection structures
Wire length: 5mm
Parameter extraction



Transistor models



Predictive transistor model from [Uemura06]
Synopsys level 3 MOSFET model tuned according to ITRS roadmap
Simulation


2D field solver CZ2D from EIP tool suite of IBM
Tabular model or synthesized model
HSPICE 2005
Modeling and Optimization


Linear or non-linear regression/SQP routine
MATLAB 2007
17
Performance Metric: Normalized Delay
– Results and Comparison

Technology trends
R-RC ↑
 T-line schemes ↓


T-line structures

Outperform R-RC beyond 90nm
 Single-ended: lowest delay

At 22nm node

R-RC: 55ps/mm
 T-lines: 8ps/mm (85% reduction)
 Speed of light: 5ps/mm

Linear model

< 6% average percent error
18
Performance Metric: Normalized Energy per Bit
– Results and Comparison

Technology trends
R-RC and T-lines ↓
 T-lines reduce more quickly


T-line structures

Outperform R-RC beyond 45nm
 Differential: lowest energy.
 Single-ended similar to R-RC.


T-TL > UT-TL
At 22nm node

R-RC: 100pJ/m
 Single-ended: 60% reduction
 Differential: 96% reduction

Linear model

< 12% average percent error
 Error for T-TL and PE-TL

RL and passive equalizers.
19
Performance Metric: Normalized Throughput
– Results and Comparison

Technology trends
R-RC and T-lines ↑
 T-lines increase more quickly


T-line structures

Outperform R-RC beyond 32nm
 Differential better than single-ended

At 22nm node

R-RC: 12Gbps/um
 T-TL: 30% improvement
 UE-TL: 75% improvement
 PE-TL: ~ 2X of R-RC

Linear model

< 7% average percent error
20
Signal Integrity – single-ended T-lines
Worst-case switching pattern for peak noise simulation
Using w.c. pattern
Using single or multiple
PRBS patterns

UT-TL structure

380mV peak noise at 1V supply voltage w/ 7ps rise time
 SI could be a big issue as supply voltage drops

T-TL less sensitive to noise

At the same rise time, ~ 50% reduction of peak noise
 Peak noise ↓ as technology scales
21
Signal Integrity – differential T-lines
Worst-case switching pattern for peak noise simulation

More reliable



Peak noise


Termination resistance
Common-mode noise reduction
Within ~10mV range
Eye-Heights


UE-TL
 Eye reduces as bit rate ↑
 Harder to meet constraint.
PE-TL
 > 70mV eye even at 22nm node
 Equalization does help!
22
Conclusion


Compare five different global interconnections in terms of latency,
energy per bit, throughput and signal integrity from 90nm to 22nm.
A simple linear model provided to link

Architecture-level performance metrics
 Technology-defined parameters

Some observations from experimental results

T-line structures have potential to replace R-RC at future node
 Differential T-lines are better than single-ended


Low-power/High-throughput/Low-noise
Equalization could be utilized for on-chip global interconnection


Higher throughput density, improve signal integrity
Even w/ lower energy dissipation (passive equalizations)
23
Thank you!
Q&A
Back Up Slides
Introduction – Technology Trend

On-Chip Interconnect Scaling

Dimension shrinks



Wire resistance increases -> RC delay
Increasing capacitive coupling -> delay, power, noise, etc.
Performance of global wires decreases w/ technology scaling.
Wire Category
Technology Node
90nm
45nm
22nm
M1
Wire
Rw(kohm/mm)
1.914
8.860
34.827
Cw(pF/mm)
0.183
0.157
0.129
Global
Wire
Rw(kohm/mm)
0.532
2.970
11.000
Cw(pF/mm)
0.205
0.179
0.151
Scaling trend of PUL wire resistance and capacitance
Copper resistivity versus wire width
26
Design methodology: single-ended T-lines
2D frequency-dependent
tabular Model
Single-ended;
Inverter chains
Inverter size,
number of stages,
Rload (if any)
SPICE
simulation
SPICE simulation to check inplane crosstalk, etc
SPICE simulation to evaluate.
Optimization Routine:
1. Optimal cycle time
2. Sweep for optimal inverter chain
27
Design methodology: differential T-lines
2D frequency-dependent
Tabular Model
Differential lines;
SA-based TX
Wire width;
Driver impedance;
RC equalizer (if any);
Termination resistance.
Closed-form equationbased model
SPICE simulation to check inplane crosstalk, etc
Evaluation based on models.
Optimization Routine:
1. Binary search for wire width
2. SQP for other var. optimization
28
Effects of driver impedance and
termination resistance

Lowering driver impedance improves eye
 Eye reduces as frequency goes up
 Optimal termination resistance.
29
Effects of driver impedance and termination
resistance on step response
Optimal Rload

Larger driver impedance leads to slower rise edge and lower saturation voltage
 Larger termination resistance causes sharper rise edge but with larger reflection
30
Crosstalk effects



Three different PRBS input patterns, min-ddp solutions
T-line Scheme A: Delay increased by 9.6%, Power increased by 37%
T-line Scheme B: Delay increased by 2%, Power increased by 25.7%
@Wire output
820mV
@Wire output
750mV
@Inverter chain output
3.6ps
@Inverter chain output
6.9ps
31
Transceiver Design

Sense amplifier (SA)



Double-tail latch-type [Schinkel 07]
Optimize sizing to minimize SA delay
Inverter chain

Number of stage


Fixed to 6
Double-tail latch-type voltage sense amp.
Sizing of each inverter


RS: output resistance of inverter chain
Sweep the 1st inverter size to minimize
the total transceiver delay for given
[Veye, RS]
@45nm tech node:
M1/M3: 45nm/45nm
M2/M4: 250nm/45nm
M5/M6: 180nm/45nm
M7/M8: 280nm/45nm
M9: 495nm/45nm
M10/M11: 200nm/45nm
M12: 1.58um/45nm
32
Transceiver Modeling

Driver side




Voltage source Vs with output resistance Rs
Vs: full-swing pulse signal with rise time Tr=0.1Tc
Rs: output resistance of the last inverter in the chain.
Receiver side



Extract look-up table for TX delay and power
Fit the table using non-linear closed form formula
The relative error is within 2% for fitting models
Transceiver delay map at 45nm node
Histogram of fitting errors at 45nm node
Transceiver power map at 45nm node
33
Bit-rate: 50Gbps
Rs=11.06ohm,
Rd=350ohm,
Cd=0.38pF,
RL=107.69ohm
34
Conclusion (cont’)
Low-Latency Application (ps/mm)
Tech
Node
90nm
65nm
45nm
32nm
Low-Energy Application (pJ/m)
22nm
Schemes
Tech
Node
90nm
65nm
45nm
32nm
22nm
Schemes
R-RC
3/35
1/42
1/46
1/55
1/55
R-RC
2/150
2/140
1/130
1/100
1/100
UT-TL
5/15
5/13
5/10
5/9
5/8
UT-TL
3/140
3/110
3/70
3/50
2/40
T-TL
5/15
5/13
5/10
5/9
5/8
T-TL
1/260
1/200
2/100
2/60
3/40
UE-TL
1/37
3/25
3/16
3/12
5/8
UE-TL
4/60
4/36
4/20
4/10
5/4
PE-TL
1/37
3/25
3/16
3/12
5/8
PE-TL
5/26
5/16
5/8
5/5
5/2
High-Throughput Application (Gbps/um)
Tech
Node
90nm
65nm
45nm
32nm
Low-Noise Application
22nm
Tech
Node
90nm
65nm
45nm
32nm
22nm
Schemes
Schemes
R-RC
5/5
5/6
3/8
3/10
2/12
R-RC
1
1
1
1
1
UT-TL
2/3.3
1/3.3
1/3.3
1/3.3
1/3.3
UT-TL
1
1
1
1
1
T-TL
1/3
2/3.4
2/6
2/9
3/16
T-TL
3
3
3
3
3
UE-TL
3/3
3/5
4/9
4/13
4/21
UE-TL
5
5
4
4
4
PE-TL
4/4
4/5.3
5/9
5/15
5/24
PE-TL
4
4
5
5
5
Item in the table: score/value. Score: the higher, the better in terms of given metric, max. score is
5. The best structure in each column marked using red color.
35
Future Works

Explore novel global signaling schemes for high throughput and low
energy dissipation.

Design, optimize > 50Gbps on-chip interconnection schemes
 Architecture-level study to identify trade-offs

Wire configuration


Un-interrupted architectures


Dimension optimization, ground plane, etc.
Equalization implementation, TX/RX choice
Distributed architectures

Active or Passive compensation (RC equalizers, other networks, etc)

Novel high-speed transceiver circuitry design
 Develop analysis and optimization capability to aid co-design and cooptimization of wire and transceiver circuit
 Fabrication to verify analysis and demonstrate feasibility
36
Related Publications
[Repeated RC Wire]
1.
L. Zhang, H. Chen, B. Yao, K. Hamilton, and C.K. Cheng, “Repeated on-chip interconnect analysis and
evaluation of delay, power and bandwidth metrics under different design goals,” IEEE International
Symposium on Quality Electronic Design, 2007, pp.251-256.
[Un-Terminated/Terminated T-Line]
2.
Y. Zhang, L. Zhang, A. Deutsch, G. A. Katopis, D. M. Dreps, J. F. Buckwalter, E. S. Kuh and C.K.
Cheng, “Design Methodology of High Performance On-Chip Global Interconnect Using Terminated
Transmission-Line, ” IEEE International Symposium on Quality Electronic Design, 2009, pp.451-458.
[Passive-Equalized T-Line]
3.
Y. Zhang, L. Zhang, A. Tsuchiya, M. Hashimoto, and C.K. Cheng, “On-chip high performance signaling
using passive compensation, ” IEEE International Conference on Computer Design, 2008, pp. 182-187.
4.
Y. Zhang, L. Zhang, A. Deutsch, G. A. Katopis, D. M. Dreps, J. F. Buckwalter, E. S. Kuh, and C. K.
Cheng, “On-chip bus signaling using passive compensation,” IEEE Electrical Performance of Electronic
Packaging, 2008, pp. 33-36.
5.
L. Zhang, Y. Zhang, A. Tsuchiya, M. Hashimoto, E. Kuh, and C.K. Cheng, “High performance on-chip
differential signaling using passive compensation for global communication, ” Asia and South Pacific
Design Automation Conference, 2009, pp. 385-390.
[Overview and Comparison]
6.
Y. Zhang, X. Hu, A. Deutsch, A. E. Engin, J. F. Buckwalter, and C. K. Cheng, “Prediction of HighPerformance On-Chip Global Interconnection, ” ACM workshop on System Level Interconnection
Prediction, 2009
37