itrs2001_des_summary1 - Computer Science and Engineering

Download Report

Transcript itrs2001_des_summary1 - Computer Science and Engineering

ITRS-2001 Design ITWG Summary
(work in progress)
July 18, 2001
New SYSTEM DRIVERS Chapter in ITRS
• Defines segments of silicon market that drive process and
design technology
• Replaces the 1999 SOC Chapter
• Along with ORTCs, serves as “glue” for ITRS
• 4 Drivers: SOC (Japan), MPU (USA), DRAM (Korea), M/S
(Europe)
– SOC: driven by cost, power, integration
– SOC: drives device requirements, packaging, I/O counts, …
– SOC: same as “ASIC-LP”
• Each section
• Nature, evolution, formal definition of this driver
• What market forces apply to this driver ?
• What technology elements (process, device, design) does this
drive?
• Key figures of merit, and futures
• Driven by Design TWG (ABK = chair)
Mixed-Signal System Driver Roadmap
• Mixed-signal circuits increasingly critical to system performance, cost
– Define “AMS boundary conditions” for Design technology roadmap by choosing
critical and important analog/RF circuits, then identifying circuit performance
needs and related device parameter needs
• Based on figures of merit for four basic analog building blocks, can
estimate future device parameter needs
Roadmap for basic
analog / RF
circuits
Roadmap for
device parameters
(needs)
A/D-Converter
Low-Noise Amplifier
Voltage-Controlled
Oscillator
Power
Amplifier
Lmin
2001
…
2015
…
…
analog transistor gm/gds
…
…
Resolution(bit)
Mixed-Signal System Drivers
22
super
1 kW
1W
20 1mW
audio
18
audio
GSM Basestation
16
GSM
14
Cable
12
DTV
1 mW
UMTS
Storage
10
telephony
Bluetooth
8
Intercon6
video nectivity
4
1kHz 10kHz100kHz1MHz 10MHz100MHz1GHz
Signal Bandwidth
System drivers for mass markets can be identified from
the FoM approach
“System Driver” Models Are Changing
Example – MPU
• Old MPU model – 3 flavors
• Cost-performance at introduction (CP-Intro)
– 340+ mm2 die, small L1 cache (32KB in 180nm)
• Cost-performance at production (CP-Prod)
– 170+ mm2 die, shrink of previous generation’s CP-Intro chip
• High-performance (HP)
– 310+ mm2 die, same as CP-Prod but with large L2 cache (512KB in 180nm)
• SRAM and logic transistor counts double every generation
• New MPU model - 2 flavors
• Cost-performance at production (CP)
– 140 mm2 die, “Pentium 4” / “desktop”
• High-performance at production (HP)
– 310 mm2 die, “Itanium” / “server”
• Both CP, HP have multiple cores (“helper engines”), on-board L3 cache, …
– Multi-cores == more dedicated, less general-purpose logic; driven by power
and reuse considerations; reflect convergence of MPU and SOC
• “Moore’s Law” still applies to tx counts, but NOT to frequency
– doubling is each generation, NOT each 18 months
Example Supporting Analyses
(MPU Diminishing Returns)
• Pollack’s Rule
– In a given process technology, new uArch takes 2-3x area of old (last
generation) uArch, and provides only 40% more performance
– Backup Slide: process generations (x-axis) versus (1) ratio of Area of New/Old
uArch, (2) ratio of Performance of New/Old (approaching 1)
– Other Backup: SPECint, SPECfp per MHz, SPECint per Watt all decreasing
• Power knob running out
–
–
–
–
Speed == Power
10W/cm2 limit for convection cooling, 50W/cm2 limit for forced-air cooling
Large currents, large power surges on wakeup
Cf. 140A supply current, 150W total power at 1.2V Vdd for EV8 (Compaq)
• Speed knob running out
– Historically, 2x clock frequency every process generation (see Backup Slides)
• 1.4x from device scaling (but running into t_ox, other limits – see Device discussion)
• 1.4x from fewer logic stages (from 40-100 down to around 14 FO4 INV delays)
– Clocks cannot be generated with period < 6-8 FO4 INV delays
– Pipelining overhead (1-1.5 FO4 INV delay for pulse-mode latch, 2-3 for FF)
– Around 14-16 FO4 INV delays is limit for clock period (L1 $ access, 64b add)
• Cannot continue 2x freq per generation trend in ITRS
New On-Chip Max Clock Frequency Model
• Flat at 16 FO4 INV delays
– FO4 INV delay = delay of inverter driving load equal to 4x its input cap
= roughly 14x CV/I device delay metric in ITRS PIDS Chapter
• No local interconnect in the model
– negligible, and scales with device performance
• No (buffered) global interconnect in the model, either
– was unrealistically fast in ITRS99 model
– global interconnects are pipelined (clock frequency is set by time
needed to complete local computation loops, not time for global
communication - cf. Pentium-4 and Alpha-21264)
– Note: interconnect delay per se is not a problem (!)
• Clock period decreases from 26 FO4 delays in 2001
(Pentium-4) at historical rates, then flattens at 16 FO4 delays
– See discussion of MPU “System Driver” …
New Layout Density Models
• Semi-custom Logic: Avg size of 4t gate = 32MP2 = 320F2
–
–
–
–
MP = lower-level contacted metal pitch
F = min feature size (technology node)
32 = 8 tracks standard-cell height times 4 tracks width (average NAND2)
Additional whitespace factor = 2x (i.e., 100% overhead)
• Custom Logic: 1.25x ASIC density
• SRAM: (used in MPU)
–
–
–
–
bitcell area (units of F^2) is near flat: 223.19*F (um) + 97.748
peripheral overhead = 60%
memory content is increasing (driver: power) and increasingly fragmented
will see paradigm shifts in architecture/stacking; eDRAM, 1-T SRAM, …
• Significant SRAM density increase, slight Logic density
decrease, compared to 1999 ITRS
– 130nm node: old ASIC logic density = 13M tx/cm2, new = 11.6M tx/cm2
– 130nm node: old SRAM density = 70M tx/cm2, new = 140M tx/cm2
– Chief impact: power densities, logic-memory balance on chip
• Very well-calibrated to design data, multiple sectors
A-Factor for SRAM Cell Size
(square feature size)
SRAM “A-Factors” for Simple 6T SRAM Cell using
Microprocessor Logic CMOS Process Technology
200
180
160
140
120
100
80
60
40
20
0
19
M
19
M
19
M
9
19
4)
84
p.
7)
7)
.
,p
1
57
)
7)
19
p.
)
94
,
96
,
98
00
56
p.
20
M
ED
.
00
M
ED
,I
D
IE
l,
D
IE
l,
D
IE
l,
a
ol
or
ot
te
(In
te
(In
te
(In
,I
20
M
a
ol
or
D
IE
l,
ot
(M
te
(In
(M
n
ro
n
ro
n
ro
n
ro
ic
m
ic
m
ic
m
ic
3m
0.
35
0.
25
0.
18
0.
n
ro
n
ro
ic
m
ic
m
13
0.
13
0.
))
DRAM half-pitch (F)
A-Factor (A*F2)
0.13micron (Intel, IEDM2000.p.567))
143.7
0.13micron (Motorola, IEDM2000,p.571))
146.74
0.18micron (Intel, IEDM1998,p.197)
172.53
0.25micron (Intel, IEDM1996,p.847)
164.16
0.35micron (Intel, IEDM1994)
167.3
0.3micron (Motorola, IEDM1994)
175.6
% of area devoted to logic
Constrained Power: Logic-Memory
Balance Analyses (see Backup Slides)
Constant Power (90W)
50
2
Constant Power Density (90W/1.57cm )
40
30
20
10
0
1998
Anomaly going from
45nm to 32nm due to
constant Vdd
Constant area region
1999-2004
2000
2002
2004
2006
2008
2010
2012
2014
Year
• Constant power or power density  decreasing logic content
•#Tr Logic, SRAM cannot scale together as in current ITRS?
•Can “derive” same result from new/reuse logic design productivity gap (!)
•How will Design Technology ensure continued flow of high-value designs?
Outline of ITRS DESIGN Chapter
• Context
– Scope of Design Technology
– High-level summary of complexities (at level of “issues”)
– Cost, productivity, quality, and other metrics of Design Technology
• Overview of Needs
– Driver classes and associated emphases
– SOC, MPU, DRAM, MS
– Resulting needs (e.g., power, …, cost-driven design)
• Summary of Difficult Challenges
• Detailed Statements of Needs, Potential Solutions
– System-Level, Circuit, Logic/Physical, Verification, Test
Design Cost and Quality Requirement
• Design cost of “largest ASIC” rises despite major DT innovations
• Other Dataquest numbers confirm memory content rising
• Currently seeking metric, data, requirements for design quality
(Dataquest, 2001) Cost Metrics Forecast
$100,000,000,000
Design Cost for Largest Possible ASIC
Same Cost RTL Methodology Only
$10,000,000,000
$1,000,000,000
$100,000,000
$10,000,000
1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005
Design Cost Requirement
• “Largest possible ASIC” design cost model
•
•
•
•
•
•
engineer cost per year increases 5% per year ($181,568 in 1990)
EDA tool cost per year increases 3.9% per year ($99,301 in 1990)
#Gates in largest ASIC design per ORTCs (.25M in 1990, 250M in 2005)
%Logic Gates constant at 70% (see next slide)
#Engineers / Million Logic Gates decreasing from 250 in 1990 to 5 in 2005
Productivity due to 8 Design Technology innovations (3.5 of which are still
unavailable) : RTL methodology; In-house P&R; Tall-thin engineer; Smallblock reuse; Large-block reuse; IC implementation suite; Intelligent
testbench; ES-level methodology
• Small refinements: (1) whether 30% memory content is fixed; (2) modeling
increased amount of large-block reuse (not just the ability to do large-block
reuse). No discussion of other design NRE (mask cost, etc.).
• #Engineers per ASIC design still rising (44 in 1990 to 875 in
2005), despite assumed 50x improvement in designer
productivity
• New Design Technology -- beyond anything currently
contemplated -- is required to keep costs manageable
Design Quality Requirement
• “Normalized transistor” quality model
•
•
•
•
•
•
speed, power, density in a given technology
analog vs. digital
custom vs. semi-custom vs. generated
first-silicon success
other: simple / complex clocking, …
developing quality normalization model within MARCO
GSRC; VSIA, Numetrics, others pursuing similar goals
• Design quality: gathering data
Backup:
MPU Analyses
Performance Efficiency of Microarchitectures
(Pollack’s Rule)
4
Area
(Lead / Compaction)
3
Growth (X) 2
Performance
(Lead / Compaction)
1
0
Note: Performance measured using SpecINT and SpecFP
1.5
1
0.7
0.5
0.35
0.18
Technology Generation
In the same technology:
• New microarchitecture has 2-3X die area of the last microarchitecture
• Provides 1.4-1.7X performance of the last microarchitecture
“We are on the Wrong Side of a Square Law”
Intel: P. Gelsinger talk ISSCC-2001
Decreasing SPECint/MHz
SPECint95
0.09
0.08
y = -5E-05x + 0.0989
SPEC ratio/MHz
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0
0
200
400
600
Clock Speed (MHz)
800
1000
Decreasing SPECfp/MHz
SPECfp95
0.4
SPEC ratio/MHz
0.35
0.3
y = -0.0005x + 0.5392
0.25
0.2
0.15
0.1
0.05
0
0
200
400
600
Clock Speed (MHz)
800
1000
Decreasing SPECfp/Watt
SPECfp Per Watt
3.5
3
2.5
2
1.5
1
0.5
0
May1996
Dec1996
Jun1997
Jan1998
Jul1998
Feb1999
Date of Data
Aug1999
Mar2000
Oct2000
Apr2001
MPU Clock Freq Doubles Each Generation…
Intel: Borkar/Parkhurst
…But 1.4x of This is From Shorter Pipelines
…which can’t get shorter forever
Intel: Borkar/Parkhurst
MPU Futures (1)
• Drivers: power, I/O bandwidth, yield, ...
• Multiple small cores per die, more memory hierarchy on board
– core can be reused across multiple apps/configs
– replication  redundancy, power savings (lower freq, Vdd while maintaining
throughput); better use of area than memory; avoid overhead of time-mplexing
– IBM Power4 (2 CPU + L2); IBM S390 (14 MPU, 16MB L2 (8 chips) on 1 MCM
(31 chips, 1000W, 1.4B xtors, 4224 pins))
– Processor-in-Memory (PIM): O(10M) xtors logic per core
– 0.5Gb eDRAM L3 by 2005
– high memory content gives better control of leakage, total chip power
• I/O bandwidth becomes a major differentiator
– double-clocking, phase-pipelining in parallel-serial data conversion is hitting
the 6-8 FO4 limit (i.e., cannot even make a clock pulse in < 6-8 FO4 delays)
– I/O count may stay same or decrease due to integration
– roughly constant die size (200-350 mm2) also limits I/O count
• Evolutionary microarchitecture changes
– superpipelining (for freq), superscalar (beyond 4-way) both run out of steam
– more multithreading support for parallel processing
– more complex hardwired functions (networking, graphics, comm, security, ...)
(megatrend: shift of flexibility-efficiency tradeoff point away from GPP)
MPU Futures (2)
• Circuit design
–
–
–
–
ECC for single-event upset (soft errors)
pass gates are on the way out due to low Vt
more redundancy to compensate for yield loss
density models are impacted
• Clocking and power
– Reasonable “need”: 1V supplies, 10-50W total power both flat
– Power savings from SOI (5% or 25%), multi-Vth (10%), multi-Vdd (30-50%),
min-energy sizing under throughput constraints (25%), parallelism …
• Note: combined use of multi-Vth, multi-Vdd potentially saves 2-4x power !
– multiple clock domains, grids; more gating/scheduling
– adaptive voltage and frequency scaling
– frequency: +1 GHz/year ... but product focus shifts to system throughput
• Bifurcation of MPU requirements
– smart interface remedial processing (SIRP): basic computing and power
efficiency, SOC integration of RF, M/S, digital (wireless mobile multimedia)
– centralized computing server: high-performance computing (traditional MPU)
Backup:
Power-Constrained Logic-Memory
Balance Analyses
Memory/Logic Power Study Setup
• Motivation: Is current ITRS MPU model consistent with
power realities? Does it drive the right set of needs?
• Ptotal = Plogic + Pmemory = constant (say, 50W or 100W)
• Plogic composed of dynamic and static power, calculated as
densities
• Pmemory = 0.1*Pdensity_dynamic
– power density in memories is around 1/10th that of logic
• Logic power density (dynamic) determined using active
capacitance density (Borkar, Micro99)
– dynamic power density Pdensity_dynamic = Cactive * Vdd2 * fclock
– fclock uses new fixed-FO4 INV delay model (linear, not superlinear,
with scale factor)
– Cactive = 0.25nF/mm2 at 180nm
– increases with scale factor (~1.43X in theory; 1.30X historically)
Memory/Logic Power Study Setup
• Static power model considers dual Vth values
– 90% of logic gates use high-Vth with Ioff from PIDS Table 28a/b
– 10% of logic gates use low-Vth with Ioff = 10X Ioff from PIDS Table
28a/b (90/10 split is from IBM and other existing dual-Vth MPUs)
– Operating temp (80-100C)  Ioff is 10X of Table 28a/b (room temp)
• Width of each gate determined from IBM SA-27E library
– 150nm technology; 2-input NAND = basic cell
– performance level E: smallest footprint, next to fastest
implementation  W of each device ~ 4um
– Weff (effective leakage width) for each gate = 4um
– 0.8*Weff*Ioff (per um) = Ileak / gate (0.8 comes from avg leakage over
input patterns)
Memory/Logic Study Setup
• Calculate densities, then find allowable logic component
(percent of total area) to achieve constant power (or
power density)
– Amemory + Alogic = Achip
– recall that Achip is now flat
• Constant power and constant power density scenarios
same until 65nm node (because chip area flat until then)
Power as a Constraint: Implications
# of MPU cores allowable
11
10
120
100
9
8
80
7
6
60
5
4
40
3
2
20
1
0
1998
2000
2002
2004
2006
Year
2008
2010
2012
0
2014
On-chip memory allowable (Mbytes)
12
# of MPU cores, constant power
# of MPU cores, constant power density
memory (MB), constant power
memory (MB), constant power density
Using same constraints, calculate #MPU cores (12Mt/core)
and Mbytes SRAM allowable (again, anomaly at 32nm due to
constant Vdd)
Backup:
Design DOF’s for Power Reduction
Example Analysis: Contributing
Factors to “ITRS Power Contradiction”
• Factors defined by manufacturing technology capabilities
• Technology Node, MPU Gate Length, Cap_Gate, Tox, Gate Dielectric
Constant, …
• Factors that can be tuned by designers, design tools
• Chip area, % of chip area dedicated to logic, Vdd, Ion, Vth, Weff/xtor,
% of gates with high Vth, …
• Sensitivities (effect that each factor has on total power)
• Greatest effects are from tuning % of chip area dedicated to logic, %
of gates with high Vth
• Some factors have more effect today, less in the future (e.g., Total
Capacitance Density, Ion)
• Some factors have less impact today, more in the future (e.g., highVth gate ratio, Weff/xtor)
Backup:
Interactions with Key Other TWGs
Device Roadmap Changes
• Cf. Process Integration, Devices and Structures (PIDS) Chapter
• CV/I device delay metric: historically decreases by 17%/year
– Since frequency improvement from shorter pipelines no longer available,
perhaps we do need to keep scaling CV/I …
– Bottom line: PIDS is running up against limits of planar CMOS, and is
shifting at least some of the pain to “design/architecture improvements”
• Continuing CV/I trend necessitates huge growth in Ioff
• Subthreshold Ioff at room temperature increases from 0.01 uA/um in 2001 to
10 uA/um at end of ITRS (22nm node)
• Ioff increases by at least order of magnitude at ~100 deg C operating temps
• Static power becomes a huge problem: multi-Vt, multi-Vdd, substrate biasing,
constant-throughput power minimization, etc. must be coherently and
simultaneously applied/optimized by automatic tools
• Also necessitates aggressive reduction in tox
• Physical tox thickness hovers at < 1.4nm (down to 1.0nm) starting in 2001,
even assuming arrival of high-k gate dielectrics starting in 2004
• Implies huge variability mitigation challenges for Design Technology
Assembly/Packaging Roadmap Changes
• MPU pad counts (Tables 3a/3b of 2000 ITRS ORTC Chap.)
flat from 2001-2005, while chip current draw increases 64%
• Effective bump pitch roughly constant at 350mm throughout
roadmap
– Bump/pad counts scale with chip area only, do not increase with technology
demands (IR drop, L*di/dt)
–  metal resources needed to control <10% IR drop skyrocket since Ichip and
wiring resistance increase  challenge for Design Technology
– Later technologies (30-40nm) also have too few bumps to carry maximum
current draw (e.g., 1250 Vdd pads at 30nm with bump pitch of 250mm can
each carry 150mA  187.5A max capability but Ichip/Vdd > 300A
• A&P Rationale: cost control (puts pain onto Design)
• Design Rationalization: must introduce power constraints
– ITRS2001 will have strong power-constrained focus
• Cost of liquid cooling, refrigeration, etc. impractical anyway
• 30-50 W/cm2 limit for forced-air cooling with fins
• MPU power dissipation capped at 150W for entire ITRS; MPU chip area
held constant (more area can’t be used well within 150W power budget)
• Design DOFs for Power Reduction: see Backup Slides
Backup:
Roadmap Acceleration Context
About This Presentation
• ITRS = International Technology Roadmap for
Semiconductors (“The Roadmap”)
– website: http://www.itrs.net (all ITRS-2001 work in progress)
– login: 2001 itrs
password: Timing
• Goal = highlight issues that can affect EDA
– Where possible, infer necessary areas for R&D attention
– Not all news in the ITRS will be covered
• Provide reusable “technical collateral” and “backstops”
– Correct roadmap dates, technology nodes, etc.
– Correct interpretation of models underlying roadmap evolution
– Identification of drivers for EDA
Roadmap Acceleration Since 2000
• Cf. Overall Roadmap Technology Characteristics (ORTCs)
• Major accelerations continue
– Next slide shows “pull-in” of red and blue lines
– E.g., 90nm node is in 2004, with physical gate length at 45nm
• MPU/ASIC half-pitch were separate, now unified
– ASIC is at the same process node as MPU
• 2-year cycles b/w MPU/ASIC generations through 2004
– New Generation = 0.7x multiplier of half-pitch or minimum feature
size, which generally allows 2x the transistors on the same size die
– “Normal” pace = 3-year cycle
• MPU/ASIC half-pitch converges w/DRAM half-pitch in 2004
– Previous ITRS: convergence predicted for 2015
– Extremely aggressive scaling for density, cost improvement and
competitive positioning
95
97
500
99
01
04
2-Year Node Cycle
1995-2001
350
Feature Size (nm)
250
10
DRAM Sc 2.0 =
3-yr cycle after 2001
Feature Size
50
35
350
180
2000 Update, Sc 2.0
Gate Length
500
250
(DRAM Half Pitch)
Minimum
70
16
1998/1999 DRAM Half-Pitch
MPU/ASIC
130
13
Sc 3.7 MPU/ASIC Half-Pitch (1-year Lag
Thru 2002, then equal to DRAM after 2004)
Technology Node
180
100
07
MPU/ASIC Gate “In Resist” 1999 ITRS
25
130
XX
100
90
XX
70
65
XX
50
45
XX
35
32
XX
25
22
16
95
97
99
Year of Production
01
04
07
10
13
2001 Renewal Period
“Most Aggressive” Sc 3.7 = 2-yr<’05; 3-yr >’05: MPU Printed (PrGL) & Physical
(PhGL) Gate Length cycle; (ASIC/Lo Power Pr/PhGL 2-year delay from MPU Pr/PhGL)
ITRS 2001 Renewal - Work in
Progress - Do Not Publish
Slide courtesy of A. Allan (Intel Corp.)
16
8.0
Technology Node - DRAM Half-Pitch (nm)
ITRS Roadmap Acceleration Continues...
Scenario 2.0/DRAM
3.7/MPU
(2-yr cycle M/A HP &
G.L. <2005; 3yr >2005)
11
~.7x per
technology
node (.5x
6
per 2 nodes)
Backup:
(Private) Design ITWG Themes
Big Picture
• ITRS takes Moore’s Law as a constraint
• Problem: We signed up for the “wrong” Moore’s Law
– 2x frequency, 2x xtors,bits every node  power, utility contradictions
– Each increment of performance is more and more costly
• Compounding problems
– no architecture awareness (2x memory, 2x logic xtors in lock-step)
– no application awareness (e.g., low-power networked-embedded SOC)
– planar CMOS-centric (no DGFET, FinFET in requirements)
– uneven acknowledgment of cost (mask NRE, design cost, cost of technology
development, manufacturing cost, …)
• New in 2001: Can Design solve it? Can Designers help?
– PIDS : 17%/year improvement in CV/I metric  punt Ioff, Rds, …
– A&P : bump pitch improves < chip area  punt IR drop, power
– Interconnect : what total variability can Designers tolerate?
2001: Design Technology Better Integrated
With Other Supporting Technologies
• Problem: Design has always been “metric-free”
– Metric  “red brick wall”  requirement for R&D investment
• Goal 1: show red bricks in Design Technology
• Goal 2: shift red bricks from other supporting technologies
– e.g., lithography CD variability requirement  solved by new Design
techniques that can better handle variability
– e.g., mask data volume requirement  solved by Design/Mfg
interfaces and flows that pass functional requirements, verification
knowledge to mask writing and inspection
– e.g., Simplex “X initiative”  as much impact as copper ?
• But..
– Need metrics of design cost, design quality
– Need serious validation/participation from EDA, system, ASIC
companies
Need to Beef Up…
• Test roadmap discussion is missing
• Cost of test will soon exceed cost of manufacturing
• At-speed test stresses tester technology
• How solid is treatment of BIST, analog test, SOC test, etc.?
• Cost (big hole in ITRS)
• Manufacturing cost, NRE cost (design, mask, …), technology development
cost (who should solve a given red brick wall?)
• Key challenges for EDA (with respect to ITRS)
•
•
•
•
Circuit/layout optimizations in the face of manufacturing variability
System cost-driven design technology
Holistic analysis, management of power (both dynamic and static)
Circuit- and methodology-level IP: global signaling and synchronization,
off-chip IO; power delivery and management
• Metrics, needs roadmap for quality/cost of design and design process
• Verification and test
• Software