ITRS Design Grenoble Meeting - Computer Science and Engineering

Download Report

Transcript ITRS Design Grenoble Meeting - Computer Science and Engineering

ITRS-2001
Grenoble Meeting
April 25, 2001
U.S. Design TWG
Outline
•
•
•
•
•
•
MPU diminishing returns
New MPU clock frequency model
MPU futures and ASIC/MPU/SOC convergence
New logic (ASIC, MPU) and SRAM density models
Required logic decrease due to power constraint
Design cost / design quality requirement, gap
analysis
• Summary of changes and errata (ORTCs, other
TWGs)
MPU Diminishing Returns
• Pollack’s Rule
– In a given process technology, new uArch takes 2-3x area of old (last
generation) uArch, and provides only 40% more performance (see Slide)
– Slide: process generations (x-axis) versus (1) ratio of Area of New/Old uArch,
(2) ratio of Performance of New/Old (approaching 1)
– Slides: SPECint, SPECfp per MHz, SPECint per Watt all decreasing rapidly
• Power knob running out
–
–
–
–
Speed == Power
10W/cm2 limit for convection cooling, 50W/cm2 limit for forced-air cooling
Large currents, large power surges on wakeup
Cf. 140A supply current, 150W total power at 1.2V Vdd for EV8 (Compaq)
• Speed knob running out
– Historically, 2x clock frequency every process generation
• 1.4x from device scaling (running into t_ox, other limits?)
• 1.4x from fewer logic stages (from 40-100 down to around 14 FO4 INV delays)
– Clocks cannot be generated with period < 6-8 FO4 INV delays
– Pipelining overhead (1-1.5 FO4 INV delay for pulse-mode latch, 2-3 for FF)
– Around 14 FO4 INV delays is limit for clock period (L1 $ access, 64b add)
• Unrealistic to continue 2x frequency trend in ITRS
Performance Efficiency of
Microarchitectures – Pollack’s Rule
4
Area
(Lead / Compaction)
3
Growth (X) 2
Performance
(Lead / Compaction)
1
0
Note: Performance measured using SpecINT and SpecFP
1.5
1
0.7
0.5
0.35
0.18
Technology Generation
Implications (in the same technology)
• New microarchitecture ~2-3X die area of the last microarchitecture
• Provides 1.4-1.7X performance of the last microarchitecture
We are on the Wrong Side of a Square Law
Intel: Gelsinger talk ISSCC-2001
Decreasing SPECint/MHz
SPECint95
0.09
0.08
y = -5E-05x + 0.0989
SPEC ratio/MHz
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0
0
200
400
600
Clock Speed (MHz)
800
1000
Decreasing SPECfp/MHz
SPECfp95
0.4
SPEC ratio/MHz
0.35
0.3
y = -0.0005x + 0.5392
0.25
0.2
0.15
0.1
0.05
0
0
200
400
600
Clock Speed (MHz)
800
1000
Decreasing SPECfp/Watt
SPEC FP Per Watt
3.5
3
2.5
2
1.5
1
0.5
0
May1996
Dec1996
Jun1997
Jan1998
Jul1998
Feb1999
Date of Data
Aug1999
Mar2000
Oct2000
Apr2001
MPU Clock Frequency Trend
Intel: Borkar/Parkhurst
MPU Clock Cycle Trend (FO4 Delays)
Intel: Borkar/Parkhurst
Outline
•
•
•
•
•
•
MPU diminishing returns
New MPU clock frequency model
MPU futures and ASIC/MPU/SOC convergence
New logic (ASIC, MPU) and SRAM density models
Required logic decrease due to power constraint
Design cost / design quality requirement, gap
analysis
• Summary of changes and errata (ORTCs, other
TWGs)
New MPU Clock Model
• Global clock: flat at 14 FO4 INV delays
– FO4 INV delay = delay of an inverter driving a load equal to 4 times its input
capacitance
– no local interconnect: negligible, scales with device performance
– no (buffered) global interconnect: (1) was unrealistically fast in Fisher98
(ITRS99) model, and (2) global interconnects are pipelined (clock frequency is
set by time needed to complete local computation loops, not time for global
communication - cf. Pentium-4 and Alpha-21264)
• Local clock: flat at 6 FO4 INV delays
– somewhat meaningless: only for ser-par conversion, small iterative structures,
“marketing interpretation” of phase-pipelining
– reasonable alternative: delete from Roadmap
• ASIC/SOC: flat at 40-50 FO4 INV delays
– absence of interconnect component justified by same pipelining argument, and
by convergence of ASIC / structured-custom design methodologies, tools sets
– higher ASIC/SOC frequencies possible, but represent tradeoffs with design
cost, power, other figures of merit
– information content is nil; reasonable to delete from Roadmap
Outline
•
•
•
•
•
•
MPU diminishing returns
New MPU clock frequency model
MPU futures and ASIC/MPU/SOC convergence
New logic (ASIC, MPU) and SRAM density models
Required logic decrease due to power constraint
Design cost / design quality requirement, gap
analysis
• Summary of changes and errata (ORTCs, other
TWGs)
MPU Futures (1)
• Drivers: power, I/O bandwidth, yield, ...
• Multiple small cores per die
– core can be reused across multiple applications, configurations
– IBM Power4 (2 CPU + L2); IBM S390 (14 MPU, 16MB L2 (8 chips) on 1 MCM
(31 chips, 1000W, 1.4B xtors, 4224 pins))
– Processor-in-Memory (PIM): O(10M) xtors logic per core, lots of memory
• More of memory hierarchy on board
– 0.5Gb eDRAM L3 by 2005
– high memory content gives better control of leakage, total chip power
• I/O bandwidth major differentiator
– double-clocking, phase-pipelining in par/ser data conversion hits 6 FO4 limit
– I/O count may stay same or decrease due to integration
– roughly constant die size (200-350 mm2) also limits I/O count
• Evolutionary uArch changes
– superpipelining (for freq), superscalar (beyond 4-way) running out of steam
– more multithreading support for parallel processing
– more complex hardwired functions (networking, graphics, communications, ...)
MPU Futures (2)
• Circuit design
–
–
–
–
ECC for SEU
pass gates on the way out due to low Vt
more redundancy to compensate for yield loss
density models are impacted
• Clocking and power (let’s be reasonable about “needs” !)
– 1V supplies, 10-50W total power both flat
– SOI (5% or 25%), multi-Vth (10%), multi-Vdd (30-50%), min-energy sizing
under throughput constraints (25%), parallelism … (synergy not guaranteed)
– multiple clock domains, grids; more gating/scheduling
– adaptive voltage and frequency scaling
– frequency: +1 GHz/year ... BUT: marketing focus shifts to system throughput
• Bifurcation of MPU requirements via “centralized processing”?
– smart interface remedial processing (SIRP): basic computing and power
efficiency, SOC integration of RF, M/S, digital (wireless mobile multimedia)
– centralized computing server: high-performance computing (traditional MPU)
• The preceding gives example content for definition of MPU
(high-volume custom) in System Drivers Chapter
ASIC-SOC-MPU Convergence
• Custom vs. ASIC headroom diminishing
– density of custom == 1.25x ASIC (logic, memory)
– “custom quality on ASIC schedule” achieved by on-the-fly, tuning, liquid etc.
cell-based methodologies (cf. IBM, Motorola)
– convergence of ASIC, structured-custom methodologies (accelerated by COT
model, tool limitations) to “hierarchical ASIC/SOC”
• ASIC-SOC convergence
– ASIC = business model
– SOC = product class (like MPU, DRAM), driven by cost and integration
– ASICs are rapidly becoming indistinguishable from SOCs in terms of
content, design methodology
• MPU-SOC convergence
– MPUs evolving into SOCs in two ways
– MPUs designed as cores to be included in SOCs
– MPUs themselves designed as SOCs to improve reuse
– (recall also SIRP = SOC integration)
• Thus, four System Driver Classes: MPU (high-volume
custom), SOC, DRAM, AMS/RF
Outline
•
•
•
•
•
•
MPU diminishing returns
New MPU clock frequency model
MPU futures and ASIC/MPU/SOC convergence
New logic (ASIC, MPU) and SRAM density models
Required logic decrease due to power constraint
Design cost / design quality requirement, gap
analysis
• Summary of changes and errata (ORTCs, other
TWGs)
ASIC Logic Density Model
• Average size of gate (4t) = 32MP2 = 320F2
• MP is contacted lower-level metal pitch
– sets size of a standard cell (e.g., 7-track, 9-track, etc.)
– ITRS Interconnect chapter: MP ~ 3.1-3.2 * F
 1 MP2 = 10F2 (consistent throughout technologies)
• 32 comes from:
– 8 tracks (expected height for dense std-cell library) by 4 tracks (avg
width of 2-input NAND gate)
– close match with claimed gate densities (published and
unpublished data) – e.g., 100K gates/mm2 at 0.18mm
• Overhead/white space factor = 0.5
– effective gate size = 64MP2
– logic density = 19.3Mt/cm2 at 180nm (compare to 20Mt/cm2
in ITRS2000, total density)
• Scales quadratically
– e.g., density 1.39Bt/cm2 at 30nm will be 36X that at 180nm
(compare with current ITRS)
MPU Logic Density Model
• Custom logic density == 1.25X ASIC logic density
• Example: MPU logic density 24.13Mt/cm2 at 180nm
(equal to 60K gates/mm2)
• Suggest breaking out logic and SRAM density
separately for MPU, rather than lumping together
SRAM Density
• SRAM cell size expressed as A*F2
• SRAM A factor essentially constant, barring paradigm
shifts in architecture/stacking
– Slight reduction with scaling, as seen in following slide
– N.B.: 1-T SRAM (www.mosys.com): 2-3x area reduction, 4x
power reduction, in production (Broadcom, Nintendo)
• Overhead (periphery)
– Best current estimate = 100%  effective bitcell size = 2*actual
– Periphery area can be more exact function of memory size
• smaller caches experience more overhead (could pertain to costperf vs. high-perf MPUs)
• A word * B bit SRAM: core area = A*B*C (Artisan TSMC25: C =
240 F2); periphery area = K*log(A)*B (Artisan TSMC25: K = 40005000 F2)
Collection of 6T SRAM Cell Sizes from
A-Factor (SRAM Cell Area
normalized to F 2)
TSMC, Toshiba, Motorola, IBM, UMC, Samsung, Fujitsu, Intel
200
180
160
140
120
100
80
60
40
A-Factor= 50.546F + 133.19
20
0
0.1
0.15
0.2
0.25
0.3
0.35
0.4
F (DRAM half-pitch) micron
Without overhead
Technology Node, F
180nm
130nm
100nm
70nm
50nm
35nm
SRAM Cell Size/F2
142
139
138
136
135
134
SRAM Density
• At 180nm  65.2 Mt/cm2 (compare to 35 Mt/cm2 in
ITRS00 for cost-performance MPU)
• Easier to understand: 10.87 Mbits/cm2 since the
Mt/cm2 definition ignores peripheral transistor count
• At 30nm  414.6 Mbits/cm2 or 2.49 Bt/cm2 (compare
to 3.5Bt/cm2 in ITRS00)
• Difference is due to non-quadratic scaling in ITRS00
Outline
•
•
•
•
•
•
MPU diminishing returns
New MPU clock frequency model
MPU futures and ASIC/MPU/SOC convergence
New logic (ASIC, MPU) and SRAM density models
Required logic decrease due to power constraint
Design cost / design quality requirement, gap
analysis
• Summary of changes and errata (ORTCs, other
TWGs)
Memory/Logic Power Study Setup
• Motivation: Is current ITRS MPU model consistent
with power realities?
• Ptotal = Plogic + Pmemory = constant
• Plogic composed of dynamic and static power,
calculated as densities
• Pmemory = 0.1*Pdensity_dynamic
– power density in memories is around 1/10th that of logic
• Logic power density (dynamic) determined using
active capacitance density (Borkar, Micro99)
– dynamic power density Pdensity_dynamic = Cactive * Vdd2 * fclock
– fclock uses new fixed-FO4 inverter delay model (linear, not
superlinear, with scale factor)
– Cactive = 0.25nF/mm2 at 180nm
– increases with scale factor (~1.43X)
Memory/Logic Power Study Setup
• Static power model considers dual Vth values
– 90% of logic gates use high-Vth with Ioff from PIDS Table 28a/b
– 10% of logic gates use low-Vth with Ioff = 10X Ioff from PIDS Table
28a/b (90/10 split is from IBM and other existing dual-Vth MPUs)
– Operating temp (80-100C)  Ioff is 10X of Table 28a/b (room temp)
• Width of each gate determined from IBM SA-27E library
– 150nm technology; 2-input NAND = basic cell
– performance level E: smallest footprint, next to fastest
implementation  W of each device ~ 4um
– Weff (effective leakage width) for each gate = 4um
– 0.8*Weff*Ioff (per um) = Ileak / gate (0.8 comes from avg leakage over
input patterns)
Memory/Logic Study Setup
• Calculate densities, then find allowable logic component
(percent of total area) to achieve constant power (or
power density)
– Amemory + Alogic = Achip
– recall that Achip is flat at 157 mm2 from 1999-2004, then increases
by 20% every 4 years
• Constant power and constant power density scenarios
same until 65nm node (because chip area flat until then)
% of area devoted to logic
Power as a Constraint: Implications
Constant Power (90W)
50
2
Constant Power Density (90W/1.57cm )
40
30
20
10
0
1998
Constant area region
1999-2004
2000
2002
2004
2006
2008
2010
2012
2014
Year
Constant power or power density  decreasing logic content
 stop lock-step logic-SRAM doubling in current ITRS
Anomaly going from 45nm to 32nm due to constant Vdd
Power as a Constraint: Implications
# of MPU cores allowable
11
10
120
100
9
8
80
7
6
60
5
4
40
3
2
20
1
0
1998
2000
2002
2004
2006
Year
2008
2010
2012
0
2014
On-chip memory allowable (Mbytes)
12
# of MPU cores, constant power
# of MPU cores, constant power density
memory (MB), constant power
memory (MB), constant power density
Using same constraints, calculate #MPU cores (12Mt/core)
and Mbytes SRAM allowable (again, anomaly at 32nm due to
constant Vdd)
Outline
•
•
•
•
•
•
MPU diminishing returns
New MPU clock frequency model
MPU futures and ASIC/MPU/SOC convergence
New logic (ASIC, MPU) and SRAM density models
Required logic decrease due to power constraint
Design cost / design quality requirement, gap
analysis
• Summary of changes and errata (ORTCs, other
TWGs)
Design Cost Requirement
• “Largest possible ASIC” design cost model
•
•
•
•
•
•
engineer cost per year increases 5% per year ($181,568 in 1990)
EDA tool cost per year increases 3.9% per year ($99,301 in 1990)
#Gates in largest ASIC design per ORTCs (.25M in 1990, 250M in 2005)
%Logic Gates constant at 70% (see next slide)
#Engineers / Million Logic Gates decreasing from 250 in 1990 to 5 in 2005
Productivity due to 7 Design Technology innovations (3.5 of which are still
unavailable) : RTL methodology; In-house P&R; Tall-thin engineer; Smallblock reuse; Large-block reuse; IC implementation suite; Intelligent
testbench; ES-level methodology
• Small refinements: (1) whether 30% memory content is fixed; (2) modeling
increased amount of large-block reuse (not just the ability to do large-block
reuse). No discussion of other design NRE (mask cost, etc.).
• #Engineers per ASIC design still rising (44 in 1990 to 875 in
2005), despite assumed 50x improvement in designer
productivity
• New Design Technology -- beyond anything currently
contemplated -- is required to keep costs manageable
Design Cost Requirement
• Source: Dataquest (2001)
Cost Metrics Forecast
$100,000,000,000
Design Cost for Largest Possible ASIC
Same Cost RTL Methodology Only
$10,000,000,000
$1,000,000,000
$100,000,000
$10,000,000
1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005
ASIC Memory Content Trends
• Source: Dataquest (2001)
ASIC Core Composition Breakout
Percentgae of Die
Area (I/Os Excluded)
60
50
Random Logic
40
Memory
30
Analog
20
Cores
10
0
1999
2000
2001
Design Quality Requirement
• “Normalized transistor” quality model
•
•
•
•
•
•
speed in a given technology
analog vs. digital
custom vs. semi-custom vs. generated
first-silicon success
other: simple / complex clocking, …
developing quality normalization model within MARCO
GSRC; VSIA, Numetrics, others pursuing similar goals
• Designs are getting worse (gathering evidence)
Outline
•
•
•
•
•
•
MPU diminishing returns
New MPU clock frequency model
MPU futures and ASIC/MPU/SOC convergence
New logic (ASIC, MPU) and SRAM density models
Required logic decrease due to power constraint
Design cost / design quality requirement, gap
analysis
• Summary of changes and errata (ORTCs, other
TWGs)
Design TWG Changes (ORTC/TWGs) (1)
• New clock frequency requirements
– FO4 based, no global interconnect
– global clock tracks 14 FO4 INV delays
– local clock tracks 6 FO4 INV delays (or can be deleted)
• New layout density requirements
–
–
–
–
“A” factors for SRAM, logic (custom), logic (semi-custom)
adjustments for overheads (memories)
adjustments for redundancy, error correction
adjustments for change in “MPU” architecture (multi-core, L3 on board, ...)
• New MPU power requirements
– bring total chip power down (e.g., flat at 90W, or perhaps 50W)
– socially responsible, reasonable “need”, if nothing else
• New MPU figures of merit and requirements
– statement of need: increase utility (SPECint, throughput, etc.), not frequency
– server: SPEC/W, I/O or request handling bandwidth
– smart interface: power, form factor, reusability, reprogrammability
Design TWG Changes (ORTC/TWGs) (2)
• #Metal layers
– formal model: #Layers grows as log (#Transistors) (DeHon 2000)
– add: dedicated metal layers for inductive shielding (1 per generation; these
are not “interconnect” layers)
• Package pins/balls
• Variability
– performance uncertainty due to variation of Leff, Vt, Tox, W_int, t_ILD, etc. is
managed by design of synchronization, logic, circuits
– these tolerances can be increased (removing some red bricks), and in any
case should be developed via critical-path, other design models
• ASIC-SOC convergence
– SOC (= System-LSI) is the “product class” that is analogous to MPU, DRAM
– System Drivers Chapter: SOC, MPU, AMS/RF, (DRAM)
– references to “ASIC” in ITRS should be adjusted/removed accordingly
Errata (1)
• PIDS: 10% static power constraint in Table 28a/b
– Justification?
– 100X increase in Ioff from room temp to 100C is high (better
estimate = 10-20X; see Borkar IEEE Micro99)
– W/L of 3 for all devices – including memory? Max Ioff used?
Pessimistic; should use simpler gate-level (not xtor-level) approach
• ORTCs: inconsistent density metrics
– ASIC, high-perf MPU give total density; cost-perf MPU breaks
down logic vs. memory
• ORTCs: density scales super-quadratically
– 180nm to 90nm gives 5X rise in density (instead of 4X)
– 180nm to 30nm gives 100X rise in density (instead of 36X)
Errata (2)
• ORTCs: ASIC total density == high-perf MPU total density
– MPU logic density should be 1.25X ASIC
– even if SRAM densities same, overall MPU density should be >5% larger
(more if ASIC memory component is smaller than MPU)
• ORTC00: MPU pad counts, Tables 3a/3b
– flat from 2001-2005
– but in this time period, chip current draw increases 64%
• A/P: Effective bump pitch roughly constant at 350um
throughout ITRS
– Why does bump/pad count scale with chip area only, not with
technology demands (IR drop, L*di/dt) ?
– Implication – metal resource needed to ensure <10% IR drop
skyrockets since Ichip and wiring resistance increase
Errata (3)
• A/P: Later technologies (30-40nm) have too few bumps to
carry required maximum current draw
– 1250 Vdd pads at 30nm: with bump pitch of 250mm can carry 150mA
(bumps at 350mm can carry more, not shown in ITRS)
– 187.5A max capability but Ichip/Vdd > 300A
– 100,000 hour reliability #’s build cushion into this calculation
• A/P: Why is hand-held power 2.6W in 2005 (monotonically
increasing 1999-2005) but then 2.1W in 2008 (resumes
increasing)?
• ORTCs: Differentiate between high-perf MPU, ASIC power?
– currently no estimates for high-end ASIC power consumption
• PIDS/Litho: Explain CD variability “requirement” (since
Design can workaround)
Errata (4)
• PIDS: Suggest including electrical gate oxide thickness
as well as physical.
– Can incorporate expected gate material enhancements to
reduce gate depletion effects (GDE)
– Can give better depiction of how Ion scales and how significant
Ioff will be as a result
Addendum: SPEC Company List
(www.specbench.org)
•
•
•
•
•
•
•
•
•
•
•
•
Advanced Micro Devices
Alpha Processor
BULL S.A.
Compaq Computer
Data General Corp.
Dell Computer
Digital Equipment
Fujitsu
Gateway 2000
HAL Computer Systems
Hewlett-Packard
Hitachi Ltd.
•
•
•
•
•
•
•
•
•
•
•
•
IBM
Intel
Intergraph Corp.
KryoTech
Motorola
Pyramid Technology
ROSS Technology
SGI
Siemens
Sun Microsystems
Tandem Computers
UNISYS Corp.