Transcript Slide 1

Evolution of Chip Design
ECE 111
Spring 2011
A Brief History
• 1958: First integrated
circuit
– Flip-flop using two
transistors
– Built by Jack Kilby at Texas
Instruments
Courtesy Texas Instruments
• 2010
– Intel Core i7 mprocessor
• 2.3 billion transistors
– 64 Gb Flash memory
[Trinh09]
© 2009 IEEE.
• > 16 billion transistors
Source: David Harris, CMOS VLSI Design Lecture Slides
Annual Sales
• >1019 transistors manufactured in 2008
– 1 billion for every human on the planet
Source: David Harris, CMOS VLSI Design Lecture Slides
Feature Size
• Minimum feature size shrinking 30% every 2-3
years
Source: David Harris, CMOS VLSI Design Lecture Slides
NRE Mask Costs
Source: MIT Lincoln Labs, M. Fritze, October 2002
Subwavelength Lithography
Challenges
Source: Raul Camposano, 2003
The Designer’s Escalating Problem
Source: Raul Camposano, 2003
Wire Delays and Noise Problems
Dramatically Complicate Design
180 nm
1 cycle
45 nm
• Unstructured “Place and Route” Standard Cell
Methodologies will Breakdown
ASIC NRE Costs Not Justified for
Many Applications
• Forecast: By 2010, a complex ASIC will have an
NRE Cost of over $40M = $28M (NRE Design
Cost) + $12M (NRE Mask Cost)
• Many “ASIC” applications will not have the
volume to justify a $40M NRE cost
• e.g. a $30 IC with a 33% margin would require
sales of 4M units (x $10 profit/IC) just to
recoup $40M NRE Cost
Case For Programmable Solutions
• Can “amortized” high NRE costs across many
applications
– e.g. microprocessors, DSPs, FPGAs
• Complex ASICs today require 18+ months vs.
~4 months for same function on DSP
– e.g. Voice-over-IP chip vs. Voice-over-IP on a DSP
– “Design time” gap will widen dramatically
• Many applications simply requires
“programmability”, e.g. cell phones
– multiple modes
– evolving standards
– evolving features, differentiation …
But …
• Advance applications and algorithms (e.g.
latest video games, broadband wireless …)
require enormous computation power
– 100s to 1000s of GOPS
• And very high efficiency
– 100s of MOPS/mW (GOPS/W)
– 10s of GOPs/$
• Existing microprocessors, DSPs, and FPGAs
don’t come close
Why are Conventional Processor
Architectures Inefficient?
• e.g. Intel Itanium II
– 6-Way Integer Unit < 2% die area
– Cache logic > 50% die area
• Most of chip there to keep these 6
Integer Units at “peak” rate
• Main issue is external DRAM
latency (50ns) to internal clock
(0.25ns) is 200:1
• Can “in theory” fit >300 ALUs
(tens of thousands in future) in
same die area, but how to keep
them “busy”?
INT6
Cache logic
Why are ASICs so Efficient?
Parallelism
Locality
(Millions of gates operating in parallel) (Fed by dedicated “local” wires & memories)
Source: Bill Dally, 2003
20MIPS cpu
in 1987
Few thousand gates
Source: Anant Agarwal, MIT, NOCS 2009 Keynote
The billion transistor chip of 2007
Source: Anant Agarwal, MIT, NOCS 2009 Keynote
Tilera’s TILEPro64™ Processor
Multicore Performance (90nm)
Number of tiles
Cache-coherent distributed cache
Operations @ 750MHz (32, 16, 8 bit)
Bisection bandwidth
64
5 MB
144-192-384 BOPS
2 Terabits per second
Power Efficiency
Power per tile (depending on app)
Core power for h.264 encode (64
tiles)
Clock speed
170 – 300 mW
12W
Up to 866
MHz
I/O and Memory Bandwidth
I/O bandwidth
Main Memory bandwidth
40 Gbps
200 Gbps
Product reality
Programming
ANSI standard C
SMP Linux programming
Stream programming
Source: Anant Agarwal, MIT, NOCS 2009 Keynote
Tile Processor Block Diagram
A Complete System on a Chip
DDR2 Memory Controller 0
DDR2 Memory Controller 1
XAUI
MAC
PHY 0
PCIe 0
MAC
PHY
PROCESSOR
CACHE
L2 CACHE
Reg File
P2
P1
P0
Serdes
L1I
L1D
ITLB
DTLB
2D DMA
Serdes
UART, HPI
JTAG, I2C,
SPI
GbE 0
SWITCH
MDN
TDN
UDN
IDN
STN
GbE 1
Flexible IO
Flexible IO
PCIe 1
MAC
PHY
XAUI
MAC
PHY 1
Serdes
Serdes
DDR2 Memory Controller 3
DDR2 Memory Controller 2
Source: Anant Agarwal, MIT, NOCS 2009 Keynote
What Does the Future Look Like?
Corollary of Moore’s law: Number of cores will
double every 18 months
‘02
‘05
‘08
‘11
‘14
Research
16
64
256
1024
4096
Industry
4
16
64
256
1024
1K cores by 2014! Are we ready?
(Cores minimally big enough to run a self respecting OS!)
Source: Anant Agarwal, MIT, NOCS 2009 Keynote
Massively Parallel Processing
On-a-Chip
SRAM
DDR DRAM
DDR DRAM
DDR Interface
DDR DRAM
32 GB/s
Registers
544 GB/s
DDR DRAM
64 Tiles x 8 ALUs = 512 ALUs
@ 2 GHz, 1000 GOPS = 1 TOPS
Parallelism + Locality
2 GB/s
Bandwidth Hierarchy is Key
Memory BW
Global RF BW
Local RF BW
Depth Extractor
0.80 GB/s
18.45 GB/s
210.85 GB/s
MPEG Encoder
0.47 GB/s
2.46 GB/s
121.05 GB/s
Polygon Rendering
0.78 GB/s
4.06 GB/s
102.46 GB/s
QR Decomposition
0.46 GB/s
3.67 GB/s
234.57 GB/s
Source:
Bill Dally, 2003
IBM/Sony/Toshiba Cell Processor
0.5 Tb/s
Memory I/O
SIMD Engine
7 ALUs
0.5 Tb/s
Chip I/O
64-bit Dual-Thread
Tb/s
PowerPC
Ring Network
• Used in Playstation 3
• 4.6 GHz 64-bit Dual-Threaded
PowerPC
• 8 SIMD Engines
x 7 ALUs = 56 ALUs
@ 4.6 GHz = 256 GFLOPS
• Terabit on-chip ring network
• Terabit external memory and
chip-to-chip IO
• 90nm process
• 234 million transistors
• 221 mm2 die
NVIDIA GeForce 8800
32-bit
CPU
8 clusters x
16 ALUs =
128 ALUs
0.7 Tb/s
Memory
I/O
• 8 Clusters x 16 ALUs = 128 ALUs
• 32-bit on-chip CPU
• Terabit external memory IO
• 1.35 GHz clock
• 90nm process
• 681 million transistors