Higher Ed Presentation for Ed Summit 2006

Download Report

Transcript Higher Ed Presentation for Ed Summit 2006

The Future of Computer
Architecture
Shekhar Borkar
Intel Corp.
June 9, 2012
This research was, in part, funded by the U.S. Government. The views and
conclusions contained in this document are those of the authors and should not be
interpreted as representing the official policies, either expressed or implied, of the
U.S. Government.
1
Outline
Compute performance goals
Challenges & solutions for:
• Compute,
• Memory,
• Interconnect
Importance of resiliency
Paradigm shift
Summary
2
Performance Roadmap
1.E+08
Hand-held
Peta
1.E+06
GFLOP
Client
Exa
1.E+04
Tera
1.E+02
Giga
1.E+00
1.E-02 Mega
12 Years
11 Years 10 Years
1.E-04
1960
1970
1980
1990
2000
2010
2020
3
Relative Tr Performance
From Giga to Exa, via Tera & Peta
Exa
1000
Peta
100
Peta
1.E+06
Tera
1.E+04
30X
G
250X
1
2.5M X
4,000X
Concurrency
Tera
10
36X
1.E+02
G
Transistor Performance
1.E+00
1986
1996
2006
2016
1986
1996
2006
2016
Exa
5V
1
Relative Energy/Op
1.E+08
1.E+08
G
Vcc scaling
Peta
1.E+06
0.1
Power
Tera
1.E+04
Peta
0.01
Tera
1M X
1.E+02
G
1.E+00
0.001
1986
4,000X
80X
1996
2006
2016
1986
1996
2006
2016
4
Energy per Operation
300
250
pJ/bit Com
pJ/bit DRAM
DP RF Op
pJ/DP FP
100 pJ/bit
Energy (pJ)
Communication
200
150
100
DRAM
75 pJ/bit
Operands
10 pJ/bit
50
25 pJ/bit
FP Op
0
45nm
32nm
22nm
14nm
10nm
7nm
5
Where is the Energy Consumed?
2.5KW
~3KW
Disk
100W
Com
100W
Memory
150W
Compute
100W
Decode and control
Address translations…
Power supply losses
Bloated with inefficient
architectural features
10TB disk @ 1TB/disk @10W
Goal
100pJ com per FLOP
5W
0.1B/FLOP~3W
@ 1.5nJ per~20W
Byte
~5W
2W
100pJ per FLOP
5W
6
Voltage Scaling
When designed to voltage scale
Normalized
1
Energy Efficiency
0.8
10
8
Freq
0.6
6
0.4
Total Power
0.2 Leakage
4
2
0
0
0.3
0.5
0.7
0.9
Vdd (Normal)
7
Near Threshold-Voltage (NTV)
1
101
10-1
1
0.2
0.4
0.6
0.8
1.0
1.2
Supply Voltage (V)
H. Kaul et al, 16.6: ISSCC08
350
300
250
200
150
100
50
320mV
10
1.4
-2
101
400
< 3 Orders
2
Total Power (mW)
10
101
65nm CMOS, 50°C
Subthreshold Region
103
Energy Efficiency (GOPS/Watt)
65nm CMOS, 50°C
450
0
0.2
1
9.6X
10-1
Active Leakage Power (mW)
102
4 Orders
Maximum Frequency (MHz)
104
320mV
0.4
0.6
0.8
1.0
1.2
10-2
1.4
Supply Voltage (V)
8
16b SIMD
Multiply
4
3
2
1
0
300mV
0.15
0.15
0.40
0.37
8X
72b Add
0.65
0.59
1.1V
0.90
0.74
1.40 Vhi
0.98 Vlo
1.15
0.87
Reconfigurable Fabric, 32nm CMOS, 50°C
10
2.5
2.0
1.5
1.0
0.5
0
0.2
1
0.8mW
10-1
5.7x
10-2
340mV
0.4
0.6
0.8
1.0
Supply Voltage (V)
1.2
10-3
1.4
Active Leakage Power (mW)
A. Agarwal, et. al., ISSCC 2010
3.0
Sub-threshold Region
Energy Efficiency (TOPS/W)
Supply Voltage (V)
22nm CMOS, 50°C
103
102
9x
102
10
10
Sub-threshold
Region
7
6
5
S. K. Hsu, et. al., ISSCC 2012
103
9x
1
Leakage Power (mW)
H. Kaul, et. al., ISSCC 2009
45nm CMOS
50°C
32b Multiply
9
8
Energy Efficiency (GOPS/W)
Normalized Energy Efficiency
NTV Across Technology Generations
10-1
10-2
1
0.2
Register File
Permute Crossbar
10-3
0.4 0.6 0.8 1.0 1.2
Supply Voltage (V)
NTV operation improves energy
efficiency across 45nm-22nm
CMOS
9
Impact of Variation on NTV
1.0
60%
+/- 5% Variation in Vdd or Vt
50%
Freq (Relative)
0.8
Spread
40%
0.6
30%
0.4
20%
0.2
10%
Frequency
0.0
0%
0.0
0.2
0.4
0.6
Vdd (Relative)
0.8
1.0
Circuit vulnerability to 5% noise
6
5
4
3
2
1
0
1.0
0.9
0.8
0.8
0.7
0.6
0.5
Vdd scaling towards threshold 
0.4 0.3 0.2 0.1
Threshold
(Vdd  Vt )
frequency 
V dd
5% variation in Vt or Vdd results in 20 to 50%
variation in circuit performance
10
Mitigating Impact of Variation
1.Variation control with body biasing
Body effect is substantially reduced in advanced
technologies
Energy cost of body biasing could become substantial
Fully-depleted transistors have no body left
2. Variation tolerance at the system level
Example: Many-core System
f
f
f
f
f/2 f/4
f
f
f
f
f/4
f
f
f
f
f
Running all cores at full
frequency exceeds energy
budget
f
f
f/2
f/2 f/4
f/2 f/4
f
Run cores at the native frequency
Law of large numbers—averaging 11
Subthreshold Leakage at NTV
SD Leakage Power
60%
50%
40% Vdd
40%
30%
50% Vdd
Increasing
Variations
75% Vdd
20%
100% Vdd
10%
0%
45nm 32nm 22nm 14nm 10nm 7nm 5nm
NTV operation reduces total power, improves energy efficiency
Subthreshold leakage power is substantial portion of the total
12
Experimental NTV Processor
IA-32 Core
Logic
951 Pin FCBGA Package
Level Shifters + clk spine
Custom Interposer
RO
M
1.8 mm
Scan
1.1 mm
L1$-I L1$-D
Technology
32nm High-K Metal Gate
Interconnect 1 Poly, 9 Metal (Cu)
Transistors
6 Million (Core)
Core Area
2mm2
Legacy Socket-7 Motherboard
S. Jain, et al, “A 280mV-to-1.2V Wide-Operating-Range IA-32 Processor in 32nm CMOS”, ISSCC 2012
13
Power and Performance
1000
800
o
32nm CMOS, 25 C
915MHz
737mW
700
Frequency (MHz)
600
100
500
100MHz
400
10
300
174mW
200
1
0.2
0.55
0.3
0.55
100
17mW
3MHz
2mW
0.4
0.55
0.5
0.55
Total Power (mW)
500MHz
0.6
0.6
0.7
0.7
0.8
0.8
0.9
0.9
1
1
1.1
1.1
0
1.2
1.2
Logic Vcc / Memory Vcc (V)
Subthreshold
Super-threshold
NTV
11%
1% 4%
3%
5%
27%
Logic
leakage
33%
Memory
leakage
Logic dynamic
15%
62%
53%
5%
81%
14
Observations
100%
80%
60%
Mem Lkg
Mem Dyn
Logic Lkg
Logic Dyn
40%
20%
0%
Sub-Vt
NTV
Full
Vdd
Leakage power dominates
Fine grain leakage power management is required
15
Memory & Storage Technologies
100000
Cost/bit
(Pico $)
10000
1000
100
Energy/bit
(Pico Joule)
10
1
Capacity
(G Bit)
0.1
0.01
SRAM
DRAM
NAND, PCM
Disk
(Endurance issue)
16
Revise DRAM Architecture
Signaling
Energy cost today:
~150 pJ/bit
M Control
DRAM
Array
New DRAM architecture
Page
Addr
RAS
Traditional DRAM
Page
Page
Page
Page
Page
Addr
CAS
Activates many pages
Lots of reads and writes (refresh)
Small amount of read data is used
Requires small number of pins
Activates few pages
Read and write (refresh) what is needed
All read data is used
Requires large number of IO’s (3D)
17
3D Integration of DRAM
Thin Logic and DRAM die
Through silicon vias
Power delivery through logic die
Logic Buffer
Package
Energy efficient, high speed IO
to logic buffer
Detailed interface signals to
DRAMs created on the logic die
The most promising solution for energy efficient BW
18
Communication Energy
12
Energy/bit (pJ)
10
Between
Between
cabinets
cabinets
8
6
Board to Board
4
2
Chip to chip
Board to Board
Chip to chip
On Die
On Die
0
0.1
1
10
100
1000
Interconnect Distance (cm)
19
On-Die Communication Power
80 Core TFLOP Chip (2006)
48 Core Single Chip Cloud (2009)
12.64mm
26.5mm
I/O Area
C
2.0mm
DDR
3 MC
21.72mm
PLL
VRC
PLL
TAP
I/O Area
DDR
3 MC
C
C
JTA
G
TI
LE
System Interface + I/O
8 X 10 Mesh
2 Core clusters in 6 X 4 Mesh
32 bit links
(why not 6 x 8?)
128 bit links
320 GB/sec bisection
256 GB/sec bisection BW @ 2 GHz
BW @ 5 GHz
Clock
dist.
11%
Router +
Links
28%
C
DDR
3 MC
1.5mm
TI
LE
21.4mm
DDR
3 MC
single tile
Dual
FPMACs
36%
10-port
RF
4%
IMEM +
DMEM
21%
MC &
DDR3800
19%
Routers
& 2Dmesh
10%
Cores
70%
Global
Clocking
1%
20
On-die Communication Energy
Switch Energy/bit (pJ)
1000
Traditional
Homogeneous Network
100
Measured
10
1
Extrapolated
0.1
0.08 pJ/bit/switch
0.04 pJ/mm (wire)
Assume Byte/Flop
0.01
0.5u
0.18u
65nm
22nm
7nm
27 pJ/Flop on-die
communication energy
1. Network power too high (27MW for EFLOP)
2. Worse if link width scales up each generation
3. Cache coherency mechanism is complex
21
Packet Switched Interconnect
STOP
STOP
STOP
5mm
3-5 Clk <1 Clk 3-5 Clk
5 Clk
0.5mW 0.5mW 0.5mW
STOP
10mm
2-3 Clk
3-5 Clk
1mW
6-7 Clk
0.5mW
3-5 Clk
0.5mW
1. Router acts like STOP signs—adds latency
2. Each hop consumes power (unnecessary)
22
Mesh—Retrospective
Bus: Good at board level, does not extend well
• Transmission line issues: loss and signal integrity, limited frequency
• Width is limited by pins and board area
• Broadcast, simple to implement
Point to point busses: fast signaling over longer distance
• Board level, between boards, and racks
• High frequency, narrow links
• 1D Ring, 2D Mesh and Torus to reduce latency
• Higher complexity and latency in each node
Hence, emergence of packet switched network
But, pt-to-pt packet switched network on a chip?
23
0.3u pitch, 0.5V
100000
Wire Energy
Delay (ps)
Wire Delay
10000
Router Energy
Repeated wire delay
Router Delay
1000
100
0
5
10
15
20
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
pJ/Bit
Interconnect Delay & Energy
25
Length (mm)
24
Bus—The Other Extreme…
Issues:
Slow, < 300MHz
Shared, limited scalability?
Solutions:
Repeaters to increase freq
Wide busses for bandwidth
Multiple busses for scalability
Benefits:
Power?
Simpler cache coherency
Move away from frequency, embrace parallelism
25
Repeated Bus (Circuit Switched)
Arbitration:
Each cycle for the next cycle
Decision visible to all nodes
O
R
R
R
R
R
R
R
R
Assume:
10mm die,
1.5u bus pitch
50ps repeater delay
Repeaters:
Align repeater direction
No driving contention
Core
(mm)
Bus Seg
Max Bus
Delay (ps) Freq (GHz)
65nm
5
195
2.2
45nm
3.5
99
2
32nm
2.5
51
1.8
22nm
1.8
26
1.5
16nm
1.3
13
1.2
26
Anders et al, A 2.9Tb/s 8W 64-Core Circuit-switched Network-on-Chip in 45nm CMOS, ESSCIRC 2008
Anders et al, A 4.1Tb/s Bisection-Bandwidth 560Gb/s/W Streaming Circuit-Switched 8×8 Mesh Network-on-Chip in 45nm CMOS, ISSCC 2008
A Circuit Switched Network
Routers
8x8 Circuitswitched NoC
Packet-switched
Request
Src
0
1
n
Dest
PClk
Circuit-switched
Acknowledge
CClk
Routers
2mm links
Circuit-switched
Data Transmission
CClk
● Circuit-switched NoC eliminates intra-route data storage
● Packet-switching used only for channel requests
 High bandwidth and energy efficiency (1.6 to 0.6 pJ/bit)
Anders et al, A 4.1Tb/s Bisection-Bandwidth 560Gb/s/W Streaming Circuit-Switched 8×8 Mesh Network-on-Chip in 45nm CMOS, ISSCC 2008
27
Hierarchical & Heterogeneous
C
C
C
Bus
C
C
C
C
R
Bus
C
R
Bus
C
C
C
C
C
C
C
C
R 2nd Level BusBus
R
Bus
Bus to connect over
short distances
C
C
C
C
Hierarchy
of Busses
Or
hierarchical
circuit
and packet switched
networks
28
2500
5
2000
4
1500
3
1000
2
500
1
0
0
Cores
Unit
Bytes/Flop (a.u.)
BW (a.u.)
Tapered Interconnect Fabric
Chip System
Tapered, but over-provisioned bandwidth
Pay (energy) as you go (afford)
29
But wait, what about Optical?
65nm
Optical:
Pre-driver
Driver
VCSEL
TIA
LA
DeMUX
CDR
Clk Buffer
Possible today
Hopeful
Source: Hirotaka Tamure (Fujitsu), ISSCC 08 Workshop on HS Transceivers
30
Impact of Exploding Parallelism
Almost flat because
Vdd close to Vt
Million Cores/EFLOP
450
400
4X increase in the
number of cores
(Parallelism)
0.5x Vdd
350
300
Increased
communication and
related energy
250
0.7x Vdd
200
150
1x Vdd
100
65nm 45nm 32nm 22nm 16nm 11nm 8nm
5nm
Increased HW, and
unreliability
1. Strike a balance between Com & Computation
2. Resiliency (Gradual, Intermittent, Permanent faults)
31
Road to Unreliability?
From Peta to Exa
Reliability Issues
1,000X parallelism
More hardware for something to go wrong
>1,000X intermittent faults due to soft errors
Aggressive Vcc scaling
to reduce power/energy
Gradual faults due to increased variations
More susceptible to Vcc droops (noise)
More susceptible to dynamic temp variations
Exacerbates intermittent faults—soft errors
Deeply scaled
technologies
Aging related faults
Lack of burn-in?
Variability increases dramatically
Resiliency will be the corner-stone
32
65nm
90nm
130nm
180nm
250nm
1
0.8
0.6
0.4
0.2
0
0.5
1
1.5
Voltage (V)
2
10
Assuming 2X bit/latch
count increase per
generation
Relative to 130nm
n-SER/cell (sea-level)
Soft Errors and Reliability
Latch
Memory
1
180
130 90
65
45
Technology (nm)
32
Soft error/bit reduces each generation
Soft error at the system level will
Nominal impact of NTV on soft error rate continue to increase
Positive impact of NTV on reliability
Low V  lower E fields, low power  lower temperature
Device aging effects will be less of a concern
Lower electromigration related defects
N. Seifert et al, "Radiation-Induced Soft Error Rates of Advanced CMOS Bulk Devices", 44th Reliability
Physics Symposium, 2006
33
Resiliency
Faults
Example
Faults cause errors (data & control)
Permanent faults
Stuck-at 0 & 1
Datapath errors
Detected by parity/ECC
Gradual faults
Variability
Temperature
Silent data corruption
Need HW hooks
Control errors
Control lost (Blue screen)
Intermittent faults Soft errors
Voltage droops
Aging faults
Minimal overhead for resiliency
Degradation
Applications
System Software
Programming system
Microcode, Platform
Microarchitecture
Circuit & Design
Error detection
Fault isolation
Fault confinement
Reconfiguration
Recovery & Adapt
34
Compute vs Data Movement
1.E+07
1 bit Xfer over 3G
1.E+06
Energy (pJ)
1.E+05
1mJ
1 bit Xfer over WiFi
1 bit Xfer over Ethernet
1 bit Xfer using Bluetooth
1.E+04
Instruction Execution
1nJ
1.E+03
Computation
Read a bit from DRAM
1.E+02
R/W Operands (RF)
1.E+01
1.E+00
0.001
1 bit over link
Read a bit from internal SRAM
0.01
0.1
1
10
Distance (meters)
1pJ
100
1000
Data movement energy will dominate
35
System Level Optimization
1.2
Interconnect Energy
0.8
0.6
1.6X
0.4
Compute
Energy
Compute Energy
1
Relative
Global
Interconnect
6X
0.2
0
45
32
22
14
Technology (nm)
10
7
Supply Voltage
Compute energy reduces faster than global interconnect energy
For constant throughput, NTV demands more parallelism
Increases data movement at the system level
System level optimization is required to determine NTV operating
point
36
Architecture needs a Paradigm Shift
Architect’s past and present priorities—
Single thread performance
Frequency
Programming productivity
Legacy, compatibility
Architecture features for productivity
Constraints
(1) Cost
(2) Reasonable Power/Energy
Architect’s future priorities should be—
Throughput performance
Parallelism, application specific HW
Power/Energy
Architecture features for energy
Simplicity
Constraints
(1) Programming productivity
(2) Cost
Must revisit and evaluate each (even legacy)
architecture feature
37
A non-architect’s biased view…
Soft-errors
Power, energy
Interconnects
Bandwidth
Variations
Resiliency
38
Appealing the Architects
Exploit transistor integration capacity
Dark Silicon is a self-fulfilling prophecy
Compute is cheap, reduce data movement
Use futuristic work-loads for evaluation
Fearlessly break the legacy shackles
39
Summary
Power & energy challenge continues
Opportunistically employ NTV operation
3D integration for DRAM
Hierarchical, heterogeneous, tapered interconnect
Resiliency spanning the entire stack
Does computer architecture have a future?
• Yes, if you acknowledge issues & challenges, and
embrace the paradigm shift
• No, if you keep “head buried in the sand”!
40