ERD/ERA Status Report - International Technology Roadmap
Download
Report
Transcript ERD/ERA Status Report - International Technology Roadmap
Ralph K. Cavin, III
March 18, 2009
Brussels
Is there a Carnot-like theorem for
computation?
◦ e.g., a limit on rate of information
throughput/power consumed?
The MIND architecture benchmarking activity
for novel devices
Memory Architectures
Inference Architectures
Chose a simple one-bit, four instruction
processor
All transistors operate at ~kT switching
energy
Interconnects dissipate energy at ~kT/gate
length
Transistor average fan-out is three
144
2-4
DEC
I2
I1
1
6
S1
1
S2
1
S3
X
6
Y
12+
Program Counter
2-bit
Counter
24
Red numbers =
# transistors
6
98
6
Memory
ALU
Z
6
C1
C0
1
CPU
S5
1
S6
1
S4
4
Total:
314
devices
Von Neumann
threshold:
Joyner
tiling:
n=314
Areamin n 8a 2 314 8a 2 2500a 2 50a 50a
amin= 1.5 nm
Areamin 75nm 75nm
Operational energy of the Minimal
Turing Machine
Eop
Per full CPU operation:
9
18 J
nkBT ln 2 980k BT / cycle 4 10
2
cycle
18
Eop 3 4 10
J
J
17
10
5
cycle
operation
Devices: 314
Area 75nm 75nm
Device density: 5.61012 cm-2
Energy per cycle
4 1018
J
cycle
Power: 2mW
BITS=density x
freq. = 1014 bit/s
Time per cycle
~2 ps
Power density : ~30 kW/cm2
MIPS: 2105
6
Sources: The Intel Microprocessor Quick Reference Guide and
TSCP Benchmark Scores
1.E+09
1.E+07
1019 bit/s
108 MIPS Brain
30 W
1.E+06
MIPS
Instructions per sec 106
1.E+08
106 W/cm2
1.E+05
1.E+04
1.E+03
1.E+02
1.E+01
1.E+00
1.E-01
1.E+09 1.E+11 1.E+13 1.E+15 1.E+17 1.E+19 1.E+21 1.E+23 1.E+25
BIT: Max. binary throughput, bit/s
7
The Minimal Turing Machine lies on the different
performance trajectorie from conventional
computers
◦ It has slope to meet brain performance
More detailed physics based analysis is needed
◦ System thermodynamics of computation
Carnot’s equivalent for Computational Engine?
Lessons from Biological Computation?
Candidates for beyond-CMOS nano-electronics
should be evaluated in the context of system
scaling
◦ e.g. spintronic minimal Turing Machine?
8
NRI Focus Centers
Kerry Bernstein, IBM
February 2009 Update
1.
2.
Short Term – Switches that supplement
CMOS and are CMOS-compatible,
supporting performance via hardware
acceleration
Long Term – Switches that replace CMOS
for general purpose high performance
compute applications
1)
CMOS is not going away anytime soon.
Charge (state variable), and the MOSFET (fundamental switch) will
remain the preferred HPC solution until new switches appear as the
long term replacement solution in 10-20 years.
2)
Hdwre Accelerators execute selected functions
faster than software performing it on the CPU.
Accelerators are responsible for substantial improvements in
thruput.
3)
Alternative switches often exhibit emergent,
idiosyncratic behavior. We should exploit them.
Certain physical behaviors may emulate selected HPC instruction
sequences. Some operations may be superior to digital solutions.
4)
New switches may improve high-utility
accelerators
The shorter term supplemental solution (5-15 years) improves or
replaces accelerators “built in CMOS and designed for CMOS”,
either on-chip, or on-planar, or on-3D-stack
Hierarchical Benchmarking
Level
Device
Circuit
Metric
The Good
The Bad
CV/I
Charge-Based
Ft
Specific to Clked
Ion/Ioff
Current-Based
NAND2 Dly, Pwr,
Incomplete
Energy, Area
Architecture
E-D-A Product
Optimization
P-D-A Product
Optimization
Logical Effort
Constrained
MIPS, IPC
Equivalent T/P
Synchronous
Ops/Joule
Energy Efficiency
Discrete Ops
SpecInt
Industry Stds
New Capability
1)
2)
3)
Derive values for the conventional quantitative ITRS
benchmarks shown in Benchmark 1.
Derive quantitative values and qualitative entries for
architecture benchmarks shown in Benchmark 2a
and 2b
Identify specific logic operations performed
elegantly by your switch: where physical device
behavior complements desired logic operation.
Determine the equivalent IPC, power of that function
performed in the new switch, as shown in
Benchmark 3 example.
Determine the actual IPC, and Operations/Watt
had the function been performed via software in the
CPU.
Benchmark 1: Device Metrics
Defined by
ITRS ERD
Working Group
Captures
Fundamental
Device
Properties
Benchmark 2: Architectural Metrics
2a. Quantitative
Communication Metrics
2b. Qualitative
CMOS Compatibility
Values
AREA of die/host accessible within 1 switch delay
Clocking infrastructure
and Locality
NO. OF SWITCHES accessible within 1 switch delay
Sq BW/ unit area (Channels x Freq)X x Channels x Freq)Y
Memory Reqmts and
compatibility
Sq Comm Channels (NX x NY) per unit area
(Accessible Area within one switch delay) / (Area of 1 switch)
Scalability
Mem. delay / Logic Delay
Logic Metrics
32 Bit Adder
Inverter with F04
NAND2 FO1
Generic Noise Immunity (dB)
Generic Logical Effort
Comp.Density (MIPS/no of devs)
PETE1 (EDDA)
PETE2 (PDA)
Metric
Delay
Power
Energy
Area
No. of Sw.
Reconfigurability or
Library Dimensions
Logic Execution /
Architecture
Specific Logic Function
performed well
Useful Specific
Physical Behaviors
Azad Naeemi, Georgia Tech
Delay versus Length for Various Transport
Mode
Transport Mechanisms
New State Variables will impact
communication and fan-out
Information
Token
Diffusion
Direct Excitons
Indirect Excitons
Spin
PseudoSpin
Drift
Indirect Excitons
Spin
Pseudospin
Ballistic
Transport
(Fermi
Velocity)
Spin
Pseudospin
Spin Wave
Spin
EM Waves
Photons
Plasmons
H.264
Comp
Crypto
……………..
Compare apples-to-apples, independent of particular strength
BTBTFET
MQCA
Quantum
Equivalent
IPC - MIPS/Watt - Ops/Joule
of switch in application
Matching Logic Functions & New Switch Behaviors
New Switch Ideas
Popular Accelerators
Single Spin
Encrypt / Decrypt
Compr / Decompr
Spin Domain
Reg. Expression Scan
Tunnel-FETs
Discrete COS Trnsfrm
NEMS
Bit Serial Operations
H.264 Std Filtering
DSP, A/D, D/A
Viterbi Algorithms
Image, Graphics
?
MQCA
Molecular
Bio-inspired
CMOL
Excitonics
Example: Cryptography Hardware Acceleration
Operations required:
Rotate, Byte Alignmt, EXORs, Multiply, Table
Lookup
Circuits used in Accel: Transmission Gates (“T-Gates”)
New Switch Opportunity: A number of new switches (i.e. T-FETs) don’t
have thermionic barriers: won’t suffer from CMOS Pass-gate VT drop, Body
Effect, or Source-Follower delay.
Potential Opportunity: Replace 4 T-Gate MOSFETs with 1 low power
switch.
2.8E-4
Bernstein, 1/25/09
• Example of HPC Hdwre Accelerator contribution to power, area,
instruction retirement rate, energy efficiency improvement.
• Purdue Emerging Technology Evaluator (PETE) metric is convolution
of power/energy, delay, and area.
• IPC and Ops/nJprovides apples-to-apples comparison of new
switches.
Paul Franzon
Department of Electrical and Computer Engineering
[email protected]
919.515.7351
Goal:
Major Conclusions:
◦ Determine research needs for ~2015
1000 Petaflop computer, and smaller
equivalents
◦ Major challenge #1: Power efficiency
Communication
Overhead in computation
◦ Major Challenge #2: Resiliency
Completing computation in presence of
permanent and transient faults
◦ Major Challenge #3: Performance Scaling
Performance scaling limited by software,
communications bisection bandwidth, and
memory speed
Critical Needs:
◦ Reduce power SRAM replacements
45 nm L1 Cache: 3.6 pJ/bit
Note: re-architecting in 3D can save ~50%
What is the potential for an ERD to reduce to 0.3 pJ/bit?
Note: Would require low-swing on bit lines, while retaining
speed and low SET rate
◦ Reduced power switched interconnect
Esp. packet routed interconnect (NOC)
What is the potential for a memory-style ERD to be used for
fast switchable interconnect?
Flash devices can do this for static reconfiguration BUT faster
switching devices will be needed for dynamic reconfiguration
Blue Gene system reliability:
◦ Most of the DRAM failures are due to DIMM socket
failures, not device failures
◦ Critical need: Sub-system level checkpointing and
roll-back
ERD requirement:
◦ Tightly embedded Flash-like state “capture”
memory for checkpointing
◦ Requirements:
Tightly embedded, e.g. Shadow registers, with
minimum process change
Slow read/write OK
~10 M writes minimum extrinsic reliability requirement
1. Metrics for cache replacement
Read &
Energy/bit for
Write
64kbit array
Speed
for 64 kit
array
Area for
64kbit array
SEU rate
Added process
complexity
2. Metrics for programmed routability
Stage
Delay for
2x2
switchbox
Energy/bit for
routing
through 2x2
swtich box
Area for
2x2 switch
box
Configuration
change speed
for 2x2
switchbox
Added
process
and design
Complexity
3. Metrics for Local Check-pointing memory
Read/write
Energy/bit
delay per bit for routing
for write
Area per bit Write lifetime
Added
process and
design
complexity
In future computing, both General Purpose
and Application Specific, the bottleneck is not
in logic operations but in memory,
communications, and reliability
Opportunities arise for memory style devices
to solve these bottlenecks:
◦ Low power SRAM replacement
◦ Ultra-low swing, routable interconnect replacement
◦ Local non-volatile memory as an aid to resiliency
The Memory Wall for multi-core
In general purpose multi-core processors, the
tradeoff for L1-L3 between memory
bandwidth and memory size is dramatic.
◦ At constant BW, two cores may require as much as
8x memory of one core
◦ At 2x BW, two cores require only about 2x memory
of a single core system
◦ Kerry Bernstein “New Dimensions in Performance, Feb. 2009
Workshop: “Technology Maturity for Adaptive,
Massively Parallel Computing” – March 2009,
Portland Oregon
http://www.technologydashboard.com/adaptivecomputing/.
◦ General theme: Inference Architectures and
Technology
Karlheinz Meier, U. Heidleberg, “VLSI Implementation
of Very Large Scale Neuromorphic Circuits –
Achievements, Challenges, Hopes”
Progress in architectures is being made but many
technology challenges remain. (Complexity)
Can Emerging Research Devices accelerate
realization of Inference Architectures?
Continue work on ERD Architectural
Benchmarking
◦ Work with NRI MIND benchmarking effort
Develop section on memory architectures for
Emerging Research Memories
Look at role of ERD/ERM in novel
architectures where unique properties can
provide substantial leverage; e.g. inference
architectures