ERD Architecture Benchmarking: The NRI MIND Activity

Download Report

Transcript ERD Architecture Benchmarking: The NRI MIND Activity

ERD Architecture Benchmarking:
The NRI MIND Activity
Ralph K. Cavin, III, Kerry Bernstein &
Jeff Welser
July 12, 2009
San Francisco, CA
Goals of the NRI/MIND Benchmarking
Project
• Develop circuit/subsystem level examples of
the applications of novel devices
• Evaluate the circuits/subsystems in the
energy-time-space context versus CMOS
implementations
• Determine most promising applications for
emerging devices with an emphasis on
integration with CMOS
Architectural Innovations haven’t been the major driver for system performance
Architecture Performance*
SPEC 2000 Int (Bas e)
SPECInt / FPG
20
15
10
5
0
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
Year
Alpha
AMD
HP
IBM
INTEL
SGI
SU N
AVG
Analysis of high perf
architectures and the
technologies they were
built in, examining device
vs arch contributions to
throughput
Highest reported SPEC2000 INT per (adj)FPG Generation
FPG, SPECmark approximated when necessary; Broken line = discontinued series
K . B e rn st e in 1 1 / 0 6
Architecture Performance*
SPEC 2000 fp(Base )
SPECfp / FPG
30
25
20
15
10
5
0
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
Year
Alpha
AMD
HP
IBM
INTEL
SGI
SU N
AVG
Highest reported SPEC2000 INT per (adj)FPG Generation
FPG, SPECmark approximated when necessary; Broken line = discontinued series
K . B e rn st e in 1 1 / 0 6
- Predominant influence
on SPEC2000 is from
device technology
- Modest contributions
from architecture
Four Architectural Projections
1) CMOS is not going away anytime soon.
Charge (state variable), and the MOSFET (fundamental switch) will remain
the preferred HPC solution until new switches appear as the
long term replacement solution in 10-20 years.
2) Hdwre Accelerators execute selected functions faster than software
performing it on the CPU.
Accelerators are responsible for substantial improvements in thru-put.
3) Alternative switches often exhibit emergent, idiosyncratic behavior. We
should exploit them.
Certain physical behaviors may emulate selected HPC instruction sequences.
Some operations may be superior to digital solutions.
4) New switches may improve high-utilitization accelerators
The shorter term supplemental solution (5-15 years) improves or replaces
accelerators “built in CMOS and designed for CMOS”,
either on-chip or on-3D-stack or on-planar
Matching Logic Functions & New Switch Behaviors
New Switch Ideas
Popular Accelerators
Encrypt / Decrypt
Single Spin
Compr / Decompr
Spin Domain
Reg. Expression Scan
Tunnel-FETs
Discrete COS Trnsfrm
NEMS
Bit Serial Operations
H.264 Std Filtering
DSP, A/D, D/A
Viterbi Algorithms
Image, Graphics
?
MQCA
Molecular
Bio-inspired
CMOL
Excitonics
Example: Cryptography Hardware Acceleration
Operations required:
Rotate, Byte Alignmt, EXORs, Multiply, Table Lookup
Circuits used in Accel:
Transmission Gates (“T-Gates”)
New Switch Opportunity:
A number of new switches (i.e. T-FETs) don’t have
(example)
thermionic barriers: won’t suffer from CMOS Pass-gate
VT drop, Body Effect, or Source-Follower delay.
Potential Opportunity:
Replace 4 T-Gate MOSFETs with 1 low power switch.
Examples of Benchmarking Work in
Progress
•
•
•
•
Magnetic Tunnel Junction one-bit adder
Magnetic Logic for one-bit adder
Magnetic Ring Logic Devices
Many other devices are being evaluated in a
variety of circuit configurations.
Background - MTJ
• Researchers have been investigating post-CMOS devices for
many years. In short term, people are looking for switches
that supplement CMOS and are CMOS-compatible,
supporting ultra-low power operation.
• MTJ (Magnetic Tunnel Junction) is one of the strongest
candidate which is available in practice rather than only in
theory.
– Excellent for memory and storage.
• STT-RAM using MTJ is strong candidate for
universal memory.
– For logic design, good or not?
• Any memory device can also be used to build logic circuits, in theory at
least, and MTJ is no exception.
• The discovery of spin torque transfer (STT) makes MTJ scalable and
completely CMOS-compatible.
MTJ-based DyCML 1 Bit Full Adder
•
•
•
MTJ is used as both a memory cell and functional input.
The switching of MTJ conducted by STT using control signals WL, BL.
It is actually a CMOS-MTJ-combined version of DyCML. Thus, it is more
reasonable to compare it with CMOS-based DyCML to see MTJ’s impact.
Results
• ED Curve of 65nm process
SCMOS
DyCML-MTJ
DyCML-CMOS
Nanomagnet Logic (NML)
PIs
Gary Bernstein1, X. Sharon Hu2, Michael Niemier2, Wolfgang Porod1
Student Researchers:
M. Tanvir Alam1, Michael Crocker2, Aaron Dingler2,
Steve Kurtz2, Shawn Liu2, M. Jafar Siddiq1, Edit Varga1
Affiliations:
1Department
of Electrical Engineering, 2Department of Computer Science and Engineering
Comparison to CMOS
• Hard to compare magnet to transistor
– Need to make technology comparison at
functional unit level; consider initial projections
B
M2
here
• Natural comparison = low power CMOS
systems, sub-threshold, etc.
M3
Sum
A
Cout
M1
Base performance projections on
adder design.
11
C
Trends
V & mr
12
Energy (pJ)
Delay (ns)
CMOS
0.020
261
NMLNP
0.029
198
NMLP
0.029
18
CMOS
0.19
20
NMLNP
0.0012
198
NMLP
0.0012
18
Because of sensitivity
to sub-threshold slope,
threshold voltage …
energy, delay can vary
significantly from
technology to
technology.
These are best data
points for CMOS
(0.3V - 1V)
EDP
(pJ ns)
With mr = 1, can still see ~15X
performance gain due to
higher throughput
If higher supply voltage to match
delay, ~7X energy savings
With mr = 5, ~17x (NP) and
~158X (P) energy savings with
better performance
Magnetic Ring Logic Devices –
Benchmarks/Metrics
• Caroline Ross - MIT
• These devices work by the movement of domain walls around thin film
rings with general structure Hard layer/Spacer/Soft layer, e.g. Co/Cu/NiFe
or Co/MgO/NiFe.
• Rings can have several remanent states with different resistances. This is
useful for multibit memory. However, digital logic uses two levels so in
these examples, some of the complexity available in ring devices is wasted
• NAND/NOR configurations are being analyzed.
Prototype Magnetic Ring Device Performance
Existing prototype
Projection
Device area
1 µm2
Improve x 100?
Switching speed
5 ns
Proportional to 1/device length
(improve x 10?) and domain
wall velocity (improve x 10?)
Switching energy
5 10-14 J (107 kT)
Proportional to switching speed
(improve x 100??) and to
device x-section area (improve
x 10-20?) and to critical current
for wall motion (improve x10100?)
Summary
• The Nanoelectronics Research Initiative
benchmarking project should be nearing
completion by mid-August, 2009
• The ERA section plans to provide a summary
of findings for 2009