Principles of Computer Architecture Dr. Mike Frank

Download Report

Transcript Principles of Computer Architecture Dr. Mike Frank

Power/Performance/Cost Efficiency of
Adiabatic Circuits, as a function of
Device On/Off Power Ratios
Michael P. Frank
CISE Department / ECE Dept.
Brown Bag Seminar
Tue., Mar. 26
ITRS Feature Size Projections
1000000
uP chan L
DRAM 1/2 p
100000
min Tox
Human hair
thickness
max Tox
Eukaryotic
cell
Feature Size (nanometers)
10000
1000
Bacterium
Virus
100
Protein
molecule
10
DNA molecule
thickness
1
Atom
0.1
1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 2020 2025 2030 2035 2040 2045 2050
Year of First Product Shipment
We are here
ITRS Feature Size Projections
1000
Bacterium
uP chan L
DRAM 1/2 p
min Tox
max Tox
Feature Size (nanometers)
100
Virus
Protein
molecule
10
DNA molecule
thickness
1
Atom
0.1
1995
2000
2005
We are here
2010
2015
2020
2025
2030
Year of First Product Shipment
2035
2040
2045
2050
Source: ITRS ‘99
Across Multiple Technologies
Vacuum Tubes
Integrated
Circuits
Mechanical
Discrete
Electromechanical Transistors
Relays
Source: Kurzweil, The Age of Spiritual Machines, pp. 22-25
Min transistor switching energy, kTs
Trend of minimum transistor switching energy
1000000
High
100000
Low
10000
trend
1000
100
10
1
1995
2005
2015
2025
2035
Year of First Product Shipment
½CV2 based on ITRS ‘99 figures for Vdd and minimum transistor gate capacitance. T=300 K
Information & Entropy
Example:
System with 3 twostate subsystems,
such as quantum
spins.
Physical Information
Known information plus entropy =
maximum entropy = maximum known information
Known
Information
Entropy
Spin label
1
2
3
Informational
Status
Entropy
Known
information
Entropy
1 2 3
Ruled out
by some
knowledge
23=8
states
Illustrating Landauer’s principle
Before bit erasure
s0
0
1
1
0
sN-1
0
2N
states
sN
0
…
…
…
sN-1
Unitary
(1-1)
evolution
State of
bit to be
“erased.”
…
s0
s0
…
sN-1
N
states
0
…
N
states
After bit erasure
s2N-1
0
State of
rest of
system
(thermal
modes, &c.)
Conventional Gates are Irreversible
• Logic gate behavior (on receiving new input):
– Many-to-one transformation of local state!
– Required to dissipate bT by Landauer principle
– Incurs ½CV2 dissipation in 2 out of 4 cases.
Example:
Static CMOS Inverter:
in
out
Transformation of local state:
Just before
transition:
in out
0 0
0 1
1 0
1 1
After
transition:
in out
0
1
1
0
Adiabatic Charging in CMOS
Exact formula (if R const.):
Ediss  f 1  f e 1/ f  1 CV 2
for frequency factor
f : RC/t
Adiabaticity is Fundamental
• Adiabatic (dissipation  quickness) processes
can occur in any type of system.
– Cf. Adiabatic theorem of quantum mechanics.
• Specific adiabatic logics have been described
for many proposed future device technologies:
–
–
–
–
Superconducting (Likharev ‘82, Averin et al. ‘01)
Nanomechanical (Drexler ‘92, & Merkle mid-’90s)
Quantum-dot (Lent & Tougaw, mid-’90s-present)
Quantum computing implementations (inherently)
• Claim: Work on architectures & analysis for
adiabatic CMOS will still apply post-CMOS!
Adiabatic Rules for Transistors
• Rule 1: Never turn on a transistor if it has a
nonzero voltage across it!
– I.e., between its source & drain terminals.
– Why: This erases info. & causes ½CV2 disspation.
• Rule 2: Never apply a nonzero voltage across a
transistor even during any onoff transition!
– Why: When partially turned on, the transistor has
relatively low R, gets rel. high P=V2/R dissipation.
– Corollary: Never turn off a transistor when it has a
nonzero current going through it!
• Why: As R gradually increases, the V=IR voltage drop
will build, and then rule 2 will be violated.
Adiabatic Rules continued
• Transistor Rule 3: Never apply a large voltage
across any on transistor.
– Why: So transition will be more reversible;
dissipation will approach CV2(RC/t), not ½CV2.
Adiabatic rules for other components:
• Diodes: Don’t use them at all!
– There is always a built-in voltage drop across them!
• Resistors: Avoid moderate network resistances.
– e.g. stay away from range >10 k and <1 M
• Capacitors: Minimize, reliability permitting.
– Note: Adiabatic dissipation scales with C2!
Transistor Rules Summarized
Legal transitions in green. (For n- or p-FETs.)
Dissipative states and transitions in red.
high
high
off
high
low
low
on
low
high
high
off
low
off
low
off
high
low
on
high
low
on
low
on
high

Transformation of local state:
Just before
transition:
After
transition:
in out
0 ½
1 ½
in out
0 1
1 0
Simple Reversible CMOS Latch
• Uses a standard CMOS transmission gate
• Sequence of operation:
(1) input initially matches latch contents (output),
(2) input changesoutput changes, (3) latch closes,
(4) input removed.
b
a
Before
input:
in out
a a
P
in
out
P
b
a
Input
arrived:
in out
a a
b b
Input
removed:
in out
a a
a b
Generic Frictional Coefficients
Normal defs. of friction (coeff. of sliding friction,
viscosity, etc.) may not apply to all processes.
For a given mechanism executing a specified
process (i.e., following a specified desired
trajectory or -ies) adiabatically over a time t:
• Energy coefficient: cE = Elost·t = Elost/q
– Energy dissipated from traj. per unit of “quickness”
• Note quickness q = 1/t has units like Hz
• Entropy coefficient: cS = Smade·t = Smade/q
– New entropy generated per unit of quickness
• Note that cE = cS·T at temperature T.
What matters!
Energy Coefficient in Electronics
• For charging capacitive load C by voltage V
through effective resistance R:
cE = Elostt = (CV2RC/t)t = C2V2R
• If the resistances are voltage-controlled
switches with gain factor k controlled by the
same voltage V, then effective R  1/kV
cE = C2V/k
• In constant-field-scaled CMOS, k  1/hox  ,
C  , and V  , so
cE  3/ = 4; Elost = cE/t  4/ = 3
(like CV2 energy)
Entropy coefficients of some
reversible logic gate operations
From Frank ‘98, “Ultimate theoretical models of
nanocomputers” (Nanotechnology journal):
• SCRL, circa 1997:
~1 b/Hz
• Optimistic reversible CMOS: ~10 b/kHz
• Merkle’s “quantum FET:”
~1.2 b/GHz
• Nanomechanical rod logic:
~.07 b/GHz
• Superconducting PQ gate:
~25 b/THz
• Helical logic:
~.01 b/THz
How low can you go? We don’t really know!
Quantifying Leakage
• For a given structured system:
• Leakage power:
Pleak = dEleak / dt
• Spontaneous entropy generation rate:
•
Sleak = dSleak / dt
•
• Again, note Pleak = Sleak · T at temperature T.
Minimum Losses w. Leakage
topt 
cS
cE

Pleak
Sleak
Etot = Eadia + Eleak
Eleak = Pleak·tr
 2 Pleak cE
 2T Sleak cS
Eadia = cE / tr
Min. energy & Roff/Ron ratio
• Note that:
cE = C2V2Ron
and if dominant leakage is source/drain:
Pleak = V2/Roff
• So:
cEPleak = C2V4/(Roff/Ron)
Emin = 2(cEPleak)1/2 = 2CV2(Roff/Ron)1/2
• So:
Qmax = ½CV2 / (2CV2(Roff/Ron)1/2)
= ¼(Roff/Ron)1/2
= ¼(Ion/Ioff)1/2
Clock/Power Supply Desiderata
• Requirements for an adiabatic timing signal /
power supply:
– Generate trapezoidal waveform with very flat
high/low regions
• Flatness limits Q of logic.
• Waveform during transitions is ideally linear,
– But this does not affect maximum Q, only energy coefficient.
– Operate resonantly with logic, with high Q.
• Power supply Q will limit overall system Q
– Reasonable cost, compared to logic it powers.
– If possible, scale Q  t (cycle time)
(Ideally,
independent
of t.)
• Required to be considered an adiabatic mechanism.
• May conflict w. inductor scaling laws!
• At the least, Q should be high at leakage-limited speed
Supply concepts in my research
• Superpose several sinusoidal signals from
phase-synchronized oscillators at harmonics of
fundamental frequency
– Weight these frequency components as per Fourier
transform of desired waveform
• Create relatively high-L integrated inductors via
vertical, helical metal coils
– Only thin oxide layers between turns
• Use mechanically oscillating, capacitive MEMS
structures in vacuo as high-Q (~10k) oscillator
– Use geometry to get desired wave shape directly
A MEMS Supply Concept
• Energy stored
mechanically.
• Variable coupling
strength -> custom
wave shape.
• Can reduce losses
through balancing,
filtering.
• Issue: How to
adjust frequency?
Summary of Limiting Factors
When considering adiabaticizing a system:
• What fraction of system power is in logic?
– Vs. Displays, transmitters, propulsion.
fL
• What fraction of logic is done adiabatically? fa
– Can be all, but w. cost-efficiency overheads.
• How large is the Ion/Ioff ratio of switches?
– Affects leakage & minimum adiabatic energy.
• What is the Qsup of the resonant power supply?
• What is the relative cost of power & logic? r$
– E.g. decreasing power cost by r$ by increasing
HW cost by  r$ will not help.
“Power premium”
Minimizing cost/performance
•
•
•
•
•
$P = Cost of power in original system
$H = Cost of logic HW in original system
$P = r$$H; $H = $P/r$
For cost-efficiency inverse to energy savings:
$tot,min = $Pr$-1/2 + $Hr$1/2
= 2 $Pr$-1/2
• $tot,orig = $P + $H = (1+r$)$H = ((1+r$)/r$) $P
• $tot,orig/$tot,min = ½(1+r$)r$-1/2
 ½r$1/2 for large r$
Summary of adiabatic limits
• Cost-effective adiabatic energy savings factor:
Sa = Econv / Eadia in cost-effective adiabatic system
• Some rough upper bounds on Sa:
Sa  ~ 1/(1fL)
Sa  ~ 1/(1fa)
Sa  ~ ¼(Ion/Ioff)1/2
(worse than these
Sa  Qsup
for non-ideal
computations)
Sa  ~ r$1/2
• Discussion ignores benefits from adiabatics of
denser packing & smaller communications
delays in parallel algorithms.
Motivation for this study
• We want to know how to carry out any
arbitrary computation in a way that is
reversible to an arbitrarily high degree.
– Up to limits set by leakage, power supply, etc.
• We want to do this as efficiently as possible:
– Using as few “device ticks” as possible (spacetime)
• Minimizes HW cost, & leakage losses
– Using as few adiabatic transitions as possible (ops)
• Minimizes frictional losses
• But, a desired computation may be originally
spec’d in terms of irreversible primitives.
General-Case vs. Special-Case
• We’d like to know two kinds of things:
– For arbitrary general-purpose computations,
• How to automatically emulate them in a fairly efficient
reversible way,
– w/o needing new intelligent/creative design work in each case?
– For various specific computations of interest,
• What are the most efficient reversible algorithms?
– Or at least, the most efficient that we can find?
• Note: These may not look anything like the most efficient
irreversible algorithms!
The Landauer embedding
• The obvious embedding of irreversible ops into
“expanding” reversible ones leads to a linear
increase in space through time. (Landauer ‘61)
– Or, increase in width of an input-consuming circuit
“Expanding”
operations
(e.g., AND)
Desired
output
“Garbage”
bits
input
Circuit depth, or time 
Lecerf Reversal
• Lecerf (‘63) was interested in the group-theory
question of whether an iterated permutation of
items would eventually return to initial item.
– Proved undecidable by reducing Turing’s halting
problem to this question, w. a reversible TM.
• Reversible TM reverses direction instead of halting.
• Returns to initial state iff irreversible TM would halt.
• Only problem:
No useful output data!
Input
Desired
output
f
f 1
Garbage
Copy of
Input
The Bennett Trick
• Bennett (‘73) pointed out that you could simply
fan-out (reversibly copy) the desired output
before reversing.
• Note O(T) storage is still temporarily needed!
Desired output
f
f 1
Copy of
Input
Input
Garbage
Triangle Representation
• Represents any use of Bennett ‘73 embedding
State of
irrev. comp.
@ time ti+ti
ti
State of
irrev. comp.
@ time ti
Time in
irreversible
system
Adiabatic
Process
Forward Reverse
phase phase
Time in
reversible
system
Mass on any
vertical line
= space usage
@ that time
Improving Spacetime Efficiency
• Bennett ‘73 transforms a computation taking
spacetime S·T to one taking (S·T2) in the
worst case.
– Can we do better?
• Bennett ‘89: Described a technique that takes
spacetime ( S  T log2 3  log T )  (S  T 1.59  log T )
– Actually, can generalize slightly and arrange for
exponent on T to be 1+, where 0 (very slowly)
• Lange, McKenzie, Tapp ‘97: Space (S) is
possible, if you use time (exp((S)))
– Not any more spacetime-efficient than Bennett.
“Pebble Game” Representation
Triangle representation
k=2
n=3
k=3
n=2
Analysis of Bennett Algorithm
• n = # of recursive levels of algorithm (n+1 spikes)
• k = # of lower-level iterations to go forward 1
higher-level step
• Tr = # of reversible lowest-level steps executed
= c(2k1)n
(c a small constant, e.g. 2)
• Ti = # of irreversible steps emulated
= kn
• So, n = logk Ti, and so
Tr = c(2k1)log Ti/log k = celog(2k1)log(Ti)/log k
= cTilog(2k 1)/log k
E.g. k=2: Tr = 2Tilog(3)/log(2)
Cost-Efficiency Analysis
• Total cost of doing a computation includes:
– Spacetime costs (storage used, integrated over time)
• Includes time-amortized manufacturing cost
• Includes cost of total energy leakage
– leakage from any in-use storage element
– Irreversibility costs (energy loss from irrev. ops)
• Total number of irreversible bit-erasures, CV2 > kT each.
– Adiabatic costs (energy loss from reversible ops.)
• Proportional to number na of adiabatic ops performed,
times ce, divided by time top of a single op.
Bennett 89 alg. is not optimal
k=2
n=3
k=3
n=2
Just look at all the spacetime it wastes!!!
Parallel “Frank02” algorithm
• We can simply squish the triangles closer
together to eliminate the wasted spacetime!
• Resulting algorithm is linear time for all n and k
and dominates Ben89 for time, spacetime, &
energy!
Emulated time
k=2
n=3
k=3
n=2
k=4
n=1
Real time
Setup for Analysis
• For energy-dominated limit,
– let cost “$” equal energy.
• c$ = energy coefficient, r$ = r$(min) = leakage power
• $i = energy dissipation per irreversible state-change
• Let the on/off ratio
Ron/off = r$(max)/r$(min) = Pmax/Pmin.
• Note that
c$  $i·tmin = $i ·($i / r$(max)), so
r$(max)  $i2/c$
• So
Ron/off  $i2 / c$r$(min) = $i2 / c$r$
Time Taken
• There are n levels of recursion.
• Each multiplies the width of the base of the
triangle by k.
• Lowest-level triangles take time c·top.
• Total time is thus c·top·kn.
k=4
n=1
Width 4 sub-units
Number of Adiabatic Ops
• Each triangle contains k + (k  1) = 2k  1
immediate sub-triangles.
• There are n levels of recursion.
• Thus number of adiabatic ops is c·(2k  1)n
k=3
n=2
52 = 25
little triangles
(adiabatic
operations)
Spacetime Usage
• Each triangle includes the spacetime usage of
all k  1 of its subtriangles,
• Plus, k 2 (k  2)(k  1) k 2  3k  2
i 
i 1

2
2
additional spacetime units, each consisting of
n1
1
storage
unit,
for
time
t
·k
k=5
op
1 n=1
2
3
1+2+3 units
1 state of irrev. mach. Being stored
Time top kn-1
Resulting recurrence relation:
ST(k,0) = 1 (or c)
ST(k,n) = (2k1)·ST(k,n1) + (k23k+2)·kn1/2
Reversible Cost
• Adiabatic cost plus spacetime cost:
$r = $a + $r = (2k-1)n·c$/t + ST(k,n)·r$t
• Minimizing over t gives:
$r = 2[(2k-1)n · ST(k,n) ·c$ r$]1/2
• But, in energy-dominated limit,
c$ r$  $i2 / Ron/off,
• So:
$r = 2$i ·[(2k-1)n · ST(k,n) / Ron/off]1/2
Tot. Cost, Orig. Cost, Advantage
• Total cost $i for irreversible operation
performed at end of algorithm, plus reversible
cost, gives:
$tot = $i · {1 + 2[(2k-1)n · ST(k,n) / Ron/off]1/2}
• Original irreversible machine performing kn ops
would use cost $orig = $i·kn, so,
• Advantage ratio between reversible &
irreversible cost,
n
R$(i/r) (k , n) 
$orig
$tot

k
(2k  1) M (k , n)
1 2
Ron/off
n
Optimization Algorithm
•
•
•
•
For any given value on Ron/off,
Scan the possible values of n (up to some limit),
For each of those, scan the possible values of k,
Until the maximum R$(i/r) for that n is found
– (the function only has a single local maximum)
• And return the max R$(i/r) over all n tried.
Cost-Efficiency Gains, Modified Ben89
100000000
0.6198
70
Advantage in Arbitrary Computation
y = 1.741x
10000000
60
1000000
50
100000
y = 0.3905x0.3896
10000
1000
40
30
100
20
10
k
1
10
n
0.1
1
100
10000
1000000
10000000
0
On/Off Ratio of Individual Devices
1E+10
0
1E+12
out
hw
n
k
Asymptotic Scaling
• The potential energy savings factor scales as
R$(i/r)  Ron/off~0.4,
• while the spacetime overhead goes only as
R$(i/r)  R$(i/r)~0.45, or Ron/off~0.18.
• E.g., with an Ron/off of 109, you can do worstcase computation in an adiabatic circuit with:
– An energy savings of up to a factor of 1,200× !
– But, this point is 700,000× less hardware-efficient!
Conclusions
• A new, more spacetime-efficient & energyefficient algorithm for doing arbitrary
computations adiabatically has been described.
• The energy savings in worst-case computations
goes as the ~0.4th power of device on/off ratio.
– Best case computations: 0.5th power.
• However, the reduction in spacetime efficiency
scales with energy savings to the ~1.6th power.
– Still much faster than we would like!
• Adiabatics can be generally cost-effective, but
still only for heavily energy-dominated apps.