prelims - Personal - University of Michigan

Download Report

Transcript prelims - Personal - University of Michigan

Towards An Efficient Low Frequency
Energy Recovery Dynamic Logic
Sujay Phadke
Advanced Computer Architecture Lab
Department of Electrical Engineering and Computer Science
University of Michigan, Ann Arbor
Advisor: Prof. Marios Papaefthymiou
September 28th, 2005
Outline




Power dissipation in conventional CMOS
Standard approaches to reduce power dissipation
Introduction to energy recovery circuits
Background - Boost Logic




Description of 3 new circuits designed
Comparison of different circuits



operation, reported simulation results
pros and cons from an energy standpoint
energy dissipation
power supply variation
Conclusion and future work
2
Power dissipation in conventional CMOS
designs
Streaming applications

small amount of logic

large number of Buffers

Long wires –
 Large capacitance C
 Driving this C wastes
energy
 Throughput-limited datapaths
 Strict requirement on
throughput
 Longer latencies can be
tolerated (DSP applications)
[ATMEL76C120 78MHz]
P  Ceff Vdd2 f
3
Conventional approaches to reducing power:
voltage scaling and pipelining
8


Lower dissipation
Lower leakage
Limitations:


6
Voltage(V)
Unpipelined
5
4
2-stage pipeline
3
6-stage pipeline
2
1
0
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
Voltage (V)
Delay(ns)
Limited by threshold voltages


7
Delay (ns)
Voltage scaling can
result in significant
energy gains
Vth scaling limited by manufacturing processes
Overhead of flip-flops
Increasing the delay, limited scalability
4
Reduced voltage drivers and voltage converters
High swing
vdd
vddL
Low swing
vdd
Voltage
converter



out
Limited by VTH
Delay in level conversion
Requirements for efficient operation


Energy efficient level conversion
No throughput impact due to level conversion delay
Point of diminishing returns!
5
Energy dissipation in CMOS
E=C.V.V=CV2
(1/2)CV2
1
Ediss  CV 2
2
DC
source
k .CV
Td 
(V  Vth )
Reducing V decreases Ediss, but
input
E=C.V.0=0
eventually will make the devices
go into sub-threshold region
Delay increases exponentially as V
is decreased
no energy
recovered back
into the supply
point of diminishing returns
in scaling Vdd
6
Energy Recovery Circuits







Switching energetics different from vanilla CMOS
DC supply replaced by an AC supply
Energy required to swing the voltage on a node is
much less than the energy stored
Use of inductors to supply and recover charge
Resonate current through inductors from power clock
to load capacitance
Energy recovery gates can be used as timing
elements
Latency overhead does not translate to a throughput
penalty
7
Energy recovery charging/discharging
Ediss
Source
V

 Ve
t
 Rdt
1
CV 2
2
t / RC 2
0
V
T 
T
t
V
0
 VT 2C T sin Tt  RC  cos Tt  
 Rdt
0  2T 4  2 2 R 2C 2T 2

2
T
t
V
V
N
t
RC
CV 2
T
2
I
s
 (t ) Rdt
CV
0 N v
 2 RC
8 T
CV 2
1 CV 2
2 N
Ediss  0
as T, Ediss

easy to
generate
8
Energy Recovery: A Brief History

Reversible computing proposed as a method of achieving
asymptotically zero energy computation

Early circuit design (Inverter chains)




Maksimovic, Oklobdzija (1-clock / 2-phase, 1.2 µm process, 40MHz)
Dickinson and Denker (4 phase , 0.9 µm process, 250MHz)
Athas et. al (Graphics Processor, 0.5 µm process, 15MHz)
Kim et. al (True Single Phase Logic )

8-bit multiplier @ 140MHz (0.5 µm process)
Fundamental requirement of gradual power clock transitions
Use of diodes to recover energy (Delay and Energy inefficient)
Tracking power clock at it fastest transition
only Pfet evaluation trees
9
Background
[Sathe: ISLPED ’05]
hybrid energy recovery family with high gate overdrive and
voltage scaling
 no diodes, data-independent capacitance
0.13m
 acts as a timing element; no throughput penalty
process
 less sensitive to power supply variation compared to vanilla
CMOS
Sim post
layout: upto
 differential outputs for data-independent capacitance seen
1.6GHz
by power clock
Chip: 750MHz-  65% energy saving compared to conventional voltage scaled
pipelined CMOS design
1.3GHz

Type 1: Boost logic

high energy dissipation at low frequencies (50MHz200MHz)
10
Structure and operation of Boost Logic
PC
M2
M1
Vdd’
Vdd’
M5
M7
PC
evaluation
out
PC
___
out
N-tree
evaluation
M6
___
PC
M3
Vss’
Reduced
potential
evaluation
compl. eval
N-tree
evaluation
M4
M8
Vss’
___
PC
Boost stage
_
Vdd’
f
Vdd’
f
Vdd’
f
Energy
recovery
Eval
Senseamp
Eval
Senseamp
Eval
Senseamp
sense-
N-tree
amplification
Vss’
N-tree
N-tree
Boost
_
f
Boost
Boost
Vss’
 1
Vdd  V dd  Vc 
2
 1
Vss  V dd  Vc 
2
Vc  Vth
f
Vss’
_
f
11
Energy Dissipation in Type 1 (Boost)
Vdd’
always a fight
between
weak pull-up
and pull down!
M5
f
0 out
Vdd
N-tree
evaluation
M6
_
f
1
Vss’
E  I crow.V .T


Increasing crowbar at lower frequencies
Energy dissipation keeps on increasing
How do we decrease this?
Sim. With 32bit RC adder
0.13m
12
Circuit Configurations Investigated

Type 2: static CMOS in the evaluation stacks

Type 3: use of static CMOS stack and an inverter to
create differential outputs with lesser area overhead

Type 4: A new domino CMOS logic in the evaluation
stage and a modified energy recovery sense amplifier
13
Type 2 circuit: CMOS stacks in evaluation tree


Complementary CMOS
stacks
differential outputs
driven to full rails (Vdd’ and Vss’)
f
Vdd’
M1
f
f
Compl. P-tree
pullup
P-tree pullup
___
out
out
Compl. N-tree
pulldown
N-tree pulldown
M2

reduces crowbar
significantly
Vdd’
M1
Vss’
_
f
M2
_
f
_
f
Vss’
Sim. With 32bit RC adder
with clock
generator
0.13m
14
Type 2: Energy Dissipation
Percentage contribution to total energy for different time periods
(32 bit adder in Type 1)
Percentage contribution to total energy for different time periods
(32 bit adder in Type 2)
100%
100%
90%
90%
80%
80%
70%
60%
50%
40%
30%
70%
60%
E(Crowbar)
50%
40%
E(Power
clock)
30%
20%
20%
10%
10%
0%
5.00E-09
1.00E-08
2.00E-08
time period (T)



5.00E-08
E(Crowbar)
E(Power
clock)
0%
5.00E-09
1.00E-08
2.00E-08
5.00E-08
time period (T)
significant area overhead (6N+10) compared to Type 1(2N+10)
limited fan-in
slow operation of PMOS
15
Type 3: CMOS stack with complementary
inverter
f
Vdd’
M1
f
Vdd’
M3
f
P-tree pullup



Use inverter to create
output differential
lesser energy diss. at
low frequencies
3N+10 area overhead
M5
___
out
out
M6
N-tree pulldown
M4
M2
_
f
_
f
Vss’
_
f
Vss’
Total energy/cycle vs. time period T for type 3 circuit
2.5E-11
Sim. With 32bit RC adder
with clock
generator
0.13m
Energy/cycle
2E-11
1.5E-11
type 1
type 3
1E-11
5E-12
0
0.00E+00 1.00E-08 2.00E-08 3.00E-08 4.00E-08 5.00E-08 6.00E-08
time period (T)
16
Type 3: Limitations due to sub-threshold
operation of inverter
f
Sim. With 32bit RC adder
with clock
generator
f
f
out
out
out
out
0.13m
at 10MHz



f
at 100MHz
due to limited drive, the inverter operates in sub-threshold
region
V (out )  V (out ) shrinks with increasing frequency, fanout
reliable operation (wrt. ∆V) only till ~ 50MHz
how can we increase the inverter drive?
17
Type 3: with low-threshold devices in the
inverter stack


Improvement obtained for lower frequencies
Sensitive to



coupling noise
process variation
Operation not robust for f>100MHz
18
A New Structure



Need to create a good differential voltage with
minimum area overhead and energy
dissipation
Need to modify the “Boost” sense amplifier
stage to make the output voltage differential
independent of fan-out loading
Need to have good tolerance for power
supply variations
19
Type 4: Domino CMOS with transmission gates
f
Vdd’
M1
transmission
gate
f
outint
_
f
_
f
f
M3
n1
_
f
precharge n1,n2
M4
M5
M6
evaluation
Vss’
_
f
_
f
enables
low-swing
pulldown
M2
out
_____
outint
n2
Compl. Pull
down N-tree
M2
f
M1
___
out
M5
M6
outint
M3
proxy
output lines
(low C)
mask high
C lines
f
_
M7
f
_
f
M4
equalization
sense
amplification
20
Operation: Evaluation/hold Phase

f0

dual N-tree evaluates
and pulls down one
proxy output line



transmission gates
transfer charge to
low C lines
f
Vdd’
M1
f
outint
_
f
M3
n1
_
f
M1
f
M4
M5
M6
f
_
f
_
f
Dual N-tree
M2
Vss’
_
f
(out int/ out int )


out
_____
outint
n2
M2
weak 0
___
out
M5
M6
outint
M3
f
_
M7
f
_
f
M4
weak 1
No crowbar because headers are switched off
Transistor M7 in the sense amplifier stage
keeps out / out
equalized at approx.
Vdd/2
21
Operation: Precharge/amplify phase

f1

outputs pulled to rails
in a recovery fashion
by the cross coupled
inverters

f
Vdd’
M1
f
outint
_
f
M3
n1
_
f
M4
M5
M6
transmission gates
Dual N-tree
M2
Vss’
f
_
f
_
f
M2
out
_____
outint
n2
_
f
isolate evaluate
circuit from sense amp
transfer charge to

M1
f
___
out
M5
M6
outint
M3
f
_
M7
f
_
f
M4
Transistor M7 in the sense amplifier stage
is cut-off

n1 and n2 pre-charge high to Vdd’
22
Type 4: Simulation Results
evaluate/
hold
Sim. With 32bit RC adder
with clock
generator
0.13m
evaluate/
hold
precharge/
amplify
23
Type4: Energy Dissipation

32-bit adder simulations
with clock generator

Shows substantial
energy savings wrt
Type 1 (Boost)

Voltage differential
independent of
fan-out loading

Works between
10MHz-200MHz
Sim. With 32bit RC adder
with clock
generator
0.13m
24
Energy Comparison of Different Topologies
Energy savings in Type 4 coming
from:

low-Cap. proxy output lines

small charge-up of internal nodes

isolation of eval. stage from
sense amplifier

elimination of crowbar
Type 1
Type 3
25%-65% reduction in energy over
operating range of frequencies
with small area overhead
Type 4
Type 2
Sim. With 32bit RC adder
with clock
generator
0.13m
25
Robustness to Variations in Power Supply
Effect of power supply variation on delay (at 100MHz)
percentage change in delay
20
15
10
5
domino
pseudo NMOS
0
vanilla CMOS(1.2V)
-5
-10
-15
-20
-15
-10
-5
0
5
10
15
percentage change in power supply

Delay variation is less than 5% for a 10% variation in power supply

Type 4 circuit seen to be relatively insensitive to power supply
variation compared to CMOS
26
Conclusions and Future Work
Conclusions:

Design of 3 structures to improve energy recovery efficiency at low
frequencies without use of diodes, multiple clock domains

A new domino style topology resulting in substantial energy savings
with minimal area overhead

Relatively insensitive to power supply variations
Future work:




Improve resonance of the Type 4 circuit
Redesign on the clock generator to investigate potential power
savings
Performance of the circuit post-layout and comparisons
Continuing investigations into other kinds of logic structures
27