Grant Proposal for Project Name

Download Report

Transcript Grant Proposal for Project Name

Circuits and Architectures to Deliver Low
Power and High Speed Systems.
By: Jabulani Nyathi
Washington State University
School of EECS
April 30, 2009
Outline
 CMOS
Scaling
 Its
benefits and
 The challenges it brings about
 Various
 Their
Techniques for Limiting Leakage Currents
shortfalls
 Bridging
the speed-Power Gap
 The Tunable
 Emerging
Body Biasing Scheme
Devices and Technologies
 Concluding Remarks
CMOS Scaling and its Benefits
 Aggressive
CMOS scaling has been a very
positive development allowing:
 Fast
switching devices, thus high speed computing.
 Massive integration due to miniaturization
 No
longer do we need multiple chips to implement a
microprocessor and its peripherals
 In fact, we can now have multiple computing elements on a
single die resulting in system on a chip.
CMOS Scaling and its Challenges
 CMOS
scaling results in:
 increased leakage
currents (5X/node) and
 Increased dynamic power dissipation.
 The
interconnect does not scale as fast as the transistor
thus
 Highly
integrated designs require elaborate clock
distribution schemes.
 IPs within a System on a Chip would be difficult to
synchronize with a single clock source.
Scaling Implications
Global Interconnects
Global Interconnects
Module2
Scaled
Local
Interconnects
Module1
Dynamic Vs Leakage Power
Research Motivation
Desire to Bridge the Speed-Power Gap by
Exploring the feasibility of optimizing devices to
operate effectively in both sub-threshold and
above threshold voltages.
 Emerging Technologies that are Ultra-Low power
can benefit from increased speed.

 Wearable
computers, sensor networks, implantable
medical technology
 Emphasis on design for energy-efficiency
Existing Low Power Design Approaches

Solve energy dissipation problem from a
region of operation standpoint

Sub-threshold design

DTMOS: shows a 5.5 times increase in current



SBB: 4.4 times frequency increase
Above threshold (Super-threshold) design



Dynamic threshold provides energy efficiency
MTCMOS: high and low threshold devices
VT Scheme: reduce power by 50% using ABB and
“sleep”/“active” modes
Architectural

Gating Techniques: 45% of total power
DTMOS/SBB Output Voltage Clamping
SBB, DTMOS, TBB
1.8 V
600 mV
Traditional
Proposed Approach

Change approach to include all possible operating regions: Tunable
Body Biasing (TBB)

Sub-threshold and super-threshold operation bridged
 Ultra-low energy and low speed or high energy and high speed

Utilize body biasing to improve performance of sub-threshold operation



Target increased performance at sub-threshold and slightly above threshold.
Save energy by eliminating idle time and process continuously with variable
power supplies (perform just in time task completion)
Target applications

Mobile, battery operated (power constrained), variable processing devices

Cell phones, PDAs, notebooks, wireless sensors, embedded systems, ASICs,
medical technology, etc.
TBB Implementation

Goals
 Attain
ON state current gain while minimizing OFF state
leakage current increase
 Highlight advantages of sub-threshold operation while
allowing super-threshold operation if needed
 Control bulk terminal to tunable potentials depending on
VDD and desired region of operation

MOS Bulk Control Circuits
 Multiplexer-based approach


Two transistors per bulk control circuit
Utilizes Vthn0
TBB Bulk Control Circuits
TBB MOS Bulk Control Signal
VDD


pMOS Bulk
nMOS Bulk
VSS<VDD ≤Vthn0
VSS
VDD
VDD > Vthn0
VDD – Vthn0
Vthn0
Relies on passing of good/poor logic “1” and logic
“0” properties of pass-transistors
Requires external control signals
 SubVt
and SubVt_b
TBB Bulk Control Circuit Simulation
Super-threshold: pBulk = VDD – Vthn0
Sub-threshold: pBulk = 0 V
Device Optimization

TBB encourages varying supply voltages
 How
will devices be sized for optimal operation at
any supply voltage?
 Maintain symmetric switching
 Examine inverter at varying supply voltages
Device Optimization (Switching Point)
VDD
Ideal
Inverter
Threshold
Simulated
Inverter
Threshold
Percent
Variation
1.8 V
900 mV
900 mV
0.0%
1.0 V
500 mV
498 mV
0.4%
376.2 mV
188.1 mV
198.7 mV
5.6%
188.1 mV
94.05 mV
108.6 mV
13.4%
Sub-threshold Noise Margins


Noise Margins significant
for proper logic levels
TBB and Traditional static
CMOS inverter have
comparable noise margins


TBB VIH is 12.5% worse
TBB VIL is 14.3% better
300
Propagation Delay
Transmission Gate
Inverter
Two Input NAND
Two Input NOR
Two Input XOR
AVERAGE SWITCHING DELAY (ns)
250
200
Gate Traditional Delay TBB Delay % Decrease
TG
98 ns
14 ns
86
Inv
125 ns
20 ns
84
NAND
133 ns
18 ns
86
NOR
163 ns
25 ns
85
XOR
289 ns
40 ns
89
150
100
50
0
TRADITIONAL
SBB
TBB
Static CMOS at Vdd = Vthn0 with varying Body Biasing
DTMOS
Review of SubVth Circuits Benefits

So far, the presentation has shown:
 TBB
requires control of MOS bulks to span the operating
regions of interest. Implementation is successful.
 Study of simple logic gates showed:


TBB gives a dramatic speed increase (up to 7x)
Static CMOS design style is suitable for sub-threshold and superthreshold operation
 Sizing

of efficient devices for the TBB approach is possible
However, how will a complex system perform?
 Design
with previous knowledge (logic style, sizing)
 Analyze post-layout simulations
Complex System-on-Chip Design
Using TBB
Work
addresses the challenges of
 Global
Interconnect Delays
 Clock distribution
 Synchronization of unrelated clocks and
 Power dissipation
Conclusion

TBB scheme has been devised to span all regions of operation
from ultra-low power to high-speed. New kind of body
biasing

Forward-biasing causes exponential sub-threshold current gain



Focus on sub-threshold and slightly above threshold to utilize leakage
Bulk control circuits are effective


Leads to 7 times frequency increase in simple logic gates
4% area and 8.9% power dissipation increase
Static CMOS is ideal overall design style

Device sizing at either sub-threshold or super-threshold allows efficient
operation with variable supply voltages
Concluding Remarks

Allowing tunable operation allows the designer to
choose operating point (kHz, MHz, GHz) – Energy
Dissipation is affected.
 Other schemes do not offer this flexibility
 TBB can lead to significant energy savings

LFSR results show TBB gives:
 Maximal
5.7 times speed increase (sub-threshold)
 Comparable energy at super-threshold and favorable at subthreshold
 Favorable EDP at all operating regions
 Operate at the same speed with less energy dissipation

Idle state leakage current can be minimized by
collapsing the supply voltage
Integrating Research Into Instruction
Data Path Circuits
 Memory Design
 Sub-System

 ROUTER
CHIP
Incorporating Research into Instruction




A long term objective is to place some of the integrated chips
on development boards such as those Digilent Inc produces.
The integrated chips become part of a system and can be used
in some of our low level courses.
Most important is the use of these programmable boards to
show case the research outcomes, particularly to visiting
prospective students.
A sample development board:
Questions and Comments Welcome!
Multiple Clock Domain
Synchronization
; EqualClock s
; RationalCl ocks
; ArbitraryC locks
Computational Module
Computational Module
Computational Module
MicroNetwork
f fast
n  1

 n  f slow n  Z 
n  Q 

Synchronous Islands
Computational Module
Isochronous
Communication
Computational Module
Computational Module
Reducing Interconnect Delays


Improved latency and bandwidth
Global interconnects are pipelined at or near the rate of computation
Sources of Power Consumption
Ptotal  Pstatic  Pdynamic  Pshort circuit
Pstatic  Pleakage  PDC
Pdynamic  Vdd  Vswing  f clk  Cload
Pshort circuit  Vswing  I avg short circuit
Most straight forward method to reduce power consumption
from any source is to reduce VDD
 Controlling frequency directly manipulates dynamic power
 Controlling device threshold manipulates leakage current,
affecting leakage and short circuit power.

Distributed FIFO Control Circuitry
Traditional vs. Tunable Body Biasing
Traditional Body Biasing
Vdd
LocalClock2
V
delay
(ps)
freq
(GHz)
1
111.2
0.7
current
Tunable Body Biasing
LocalClock2
current
Tunable BB %
diff
uA
delay
(ps)
freq
(GHz)
uA
freq
current
9
3100
103.1
9.7
2988
7.8
-3.6
172.55
5.8
1240
177.7
5.6
1042
-3.4
-16
0.35
1354.5
0.7383
71
1438
0.6954
72.9
-5.8
-2.7
0.2
96700
0.0103
2.81
16640
0.0601
5.051
483
79.8
The synchronizer/buffer shows an increase in performance at sub-threshold
voltages when using tunable body biasing
Tunable Body Biasing
Current (uA)
Max
Freq
(GHz)
Vdd
(V)
Traditional
Body
Biasing
Tunable
Body
Biasing
Peak
Avg
Power (uW)
Idle
Peak
Avg
Idle
1
4
5597
2382
8.696
5597
2382
8.696
0.7
2
2222
803.4
4.873 1555.4
562.38
3.411
0.35
0.125
131.1
35.58
1.468 45.885
12.453
0.514
0.2
0.01
7.452
2.895
1.349
1.49
0.579
0.27
1
4
5140
2460
9.54
5140
2460
9.54
0.7
2
2050
833
4.423
1435
583.1
3.096
0.35
0.167
132
39.8
1.589
46.2
13.93
0.556
0.2
0.015
9.468
4.03
1.239
1.894
0.806
0.248
Pursuit of Low Power Operation
 It
is likely that not all IP blocks in a SoC need
to operate at high speed
 Power dissipation for those IP blocks could be
reduced by operating at a lower voltage
 TBB offers the possibility to dynamically
operate at either sub-threshold or superthreshold voltages
Variable Voltage SoC
Vdd1
Vdd4
Vdd5
Computational Module
Computational Module
Computational Module
Vdd2
MicroNetwork
Consider a SoC with 50 IP
blocks, each requiring
communication at a rate of
10 MHz
 Each IP could operate at subthreshold levels
 The channel could operate at
super-threshold voltages
while the IP blocks are in
sub-threshold

Computational Module
Synchronous Islands
Vdd3
Isochronous
Communication
Computational Module
Computational Module
Idle vs Operating Power
Idle
Vdd
(V)

Current
(uA)
Operating
Current
Power (uW)
(uA)
Power
(uW)
1
16.9
16.9
2988
2988
0.7
5.3
3.71
1042
729.4
0.35
1.5
0.525
72.9
25.52
0.2
0.925
0.185
5.051
1.01
During idle periods, it is advantageous to reduce leakage
current by

Reducing the power supply voltage or
 Increasing the threshold voltage (e.g. bulk voltage manipulation)
Speed at Varying VDD
Delay Comparison of a TBB and Traditional LFSR
100000
Minimum Clock Period ( ns )
10000
1000
TBB Delay
Traditional Delay
100
TBB 5.7x Faster
At 376.2 mV
TBB 20% Faster
At 1.8 V
10
1
0
0
0.2
0.4
0.6
0.8
1
1.2
Supply Voltage ( V )
1.4
1.6
1.8
2
Energy-delay Product
Energy Delay Product for TBB with Control 8-Bit LFSR
10000000
Energy Delay Product ( ns*fJ )
1000000
100000
TBB Energy-delay Product
Traditional Energy-delay Product
10000
EDP of TBB outperforms Traditional at ALL
operating regions, significantly in super-threshold
1000
100
0
0.2
0.4
0.6
0.8
1
1.2
Supply Voltage ( V )
1.4
1.6
1.8
2
Regions of Operation
Delay vs. Energy Dissipation Tradeoff for TBB LFSR
10000
10000000
TBB Delay
TBB Energy Dissipation
1000000
Clock Period ( ns )
100
1.1 GHz with
3.85 nJ/cycle
3.9 MHz with
0.6 fJ/cycle
10000
1000
100
222.2 MHz with
103 fJ/cycle
10
10
1
0
1
0
0.3262
0.3762
0.5643
0.7524
1.1286
Supply Voltage ( V )
1.5048
1.8
Energy Dissipation ( fJ )
100000
1000
Contributions of this work
 Proposed
scheme alleviates the communication
bottleneck and offers a way to synchronize SoC
multiple clocks
Perform data transfers up to 10 GHz
 Proposed scheme maintains high performance under the
influence of any clock skew
 6.5 GHz for any process corner and any skew

 Low
power FIFO scheme with a small impact on
area when used in SoCs with many modules
Contributions of this work
 Process
corners have a minor impact on performance,
resulting in a 10% reduction of speed
 The optimal voltage for minimum energy
consumption per transaction is at 2Vth
 Introduction of TBB to address leakage and dynamic
power dissipation
 500%
increase in performance at sub-threshold voltages
with a modest 80% increase in power
 5-10% less power dissipation than traditional body biasing
Summary of Proposed FIFO
Scheme
 Linear FIFO scheme that addresses

Signal propagation across communication channel


Successful Synchronization



Synchronizes equal, rational & arbitrary clocks
6.5 GHz sustained performance after process corner analysis using 3 stages.
Compared to CN scheme



Sustained throughput over long distances
Fewer devices per stage, fewer stages needed
25% higher performance, 12% lower power
Operates at both super- and sub-threshold voltages




Lower instantaneous power demands from local clocks (less di/dt)
Optimal energy per transaction at 0.7V in a 65nm process
Sub-threshold reduces power by 3 orders of magnitude
Tunable Body Biasing provides 50% increased performance in sub-threshold while
maintaining super-threshold operation
TBB Scalability
At 90 nm, the % difference is much less
At 180 nm, TBB sub-threshold static power % is large
Technology
180 nm
90 nm
Body Biasing and
Operating Region
Total Average
Power Dissipation
Static Power
Contribution
[%]
Total Average
Power Dissipation
Static Power
Contribution
[%]
Traditional in
Sub-threshold
193 pW
0.1%
13.1 nW
1.8%
Traditional in
Super-threshold
39.6 μW
Negligible
22.1 μW
negligible
TBB in
Sub-threshold
1430 pW
25.2%
20.4 nW
6.1%
TBB in
Super-threshold
39.4 μW
0.000034%
22.1 μW
0.0025%
Total TBB sub-threshold power is large
Total TBB sub-threshold power isn’t so large
LFSR Energy vs. Frequency
TBB and Traditional LFSR Energy Dissipation vs Frequency
225
200
Energy Dissipation [fJ]
175
150
125
100
75
50
Traditional Energy
TBB Energy
25
0
0
100
200
300
400
500
600
Frequency [MHz]
700
800
900
1000
1100
TBB Implementation Cont.
TBB Implementation Cont.
Logic Gate Analysis (Power)
Power Dissipation vs Supply Voltage
1000.0000
100.0000
Traditional CMOS Power
Power Dissipation [ nW ]
10.0000
TBB CMOS Power
1.0000
0.1000
0.0100
0.0010
0.0001
0.25
0.3762
0.75
Supply Voltage
1.8
Inverter Power Dissipation
VDD
Power Dissipation
[fW]
0.3262
8.27
0.4262
•Average
Power
•[nW]
Maximum Frequency
[MHz]
Period
[ns]
3.5
0.416
2400.0
11.41
30.0
2.6
380.0
0.5643
15.64
651.6
41.7
24.0
1.8
82.30
68.60
833.3
1.2
VDD
Power Dissipation
[fW]
0.3262
8.52
0.4262
•Average
Power
•[nW]
Maximum Frequency
[MHz]
Period
[ns]
22.4
2.6
380.0
13.00
259.8
20.
50.0
0.5643
15.13
2102.0
138.9
7.2
1.8
81.47
81.5
1000.
1.0
Logic Gate Analysis (Energy)
Energy Dissipation vs Supply Voltage
180
160
140
Energy Dissipation [ fJ ]
Traditional CMOS Energy
TBB CMOS Energy
120
100
80
60
40
20
0
0.25
0.3762
0.75
Supply Voltage [V]
1.8
Logic Gate Analysis (EDP)
EDP vs Power Supply
30000
25000
20000
EDP [ fJ*ns ]
Traditional CMOS EDP
TBB CMOS EDP
15000
10000
5000
0
-5000
0.25
0.3762
0.75
Supply Voltage [V]
1.8
Logic Gate Analysis (Fan-in)
1400
1200
Propagation Delay [ ns ]
1000
800
Traditional NAND
TBB NAND
Traditional NOR
TBB NOR
600
400
200
0
One
Two
Three
Number of Inputs
Four
Logic Gate Analysis (Logic Styles)
Energy Dissipation vs Supply Voltage
70
60
Traditional Pseudo-nMOS Energy
Energy Dissipated [ fJ ]
50
TBB Pseudo-nMOS Energy TBB
40
30
20
10
0
0.5*Vthn
0.75*Vthn
Vthn - 50 mV
Supply Voltage [V]
Vthn
Vthn + 50 mV
1.5*Vthn
Power Comparison of a TBB and Traditional LFSR
LFSR Power Dissipation
800
Average Power Dissipation ( uW )
700
600
500
TBB Power
400
Traditional Power
300
200
100
0
-100
0
0.2
0.4
0.6
0.8
1
1.2
Supply Voltage ( V )
1.4
1.6
1.8
2
Device Optimization (Optimal Region)
Delay vs. Energy Dissipation Tradeoff for TBB LFSR
4000
4500000
3500
4000000
Clock Period ( ns )
3000000
2500
TBB Delay
2500000
TBB Energy Dissipation
2000
2000000
1500
1500000
1000
1000000
500
500000
0
0
0.3262
0.3762
0.5643
0.7524
Supply Voltage ( V )
1.1286
1.5048
1.8
Energy Dissipation ( fJ )
3500000
3000
Regions of Operation
Super-threshold
(1.8 V)
Sub-threshold
(250 mV)
Optimal
(750 mV)
Design
Delay (ns)
Energy (fJ)
Delay (ns)
Energy (fJ)
Delay (ns)
Energy (fJ)
Traditional
LFSR
0.7
437.6
20000
105
7
74.1
TBB
LFSR
0.6
437
4500
22.8
4.5
73.6
GHz
kHz
MHz
Logic Gate Results

Results Highlights
 TBB,
SBB, and DTMOS increase speed up to 7 times in
sub-threshold
 Static CMOS has best overall logic style performance

Pseudo-nMOS, Domino, and pass-transistor still are valuable in
niche situations
 TBB
and Traditional Noise Margins are comparable