Grant Proposal for Project Name
Download
Report
Transcript Grant Proposal for Project Name
Circuits and Architectures to Deliver Low
Power and High Speed Systems.
By: Jabulani Nyathi
Washington State University
School of EECS
April 30, 2009
Outline
CMOS
Scaling
Its
benefits and
The challenges it brings about
Various
Their
Techniques for Limiting Leakage Currents
shortfalls
Bridging
the speed-Power Gap
The Tunable
Emerging
Body Biasing Scheme
Devices and Technologies
Concluding Remarks
CMOS Scaling and its Benefits
Aggressive
CMOS scaling has been a very
positive development allowing:
Fast
switching devices, thus high speed computing.
Massive integration due to miniaturization
No
longer do we need multiple chips to implement a
microprocessor and its peripherals
In fact, we can now have multiple computing elements on a
single die resulting in system on a chip.
CMOS Scaling and its Challenges
CMOS
scaling results in:
increased leakage
currents (5X/node) and
Increased dynamic power dissipation.
The
interconnect does not scale as fast as the transistor
thus
Highly
integrated designs require elaborate clock
distribution schemes.
IPs within a System on a Chip would be difficult to
synchronize with a single clock source.
Scaling Implications
Global Interconnects
Global Interconnects
Module2
Scaled
Local
Interconnects
Module1
Dynamic Vs Leakage Power
Research Motivation
Desire to Bridge the Speed-Power Gap by
Exploring the feasibility of optimizing devices to
operate effectively in both sub-threshold and
above threshold voltages.
Emerging Technologies that are Ultra-Low power
can benefit from increased speed.
Wearable
computers, sensor networks, implantable
medical technology
Emphasis on design for energy-efficiency
Existing Low Power Design Approaches
Solve energy dissipation problem from a
region of operation standpoint
Sub-threshold design
DTMOS: shows a 5.5 times increase in current
SBB: 4.4 times frequency increase
Above threshold (Super-threshold) design
Dynamic threshold provides energy efficiency
MTCMOS: high and low threshold devices
VT Scheme: reduce power by 50% using ABB and
“sleep”/“active” modes
Architectural
Gating Techniques: 45% of total power
DTMOS/SBB Output Voltage Clamping
SBB, DTMOS, TBB
1.8 V
600 mV
Traditional
Proposed Approach
Change approach to include all possible operating regions: Tunable
Body Biasing (TBB)
Sub-threshold and super-threshold operation bridged
Ultra-low energy and low speed or high energy and high speed
Utilize body biasing to improve performance of sub-threshold operation
Target increased performance at sub-threshold and slightly above threshold.
Save energy by eliminating idle time and process continuously with variable
power supplies (perform just in time task completion)
Target applications
Mobile, battery operated (power constrained), variable processing devices
Cell phones, PDAs, notebooks, wireless sensors, embedded systems, ASICs,
medical technology, etc.
TBB Implementation
Goals
Attain
ON state current gain while minimizing OFF state
leakage current increase
Highlight advantages of sub-threshold operation while
allowing super-threshold operation if needed
Control bulk terminal to tunable potentials depending on
VDD and desired region of operation
MOS Bulk Control Circuits
Multiplexer-based approach
Two transistors per bulk control circuit
Utilizes Vthn0
TBB Bulk Control Circuits
TBB MOS Bulk Control Signal
VDD
pMOS Bulk
nMOS Bulk
VSS<VDD ≤Vthn0
VSS
VDD
VDD > Vthn0
VDD – Vthn0
Vthn0
Relies on passing of good/poor logic “1” and logic
“0” properties of pass-transistors
Requires external control signals
SubVt
and SubVt_b
TBB Bulk Control Circuit Simulation
Super-threshold: pBulk = VDD – Vthn0
Sub-threshold: pBulk = 0 V
Device Optimization
TBB encourages varying supply voltages
How
will devices be sized for optimal operation at
any supply voltage?
Maintain symmetric switching
Examine inverter at varying supply voltages
Device Optimization (Switching Point)
VDD
Ideal
Inverter
Threshold
Simulated
Inverter
Threshold
Percent
Variation
1.8 V
900 mV
900 mV
0.0%
1.0 V
500 mV
498 mV
0.4%
376.2 mV
188.1 mV
198.7 mV
5.6%
188.1 mV
94.05 mV
108.6 mV
13.4%
Sub-threshold Noise Margins
Noise Margins significant
for proper logic levels
TBB and Traditional static
CMOS inverter have
comparable noise margins
TBB VIH is 12.5% worse
TBB VIL is 14.3% better
300
Propagation Delay
Transmission Gate
Inverter
Two Input NAND
Two Input NOR
Two Input XOR
AVERAGE SWITCHING DELAY (ns)
250
200
Gate Traditional Delay TBB Delay % Decrease
TG
98 ns
14 ns
86
Inv
125 ns
20 ns
84
NAND
133 ns
18 ns
86
NOR
163 ns
25 ns
85
XOR
289 ns
40 ns
89
150
100
50
0
TRADITIONAL
SBB
TBB
Static CMOS at Vdd = Vthn0 with varying Body Biasing
DTMOS
Review of SubVth Circuits Benefits
So far, the presentation has shown:
TBB
requires control of MOS bulks to span the operating
regions of interest. Implementation is successful.
Study of simple logic gates showed:
TBB gives a dramatic speed increase (up to 7x)
Static CMOS design style is suitable for sub-threshold and superthreshold operation
Sizing
of efficient devices for the TBB approach is possible
However, how will a complex system perform?
Design
with previous knowledge (logic style, sizing)
Analyze post-layout simulations
Complex System-on-Chip Design
Using TBB
Work
addresses the challenges of
Global
Interconnect Delays
Clock distribution
Synchronization of unrelated clocks and
Power dissipation
Conclusion
TBB scheme has been devised to span all regions of operation
from ultra-low power to high-speed. New kind of body
biasing
Forward-biasing causes exponential sub-threshold current gain
Focus on sub-threshold and slightly above threshold to utilize leakage
Bulk control circuits are effective
Leads to 7 times frequency increase in simple logic gates
4% area and 8.9% power dissipation increase
Static CMOS is ideal overall design style
Device sizing at either sub-threshold or super-threshold allows efficient
operation with variable supply voltages
Concluding Remarks
Allowing tunable operation allows the designer to
choose operating point (kHz, MHz, GHz) – Energy
Dissipation is affected.
Other schemes do not offer this flexibility
TBB can lead to significant energy savings
LFSR results show TBB gives:
Maximal
5.7 times speed increase (sub-threshold)
Comparable energy at super-threshold and favorable at subthreshold
Favorable EDP at all operating regions
Operate at the same speed with less energy dissipation
Idle state leakage current can be minimized by
collapsing the supply voltage
Integrating Research Into Instruction
Data Path Circuits
Memory Design
Sub-System
ROUTER
CHIP
Incorporating Research into Instruction
A long term objective is to place some of the integrated chips
on development boards such as those Digilent Inc produces.
The integrated chips become part of a system and can be used
in some of our low level courses.
Most important is the use of these programmable boards to
show case the research outcomes, particularly to visiting
prospective students.
A sample development board:
Questions and Comments Welcome!
Multiple Clock Domain
Synchronization
; EqualClock s
; RationalCl ocks
; ArbitraryC locks
Computational Module
Computational Module
Computational Module
MicroNetwork
f fast
n 1
n f slow n Z
n Q
Synchronous Islands
Computational Module
Isochronous
Communication
Computational Module
Computational Module
Reducing Interconnect Delays
Improved latency and bandwidth
Global interconnects are pipelined at or near the rate of computation
Sources of Power Consumption
Ptotal Pstatic Pdynamic Pshort circuit
Pstatic Pleakage PDC
Pdynamic Vdd Vswing f clk Cload
Pshort circuit Vswing I avg short circuit
Most straight forward method to reduce power consumption
from any source is to reduce VDD
Controlling frequency directly manipulates dynamic power
Controlling device threshold manipulates leakage current,
affecting leakage and short circuit power.
Distributed FIFO Control Circuitry
Traditional vs. Tunable Body Biasing
Traditional Body Biasing
Vdd
LocalClock2
V
delay
(ps)
freq
(GHz)
1
111.2
0.7
current
Tunable Body Biasing
LocalClock2
current
Tunable BB %
diff
uA
delay
(ps)
freq
(GHz)
uA
freq
current
9
3100
103.1
9.7
2988
7.8
-3.6
172.55
5.8
1240
177.7
5.6
1042
-3.4
-16
0.35
1354.5
0.7383
71
1438
0.6954
72.9
-5.8
-2.7
0.2
96700
0.0103
2.81
16640
0.0601
5.051
483
79.8
The synchronizer/buffer shows an increase in performance at sub-threshold
voltages when using tunable body biasing
Tunable Body Biasing
Current (uA)
Max
Freq
(GHz)
Vdd
(V)
Traditional
Body
Biasing
Tunable
Body
Biasing
Peak
Avg
Power (uW)
Idle
Peak
Avg
Idle
1
4
5597
2382
8.696
5597
2382
8.696
0.7
2
2222
803.4
4.873 1555.4
562.38
3.411
0.35
0.125
131.1
35.58
1.468 45.885
12.453
0.514
0.2
0.01
7.452
2.895
1.349
1.49
0.579
0.27
1
4
5140
2460
9.54
5140
2460
9.54
0.7
2
2050
833
4.423
1435
583.1
3.096
0.35
0.167
132
39.8
1.589
46.2
13.93
0.556
0.2
0.015
9.468
4.03
1.239
1.894
0.806
0.248
Pursuit of Low Power Operation
It
is likely that not all IP blocks in a SoC need
to operate at high speed
Power dissipation for those IP blocks could be
reduced by operating at a lower voltage
TBB offers the possibility to dynamically
operate at either sub-threshold or superthreshold voltages
Variable Voltage SoC
Vdd1
Vdd4
Vdd5
Computational Module
Computational Module
Computational Module
Vdd2
MicroNetwork
Consider a SoC with 50 IP
blocks, each requiring
communication at a rate of
10 MHz
Each IP could operate at subthreshold levels
The channel could operate at
super-threshold voltages
while the IP blocks are in
sub-threshold
Computational Module
Synchronous Islands
Vdd3
Isochronous
Communication
Computational Module
Computational Module
Idle vs Operating Power
Idle
Vdd
(V)
Current
(uA)
Operating
Current
Power (uW)
(uA)
Power
(uW)
1
16.9
16.9
2988
2988
0.7
5.3
3.71
1042
729.4
0.35
1.5
0.525
72.9
25.52
0.2
0.925
0.185
5.051
1.01
During idle periods, it is advantageous to reduce leakage
current by
Reducing the power supply voltage or
Increasing the threshold voltage (e.g. bulk voltage manipulation)
Speed at Varying VDD
Delay Comparison of a TBB and Traditional LFSR
100000
Minimum Clock Period ( ns )
10000
1000
TBB Delay
Traditional Delay
100
TBB 5.7x Faster
At 376.2 mV
TBB 20% Faster
At 1.8 V
10
1
0
0
0.2
0.4
0.6
0.8
1
1.2
Supply Voltage ( V )
1.4
1.6
1.8
2
Energy-delay Product
Energy Delay Product for TBB with Control 8-Bit LFSR
10000000
Energy Delay Product ( ns*fJ )
1000000
100000
TBB Energy-delay Product
Traditional Energy-delay Product
10000
EDP of TBB outperforms Traditional at ALL
operating regions, significantly in super-threshold
1000
100
0
0.2
0.4
0.6
0.8
1
1.2
Supply Voltage ( V )
1.4
1.6
1.8
2
Regions of Operation
Delay vs. Energy Dissipation Tradeoff for TBB LFSR
10000
10000000
TBB Delay
TBB Energy Dissipation
1000000
Clock Period ( ns )
100
1.1 GHz with
3.85 nJ/cycle
3.9 MHz with
0.6 fJ/cycle
10000
1000
100
222.2 MHz with
103 fJ/cycle
10
10
1
0
1
0
0.3262
0.3762
0.5643
0.7524
1.1286
Supply Voltage ( V )
1.5048
1.8
Energy Dissipation ( fJ )
100000
1000
Contributions of this work
Proposed
scheme alleviates the communication
bottleneck and offers a way to synchronize SoC
multiple clocks
Perform data transfers up to 10 GHz
Proposed scheme maintains high performance under the
influence of any clock skew
6.5 GHz for any process corner and any skew
Low
power FIFO scheme with a small impact on
area when used in SoCs with many modules
Contributions of this work
Process
corners have a minor impact on performance,
resulting in a 10% reduction of speed
The optimal voltage for minimum energy
consumption per transaction is at 2Vth
Introduction of TBB to address leakage and dynamic
power dissipation
500%
increase in performance at sub-threshold voltages
with a modest 80% increase in power
5-10% less power dissipation than traditional body biasing
Summary of Proposed FIFO
Scheme
Linear FIFO scheme that addresses
Signal propagation across communication channel
Successful Synchronization
Synchronizes equal, rational & arbitrary clocks
6.5 GHz sustained performance after process corner analysis using 3 stages.
Compared to CN scheme
Sustained throughput over long distances
Fewer devices per stage, fewer stages needed
25% higher performance, 12% lower power
Operates at both super- and sub-threshold voltages
Lower instantaneous power demands from local clocks (less di/dt)
Optimal energy per transaction at 0.7V in a 65nm process
Sub-threshold reduces power by 3 orders of magnitude
Tunable Body Biasing provides 50% increased performance in sub-threshold while
maintaining super-threshold operation
TBB Scalability
At 90 nm, the % difference is much less
At 180 nm, TBB sub-threshold static power % is large
Technology
180 nm
90 nm
Body Biasing and
Operating Region
Total Average
Power Dissipation
Static Power
Contribution
[%]
Total Average
Power Dissipation
Static Power
Contribution
[%]
Traditional in
Sub-threshold
193 pW
0.1%
13.1 nW
1.8%
Traditional in
Super-threshold
39.6 μW
Negligible
22.1 μW
negligible
TBB in
Sub-threshold
1430 pW
25.2%
20.4 nW
6.1%
TBB in
Super-threshold
39.4 μW
0.000034%
22.1 μW
0.0025%
Total TBB sub-threshold power is large
Total TBB sub-threshold power isn’t so large
LFSR Energy vs. Frequency
TBB and Traditional LFSR Energy Dissipation vs Frequency
225
200
Energy Dissipation [fJ]
175
150
125
100
75
50
Traditional Energy
TBB Energy
25
0
0
100
200
300
400
500
600
Frequency [MHz]
700
800
900
1000
1100
TBB Implementation Cont.
TBB Implementation Cont.
Logic Gate Analysis (Power)
Power Dissipation vs Supply Voltage
1000.0000
100.0000
Traditional CMOS Power
Power Dissipation [ nW ]
10.0000
TBB CMOS Power
1.0000
0.1000
0.0100
0.0010
0.0001
0.25
0.3762
0.75
Supply Voltage
1.8
Inverter Power Dissipation
VDD
Power Dissipation
[fW]
0.3262
8.27
0.4262
•Average
Power
•[nW]
Maximum Frequency
[MHz]
Period
[ns]
3.5
0.416
2400.0
11.41
30.0
2.6
380.0
0.5643
15.64
651.6
41.7
24.0
1.8
82.30
68.60
833.3
1.2
VDD
Power Dissipation
[fW]
0.3262
8.52
0.4262
•Average
Power
•[nW]
Maximum Frequency
[MHz]
Period
[ns]
22.4
2.6
380.0
13.00
259.8
20.
50.0
0.5643
15.13
2102.0
138.9
7.2
1.8
81.47
81.5
1000.
1.0
Logic Gate Analysis (Energy)
Energy Dissipation vs Supply Voltage
180
160
140
Energy Dissipation [ fJ ]
Traditional CMOS Energy
TBB CMOS Energy
120
100
80
60
40
20
0
0.25
0.3762
0.75
Supply Voltage [V]
1.8
Logic Gate Analysis (EDP)
EDP vs Power Supply
30000
25000
20000
EDP [ fJ*ns ]
Traditional CMOS EDP
TBB CMOS EDP
15000
10000
5000
0
-5000
0.25
0.3762
0.75
Supply Voltage [V]
1.8
Logic Gate Analysis (Fan-in)
1400
1200
Propagation Delay [ ns ]
1000
800
Traditional NAND
TBB NAND
Traditional NOR
TBB NOR
600
400
200
0
One
Two
Three
Number of Inputs
Four
Logic Gate Analysis (Logic Styles)
Energy Dissipation vs Supply Voltage
70
60
Traditional Pseudo-nMOS Energy
Energy Dissipated [ fJ ]
50
TBB Pseudo-nMOS Energy TBB
40
30
20
10
0
0.5*Vthn
0.75*Vthn
Vthn - 50 mV
Supply Voltage [V]
Vthn
Vthn + 50 mV
1.5*Vthn
Power Comparison of a TBB and Traditional LFSR
LFSR Power Dissipation
800
Average Power Dissipation ( uW )
700
600
500
TBB Power
400
Traditional Power
300
200
100
0
-100
0
0.2
0.4
0.6
0.8
1
1.2
Supply Voltage ( V )
1.4
1.6
1.8
2
Device Optimization (Optimal Region)
Delay vs. Energy Dissipation Tradeoff for TBB LFSR
4000
4500000
3500
4000000
Clock Period ( ns )
3000000
2500
TBB Delay
2500000
TBB Energy Dissipation
2000
2000000
1500
1500000
1000
1000000
500
500000
0
0
0.3262
0.3762
0.5643
0.7524
Supply Voltage ( V )
1.1286
1.5048
1.8
Energy Dissipation ( fJ )
3500000
3000
Regions of Operation
Super-threshold
(1.8 V)
Sub-threshold
(250 mV)
Optimal
(750 mV)
Design
Delay (ns)
Energy (fJ)
Delay (ns)
Energy (fJ)
Delay (ns)
Energy (fJ)
Traditional
LFSR
0.7
437.6
20000
105
7
74.1
TBB
LFSR
0.6
437
4500
22.8
4.5
73.6
GHz
kHz
MHz
Logic Gate Results
Results Highlights
TBB,
SBB, and DTMOS increase speed up to 7 times in
sub-threshold
Static CMOS has best overall logic style performance
Pseudo-nMOS, Domino, and pass-transistor still are valuable in
niche situations
TBB
and Traditional Noise Margins are comparable