Transcript Slide 1

Design in the Nano-meter
Regime: From Devices to
System Architecture
Kaushik Roy
Purdue University
Challenges ahead …
in Si nanometer regime
Exponential Increase in Leakage
1970
1980
5 µm
2000
2010
2020
100 nm
1 µm
10 nm
Silicon
Nano- electronics
ION
 106
IOFF
ION
 103
IOFF
Subthreshold Gate Leakage
Leakage
Gate
Source
Drain
n+
n+
Junction
leakage
Bulk
Leakage Power (% of Total)
Silicon
Micro- electronics
Non-Silicon
Technology
ION
~ 102~6
IOFF
50%
Must stop
at 50%
40%
30%
20%
10%
0%
A. Grove, IEDM 2002
1.5
0.7
0.35 0.18 0.09 0.05
Technology ()
Technology Trend
2003
2009
2020
Nano devices
Fully-depleted body
VG
Bulk-CMOS
Gate
Gate
VD
VD
Drain
Source
VG
VS
VS
DGMOS
Buried Oxide (BOX)
Substrate
Source Floating Body
Drain
Vback
Buried Oxide (BOX)
FD/SOI
Substrate
FinFET
Carbon nanotube
III-V devices
nano-wires
Spintronics
Trigate
PD/SOI
Single gate device
Multi-gate devices
Design methods to exploit the advantages of
technology innovations
Variation in Process Parameters
Device 1
Device 2
Normalized Frequency
1.4
1.3
30%
1.2
130nm
1.1
Source: Intel
1.0
5X
0.9
1
Channel length
2
3
4
Normalized Leakage (Isb
(Isb))
5
# dopant atoms
Delay and Leakage Spread
Inter and Intra-die
Variations
10000
Source: Intel
1000
100
10
1000 500 250 130
65
32
Technology Node (nm)
Random dopant fluctuation
Device parameters are no longer deterministic
Reliability
Failure probability
Temporal degradation of performance -- NBTI
Tech. generation
Time
Defects
Life time
degradation
Device-Aware
Circuit/Architecture is
Essential
Right type of device with right
circuit and architecture
Low-Power and High-Performance VLSI Research
Wireless
Communications
- Low Power
- Coding / Modulation
High Speed
Arithmetic
- Sharing Multiplier
for Vector Scaling
Carbon
Nano-tubes
-circuits
-architecture
Nano
Circuits
& Arch.
NBTI
- Differential / Redundant Coeff.
- Distributed Multiplication
- Filter / Image Compression
Power
Delivery
- Analysis
- Design for Rel.
Self-Healing/
Self-Calibration
Process-Tolerant
Design
Low Power
VLSI Signal
Processing
Reliability, Noise
& Power Del.
- Logic (Sizing, Body Bias)
- Memory
Failure Analysis
& Yield
Performance/Power Aware Computing Process
Subthreshold, Gate,
Jn. BTBT, GIDL,..
-Transistor Stacking
- Multiple Vt
- Dynamic Vt
Low Complexity
Leakage
Control
Variation
Wavelet based Idd
Analysis
Device/Circuit
Design
- Idd Testing
-Mixed Signal
Active Leakage
Reduction
Low Leakage
Memory
-- Dynamic Vt
- DRG Cache
Digital
Sub-threshold Logic
- Ultra Low Power
- Self Adjusting Vt
Device Modeling
& Circuits
- Bulk, SOI
Caches
- Reconfigurable Cache
-Gated Gnd, Clocking
-- Dynamic Vdd
Optimal SOI Electro-thermal
Design
Devices
- DG-SOI
- 3D-SOI
-Device/Circuit/Arch
co-design
Professor Kaushik Roy
ECE, Purdue University
Memories: Leakage Reduction
& Process Compensation
Device-aware Circuit/Microarch: Cache
Bulk Ultra-high Vt
Nominal Vt
Ground-plane SOI
FinFET
Circuit Design Issues
Leakage – Sub-threshold, Gate, Junction, BTBT
Stability – Read noise margin, Writability, Soft errors
Delay – Decoder, Wordline, Bitline, MUX, Sense-amp, Driver
Transition between active and standby modes
Variations – Process, Vdd, Temperature
Microarch Design Issues
Array aspect ratio – # cells WL/BL
Sub-array structure and selection strategy
Active-Standby transition frequency, delay, energy
How do you co-design?
Bulk Nominal Vt Source-biased (Supply Gated) Cache
Bulk Ultra-high Vt
Nominal Vt
Ground-plane SOI
FinFET
SB-SRAM Circuit Design Issues
+
-
VGND
VGND
Holding
Circuit
SLEEP
Hot Cache Line
SRAM
Array
Column I/O
Periodic Sleep
Generation
VSB
• Data retention
delay
VDECAY
VREF
(VGND should be
+
SLEEP
strapped)
V
GEN
CLK
• Noise issue
• Process variation
Self-decay sleep control circuit
tracking sleep control
BIAS
SLEEP
VSB
VGND holding circuit
SB-SRAM Microarch Design Issues
Use locality of reference in cache to reduce transition energy
Optimum memory sub-array size selection
T
Sleep time Tsleep selection
SLEEP
Co-design approach leads to higher payoffs and more opportunities
Basic Idea: Supply Gating
M1
‘0’
‘0’
VM>0
‘0’
M2
Vgs=0,Vbs=0,Vds=Vdd
Negative Vgs,
Negative Vbs- More
Body effect,
Reduced Vds-Less DIBL
2-T stack has lower
subthreshold leakage
For M1:
Vgs =-VM< 0,Vbs =-VM<0,
Vds = Vdd-VM<Vdd
For M2:
Vgs =0,Vbs =0,
Vds = VM < Vdd
Source-Biasing: Retaining Data During Inactive Mode
+
-
VSB
VGND
SLEEP
VGND
Holding
Circuit
SLEEP
VSB
…
Sleep transistor cuts off VGND from ground
during sleep mode
 VGND is strapped using different circuit schemes

16K-Byte SRAM Organization
A<0:1 >
A<2:4 >
A<5:7 >
Predecoder
Selfdecay
circuit
WL<3:0>
VSB
MP1
VGND
...
...
...
...
bitlines
512 cells
4 cells
SLEEP BLOCK_SEL
Decoder/Driver
X4
SL
VGND
SL
Distributed
sleep TR cells
Col. I/O
Φ PRE
Active leakage reduction SRAM
 Distributed sleep transistors
 SRAM block turned on ahead of time
 Self-decay circuit for low dynamic power overhead

2x16K-Byte SRAM Testchip
Kim, Roy, ISSCC’05
Technology
180nm 6-metal
CMOS
Chip Size
3.3X2.9 mm2
Supply
Voltage
1.8V
Threshold
Voltage
NMOS: 0.53V
PMOS: -0.53V
Read Access
Cycle
984MHz
@ 1.8V, RT
Active
Current
0.14mW/MHz
@ 1.8V
Standby
Current
7.27μA
(16KB array)
Measured Leakage Reduction
Leakage (A)
8.E-06
Junction leakage
Bitline leakage
Cell leakage
6.E-06
1.8V, 45 C
4.E-06
94.2% 
2.E-06
0.E+00
Conventional
This work
94.2% total leakage reduction at VGND=0.9V
 Raising VGND also reduces gate tunneling leakage

Bulk Ultra-High Vt Forward-biased Cache
Bulk Ultra-high Vt
Nominal Vt
Ground-plane SOI
FinFET
Strong halo, Low ISUB
FBB to ↑ ION
WL
FB-SRAM Circuit Design Issues
VDD
BL
BLB
• Zero body bias in standby to reduce leakage
• FBB in active-mode to improve speed
• Early sub-array selection to hide body-bias
transition latency
PWELL
GND
FB-SRAM Microarch Design Issues
Use MSB of memory address for early selection of memory sub-array
Use locality of reference in cache to reduce transition energy
Co-design approach gives large leakage savings
32x32 Forward Body-Biased Sub-array
M1
0.4V power
supply
M2
SUBSL
..
M3
WL31
MA MP
…
...
MN
32
WL 0
VPWELL
..
32
...
...
...
Comparison
Conventional
SBSRAM
FBSRAM
VSL
V DD
VDD
0V
Active
VT=270mV
0.2V
Standby
VT=270mV
VPWELL
0V
Active
0.5V
Standby
VT=350mV
• SBSRAM (DRG) has been proven with Si measurements
• Dynamic VDD, RBB SRAM have fundamental design
issues
• MEDICI: gate/BTBT leakage is also modeled
Power consumption (W)
32KB Cache Total Leakage Reduction
0.25
0.20
0.15
0.10
230mW
Dynamic power
overhead
Leakage power
(selected subarray)
Leakage power
(unselected subarrays)
64% total leakage
reduction
83mW
84mW
0.05
0.00
Conventional SBSRAM
FBSRAM
• SBSRAM and FBSRAM are designed to give isoleakage savings
• 64% total leakage reduction including overhead
Process Variations & ProcessTolerance
Robust Design: Process Variations in Onchip SRAMs
WL
PL
‘1’
AXL
‘0’
BR
Low-Vt
Yield ≈ 33%
300
Fault statistics
250
σVt ≈ 30mv, using BPTM 45nm technology
200
Simulation example of an 64KB Cache
150
100
50
1049
996
944
890
839
786
734
682
629
577
524
472
419
367
315
262
210
157
105
52
0
0
Chip Count
AXR
NR
High-Vt
350
– Read, Write, Access, Hold
PR
NL
BL
Parametric failures
Number of faulty cells (NFaultyFaulty-Cells)
Parametric failures can degrade SRAM yield
Inter-die Variation & Memory Failure
LVT
Failure Probability
Reg. A
High RF/HF
Nom. Vt
HVT
Reg. B
Low Failures
Reg. C
High AF/WF
Cell. Fail. Prob.
Mem. Fail. Prob.
BPTM 70nm Devices
Inter-die Vt shift [V]
Memory failure probabilities are high when
inter-die shift in process is high
Self-Repairing SRAM Array
LVT
Region A
Nom. Vt
Region B
HVT
Region C
Region A
LVT Corner
Region C
HVT Corner
Read & Hold
failures dominate
Access & Write
failures dominate
Reduce
RF & HF
Reduce
AF & WF
Reduce the dominant failures at different inter-die
corners to increase width of low failure region
Self-Repair using Leakage Monitoring
DD
Bypass
Switch
V
out
V
V
REF1 REF2
Comparator
SRAM
Array
Body
bias
Body-Bias
Generator
Entire array leakage is
monitored to detect interdie corner and proper
body-bias is applied
VOUT
SRAM
ARRAY
LVT
Calibrate
Signal
On-chip
Leakage
Monitor
Nom. Vt
V
BPTM 70nm
VREF1
VREF2
HVT
Nom. Vt
LVT
Current Monitor Circuit
Test-Chip of Self-Repairing SRAM
VCO
Isolated cell
VCO
16 KB
block
64 KB
LVT
Array
Sensor + Ref. gen. BB gen
Technology : IBM 0.13 m
128KB SRAM, leakage sensor,
reference & body-bias gen
Dual-Vt Triple-well tech.
Number of Trans: ~ 7 million
Die size: 16mm2
Yield Enhancement using Self-Repair
No-body-bias
256KB Self-Repairing
SRAM
Memory failure prob.
BPTM 70nm
Self- Repair
RBB
ZBB
FBB
256KB SRAM with
No Body-Bias
BPTM 70nm
Inter-die Vt shift [mV]
Self-Repairing SRAM using body-bias can
significantly improve design yield
Self-repair: Architecture Level
Fault-Tolerant Cache Architecture
Tag
Index
16
off
11
5
Column
Decoder
2 9
Row Address
Controller
16b 16b 16b 16b
256b
256b
256b
Faulty
Faulty
Column Mux Data
Col Mux Tag
Config
Storage
Sense
Amp
Sense
Amp
Hit/Miss
Tag
=
BIST
Data
Configurator
Fault
Memory Locations
Configuration
256b
512 Rows
Col Address
Row Decoder
Index
Data Blocks
Tag Blocks
Test Mode
Operating Condition
BIST detects the faulty blocks
 Config Storage stores the fault information
Idea is to resize the cache to avoid faulty blocks
during regular operation

Effective Yield of 64K Cache
100
100
Optimum r = 3
60
40
81
85
86
69
Proposed Arch.
Yield without
any Redundancy
93
91
75
77
77
77
60
46
40
Conv. Yield
34
33
20
20
93
86
80
86
% Yield
% Yield
80
94
33
32
31
Proposed Architecture with r = 3
ECC
Redundency
Proposed Architecture
0
0
0
1
2
3
4
Redundent Rows in Config Storage
(r)
0
8
16
24
32
Redundent Rows in Cache (R)

ECC + Redundancy yield ~ 77%

Proposed architecture + Redundancy yield ~ 94%
Fault Tolerant Capability
Chip Count (Nchip)
350
Fault statistics
300
Chips saved by the proposed + redundancy (R=8, r=3)
250
Chips saved by ECC + redundancy ( R=16)
200
More number of saved chips
as compare to ECC
150
100
ECC fails to save
any chips
50
0
0
105
210
315
419
524
629
734
839
944
1049
NFaulty-Cells

Proposed architecture can handle more number of
faulty cells than ECC, as high as 890 faulty cells

Saves more number of chips than ECC for a given
NFaulty-Cells
CPU Performance Loss
% CPU Performance Loss
2.5
For a 64K cache
averaged over SPEC
2000 benchmarks
2.0
1.5
1.0
0.5
0.0
0
105 210 315 419 524 629 734 839
NFaulty-Cells

Increase in miss rate due to downsizing of cache
Average CPU performance loss over all SPEC 2000
benchmarks for a cache with 890 faulty cells is ~ 2%

Logic: Active Leakage Reduction
- Dual-Vt
- Transistor Stacking
Leakage Reduction: Supply Gating for Logic
VDD
VDD-Gating
Control
input
Logic
Block
GND-Gating
Control
Pros
Cons
5-20X
Leakage
Reduction
Delay/Area
Overhead
Scalable
Floated
Output
Can be
applied to idle
sections only
Output
Design ease
GND
How to use supply gating dynamically in
active mode?
Dynamic Supply Gating (DSG): An Example
100
Power Saving %
80
70nm technology
50nm technology
60
40
20
Predecoder
3-to-8 row decoder
0
Postdecoder
8
12
16
Row Address Bits
How to do it for random
logic?
Dynamic Supply Gating for General Circuits
 Shannon’s expansion:
f ( x1,..., xi ,..., xn )  xi f ( x1,..., xi  1,..., xn )  xi' f ( x1,..., xi  0,..., xn )
 xi CF1  xi' CF2
CF1  f ( x1,..., xi  1,..., xn ); CF2  f ( x1,..., xi  0,..., xn )
CF1
Xi is
referred as
Control
Variable
f1
CF11
f
CF2
xi'
inputs
f2
xixj
f1
MUX
MUX
xi
xi
CF12
xixj'
xj
Control
variable
selection
is
important
Leakage Power (uw)
Simulation Results
160
Active Leakage Saving
140
120
Original
100
DSG
80
60
40
20
0
x2
sct
pcle
pcler8
cht
mux
alu2
decod
cm150a
count
MCNC Benchmarks, 70nm Process, Vdd=1V, Temp=100°C
Logic: Process Variation & Tolerance
- Transistor Sizing for Yield (Statistical Design)
- Transistor Sizing for Efficient Speed-Binning
- Shadow Latches (Razor)
- Pipeline Balancing/Imbalancing
- Vdd Scaling & Critical Path Isolation
(ICCAD’06)
Design Considerations for Low Power and Robust
Circuit
Number of paths
predictable and restricted to a
logic section having low activation
probability
CLK
Tc
Design A
S1
VDD=1V
Design B
S3
S2
S2+S
S3
1
VDD<1V
Design B
path delay
 Few predictable critical paths
 Low activation probability of critical paths
 Slack between critical and non-critical paths under variations
f4
Original
Circuit
f3
PO
f2
f1
decoder
Inputs
Inputs
OR Network
Proposed Approach:
PO
Critical Path Isolation By Control Variable Selection
X1
X3
X5
X4
X2
X6
f1
f2
X9
X7
X8
| ai  bi |
Mi 
max(ai , bi )
Mi  max(ai  bi )
X1
X3
X5
X6
X9
X2
X3
X6
X1
X9
X7
ai: # literal count of xi
bi: # literal count of xi’
f1(CF1)
f2(CF1)
f1
f1(CF2)
f2
f2(CF2)
x4
X4
f1(CF1)
X3
X2
X6
X7
f2(CF1)
X2
X4
X3
X5
X6
X9
f1
f2
f1(CF2)
f2(CF2)
x1
Further Isolation by Hierarchical Partitioning and Sizing
(Xi, Xj)= (1,1)
Xi = 1
CF32
CF10
(Xi, Xj)= (1,0)
CF53
CF42
(Xi Xj Xk)= (1,0,0)
Original
Circuit
CF63
Xi = 0
MUX Network
(Xi Xj Xk)= (1,0,1)
CF20
Inputs
Inputs
LEVEL1
(50%)
LEVEL2
(25%)
LEVEL3
(12.5%)
Stopping conditions: area, delay constraints
Advantages of Shannon decomposition
 Critical paths can be isolated
 Activation of errors can be predicted ahead of time
 Activation probability of critical paths can be reduced
PO
Simulation Results for Pipeline-based Design
CLK
freeze
D2
cht
●
mux
80ps
70ps
●
Inputs
D3
●
100
1
85ps
cm150a
D1
D1, D2, D3 are
decoding logic
outputs
% imp in power @input switching prob = 0.2
% imp in power @input switching prob = 0.5
% Imp. in power
80
0.8
60
VDD[V]
0.6
40
0.4
0.2
20
0
0
cht
sct pcle mux decod cm150a x2 alu2 count
cht
sct pcle mux decod cm150a x2 alu2 count
Avg performance penalty=5.9% for switching activity=0.5
 Avg power saving = 60%, avg area penalty = 18%

Ultra Low Power
Subthreshold Leakage for
computation??
-- Soeleman, Kim, Roy ISLPED 2000/2001, TVLSI 2001, TVLSI 2003
-- Raychowdhury, Kim, Roy, ISLPED 04, TED 2004/2005, TVLSI 2005…
Subthreshold Operation
Region of
operation
1.E-3
Vth
IDS α exp(VGS-VTH)
1.E-5
1.E-6
and not (VGS-VTH)
1.E-7
Vdd<Vth
1.E-8
1.E-9
0
0.2
0.4
0.6
0.8
Region of
operation
VGS (Volts)
CGATE < COX
CGATE (fF/µm)
IDS (A/µm)
1.E-4
1
0.9
0.8
0.7
0.6
0.5
0
0.2
Vth
0.4
VGS (Volts)
0.6
0.8
Design Goal
Power
Power Ceiling
Super-threshold
 Sub-threshold
Device
optimization
Wireless
app.
Medical
app.
Circuit/Architecture
optimization
Throughput
Dev/Cir/Arch optimization is necessary
 Is scaling necessary ?
 Device for sub-threshold
operation??
Scaling & Subthreshold Operation
Average Power
(Х 10-7 J)
• Reduced L => Reduced capacitance
4
Iso-performance (3.4ns)
3
2
1
0
500
mV
420
mV
250
180
280
mV
130
200
mV
90
Technology Node (nm)
Scaling is essential even for
subthreshold operation
Are standard transistors
good for subthreshold
operation too?
Doping Profile: Std. vs. Proposed
Standard Device
Proposed Device
Proposed device vs. Std. Device
@ iso-performance (3.4ns)
Average Power
(Х 10-7 J)
4
3
2
1
500mV
420mV
280mV
200mV
48%
180mV
0
250
180
130
90
Technology Node (nm)
Raychowdhury, Paul, Roy; IEEE TED, Feb’05, ISLPED’04
Circuit Considerations
CMOS-NAND
A
B
Pseudo-NMOS (NAND)
B
B
A
PUP
PUP
PDN
A
Pseudo-NMOS over CMOS
- Less power
- Faster operation
PDN
Pseudo-NMOS logic
VTC of an Inverter (350nm Tech)
Std. operation (Vdd = 3.3V)
Sub-threshold (0.5V)
P/N=4
Vout
P/N=4
P/N=0.25
P/N=0.1
Vin=Vout
Vin=Vout
Vin
Vin
Pseudo NMOS logic is good for sub-threshold operation
Improvement Through Circuit Innovation
Pseudo-NMOS over CMOS (sub-threshold)
- Faster operation
- Reasonable power
Pseudo-NMOS logic is suitable for
Sub-threshold operation
Architecture Optimization
CLK
IN
Logic
Logic
Logic
Logic
OUT
Latch
Latch
Logic
Control
Parallelism
Logic
Control
IN
Latch
Pipelining
OUT
Architecture Optimization
5-Tap FIR filter
@ iso-performance
Parallelism
90nm Predictive Tech.
Pipelining
Optimum no. of pipeline stages and
parallel blocks need to be chosen
Dev/Cir/Arc Co-design: Summary
90nm Predictive Tech.
5-Tap FIR Filter
0.8V
0.7V
0.6V
Standard CMOS
0.5V
0.4V
0.4V
CMOS to
Pseudo-NMOS
0.3V
0.3V
0.2V
0.2V
0.15V
0.13V
Optimal parallelization
and pipelining
Device optimization
Under review, TVLSI
Process-Tolerant Sub-threshold Chips
16-bit adder
Process-Tolerant
Adder
SRAM(for read/
write/hold test)
1KB SRAM
Filter core
Adaptive
-ratio circuit
(tR/tF/tD)
IBM 130nm 8RF
Process-Tolerant
Pipeline
MITLL 3D FDSOI
Summary
• Design paradigm shift is essential for both
to meeting the growing demands for power
dissipation, yield, and reliability
• Device/Circuit/Architecture Co-design can
address some of the design problems
associated with scaling
Performance/Power Aware Computing & Communications
•
•
•
•
•
•
•
•
•
•
NSF
DARPA
Northrup-Grumman
MARCO GSRC
SRC
Intel
ATT/Lucent
HP
IBM
Convergys
Faculty: Kaushik Roy
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Bipul Paul, Post-Doc. Res.
Keejong Kim, Post-Doc. Res.
Swarup Bhunia
Tamer Cakici
James Gallaghar
Mark Budnik
Yiran Chen
Arijit Raychowdhury
Aditya Bansal
Amit Agarwal
Ashish Goel
Hunsoo Choo
Nilanjan Banerjee
Swaroop Ghosh
Jung Hwan Choi
Konhyuk Kang
Arjun Guha
Hamid Mahmoodi-Meimand
Saibal Mukhopadhyay
Animesh Datta
Jongsun Park
Hari Ananthan
Yongtao Wang
Myeong Hwang