ZZ-HVS: Zig-Zag Horizontal and Vertical Sleep Transistor Sharing to
Download
Report
Transcript ZZ-HVS: Zig-Zag Horizontal and Vertical Sleep Transistor Sharing to
ZZ-HVS: Zig-Zag Horizontal and Vertical Sleep
Transistor Sharing to Reduce Leakage Power in
On-Chip SRAM Peripheral Circuits
Houman Homayoun Avesta Makhzan and Alex Veidenbaum
Dept. of Computer Science, UC Irvine
[email protected]
Outline
Cache Power Dissipation
Why Cache Peripheral ?
Proposed Circuit Technique to Reduce Leakage in
Cache Peripheral
Circuit Evaluation
Proposed Architecture to Control the Circuit
Results
Conclusion
On-chip Caches and Power
On-chip caches in high-performance
processors are large
more than 60% of chip budget
Dissipate significant portion of power via
leakage
Much of it was in the SRAM cells
Many architectural techniques proposed to
remedy this
Today, there is also significant leakage in the
peripheral circuits of an SRAM (cache)
In part because cell design has been optimized
Pentium M processor die photo
Courtesy of intel.com
Peripherals ?
Addr Input Global Drivers
Bitline
addr0
Global Wordline
addr1
Decoder
Bitline
Local Wordline
addr2
addr3
Predecoder and Global Wordline Drivers
addr
Sense amp
Global Output Drivers
Data Input/Output Driver
Address Input/Output Driver
Row Pre-decoder
Wordline Driver
Row Decoder
Others : sense-amp, bitline pre-charger, memory cells, decoder logic
Why Peripherals ?
100000
10000
1000
6300X
( pw )
100
200X
10
m
em
or
y
ce
ll
IN
V
IN X
V2
X
IN
V3
X
IN
V4
X
IN
V5
X
IN
V6
X
IN
V8
IN X
V1
2
IN X
V1
6
IN X
V2
0
IN X
V2
4
IN X
V3
2X
1
Using minimal sized transistor for area considerations in cells
and larger, faster and accordingly more leaky transistors to
satisfy timing requirements in peripherals.
Using high vt transistors in cells compared with typical threshold
voltage transistors in peripherals
Leakage Power Components of L2 Lache
SRAM peripheral circuits dissipate more than 90%
of the total leakage power
Circuit Techniques Address Leakage in SRAM Cell
Gated-Vdd, Gated-Vss
Voltage Scaling (DVFS)
ABB-MTCMOS
Forward Body Biasing (FBB), RBB
Sleepy Stack
Sleepy Keeper
Target SRAM memory cell
Architectural Techniques
Way Prediction, Way Caching, Phased Access
Drowsy Cache
Evict lines not used for a while, then power them down
Applying DVS, Gated Vdd, Gated Vss to memory cell
Keeps cache lines in low-power state, w/ data retention
Cache Decay
Predict or cache recently access ways, read tag first
Many architectural support to do that.
All target cache SRAM memory cell
Sleep Transistor Stacking Effect
Subthreshold current: inverse exponential function of
threshold voltage
VT VT 0 (
( 2) F VSB
2 F )
Stacking transistor N with slpN:
The source to body voltage (VM ) of
transistor N increases, reduces its
subthreshold leakage current, when
both transistors are off
Drawback : rise time, fall time, wakeup
delay, area, dynamic power, instability
vdd
VC
Vgn
N
CL
VM
Vgslpn
vss
slpN
vss
Source of Subthreshold Leakage
in the Peripheral Circuitry
vdd
vdd
W
6
L
W
3
L
P4
I leakage P3
P2
P1
vdd
vdd
W
24
L
W
12
L
addr0
Bitline
Bitline
z
addr1
addr2
addr3
0
1
0
W
1.5
L
W
3
L
N1 I leakage N2
vss
vss
0
1
W
6
L
N3
vss
W
12
L
N4
vss
The inverter chain has to drive a logic value 0 to the
pass transistors when a memory row is not selected
N1,N3 and P2,P4 are in the off state and are leaking
A Redundant Circuit Approach
Sleep signal
vdd
vdd
vdd
slpP2
slpP1
W
12
L
slpP3
W
12
L
W
12
L
P3
1
0
slpP4
W
12
L
P2
P1
vdd
P4
0
1
slpN5
N1
slpN1
Sleep signal
VM
W
6
L
vss
N2
slpN2
W
6
L
vss
N3
slpN3 W
L
W
1.5
L
N4
6
vss
slpN4
W
6
L
vss
Sleep signal
Drawback impact on wordline driver output
rise time, fall time and propagation delay
Impact on Rise Time and Fall Time
vdd
The rise time and fall time of the output
of an inverter is proportional to the
Rpeq * CL and Rneq * CL
slpP1
P2
P1
I leakage
1
0
vdd
slpP2
Inserting the sleep transistors increases
both Rneq and Rpeq
0
N1
slpN1
N2
slpN2
I leakage
vss
vss
Increasing in rise time
Impact on performance
Increasing in fall time
Impact on memory functionality
Fall Time Increase Impact
Bitline
Vdd
Wordline pulse
global wordline
driver
Bitline
1
0
M1
M2
local wordline driver
Sense-amp
Wordline pulse
generator
Fall time increase pass transistor active period increase (read operation)
The bitline over-discharge, the memory content over-charge during the
read operation.
Such over-discharge
increases the dynamic power dissipation of bitlines
can cause cell content flip if the over-discharge period is large
The sense amplifier timing circuit and the wordline pulse generator
circuit need to be redesigned!
A Zig-Zag Circuit
Sleep signal
vdd
vdd
slpP4
slpP2
W
12
L
vdd
1
0
P4
P3
P2
P1
W
12
L
vdd
0
1
0
slpN5
N1
slpN1
Sleep signal
W
6
L
vss
N3
N2
vss
slpN3
W
6
L
W
1.5
L
N4
vss
vss
vss
Sleep signal
Rpeq for the first and third inverters and Rneq for the
second and fourth inverters doesn’t change.
Fall time of the circuit does not change
A Zig-Zag Share Circuit
To improve leakage reduction and area-efficiency of
the zig-zag scheme, using one set of sleep transistors
shared between multiple stages of inverters
Zig-Zag Horizontal Sharing
Zig-Zag Horizontal and Vertical Sharing
Zig-Zag Horizontal Sharing
Zz-hs less impact on rise time
Both reduce leakage almost the
same
vdd
vdd
Comparing zz-hs with zigzag
scheme, with the same area
overhead
Sleep signal
slpP
P2
P1
1
0
N1
vdd
P3
P4
N3
N4
0
N2
VM
Sleep signal
2 x slpN
vss
R Neq
I share
vss
vss
R N1
I share
R nslp zz hs
vss
R nslp zz
2
Zig-Zag Horizontal and Vertical Sharing
vdd
Sleep signal
vdd
vdd
Word-line
Driver line K
vdd
vdd
Word-line
Driver line K +1
slpP
P 11
P12
P 13
P14
P 21
P22
P 23
P24
N11
VM
N 12
N 13
N 14
N21
N 22
N 23
N 24
Sleep signal
slpN
vss
vss
vss
vss
vss
Leakage Reduction of Zig-Zag Horizontal
and Vertical Sharing
(a)
(b)
vdd
Vg0
Vg0
N 11
IN11
Vg0
vdd
VM1
I slpN
slpN
vss
Vg0
N 21
IN21
Vg0
vdd
VM1
I slpN
slpN
Vg0
N11
N21
IN 21
IN11
VM2
Vg0
vdd
I slpN
slpN
vss
Increase in virtual ground voltage
increase leakage reduction
vss
VM 1
VM 2
n. log
WN 1 1
Wslp N
10
Vdd Vg 0
2
n. log
2.WN 1 1
Wslp N
10
Vdd Vg 0
2
Circuit Evaluation
Test Experiment
Wordline inverter chain drives 256 one-bit memory cells.
Using Mentor Graphic IC-Station in TSMC 65nm
technology
Use Synopsis Hspice and the supply voltage of 1.08V at
typical corner (250 C)
The empirical results presented are for the
leakage current
rise time and fall time
propagation delay
dynamic power
area
50
4
45
3.9
40
3.8
35
3.7
30
3.6
25
3.5
20
3.4
15
3.3
10
3.2
5
3.1
0
3
e
li n
se
a
b
nt
da
n
u
red
g
za
zig
Leakage Power
hs
zz-
-1W
hs
zz-
-2W
dynamic power
Dynamic power increase of 1.5% to 3.5%
Max leakage reduction of 94%
(uW)
(nW)
Zig-zag Horizontal Sharing: Power Results
Zig-zag Horizontal Sharing: Latency Results
210
(ps)
190
170
150
130
110
90
70
50
li
se
ba
ne
u
red
nd
t
an
g
za
z ig
Propagation Delay
hs
z z-
-1W
Fall Time
-2W
hs
z z-
Rise Time
Both zig-zag and zig-zag share wordline driver fall time is not
affected
zz-hs-2W has the least impact on rise time and propagation
delay
Zig-zag Horizontal Sharing: Area Results
16
14
12
8
(
2
)
10
6
4
2
0
t
ine
an
l
d
e
n
s
u
ba
red
W
W
z ag
g
s -1
s -2
i
h
h
z
zz
zz
Area increase varies significantly from 25% for zz-hs-1W circuit to
115% for the redundant scheme
ZZ-HVS Evaluation : Power Result
1000
(a)
x100
log (nW)
100
x10
x12
10
x2
1
1
2
3
4
5
6
7
8
9
10
number of wordline row
baseline
redundant
zigzag
zz-hs
zz-hvs
Increasing the number of wordline rows share sleep transistors increases the
leakage reduction and reduces the area overhead
Leakage power reduction varies form a 10X to a 100X when 1 to 10 wordline
shares the same sleep transistors
2~10X more leakage reduction, compare to the zig-zag scheme
ZZ-HVS Evaluation : Area Result
160
(b)
140
120
80
(
2)
100
60
40
20
0
1
2
3
baseline
4
5
6
7
number of wordline row
redundant
zigzag
8
zz-hs
9
10
zz-hvs
zz-hvs has the least impact on area, 4~25%
depends on the number of wordline rows shared
ZZ-HVS Circuit Evaluation: Sleep Transistor Sizing
Leakage power (nW)
Propagation delay (ps)
baseline W(1X) 2W(2X) 3W(3X) 4W(4X)
460
5.11
9.13
12.63
15.7
164
198
180
174
169
Trade-off between the leakage savings and impact
on the wordline driver propagation delay
zz-hvs-3W (3X) show an optimal trade-off
40X reduction in leakage at 5%
increase in propagation delay
Wakeup Latency
To benefit the most from the leakage savings of
stacking sleep transistors
keep the bias voltage of NMOS sleep transistor as low as
possible (and for PMOS as high as possible)
Drawback: impact on the wakeup latency of wordline
drivers
Wakeup latency associated with the zz-hvs-3W circuit
is 1.3ns
4 processor cycles (3.3 GHz)
For large memory, such as 2MB L2 cache the overall
wake up latency can be as high 6 to 10 cycles
Impact on Propagation Delay
The zz-hvs increases the propagation delay of the
peripheral circuit by 5%, when applied to wordline
drivers, input/output drivers, etc
Translate to 5% reduction in maximum operating clock
frequency of the memory in a single pipeline memory
Deep pipelined memories such as L1 and L2 cache hide
negligible increase in peripheral circuit latency
Sleep-Share: ZZ-HVS + Architectural Control
When an L2 cache miss occurs the processor executes
a number of miss-independent instructions and then
ends up stalling
The processor stays idle until the L2 cache miss is
serviced. This may take hundreds of cycle (300 cycles for
our processor architecture)
During such a stall period there is no access to L1 and
L2 caches and they can be put into low-power mode
Detecting Processor Idle Period
The instruction queue and functional units of the
processor monitored after an L2 miss
Instruction queue has not issued any instructions
Functional units have not executed any instructions for
K consecutive cycles (K=10)
The sleep signal is asserted
The sleep signal is de-asserted 10 cycles before the miss
service is completed
Assumption: memory access latency is deterministic.
No performance loss
Simulated Processor Architecture
Parameter
Value
L1 I-cache
L1 D-cache
L2 cache
Fetch, dispatch
Issue
Memory
Reorder buffer
Instruction queue
Register file
Load/store queue
Branch predictor
Arithmetic unit
Complex unit
Pipeline
128KB, 2 cycles
128KB, 2 cycles
2MB, 8 way, 20 cycles
4 wide
4 way out of order
300 cycles
96 entry
32 entry
128 integer and 125 floating point
32 entry
64KB entry g-share
4 integer, 4 floating point units
2 INT, 2 FP multiply/divide units
15 cycles
SimpleScalar 4.0
SPEC2K benchmarks
Compiled with the -O4 flag using the Compaq compiler targeting
the Alpha 21264 processor
fast–forwarded for 3 billion instructions, then fully simulated for 4
billion instructions
using the reference data sets.
L1 and L2 Leakage Power Reduction
60%
50%
40%
30%
20%
10%
vp
r
am
m
p
ap
pl
eq u
ua
k
fa e
ce
re
c
lu
ca
s
m
gr
id
sw
i
w
up m
w
i
av se
er
ag
e
m
p e cf
rl
bm
k
pa
rs
er
gc
c
ga
p
bz
ip
2
0%
L1 Leakage power reduction
L2 Leakage power reduction
Leakage reduction of 30% for the L2
cache and 28% for the L1 cache
Conclusion
Study break down of leakage in L2 cache components,
show peripheral circuit leaking considerably
proposed zig-zag share to reduce leakage in SRAM
memory peripheral circuits
zig-zag share reduces peripheral leakage by up to 40X
with only a small increase in memory area and delay
Propose Sleep-Share to control zig-zag share circuits
in L1 and L2 cache peripherals
Leakage reduction of 30% for the L2 cache and 28% for
the L1 cache