C411L24MemoryPeripheral

Download Report

Transcript C411L24MemoryPeripheral

CMPEN 411
VLSI Digital Circuits
Spring 2009
Lecture 24: Peripheral Memory Circuits
[Adapted from Rabaey’s Digital Integrated Circuits, Second Edition, ©2003
J. Rabaey, A. Chandrakasan, B. Nikolic]
Sp09 CMPEN 411 L24 S.1
Review: Read-Write Memories (RAMs)

Static – SRAM







data is stored as long as supply is applied
large cells (6 fets/cell) – so fewer bits/chip
fast – so used where speed is important (e.g., caches)
differential outputs (output BL and !BL)
use sense amps for performance
compatible with CMOS technology
Dynamic – DRAM






periodic refresh required (every 1 to 4 ms) to compensate for the
charge loss caused by leakage
small cells (1 to 3 fets/cell) – so more bits/chip
slower – so used for main memories
single ended output (output BL only)
need sense amps for correct operation
not typically compatible with CMOS technology
Sp09 CMPEN 411 L24 S.2
Peripheral Memory Circuitry

Row and column decoders

Read bit line precharge logic



Speed

Power
consumption

Area – pitch
matching
Sense amplifiers
Timing and control
Sp09 CMPEN 411 L24 S.6
Row Decoders

Collection of 2M complex logic gates organized in a
regular, dense fashion

(N)AND decoder for 8 address bits
WL(0) = !A7 & !A6 & !A5 & !A4 & !A3 & !A2 & !A1 & !A0
…
WL(255) = A7 & A6 & A5 & A4 & A3 & A2 & A1 & A0

NOR decoder for 8 address bits
WL(0) = !(A7 | A6 | A5 | A4 | A3 | A2 | A1 | A0)
…
WL(255) = !(!A7 | !A6 | !A5 | !A4 | !A3 | !A2 | !A1 | !A0)

Goals: Pitch matched, fast, low power
Sp09 CMPEN 411 L24 S.7
Implementing a Wide NOR Function

Single stage 8x256 bit decoder (as in Lecture 22)



Decompose logic into multiple levels



One 8 input NOR gate per row x 256 rows = 256 x (8+8) = 4,096
Pitch match and speed/power issues
!WL(0) = !(!(A7 | A6) & !(A5 | A4) & !(A3 | A2) & !(A1 | A0))
First level is the predecoder (for each pair of address bits, form
Ai|Ai-1, Ai|!Ai-1, !Ai|Ai-1, and !Ai|!Ai-1)
Second level is the word line driver
Predecoders reduce the number of transistors required


Four sets of four 2-bit NOR predecoders = 4 x 4 x (2+2) = 64
256 word line drivers, each a four input NAND – 256 x (4+4) = 2,048
- 4,096 vs 2,112 = almost a 50% savings

Number of inputs to the gates driving the WLs is halved, so
the propagation delay is reduced by a factor of ~4
Sp09 CMPEN 411 L24 S.8
Hierarchical Decoders
Multi-stage implementation improves performance
•••
WL 1
WL 0
A 0A 1 A 0A 1 A 0A 1 A 0A 1
A 2A 3 A 2A 3 A 2A 3 A 2A 3
•••
NAND decoder using
2-input pre-decoders
A1 A0
A0
Sp09 CMPEN 411 L24 S.9
A1
A3 A2
A2
A3
Dynamic Decoders
Precharge devices
GND
VDD
GND
WL 3
VDD
WL 3
WL 2
WL 2
VDD
WL 1
WL 1
V DD
WL 0
WL 0
VDD f
A0
A0
A1
A1
2-input NOR decoder
A0
A0
A1
A1
f
2-input NAND decoder
Which one is faster? Smaller? Low power?
Sp09 CMPEN 411 L24 S.10
Pass Transistor Based Column Decoder
A1
A0
2 input NOR decoder
BL3 !BL3
BL2 !BL2
S3
S2
S1
S0
data_out

BL1 !BL1 BL0 !BL0
!data_out
Read: connect BLs to the Sense Amps (SA)
drive one of the BLs low to write a 0 into the cell
Writes:

Fast since there is only one transistor in the signal path. However,
there is a large transistor count ( (K+1)2K + 2 x 2K)

For K = 2  3 x 22 (decoder) + 2 x 22 (PTs) = 12 + 8 = 20
Sp09 CMPEN 411 L24 S.11
Tree Based Column Decoder
BL3 !BL3
BL2 !BL2
BL1 !BL1
data_out
!data_out
BL0 !BL0
A0
!A0
A1
!A1

Number of transistors reduced to (2 x 2 x (2K -1))


for K = 2  2 x 2 x (22 – 1) = 4 x 3 = 12
Delay increases quadratically with the number of sections (K)
(so prohibitive for large decoders)

can fix with buffers, progressive sizing, combination of tree and
pass transistor approaches
Sp09 CMPEN 411 L24 S.12
Decoder Complexity Comparisons

Consider a memory with 10b address and 8b data
Conf.
1D
2D
2D
2D
Data/Row
Row Decoder
10b = a 10x210 decoder
Single stage = 20,480
Two stage = 10,320
32b
8b = 8x28 decoder
Single stage = 4,096 T
(32x256 core) Two stage = 2,112 T
64b
7b = 7x27 decoder
Single stage = 1,792 T
(64x128 core) Two stage = 1,072 T
128b
6b = 6x26 decoder
Single stage = 768 T
(128x64 core) Two stage = 432 T
Sp09 CMPEN 411 L24 S.13
Column Decoder
8b
2b = 2x22 decoder
PT = 76 T
Tree = 96 T
3b = 3x23 decoder
PT = 160 T
Tree = 224 T
4b = 4x24 decoder
PT = 336 T
Tree = 480 T
Bit Line Precharge Logic

First step of a Read
cycle is to precharge
(PC) the bit lines to VDD


every differential signal in
the memory must be
equalized to the same
voltage level before Read
Turn off PC and enable
the WL

!PC
the grounded PMOS load
limits the bit line swing
(speeding up the next
precharge cycle)
Sp09 CMPEN 411 L24 S.14
BL
!BL
equalization transistor - speeds up
equalization of the two bit lines by
allowing the capacitance and pull-up
device of the nondischarged bit line to
assist in precharging the discharged
line
Sense Amplifiers


Amplification – resolves data
with small bit line swings
(in some DRAMs required
for proper functionality)
SA
input
output
Delay reduction – compensates for the limited drive
capability of the memory cell to accelerate BL transition
tp = ( C * V ) / Iav
large
small
make  V as small as
possible

Power reduction – eliminates a large part of the power
dissipation due to charging and discharging bit lines

Signal restoration – for DRAMs, need to drive the bit lines
full swing after sensing (read) to do data refresh
Sp09 CMPEN 411 L24 S.15
Classes of Sense Amplifiers

Differential SA – takes small signal differential inputs (BL
and !BL) and amplifies them to a large signal singleended output

common-mode rejection – rejects noise that is equally injected to
both inputs

Only suitable for SRAMs (with BL and !BL)

Types




Current mirroring
Two-stage
Latch based
Single-ended SA – needed for DRAMs
Sp09 CMPEN 411 L24 S.16
Differential Sense Amplifier
V DD
M3
M4
y
M1
bit
SE
M2
Out
bit
M5
Directly applicable to
SRAMs
Sp09 CMPEN 411 L24 S.17
Differential Sensing ― SRAM
V DD
PC
V DD
BL
BL
EQ
V DD
y M3
WL i
M1
x
SE
V DD
M4
M2
2y
2x
2x
x
SE
M5
SE
SRAM cell i
Diff.
x Sense 2x
Amp
V DD
Output
y
SE
Output
(a) SRAM sensing scheme
Sp09 CMPEN 411 L24 S.18
(b) two stage differential amplifier
Approaches to Memory Timing
SRAM Timing
Self-Timed
DRAM Timing
Multiplexed Addressing
Address
Bus
Address
Bus
Address
Address transition
initiates memory
operation
msb’s
lsb’s
Row
Addr.
Column
Addr.
RAS
CAS
RAS-CAS timing
Sp09 CMPEN 411 L24 S.20
Reliability and Yield

Memories operate under low signal-to-noise conditions

word line to bit line coupling can vary substantially over the
memory array
- folded bit line architecture (routing BL and !BL next to each other
ensures a closer match between parasitics and bit line
capacitances)

interwire bit line to bit line coupling
- transposed (or twisted) bit line architecture (turn the noise into a
common-mode signal for the SA)


suffer from low yield due to high density and structural
defects


leakage (in DRAMs) requiring refresh operation
increase yield by using error correction (e.g., parity bits) and
redundancy
and are susceptible to soft errors due to alpha particles
and cosmic rays
Sp09 CMPEN 411 L24 S.21
Redundancy in the Memory Structure
Fuse bank
Redundant row
Redundant columns
Row
address
Column
address
Sp09 CMPEN 411 L24 S.22
Row Redundancy
Fused
Repair
Addresses
== ?
Redundant Wordline
== ?
Redundant Wordline
Enable
Normal
Wordline
Decoder
Normal Wordline
Normal
Wordline
Decoder
Normal Wordline
Functional
Address
Enable
Fused
Repair
Addresses
Page 4
Sp09 CMPEN 411 L24 S.23
== ?
Redundant Wordline
== ?
Redundant Wordline
Data
0
Page 5
Sp09 CMPEN 411 L24 S.24
Data
1
Data
2
Data
3
Data
4
Data
5
Data
6
Fuse
Fuse
Fuse
Fuse
Fuse
Fuse
Fuse
Fuse
Redundant Data Column
Normal Data Column
Normal Data Column
Normal Data Column
Normal Data Column
Normal Data Column
Normal Data Column
Normal Data Column
Normal Data Column
Column Redundancy
Data
7
Error-Correcting Codes
Example: Hamming Codes
e.g. If B3 flips
1
1
=3
0
2K>= m+k+1. m # data bit, k # check bit
For 64 data bits, needs 7 check bits
Sp09 CMPEN 411 L24 S.25
Performance and area overhead for ECC
Sp09 CMPEN 411 L24 S.26
Redundancy and Error Correction
Sp09 CMPEN 411 L24 S.27
Soft Errors
Nonrecurrent and
nonpermanent errors from



alpha particles (from the
packaging materials)
neutrons from cosmic rays
System FITS

As feature size
decreases, the charge
stored at each node
decreases (due to a lower
node capacitance and
lower VDD) and thus Qcritical
(the charge necessary to
cause a bit flip) decreases
leading to an increase in
the soft error rate (SER)
Sp09 CMPEN 411 L24 S.28
From Semico Research Corp.
10000
1000
100
10
1
0.25
0.18
0.13
0.09
0.05
Process Technology
From Actel
MTBF (hours)
.13 m
.09  m
Ground-based
895
448
Civilian Avionics System
324
162
Military Avionics System
18
9
CELL Processor!
See class website for web links
Sp09 CMPEN 411 L24 S.29
CELL Processor!
Sp09 CMPEN 411 L24 S.30
CELL Processor!
Sp09 CMPEN 411 L24 S.31
Embedded SRAM (4.6Ghz)
Sp09 CMPEN 411 L24 S.32

Each SRAM cell 0.99um2

Each block has 32 sub-arrays,

Each sub-array has 128 WL
plus 4 redundant line, Each
block has 2 redundant BL,
Multiplier in CELL
Sp09 CMPEN 411 L24 S.33
Next Lecture and Reminders

Next lecture

Power consumption in datapaths and memories
- Reading assignment – Rabaey, et al, 11.7; 12.5
Sp09 CMPEN 411 L24 S.34