Transcript MainMemory

Main Memory
by J. Nelson Amaral
Types of Memories
Read/Write Memory (RWM): we can store and retrieve data.
Random Access Memory (RAM): the time required to read or
write a bit of memory is independent of the bit’s location.
Static Random Access Memory (SRAM): once a word is written
to a location, it remains stored as long as power is applied
to the chip, unless the location is written again.
Dynamic Random Access Memory (DRAM): the data stored at
each location must be refreshed periodically by reading it and
then writing it back again, or else it disappears.
CMPUT 229
Static × Dynamic Memory Cell
bit line
word line
Static Memory Cell
(6 transistors)
Dynamic Memory Cell
(1 transistor)
CMPUT 329 - Computer Organization and
Architecture II
3
Writing 1 in a Dynamic Memories
bit line
word line
To store a 1 in this cell, a HIGH voltage is placed on
the bit line, causing the capacitor to charge through
the on transistor.
CMPUT 329 - Computer Organization and
Architecture II
4
Writing 0 in a Dynamic Memories
bit line
word line
To store a 0 in this cell, a LOW voltage is placed on
the bit line, causing the capacitor to discharge through
the on transistor.
CMPUT 329 - Computer Organization and
Architecture II
5
Destructive Reads
bit line
word line
To read the DRAM cell, the bit line is precharged to
a voltage halfway between HIGH and LOW, and
then the word line is set HIGH.
Depending on the charge in the capacitor, the precharged
bit line is pulled slightly higher or lower.
A sense amplifier detects this small change and
recovers a 1 or a 0.
CMPUT 329 - Computer Organization and
Architecture II
6
Recovering from Destructive Reads
bit line
word line
The read operation discharges the capacitor.
Therefore a read operation in a dynamic memory must
be immediately followed by a write operation of the same
value read to restore the capacitor charges.
CMPUT 329 - Computer Organization and
Architecture II
7
Forgetful Memories
bit line
word line
The problem with this cell is that it is not bi-stable:
only the state 0 can be kept indefinitely, when the
cell is in state 1, the charge stored in the capacitor
slowly dissipates and the data is lost.
CMPUT 329 - Computer Organization and
Architecture II
8
Refreshing the Memory:
Why DRAMs are Dynamic
Vcap
1 written
refreshes
VCC
HIGH
LOW
0V
time
0 stored
The solution is to periodically refresh the memory
cells by reading and writing back each one of them.
CMPUT 229
DIN3
0
3-to-8
decoder
1
2
0 A2
1 A1
1 A0
2
3
1
0
4
5
6
7
WE_L
CS_L
WR_L
DIN2
DIN1
DIN0
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IOE_L
OE_L
DOUT3
DOUT2
DOUT1
DOUT0
DIN3
0
3-to-8
decoder
1
2
0 A2
1 A1
1 A0
2
3
1
0
4
5
6
7
WE_L
CS_L
WR_L
DIN3
DIN3
DIN3
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IOE_L
OE_L
DOUT3
DOUT3
DOUT3
DOUT3
DIN3
0
3-to-8
decoder
1
2
0 A2
1 A1
1 A0
2
3
1
0
4
5
6
7
WE_L
CS_L
WR_L
DIN3
DIN3
DIN3
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IOE_L
OE_L
DOUT3
DOUT3
DOUT3
DOUT3
DIN3
0
3-to-8
decoder
1
2
0 A2
1 A1
1 A0
2
3
1
0
4
5
6
7
WE_L
CS_L
WR_L
DIN3
DIN3
DIN3
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IOE_L
OE_L
DOUT3
DOUT3
DOUT3
DOUT3
Bi-directional Data Bus
microprocessor
WE_L
CS_L
WR_L
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IN OUT
SEL
WR
IOE_L
OE_L
DIO3
CMPUT 229
DIO2
DIO1
DIO0
DRAM High Level View
DRAM chip
Cols
0
2
/
1
2
3
0
addr
1
Rows
Memory
controller
2
(to CPU)
8
/
3
data
Internal row buffer
CMPUT 229
Byant/O’Hallaron, pp. 459
DRAM RAS Request
DRAM chip
Cols
0
RAS = 2
2
/
1
2
3
0
addr
1
Rows
Memory
controller
2
8
/
3
data
Row 2
Internal row buffer
CMPUT 229
RAS = Row Address Strobe
Byant/O’Hallaron, pp. 460
DRAM CAS Request
DRAM chip
Cols
0
CAS = 1
2
/
1
2
3
0
addr
1
Memory
controller
Rows
Supercell
(2,1)
8
/
2
3
data
Internal row buffer
CMPUT 229
CAS = Column Address Strobe
Byant/O’Hallaron, pp. 460
addr (row = i, col = j)
Memory Modules
: Supercell (i,j)
DRAM 0
64 MB
memory module
consisting of
8 8Mx8 DRAMs
DRAM 7
data
bits
56-63
63
56 55
bits
48-55
48 47
bits
40-47
40 39
bits
32-39
32 31
bits
24-31
24 23
bits
16-23
16 15
bits
8-15
8 7
bits
0-7
0
Memory
controller
64-bit double word at main memory address A
64-bit doubleword to CPU chip
Byant/O’Hallaron, pp. 461
Step 1: Apply row address
Step 2: RAS go from high
to low and remain low
2
8
Step 3: Apply column address
5
Step 4: WE must be high
Step 5: CAS goes from high
to low and remain low
3
1
Step 6: OE goes low
4
Step 7: Data appears
6
Step 8: RAS and CAS
return to high
7
Read Cycle on an Asynchronous DRAM
Improved DRAMs
Central Idea: Each read to a DRAM actually
reads a complete row of bits or word line from
the DRAM core into an array of sense amps.
A traditional asynchronous DRAM interface
then selects a small number of these bits to be
delivered to the cache/microprocessor.
All the other bits already extracted from the DRAM
cells into the sense amps are wasted.
CMPUT 229
Fast Page Mode DRAMs
In a DRAM with Fast Page Mode, a page is defined as
all memory addresses that have the same row address.
To read in fast page mode, all the steps from 1 to 7 of
a standard read cycle are performed.
Then OE and CAS are switched high, but RAS remains
low.
Then the steps 3 to 7 (providing a new column address,
asserting CAS and OE) are performed for each new
memory location to be read.
CMPUT 229
A Fast Page Mode Read Cycle on an Asynchronous DRAM
Enhanced Data Output RAMs (EDO-RAM)
The process to read multiple locations in an EDO-RAM
is very similar to the Fast Page Mode.
The difference is that the output drivers are not disabled
when CAS goes high.
This distinction allows the data from the current read cycle
to be present at the outputs while the next cycle
begins.
As a result, faster read cycle times are allowed.
CMPUT 229
An Enhanced Data Output Read Cycle on an Asynchronous DRAM
Synchronous DRAMs (SDRAM)
A Synchronous DRAM (SDRAM) has a clock input. It operates
in a similar fashion as the fast page mode and EDO DRAM.
However the consecutive data is output synchronously on the
falling/rising edge of the clock, instead of on command by
CAS.
How many data elements will be output (the length of
the burst) is programmable up to the maximum size of
the row.
The clock in an SDRAM typically runs one
order of magnitude faster than the access time for
individual accesses.
CMPUT 229
DDR SDRAM
A Double Data Rate (DDR) SDRAM is an SDRAM
that allows data transfers both on the rising and
falling edge of the clock.
Thus the effective data transfer rate of a DDR
SDRAM is two times the data transfer rate of
a standard SDRAM with the same clock frequency.
A Quad Data Rate (QDR) SDRAM doubles the data
transfer rate again by separating the input and
output of a DDR SDRAM.
CMPUT 229
P-H 473
Main Memory Supporting Caches
• Use DRAMs for main memory
– Fixed width (e.g., 1 word)
– Connected by fixed-width clocked bus
• Bus clock is typically slower than CPU clock
Chapter 5 — Large and
Fast: Exploiting Memory
Hierarchy — 27
P-H 471
Improving Memory Bandwidth
Baer p. 248
SIMM × DIMM
SIMM ≡ Single Inline
Memory Module
DIMM ≡ Dual Inline
Memory Module
Uses two edges of the physical
connector → twice as many
connections to the chip
Memory System Example
Cache
1 bus cycle for address transfer
15 bus cycles per DRAM access
1 bus cycle per data transfer
4-word cache block
Memory Bus
Miss penalty = 1 + 4×15 + 4×1
= 65 bus cycles
Bandwidth = 16 bytes / 65 cycles
= 0.25 byte/cycle
Chapter 5 — Large and
Fast: Exploiting Memory
Hierarchy — 30
1-word wide DRAM
Memory
P-H 471
Example: Wider Memory
1 bus cycle for address transfer
15 bus cycles per DRAM access
1 bus cycle per data transfer
Cache
4-word cache block
Memory Bus
4-word wide DRAM
Miss penalty = 1 + 15 + 1
= 17 bus cycles
Bandwidth = 16 bytes / 17 cycles
= 0.94 byte/cycle
Wider bus/memories are costly!
Chapter 5 — Large and
Fast: Exploiting Memory
Hierarchy — 31
P-H 471
Memory
Example: Interleaved Memory
Cache
1 bus cycle for address transfer
15 bus cycles per DRAM access
1 bus cycle per data transfer
4-word cache block
Memory Bus
Miss penalty = 1 + 15 + 4×1
= 20 bus cycles
Bandwidth = 16 bytes / 20 cycles
= 0.8 byte/cycle
Chapter 5 — Large and
Fast: Exploiting Memory
Hierarchy — 32
Bank
P-H 471
Bank
Bank
Bank
Split –Transaction Bus
Issue: Memory should not hold the processor-memory bus
while it fetches the data to its buffers.
Solution: Split-transaction bus
Example (load):
Phase 1 for access A can
be in parallel with Phase 2
for access B
Phase 1: Processor sends address and operation type to
bus, then releases the bus
Phase 2: Memory fetches data into its buffers.
Phase 3: Memory controller requests the bus
Memory sends the data into the bus.
Release the bus
Baer p. 250
Bank Interleaving and Cache
Indexing
Cache
Index
Cache
Tag
Issue: In both cases,
cache Index
overlaps Bank Index
Cache
Displ.
Bank
Index
⇒ on a miss, the
missing line is in the
same bank as the
replaced line.
Line Interleaving
⇒ full penalty for
precharge, row and
column access
Page Interleaving
Baer p. 249
Page
Index
Page
Offset
Bank Interleaving and Cache
Indexing
Solution: bank rehash by XORing
the k bits of the bank index with
k bits of the tag.
Baer p. 250
Memory Controller
Transactions do not need to be processed in order.
Intelligent controllers optimize accesses by reordering
transactions.
Baer p. 250
Memory Controller
Why the controller’s job is difficult?
1. Must obey more than 50 timing constraints
2. Must prioritize requests to optimize performance
Scheduling decisions have long-term consequence:
Future requests depends on which request
is served first (which instruction is unblocked).
Benefit of a scheduling decision depends on
future processor behavior.
IpekISCA2008 p. 40
Reinforcement-Learning Controller
IpekISCA2008 p. 41
Reinforcement-Learning Controller
IpekISCA2008 p. 42
Reinforcement Learning Controller
Performance
Peak BW: 6.4 GB/s
Bus Utilization:
In-Order
FR-FCFS
RL
Optimistic
26%
46%
56%
80%
4-core system
IpekISCA2008 p. 42
Online RL is better than offline RL
IpekISCA2008 p. 48
Narrow and fast buses.
Split transactions
Separate row and column
control lines
Rambus
400 MHz --- 1.6 GB/s
16 internal banks
Introduced in 1997
SDRAMs were at 100 MHz
and had a peak of 0.4 GB/s
2010: 64-bit DDR DRAMs
at 133 MHz ⇒ same peak
DRAM Generations
Access time to a new row/column (ns)
Year
Capacity
$/GB
1980
64Kbit
$1500000
1983
256Kbit
$500000
1985
1Mbit
$200000
1989
4Mbit
$50000
1992
16Mbit
$15000
1996
64Mbit
$10000
1998
128Mbit
$4000
2000
256Mbit
$1000
2004
512Mbit
$250
2007
1Gbit
$50
Year
Chapter 5 — Large and
Fast: Exploiting Memory
Hierarchy — 43
P-H 474