Caches (Cont.) - Thayer School of Engineering at Dartmouth

Download Report

Transcript Caches (Cont.) - Thayer School of Engineering at Dartmouth

ENGS 116 Lecture 14
1
Main Memory and Virtual Memory
Vincent H. Berk
October 26, 2005
Reading for today: Sections 5.1 – 5.4, (Jouppi article)
Reading for Friday: Sections 5.5 – 5.8
Reading for Monday: Sections 5.8 – 5.12 and 5.16
ENGS 116 Lecture 14
2
Main Memory Background
• Performance of Main Memory:
– Latency: Cache miss penalty
• Access Time: time between request and word arrives
• Cycle Time: time between requests
– Bandwidth: I/O & large block miss penalty (L2)
• Main Memory is DRAM: dynamic random access memory
– Dynamic since needs to be refreshed periodically (1% time)
– Addresses divided into 2 halves (memory as a 2-D matrix):
• RAS or Row Access Strobe
• CAS or Column Access Strobe
• Cache uses SRAM: static random access memory
– No refresh; 6 transistors/bit vs. 1 transistor;
Size: DRAM/SRAM ≈ 4-8;
Cost/Cycle time: SRAM/DRAM ≈ 8-16
ENGS 116 Lecture 14
4 Key DRAM Timing Parameters
• tRAC: minimum time from RAS line falling to the valid data output.
– Quoted as the speed of a DRAM when buying
– A typical 512Mbit DRAM tRAC = 60-40 ns
• tRC: minimum time from the start of one row access to the start of the
next.
– tRC = 80 ns for a 512Mbit DRAM with a tRAC of 60-40 ns
• tCAC: minimum time from CAS line falling to valid data output.
– 5 ns for a 512Mbit DRAM with a tRAC of 60-40 ns
• tPC: minimum time from the start of one column access to the start of
the next.
– 15 ns for a 512Mbit DRAM with a tRAC of 60-40 ns
3
ENGS 116 Lecture 14
4
DRAM Performance
• A 40 ns (tRAC) DRAM can
– perform a row access only every 80 ns (tRC)
– perform column access (tCAC) in 5 ns, but time between column
accesses is at least 15 ns (tPC).
• In practice, external address delays and turning around buses
make it 20 to 25 ns
• These times do not include the time to drive the addresses off the
microprocessor or the memory controller overhead!
ENGS 116 Lecture 14
5
DRAM History
• DRAMs: capacity + 60%/yr, cost – 30%/yr
– 2.5X cells/area, 1.5X die size in ≈ 3 years
• ‘98 DRAM fab line costs $2B
• Rely on increasing numbers of computers & memory per computer
(60% market)
– SIMM or DIMM is replaceable unit
 computers use any generation DRAM
• Commodity, second source industry
 high volume, low profit, conservative
– Little organization innovation in 20 years
• Order of importance: 1) Cost/bit, 2) Capacity
– First RAMBUS: 10X BW, + 30% cost  little impact
• Current SDRAM yield very high: > 80%
ENGS 116 Lecture 14
6
Main Memory Performance
• Simple:
– CPU, Cache, Bus,
Memory same width
(32 or 64 bits)
• Wide:
– CPU/Mux 1 word;
Mux/Cache, Bus,
Memory N words
(Alpha: 64 bits & 256
bits; UltraSPARC 512)
• Interleaved:
– CPU, Cache, Bus 1
word; Memory N
modules (4 modules);
example is word
interleaved
ENGS 116 Lecture 14
7
Main Memory Performance
• Timing model (word size is 32 bits)
– 1 to send address,
– 6 for access time, 1 to send data
– Cache Block is 4 words
• Simple memory
 4  (1 + 6 + 1) = 32
• Wide memory
1+6+1
=8
• Interleaved memory  1 + 6 + 4  1 = 11
Address
0
4
8
12
Bank 0
Address
1
5
9
13
Bank 1
Address
2
6
10
14
Bank 2
Address
3
7
11
15
Bank 3
ENGS 116 Lecture 14
8
Independent Memory Banks
• Memory banks for independent accesses
vs. faster sequential accesses
– Multiprocessor
– I/O (DMA)
– CPU with Hit under n Misses, Non-blocking Cache
• Superbank: all memory active on one block transfer
(or Bank)
• Bank: portion within a superbank that is word
interleaved (or subbank)
Superbank
Superbank offset (Bank)
Superbank #
Bank #
Bank offset
...
ENGS 116 Lecture 14
9
Independent Memory Banks
• How many banks?
number banks ≥ number clocks to access word in bank
– For sequential accesses, otherwise will return to original bank
before it has next word ready
– (like in vector case)
• Increasing DRAM  fewer chips  harder to have banks
ENGS 116 Lecture 14
10
Avoiding Bank Conflicts
• Lots of banks
int x[256][512];
for (j = 0; j < 512; j = j+1)
for (i = 0; i < 256; i = i+1)
x[i][j] = 2 * x[i][j];
• Even with 128 banks, since 512 is multiple of 128, conflict on word
accesses
• SW: loop interchange or declaring array not power of 2 (“array padding”)
• HW: prime number of banks
–
–
–
–
–
bank number = address mod number of banks
address within bank = address / number of words in bank
modulo & divide per memory access with prime no. banks?
address within bank = address mod number words in bank
bank number? easy if 2N words per bank
ENGS 116 Lecture 14
Fast Memory Systems: DRAM specific
• Multiple CAS accesses: several names (page mode)
– Extended Data Out (EDO): 30% faster in page mode
• New DRAMs to address gap;
what will they cost, will they survive?
– RAMBUS: startup company; reinvent DRAM interface
>>
>>
>>
>>
>>
Each chip a module vs. slice of memory
Short bus between CPU and chips
Does own refresh
Variable amount of data returned
1 byte / 2 ns (500 MB/s per chip)
– Synchronous DRAM: 2 banks on chip, a clock signal to DRAM, transfer
synchronous to system clock (66 - 150 MHz)
– Intel claims RAMBUS Direct is future of PC memory
• Niche memory or main memory?
– e.g., Video RAM for frame buffers, DRAM + fast serial output
11
ENGS 116 Lecture 14
12
Virtual Memory
• Virtual Address (232, 264) to Physical Address mapping (228)
• Virtual memory in terms of cache:
– Cache block?
– Cache miss?
• How is virtual memory different from caches?
– What controls replacement
– Size (transfer unit, mapping mechanisms)
– Lower-level use
ENGS 116 Lecture 14
13
Virtual
address:
0
4K
8K
12K
Physical
address:
A
B
C
D
Virtual memory
0
4K
8K
C
12K
16K
20K
24K
28K
A
Physical
main
memory
B
D
Disk
Figure 5.36 The logical program in its contiguous virtual address space
is shown on the left; it consists of four pages A, B, C,
and D.
ENGS 116 Lecture 14
14
Parameter
First-level cache
Virt ual memory
Block (page) size
16 – 128 bytes
4096 – 65,536 bytes
Hit time
1 – 2 clock cycles
40 – 100 clock cycles
Miss penalty
(Access time)
(Transfer time)
Miss rate
8 – 100 clock cycles
700,000 – 6,000,000 clock cycles
(6 – 60 clock cycles) (500,000 – 4,000,000 clock cycles)
(2 – 40 clock cycles) (200,000 – 2,000,000 clock cycles)
0.5 – 10%
0.00001 – 0.001%
Data memory size
0.016 – 1 MB
16 – 8192 MB
Figure 5.37 Typical ranges of parameters for caches and virtual memory.
ENGS 116 Lecture 14
15
Virtual Memory
• 4 Questions for Virtual Memory (VM)?
– Q1: Where can a block be placed in the upper level?
fully associative, set associative, or direct mapped?
– Q2: How is a block found if it is in the upper level?
– Q3: Which block should be replaced on a miss?
random or LRU?
– Q4: What happens on a write?
write back or write through?
• Other issues: size; pages or segments or hybrid
ENGS 116 Lecture 14
16
Virtual address
Virtual page number
Page offset
Main
memory
Page
table
Physical address
Figure 5.40 The mapping of a virtual address to a physical address via a
page table.
ENGS 116 Lecture 14
17
Fast Translation: Translation Buffer (TLB)
• Cache of translated addresses
• Data portion usually includes physical page frame number, protection
field, valid bit, use bit, and dirty bit
• Alpha 21064 data TLB: 32-entry fully associative
Page
offset
<13>
Page-frame
address
<30>
1
2
<1> <2><2>
V R W
<30>
Tag
<21>
Physical page #

(low-order 13 bits of
address)
<13>


3
32:1 MUX
4
<21>
(high-order 21 bits
of address)
34-bit
physical
address
ENGS 116 Lecture 14
18
Selecting a Page Size
• Reasons for larger page size
– Page table size is inversely proportional to the page size;
therefore memory saved
– Fast cache hit time easy when cache ≤ page size (VA caches);
bigger page makes it feasible as cache grows in size
– Transferring larger pages to or from secondary storage,
possibly over a network, is more efficient
– Number of TLB entries is restricted by clock cycle time, so a larger
page size maps more memory, thereby reducing TLB misses
• Reasons for a smaller page size
– Fragmentation: don’t waste storage; data must be contiguous within page
– Quicker process start for small processes
• Hybrid solution: multiple page sizes
– Alpha: 8 KB, 16 KB, 32 KB, 64 KB pages (43, 47, 51, 55 virtual addr bits)
ENGS 116 Lecture 14
19
Alpha VM Mapping
• “64-bit” address divided into 3
segments
– seg0 (bit 63 = 0) user code/heap
– seg1 (bit 63 = 1, 62 = 1) user
stack
– kseg (bit 63 = 1, 62 = 0)
kernel segment for OS
• Three level page table, each one
page
– Alpha only 43 bits of VA
– (future min page size up to 64 KB
 55 bits of VA)
• PTE bits; valid, kernel & user,
read & write enable (no reference,
use, or dirty bit)
21
seg0/seg1
selector
Virtual address
000 … 0 or
111 … 1
level1
level2
10
Page Table
Base Register
level3
10
page offset
10
13
+
L1 page table
..
.
L2 page table
+
Page table entry
..
.
..
.
L3 page table
..
.
+
Page table entry
..
.
Page table entry
..
.
8 bytes
32 bit address
32 bit fields
Physical address
physical page-frame number
– What do you do?
Main memory
page offset
ENGS 116 Lecture 14
20
Protection
• Avoid separate processes to access each others memory
– Causes Segmentation Fault: sigSEG
– Useful for Multitasking systems
– Operating system issue
• At least two levels of protection:
– Supervisor (Kernel) mode (privileged)
• Creates page tables, sets process bounds, handles exceptions
– User mode (non-privileged)
• Can only make requests to Kernel: called SYSCALLs
• Shared memory
• SYSCALL parameter passing
ENGS 116 Lecture 14
21
Protection 2
• Each page needs:
– PID bit
– Read/Write/Execute bit
• Each process needs
–
–
–
–
Stack frame page(s)
Text or code pages
Data or heap pages
State table keeping:
• PC and other CPU status registers
• State of all registers
ENGS 116 Lecture 14
22
Alpha 21064
• Separate Instruction & Data
TLB & Caches
• TLBs fully associative
• TLB updates in SW
(“Private Arch Lib”)
• Caches 8KB direct
mapped, write through
• Critical 8 bytes first
• Prefetch instr. stream buffer
• 2 MB L2 cache, direct
mapped, WB (off-chip)
• 256 bit path to main
memory, 4  64-bit modules
• Victim buffer: to give read
priority over write
• 4-entry write buffer
between D$ & L2$
Instr
Data
Write
Buffer
Stream
Buffer
Victim Buffer
ENGS 116 Lecture 14
23
Alpha CPI Components
• Instruction stall: branch mispredict (green);
• Data cache (blue); Instruction cache (yellow); L2$ (pink)
Other: compute + register conflicts, structural conflicts
ENGS 116 Lecture 14
24
Pitfall: Predicting Cache Performance of One
Program from Another (ISA, compiler, ...)
35%
• 4KB data cache: miss
rate 8%, 12%, or 28%?
30%
• 1KB instruction cache:
25%
miss rate 0%, 3%, or
10%?
20%
Miss
• Alpha vs. MIPS
Rate
for 8 KB Data $:
15%
17% vs. 10%
10%
• Why 2X Alpha v.
MIPS?
D$, Tom
D: tomcatv
D: gcc
D: espresso
I: gcc
I: espresso
I: tomcatv
D$, gcc
D$, esp
I$, gcc
5%
64
32
16
8
Cache Size (KB)
128
I$, Tom
4
2
I$, esp
1
0%
ENGS 116 Lecture 14
25
Pitfall: Simulating Too Small an Address Trace
4.5
Cumulative
Average
Memory
Access
Time
4
3.5
3
2.5
2
1.5
1
0
I$
D$
L2
MP
1
= 4 KB, B = 16 B
= 4 KB, B = 16 B
= 512 KB, B = 128 B
= 12, 200 (miss penalties)
2
3
4
5
6
7
8
9 10 11 12
Instructions Executed (billions)
ENGS 116 Lecture 14
26
Additional Pitfalls
• Having too small an address space
• Ignoring the impact of the operating system on the performance of
the memory hierarchy
ENGS 116 Lecture 14
27
TLB
First-level cache
Second-level cache
Virtual memory
Block size
4–8 bytes (1 PTE) 4–32 bytes
32–256bytes
4096–16,384 bytes
Hit time
1 clock cycle
1–2 clock cycles
6–15 clock cycles
10–100clock cycles
Miss penalty
10–30clock
cycles
8–66 clock cycles
30–200clock cycles
700,000–6,000,000
clock cycles
Miss rate (local)
0.1–2%
0.5–20%
15–30%
0.00001–0.001%
Size
32–8192 bytes
(8-1024 PTEs)
1–128 KB
256 KB – 16 MB
16–8192MB
Backing st ore
First-level cache
Second-level cache
Page-mode DRAM
Disks
Q1: block placement
Fully associat ive
or set associative
Direct mapped
Direct mapped or
set associat ive
Fully associat ive
Q2: block
identificat ion
Tag/block
Tag/block
Tag/block
Table
Q3: block replacement
Random
N.A. (direct mapped) Random
 LRU
Q4: write st rategy
Flush on a write
to page table
Write through or
write back
Write back
Write back
Figure 5.53 Summary of the memory-hierarchy examples in Chapter 5.