Lecture 1: Course Introduction and Overview

Download Report

Transcript Lecture 1: Course Introduction and Overview

CMPUT429/CMPE382 Winter 2001
Topic6: Main Memory and Virtual Memory
(Adapted from David A. Patterson’s CS252,
Spring 2001 Lecture Slides)
1/17/01
CS252/Patterson
Lec 1.1
Main Memory Background
• Performance of Main Memory:
– Latency: Cache Miss Penalty
» Access Time: time between request and word arrives
» Cycle Time: time between requests
– Bandwidth: I/O & Large Block Miss Penalty (L2)
• Main Memory is DRAM: Dynamic Random Access Memory
– Dynamic since needs to be refreshed periodically (8 ms, 1% time)
– Addresses divided into 2 halves (Memory as a 2D matrix):
» RAS or Row Access Strobe
» CAS or Column Access Strobe
• Cache uses SRAM: Static Random Access Memory
– No refresh (6 transistors/bit vs. 1 transistor
Size: DRAM/SRAM 4-8,
Cost/Cycle time: SRAM/DRAM 8-16
1/17/01
CS252/Patterson
Lec 1.2
Fast Memory Systems: DRAM specific
• Multiple CAS accesses: several names (page mode)
– Extended Data Out (EDO): 30% faster in page mode
• New DRAMs to address gap;
what will they cost, will they survive?
– RAMBUS: startup company; reinvent DRAM interface
» Each Chip a module vs. slice of memory
» Short bus between CPU and chips
» Does own refresh
» Variable amount of data returned
» 1 byte / 2 ns (500 MB/s per chip)
» 20% increase in DRAM area
– Synchronous DRAM: 2 banks on chip, a clock signal to DRAM,
transfer synchronous to system clock (66 - 150 MHz)
– Intel claims RAMBUS Direct (16 b wide) is future PC memory?
» Possibly not true! Intel to drop RAMBUS?
• Niche memory or main memory?
– e.g., Video RAM for frame buffers, DRAM + fast serial output
1/17/01
CS252/Patterson
Lec 1.3
Main Memory Organizations
• Simple:
– CPU, Cache, Bus, Memory
same width
(32 or 64 bits)
• Wide:
– CPU/Mux 1 word;
Mux/Cache, Bus, Memory
N words (Alpha: 64 bits &
256 bits; UtraSPARC 512)
• Interleaved:
– CPU, Cache, Bus 1 word:
Memory N Modules
(4 Modules); example is
word interleaved
1/17/01
CS252/Patterson
Lec 1.4
Main Memory Performance
• Timing model (word size is 32 bits)
– 1 to send address,
– 6 access time, 1 to send data
– Cache Block is 4 words
• Simple M.P.
= 4 x (1+6+1) = 32
• Wide M.P.
= 1 + 6 + 1 = 8
• Interleaved M.P. = 1 + 6 + 4x1 = 11
1/17/01
CS252/Patterson
Lec 1.5
Independent Memory Banks
• Memory banks for independent accesses
vs. faster sequential accesses
– Multiprocessor
– I/O
– CPU with Hit under n Misses, Non-blocking Cache
• Superbank: all memory active on one block transfer
(or Bank)
• Bank: portion within a superbank that is word
interleaved (or Subbank)
…
Superbank
Superbank Number
1/17/01
Bank
Superbank Offset
Bank Number
Bank Offset
CS252/Patterson
Lec 1.6
Independent Memory Banks
• How many banks?
number banks  number clocks to access word in bank
– For sequential accesses, otherwise will return to original bank
before it has next word ready
• Increasing DRAM => fewer chips => harder to have
enough banks
1/17/01
CS252/Patterson
Lec 1.7
Avoiding Bank Conflicts
• Lots of banks
int x[256][512];
for (j = 0; j < 512; j = j+1)
for (i = 0; i < 256; i = i+1)
x[i][j] = 2 * x[i][j];
• Even with 128 banks, since 512 is multiple of 128,
conflict on word accesses
• Software: loop interchange or declaring array not
power of 2 (“array padding”)
• Hardware: Use a Prime number of banks
– bank number = address mod number of banks
– address within bank = address / number of words in bank
1/17/01
CS252/Patterson
Lec 1.8
Finding Bank Number and Address
within a bank
Problem: We want to determine the number of banks, Nb, to use
and the number of words to store in each bank, Wb, such that:
• given a word address x, it is easy to find the bank where x will
be found, B(x), and the address of x within the bank, A(x).
• for any address x, B(x) and A(x) are unique.
• the number of bank conflicts is minimized
1/17/01
CS252/Patterson
Lec 1.9
Finding Bank Number and Address
within a bank
Solution: We will use the following relation to determine the bank
number for x, B(x), and the address of x within the bank, A(x):
B(x) = x MOD Nb
A(x) = x MOD Wb
and we will choose Nb and Wb to be co-prime, i.e., there is no prime
number that is a factor of Nb and Wb (this condition is satisfied
if we choose Nb to be a prime number that is equal to an integer
power of two minus 1).
We can then use the Chinese Remainder Theorem (see page 436,
and exercise 5.10) to show that B(x) and A(x) is always unique.
1/17/01
CS252/Patterson
Lec 1.10
Fast Bank Number
Example: Values for B(x) and A(x) for a system with 3 banks,
Nb = 3, and 8 words per bank, Wb = 8.
Comparison between Sequential Interliving and
Module Interleaving.
Bank Number:
Address
within Bank: 0
1
2
3
4
5
6
7
1/17/01
Seq. Interleaved
0
1
2
0
3
6
9
12
15
18
21
1
4
7
10
13
16
19
22
2
5
8
11
14
17
20
23
Modulo Interleaved
0
1
2
0
9
18
3
12
21
6
15
16
1
10
19
4
13
22
7
8
17
2
11
20
5
14
23
CS252/Patterson
Lec 1.11
Minimum Memory Size
DRAMs per PC over Time
1/17/01
‘86
1 Mb
32
4 MB
8 MB
16 MB
32 MB
64 MB
DRAM Generation
‘89
‘92
‘96
‘99
‘02
4 Mb 16 Mb 64 Mb 256 Mb 1 Gb
8
16
4
8
2
4
1
8
2
128 MB
4
1
256 MB
8
2
CS252/Patterson
Lec 1.12
Main Memory Summary
• Wider Memory
• Interleaved Memory: for sequential or
independent accesses
• Avoiding bank conflicts: SW & HW
• DRAM specific optimizations: page mode &
Specialty DRAM
• Need Error correction
1/17/01
CS252/Patterson
Lec 1.13
Vitual Memory
The memory is divided and portions are assigned to different processes.
Each process has the “illusion” of accessing the entire memory space.
When a portion is not available for a processor, a “magic hand”
goes to the disk, finds the missing data, and places it in the memory
(like the cache line replacement mechanism).
Problem: Multiple processes access the same memory address.
A processor access a memory address where there is no
RAM memory mapped to it.
Solution: “Virtualize” the address space used by the processes
(and by the processor).
Complication: Must translate the address used in every memory
access.
Must protect the data that belongs to one process
from access by another process.
1/17/01
CS252/Patterson
Lec 1.14
Virtual to Physical Mapping
Physical
Address:
Virtual
Address:
0
4K
8K
12K
A
B
C
D
Figure 5.36
1/17/01
0
4K
8K
12K
16K
20K
24K
28K
C
A
Physical
main memory
B
D
Disk
CS252/Patterson
Lec 1.15
Virtual Memory (VM) Terminology
VM Page or Segment  Cache block or line
Page Fault or Address Fault  Cache miss
memory mapping or address translation is the process
of converting the virtual address produced by the processor
into the physical address used to access the main memoy.
In VM, the replacement policy is controlled by software.
The unit of exchange for VM can have fix size (pages),
or variable size (segments). Some new machines use
paged segments.
1/17/01
CS252/Patterson
Lec 1.16
Address Translation
Virtual Address
Virtual Page number
Page
table
1/17/01
Page offset
Main
memory
Fig. 5.40
CS252/Patterson
Lec 1.17
The for memory-hierarchy questions
for Virtual Memory
Q1: Where can a block be placed in main memory?
Pages or segments can be placed anywhere in main memory
Q2: Where is a block found if it is in main memory?
A page table contains the mapping from virtual to
physical addresses
Inverted pages reduce the size of the page table.
A translation lookaside buffer (TLB) is used to cache
the most recently used translations.
Q3: Which block should be replaced on a virtual memory miss?
The least recently used (LRU) page is replaced.
Q4: What happens on a write?
The write policy is always writeback.
1/17/01
CS252/Patterson
Lec 1.18
Fast Address Translation
Page-frame Page
offset
address
<13>
<30>
V R W
<1> <2> <2>
Tag
<30>
•••
Phys. Addr.
<21>
•••
•••
1-2: Send virtual address to all tags.
Low-order 13 bits
of address
32:1 Mux
2: Check access type for violation of protection. (High-order 21 bits of address)
3: Use matching tag as a mux selector.
4: Combine page offset with physical page frameto form physical address.
1/17/01
CS252/Patterson
Lec 1.19
Protecting Processes
(minimum protection system)
Use a pair of registers, base and bound, to check if the address
is within an allowed interval:
base  address  bound
The hardware must implement at least two modes of operation.
The operating system runs in the supervisor mode and regular
application programs run in the user mode.
The base, bound, the user/supervisor mode bit, and the exception
enable/disable bit can only be changed when the processor
is running in the supervisor mode.
1/17/01
CS252/Patterson
Lec 1.20
2. Fast hits by Avoiding Address
Translation
CPU
CPU
VA
VA
VA
VA
Tags
TB
CPU
PA
Tags
$
$
TB
VA
PA
PA
L2 $
TB
$
PA
1/17/01
PA
MEM
MEM
Conventional
Organization
Virtually Addressed Cache
Translate only on miss
Synonym Problem
MEM
Overlap $ access
with VA translation:
requires $ index to
remain invariant
CS252/Patterson
across translation
Lec 1.21
2. Fast hits by Avoiding Address Translation
• Send virtual address to cache? Called Virtually
Addressed Cache or just Virtual Cache vs. Physical
Cache
– Every time process is switched logically must flush the cache; otherwise
get false hits
» Cost is time to flush + “compulsory” misses from empty cache
– Dealing with aliases (sometimes called synonyms);
Two different virtual addresses map to same physical address
– I/O must interact with cache, so need virtual address
• Solution to aliases
– HW guarantees covers index field & direct mapped, they must be unique;
called page coloring
• Solution to cache flush
– Add process identifier tag that identifies process as well as address
within process: can’t get a hit if wrong process
1/17/01
CS252/Patterson
Lec 1.22
2. Fast Cache Hits by Avoiding Translation:
Index with Physical Portion of Address
• If index is physical part of address, can
start tag access in parallel with translation
so that can compare to physical tag
Page Address
Address Tag
Page Offset
Index
Block Offset
• Limits cache to page size: what if want
bigger caches and uses same trick?
– Higher associativity moves barrier to right
– Page coloring
1/17/01
CS252/Patterson
Lec 1.23
3: Fast Hits by pipelining Cache
Case Study: MIPS R4000
• 8 Stage Pipeline:
– IF–first half of fetching of instruction; PC selection happens
here as well as initiation of instruction cache access.
– IS–second half of access to instruction cache.
– RF–instruction decode and register fetch, hazard checking and
also instruction cache hit detection.
– EX–execution, which includes effective address calculation, ALU
operation, and branch target computation and condition
evaluation.
– DF–data fetch, first half of access to data cache.
– DS–second half of access to data cache.
– TC–tag check, determine whether the data cache access hit.
– WB–write back for loads and register-register operations.
• What is impact on Load delay?
– Need 2 instructions between a load and its use!
1/17/01
CS252/Patterson
Lec 1.24
Case Study: MIPS R4000
IF
IS
IF
RF
IS
IF
EX
RF
IS
IF
DF
EX
RF
IS
IF
DS
DF
EX
RF
IS
IF
TC
DS
DF
EX
RF
IS
IF
WB
TC
DS
DF
EX
RF
IS
IF
IF
THREE Cycle
Branch Latency
(conditions evaluated
during EX phase)
IS
IF
RF
IS
IF
EX
RF
IS
IF
DF
EX
RF
IS
IF
DS
DF
EX
RF
IS
IF
TC
DS
DF
EX
RF
IS
IF
WB
TC
DS
DF
EX
RF
IS
IF
TWO Cycle
Load Latency
Delay slot plus two stalls
Branch likely cancels delay slot if not taken
1/17/01
CS252/Patterson
Lec 1.25
R4000 Performance
Base
Load stalls
Branch stalls
FP result stalls
FP structural
stalls
1/17/01
tomcatv
su2cor
spice2g6
ora
nasa7
doduc
li
gcc
espresso
eqntott
• Not ideal CPI of 1:
– Load stalls (1 or 2 clock cycles)
– Branch stalls (2 cycles + unfilled slots)
– FP result stalls: RAW data hazard (latency)
– FP structural stalls: Not enough FP hardware (parallelism)
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
CS252/Patterson
Lec 1.26
What is the Impact of What You’ve
Learned About Caches?
1000
1/17/01
2000
1999
1998
1997
1996
1995
1994
1993
1992
DRAM
1991
1990
1989
1988
1987
1986
1985
1984
1983
1982
1981
1980
• 1960-1985: Speed
= ƒ(no. operations)
• 1990
100
– Pipelined
Execution &
Fast Clock Rate
10
– Out-of-Order
execution
– Superscalar
Instruction Issue 1
• 1998: Speed =
ƒ(non-cached memory accesses)
• What does this mean for
– Compilers?,Operating Systems?, Algorithms?
Data Structures?
CPU
CS252/Patterson
Lec 1.27
Alpha 21064
• Separate Instr & Data
TLB & Caches
• TLBs fully associative
• TLB updates in SW
(“Priv Arch Libr”)
Instr
• Caches 8KB direct
mapped, write thru
• Critical 8 bytes first
• Prefetch instr. stream
buffer
• 2 MB L2 cache, direct
mapped, WB (off-chip)
• 256 bit path to main
Stream
memory, 4 x 64-bit
Buffer
modules
• Victim Buffer: to give
read priority over
write
• 4 entry write buffer
Victim Buffer
1/17/01
between D$ & L2$
Data
Write
Buffer
CS252/Patterson
Lec 1.28
Miss Rate
Su2cor
Spice
Mdljp2
Tomcatv
Doduc
Ear
Ora
Sc
Compress
10.00%
Li
I$ miss = 6%
D$ miss = 32%
L2 miss = 10%
TPC-B (db1)
100.00%
AlphaSort
Alpha Memory Performance: Miss
Rates of SPEC92
I$
8K
D $ 8K
1.00%
L2 2M
0.10%
0.01%
1/17/01
I$ miss = 2%
D$ miss = 13%
L2 miss =
0.6%
I$ miss = 1%
D$ miss = 21%
L2 miss = 0.3%
CS252/Patterson
Lec 1.29
Alpha CPI Components
1/17/01
L2
I$
D$
I Stall
Mdljp2
Tomcatv
Doduc
Ear
Ora
Compress
Sc
Li
Other
TPC-B (db1)
5.00
4.50
4.00
3.50
3.00
2.50
2.00
1.50
1.00
0.50
0.00
AlphaSort
CPI
• Instruction stall: branch mispredict (green);
• Data cache (blue); Instruction cache (yellow); L2$
(pink)
Other: compute + reg conflicts, structural conflicts
CS252/Patterson
Lec 1.30
Pitfall: Predicting Cache Performance from
Different Prog. (ISA, compiler, ...)
35%
D$, Tom
30%
D: tomcatv
• 4KB Data cache miss
rate 8%,12%, or
25%
28%?
• 1KB Instr cache miss 20% D$, gcc
rate 0%,3%,or 10%?Miss
Rate
• Alpha vs. MIPS
15% D$, esp
for 8KB Data $:
17% vs. 10%
10%
• Why 2X Alpha v.
I$, gcc
MIPS?
5%
0%I$, esp
1
2
I$, Tom
1/17/01
D: gcc
D: espresso
I: gcc
I: espresso
I: tomcatv
4
8
16
Cache Size (KB)
32
64
128
CS252/Patterson
Lec 1.31
hit time
miss
penalty
miss rate
Cache Optimization Summary
1/17/01
Technique
Larger Block Size
Higher Associativity
Victim Caches
Pseudo-Associative Caches
HW Prefetching of Instr/Data
Compiler Controlled Prefetching
Compiler Reduce Misses
Priority to Read Misses
Early Restart & Critical Word 1st
Non-Blocking Caches
Second Level Caches
Better memory system
Small & Simple Caches
Avoiding Address Translation
Pipelining Caches
MR
+
+
+
+
+
+
+
MP HT
–
–
+
+
+
+
+
–
+
+
+
Complexity
0
1
2
2
2
3
0
1
2
3
2
3
0
2
2
CS252/Patterson
Lec 1.32