Transcript Caches

ENGS 116 Lecture 14
1
Caches and Main Memory
Vincent H. Berk
November 5th, 2008
Reading for Today: Sections C.4 – C.7
Reading for Wednesday: Sections 5.1 – 5.3
Reading for Monday: Sections 5.4 – 5.8
ENGS 116 Lecture 14
2
Reducing Hit Time
• Small and Simple Caches
• Way Prediction
• Trace caches
ENGS 116 Lecture 14
3
1. Fast Hit Times
via Small and Simple Caches
•
Alpha 21164 has 8KB Instruction and 8KB data cache + 96KB second level
cache
– Small cache and fast clock rate
• Intel Pentium 4 moved from 2-way 16KB data + 16KB instruction to
4-way 8KB + 8KB
– Now cache runs at core-speed
•
•
Direct mapped, on chip
Intel Core Duo architecture, however:
– L1 Data & Instruction both 32KB, 8-way, per core
– 3 cycle latency (hit-time)
– L2 shared (amongst 2 cores): 4 MB, 16-way, 14 cycle latency (incl. L1)
ENGS 116 Lecture 14
4
2. Reducing Hit Time via “Pseudo-Associativity”
or way prediction
• How to combine fast hit time of Direct Mapped and have the lower conflict
misses of 2-way SA cache?
• Divide cache: on a miss, check other half of cache to see if there, if so have a
pseudo-hit (slow hit)
• Way Prediction: keep prediction bits to decide what comparison is made first
Hit Time
Pseudo Hit Time
Miss Penalty
Time
• Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles
– Better for caches not tied directly to processor (L2)
– Used in MIPS R1000 L2 cache, similar in UltraSPARC
– Not common today
ENGS 116 Lecture 14
5
3. Trace Caches
• Combine branch-prediction and instruction prefetching
• Ability to load instruction blocks including taken branches into a
cache block.
• Basis for Pentium 4 NetBurst architecture
• Contains decoded instructions – especially useful for the x86 CISC
architecture decoded to RISC internal instructions.
ENGS 116 Lecture 14
6
Increasing Cache BW
• Pipelined writes
• Multi-bank memory (later in Main Memory section)
• Non-blocking caches
ENGS 116 Lecture 14
7
1. Pipelined Writes
• Pipeline Tag Check and Update Cache as separate stages;
current write tag check & previous write cache update
• Only STORES in the pipeline; empty during a miss
CPU
address
Data Data
in
out
=?
Store r2, (r1) Check r1
Add
-Sub
-Store r4, (r3) M[r1]=r2 &check r3
Tag
Delayed write buffer
=?
M
U
X
Data
Write
buffer
Lower level memory
• In shade is “Delayed Write Buffer”; must be checked on reads;
either complete write or read from buffer
ENGS 116 Lecture 14
8
2. Increase Bandwidth:
Non-blocking Caches to Reduce Stalls on Misses
• Non-blocking cache or lockup-free cache allow data cache to continue
to supply cache hits during a miss
– requires out-of-order execution CPU
• “hit under miss” reduces the effective miss penalty by working during
miss vs. ignoring CPU requests
• “hit under multiple miss” or “miss under miss” may further lower the
effective miss penalty by overlapping multiple misses
– Significantly increases the complexity of the cache controller as
there can be multiple outstanding memory accesses
– Requires multiple memory banks (otherwise cannot support)
– Pentium Pro allows 4 outstanding memory misses
ENGS 116 Lecture 14
9
Value of Hit Under Miss for SPEC
Hit Under i Misses
2
Avg. Mem. Access Time
1.8
1.6
1.4
0->1
1.2
1->2
1
2->64
0.8
Base
0.6
0.4
“Hit under n Misses”
Integer
•
•
•
ora
spice2g6
nasa7
alvinn
hydro2d
mdljdp2
wave5
su2cor
doduc
swm256
tomcatv
fpppp
ear
mdljsp2
compress
xlisp
espresso
eqntott
0.2
0
01
12
2  64
Base
Floating Point
FP programs on average: AMAT= 0.68  0.52  0.34  0.26
Int programs on average: AMAT= 0.24  0.20  0.19  0.19
8 KB Data Cache, Direct Mapped, 32B block, 16 cycle miss
ENGS 116 Lecture 14
10
Reducing Miss Penalty
•
•
•
•
Critical Word First
Merging Write Buffers
Victim Caches
Subblock Placement
ENGS 116 Lecture 14
11
1 . Reduce Miss Penalty:
Early Restart and Critical Word First
•
•
•
Don’t wait for full block to be loaded before restarting CPU
– Early restart — As soon as the requested word of the block arrives, send
it to the CPU and let the CPU continue execution
– Critical Word First — Request the missed word first from memory and
send it to the CPU as soon as it arrives; let the CPU continue execution
while filling the rest of the words in the block. Also called wrapped fetch
and requested word first.
Generally useful only in large blocks,
Spatial locality a problem; tend to want next sequential word, so not clear if
benefit by early restart
block
ENGS 116 Lecture 14
12
2. Reduce Miss Penalty by Merging Write Buffer
• Write merging in write buffer
Write Address
100
104
108
112
Write Address
100
V
V
V
V
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
V
V
V
V
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
4 entry, 4 word
16 sequential
writes in a row
ENGS 116 Lecture 14
13
3. Reduce Miss Penalty via a “Victim Cache”
• How to combine fast hit
time of direct mapped yet
still avoid conflict misses?
• Add buffer to place data
discarded from cache
• Jouppi [1990]: 4-entry
victim cache removed 20%
to 95% of conflicts for a
4-KB direct-mapped data
cache
• Used in Alpha, HP
machines
CPU
address
Data Data
in
out
=?
Tag
Data
Victim cache
=?
Write
buffer
Lower Level Memory
ENGS 116 Lecture 14
14
4. Reduce Miss Penalty: Subblock Placement
• Don’t have to load full block on a miss
• Have valid bits per subblock to indicate valid
• (Originally invented to reduce tag storage)
100
1
1
1
1
300
1
1
0
0
200
0
1
0
1
204
0
0
0
0
Valid Bits
Subblocks
ENGS 116 Lecture 14
Reducing Misses by Compiler Optimizations
•
•
•
McFarling [1989] reduced cache misses by 75%
on 8KB direct mapped cache, 4 byte blocks in software
Instructions
– Reorder procedures in memory so as to reduce conflict misses
– Profiling to look at conflicts (using tools they developed)
Data
– Merging Arrays: improve spatial locality by single array of compound
elements vs. 2 arrays
– Loop Interchange: change nesting of loops to access data in order stored
in memory
– Loop Fusion: Combine 2 independent loops that have same looping and
some variables overlap
– Blocking: Improve temporal locality by accessing “blocks” of data
repeatedly vs. going down whole columns or rows
15
ENGS 116 Lecture 14
16
Merging Arrays Example
/* Before: 2 sequential arrays */
int val[SIZE];
int key[SIZE];
/* After: 1 array of structures */
struct merge {
int val;
int key;
};
struct merge merged_array[SIZE];
Reducing conflicts between val & key; improve spatial locality
ENGS 116 Lecture 14
17
Loop Interchange Example
/* Before */
for (k = 0; k < 100; k = k+1)
for (j = 0; j < 100; j = j+1)
for (i = 0; i < 5000; i = i+1)
x[i][j] = 2 * x[i][j];
/* After */
for (k = 0; k < 100; k = k+1)
for (i = 0; i < 5000; i = i+1)
for (j = 0; j < 100; j = j+1)
x[i][j] = 2 * x[i][j];
Sequential accesses instead of striding through memory every
100 words; improved spatial locality
ENGS 116 Lecture 14
18
Loop Fusion Example
/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
a[i][j] = 1/b[i][j] * c[i][j];
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
d[i][j] = a[i][j] + c[i][j];
/* After */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{
a[i][j] = 1/b[i][j] * c[i][j];
d[i][j] = a[i][j] + c[i][j];}
2 misses per access to a & c vs. one miss per access;
improve temporal locality
ENGS 116 Lecture 14
19
Blocking Example
/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1) {
r = 0;
for (k = 0; k < N; k = k+1)
r = r + y[i][k]*z[k][j];
x[i][j] = r;
};
•
•
Two inner loops:
– Read all N  N elements of z[ ]
– Read N elements of 1 row of y[ ] repeatedly
– Write N elements of 1 row of x[ ]
Capacity misses a function of N & Cache Size:
•
– 3 N  N  4  no capacity misses; otherwise ...
Idea: compute on B  B submatrix that fits
ENGS 116 Lecture 14
20
Blocking Example
/* After */
for (jj = 0; jj < N; jj = jj+B)
for (kk = 0; kk < N; kk = kk+B)
for (i = 0; i < N; i = i+1)
for (j = jj; j < min(jj+B-1,N); j = j+1) {
r = 0;
for (k = kk; k < min(kk+B-1,N); k = k+1)
r = r + y[i][k]*z[k][j];
x[i][j] = x[i][j] + r;
};
• B called blocking factor
• Capacity misses reduced from 2N3 + N2 to 2N3/B +N2
• Conflict misses, too?
• Blocks don’t have to be square.
ENGS 116 Lecture 14
Reducing Conflict Misses by Blocking
• Conflict misses in caches not FA vs. Blocking size
– Lam et al [1991] a blocking factor of 24 had a fifth the misses vs.
48 despite fact that both fit in cache
21
ENGS 116 Lecture 14
Reducing Misses by Hardware
Prefetching of Instructions & Data
• E.g., Instruction Prefetching
– Alpha 21064 fetches 2 blocks on a miss
– Extra block placed in “stream buffer”
– On miss check stream buffer
– Intel Core Duo has prefetchers for code and instructions
• Works with data blocks, too:
– Jouppi [1990] 1 data stream buffer got 25% misses from 4KB
cache; 4 streams got 43%
– Palacharla & Kessler [1994] for scientific programs for 8
streams got 50% to 70% of misses from 2 64KB, 4-way set
associative caches
• Prefetching relies on having extra memory bandwidth that can be
used without penalty (use when bus is idle)
22
ENGS 116 Lecture 14
23
Reducing Misses by
Software Prefetching of Data
• Data Prefetch
– Load data into register (HP PA-RISC loads)
– Cache Prefetch: load into cache
(MIPS IV, PowerPC, SPARC v. 9)
– Special prefetching instructions cannot cause faults;
a form of speculative execution
• Issuing prefetch instructions takes time
– Is cost of prefetch issues < savings in reduced misses?
– Higher superscalar reduces difficulty of issue bandwidth
ENGS 116 Lecture 14
24
Cache Example
Suppose we have a processor with a base CPI of 1.0, assuming that all references
hit in the primary cache, and a clock rate of 500 MHz. Assume a main memory
access time of 200 ns, including all the miss handling. Suppose the miss rate per
instruction at the primary cache is 5%. How much faster will the machine be if
we add a secondary cache that has a 20-ns access time for either a hit or a miss
and is large enough to reduce the global miss rate to main memory to 2%?
ENGS 116 Lecture 14
25
What is the Impact of What You’ve Learned
About Caches?
•
CPU
2000
1999
1998
1997
1996
1995
DRA M
1994
1993
1992
1991
1990
1989
1988
1987
1986
1985
1984
1983
1982
•
1981
•
1960-1985: Speed =
1000
ƒ(no. operations)
1990
– Pipelined
100
Execution &
Fast Clock Rate
– Out-of-Order
10
execution
– Superscalar
Instruction Issue
1
1998: Speed =
ƒ(non-cached memory accesses)
What does this mean for
– Compilers, Operating Systems, Algorithms, Data Structures?
1980
•
ENGS 116 Lecture 14
26
Main Memory Background
• Performance of Main Memory:
– Latency: Cache miss penalty
» Access Time: time between request and word arrives
» Cycle Time: time between requests
– Bandwidth: I/O & large block miss penalty (L2)
• Main Memory is DRAM: dynamic random access memory
– Dynamic since needs to be refreshed periodically (1% time)
– Addresses divided into 2 halves (memory as a 2-D matrix):
» RAS or Row Access Strobe
» CAS or Column Access Strobe
• Cache uses SRAM: static random access memory
– No refresh; 6 transistors/bit vs. 1 transistor;
Size: DRAM/SRAM ≈ 4-8;
Cost/Cycle time: SRAM/DRAM ≈ 8-16
ENGS 116 Lecture 14
4 Key DRAM Timing Parameters
• tRAC: minimum time from RAS line falling to the valid data output.
– Quoted as the speed of a DRAM when buying
– A typical 512Mbit DRAM tRAC = 8-10 ns
• tRC: minimum time from the start of one row access to the start of the
next.
– tRC = 15 ns for a 512Mbit DRAM with a tRAC of 8-10 ns
• tCAC: minimum time from CAS line falling to valid data output.
– 1 ns for a 512Mbit DRAM with a tRAC of 8-10 ns
• tPC: minimum time from the start of one column access to the start of
the next.
– 15 ns for a 512Mbit DRAM with a tRAC of 60-40 ns
27
ENGS 116 Lecture 14
28
DRAM Performance
• A 8 ns (tRAC) DRAM can
– perform a row access only every 16 ns (tRC)
– perform column access (tCAC) in 1 ns, but time between column
accesses is at least 3 ns (tPC).
» In practice, external address delays and turning around buses
make it 8 to 10 ns
• These times do not include the time to drive the addresses off the
microprocessor or the memory controller overhead!
ENGS 116 Lecture 14
29
DRAM History
• DRAMs: capacity + 60%/yr, cost – 30%/yr
– 2.5X cells/area, 1.5X die size in ≈ 3 years
• ‘98 DRAM fab line costs $2B, '08 costs $2.5B, 3-4 years to build
• Rely on increasing numbers of computers & memory per computer
(60% market)
– SIMM, DIMM, or RIMM is replaceable unit
=> computers use any generation (S)DRAM
• Commodity, second source industry
=> high volume, low profit, conservative
– Little organization innovation in 20 years
• Order of importance: 1) Cost/bit, 2) Capacity
– First RAMBUS: 10X BW, + 30% cost => little impact
• Current SDRAM yield very high: > 80%
ENGS 116 Lecture 14
30
Main Memory Performance
• Simple:
– CPU, Cache, Bus,
Memory same width
(32 or 64 bits)
• Wide:
– CPU/Mux 1 word;
Mux/Cache, Bus,
Memory N words
(Alpha: 64 bits & 256
bits; UltraSPARC 512)
• Interleaved:
– CPU, Cache, Bus 1
word; Memory N
modules (4 modules);
example is word
interleaved
ENGS 116 Lecture 14
31
Main Memory Performance
• Timing model (word size is 32 bits)
– 1 to send address,
– 6 for access time, 1 to send data
– Cache Block is 4 words
• Simple memory
 4  (1 + 6 + 1) = 32
• Wide memory
1+6+1 =8
• Interleaved memory  1 + 6 + 4  1 = 11
Address
0
4
8
12
Bank 0
Address
1
5
9
13
Bank 1
Address
2
6
10
14
Bank 2
Address
3
7
11
15
Bank 3
ENGS 116 Lecture 14
32
Independent Memory Banks
• Memory banks for independent accesses
vs. faster sequential accesses
– Multiprocessor
– I/O (DMA)
– CPU with Hit under n Misses, Non-blocking Cache
• Superbank: all memory active on one block transfer (or
Bank)
• Bank: portion within a superbank that is word interleaved
(or subbank)
...
Superbank
Superbank offset (Bank)
Superbank #
Bank #
Bank offset
ENGS 116 Lecture 14
33
Independent Memory Banks
• How many banks?
number banks ≥ number clocks to access word in bank
– For sequential accesses, otherwise will return to original bank
before it has next word ready
– (like in vector case)
• Increasing DRAM => fewer chips => harder to have banks
ENGS 116 Lecture 14
34
Avoiding Bank Conflicts
• Lots of banks
int x[256][512];
for (j = 0; j < 512; j = j+1)
for (i = 0; i < 256; i = i+1)
x[i][j] = 2 * x[i][j];
• Even with 128 banks, since 512 is multiple of 128, conflict on word
accesses
• SW: loop interchange or declaring array not power of 2 (“array padding”)
• HW: prime number of banks
–
–
–
–
–
bank number = address mod number of banks
address within bank = address / number of words in bank
modulo & divide per memory access with prime no. banks?
address within bank = address mod number words in bank
bank number? easy if 2N words per bank
ENGS 116 Lecture 14
35
Fast Memory Systems: DRAM specific
• Multiple CAS accesses: several names (page mode)
– Extended Data Out (EDO): 30% faster in page mode
• New DRAMs to address gap;
what will they cost, will they survive?
– RAMBUS: startup company; reinvent DRAM interface
>>
>>
>>
>>
>>
Each chip a module vs. slice of memory
Short bus between CPU and chips
Does own refresh
Variable amount of data returned
1 byte / 2 ns (500 MB/s per chip)
– Synchronous DRAM: 2 banks on chip, a clock signal to DRAM, transfer
synchronous to system clock (66 - 150 MHz)
– Intel claims RAMBUS Direct is future of PC memory, Pentium 4
originally only supported RAMBUS memory...
• Niche memory or main memory?
– e.g., Video RAM for frame buffers, DRAM + fast serial output
ENGS 116 Lecture 14
36
Pitfall: Predicting Cache Performance of One
Program from Another (ISA, compiler, ...)
D$, Tom
30%
D: tomcatv
D: gcc
D: espresso
I: gcc
I: espresso
I: tomcatv
25%
D$, gcc
20%
Miss
Rate
D$, esp
15%
10%
I$, gcc
5%
I$, Tom
Cache Size (KB)
128
64
32
16
8
4
I$, esp
2
0%
1
• 4KB data cache: miss
rate 8%, 12%, or 28%?
• 1KB instruction cache:
miss rate 0%, 3%, or
10%?
• Alpha vs. MIPS
for 8 KB Data $:
17% vs. 10%
• Why 2X Alpha v.
MIPS?
35%
ENGS 116 Lecture 14
37
Pitfall: Simulating Too Small an Address Trace
4.5
Cumulative
Average
Memory
Access
Time
4
3.5
3
2.5
2
1.5
1
0
I$
D$
L2
MP
1
= 4 KB, B = 16 B
= 4 KB, B = 16 B
= 512 KB, B = 128 B
= 12, 200 (miss penalties)
2
3
4
5
6
7
8
9 10 11 12
Instructions Executed (billions)
ENGS 116 Lecture 14
38
Additional Pitfalls
• Having too small an address space
• Ignoring the impact of the operating system on the performance of
the memory hierarchy
ENGS 116 Lecture 14
Block size
Hit time
Miss penalty
Miss rate (local)
Size
39
TLB
4–8 bytes (1 PTE)
1 clock cycle
10–30 clock cycles
First-level cache
4–32 bytes
1–2 clock cycles
8–66 clock cycles
0.1–2%
0.5–20%
32–8192 bytes
1–128 KB
(8-1024 PTEs)
Backing store
First-level cache
Second-level cache
Q1: block placement Fully associative or Direct mapped
set associative
Q2: block
Tag/block
Tag/block
identification
Q3: block replacement Random
N.A. (direct
mapped)
Q4: write strategy
Flush on a write to Write through or
page table
write back
Second-level cache
32–256 bytes
6–15 clock cycles
30–200 clock cycles
15–30%
256 KB – 16 MB
Virtual memory
4096–16,384 bytes
10–100 clock cycles
700,000–6,000,000
clock cycles
0.00001–0.001%
16–8192 MB
Page-mode DRAM Disks
Direct mapped or set Fully associative
associative
Tag/block
Table
Random
 LRU
Write back
Write back
Figure 5.53 Summary of the memory-hierarchy examples in Chapter 5.