slides in ppt

Transcript slides in ppt

CS 161
Ch 7: Memory Hierarchy
LECTURE 15
Instructor: L.N. Bhuyan
www.cs.ucr.edu/~bhuyan
1
1999 ©UCB
Direct-mapped Cache Contd.
°The direct mapped cache is simple to
design and its access time is fast (Why?)
°Good for L1 (on-chip cache)
°Problem: Conflict Miss, so low hit ratio
Conflict Misses are misses caused by
accessing different memory locations that
are mapped to the same cache index
In direct mapped cache, no flexibility in
where memory block can be placed in
cache, contributing to conflict misses
2
1999 ©UCB
Another Extreme: Fully Associative
°Fully Associative Cache (8 word block)
• Omit cache index; place item in any block!
• Compare all Cache Tags in parallel
4
31
0
Byte Offset
Cache Tag (27 bits long)
=
=
:
=
Valid
Cache Data
B 31
:
=
Cache Tag
B1
B0
=
:
:
:
°By definition: Conflict Misses = 0 for a
fully associative cache
3
1999 ©UCB
Fully Associative Cache
°Must search all tags in cache, as item
can be in any cache block
°Search for tag must be done by
hardware in parallel (other searches
too slow)
°But, the necessary parallel comparator
hardware is very expensive
°Therefore, fully associative placement
practical only for a very small cache
4
1999 ©UCB
Compromise: N-way Set Associative Cache
°N-way set associative:
N cache blocks for each Cache Index
• Like having N direct mapped caches
operating in parallel
• Select the one that gets a hit
°Example: 2-way set associative cache
• Cache Index selects a “set” of 2 blocks
from the cache
• The 2 tags in set are compared in parallel
• Data is selected based on the tag result
(which matched the address)
5
1999 ©UCB
Example: 2-way Set Associative Cache
tag
Valid Cache Tag Cache Data
Block 0
:
:
:
mux
Hit
address
Cache Data Cache Tag Valid
Block 0
:
=
6
offset
index
:
:
=
Cache
Block
1999 ©UCB
Set Associative Cache Contd.
°Direct Mapped, Fully Associative can
be seen as just variations of Set
Associative block placement strategy
°Direct Mapped =
1-way Set Associative Cache
°Fully Associative =
n-way Set associativity for a cache
with exactly n blocks
7
1999 ©UCB
8
1999 ©UCB
Block Replacement Policy
°N-way Set Associative or Fully
Associative have choice where to place
a block, (which block to replace)
• Of course, if there is an invalid block, use it
°Whenever get a cache hit, record the
cache block that was touched
°When need to evict a cache block,
choose one which hasn't been touched
recently: “Least Recently Used” (LRU)
• Past is prologue: history suggests it is least
likely of the choices to be used soon
• Flip side of temporal locality
9
1999 ©UCB
Block Replacement Policy: Random
°Sometimes hard to keep track of the
LRU block if lots of choices
°How hard for 2-way associativity?
°Second Choice Policy: pick one at
random and replace that block
°Advantages
• Very simple to implement
• Predictable behavior
• No worst case behavior
10
1999 ©UCB
What about Writes to Memory?
°Suppose write data only to cache?
• Main memory and cache would then be
inconsistent - cannot allow
°Simplest Policy: The information is
written to both the block in the cache
and to the block in the lower-level
memory (write-through)
°Problem: Writes operate at speed of
lower level memory!
11
1999 ©UCB
Improving Cache Performance: Write Buffer
Processor
Cache
DRAM
Write Buffer
° A Write Buffer is added between Cache and
Memory
• Processor: writes data into cache & write buffer
• Controller: write buffer contents to memory
° Write buffer is just a First-In First-Out queue:
• Typical number of entries: 4 to 10
• Works fine if: Store frequency (w.r.t. time)
<< 1 / DRAM write cycle time
12
1999 ©UCB
Improving Cache Performance: Write Back
° Option 2: data is written only to cache block
° Modified cache block is written to main memory
only when it is replaced
• Block is unmodified (clean) or modified (dirty)
° This scheme is called “Write Back”
• Advantage? Repeated writes to same block stay in
cache
• Disadvantage? More complex to implement
° Write Back is standard for Pentium Pro, optional
for PowerPC 604
13
1999 ©UCB
Improving Caches
°In general, want to minimize
Average Access Time:
= Hit Time x (1 - Miss Rate)
+ Miss Penalty x Miss Rate
(recall Hit Time << Miss Penalty)
°So far, have looked at
• Larger Block Size
• Larger Cache
Reduce
Miss Rate
• Higher Associativity
°What else to reduce miss penalty?
Add a second level (L2) cache.
14
1999 ©UCB
Current Memory Hierarchy
Processor
Control
L1
cache
regs
Datapath
L2
Cache
Main
Memory
Secondary
Memory
Speed(ns): 0.5ns 2ns
6ns
100ns 10,000,000ns
Size (MB):
0.0005 0.05
1-4 100-1000
100,000
Cost ($/MB):
-$100 $30
$1
$0.05
Technology: Regs SRAM SRAM DRAM
Disk
15
1999 ©UCB
How do we calculate the miss penalty?
°Access time = L1 hit time * L1 hit rate +
L1 miss penalty * L1 miss rate
°We simply calculate the L1 miss penalty
as being the access time for the L2
cache
°Access time = L1 hit time * L1 hit rate +
(L2 hit time * L2 hit rate +
L2 miss penalty * L2 miss rate)
* L1 miss rate
16
1999 ©UCB
Do the numbers for L2 Cache
°Assumptions:
• L1 hit time = 1 cycle, L1 hit rate = 90%
• L2 hit time (also L1 miss penalty) = 4 cycles,
L2 miss penalty= 100 cycles,
L2 hit rate = 90%
°Access time = L1 hit time * L1 hit rate +
(L2 hit time * L2 hit rate +
L2 miss penalty * (1 - L2 hit rate) )
* L1 miss rate
= 1*0.9 + (4*0.9 + 100*0.1) *(1-0.9)
= 0.9 + (13.6) * 0.1 = 2.26 clock cycles
17
1999 ©UCB
What would it be without the L2 cache?
°Assume that the L1 miss penalty
would be 100 clock cycles
°1 *0.9 + (100)* 0.1
°10.9 clock cycles vs. 2.26 with L2
°So gain a benefit from having the
second, larger cache before main
memory
°Today’s L1 cache sizes: 16 KB-64 KB;
L2 cache may be 512 KB to 4096 KB
18
1999 ©UCB
An Example
Q: Suppose we have a processor with a base CPI of 1.0
assuming all references hit in the primary cache and
a clock rate of 500 MHz. The main memory access
time is 200 ns. Suppose the miss rate per instn is 5%.
What is the revised CPI? How much faster will the
machine run if we put a secondary cache (with 20-ns
access time) that reduces the miss rate to memory to
2%? Assume same access time for hit or miss.
A: Miss penalty to main memory = 200 ns = 100 cycles.
Total CPI = Base CPI + Memory-stall cycles per instn.
Hence, revised CPI = 1.0 + 5% x 100 = 6.0
When an L2 with 20-ns (10 cycles) access time is put,
the miss rate to memory is reduced to 2%. So, out of
5% L1 miss, L2 hit is 3% and miss is 2%.
The CPI is reduced to 1.0 + 5% x (10 + 40% x 100) = 3.5.
Thus, the m/c with secondary cache is faster by
6.0/3.5 = 1.7
19
1999 ©UCB
The Three Cs in Memory Hierarchy
°The cache miss consists of three
classes.
- Compulsory misses: Caused due to
first access to the block from memory –
small but fixed independent of cache size.
- Capacity misses: Because the cache
cannot contain all the blocks for its limited
size – reduces by increasing cache size
- Conflict misses: Because multiple
blocks compete for the same block or set in
the cache. Also called collision misses. –
reduces by increasing associativity
20
1999 ©UCB
3Cs Absolute Miss Rate (SPEC92)
0.14
1-way
Miss Rate per Type
0.12
2-way
0.1
4-way
Conflict
0.08
8-way
0.06
Capacity
0.04
0.02
Cache Size (KB)
21
128
64
32
16
8
4
2
1
0
Compulsory
1999 ©UCB
22
1999 ©UCB
Unified vs Split Caches
Proc
Unified
Cache-1
Unified
Cache-2
I-Cache-1
Proc
D-Cache-1
Unified
Cache-2
Example:
• 16KB I&D: Inst miss rate=0.64%, Data miss rate=6.47%
• 32KB unified: Aggregate miss rate=1.99%
° Which is better (ignore L2 cache)?
• Assume 33% data ops  75% accesses from instructions
(1.0/1.33)
• hit time=1, miss time=50
• Note that data hit has 1 stall for unified cache (only one port)
AMATHarvard=75%x(1+0.64%x50)+25%x(1+6.47%x50) = 2.05
AMATUnified=75%x(1+1.99%x50)+25%x(1+1+1.99%x50)= 2.24
23
1999 ©UCB
Static RAM (SRAM)
°Six transistors in cross connected
fashion
• Provides regular AND inverted outputs
• Implemented in CMOS process
Single Port 6-T SRAM Cell
24
1999 ©UCB
Dynamic Random Access Memory - DRAM
° DRAM organization is similar to SRAM
except that each bit of DRAM is constructed
using a pass transistor and a capacitor,
shown in next slide
° Less number of transistors/bit gives high
density, but slow discharge through
capacitor.
° Capacitor needs to be recharged or
refreshed giving rise to high cycle time.
° Uses a two-level decoder as shown later.
Note that 2048 bits are accessed per row,
but only one bit is used.
25
1999 ©UCB
Dynamic RAM
° SRAM cells exhibit high speed/poor density
° DRAM: simple transistor/capacitor pairs in
high density form
Word Line
C
.
.
.
Bit Line
Sense Amp
26
1999 ©UCB
DRAM logical organization (4 Mbit)
•Access time of DRAM = Row access time + column
access time + refreshing
Column Decoder
…
Sense Amps & I/O
11
D
A0…A10
Row Decoder
…
Q
Memory Array
(2,048 x 2,048)
Storage
Word Line Cell
° Square root of bits per RAS/CAS
27
1999 ©UCB
Main Memory Organizations Fig. 7.13
CPU
CPU
CPU
Multiplexor
Cache
Cache
Cache
Bus
Memory
Memory
wide memory organization
one-word wide
memory organization
28
Bus
Bus
Memory
bank 0
Memory
bank 1
Memory
bank 2
Memory
bank 3
interleaved
memory organization
DRAM access time >> bus transfer time
1999 ©UCB
Memory Access Time Example
° Assume that it takes 1 cycle to send the address,
15 cycles for each DRAM access and 1 cycle to
send a word of data.
° Assuming a cache block of 4 words and one-word
wide DRAM (fig. 7.13a), miss penalty = 1 + 4x15 +
4x1 = 65 cycles
° With main memory and bus width of 2 words (fig.
7.13b), miss penalty = 1 + 2x15 + 2x1 = 33 cycles.
For 4-word wide memory, miss penalty is 17 cycles.
Expensive due to wide bus and control circuits.
° With interleaved memory of 4 memory banks and
same bus width (fig. 7.13c), the miss penalty = 1 +
1x15 + 4x1 = 20 cycles. The memory controller
must supply consecutive addresses to different
memory banks. Interleaving is universally adapted
in high-performance computers.
29
1999 ©UCB

slides in ppt

Transcript slides in ppt

Directory