Memory Hierarchy and Cache Design

Download Report

Transcript Memory Hierarchy and Cache Design

Lecture 10
Memory Hierarchy and
Cache Design
Computer Architecture
COE 501
Why is memory important?
• Processor performance has increased at a much
faster rate than memory performance, making main
memory the bottleneck.
1000
CPU
CPU-DRAM Gap
100
10
DRAM
2000
1999
1998
1997
1996
1995
1994
1993
1992
1991
1990
1989
1988
1987
1986
1985
1984
1983
1982
1981
1980
1
• 1980: no cache in µproc;
2001 2-level cache, 90% of transistors on µprocessor
General Principles of Memory
• Locality
– Temporal Locality: referenced memory is likely to be referenced
again soon (e.g. code within a loop)
– Spatial Locality: memory close to referenced memory is likely to be
referenced soon (e.g., data in a sequentially access array)
• Locality + smaller HW is faster = memory hierarchy
– Levels: each smaller, faster, more expensive/byte than level below
– Inclusive: data found in top also found in the bottom
• Definitions
–
–
–
–
Upper: closer to processor
Block: minimum unit that present or not in upper level
Block address: location of block in memory
Hit time: time to access upper level, including hit determination
Memory Hierarchy
Secondary
Storage
Disks
DRAM
Memory
Hierarchy
L2 Cache
L1 Cache
Processor
Registers
Four Questions for Memory
Hierarchy Designers
• Q1: Where can a block be placed in the upper level?
(Block placement)
• Q2: How is a block found if it is in the upper level?
(Block identification)
• Q3: Which block should be replaced on a miss?
(Block replacement)
• Q4: What happens on a write?
(Write strategy)
Q1: Where can a block be
placed in the upper level?
• Direct Mapped: Each block has only one
place that it can appear in the cache.
• Fully associative: Each block can be placed
anywhere in the cache.
• Set associative: Each block can be placed in
a restricted set of places in the cache.
– If there are n blocks in a set, the cache placement is
called n-way set associative
• What is the associativity of a direct mapped
cache?
Associativity Examples
(Figure 5.2, pg. 376)
Fully associative:
Block 12 can go anywhere
Direct mapped:
Block no. = (Block address) mod
(No. of blocks in cache)
Block 12 can go only into block 4
(12 mod 8)
Set associative:
Set no. = (Block address) mod
(No. of sets in cache)
Block 12 can go anywhere in set 0
(12 mod 4)
Q2: How Is a Block Found If It Is
in the Upper Level?
• The address can be divided into two main parts
– Block offset: selects the data from the block
offset size = log2(block size)
– Block address: tag + index
» index: selects set in cache
index size = log2(#blocks/associativity)
– tag: compared to tag in cache to determine hit
tag size = addreess size - index size - offset size
Tag
Index
Q3: Which Block Should be
Replaced on a Miss?
• Easy for Direct Mapped
• Set Associative or Fully Associative:
– Random - easier to implement
– Least Recently used - harder to implement - may approximate
• Miss rates for caches with different size,
associativity and replacemnt algorithm.
Associativity:
2-way
4-way
Size
LRU Random
LRU Random
16 KB
5.18%
5.69% 4.67%
5.29%
64 KB
1.88%
2.01% 1.54%
1.66%
256 KB
1.15%
1.17% 1.13%
1.13%
LRU
4.39%
1.39%
1.12%
8-way
Random
4.96%
1.53%
1.12%
For caches with low miss rates, random is almost as good as LRU.
Q4: What Happens on a Write?
• Write through: The information is written to both the
block in the cache and to the block in the lower-level
memory.
• Write back: The information is written only to the block
in the cache. The modified cache block is written to
main memory only when it is replaced.
– is block clean or dirty? (add a dirty bit to each block)
• Pros and Cons of each:
– Write through
» read misses cannot result in writes to memory,
» easier to implement
» Always combine with write buffers to avoid memory latency
– Write back
» Less memory traffic
» Perform writes at the speed of the cache
Q4: What Happens on a Write?
• Since data does not have to be brought into
the cache on a write miss, there are two
options:
– Write allocate
» The block is brought into the cache on a write miss
» Used with write-back caches
» Hope subsequent writes to the block hit in cache
– No-write allocate
» The block is modified in memory, but not brought
into the cach
» Used with write-through caches
» Writes have to go to memory anyway, so why bring
the block into the cache
Cache Measures
• Hit rate: fraction found in the cache
– So high that we usually talk about Miss rate = 1 - Hit Rate
• Hit time: time to access the cache
• Miss penalty: time to replace a block from lower level,
including time to replace in CPU
– access time: time to acccess lower level
– transfer time: time to transfer block
• Average memory-access time
= Hit time + Miss rate x Miss penalty (ns or clocks)
Block Size vs. Cache Measures
• Increasing Block Size generally increases
Miss Penalty and decreases Miss Rate
Miss
Penalty
X
Miss
Rate
Block Size
Avg.
Memory
Access
Time
=
Block Size
Block Size
Example:
Alpha 21064 Data Cache
• The data cache of the Alpha 21064 has the
following features
–
–
–
–
–
8 KB of data
32 byte blocks
Direct mapped placement
Write through (no-write allocate, 4-block write buffer)
34 bit physical address composed of
» 5 bit block offset
» 8 bit index
» 21 bit tag
Example:
Alpha 21064 Data Cache
A cache read has 4 steps
(1) The address from the cache
is divided into the tag, index,
and block offset
(2) The index selects block
(3) The address tag is compared
with the tag in the cache, the
valid bit is checked, and data
to be loaded is selected
(4) If the valid bit is set and the
tags match, the data is loaded
into the processor
If there is a write, the data is also
sent to the write buffer
2-Way Set Associative Cache
Features of an 8 KB 2-way set
associative cache
5 bit block offset
7 bit index
22 bit tag
The set associative cache has
extra hardware for
2 tag comparisons
mux to select data
Compared to the direct mapped
cache, the set associative cache
will tend to have a smaller miss rate
but a larger hit time. Why?
Split vs. Unified Cache
• Unified cache (mixed cache): Data and instructions are
stored together (von Neuman architecture)
• Split cache: Data and instructions are stored
separately (Harvard architecture)
• Why do instructions caches have a lower miss ratio?
Size
Instruction Cache
1 KB
3.06%
2 KB
2.26%
4 KB
1.78%
8 KB
1.10%
16 KB
0.64%
32 KB
0.39%
64 KB
0.15%
128 KB
0.02%
Data Cache
24.61%
20.57%
15.94%
10.19%
6.47%
4.82%
3.77%
2.88%
Unified Cache
13.34%
9.78%
7.24%
4.57%
2.87%
1.99%
1.35%
0.95%
Example:
Split vs. Unified Cache
• Which has the lower average memeory
access time?
– Split cache : 16 KB instructions + 16 KB data
– Unified cache: 32 KB instructions + data
• Assumptions
–
–
–
–
Use miss rates from previous chart
Miss penalty is 50 cycles
Hit time is 1 cycle
Load or store hit takes an extra cycle, since there is only
one port for instructions and data
Example:
Split vs. Unified Cache
Average memory-access time = Hit time + Miss rate x Miss penalty
AMAT = %instr x (instr hit time + instr miss rate x instr miss penalty) +
%data x (data hit time + data miss rate x data miss penalty)
For the split cache:
AMAT = 75% x (1 + 0.64%x 50) + 25% (1 + 6.74% x 50) = 2.05
For the unified cache
AMAT = 75% x (1 + 1.99%x 50) + 25% x (2+ 1.99% x 50) = 2.24
The unified cache has a longer AMAT, even though its miss rate is
lower, due to conflicts for instruction and data hazards.
What are advantages of a split cache? Of unified cache?
Improving Cache Performance
• Average memory-access time
= Hit time + Miss rate x Miss penalty
• Improve performance by:
1. Reduce the miss rate,
2. Reduce the miss penalty, or
3. Reduce the time to hit in the cache.
Summary
• CPU-Memory gap is major performance obstacle
for achieving high performance
• Memory hierarchies
– Take advantage of program locality
– Closer to processor => smaller, faster, more expensive
– Further from processor => bigger, slower, less expensive
• 4 questions for memory hierarchy
– Block placement, block identification, block replacement, and
write stategy
• Cache parameters
– Cache size, block size, associativity