Transcript Lecture 20

EENG 449bG/CPSC 439bG
Computer Systems
Lecture 17
Memory Hierarchy Design Part I
April 7, 2005
Prof. Andreas Savvides
Spring 2005
http://www.eng.yale.edu/courses/2005s/een
g449b
4/7/05
EENG449b/Savvides
Lec 18.1
Who Cares About the Memory Hierarchy?
CPU
CPU-DRAM Gap
“Moore’s Law”
100
10
1
“Less’ Law?”
µProc
60%/yr.
Processor-Memory
Performance Gap:
(grows 50% / year)
DRAM
DRAM
7%/yr.
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
Performance
1000
• 1980: no cache in µproc; 1995 2-level cache on chip
(1989 first Intel µproc with a cache on chip)
4/7/05
EENG449b/Savvides
Lec 18.2
Review of Caches
• Cache is the name given to the first level of
the memory hierarchy encountered one the
address leaves the CPU
– Cache hit / cache miss when data is found / not found
– Block – a fixed size collection of data containing the
requested word
– Spatial / temporal localities
– Latency – time to retrieve the first word in the block
– Bandwidth – time to retrieve the rest of the block
– Address space is broken into fixed-size blocks called
pages
– Page fault – when CPU references something that is not
on cache or main memory
4/7/05
EENG449b/Savvides
Lec 18.3
Generations of Microprocessors
• Time of a full cache miss in instructions executed:
1st Alpha:
2nd Alpha:
3rd Alpha:
340 ns/5.0 ns = 68 clks x 2 or
266 ns/3.3 ns = 80 clks x 4 or
180 ns/1.7 ns =108 clks x 6 or
136
320
648
• 1/2X latency x 3X clock rate x 3X Instr/clock  5X
4/7/05
EENG449b/Savvides
Lec 18.4
Processor-Memory Performance
Gap “Tax”
Processor
• Alpha 21164
• StrongArm SA110
• Pentium Pro
% Area
(cost)
37%
61%
64%
%Transistors
(power)
77%
94%
88%
– 2 dies per package: Proc/I$/D$ + L2$
• Caches have no “inherent value”,
only try to close performance gap
4/7/05
EENG449b/Savvides
Lec 18.5
What is a cache?
• Small, fast storage used to improve average access
time to slow memory.
• Exploits spacial and temporal locality
• In computer architecture, almost everything is a cache!
–
–
–
–
–
–
Registers “a cache” on variables – software managed
First-level cache a cache on second-level cache
Second-level cache a cache on memory
Memory a cache on disk (virtual memory)
TLB a cache on page table
Branch-prediction a cache on prediction information?
Proc/Regs
Bigger
L1-Cache
L2-Cache
Faster
Memory
4/7/05
Disk, Tape, etc.
EENG449b/Savvides
Lec 18.6
Review: Cache performance
• Miss-oriented Approach to Memory Access:
MemAccess


CPUtime  IC   CPI

 MissRate  MissPenalty   CycleTime
Execution
Inst


MemMisses


CPUtime  IC   CPI

 MissPenalty   CycleTime
Execution
Inst


– CPIExecution includes ALU and Memory instructions
• Separating out Memory component entirely
– AMAT = Average Memory Access Time
– CPIALUOps does not include memory instructions
 AluOps
CPUtime  IC  
 CPI
Inst

AluOps

MemAccess

 AMAT   CycleTime
Inst

AMAT  HitTime  MissRate  MissPenalty
  HitTime Inst  MissRate Inst  MissPenalty Inst  
4/7/05
 HitTime Data  MissRate Data  MissPenaltyData 
EENG449b/Savvides
Lec 18.7
Impact on Performance
• Suppose a processor executes at
– Clock Rate = 200 MHz (5 ns per cycle), Ideal (no misses) CPI = 1.1
– 50% arith/logic, 30% ld/st, 20% control
• Suppose that 10% of memory operations get 50 cycle
miss penalty
• Suppose that 1% of instructions get same miss penalty
• CPI = ideal CPI + average stalls per instruction
=1.1(cycles/ins) +
[ 0.30 (DataMops/ins)
x 0.10 (miss/DataMop) x 50 (cycle/miss)] +
[ 1 (InstMop/ins)
x 0.01 (miss/InstMop) x 50 (cycle/miss)]
= (1.1 + 1.5 + .5) cycle/ins = 3.1
• 58% of the time the proc is stalled waiting for memory!
4/7/05
EENG449b/Savvides
Lec 18.8
Traditional Four Questions for
Memory Hierarchy Designers
• Q1: Where can a block be placed in the upper level?
(Block placement)
– Fully Associative, Set Associative, Direct Mapped
• Q2: How is a block found if it is in the upper level?
(Block identification)
– Tag/Block
• Q3: Which block should be replaced on a miss?
(Block replacement)
– Random, LRU
• Q4: What happens on a write?
(Write strategy)
– Write Back or Write Through (with Write Buffer)
4/7/05
EENG449b/Savvides
Lec 18.9
Q1: Where can a Block Be Placed in
a Cache?
4/7/05
EENG449b/Savvides
Lec 18.10
Set Associatively
• Direct mapped = one-way set associative
– (block address) MOD (Number of blocks in cache)
• Fully associative = set associative with 1 set
– Block can be placed anywhere in the cache
• Set associative – a block can be placed in a
restricted set of places
– Block first mapped to a set and then placed anywhere in the
set
» (Block address) MOD (Number of sets in cache)
– If there are n blocks in a set, then the cache is n-way set
associative
• Most popular cache configurations in today’s
processors
– Direct mapped, 2-way set associative, 4-way set associative
4/7/05
EENG449b/Savvides
Lec 18.11
Q2: How is a block found if it is in
the cache?
Selects the set
Selects the desired data from the block
Compared against for a hit
• If cache size remains the same increasing associativity increases
The number of blocks per set => decrease index size and increase tag
4/7/05
EENG449b/Savvides
Lec 18.12
Q3: Which Block Should be Replaced
on a Cache Miss?
• Directly mapped cache
– No choice – a single block is checked for a hit. If
there is a miss, data is fetched into that block
• Fully associative and set associative
– Random
– Least Recently Used(LRU) – locality principles
– First in, First Out(FIFO) - Approximates LRU
4/7/05
EENG449b/Savvides
Lec 18.13
Q4: What Happens on a Write?
• Cache accesses dominated by reads
• E.g on MIPS 10% stores and 37% loads = 21% of
cache traffic is writes
• Writes are much slower than reads
– Block is read from cache at the same time a block is read and
compared
» If the write is a hit the block is passed to the CPU
– Writing cannot begin unless the address is a hit
• Write through – information is written to both the
cache and the lower-level memory
• Write back – information only written to cache.
Written to memory only on block replacement
– Dirty bit – used to indicate whether a block has been changed
While in cache
4/7/05
EENG449b/Savvides
Lec 18.14
Write Through vs. Write Back
• Write back – uses cache speed, all entries
updated once during the writing of a block
• Write through – slower, BUT cache is
always clean
– Cache read misses never result in writes at the lower
level
– Next lower level of the cache has the most current
copy of the data
4/7/05
EENG449b/Savvides
Lec 18.15
Example: Alpha 21264 Data Cache
• 2-way set associative
• Write back
• Each block has 64
bytes of data
– Offset points to the
data we want
– Total cache size
65,536 bytes
• Index 29=512 points
to the block
• Tag comparison
determines if we have
a hit
• Victim buffer to helps
with write back
4/7/05
EENG449b/Savvides
Lec 18.16
Address Breakdown
• Physical address is 44 bits wide, 38-bit
block address and 6-bit offset (2^6=64)
• Calculating cache index size field
2 Index 
Cache size
65,356

 512  29
Block size  Set Associativ ity 64  2
• Blocks are 64 bytes so offset needs 6 bits
• Tag size = 38 – 9 = 29 bits
4/7/05
EENG449b/Savvides
Lec 18.17
Writing to cache
• If word to be written is in the cache, first
3 steps are the same as read
• The 21264 processor uses write back so the
block cannot simply discarded on a miss
• If the “victim” was modified its data and
address are sent to the victim buffer.
• Data cache cannot apply all processor needs
– Separate Instruction and Data Caches may be needed
4/7/05
EENG449b/Savvides
Lec 18.18
Unified vs Split Caches
• Unified vs Separate Instruction and Data caches
Proc
Unified
Cache-1
Unified
Cache-2
I-Cache-1
Proc
D-Cache-1
Unified
Cache-2
• Example:
– 16KB I&D: Inst miss rate=0.64%, Data miss rate=6.47%
– 32KB unified: Aggregate miss rate=1.99%
– Using miss rate in the evaluation may be misleading!
• Which is better (ignore L2 cache)?
– Assume 25% data ops  75% accesses from instructions (1.0/1.33)
– hit time=1, miss time=50
– Note that data hit has 1 stall for unified cache (only one port)
AMAT  HitTime  MissRate  MissPenalty
AMATHarvard=75%x(1+0.64%x50)+25%x(1+6.47%x50) = 2.05
AMATUnified=75%x(1+1.99%x50)+25%x(1+1+1.99%x50)= 2.24EENG449b/Savvides
4/7/05
Lec 18.19
Impact of Caches on Performance
• Consider a in-order execution computer
– Cache miss penalty 100 clock cycles, CPI=1
– Average miss rate 2% and an average of 1.5 memory
references per instruction
– Average # of cache misses: 30 per 1000 instructions
Memory stall cycles 

CPU time  IC   CPIexecution 
  Clock cycle time
Instructio
n


Performance with cache misses
CPU timewith cache  IC  (1.0  (30/1000  100))  Clock cycle time
 IC  4.00  Clock cycle time
4/7/05
EENG449b/Savvides
Lec 18.20
Impact of Caches on Performance
• Calculate the performance using miss rate
Memory accesses


CPU time  IC   CPIexecution  Miss rate 
 Miss penalty  Clock cycle time 
Instruction


CPU timewith cache  IC  (1.0  (1.5  2%  100))  Clock cycle time
 IC  4.00  Clock cycle time
• 4x increase in CPU time from “perfect cache”
• No cache – 1.0 + 100 x 1.5 = 151 – a factor of 40 compared to a system
with cache
• Minimizing memory access does not always imply reduction in CPU time
4/7/05
EENG449b/Savvides
Lec 18.21
How to Improve Cache Performance?
AMAT  HitTime  MissRate  MissPenalt y
Four main categories of optimizations
1. Reducing miss penalty
- multilevel caches, critical word first, read miss before write
miss, merging write buffers and victim caches
2. Reducing miss rate
- larger block size, larger cache size, higher associativity, way
prediction and pseudoassociativity and computer optimizations
2. Reduce the miss penalty or miss rate via parallelism
- non-blocking caches, hardware prefetching and compiler
prefetching
3. Reduce the time to hit in the cache
- small and simple caches, avoiding address
pipelined cache access
4/7/05
translation,
EENG449b/Savvides
Lec 18.22
Where to misses come from?
• Classifying Misses: 3 Cs
– Compulsory—The first access
to a block is not in the cache,
so the block must be brought into the cache. Also called cold
start misses or first reference misses.
(Misses in even an Infinite Cache)
– Capacity—If
the cache cannot contain all the blocks needed
during execution of a program, capacity misses will occur due to
blocks being discarded and later retrieved.
(Misses in Fully Associative Size X Cache)
– Conflict—If
block-placement strategy is set associative or
direct mapped, conflict misses (in addition to compulsory &
capacity misses) will occur because a block can be discarded and
later retrieved if too many blocks map to its set. Also called
collision misses or interference misses.
(Misses in N-way Associative, Size X Cache)
4/7/05
EENG449b/Savvides
Lec 18.23
3Cs Absolute Miss Rate (SPEC92)
0.14
1-way
Conflict
Miss Rate per Type
0.12
Miss rate
2-way
0.1
4-way
0.08
8-way
0.06
Capacity
0.04
0.02
Cache Size (KB)
4/7/05
128
64
32
16
8
4
2
1
0
Compulsory
EENG449b/Savvides
Lec 18.24
Cache Organization?
•
•
Assume total cache size not changed:
What happens if:
1) Change Block Size:
2) Change Associativity:
3) Change Compiler:
Which of 3Cs is obviously affected?
4/7/05
EENG449b/Savvides
Lec 18.25
Increasing Block Size
• Larger block sizes reduce compulsory misses
– Takes advantage of spatial locality
• Larger blocks increase miss penalty
– Our goal: reduce miss rate and miss penalty!
• Block size selection depends on latency and
bandwidth of lower level memory
– High latency & high BW => large block sizes
– Low latency & low BW => smaller block sizes
4/7/05
EENG449b/Savvides
Lec 18.26
Larger Block Size
(fixed size&assoc)
25%
1K
20%
Miss
Rate
4K
15%
16K
10%
64K
5%
Block Size (bytes)
256
128
64
32
0%
16
Reduced
compulsory
misses
256K
Increased
Conflict
Misses
What else drives up block size?
4/7/05
EENG449b/Savvides
Lec 18.27
Cache Size
0.14
1-way
Miss Rate per Type
0.12
2-way
0.1
4-way
0.08
8-way
0.06
Capacity
0.04
0.02
Cache Size (KB)
128
64
32
16
8
4
2
1
0
Compulsory
• Old rule of thumb: 2x size => 25% cut in miss rate
• What does it reduce?
4/7/05
EENG449b/Savvides
Lec 18.28
Increase Associativity
• Higher set-associativity improves miss rates
• Rule of thumb for caches
– A direct mapped cache of size N has about the same
miss rate as a two-way, set-associative cache of N/2
– Holds for < 128Kb
• Disadvantages
– Reduces miss rate but increases miss penalty
– Greater associativity can come at the cost of inreased
hit time.
4/7/05
EENG449b/Savvides
Lec 18.29
Associativity
0.14
1-way
Conflict
Miss Rate per Type
0.12
2-way
0.1
4-way
0.08
8-way
0.06
Capacity
0.04
0.02
Cache Size (KB)
4/7/05
128
64
32
16
8
4
2
1
0
Compulsory
EENG449b/Savvides
Lec 18.30
3Cs Relative Miss Rate
100%
Miss Rate per Type
1-way
80%
Conflict
2-way
4-way
8-way
60%
40%
Capacity
20%
4/7/05
128
64
Flaws: for fixed block size
Good: insight => invention Cache Size (KB)
32
16
8
4
2
1
0%
Compulsory
EENG449b/Savvides
Lec 18.31
Associativity vs Cycle Time
• Beware: Execution time is only final measure!
• Why is cycle time tied to hit time?
• Will Clock Cycle time increase?
4/7/05
EENG449b/Savvides
Lec 18.32
Next Time
• Cache Tradeoffs for Performance
• Reducing Hit Times
4/7/05
EENG449b/Savvides
Lec 18.33