Transcript lec6-1

CS 152
Computer Architecture and Engineering
Lecture 11 -- Cache II
2014-2-25
John Lazzaro
(not a prof - “John” is always OK)
TA: Eric Love
www-inst.eecs.berkeley.edu/~cs152/
Play:
CS 152 L11: Cache II
UC Regents Spring 2014 © UCB
Today: Caches, Part Two ...
Locality: Why caches work.
Cache misses and performance:
How do we size the cache?
Short Break
Practical cache design:
A state machine and a controller.
Write buffers and caches.
Victim caches and pre-fetch buffers.
CS 152 L11: Cache II
Also: IBM mainframe cache
UC Regents Spring 2014 © UCB
Why We Use Caches
CS 152 L11: Cache II
UC Regents Spring 2014 © UCB
Programs with locality cache well ...
Memory Address (one dot per access)
Bad
Temporal
Locality
Spatial
Locality
Time
Q. Point out bad locality
behavior ...
CS 152 L11: Cache II
Donald J. Hatfield, Jeanette Gerald: Program
Restructuring for Virtual Memory. IBM Systems
UC Regents Spring 2014 © UCB
Journal 10(3): 168-192 (1971)
The caching algorithm in one slide
Temporal locality: Keep most recently
accessed data closer to processor.
Spatial locality: Move contiguous blocks
in the address space to upper levels.
CS 152 L11: Cache II
UC Regents Spring 2014 © UCB
Caching terminology
Hit: Data
appears
in upper
level block
(ex: Blk X)
Hit Rate: The
fraction of memory
accesses found in
upper level.
Miss: Data retrieval
from lower level
needed
(ex: Blk Y)
Miss Rate:
1 - Hit Rate
CS 152 L11: Cache II
Hit Time <<
Miss Penalty
Hit Time: Time
to access
upper level.
Includes
hit/miss check.
Miss penalty:
Time to replace
block in upper
level + deliver to
CPU
UC
Regents Spring 2014 © UCB
Cache Design Example
CS 152 L11: Cache II
UC Regents Spring 2014 © UCB
CPU address space: An array of “blocks”
Block #
32-bit Memory Address
31
0
32-byte blocks
0
1
Which block?
27 bits
Byte #
5 bits
2
3
4
5
The job of a
cache is to hold
a “popular”
subset of
blocks.
6
7
.
.
.
27
2 -1
CS 152 L11: Cache II
UC Regents Spring 2014 © UCB
One Approach: Fully Associative Cache
31
Ideal, but expensive
...
Cache Tag (27 bits)
26
5 4
0
Byte Select
Cache Data
Holds 4 blocks
Block # (”Tags”)
0
Ex: 0x04
=
Byte
31
...
Byte
1
Byte
0
=
Byte
31
...
Byte
1
Byte
0
=
=
Hit
CS 152 L11: Cache II
Valid
Bit
Return byte(s) of
“hit” cache line
UC Regents Spring 2014 © UCB
Building a cache with one comparator
Block #
Blocks of a certain color
may only appear in one
line of the cache.
32-byte blocks
0
1
2
32-bit Memory Address
31
7 6
5 4
3
0
4
Which block?
25 bits
Color Byte #
2 bits
5 bits
5
6
7
Cache index
.
.
.
27
2 -1
CS 152 L11: Cache II
UC Regents Spring 2014 © UCB
Example: A Direct Mapped Cache
31
7
Cache Tag (25 bits)
6
5
Index
Ex: 0x01
24
Cache Tags
4
0
Byte Select
Ex: 0x00
Cache Data
0
=
Byte
31
...
Byte
1
Byte
0
Byte
31
...
Byte
1
Byte
0
Hit
PowerPC 970: 64K
direct-mapped Level-1 ICS 152 L11: Cache II
cache
Valid
Bit
Return byte(s) of a
“hit” cache line
UC Regents Spring 2014 © UCB
Memory Address (one dot per access)
The limits of direct-mapped caches ...
What if both regions
have same block color?
Time
Donald J. Hatfield, Jeanette Gerald: Program
Restructuring for Virtual Memory. IBM Systems
Journal 10(3): 168-192 (1971)
CS 152 L11: Cache II
UC Regents Spring 2014 © UCB
Hybrid Design: Set Associative Cache
“N-way” set associative -- N is number of blocks for each color
Index
Cache Tag (26 bits)
(2 bits)
Byte Select
(4 bits)
Ex: 0x01
Cache Data Valid Cache Tags
Cache Tags Valid
Cache Block
Cache Block
=
=
Cache Block
16 bytes
Cache Data
Cache Block
Hit
Left
Hit
Right
16 bytes
Return bytes
of “hit” set
member
Cache block halved to
keep
# of cached bits
CS 152 L11: Cache II
PowerPC 970: 32K 2way
UC Regents Spring 2014 © UCB
Memory Address (one dot per access)
The benefits of set-associativity ...
What if both regions
have same block color?
Time
Q. What costs (over
direct mapped) for this
CSbenefit?
152 L11: Cache II
Donald J. Hatfield, Jeanette Gerald: Program
Restructuring for Virtual Memory. IBM Systems
Journal 10(3): 168-192 (1971) UC Regents Spring 2014 © UCB
Recall: Branch predictor (direct-mapped)
Address of BNEZ instruction
0b011[..]010[..]100
18 bits
12 bits
Branch Target Buffer (BTB)
18-bit address tag
0b011[...]01
=
Hit
CS 152 L11: Cache II
target address
PC + 4 + Loop
“Taken”
Address
4096 BTB/BHT
entries
BNEZ R1 Loop
Branch
History Table
(BHT)
With
4096
colors,
odds are
low
2 active
branches
have the
same
“Taken” or
“Not Taken” color.
If branches “clash”, they take turns kicking each other
UC Regents Spring 2014 © UCB
Key ideas about caches ...
Program locality is why building
a memory hierarchy makes sense
Latency toolkit: hierarchy design,
bit-wise parallelism, pipelining.
In practice: how many rows, how
many columns, how many arrays.
Cache operation: compare tags,
detect hits, select bytes.
CS 152 L11: Cache II
UC Regents Spring 2014 © UCB
Cache Misses
&
Performance
CS 152 L11: Cache II
UC Regents Spring 2014 © UCB
Recall: Caching terminology
Hit: Data
appears
in upper
level block
(ex: Blk X)
Hit Rate: The
fraction of memory
accesses found in
upper level.
Miss: Data retrieval
from lower level
needed
(ex: Blk Y)
Miss Rate:
1 - Hit Rate
CS 152 L11: Cache II
Hit Time <<
Miss Penalty
Hit Time: Time
to access
upper level.
Includes
hit/miss check.
Miss penalty:
Time to replace
block in upper
level + deliver to
CPU
UC
Regents Spring 2014 © UCB
Recall: The Performance Equation
What factors make
different programs
have different CPIs?
Cache behavior
varies.
Instruction mix varies.
Branch prediction varies.
Seconds
Program
=
Instructions
Cycles
Seconds
Program
Instruction
Cycle
We need all three
terms, and only these
terms, to compute
CPU Time!
CS 152 L11: Cache II
“CPI” -- The Average
Number of Clock
Cycles Per Instruction
For the Program
UC Regents Spring 2014 © UCB
Recall: CPI as a tool to guide design
Machine CPI
(throughput,
not latency)
Program
Instruction Mix
5 x 30 + 1 x 20 + 2 x 20 + 2 x 10 + 2 x
20
100
= 2.7 cycles/instruction
CS 152 L11: Cache II
Where
program
spends
its time
UC Regents Spring 2014 © UCB
AMAT: Average Memory Access Time
Seconds
Program
=
Instructions
Program
Cycles
Instruction
Seconds
Cycle
Last slide computed it ...
Machine CPI
Last slide assumes
constant memory access
time.
True CPI depends on the
Average Memory Access
Time (AMAT) for Inst & Data
AMAT = Hit Time +
(Miss Rate x Miss Penalty)
Goal: Reduce AMAT
True CPI = Ideal CPI
+
See
Appendix
B.2 of
Memory
Stall Cycles.
CA-AQA for details.
CS 152 L11: Cache II
Beware! Improving one term
may hurt other terms,
and increase AMAT!
UC Regents Spring 2014 © UCB
One type of cache miss: Conflict Miss
N blocks of same color in use at once, but
cache can only hold M < N of them
Miss
Rate
Solution: Increase M
(Associativity)
Miss rate
improvement
equivalent to
doubling
cache size.
fully-associative
Cache Size (KB)
Other Solutions
Increase number of cache
lines (# blocks in cache)
Q. Why does this help?
A. Reduce odds of a
conflict.
Add a small “victim cache”
that holds blocks recently
removed from the cache.
More victim cache soon
...
If hit time increases, AMAT may go up!
AMAT = Hit Time + (Miss Rate x Miss Penalty)
CS 152 L11: Cache II
UC Regents Spring 2014 © UCB
Other causes of cache misses ...
Capacity Misses
Compulsory Misses
Cache cannot contain all
blocks accessed by the
program
First access of a block
by a program
Mostly unavoidable
Solution: Increase size
of the cache
Solution: Prefetch blocks
(via hardware, software)
Miss rates (relative)
Miss rates
(absolute)
Cache Size (KB)
Cache Size (KB)
Also “Coherency Misses”: other processes update memory
CS 152 L11: Cache II
UC Regents Spring 2014 © UCB
Thinking about cache miss types ...
What kind of misses happen in a
fully associative cache of infinite size?
A. Compulsory misses. Must bring
each block into cache.
In addition, what kind of misses happen in
a finite-sized fully associative cache?
A. Capacity misses. Program may
use more blocks than can fit in cache.
In addition, what kind of misses happen
in a set-associative or direct-map cache?
A. Conflict misses.
(all questions assume the replacement policy used is considered “optimal”)
CS 152 L11: Cache II
UC Regents Spring 2014 © UCB
Separate instruction and data caches?
Compare 2k separate I & D to 2k+1
arrows mark
unified ...
crossover.
Misses per 1000
instructions
Figure B.6 from CA-AQA. Data for a 2-way set associative
cache with 64-byte blocks for DEC Alpha.
Note: The extraordinarily effectiveness of large instruction
caches ...
CS 152 L11: Cache II
UC Regents Spring 2014 © UCB
Break
Play:
CS 152 L11: Cache II
UC Regents Spring 2014 © UCB
Practical Cache Design
CS 152 L11: Cache II
UC Regents Spring 2014 © UCB
time machine back to FPGA-oriented 2006 CS 152 ...
Cache Design: Datapath + Control
Datapath for performance, control for correctness.
Most design errors come from incorrect specification
of state machine behavior!
State Machine
To
CPU
Control
Control
Control
Addr
To
CPU
Din
Dout
Addr
Blocks
Tags
Din
Dout
To
Lower
Level
Memory
To
Lower
Level
Memory
Red text will highlight state machine requirements
CS 152 L11: Cache II
...
UC Regents Spring 2014 © UCB
Recall: State Machine Design ...
Rst == 1
RYG
100
Change == 1
Change == 1
RYG
001
Change == 1
RYG
010
Cache controller state machines like this, but more
states, and perhaps several connected machines ...
CS 152 L11: Cache II
UC Regents Spring 2014 © UCB
Issue #1: Control for CPU interface ....
For reads,
your state
machine must:
Large,
slow
Small, fast
(1) sense REQ
(2) latch Addr
(3) create Wait
(4) put Data Out
on the bus.
From
CPU
To CPU
An example interface ... there are other possibilities.
CS 152 L11: Cache II
UC Regents Spring 2014 © UCB
Issue #2: Cache Block Replacement
After a cache read miss,
if there are no empty cache blocks,
which block should be removed
from the cache?
The Least Recently Used
A randomly chosen block?
(LRU) block? Appealing,
Easy to implement, how
but hard to implement.
well does it work?
Miss Rate for 2-way Set Associative Cache
Size
Random
LRU
16 KB
5.7%
2.0%
1.17%
5.2%
1.9%
1.15%
64 KB
256 KB
Part of your state machine decides which block to
CS 152 L11: Cache II
replace.
Also,
try
other
LRU
approx.
UC Regents Spring 2014 © UCB
Issue #3: High performance block fetch
1
12-bit
row
address
input
of
40
96
de
co
de
r
With proper memory layout, one row
access delivers entire cache block to the
sense challenges:
amp.
Two state machine
(1) Bring
in the word requested by CPU with lowest
latency (2) Bring in rest of cache block
ASAP
2048
columns
Each
4096 rows
column
33,554,432 usable bits
4 bits
(tester found good bits in bigger array)
deep
8196 bits delivered by sense amps
Select requested bits, send off the
CS 152 L11: Cache II
UC Regents Spring 2014 © UCB
Issue #3 (continued): DRAM Burst Reads
One request ...
DRAM can be set up to request an N byte region
starting at an arbitrary N+k within region
Many returns ...
State machine challenges: (1) setting up correct block
read mode (2) delivering correct word direct to CPU
(3) putting all words in cache in right place.
CS 152 L11: Cache II
UC Regents Spring 2014 © UCB
Writes and Caches
CS 152 L11: Cache II
UC Regents Spring 2014 © UCB
Issue #4: When to write to lower level ...
Write-Through
Write-Back
Policy
Data written to
cache block
also written to
lower-level
memory
Write data only
to the cache
Update lower
level when a
block falls out
of the cache
Do read misses
produce writes?
No
Yes
Do repeated
writes make it
to lower level?
Yes
No
Related
issue:
Do writes to
blocks not
in the cache
get put in
the cache
(”writeallocate”)
or not?
State machine design (1) Write-back puts most write
logic in cache-miss machine. (2) Write-through
isolates writing in its own state machine.
CS 152 L11: Cache II
UC Regents Spring 2014 © UCB
Issue #5: Write-back DRAM Burst Writes
One
command ...
Many
bytes
written
State machine challenges: (1) putting cache block into
correct location (2) what if a read or write wants to use
DRAM before the burst is complete? Must stall ...
CS 152 L11: Cache II
UC Regents Spring 2014 © UCB
If we choose write-through ...
Write-Through
Policy
Data written to cache block
also written to lower-level
memory
Do read misses
produce writes?
No
Do repeated
writes make it
to lower level?
Yes
State machine design issue: handling writes without
stalling the machine until the written word is safely
in the lower level (DRAM)
CS 152 L11: Cache II
UC Regents Spring 2014 © UCB
Issue #6: Avoid write-through write stalls
Solution: add a “write buffer” to cache datapath
Processor
Cache
Lower
Level
Memory
Write Buffer
Holds data awaiting write-through to
lower level memory
Q. Why a write buffer ?
A. So CPU doesn’t stall
Q. Why a buffer, why
A. Bursts of writes are
not just one register ?
common.
Q. Are Read After Write
A. Yes! Drain buffer
(RAW) hazards an issue before next read, or
for write buffer?
check write buffers.
On reads, state machine checks cache and write buffer -what if word was removed from cache before lower-level
write?
On writes, state machine stalls for full write buffer,
CS 152 L11: Cache II
UC Regents Spring 2014 © UCB
Write buffer logic for a LW instruction
Processor
Cache
Lower
Level
Memory
Write Buffer
Cache state machines must be designed so that this algorithm
always yields correct memory semantics.
CS 152 L11: Cache II
UC Regents Spring 2014 © UCB
Write buffer logic for a SW instruction
Processor
Cache
Lower
Level
Memory
Write Buffer
LW + SW require complex state machine logic ... plus, state
machine needs to manage two buses ... plus the write buffer
FIFO logic ...
CS 152 L11: Cache II
UC Regents Spring 2014 © UCB
Issue #7: Optimizing the hit time ...
Hit time is directly tied
to clock rate of CPU.
If left unchecked, it
increases when cache
size and associativity
increases.
Note that XScale
pipelines both
instruction and data
caches, adding stages
to the CPU pipeline.
State machine design issue: pipelining cache control!
CS 152 L11: Cache II
UC Regents Spring 2014 © UCB
Common bug: When to write to cache?
A1. If no write-allocate ... when address is already in the cache.
Issue: Must check tag before writing, or else
may overwrite the wrong address!
Options: Stall and do tag check, or pipeline check.
A2. Always: we do allocate on write.
CS 152 L11: Cache II
UC Regents Spring 2014 © UCB
Key ideas about caches ...
The cache design spectrum: from
direct mapped to fully associative.
AMAT (Ave. Memory Access Time) =
Hit Time + (Miss Rate x Miss Penalty)
Cache misses: conflict, capacity,
compulsory, and coherency.
Cache design bugs are usually
from cache specification errors.
CS 152 L11: Cache II
UC Regents Spring 2014 © UCB
Influential memory systems paper
by Norm Jouppi
(then at DEC, now at Google)
1990
We look at it in depth ...
The baseline memory system ...
that enhancements are trying to improve.
In 1990,
one
thousand
16Mb
DRAM chips!
Small
software
benchmark
suite:
How
memory
system
falls
short:
Green line
shows
how a
system
with
perfect
caching
performs.
Actual performance
First idea:
Miss cache
L1 Cache
“Miss Cache”
Miss Cache: Small. Fully-associative. LRU replacement.
Checked in parallel with the L1 cache.
If L1 and miss cache both miss, the data block returned by
the next-lower cache is placed in L1 and miss cache.
Second idea:
Victim cache
L1 Cache
“Victim Cache”
Victim Cache: Small. Fully-associative. LRU replacement.
Checked in parallel with the L1 cache.
If L1 and miss cache both miss, the data block removed
from the L1 cache is placed into the victim cache.
% of conflict misses removed
Plotted vs number of {miss, victim} cache entries
Miss Cache
Victim Cache
Each symbol a benchmark.{Solid, dashed} line is L1 {I, D}.
Third idea:
Streaming
prefetch
buffer
(FIFO)
L1 Cache
Prefetch FIFO
Prefetch buffer: Small FIFO of cache lines and tags.
Check head of FIFO in parallel with L1 cache access.
If both miss, fetch missed block “k” for L1. Clear FIFO.
Prefetch blocks “k + 1”, “k +2”, ... and place in FIFO tail.
Fourth idea:
Multi-way
streaming
prefetch
buffer
L1 Cache
FIFO
FIFO
FIFO
FIFO
Multi-way buffer: 4-FIFO version of the original design.
Allows block streams for 4 misses to proceed in parallel.
If an access misses L1 and all FIFOs, clear LRU FIFO.
% of all misses removed
Plotted vs number of prefetches that follow a miss.
Single-Way Buffer
Multi-Way Buffer
Each symbol a benchmark.{Solid, dashed} line is L1 {I, D}.
Purple line: Performance of enhanced system
Complete
memory
system
Baseline
Green
line:
Shows
how a
system
with
perfect
caching
performs.
4-way data victim cache + 4-way data stream buffer
Also: Instruction stream buffer.
Original
Captive
Model
IBM owns fabs, and
designs CPU chips,
for use in its
server products.
IBM’s most important server line is
the z-Series, compatible with the 360
mainframe architecture introduced in 1962.
CS 250 L1: Fab/Design Interface
UC Regents Fall 2013 © UCB
Who uses IBM
mainframes?
You do!
Imagine a TeleBears
migration to a new
platform ... that goes
wrong ... and wipes
out a semester!
IBM Mainframe
32nm process
2.75 billion
transistors
6 cores,
but ...
Most of
the die
is dedicated
to the
memory system
Most of power is spent in the 5.5 Ghz cores
... and
chips
today
are
power
limited ...
How to use 1.35 billion new transistors?
Photos shown
to scale.
50% more cores + memory system scale-up
... use faster transistors
to meet power budget ...
45 nm chip (2012)
5.2 GHz 1.4B Transistors
32 nm chip (2014)
5.5 GHz 2.75B Transistors
L2 operand hit time halved
How? Split instruction and data L2s simplified design.
Also: New L2-CPU floorplan + logic optimizations.
Branch predictor array sizes rival L1s ...
On Thursday
Virtual memory ...
Have fun in section !