AMA-L12-CachesAndMemoryx

Download Report

Transcript AMA-L12-CachesAndMemoryx

Lecture 12: Caches and Memory
1
1
01
01
“6T SRAM” cell
b
b
2 access gates
2T per inverter
• Chained inverters maintain a stable state
• Access gates provide access to the cell
• Writing to a cell involves over-powering the two
small storage inverters
Lecture 12: Caches and Memory
2
1-of-8 Decoder
“Wordline”
“Bitlines”
Why are we reading
both b and b?
“Column
Mux”
1-of-8 Decoder
Lecture 12: Caches and Memory
3
• 6T cell must be small as possible to have dense
storage
– Bigger caches
– Smaller transistors  slower transistors
*Long* metal line with a
lot of parasitic loading
So dinky inverters cannot
drive their outputs very
quickly…
Lecture 12: Caches and Memory
4
• Type of differential amplifier
– Two inputs, amplifies the difference
X
Bitlines
precharged
To Vdd
Y
b
Diff
Amp
a × (X – Y) + Vbias
Small cell discharges
bitline very slowly
b
Wordline
enabled
Sense amp “sees” the difference
quickly and outputs b’s value
Lecture 12: Caches and Memory
Sometimes precharge
bitlines to Vdd/2 which
makes a bigger “delta”
for faster sensing
5
Wordline2
Wordline1
b2
b1
Wordlines = 2 × ports
Bitlines = 4 × ports
Lecture 12: Caches and Memory
b1
b2
Area = O(ports2)
6
• ARF, PRF, RAT all need many read and write ports to
support superscalar execution
– Luckily, these have limited number of entries/bytes
• Caches also need multiple ports
– Not as many ports
– But the overall size is much larger
Lecture 12: Caches and Memory
7
• I$
– low port requirement (one fetch group/$-line per cycle)
– latency only exposed on branch mispredict
• D$
– higher port requirement (multiple LD/ST per cycle)
– latency often on critical path of execution
• L2
– lower port requirement (most accesses hit in L1)
– latency less important (only observed on L1 miss)
– optimizing for hit rate usually more important than latency
• difference between L2 latency and DRAM latency is large
Lecture 12: Caches and Memory
8
SRAM
Array
S
S
Lecture 12: Caches and Memory
SRAM
Array
Decoder
Decoder
Slow due to quadratic
area growth
SRAM
Array
S
S
Sense
Sense
Sense
Sense
Column
Muxing
Decoder
Decoder
Decoder
Decoder
Decoder
Decoder
Big SRAM
Array
4-ported
L1 Data
Cache
SRAM
Array
4 Banks, 1 port each
Each bank is much faster
9
• Banking provides high bandwidth
• But only if all accesses are to different banks
• Banks typically address interleaved
– For N banks
– Addr  bank[Addr % N]
• Addr on cache line granularity
– For 4 banks, 2 accesses, chance of conflict is 25%
– Need to match # banks to access patterns/BW
Lecture 12: Caches and Memory
10
• You should know this already
foo
foo
foo
foo
foo’s value
foo’s value
foo
foo’s value
set associative
CAM/RAM hybrid?
direct mapped
Lecture 12: Caches and Memory
RAM
CAM
fully associative
11
• Set-associativity good for reducing conflict misses
• Cost: slower cache access
– often dominated by the tag array comparisons
– Basically mini-CAM logic
• Must trade off:
– Smaller cache size
– Longer latency
– Lower associativity
=
=
=
=
40-50 bit
comparison!
• Every option hurts performance
Lecture 12: Caches and Memory
12
• If figuring out the way takes too long, then just guess!
“E”
Load
PC
Way
Pred
S
X
X
X
Payload
=
=
=
=
Tag check
still occurs
to validate
way-pred
• May be hard to predict way if the same
load accesses different addresses
Lecture 12: Caches and Memory
13
• Organize data array s.t. left most way is the MRU
MRU
Accesses
LRU
On way-miss, move block
to MRU position
Way-Miss (Cache Hit)
Way-prediction
keeps
Way-predict the
MRUhitting
way
Way-prediction continues
to hit
Complication: data array needs datapath
for swapping blocks (maybe 100’s of bits)
Normally just update a few LRU bits in
the tag array (< 10 bits?)
Lecture 12: Caches and Memory
14
• Like BTBs, just use part of the tag
=
=
=
=
= = = =
Tag array lookup
now much faster!
Partial tags lead to false hits:
Tag 0x45120001 looks like a hit
for Address 0x3B120001
Lecture 12: Caches and Memory
Similar to way-prediction, full tag
comparison still needed to verify
“real” hit --- not on critical path
15
• Partial tagging can be used in the LSQ as well
Do address check on
partial addresses only
Slower complete
tag check verifies
the match/no match
Replay or flush
as needed
Lecture 12: Caches and Memory
On a partial hit,
forward the data
If a store finds a later partially-matched
load, don’t do pipeline flush right away
Penalty is too severe, wait for slow
check before flushing the pipe
16
• Bank conflicts, way-mispredictions, partial-tag false
hits
– All change the latency of the load instruction
– Increases frequency of replays
• more “replay conditions” exist/encountered
– Need careful tradeoff between
• performance (reducing effective cache latency)
• performance (frequency of replaying instructions)
• power (frequency of replaying instructions)
Lecture 12: Caches and Memory
17
• More Set-Assoc needed when number of items
mapping to same cache set > number of ways
• Not all sets suffer from high conflict rates
• Idea: provide a little extra associativity, but not for
each and every set
Lecture 12: Caches and Memory
18
A
D
E
B
A
C
B
X
Y
Z
M
N
J
P
KJ
Q
K
L
R
D
C
M
L
Every access is a miss!
ABCED and JKLMN
do not “fit” in a 4-way
set associative cache
E
A
A
B
B
C
X
Y
Z
J
N
P
KJ
Q
L
K
R
Victim
Cache
C
D
D
C
A
B
M
K
LJ
M
L
Victim cache provides
a “fifth way” so long as
only four sets overflow
into it at the same time
Can even provide 6th
or 7th … ways
Lecture 12: Caches and Memory
19
A
Regular Set-Associative Cache
B
Lots of misses
C
D
A
X
Y
C
W
B
D
W
Z
X
Y
Z
Skewed-Associative Cache
X
A
C
Lecture 12: Caches and Memory
D
Fewer of misses
B
Z
W
Y
20
• Program stack needs very little associativity
– spatial locality
• stack frame is laid out sequentially
• function usually only refers to own stack frame
f()
g()
h()
j()
Call Stack
Addresses
laid out in
linear
organization
Layout in 4-way Cache
MRU
LRU
Associativity not being used effectively
k()
Lecture 12: Caches and Memory
21
f()
g()
h()
“Nice” stack
accesses
Lots of conflicts!
j()
k()
Disorganized
heap accesses
“Regular”
Cache
Stack Cache
Lecture 12: Caches and Memory
22
• Stack cache portion can be a lot simpler due to
direct-mapped structure
– relatively easily prefetched for by monitoring call/retn’s
• “Regular” cache portion can have lower associativity
– doesn’t have conflicts due to stack/heap interaction
Lecture 12: Caches and Memory
23
• Which cache does a load access?
– Many ISA’s have a “default” stack-pointer register
LDQ 0[$sp]
Stack Cache
LDQ 12[$sp]
LDQ 8[$t3]
X
LDQ 24[$sp]
LDQ 0[$t1]
Regular Cache
MOV $t3 = $sp
Need stack base and offset
information, and then need
to check each cache access
against these bounds
Wrong cache  replay
Lecture 12: Caches and Memory
24
• Normal cache is “uni-lateral” in that everything goes
into the same place
• Stack cache is an example of “multi-lateral” caches
– multiple cache structures with disjoint contents
– I$ vs. D$ could be considered multi-lateral
Lecture 12: Caches and Memory
25
• Stack cache showed how different loads exhibit
different access patterns
Stack
(multiple push/pop’s
of frames)
Lecture 12: Caches and Memory
Heap
(heavily data-dependent
access patterns)
Streaming
(linear accesses
with low/no reuse)
26
• Streaming
– once you’re done decoding MPEG frame, no need to revisit
• Other
struct tree_t {
int valid;
int other_fields[24];
int num_children;
struct tree_t * children;
};
Fields map to different cache lines
Lecture 12: Caches and Memory
while(some condition) {
struct tree_t * parent = getNextRoot(…);
if(parent->valid) {
doTreeTraversalStuff(parent);
doMoreStuffToTree(parent);
pickFruitFromTree(parent);
}
}
parent->valid accessed once,
and then not used again
27
• Several proposed variations
– annex cache, pollution control cache, etc.
If not accessed again, eventually LRU’d out
Fill on miss
Small
Filter
Cache
First-time misses
are placed in filter
cache
Main cache only contains
lines with proven reuse
Main
Cache
If accessed again, promote
to the main cache
Lecture 12: Caches and Memory
One-time-use lines have
been filtered out
Can be thought of as the
“dual” of the victim cache
28
• More complexity
– load may need to be routed to different places
• may require some form of prediction to pick the right one
– guessing wrong can cause replays
• or accessing multiple in parallel increases power
– no bandwidth benefit
– more sources to bypass from
• costs both latency and power in bypass network
Lecture 12: Caches and Memory
29
• What if memory latency is 10000 cycles?
– Not enough traditional ILP to cover this latency
– Runtime dominated by waiting for memory
– What matters is overlapping memory accesses
• MLP: “number of outstanding cache misses [to main
memory] that can be generated and executed in an
overlapped manner.”
• ILP is a property of a DFG; MLP is a metric
– ILP is independent of the underlying execution engine
– MLP is dependent on the microarchitecture assumptions
– You can measure MLP for uniprocessor, CMP, etc.
Lecture 12: Caches and Memory
30
• WIB – Waiting Instruction Buffer
WIB
Scheduler
Scheduler
Load miss
Load
miss
No instructions in
forward slice can
execute
Eventually all independent
insts issue and scheduler
contains only insts in the
forward slice… stalled
Lecture 12: Caches and Memory
Eventually expose
other independent
load misses (MLP)
Move forward slice
to separate buffer
Independent insts
continue to issue
New insts keep
the scheduler busy
31
• Similar to replay – continue issuing dependent
instructions, but need to shunt to the WIB
• WIB hardware can potentially be large
– WIB doesn’t do scheduling – no CAM logic needed
• Need to redispatch from WIB back into RS when
load comes back from memory
– like redispatching from replay-queue
Lecture 12: Caches and Memory
32