Directory Based Cache Coherence

Transcript Directory Based Cache Coherence

Accelerating
Multiprocessor
Simulation
Kenneth C. Barr
6.895 Final Project
Motivation
• Detailed simulation of large benchmarks can take days…
• …and that’s just a uniprocessor. Parallel simulations are
more complex.
– Cache coherence
– Interconnect / bus timing models
– N CPUs
• Memory Address Record (MAR): a structure to speed up
simulation of directory-based cache coherent computers.
Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab
6.895 Presentation — December 10, 2003
Directory Based Cache Coherence:
Review
• Same state idea as snooping (e.g., MESI), but more scalable
Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab
6.895 Presentation — December 10, 2003
Directory Based Cache Coherence:
Review
• Same state idea as snooping (e.g., MESI), but more scalable
• Add directory to hold state, and replace bus with network
Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab
6.895 Presentation — December 10, 2003
Directory Based Cache Coherence:
Review
•
•
•
•
Same state idea as snooping (e.g., MESI), but more scalable
Add directory to hold state, and replace bus with network
Each cache line has state in directory
On load and store, contact the “home node” for line’s state
– Exclusive + Owner
– Shared + Readers
– Invalid
EXCL
CPU17
SHRD
01101…01110
INV
Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab
don’t care
6.895 Presentation — December 10, 2003
Directory Example
• CPU1 prepares to modify word at 0x8000
– Contact home with “read-exclusive” request
– Home replies: “<data> is shared by CPU2,
CPU3, and CPU4”
– Home transitions line to Exclusive
– CPU1 sends invalidates to
CPU2, CPU3, and CPU4
– Sharers invalidate their copies
– CPU1 waits for all three replies
– CPU1 puts data in cache
(possibly evicting LRU line)
• Not hard to see that this is intense
simulation effort!
– 11 hard-to-predict if-statements in top-level
function!
• Let’s skip it!
Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab
6.895 Presentation — December 10, 2003
Sampling Microarchitecture Simulation
• MAR used in conjunction with “warming”
–
–
–
–
Long periods of fast “functional warming”
Short periods of detailed simulation
35-60 times faster than detailed simulation
Less than 1% error in CPI
- Wunderlich et al. 2003
Detailed
Simulation
Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab
Functional
Warming
Detailed
Warming
6.895 Presentation — December 10, 2003
Proposal:
Memory Address Record to enable fast warmup
• Quick updates to MAR during warming
– No detailed cache or directory model; all accesses straight to shared memory
space (on simulator’s heap, so it’s easy and fast to access)
– Everything looks like a hit
– For each access, record {CPU, read/write, time}
• Playback from MAR to enable detailed simulation
Detailed
Simulation
Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab
Functional
Warming
Detailed
Warming
6.895 Presentation — December 10, 2003
Proposal:
Memory Address Record to enable fast warmup
• For each access, record {CPU, read/write, time}
Physical Memory
Line 0
Line 1
struct mar_record{
int writer;
stime_t writetime;
vector<stime_t> readers;
};
Line N
Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab
6.895 Presentation — December 10, 2003
Proposal:
Memory Address Record to enable fast warmup
• For each access, record {CPU, read/write, time}
– 3:07am, CPU1 issues “load r1, 0x8004”
Physical Memory
Line 0
last writer, last writetime, … <CPU1, 3:07am>
Line 1
struct mar_record{
int writer;
stime_t writetime;
vector<stime_t> readers;
};
Line N
Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab
6.895 Presentation — December 10, 2003
Algorithm
•
Update
– simply record <cpu, read/write, time> overwriting old values.
•
Playback
– Two stages
1. Reconstruct Caches
2. Use caches to build directory
Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab
6.895 Presentation — December 10, 2003
Algorithm
Reconstructing Caches
• Uses N w-deep priority queues and N “latest writers”
for each set{
for each line in set {
update latest w writers
throw out all prior reads
insert writers and remaining (subsequent) reads in queue
}
for each CPU{
empty priority queue (tag, dirty, valid bits) into set
}
Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab
6.895 Presentation — December 10, 2003
Algorithm
Reconstructing Directory
for each CPU{
for each set in current cache{
check other caches (N-1)W places to get correct state
}
}
• All lines start as I.
• Line present in one place (dirty) -> M
• Line present in one place (clean) -> S (or E, but evictions
may not let us distinguish).
• Line present in many places (clean) -> S
• Other cases should never arise
Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab
6.895 Presentation — December 10, 2003
Proof (or intuition) of correctness
• We’ve got everything the cache has and more!
• So how can MAR be faster?
– No timing models, no state machines until playback. And when we do
play-back, we’re only interested in the final state, not what happened in
between
• Default scheme time
– all accesses ∙ T(simulate cache + simulate directory + simulate network)
• MAR time
– Update:
– Playback:
all accesses ∙ T(hash table update)
touched lines ∙ (writes ∙ sharers + reads ∙ cpus)
• In our favor
– MAR “all access” step is fast
– Touched lines are tiny compared to all accesses
– Sharing should be minimal at the application level
Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab
6.895 Presentation — December 10, 2003
Measured sharing and cache lines
Sharing Characteristics
1000000000
100000000
10000000
1000000
100000
10000
1000
100
10
1
25000
total requests
touched lines
0 sharers
1 sharer
2 sharers
3 sharers
4 sharers
20000
count
count
Ratio of Accesses to Touched Lines
15000
10000
5000
0
fft
lu
barnes
ocean
benchmark
Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab
fft
lu
barnes
ocean
benchmark
6.895 Presentation — December 10, 2003
Testing
• UVSIM
– Models SGI Origin 3000 CPU (R10K)
and most structures
– Used 4 CPUS
– Added MAR update, playback, and stats
Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab
6.895 Presentation — December 10, 2003
Splash2 Benchmarks
Splash2 Miss Rates (64KB Cache/4 CPU)
• Parallel Scientific Computing
16
14
– Fast Fourier Transform (FFT)
– Left-upper (LU) dense matrix
factorization
– Barnes-Hut n-body interaction
– Ocean current simulation
Capacity
Coherence
percent
12
10
8
6
4
• Chosen subset has diverse behavior
2
0
Scaling of
Scaling of
Scaling of
computation communication computation-tocommunication
FFT
n log n
p
LU
n
p
Barnes
n log n
p
Ocean
n
p
n
p
n
p
n (log n)

p
n
p
Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab
fft
lu
barnes
benchmark
ocean
log n
n
p

n
p
n
p
Data from Hennessy &
Patterson, 2003
6.895 Presentation — December 10, 2003
Remaining work
• Time it!
• Note, reduce space requirements
• Simplify formulas
Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab
6.895 Presentation — December 10, 2003
Future work
• Can we save space in the structure by making it more
cache-like?
• Other optimizations (eg, don’t stride, but take advantage of
locality during replay)
• What extra state needs to be stored for other schemes (eg
MESI, MOESI, etc…)
• A per-processor structure to ease parallel-hosted
simulation
• Caveats
– If application threads get out of sync in real life, we’re modeling a
correct execution, but not one we’d see in real life.
– Don’t forget to sync/barrier when the applications sync.
Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab
6.895 Presentation — December 10, 2003
Questions…
Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab
6.895 Presentation — December 10, 2003
Reconstructing Caches and directory
•
Algorithm
1.
reconstruct caches
–
“Stride” through MAR, looking at all cachelines that map to a set
–
For each line
–
–
–
Keep track of most recent write (per CPU)
Throw out all reads older than most recent write
If there are readers, then for each CPU
–
–
For each CPU
–
For each set of cache
–
2.
Insert tag and read time into a priority queue (sorted by time)
copy the W most recent reads and writes
reconstruct directory from caches
–
foreach cpu’s cache, for each address, call add_sharer() sort of
function
–
Note that cache-building step leaves us a consistent state (eg.
Only one cache can be writing)
3.
post process directory for subtleties
– ComputerIfScience
only
Kenneth C. Barr — MIT
andone
Artificial reader,
Intelligence Labit’s E, not S.
6.895 Presentation — December 10, 2003
Conclusion
• Pros
• Replaces most of the work in
animation with a quick
update and O(N) replay
• Supports multisimulation of
cache params and directory
schemes (MSI vs MESI)
Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab
• Cons
• Memory (grows with
number of touched cache
lines)
6.895 Presentation — December 10, 2003
Results
• Benchmarks: Splash2 paper shows the fraction spent in
load/store/lock/sync/etc… Make graph for this
• How far-off are certain metrics if you don’t warm up the directory?
• Pretty graphs showing timing of my scheme vs default (on four
splash2 apps)
• Growth of structure over time with these apps
Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab
6.895 Presentation — December 10, 2003