multimedia y NAS

Download Report

Transcript multimedia y NAS

Lecture 13 Multi-core Chips
Peng Liu
[email protected]
1
Introduction to Multi-core Processors
2
Motivation: Single Processor Performance Scaling
3
Multi-core Chips
(aka亦称 Chip Multi-Processors or CMPs)
4
Sample of Multi-core Options
5
Sample of Multi-core Options
6
Classic Organization of a Shared
Memory Multiprocessor
7
And There is Much More…
8
Supercomputers
Definition of a supercomputer:
• Fastest machine in world at given task
• A device to turn a compute-bound problem into an I/O bound
problem
• Any machine costing $30M+
• Any machine designed by Seymour Cray
CDC6600 (Cray, 1964) regarded as first supercomputer
9
CDC 6600 Seymour Cray, 1963
• A fast pipelined machine with 60-bit words
– 128 Kword main memory capacity, 32
banks
• Ten functional units (parallel, unpipelined)
– Floating Point: adder, 2 multipliers, divider
– Integer: adder, 2 incrementers, ...
• Hardwired control (no microcoding)
• Scoreboard for dynamic scheduling of instructions
• Ten Peripheral Processors for Input/Output
– a fast multi-threaded 12-bit integer ALU
• Very fast clock, 10 MHz (FP add in 4 clocks)
• >400,000 transistors, 750 sq. ft., 5 tons, 150 kW,
novel freon-based technology for cooling
• Fastest machine in world for 5 years (until 7600)
– over 100 sold ($7-10M each)
3/10/2009
10
Supercomputer Applications
Typical application areas
• Military research (nuclear weapons, cryptography)
• Scientific research
• Weather forecasting
• Oil exploration
• Industrial design (car crash simulation)
• Bioinformatics
• Cryptography
All involve huge computations on large data sets
In 70s-80s, Supercomputer  Vector Machine
11
BlueGene/Q Compute chip
System-on-a-Chip design : integrates processors,
memory and networking logic into a single chip
•
360 mm² Cu-45 technology (SOI)
– ~ 1.47 B transistors
•
16 user + 1 service processors
– plus 1 redundant processor
– all processors are symmetric
– each 4-way multi-threaded
– 64 bits PowerISA™
– 1.6 GHz
– L1 I/D cache = 16kB/16kB
– L1 prefetch engines
– each processor has Quad FPU
(4-wide double precision, SIMD)
– peak performance 204.8 GFLOPS@55W
•
Central shared L2 cache: 32 MB
– eDRAM
– multiversioned cache
will support transactional memory,
speculative execution.
– supports atomic ops
•
Dual memory controller
– 16 GB external DDR3 memory
– 1.33 Gb/s
– 2 * 16 byte-wide interface (+ECC)
•
Chip-to-chip networking
– Router logic integrated into BQC chip.
•
External IO
– PCIe Gen2 interface
12
Blue Gene/Q packaging hierarchy
2. Module
Single Chip
3. Compute Card
One single chip module,
16 GB DDR3 Memory
4. Node Card
32 Compute Cards,
Optical Modules, Link Chips,
Torus
1. Chip
16 cores
5b. I/O Drawer
8 I/O Cards
8 PCIe Gen2 slots
6. Rack
2 Midplanes
1, 2 or 4 I/O Drawers
7. System
20PF/s
5a. Midplane
16 Node Cards
Ref: SC2010
Sequoia PowePCA2, Custom. 2013, 17.173 PFlop/s
13
Supercomputer in China
•
•
•
•
•
Sunway TaihuLight (神威太湖之光)
– June 2016 Rank 1
– 93.015 PFlops
Tianhe-2
– National University of Defense Technology (NUDT)
– June 2013, June 2014, June 2015
– Intel Xeon E5-2692 CPUs and Xeon Phi 31S1P with custom interconnect
– 33.863 PFlop/s
Sunway BlueLight MPP (Nov. 2011)
– National Research Center of Parallel Computer Engineering and Technology
– ShenWei SW1600 CPUs with an InfiniBand QDR Interconnect
Tianhe-1A ( Nov. 2010)
– National University of Defense Technology
– Nvidia Tesla GPUs and Intel Xeon CPUs with custom interconnect
Dawning 6000 (June 2012)
– Chinese Academy of Sciences and Dawning Information Industry
– Godson-3B CPUs
14
Vector Programming Model
Scalar Registers
r15
v15
r0
v0
Vector Registers
[0]
[1]
[2]
[VLRMAX-1]
Vector Length Register
Vector Arithmetic
Instructions
ADDV v3, v1, v2
v1
v2
v3
Vector Load and
Store Instructions
LV v1, r1, r2
Base, r1
VLR
Stride, r2
+
+
[0]
[1]
v1
+
+
+
+
[VLR-1]
Vector Register
Memory
15
Multimedia Extensions (aka SIMD extensions)
64b
32b
32b
16b
16b
8b
•
•
•
8b
16b
8b
8b
16b
8b
8b
8b
8b
Very short vectors added to existing ISAs for microprocessors
Use existing 64-bit registers split into 2x32b or 4x16b or 8x8b
– This concept first used on Lincoln Labs TX-2 computer in 1957, with 36b
datapath split into 2x18b or 4x9b
– Newer designs have 128-bit registers (PowerPC Altivec, Intel SSE2/3/4)
Single instruction operates on all elements within register
16b
16b
16b
4x16b adds
16b
16b
16b
+
16b
+
16b
16b
16b
+
16b
+
16b
16
GPUs
Graphics Processing Units
17
Graphics Processing Units (GPUs)
• Original GPUs were dedicated fixed-function devices for
generating 3D graphics (mid-late 1990s) including highperformance floating-point units
– Provide workstation-like graphics for PCs
– User could configure graphics pipeline, but not really program it
• Over time, more programmability added (2001-2005)
– E.g., New language Cg for writing small programs run on each
vertex or each pixel, also Windows DirectX variants
– Massively parallel (millions of vertices or pixels per frame) but
very constrained programming model
• Some users noticed they could do general-purpose computation
by mapping input and output data to images, and computation to
vertex and pixel shading computations
– Incredibly difficult programming model as had to use graphics
pipeline model for general computation
18
General-Purpose GPUs (GP-GPUs)
• In 2006, Nvidia introduced GeForce 8800 GPU supporting a new
programming language: CUDA
– “Compute Unified Device Architecture”
– Subsequently, broader industry pushing for OpenCL, a vendor-neutral
version of same ideas.
• Idea: Take advantage of GPU computational performance and memory
bandwidth to accelerate some kernels for general-purpose computing
• Attached processor model: Host CPU issues data-parallel kernels to
GP-GPU for execution
• This lecture has a simplified version of Nvidia CUDA-style model and
only considers GPU execution for computational kernels, not graphics
– Would probably need another course to describe graphics processing
19
“Single Instruction, Multiple Thread”
• GPUs use a SIMT model, where individual scalar
instruction streams for each CUDA thread are grouped
together for SIMD execution on hardware (Nvidia groups
32 CUDA threads into a warp)
µT0 µT1 µT2 µT3 µT4 µT5 µT6 µT7
Scalar
instruction
stream
ld x
mul a
ld y
add
st y
SIMD execution across warp
20
A Multithreaded SIMD Processor
21
GPU Memory Structures
22
Nvidia Fermi GF100 GPU
[Nvidia,
2010]
23
GPU Future
• High-end desktops have separate GPU chip, but
trend towards integrating GPU on same die as CPU
(already in laptops, tablets and smartphones)
– Advantage is shared memory with CPU, no need to
transfer data
– Disadvantage is reduced memory bandwidth compared
to dedicated smaller-capacity specialized memory
system
• Graphics DRAM (GDDR) versus regular DRAM (DDR3)
• Will GP-GPU survive? Or will improvements in CPU
DLP make GP-GPU redundant?
– On same die, CPU and GPU should have same memory
bandwidth
– GPU might have more FLOPS as needed for graphics
anyway
24
Multiprocessor Benchmarks
• Stanford Parallel Applications for Shared
Memory SPLASH 2, 1995
• NAS(NASA Advanced Supercomputing) Parallel
Benchmarks, 1991
• PARSEC Benchmark Suite, 2008
• Berkeley Design Patterns, 2006
25
Synchronization
26
Symmetric Multiprocessors
Processor
Processor
CPU-Memory bus
bridge
Memory
I/O bus
I/O controller
symmetric
• All memory is equally far
away from all processors
• Any processor can do any I/O
(set up a DMA transfer)
I/O controller
I/O controller
Graphics
output
Networks
27
Synchronization
The need for synchronization arises
whenever
there are concurrent processes in a system
(even in a uniprocessor system)
producer
consumer
Producer-Consumer: A consumer process
must wait until the producer process has
produced data
Mutual Exclusion: Ensure that only one
process uses a resource at a given time
P1
P2
Shared
Resource
28
A Producer-Consumer Example
Producer
tail
head
Rtail
Producer posting Item x:
Load Rtail, (tail)
Store (Rtail), x
Rtail=Rtail+1
Store (tail), Rtail
The program is written assuming
instructions are executed in order.
Consumer
Rtail
Rhead
R
Consumer:
Load Rhead, (head)
spin: Load Rtail, (tail)
if Rhead==Rtail goto spin
Load R, (Rhead)
Rhead=Rhead+1
Store (head), Rhead
process(R)
Problems?
29
A Producer-Consumer Example
continued
Producer posting Item x:
Load Rtail, (tail)
1
Store (Rtail), x
Rtail=Rtail+1
Store (tail), Rtail
2
Consumer:
Load Rhead, (head)
3
spin: Load Rtail, (tail)
if Rhead==Rtail goto spin
Load R, (Rhead)
4
Rhead=Rhead+1
Store (head), Rhead
Can the tail pointer get updated
process(R)
before the item x is stored?
Programmer assumes that if 3 happens after 2, then 4
happens after 1.
Problem sequences are:
2, 3, 4, 1
4, 1, 2, 3
30
Sequential Consistency
A Memory Model
P
P
P
P
P
P
M
“ A system is sequentially consistent if the result of
any execution is the same as if the operations of all
the processors were executed in some sequential
order, and the operations of each individual processor
appear in the order specified by the program”
Leslie Lamport
Sequential Consistency =
arbitrary order-preserving interleaving
of memory references of sequential programs
31
Sequential Consistency
Sequential concurrent tasks:
T1, T2
Shared variables:
X, Y
(initially X = 0, Y = 10)
T1:
Store (X), 1 (X = 1)
Store (Y), 11 (Y = 11)
T2:
Load R1, (Y)
Store (Y’), R1(Y’= Y)
Load R2, (X)
Store (X’), R2(X’= X)
what are the legitimate answers for X’ and Y’ ?
(X’,Y’)  {(1,11), (0,10), (1,10), (0,11)} ?
If y is 11 then x cannot be 0
32
Sequential Consistency
Sequential consistency imposes more memory ordering
constraints than those imposed by uniprocessor
program dependencies (
)
What are these in our example ?
T1:
Store (X), 1 (X = 1)
Store (Y), 11 (Y = 11)
additional SC requirements
T2:
Load R1, (Y)
Store (Y’), R1(Y’= Y)
Load R2, (X)
Store (X’), R2(X’= X)
Does (can) a system with caches or out-of-order
execution capability provide a sequentiallyconsistent
view of the memory ?
more on this later
33
Multiple Consumer Example
Producer
tail
head
Rtail
Producer posting Item x:
Load Rtail, (tail)
Store (Rtail), x
Rtail=Rtail+1
Store (tail), Rtail
Critical section:
Needs to be executed atomically
by one consumer  locks
Consumer
1
Consumer
2
Rhead
R
Rtail
Rhead
R
Rtail
Consumer:
Load Rhead, (head)
spin: Load Rtail, (tail)
if Rhead==Rtail goto spin
Load R, (Rhead)
Rhead=Rhead+1
Store (head), Rhead
process(R)
What is wrong with this code?
34
Locks or Semaphores
E. W. Dijkstra, 1965
A semaphore is a non-negative integer, with the
following operations:
P(s): if s>0, decrement s by 1, otherwise wait
V(s): increment s by 1 and wake up one of
the waiting processes
P’s and V’s must be executed atomically, i.e., without
• interruptions or
• interleaved accesses to s by other processors
Process i
P(s)
<critical section>
V(s)
initial value of s determines
the maximum no. of processes
in the critical section
35
Implementation of Semaphores
Semaphores (mutual exclusion) can be implemented
using ordinary Load and Store instructions in the
Sequential Consistency memory model. However,
protocols for mutual exclusion are difficult to design...
Simpler solution:
atomic read-modify-write instructions
Examples: m is a memory location, R is a register
Test&Set (m), R:
R  M[m];
if R==0 then
M[m] 1;
Fetch&Add (m), RV, R:
R  M[m];
M[m] R + RV;
Swap (m), R:
Rt M[m];
M[m] R;
R  Rt;
36
Multiple Consumers Example
using the Test&Set Instruction
P:
spin:
V:
Test&Set (mutex),Rtemp
if (Rtemp!=0) goto P
Load Rhead, (head)
Load Rtail, (tail)
if Rhead==Rtail goto spin
Load R, (Rhead)
Rhead=Rhead+1
Store (head), Rhead
Store (mutex),0
process(R)
Critical
Section
Other atomic read-modify-write instructions (Swap,
Fetch&Add, etc.) can also implement P’s and V’s
What if the process stops or is swapped out while
in the critical section?
37
Nonblocking Synchronization
Compare&Swap(m), Rt, Rs:
if (Rt==M[m])
then M[m]=Rs;
Rs=Rt ;
status success;
else status fail;
try:
spin:
statusis an
implicit
argument
Load Rhead, (head)
Load Rtail, (tail)
if Rhead==Rtail goto spin
Load R, (Rhead)
Rnewhead = Rhead+1
Compare&Swap(head), Rhead, Rnewhead
if (status==fail) goto try
process(R)
38
Load-reserve & Store-conditional
Special register(s) to hold reservation flag and address,
and the outcome of store-conditional
Load-reserve R, (m):
<flag, adr><1, m>;
R  M[m];
Store-conditional (m), R:
if<flag, adr> == <1, m>
then cancel other procs’
reservation on m;
M[m] R;
status succeed;
else status fail;
try:
spin:
Load-reserve Rhead, (head)
Load Rtail, (tail)
if Rhead==Rtailgoto spin
Load R, (Rhead)
Rhead = Rhead + 1
Store-conditional (head), Rhead
if (status==fail) goto try
process(R)
39
Performance of Locks
Blocking atomic read-modify-write instructions
e.g., Test&Set, Fetch&Add, Swap
vs
Non-blocking atomic read-modify-write instructions
e.g., Compare&Swap,
Load-reserve/Store-conditional
vs
Protocols based on ordinary Loads and Stores
Performance depends on several interacting factors:
degree of contention,
caches,
out-of-order execution of Loads and Stores
later ...
40
Issues in Implementing
Sequential Consistency
P
P
P
P
P
P
M
Implementation of SC is complicated by two issues
•Out-of-order execution capability
Load(a); Load(b)
yes
Load(a); Store(b)
yes if a b
Store(a); Load(b)
yes if a b
Store(a); Store(b)
yes if a b
•Caches
Caches can prevent the effect of a store from
being seen by other processors
No common commercial architecture has a
sequentially consistent memory model!
41
Memory Fences
Instructions to sequentialize memory accesses
Processors with relaxed or weak memory models (i.e.,
permit Loads and Stores to different addresses to be
reordered) need to provide memory fence instructions
to force the serialization of memory accesses
Examples of processors with relaxed memory models:
Sparc V8 (TSO,PSO): Membar
Sparc V9 (RMO):
Membar #LoadLoad, Membar #LoadStore
Membar #StoreLoad, Membar #StoreStore
PowerPC (WO): Sync, EIEIO
Memory fences are expensive operations, however, one
pays the cost of serialization only when it is required
42
Using Memory Fences
Producer
tail
head
Consumer
Rtail
Rtail
Rhead
R
Consumer:
Load Rhead, (head)
spin: Load Rtail, (tail)
if Rhead==Rtail goto spin
MembarLL
Load R, (Rhead)
Rhead=Rhead+1
Store (head), Rhead
ensures that R is
process(R)
not loaded before
x has been stored
Producer posting Item x:
Load Rtail, (tail)
Store (Rtail), x
MembarSS
Rtail=Rtail+1
Store (tail), Rtail
ensures that tail ptr
is not updated before
x has been stored
43
Mutual Exclusion Using Load/Store
A protocol based on two shared variables c1 and c2.
Initially, both c1 and c2 are 0 (not busy)
Process 1
...
c1=1;
L: if c2=1 then go to L
< critical section>
c1=0;
What is wrong?
Process 2
...
c2=1;
L: if c1=1 then go to L
< critical section>
c2=0;
Deadlock!
44
Mutual Exclusion: second attempt
To avoid deadlock, let a process give up the reservation
(i.e. Process 1 sets c1 to 0) while waiting.
Process 1
...
L: c1=1;
if c2=1 then
{ c1=0; go to L}
< critical section>
c1=0
Process 2
...
L: c2=1;
if c1=1 then
{ c2=0; go to L}
< critical section>
c2=0
• Deadlock is not possible but with a low probability
a livelock may occur.
• An unlucky process may never get to enter the
critical section 
starvation
45
A Protocol for Mutual Exclusion
T. Dekker, 1966
A protocol based on 3 shared variables c1, c2 and turn.
Initially, both c1 and c2 are 0 (not busy)
Process 1
...
c1=1;
turn = 1;
L: if c2=1 & turn=1
then go to L
< critical section>
c1=0;
Process 2
...
c2=1;
turn = 2;
L: if c1=1 & turn=2
then go to L
< critical section>
c2=0;
• turn = i ensures that only process i can wait
• variables c1 and c2 ensure mutual exclusion
Solution for n processes was given by Dijkstra
and is quite tricky!
46
Scenario 1
...
Process 1
c1=1;
turn = 1;
L: if c2=1 & turn=1
then go to L
< critical section>
c1=0;
...
Process 2
c2=1;
turn = 2;
L: if c1=1 & turn=2
then go to L
< critical section>
c2=0;
Scenario 2
Analysis of Dekker’s Algorithm
...
Process 1
c1=1;
turn = 1;
L: if c2=1 & turn=1
then go to L
< critical section>
c1=0;
...
Process 2
c2=1;
turn = 2;
L: if c1=1 & turn=2
then go to L
< critical section>
c2=0;
47
N-process Mutual Exclusion
Lamport’s Bakery Algorithm
Process i
Initiallynum[j] = 0, for all j
Entry Code
choosing[i] = 1;
num[i] = max(num[0], …, num[N-1]) + 1;
choosing[i] = 0;
for(j = 0; j < N; j++) {
while( choosing[j] );
while( num[j] &&
( ( num[j] < num[i] ) ||
( num[j] == num[i] && j < i ) ) );
}
Exit Code
num[i] = 0;
48
Relaxed Memory Model needs Fences
Producer
tail
head
Consumer
Rtail
Rtail
Rhead
R
Consumer:
Load Rhead, (head)
spin: Load Rtail, (tail)
if Rhead==Rtail goto spin
MembarLL
Load R, (Rhead)
Rhead=Rhead+1
Store (head), Rhead
ensures that R is
process(R)
not loaded before
x has been stored
Producer posting Item x:
Load Rtail, (tail)
Store (Rtail), x
MembarSS
Rtail=Rtail+1
Store (tail), Rtail
ensures that tail ptr
is not updated before
x has been stored
49
Memory Coherence in SMPs
CPU-1
A
CPU-2
cache-1
100
A
100
cache-2
CPU-Memory bus
A
100
memory
Suppose CPU-1 updates A to 200.
write-back: memory and cache-2 have stale values
write-through: cache-2 has a stale value
Do these stale values matter?
What is the view of shared memory for programming?
50
Write-back Caches & SC
• T1 is executed
prog T1
ST X, 1
ST Y,11
• cache-1 writes backY
• T2 executed
• cache-1 writes backX
• cache-2 writes back
X’&Y’
cache-1
X= 1
Y=11
memory
X=0
Y =10
X’=
Y’=
X= 1
Y=11
X=0
Y =11
X’=
Y’=
Y=
Y’=
X=
X’=
X= 1
Y=11
X=0
Y =11
X’=
Y’=
X=1
Y =11
X’=
Y’=
Y = 11
Y’= 11
X=0
X’= 0
Y = 11
Y’= 11
X=0
X’= 0
X=1
Y =11
X’= 0
Y’=11
Y =11
Y’=11
X=0
X’= 0
X= 1
Y=11
X= 1
Y=11
cache-2
Y=
Y’=
X=
X’=
prog T2
LD Y, R1
ST Y’, R1
LD X, R2
ST X’,R2
51
Write-through Caches & SC
prog T1
ST X, 1
ST Y,11
• T1 executed
• T2 executed
cache-1
X= 0
Y=10
memory
X=0
Y =10
X’=
Y’=
cache-2
Y=
Y’=
X=0
X’=
X= 1
Y=11
X=1
Y =11
X’=
Y’=
Y=
Y’=
X=0
X’=
X= 1
Y=11
X=1
Y =11
X’= 0
Y’=11
Y = 11
Y’= 11
X=0
X’= 0
prog T2
LD Y, R1
ST Y’, R1
LD X, R2
ST X’,R2
Write-through caches don’t preserve
sequential consistency either
52
Cache Coherence vs.
Memory Consistency
• A cache coherence protocol ensures that all writes
by one processor are eventually visible to other
processors, for one memory address
– i.e., updates are not lost
• A memory consistency model gives the rules on
when a write by one processor can be observed by a
read on another, across different addresses
– Equivalently, what values can be seen by a load
• A cache coherence protocol is not enough to ensure
sequential consistency
– But if sequentially consistent, then caches must be coherent
• Combination of cache coherence protocol plus
processor memory reorder buffer implements a given
machine’s memory consistency model
53
Maintaining Cache Coherence
Hardware support is required such that
• only one processor at a time has write
permission for a location
• no processor can load a stale copy of
the location after a write
cache coherence protocols
54
Warmup: Parallel I/O
Memory
Bus
Address (A)
Proc.
Data (D)
Physical
Memory
Cache
R/W
Page transfers
occur while the
Processor is running
A
Either Cache or DMA can
be the Bus Master and
effect transfers
D
R/W
DMA
DISK
(DMA stands for “Direct Memory Access”, means the I/O device
can read/write memory autonomous from the CPU)
55
Problems with Parallel I/O
Cached portions
of page
Memory
Bus
Proc.
Physical
Memory
Cache
DMA transfers
DMA
DISK
Memory
Disk
Disk: Physical memory may be
stale if cache copy is dirty
Memory: Cache may hold stale data and not
see memory writes
56
Snoopy CacheGoodman 1983
• Idea: Have cache watch (or snoop upon) DMA transfers,
and then “do the right thing”
• Snoopy cache tags are dual-ported
Used to drive Memory Bus
when Cache is Bus Master
A
Proc.
R/W
Tags and
State
D
Data
(lines)
A
R/W
Snoopy read port
attached to Memory
Bus
Cache
57
Shared Memory Multiprocessor
Memory
Bus
M1
Snoopy
Cache
M2
Snoopy
Cache
M3
Snoopy
Cache
Physical
Memory
DMA
DISKS
Use snoopy mechanism to keep all processors’
view of memory coherent
58
Snoopy Cache Coherence Protocols
write miss:
the address is invalidated in all other
caches before the write is performed
read miss:
if a dirty copy is found in some cache, a writeback is performed before the memory is read
59
Cache State Transition Diagram
The MSI protocol
Each cache line has state bits
Address tag
state
bits
M: Modified
S: Shared
I: Invalid
Write miss
(P1 gets line from memory)
Other processor reads
(P1writes back)
M
Other processor
intent to write
(P1 writes back)
Read miss
(P1 gets line from memory)
Read by any
processor
S
P1 reads
or writes
Other processor
intent to write
I
Cache state in
processor P1
60
Two Processor Example
(Reading and writing the same cache line)
P1 reads
P1 writes
P2 reads
P2 writes
P1 reads
P1 writes
P2 writes
P1
P2 reads,
P1 writes back
M
P1 reads
or writes
Write miss
P2 intent to write
Read
miss
P1 writes
P2
S
P2 intent to write
P1 reads,
P2 writes back
I
M
P2 reads
or writes
Write miss
P1 intent to write
Read
miss
S
P1 intent to write
I
61
Observation
Other processor reads
P1 writes back
P1 reads
or writes
M
Write miss
Other processor
intent to write
Read
miss
Read by any
processor
S
Other processor
intent to write
I
• If a line is in the M state then no other cache can have
a copy of the line!
– Memory stays coherent, multiple differing copies cannot exist
62
MESI: An Enhanced MSI protocol
increased performance for private data
Each cache line has a tag M: Modified Exclusive
E: Exclusive but unmodified
S: Shared
I: Invalid
Address tag
state
bits
Write miss
P1 write
or read
M
Other processor reads
P1writes back
Read miss,
shared
Read by any
processor
S
P1 write
P1 intent
to write
Other
processor
reads
E
Other processor
intent to write, P1
writes back
Other processor
intent to write
P1 read
Read miss,
not shared
Other processor
intent to write
I
Cache state in
processor P1
63
Optimized Snoop with Level-2 Caches
CPU
CPU
CPU
CPU
L1 $
L1 $
L1 $
L1 $
L2 $
L2 $
L2 $
L2 $
Snooper
Snooper
Snooper
Snooper
• Processors often have two-level caches
• small L1, large L2 (usually both on chip now)
• Inclusion property: entries in L1 must be in L2
invalidation in L2  invalidation in L1
• Snooping on L2 does not affect CPU-L1 bandwidth
What problem could occur?
64
Intervention
CPU-1
A
CPU-2
cache-1
200
cache-2
CPU-Memory bus
A
100
memory (stale data)
When a read-miss for A occurs in cache-2,
a read request for A is placed on the bus
• Cache-1 needs to supply & change its state to shared
• The memory may respond to the request also!
Does memory know it has stale data?
Cache-1 needs to intervene through memory
controller to supply correct data to cache-2
65
False Sharing
state
blk addr data0 data1
...
dataN
A cache block contains more than one word
Cache-coherence is done at the block-level and
not word-level
Suppose M1 writes wordiand M2 writes wordk and
both words have the same block address.
What can happen?
66
66
Synchronization and Caches:
Performance Issues
Processor 1
R1
L: swap (mutex), R;
if<R>then goto L;
<critical section>
M[mutex]  0;
cache
Processor 2
R1
L: swap (mutex), R;
if<R>then goto L;
<critical section>
M[mutex]  0;
mutex=1
cache
Processor 3
R1
L: swap (mutex), R;
if<R>then goto L;
<critical section>
M[mutex]  0;
cache
CPU-Memory Bus
Cache-coherence protocols will cause mutex to ping-pong
between P1’s and P2’s caches.
Ping-ponging can be reduced by first reading the mutex
location (non-atomically) and executing a swap only if it is
found to be zero.
67
Load-reserve & Store-conditional
Special register(s) to hold reservation flag and
address, and the outcome of store-conditional
Load-reserve R, (a):
<flag, adr><1, a>;
R  M[a];
Store-conditional (a), R:
if<flag, adr> == <1, a>
then cancel other procs’
reservation on a;
M[a] <R>;
status succeed;
else status fail;
If the snooper sees a store transaction to the address
in the reserve register, the reserve bit is set to 0
• Several processors may reserve ‘a’ simultaneously
• These instructions are like ordinary loads and stores
with respect to the bus traffic
Can implement reservation by using cache hit/miss, no
additional hardware required (problems?)
68
Performance of Symmetric Shared-Memory
Multiprocessors
Cache performance is combination of:
1. Uniprocessor cache miss traffic
2. Traffic caused by communication
–
Results in invalidations and subsequent cache misses
• Adds 4th C: coherence miss
–
–
Joins Compulsory, Capacity, Conflict
(Sometimes called a Communication miss)
69
Coherency Misses
1. True sharing misses arise from the communication of
data through the cache coherence mechanism
•
•
•
Invalidates due to 1st write to shared block
Reads by another CPU of modified block in different cache
Miss would still occur if block size were 1 word
2. False sharing misses when a block is invalidated
because some word in the block, other than the one
being read, is written into
•
•
Invalidation does not cause a new value to be communicated, but only causes an extra
cache miss
Block is shared, but no word in block is actually shared
 miss would not occur if block size were 1 word
70
Home Work
• Readings:
– Read Chapter 6.1-6.14;
71
Acknowledgements
• These slides contain material from courses:
– UCB CS152
– Stanford EE108B
72
Get involved in research
• Undergrad research experience is the most
important part of application to top grad schools,
and fun too.
• Thanks for taking the class!
73