Lecture 1: Course Introduction and Overview

Download Report

Transcript Lecture 1: Course Introduction and Overview

CS162
Computer Architecture
Lecture 16:
Multiprocessor 2: Directory Protocol,
Interconnection Networks
2/28/01
CS252/Patterson
Lec 12.1
Larger MPs
•
•
•
•
Separate Memory per Processor
Local or Remote access via memory controller
1 Cache Coherency solution: non-cached pages
Alternative: directory per cache that tracks state of
every block in every cache
– Which caches have a copies of block, dirty vs. clean, ...
• Info per memory block vs. per cache block?
– PLUS: In memory => simpler protocol (centralized/one location)
– MINUS: In memory => directory is ƒ(memory size) vs. ƒ(cache size)
• Prevent directory as bottleneck?
distribute directory entries with memory, each keeping
track of which Procs have copies of their blocks
2/28/01
CS252/Patterson
Lec 12.2
Distributed Directory MPs
2/28/01
CS252/Patterson
Lec 12.3
Network Examples
• Bi-directional Ring – EX: HP V Class
• 2-D Mesh and Hypercube – SGI Origin and
Cray T3E
• Crossbar and Omega Network – SMPs, IBM
SP3, and IP Routers
• Clusters using ethernet, Gigabit ethernet,
Myrinet, etc.
Properties of various networks will be
discussed later
2/28/01
CS252/Patterson
Lec 12.4
CC-NUMA Multiprocessor: Directory
Protocol
• What is Cache Coherent Non-Uniform Memory
Access (CC-NUMA)?
• Similar to Snoopy Protocol: Three states
– Shared: ≥ 1 processors have data, memory up-to-date
– Uncached (no processor hasit; not valid in any cache)
– Exclusive: 1 processor (owner) has data;
memory out-of-date
• In addition to cache state, must track which
processors have data when in the shared state
(usually bit vector, 1 if processor has copy)
• Directory Size: Big => Limited Directory
Schemes (Not to be discussed)
2/28/01
CS252/Patterson
Lec 12.5
Directory Protocol
• No bus and don’t want to broadcast:
– interconnect no longer single arbitration point
– all messages have explicit responses
• Terms: typically 3 processors involved
– Local node where a request originates
– Home node where the memory location
of an address resides
– Remote node has a copy of a cache
block, whether exclusive or shared
• Example messages on next slide:
P = processor number, A = address
2/28/01
CS252/Patterson
Lec 12.6
Example Directory Protocol
• Message sent to directory causes two actions:
– Update the directory
– More messages to satisfy request
• Block is in Uncached state: the copy in memory is the
current value; only possible requests for that block are:
– Read miss: requesting processor sent data from memory &requestor
made only sharing node; state of block made Shared.
– Write miss: requesting processor is sent the value & becomes the
Sharing node. The block is made Exclusive to indicate that the only valid
copy is cached. Sharers indicates the identity of the owner.
• Block is Shared => the memory value is up-to-date:
– Read miss: requesting processor is sent back the data from memory &
requesting processor is added to the sharing set.
– Write miss: requesting processor is sent the value. All processors in the
set Sharers are sent invalidate messages, & Sharers is set to identity
of requesting processor. The state of the block is made Exclusive.
2/28/01
CS252/Patterson
Lec 12.7
Example Directory Protocol
• Block is Exclusive: current value of the block is held in
the cache of the processor identified by the set
Sharers (the owner) => three possible directory
requests:
– Read miss: owner processor sent data fetch message, causing state of
block in owner’s cache to transition to Shared and causes owner to
send data to directory, where it is written to memory & sent back to
requesting processor.
Identity of requesting processor is added to set Sharers, which still
contains the identity of the processor that was the owner (since it
still has a readable copy). State is shared.
– Data write-back: owner processor is replacing the block and hence
must write it back, making memory copy up-to-date
(the home directory essentially becomes the owner), the block is now
Uncached, and the Sharer set is empty.
– Write miss: block has a new owner. A message is sent to old owner
causing the cache to send the value of the block to the directory
from which it is sent to the requesting processor, which becomes the
new owner. Sharers is set to identity of new owner, and state of
block is made Exclusive.
2/28/01
CS252/Patterson
Lec 12.8
2/28/01
CS252/Patterson
Lec 12.9
Interconnection Network Routing,
Topology Design Trade-offs
2/28/01
CS252/Patterson
Lec 12.10
Interconnection Topologies
• Class networks scaling with N
• Logical Properties:
– distance, degree
• Physical properties
– length, width
• Static vs. Dynamic Networks
• Fully connected network
– diameter = 1
– degree = N
– cost?
» bus => O(N), but BW is O(1) - actually worse
» crossbar => O(N2) for BW O(N)
• VLSI technology determines switch degree
2/28/01
CS252/Patterson
Lec 12.11
What characterizes a network?
• Topology
(what)
• Routing Algorithm
(which)
• Switching Strategy
(how)
• Flow Control Mechanism
(when)
– physical interconnection structure of the network graph
– direct: node connected to every switch
– indirect: nodes connected to specific subset of switches
– restricts the set of paths that msgs may follow
– many algorithms with different properties
» gridlock avoidance?
– how data in a msg traverses a route
– circuit switching vs. packet switching
– when a msg or portions of it traverse a route
– what happens when traffic is encountered?
2/28/01
CS252/Patterson
Lec 12.12
Flow Control
• What do you do when push comes to shove?
–
–
–
–
Ethernet: collision detection and retry after delay
FDDI, token ring: arbitration token
TCP/WAN: buffer, drop, adjust rate
any solution must adjust to output rate
• Link-level flow control
Ready
Data
2/28/01
CS252/Patterson
Lec 12.13
Topological Properties
• Routing Distance - number of links on route
• Diameter - maximum routing distance
between any two nodes in the network
• Average Distance – Sum of distances
between nodes/number of nodes
• Degree of a Node – Number of links
connected to a node => Cost high if degree
is high
• A network is partitioned by a set of links if
their removal disconnects the graph
• Fault-tolerance – Number of alternate paths
between two nodes in a network
2/28/01
CS252/Patterson
Lec 12.14
Review: Performance Metrics
Sender
Sender
Overhead
Transmission time
(size ÷ bandwidth)
(processor
busy)
Time of
Flight
Transmission time
(size ÷ bandwidth)
Receiver
Overhead
Receiver
Transport Latency
(processor
busy)
Total Latency
Total Latency = Sender Overhead + Time of Flight +
Message Size ÷ BW + Receiver Overhead
Includes header/trailer in BW calculation?
2/28/01
CS252/Patterson
Lec 12.15
Example Static Network: 2-D Mesh
Architecture
2/28/01
Node
Node
Node
Node
0
1
2
3
Node
Node
Node
Node
4
5
6
7
Node
Node
Node
Node
8
9
10
11
Node
Node
Node
Node
12
13
14
15
(a) a 16-node mesh structure
CS252/Patterson
Lec 12.16
More Static Networks: Linear
Arrays and Rings
Linear Array
Torus
Torus arranged to use short wires
• Linear Array
–
–
–
–
Diameter?
Average Distance?
Bisection bandwidth?
Route A -> B given by relative address R = B-A
• Torus?
• Examples: FDDI, SCI, FiberChannel Arbitrated Loop,
KSR1
2/28/01
CS252/Patterson
Lec 12.17
Multidimensional Meshes and Tori
2D Grid
3D Cube
• d-dimensional array
– n = kd-1 X ...X kO nodes
– described by d-vector of coordinates (id-1, ..., iO)
• d-dimensional k-ary mesh: N = kd
– k = dN
– described by d-vector of radix k coordinate
• d-dimensional k-ary torus (or k-ary d-cube)?
Ex: Intel Paragon (2D), SGI Origin (Hypercube), Cray T3E
(3DMesh)
2/28/01
CS252/Patterson
Lec 12.18
Hypercubes
• Also called binary n-cubes.
N = 2n.
• O(logN) Hops
• Good bisection BW
• Complexity
# of nodes =
– Out degree is n = logN
correct dimensions in order
– with random comm. 2 ports per processor
0-D
2/28/01
1-D
2-D
3-D
4-D
5-D !
CS252/Patterson
Lec 12.19
Origin Network
N
N
N
N
N
N
N
N
N
N
(b) 4-node
N
N
(c) 8-node
(d) 16-node
(d) 32-node
• Each router has six
pairs of 1.56MB/s
unidirectional links
meta-router
– Two to nodes, four
to other routers
– latency: 41ns pin to
pin across a router
• Flexible cables up to
3 ft long
(e) 64-node
2/28/01
• Four “virtual
channels”: request,
reply, other two for
priority or I/O
CS252/Patterson
Lec 12.20
Case Study: Cray T3D
Resp
in
Req
out
Req
in
Resp
out
3D torus of pairs of PEs
· share net and BLT
· up to 2,048
· 64 MB each
Message queue
· 4,080  4  64
DMA
150-MHz DEC Alpha (64 bit)
Block transfer
engine
8-KB instruction + 8-KB data
43-bit virtual address
PE# + FC
Prefetch queue
· 16  64
32- and 64-bit memory
and byte operations
DTB
$
Special registers
· swaperand
· fetch&add
· barrier
DRAM
P
MMU
32-bit
physical address
Nonblocking stores
and memory barrier
Prefetch
Load-lock, store-conditional
• Build up info in ‘shell’
• Remote memory operations encoded in address
2/28/01
CS252/Patterson
Lec 12.21
Trees
• Diameter and ave distance logarithmic
– k-ary tree, height d = logk N
– address specified d-vector of radix k coordinates describing path
down from root
• Fixed degree
• Route up to common ancestor and down
– R = B xor A
– let i be position of most significant 1 in R, route up i+1 levels
– down in direction given by low i+1 bits of B
• H-tree space is O(N) with O(N) long wires
• Bisection BW?
2/28/01
CS252/Patterson
Lec 12.22
Real Machines
Machine
Topology
Cycle Time
(ns)
Channel
Width
(bits)
Routing
Delay
(cycles)
Flit
(data bits)
nCUBE/2
Hypercube
25
1
40
32
TMC CM-5
Fat-Tree
25
4
10
4
IBM SP-2
Banyan
25
8
5
16
Intel Paragon
2D Mesh
11.5
16
2
16
Meiko CS-2
Fat-Tree
20
8
7
8
CRAY T3D
3D Torus
6.67
16
2
16
DASH
Torus
30
16
2
16
J-Machine
3D Mesh
31
8
2
8
Monsoon
Butterfly
20
16
2
16
SGI Origin
Hypercube
2.5
20
16
160
Myricom
Arbitrary
6.25
16
50
16
• Wide links, smaller routing delay
• Tremendous variation
2/28/01
CS252/Patterson
Lec 12.23