ppt - Computer Science Division - University of California, Berkeley

Download Report

Transcript ppt - Computer Science Division - University of California, Berkeley

CS252
Graduate Computer Architecture
Lecture 15
Multiprocessor Networks
March 14th, 2011
John Kubiatowicz
Electrical Engineering and Computer Sciences
University of California, Berkeley
http://www.eecs.berkeley.edu/~kubitron/cs252
Topological Properties
•
•
•
•
3/14/2011
Routing Distance - number of links on route
Diameter - maximum routing distance
Average Distance
A network is partitioned by a set of links if
their removal disconnects the graph
cs252-S11, Lecture 15
2
Interconnection Topologies
• Class of networks scaling with N
• Logical Properties:
– distance, degree
• Physical properties
– length, width
• Fully connected network
– diameter = 1
– degree = N
– cost?
» bus => O(N), but BW is O(1)
- actually worse
» crossbar => O(N2) for BW O(N)
• VLSI technology determines switch degree
3/14/2011
cs252-S11, Lecture 15
3
Example: Linear Arrays and Rings
Linear Array
Torus
Torus arranged to use short wires
• Linear Array
–
–
–
–
Diameter?
Average Distance?
Bisection bandwidth?
Route A -> B given by relative address R = B-A
• Torus?
• Examples: FDDI, SCI, FiberChannel Arbitrated Loop,
KSR1
3/14/2011
cs252-S11, Lecture 15
4
Example: Multidimensional Meshes and Tori
3D Cube
2D Grid
2D Torus
• n-dimensional array
– N = kn-1 X ...X kO nodes
– described by n-vector of coordinates (in-1, ..., iO)
• n-dimensional k-ary mesh: N = kn
– k = nN
– described by n-vector of radix k coordinate
• n-dimensional k-ary torus (or k-ary n-cube)?
3/14/2011
cs252-S11, Lecture 15
5
On Chip: Embeddings in two dimensions
6x3x2
• Embed multiple logical dimension in one physical
dimension using long wires
• When embedding higher-dimension in lower one, either
some wires longer than others, or all wires long
3/14/2011
cs252-S11, Lecture 15
6
Trees
• Diameter and ave distance logarithmic
– k-ary tree, height n = logk N
– address specified n-vector of radix k coordinates
describing path down from root
• Fixed degree
• Route up to common ancestor and down
– R = B xor A
– let i be position of most significant 1 in R, route up i+1
levels
– down in direction given by low i+1 bits of B
• H-tree space is O(N) with O(N) long wires
• Bisection BW?
3/14/2011
cs252-S11, Lecture 15
7
Fat-Trees
• Fatter links (really more of them) as you go up, so
bisection BW scales with N
3/14/2011
cs252-S11, Lecture 15
8
Butterflies
4
0
0
1
0
1
0
1
1
3
2
1
0
0
1
building block
16 node butterfly
•
•
•
•
Tree with lots of roots!
N log N (actually N/2 x logN)
Exactly one route from any source to any dest
R = A xor B, at level i use ‘straight’ edge if ri=0, otherwise
cross edge
• Bisection N/2 vs N (n-1)/n (for n-cube)
3/14/2011
cs252-S11, Lecture 15
9
k-ary n-cubes vs k-ary n-flies
•
•
•
•
degree n
N switches
diminishing BW per node
requires locality
vs degree k
vs N log N switches
vs constant
vs little benefit to locality
• Can you route all permutations?
3/14/2011
cs252-S11, Lecture 15
10
Benes network and Fat Tree
16-node Benes Network (Unidirectional)
16-node 2-ary Fat-Tree (Bidirectional)
• Back-to-back butterfly can route all permutations
• What if you just pick a random mid point?
3/14/2011
cs252-S11, Lecture 15
11
Hypercubes
•
•
•
•
Also called binary n-cubes. # of nodes = N = 2n.
O(logN) Hops
Good bisection BW
Complexity
– Out degree is n = logN
correct dimensions in order
– with random comm. 2 ports per processor
0-D
3/14/2011
1-D
2-D
3-D
4-D
cs252-S11, Lecture 15
5-D !
12
Some Properties
• Routing
– relative distance: R = (b n-1 - a n-1, ... , b0 - a0 )
– traverse ri = b i - a i hops in each dimension
– dimension-order routing? Adaptive routing?
• Average Distance
Wire Length?
– n x 2k/3 for mesh
– nk/2 for cube
• Degree?
• Bisection bandwidth?
Partitioning?
– k n-1 bidirectional links
• Physical layout?
– 2D in O(N) space
– higher dimension?
3/14/2011
Short wires
cs252-S11, Lecture 15
13
The Routing problem: Local decisions
Input
Ports
Receiver
Input
Buffer
Output
Buffer Transmiter
Output
Ports
Cross-bar
Control
Routing, Scheduling
• Routing at each hop: Pick next output port!
3/14/2011
cs252-S11, Lecture 15
14
How do you build a crossbar?
Io
Io
I1
I2
I3
O0
I1
Oi
O2
I2
O3
I3
phase
RAM
addr
Din Dout
Io
I1
I2
I3
3/14/2011
cs252-S11, Lecture 15
O0
Oi
O2
O3
15
Input buffered switch
Input
Ports
Output
Ports
R0
R1
Cross-bar
R2
R3
Scheduling
• Independent routing logic per input
– FSM
• Scheduler logic arbitrates each output
– priority, FIFO, random
• Head-of-line blocking problem
– Message at head of queue blocks messages behind it
3/14/2011
cs252-S11, Lecture 15
16
Output Buffered Switch
Input
Ports
Output
Ports
R0
Output
Ports
R1
Output
Ports
R2
Output
Ports
R3
Control
• How would you build a shared pool?
3/14/2011
cs252-S11, Lecture 15
17
Properties of Routing Algorithms
• Routing algorithm:
– R: N x N -> C, which at each switch maps the destination node nd to
the next channel on the route
– which of the possible paths are used as routes?
– how is the next hop determined?
»
»
»
»
arithmetic
source-based port select
table driven
general computation
• Deterministic
– route determined by (source, dest), not intermediate state (i.e. traffic)
• Adaptive
– route influenced by traffic along the way
• Minimal
– only selects shortest paths
• Deadlock free
– no traffic pattern can lead to a situation where packets are deadlocked
and never move forward
3/14/2011
cs252-S11, Lecture 15
18
Example: Simple Routing Mechanism
• need to select output port for each input packet
– in a few cycles
• Simple arithmetic in regular topologies
– ex: Dx, Dy routing in a grid
»
»
»
»
»
west (-x)
east (+x)
south (-y)
north (+y)
processor
Dx < 0
Dx > 0
Dx = 0, Dy < 0
Dx = 0, Dy > 0
Dx = 0, Dy = 0
• Reduce relative address of each dimension in order
– Dimension-order routing in k-ary d-cubes
– e-cube routing in n-cube
3/14/2011
cs252-S11, Lecture 15
19
Communication Performance
Routing
and
Control
Header
Data
Payload
Error
Code
Trailer
digital
symbol
Sequence of symbols transmitted over a channel
• Typical Packet includes data + encapsulation bytes
– Unfragmented packet size S = Sdata+Sencapsulation
• Routing Time:
– Time(S)s-d = overhead + routing delay + channel
occupancy + contention delay
– Channel occupancy = S/b = (Sdata+ Sencapsulation)/b
– Routing delay in cycles (D):
» Time to get head of packet to next hop
– Contention?
3/14/2011
cs252-S11, Lecture 15
20
Store&Forward vs Cut-Through Routing
Cut-Through Routing
Store & Forward Routing
Source
Dest
32 1 0
3 2 1 0
3 2 1
3 2
3
Dest
0
1 0
2 1 0
3 2 1 0
3 2 1
3 2
3
0
3 2 1
0
3 2
1
0
3
2
1
0
3
2
1 0
3
2 1 0
3 2 1 0
1 0
2 1 0
3 2 1 0
Time
3 2 1
Time:
h(S/b + D/)
OR(cycles): h(S/w + D)
0
vs
vs
S/b + h D/
S/w + h D
• what if message is fragmented?
• wormhole vs virtual cut-through
3/14/2011
cs252-S11, Lecture 15
21
Contention
• Two packets trying to use the same link at same time
– limited buffering
– drop?
• Most parallel mach. networks block in place
– link-level flow control
– tree saturation
• Closed system - offered load depends on delivered
– Source Squelching
3/14/2011
cs252-S11, Lecture 15
22
Bandwidth
• What affects local bandwidth?
– packet density:
– routing delay:
– contention
b x Sdata/S
b x Sdata /(S + wD)
» endpoints
» within the network
• Aggregate bandwidth
– bisection bandwidth
» sum of bandwidth of smallest set of links that partition the
network
– total bandwidth of all the channels: Cb
– suppose N hosts issue packet every M cycles with ave dist
3/14/2011
» each msg occupies h channels for l = S/w cycles each
» C/N channels available per node
» link utilization for store-and-forward:
r = (hl/M channel cycles/node)/(C/N) = Nhl/MC < 1!
» link utilization for wormhole routing?
cs252-S11, Lecture 15
23
Saturation
0.8
Delivered Bandwidth
80
70
Latency
60
50
40
Saturation
30
20
10
0.7
0.6
0.5
0.4
Saturation
0.3
0.2
0.1
0
0
0
0.2
0.4
0.6
0.8
1
Delivered Bandwidth
3/14/2011
cs252-S11, Lecture 15
0
0.2
0.4
0.6
0.8
1
1.2
Offered Bandwidth
24
How Many Dimensions?
• n = 2 or n = 3
– Short wires, easy to build
– Many hops, low bisection bandwidth
– Requires traffic locality
• n >= 4
– Harder to build, more wires, longer average length
– Fewer hops, better bisection bandwidth
– Can handle non-local traffic
• k-ary n-cubes provide a consistent framework for
comparison
– N = kn
– scale dimension (n) or nodes per dimension (k)
– assume cut-through
3/14/2011
cs252-S11, Lecture 15
25
Traditional Scaling: Latency scaling with N
250
140
200
100
n=2
n=3
80
n=4
k=2
60
S/w
40
Ave Latency T(S=140)
Ave Latency T(S=40)
120
150
100
50
20
0
0
0
2000
4000
6000
8000
10000
0
2000
4000
6000
8000
10000
Machine Size (N)
Machine Size (N)
• Assumes equal channel width
– independent of node count or dimension
– dominated by average distance
3/14/2011
cs252-S11, Lecture 15
26
Average Distance
100
256
90
1024
80
16384
1048576
Ave Distance
70
60
50
ave dist = n(k-1)/2
40
30
20
10
0
0
5
10
15
20
25
Dimension
• but, equal channel width is not equal cost!
• Higher dimension => more channels
3/14/2011
cs252-S11, Lecture 15
27
Dally Paper: In the 3D world
• For N nodes, bisection area is O(N2/3 )
• For large N, bisection bandwidth is limited to O(N2/3 )
– Bill Dally, IEEE TPDS, [Dal90a]
– For fixed bisection bandwidth, low-dimensional k-ary ncubes are better (otherwise higher is better)
– i.e., a few short fat wires are better than many long thin wires
– What about many long fat wires?
3/14/2011
cs252-S11, Lecture 15
28
Dally paper (con’t)
• Equal Bisection,W=1 for hypercube  W= ½k
• Three wire models:
– Constant delay, independent of length
– Logarithmic delay with length (exponential driver tree)
– Linear delay (speed of light/optimal repeaters)
Logarithmic Delay
3/14/2011
Linear Delay
cs252-S11, Lecture 15
29
Equal cost in k-ary n-cubes
•
•
•
•
•
Equal number of nodes?
Equal number of pins/wires?
Equal bisection bandwidth?
Equal area?
Equal wire length?
What do we know?
• switch degree: n
diameter = n(k-1)
• total links = Nn
• pins per node = 2wn
• bisection = kn-1 = N/k links in each directions
• 2Nw/k wires cross the middle
3/14/2011
cs252-S11, Lecture 15
30
Latency for Equal Width Channels
250
256
Average Latency (S = 40, D = 2)
1024
200
16384
1048576
150
100
50
0
0
5
10
15
20
25
Dimension
• total links(N) = Nn
3/14/2011
cs252-S11, Lecture 15
31
Latency with Equal Pin Count
300
300
256 nodes
250
1024 nodes
Ave Latency T(S = 140 B)
Ave Latency T(S=40B)
250
16 k nodes
200
1M nodes
150
100
50
200
150
100
256 nodes
1024 nodes
50
16 k nodes
1M nodes
0
0
0
5
10
15
20
25
0
Dimension (n)
5
10
15
20
25
Dimension (n)
• Baseline n=2, has w = 32 (128 wires per node)
• fix 2nw pins => w(n) = 64/n
• distance up with n, but channel time down
3/14/2011
cs252-S11, Lecture 15
32
Latency with Equal Bisection Width
• N-node hypercube has N
bisection links
• 2d torus has 2N 1/2
• Fixed bisection 
w(n) = N 1/n / 2 = k/2
1000
900
Ave Latency T(S=40)
800
700
600
500
400
300
256 nodes
200
1024 nodes
• 1 M nodes, n=2 has w=512!
16 k nodes
100
1M nodes
0
0
5
10
15
20
25
Dimension (n)
3/14/2011
cs252-S11, Lecture 15
33
Larger Routing Delay (w/ equal pin)
1000
Ave Latency T(S = 140 B)
900
800
700
600
500
400
300
256 nodes
200
1024 nodes
16 k nodes
100
1M nodes
0
0
5
10
15
20
25
Dimension (n)
• Dally’s conclusions strongly influenced by assumption
of small routing delay
– Here, Routing delay D=20
3/14/2011
cs252-S11, Lecture 15
34
Saturation
250
200
150
Latency
S/w=40
S/w=16
S/w=8
S/w=4
100
50
0
0
0.2
0.4
0.6
0.8
1
Ave Channel Utilization
• Fatter links shorten queuing delays
3/14/2011
cs252-S11, Lecture 15
35
Discuss of paper:
Virtual Channel Flow Control
• Basic Idea: Use of virtual channels to reduce contention
– Provided a model of k-ary, n-flies
– Also provided simulation
• Tradeoff: Better to split buffers into virtual channels
– Example (constant total storage for 2-ary 8-fly):
3/14/2011
cs252-S11, Lecture 15
36
When are virtual channels allocated?
Input
Ports
Output
Ports
Hardware efficient design
For crossbar
Cross-Bar
• Two separate processes:
– Virtual channel allocation
– Switch/connection allocation
• Virtual Channel Allocation
– Choose route and free output virtual channel
– Really means: Source of link tracks channels at destination
• Switch Allocation
– For incoming virtual channel, negotiate switch on outgoing pin
3/14/2011
cs252-S11, Lecture 15
37
Reducing routing delay: Express Cubes
• Problem: Low-dimensional networks have high k
– Consequence: may have to travel many hops in single dimension
– Routing latency can dominate long-distance traffic patterns
• Solution: Provide one or more “express” links
– Like express trains, express elevators, etc
» Delay linear with distance, lower constant
» Closer to “speed of light” in medium
» Lower power, since no router cost
– “Express Cubes: Improving performance of k-ary n-cube
interconnection networks,” Bill Dally 1991
• Another Idea: route with pass transistors through links
3/14/2011
cs252-S11, Lecture 15
38
Summary
• Network Topologies:
Topology
Degree Diameter
Ave Dist
Bisection
D (D ave) @ P=1024
1D Array
2
N-1
N/3
1
huge
1D Ring
2
N/2
N/4
2
2D Mesh
4
2 (N1/2 - 1)
2/3 N1/2
N1/2
63 (21)
2D Torus
4
N1/2
1/2 N1/2
2N1/2
32 (16)
k-ary n-cube
2n
nk/2
nk/4
nk/4
15 (7.5) @n=3
Hypercube
n =log N n
n/2
N/2
10 (5)
• Fair metrics of comparison
– Equal cost: area, bisection bandwidth, etc
• Routing Algorithms restrict set of routes within the topology
– simple mechanism selects turn at each hop
– arithmetic, selection, lookup
• Virtual Channels
– Adds complexity to router
– Can be used for performance
– Can be used for deadlock avoidance
3/14/2011
cs252-S11, Lecture 15
39