ppt - Computer Science Division - University of California, Berkeley

Download Report

Transcript ppt - Computer Science Division - University of California, Berkeley

CS252
Graduate Computer Architecture
Lecture 21
Multiprocessor Networks (con’t)
John Kubiatowicz
Electrical Engineering and Computer Sciences
University of California, Berkeley
http://www.eecs.berkeley.edu/~kubitron/cs252
Review: On Chip: Embeddings in two
dimensions
6x3x2
• Embed multiple logical dimension in one physical
dimension using long wires
• When embedding higher-dimension in lower one, either
some wires longer than others, or all wires long
4/15/2009
cs252-S09, Lecture 21
2
Review:
Store&Forward vs Cut-Through Routing
Cut-Through Routing
Store & Forward Routing
Source
Dest
32 1 0
3 2 1 0
3 2 1
3 2
3
Dest
0
1 0
2 1 0
3 2 1 0
3 2 1
3 2
3
0
3 2 1
0
3 2
1
0
3
2
1
0
3
2
1 0
3
2 1 0
3 2 1 0
1 0
2 1 0
3 2 1 0
Time
3 2 1
Time:
h(n/b + D/)
OR(cycles): h(n/w + D)
0
vs
vs
n/b + h D/
n/w + h D
• what if message is fragmented?
• wormhole vs virtual cut-through
4/15/2009
cs252-S09, Lecture 21
3
Contention
• Two packets trying to use the same link at same time
– limited buffering
– drop?
• Most parallel mach. networks block in place
– link-level flow control
– tree saturation
• Closed system - offered load depends on delivered
– Source Squelching
4/15/2009
cs252-S09, Lecture 21
4
Bandwidth
• What affects local bandwidth?
– packet density
– routing delay
– contention
b x ndata/n
b x ndata /(n + wD)
» endpoints
» within the network
• Aggregate bandwidth
– bisection bandwidth
» sum of bandwidth of smallest set of links that partition the network
– total bandwidth of all the channels: Cb
– suppose N hosts issue packet every M cycles with ave dist
» each msg occupies h channels for l = n/w cycles each
» C/N channels available per node
» link utilization for store-and-forward:
r = (hl/M channel cycles/node)/(C/N) = Nhl/MC < 1!
» link utilization for wormhole routing?
4/15/2009
cs252-S09, Lecture 21
5
Saturation
0.8
Delivered Bandwidth
80
70
Latency
60
50
40
Saturation
30
20
10
0.7
0.6
0.5
0.4
Saturation
0.3
0.2
0.1
0
0
0
0.2
0.4
0.6
0.8
1
Delivered Bandwidth
4/15/2009
cs252-S09, Lecture 21
0
0.2
0.4
0.6
0.8
1
1.2
Offered Bandwidth
6
How Many Dimensions?
• n = 2 or n = 3
– Short wires, easy to build
– Many hops, low bisection bandwidth
– Requires traffic locality
• n >= 4
– Harder to build, more wires, longer average length
– Fewer hops, better bisection bandwidth
– Can handle non-local traffic
• k-ary d-cubes provide a consistent framework for
comparison
– N = kd
– scale dimension (d) or nodes per dimension (k)
– assume cut-through
4/15/2009
cs252-S09, Lecture 21
7
Traditional Scaling: Latency scaling with N
250
140
200
100
d=2
d=3
80
d=4
k=2
60
n/w
40
Ave Latency T(n=140)
Ave Latency T(n=40)
120
150
100
50
20
0
0
0
5000
10000
0
2000
4000
6000
8000
10000
Machine Size (N)
Machine Size (N)
• Assumes equal channel width
– independent of node count or dimension
– dominated by average distance
4/15/2009
cs252-S09, Lecture 21
8
Average Distance
100
256
90
1024
80
16384
1048576
Ave Distance
70
60
50
ave dist = d (k-1)/2
40
30
20
10
0
0
5
10
15
20
25
Dimension
• but, equal channel width is not equal cost!
• Higher dimension => more channels
4/15/2009
cs252-S09, Lecture 21
9
In the 3D world
• For n nodes, bisection area is O(n2/3 )
• For large n, bisection bandwidth is limited to O(n2/3 )
– Bill Dally, IEEE TPDS, [Dal90a]
– For fixed bisection bandwidth, low-dimensional k-ary ncubes are better (otherwise higher is better)
– i.e., a few short fat wires are better than many long thin wires
– What about many long fat wires?
4/15/2009
cs252-S09, Lecture 21
10
Dally paper (con’t)
• Equal Bisection,W=1 for hypercube  W= ½k
• Three wire models:
– Constant delay, independent of length
– Logarithmic delay with length (exponential driver tree)
– Linear delay (speed of light/optimal repeaters)
Logarithmic Delay
4/15/2009
Linear Delay
cs252-S09, Lecture 21
11
Equal cost in k-ary n-cubes
•
•
•
•
•
Equal number of nodes?
Equal number of pins/wires?
Equal bisection bandwidth?
Equal area?
Equal wire length?
What do we know?
• switch degree: d
diameter = d(k-1)
• total links = Nd
• pins per node = 2wd
• bisection = kd-1 = N/k links in each directions
• 2Nw/k wires cross the middle
4/15/2009
cs252-S09, Lecture 21
12
Latency for Equal Width Channels
250
256
Average Latency (n = 40, D = 2)
1024
200
16384
1048576
150
100
50
0
0
5
10
15
20
25
Dimension
• total links(N) = Nd
4/15/2009
cs252-S09, Lecture 21
13
Latency with Equal Pin Count
300
300
256 nodes
250
1024 nodes
Ave Latency T(n= 140 B)
Ave Latency T(n=40B)
250
16 k nodes
200
1M nodes
150
100
50
200
150
100
256 nodes
1024 nodes
50
16 k nodes
1M nodes
0
0
0
5
10
15
20
25
0
Dimension (d)
5
10
15
20
25
Dimension (d)
• Baseline d=2, has w = 32 (128 wires per node)
• fix 2dw pins => w(d) = 64/d
• distance up with d, but channel time down
4/15/2009
cs252-S09, Lecture 21
14
Latency with Equal Bisection Width
• N-node hypercube has
N bisection links
• 2d torus has 2N 1/2
• Fixed bisection => w(d)
= N 1/d / 2 = k/2
1000
900
Ave Latency T(n=40)
800
700
600
500
400
300
256 nodes
200
1024 nodes
• 1 M nodes, d=2 has
w=512!
16 k nodes
100
1M nodes
0
0
5
10
15
20
25
Dimension (d)
4/15/2009
cs252-S09, Lecture 21
15
Larger Routing Delay (w/ equal pin)
1000
Ave Latency T(n= 140 B)
900
800
700
600
500
400
300
256 nodes
200
1024 nodes
16 k nodes
100
1M nodes
0
0
5
10
15
20
25
Dimension (d)
• Dally’s conclusions strongly influenced by assumption
of small routing delay
– Here, Routing delay D=20
4/15/2009
cs252-S09, Lecture 21
16
Latency under Contention
300
250
n40,d2,k32
200
n40,d3,k10
Latency
n16,d2,k32
n16,d3,k10
150
n8,d2,k32
n8,d3,k10
n4,d2,k32
100
n4,d3,k10
50
0
0
0.2
0.4
0.6
0.8
1
Channel Utilization
• Optimal packet size?
4/15/2009
Channel utilization?
cs252-S09, Lecture 21
17
Saturation
250
200
150
Latency
n/w=40
n/w=16
n/w=8
n/w = 4
100
50
0
0
0.2
0.4
0.6
0.8
1
Ave Channel Utilization
• Fatter links shorten queuing delays
4/15/2009
cs252-S09, Lecture 21
18
Phits per cycle
350
300
Latency
250
200
150
100
50
n8, d3, k10
n8, d2, k32
0
0
0.05
0.1
0.15
0.2
0.25
Flits per cycle per processor
• higher degree network has larger available bandwidth
– cost?
4/15/2009
cs252-S09, Lecture 21
19
Discussion
• Rich set of topological alternatives with deep
relationships
• Design point depends heavily on cost model
– nodes, pins, area, ...
– Wire length or wire delay metrics favor small dimension
– Long (pipelined) links increase optimal dimension
• Need a consistent framework and analysis to
separate opinion from design
• Optimal point changes with technology
4/15/2009
cs252-S09, Lecture 21
20
Another Idea: Express Cubes
• Problem: Low-dimensional networks have high k
– Consequence: may have to travel many hops in single dimension
– Routing latency can dominate long-distance traffic patterns
• Solution: Provide one or more “express” links
– Like express trains, express elevators, etc
» Delay linear with distance, lower constant
» Closer to “speed of light” in medium
» Lower power, since no router cost
– “Express Cubes: Improving performance of k-ary n-cube
interconnection networks,” Bill Dally 1991
• Another Idea: route with pass transistors through links
4/15/2009
cs252-S09, Lecture 21
21
The Routing problem: Local decisions
Input
Ports
Receiver
Input
Buffer
Output
Buffer Transmiter
Output
Ports
Cross-bar
Control
Routing, Scheduling
• Routing at each hop: Pick next output port!
4/15/2009
cs252-S09, Lecture 21
22
How do you build a crossbar?
Io
Io
I1
I2
I3
O0
I1
Oi
O2
I2
O3
I3
phase
RAM
addr
Din Dout
Io
I1
I2
I3
4/15/2009
cs252-S09, Lecture 21
O0
Oi
O2
O3
23
Input buffered swtich
Input
Ports
Output
Ports
R0
R1
Cross-bar
R2
R3
Scheduling
• Independent routing logic per input
– FSM
• Scheduler logic arbitrates each output
– priority, FIFO, random
• Head-of-line blocking problem
4/15/2009
cs252-S09, Lecture 21
24
Output Buffered Switch
Input
Ports
Output
Ports
R0
Output
Ports
R1
Output
Ports
R2
Output
Ports
R3
Control
• How would you build a shared pool?
4/15/2009
cs252-S09, Lecture 21
25
Output scheduling
R0
Input
Buffers
O0
R1
R2
Cross-bar
Output
Ports
O1
O2
R3
• n independent arbitration problems?
– static priority, random, round-robin
• simplifications due to routing algorithm?
• general case is max bipartite matching
4/15/2009
cs252-S09, Lecture 21
26
Switch Components
• Output ports
– transmitter (typically drives clock and data)
• Input ports
– synchronizer aligns data signal with local clock
domain
– essentially FIFO buffer
• Crossbar
– connects each input to any output
– degree limited by area or pinout
• Buffering
• Control logic
– complexity depends on routing logic and scheduling
algorithm
– determine output port for each incoming packet
– arbitrate among inputs directed at same output
4/15/2009
cs252-S09, Lecture 21
27
Properties of Routing Algorithms
• Routing algorithm:
– R: N x N -> C, which at each switch maps the destination node nd to
the next channel on the route
– which of the possible paths are used as routes?
– how is the next hop determined?
»
»
»
»
arithmetic
source-based port select
table driven
general computation
• Deterministic
– route determined by (source, dest), not intermediate state (i.e. traffic)
• Adaptive
– route influenced by traffic along the way
• Minimal
– only selects shortest paths
• Deadlock free
– no traffic pattern can lead to a situation where packets are deadlocked
and never move forward
4/15/2009
cs252-S09, Lecture 21
28
Routing Mechanism
• need to select output port for each input packet
– in a few cycles
• Simple arithmetic in regular topologies
– ex: Dx, Dy routing in a grid
»
»
»
»
»
west (-x)
east (+x)
south (-y)
north (+y)
processor
Dx < 0
Dx > 0
Dx = 0, Dy < 0
Dx = 0, Dy > 0
Dx = 0, Dy = 0
• Reduce relative address of each dimension in order
– Dimension-order routing in k-ary d-cubes
– e-cube routing in n-cube
4/15/2009
cs252-S09, Lecture 21
29
Deadlock Freedom
• How can deadlock arise?
– necessary conditions:
» shared resource
» incrementally allocated
» non-preemptible
– think of a channel as a shared
resource that is acquired incrementally
» source buffer then dest. buffer
» channels along a route
• How do you avoid it?
– constrain how channel resources are allocated
– ex: dimension order
• How do you prove that a routing algorithm is
deadlock free?
– Show that channel dependency graph has no cycles!
4/15/2009
cs252-S09, Lecture 21
30
Consider Trees
• Why is the obvious routing on X deadlock free?
– butterfly?
– tree?
– fat tree?
• Any assumptions about routing mechanism?
amount of buffering?
4/15/2009
cs252-S09, Lecture 21
31
Up*-Down* routing for general topology
•
•
•
•
•
Given any bidirectional network
Construct a spanning tree
Number of the nodes increasing from leaves to roots
UP increase node numbers
Any Source -> Dest by UP*-DOWN* route
– up edges, single turn, down edges
– Proof of deadlock freedom?
• Performance?
– Some numberings and routes much better than others
– interacts with topology in strange ways
4/15/2009
cs252-S09, Lecture 21
32
Turn Restrictions in X,Y
+Y
-X
+X
-Y
• XY routing forbids 4 of 8 turns and leaves no
room for adaptive routing
• Can you allow more turns and still be deadlock
free?
4/15/2009
cs252-S09, Lecture 21
33
Minimal turn restrictions in 2D
+y
+x
-x
West-first
north-last
4/15/2009
-y
cs252-S09, Lecture 21
negative first
34
Example legal west-first routes
• Can route around failures or congestion
• Can combine turn restrictions with virtual
channels
4/15/2009
cs252-S09, Lecture 21
35
General Proof Technique
• resources are logically associated with channels
• messages introduce dependences between
resources as they move forward
• need to articulate the possible dependences that
can arise between channels
• show that there are no cycles in Channel
Dependence Graph
– find a numbering of channel resources
such that every legal route follows a
monotonic sequence
 no traffic pattern can lead to deadlock
• network need not be acyclic, just channel
dependence graph
4/15/2009
cs252-S09, Lecture 21
36
Example: k-ary 2D array
• Thm: Dimension-ordered (x,y) routing
is deadlock free
• Numbering
–
–
–
–
+x channel (i,y) -> (i+1,y) gets i
similarly for -x with 0 as most positive edge
+y channel (x,j) -> (x,j+1) gets N+j
similary for -y channels
• any routing sequence: x direction,
turn, y direction is increasing
• Generalization:
1
00
2
01
2
17
18
10
17
3
02
1
03
0
11
12
13
21
22
23
31
32
33
18
20
19
16
30
– “e-cube routing” on 3-D: X then Y then Z
4/15/2009
cs252-S09, Lecture 21
37
Channel Dependence Graph
1
00
2
01
2
17
18
10
17
3
02
1
1
2
3
2
1
0
03
18 17
0
11
12
13
21
22
23
18 17
2
3
2
1
0
17 18
19
16
30
31
32
33
17 18
17 18
17 18
1
2
3
2
1
0
16 19
16 19
1
2
4/15/2009
18 17
1
18
20
18 17
cs252-S09, Lecture 21
16 19
16 19
2
3
1
0
38
More examples:
• What about wormhole routing on a ring?
2
1
0
3
7
4
5
6
• Or: Unidirectional Torus of higher dimension?
4/15/2009
cs252-S09, Lecture 21
39
Deadlock free wormhole networks?
• Basic dimension order routing techniques don’t work for
unidirectional k-ary d-cubes
– only for k-ary d-arrays (bi-directional)
– And – dimension-ordered routing not adaptive!
• Idea: add channels!
– provide multiple “virtual channels” to break the dependence cycle
– good for BW too!
Input
Ports
Output
Ports
Cross-Bar
– Do not need to add links, or xbar, only buffer resources
• This adds nodes to the CDG, remove edges?
4/15/2009
cs252-S09, Lecture 21
40
When are virtual channels allocated?
Input
Ports
Output
Ports
Cross-Bar
Hardware efficient design
For crossbar
• Two separate processes:
– Virtual channel allocation
– Switch/connection allocation
• Virtual Channel Allocation
– Choose route and free output virtual channel
• Switch Allocation
– For each incoming virtual channel, must negotiate switch on
outgoing pin
• In ideal case (not highly loaded), would like to
optimistically allocate a virtual channel
4/15/2009
cs252-S09, Lecture 21
41
Breaking deadlock with virtual channels
Packet switches
from lo to hi channel
4/15/2009
cs252-S09, Lecture 21
42
Paper Discusion: Linder and Harden
• Paper: “An Adaptive and Fault Tolerant Wormhole
Routing Stategy for k-ary n-cubes”
– Daniel Linder and Jim Harden
• General virtual-channel scheme for k-ary n-cubes
– With wrap-around paths
• Properties of result for uni-directional k-ary n-cube:
– 1 virtual interconnection network
– n+1 levels
• Properties of result for bi-directional k-ary n-cube:
– 2n-1 virtual interconnection networks
– n+1 levels per network
4/15/2009
cs252-S09, Lecture 21
43
Example: Unidirectional 4-ary 2-cube
Physical Network
• Wrap-around channels
necessary but can
cause deadlock
4/15/2009
Virtual Network
• Use VCs to avoid deadlock
• 1 level for each wrap-around
cs252-S09, Lecture 21
44
Bi-directional 4-ary 2-cube: 2 virtual networks
Virtual Network 1
4/15/2009
Virtual Network 2
cs252-S09, Lecture 21
45