Lecture 9 - Suraj @ LUMS

Download Report

Transcript Lecture 9 - Suraj @ LUMS

Tuesday, October 03, 2006
If I have seen further,
it is by standing on the
shoulders of giants.
- Isaac Newton
1
Addition Example: Value of sum to be transmitted
to all nodes
2
Addition Example: Value of sum to be transmitted
to all nodes
3
Addition Example: Value of sum to be transmitted
to all nodes
Consider replicating computation as an option.
4
5
6
Collective Communication
 Global
interaction
operations
 Building block.
 Proper
implementation
is necessary
7
 Algorithms for rings can be extended for
meshes.
8
• Parallel algorithms using regular data
structures map naturally to mesh.
9
 Many algorithms with recursive
interactions map naturally onto
hypercube topology.
• Practical for interconnection networks
used in modern computers.
• Time to transfer the data between two
nodes is considered to be independent of
the relative location of nodes in the
interconnection networks.
•
Routing techniques
10
 ts : startup time
 Prepare message: add headers, trailers,
error correction information etc
 tw : per-word transfer time
 tw = 1/r , where r is bandwidth in r words
per second
 Transfer of m words between any pair of nodes
in interconnection networks incurs a cost of
ts + m tw
11
 Assumptions:
 Links are bidirectional
 Send a message on only one of its links
at a time
 Receive a message on only one link at a
time
 Can send to / receive from
same/different links
 Effect of congestion not shown in the
total transfer time
12
One to all broadcast
Dual: All-to-One reduction
•Each node has a buffer M containing m words.
•Data from all are combined through an operator
and accumulated at a single destination process into
one buffer of size m.
13
One to all broadcast
One way:
 Send p-1 messages from source to p-1
nodes
14
One to all broadcast
Inefficient way:
 Send p-1 messages from source to p-1
nodes
 Source becomes the bottleneck
 Connection between a single pair of
nodes is used at the time

Under-utilization of communication network
15
One to all broadcast
.Recursive doubling
7
6
5
4
0
1
2
3
Step 1
16
One to all broadcast
.Recursive doubling
Step 2
7
6
5
4
0
1
2
3
Step 2
17
One to all broadcast
.Recursive doubling
Step 3
Step 3
7
6
5
4
0
1
2
3
Step 3
Step 3
18
One to all broadcast
.Recursive doubling
Step 3
Step 3
7
6
5
4
0
1
2
3
log p
steps
Step 3
Step 3
19
One to all broadcast
.What if 0 had sent to 1 and then 0 and 1 had attempted
to send to 2 and 3?
7
6
5
4
0
1
2
3
20
All to One Reduction
 Reverse the direction and sequence of
communication
21
Matrix vector multiplication
22
One to all broadcast
Mesh
Regard each row and column as a linear
array.
23
One to all broadcast
Mesh
24
One to all broadcast
Hypercube
d-dimensional mesh
d-steps
25
One to all broadcast
Hypercube
2
3
0
1
6
7
4
5
26
One to all broadcast
Hypercube
2
3
0
1
6
7
4
5
27
One to all broadcast
Hypercube
2
3
0
1
6
7
4
5
28
One to all broadcast
.What if 0 had sent to 1 and then 0 and 1 had attempted
to send to 2 and 3?
29
One to all broadcast
30
All-to-All communication
31
All-to-All communication
All-to-All broadcast: p One-to-All broadcasts?
32
All-to-All
broadcast
Broadcasts
are
pipelined.
33
All-to-All broadcast
34
All-to-All Reduction
35
All-to-All broadcast in Mesh
36
All-to-All broadcast in Mesh
37
All-to-All broadcast in Hypercube
6
6
3
2
2
3
0
0
1
7
7
5
5
4
4
1
38
All-to-All broadcast in Hypercube
6,7
6
2,3
2,3
0,1
2
3
0
1
7
6,7
5
4,5
4,5
4
0,1
39
All-to-All broadcast in Hypercube
4,5,6,7
6
0,1,2,3
7
4,5,6,7
5
4,5,6,7
4,5,6,7
0,1,2,3
2
3
4
0,1,2,3
0
1
0,1,2,3
40
All-to-All broadcast in Hypercube
0,1,2,3,4,5,6,7
0,1,2,3,4,
5,6,7
0,1,2,3,4,5,6,7
2
3
0,1,2,3,4,5,6,7
0
1
6
7
0,1,2,3,4,
4 5,6,7
5
0,1,2,3,4,5,6,7
0,1,2,3,4,5,6,7
0,1,2,3,4, 5,6,7
41
All- reduce
 Each node has a buffer of size m
 Final results are identical buffers on each
node formed by combining the original p
buffers.

All to one reduction followed by one-to-all
broadcast
42
Gather and Scatter
43
Scatter
44
 Speedup

The ratio between execution time on a single
processor and execution time on multiple
processors.
 Given a computational job that is
distributed among N processors

Results in N-fold speedup?
45
 Speedup

The ratio between execution time on a single
processor and execution time on multiple
processors.
 Given a computational job that is
distributed among N processors
 Results in N-fold speedup? In a perfect
world!
46
 Every algorithm has a sequential
component to be done by a single
processor. This is not diminished when
parallel part is split up.
 There are communication costs, idle time,
replicated computation etc.
47
Amdahl’s Law
 Let T(N) be the time required to
complete the task on N processors. The
speedup S(N) is the ratio
S(N)=
T(1)
T(N)
Serial portion Ts and parallelizable
portion Tp
48
Amdahl’s Law
 Let T(N) be the time required to
complete the task on N processors. The
speedup S(N) is the ratio
S(N)=
T(1)
T(N)
Serial portion Ts and parallelizable portion Tp
S(N) 
(Ts + Tp)
(Ts + Tp/N)
An optimistic estimate.
49
Amdahl’s Law
 Let T(N) be the time required to complete
the task on N processors. The speedup S(N) is
the ratio
S(N)=
T(1)
T(N)
Serial portion Ts and parallelizable portion Tp
S(N)  (Ts + Tp)
(Ts + Tp/N)
An optimistic estimate.
 Ignores the overhead incurred due to parallelizing the
code.
50
Amdahl’s Law
S(N)  (Ts + Tp)
(Ts + Tp/N)
N
10
100
1000
10000
Tp=0.5
1.8
1.98
1.99
1.99
Tp=0.9
5.26
9.17
9.91
9.91
51
Amdahl’s Law
 If the sequential component is 5 percent
then the maximum speedup that can be
achieved is ?
52
Amdahl’s Law
 If the sequential component is 5 percent
then the maximum speedup that can be
achieved is 20
 Useful when sequential programs are
parallelized incrementally.

A sequential program can be profiled to
identify computationally demanding
components (hotspots).
53