Lecture 9 - Suraj @ LUMS
Download
Report
Transcript Lecture 9 - Suraj @ LUMS
Tuesday, October 03, 2006
If I have seen further,
it is by standing on the
shoulders of giants.
- Isaac Newton
1
Addition Example: Value of sum to be transmitted
to all nodes
2
Addition Example: Value of sum to be transmitted
to all nodes
3
Addition Example: Value of sum to be transmitted
to all nodes
Consider replicating computation as an option.
4
5
6
Collective Communication
Global
interaction
operations
Building block.
Proper
implementation
is necessary
7
Algorithms for rings can be extended for
meshes.
8
• Parallel algorithms using regular data
structures map naturally to mesh.
9
Many algorithms with recursive
interactions map naturally onto
hypercube topology.
• Practical for interconnection networks
used in modern computers.
• Time to transfer the data between two
nodes is considered to be independent of
the relative location of nodes in the
interconnection networks.
•
Routing techniques
10
ts : startup time
Prepare message: add headers, trailers,
error correction information etc
tw : per-word transfer time
tw = 1/r , where r is bandwidth in r words
per second
Transfer of m words between any pair of nodes
in interconnection networks incurs a cost of
ts + m tw
11
Assumptions:
Links are bidirectional
Send a message on only one of its links
at a time
Receive a message on only one link at a
time
Can send to / receive from
same/different links
Effect of congestion not shown in the
total transfer time
12
One to all broadcast
Dual: All-to-One reduction
•Each node has a buffer M containing m words.
•Data from all are combined through an operator
and accumulated at a single destination process into
one buffer of size m.
13
One to all broadcast
One way:
Send p-1 messages from source to p-1
nodes
14
One to all broadcast
Inefficient way:
Send p-1 messages from source to p-1
nodes
Source becomes the bottleneck
Connection between a single pair of
nodes is used at the time
Under-utilization of communication network
15
One to all broadcast
.Recursive doubling
7
6
5
4
0
1
2
3
Step 1
16
One to all broadcast
.Recursive doubling
Step 2
7
6
5
4
0
1
2
3
Step 2
17
One to all broadcast
.Recursive doubling
Step 3
Step 3
7
6
5
4
0
1
2
3
Step 3
Step 3
18
One to all broadcast
.Recursive doubling
Step 3
Step 3
7
6
5
4
0
1
2
3
log p
steps
Step 3
Step 3
19
One to all broadcast
.What if 0 had sent to 1 and then 0 and 1 had attempted
to send to 2 and 3?
7
6
5
4
0
1
2
3
20
All to One Reduction
Reverse the direction and sequence of
communication
21
Matrix vector multiplication
22
One to all broadcast
Mesh
Regard each row and column as a linear
array.
23
One to all broadcast
Mesh
24
One to all broadcast
Hypercube
d-dimensional mesh
d-steps
25
One to all broadcast
Hypercube
2
3
0
1
6
7
4
5
26
One to all broadcast
Hypercube
2
3
0
1
6
7
4
5
27
One to all broadcast
Hypercube
2
3
0
1
6
7
4
5
28
One to all broadcast
.What if 0 had sent to 1 and then 0 and 1 had attempted
to send to 2 and 3?
29
One to all broadcast
30
All-to-All communication
31
All-to-All communication
All-to-All broadcast: p One-to-All broadcasts?
32
All-to-All
broadcast
Broadcasts
are
pipelined.
33
All-to-All broadcast
34
All-to-All Reduction
35
All-to-All broadcast in Mesh
36
All-to-All broadcast in Mesh
37
All-to-All broadcast in Hypercube
6
6
3
2
2
3
0
0
1
7
7
5
5
4
4
1
38
All-to-All broadcast in Hypercube
6,7
6
2,3
2,3
0,1
2
3
0
1
7
6,7
5
4,5
4,5
4
0,1
39
All-to-All broadcast in Hypercube
4,5,6,7
6
0,1,2,3
7
4,5,6,7
5
4,5,6,7
4,5,6,7
0,1,2,3
2
3
4
0,1,2,3
0
1
0,1,2,3
40
All-to-All broadcast in Hypercube
0,1,2,3,4,5,6,7
0,1,2,3,4,
5,6,7
0,1,2,3,4,5,6,7
2
3
0,1,2,3,4,5,6,7
0
1
6
7
0,1,2,3,4,
4 5,6,7
5
0,1,2,3,4,5,6,7
0,1,2,3,4,5,6,7
0,1,2,3,4, 5,6,7
41
All- reduce
Each node has a buffer of size m
Final results are identical buffers on each
node formed by combining the original p
buffers.
All to one reduction followed by one-to-all
broadcast
42
Gather and Scatter
43
Scatter
44
Speedup
The ratio between execution time on a single
processor and execution time on multiple
processors.
Given a computational job that is
distributed among N processors
Results in N-fold speedup?
45
Speedup
The ratio between execution time on a single
processor and execution time on multiple
processors.
Given a computational job that is
distributed among N processors
Results in N-fold speedup? In a perfect
world!
46
Every algorithm has a sequential
component to be done by a single
processor. This is not diminished when
parallel part is split up.
There are communication costs, idle time,
replicated computation etc.
47
Amdahl’s Law
Let T(N) be the time required to
complete the task on N processors. The
speedup S(N) is the ratio
S(N)=
T(1)
T(N)
Serial portion Ts and parallelizable
portion Tp
48
Amdahl’s Law
Let T(N) be the time required to
complete the task on N processors. The
speedup S(N) is the ratio
S(N)=
T(1)
T(N)
Serial portion Ts and parallelizable portion Tp
S(N)
(Ts + Tp)
(Ts + Tp/N)
An optimistic estimate.
49
Amdahl’s Law
Let T(N) be the time required to complete
the task on N processors. The speedup S(N) is
the ratio
S(N)=
T(1)
T(N)
Serial portion Ts and parallelizable portion Tp
S(N) (Ts + Tp)
(Ts + Tp/N)
An optimistic estimate.
Ignores the overhead incurred due to parallelizing the
code.
50
Amdahl’s Law
S(N) (Ts + Tp)
(Ts + Tp/N)
N
10
100
1000
10000
Tp=0.5
1.8
1.98
1.99
1.99
Tp=0.9
5.26
9.17
9.91
9.91
51
Amdahl’s Law
If the sequential component is 5 percent
then the maximum speedup that can be
achieved is ?
52
Amdahl’s Law
If the sequential component is 5 percent
then the maximum speedup that can be
achieved is 20
Useful when sequential programs are
parallelized incrementally.
A sequential program can be profiled to
identify computationally demanding
components (hotspots).
53