ppt - People Server at UNCW

Download Report

Transcript ppt - People Server at UNCW

Other Means of Executing
Parallel Programs
OpenMP And Paraguin
(c) 2011 Clayton S. Ferner
1
OpenMP
• Jointly defined by a group of major computer
hardware and software vendors, OpenMP is a
portable, scalable model that gives sharedmemory parallel programmers a simple and
flexible interface for developing parallel
applications for platforms ranging from the
desktop to the supercomputer
(c) 2011 Clayton S. Ferner
2
The Paraguin Compiler
• The Paraguin Compiler is a compiler written by
me (no group, no funding – just me by myself)
at UNCW
• The intent is to create a similar abstraction as
OpenMP but for use on a distributed-memory
system
(c) 2011 Clayton S. Ferner
3
(c) 2011 Clayton S. Ferner
4
OpenMP
• MPI is a message-passing interface that
provides a means to implement parallel
algorithms on distributed-memory systems
(such as clusters)
• The OpenMP Application Program Interface
(API) supports multi-platform shared-memory
parallel programming in C/C++ and Fortran
on all architectures, including Unix platforms
and Windows NT platforms.*
*The
OpenMP® API specification for parallel programming
(http://openmp.org/wp/about-openmp/)
(c) 2011 Clayton S. Ferner
5
OpenMP (cont)
• Parallelization is directed by the programmer
through the use of pragmas
• Pragma are used to pass information to the
compiler, but are ignored (like comments) if
the compiler does not recognize the pragma
• Pragmas can be inserted for a particular
compiler without “breaking” the code for
other compilers.
(c) 2011 Clayton S. Ferner
6
OpenMP Pragmas
• #pragma omp parallel
structured-block
• The block will be executed in parallel by all
threads
(c) 2011 Clayton S. Ferner
7
OpenMP Pragmas
• #pragma omp for
for loop
• The loop will be executed in parallel by all threads
• The iterations are divided into “chunks” which
the threads execute (although the programmer
can control this).
• There is a barrier at the end of the for loop (i.e.
threads will synchronize at the end)
(c) 2011 Clayton S. Ferner
8
OpenMP Pragmas
• #pragma omp parallel for
for loop
• Equivalent to doing
#pragma omp parallel
#pragma omp for
for loop
(c) 2011 Clayton S. Ferner
9
OpenMP Pragmas
• #pragma omp critical
structured-block
• Defines a critical section
• Only one thread may be executing the block at
any given time
(c) 2011 Clayton S. Ferner
10
OpenMP Pragmas
• #pragma omp barrier
• All threads will wait at the barrier until all
other threads have reached the same barrier
(c) 2011 Clayton S. Ferner
11
OpenMP Examples
void simple(int n, float *a, float *b)
{
int i;
#pragma omp parallel for
for (i=0; i<n; i++) /* i is private by default */
b[i] = (a[i] + a[i-1]) / 2.0;
}
(c) 2011 Clayton S. Ferner
12
OpenMP Examples
Assume n=15 and the number of
threads is 4
Thread 0
b[0] = …
b[1] = …
b[2] = …
b[3] = …
Thread 1 Thread 2
b[4] = … b[8] = …
b[5] = … b[9] = …
b[6] = … b[10] = …
b[7] = … b[11] = …
(c) 2011 Clayton S. Ferner
Thread 3
b[12] = …
b[13] = …
b[14] = …
13
OpenMP Examples
int main(){
int x = 2;
#pragma omp parallel num_threads(2) shared(x)
{
if (omp_get_thread_num() == 0)
x = 5;
else
/* Print 1: the following read of x has a race */
printf("1: Thread# %d: x = %d\n", omp_get_thread_num(),x );
#pragma omp barrier
if (omp_get_thread_num() == 0)
printf("2: Thread# %d: x = %d\n", omp_get_thread_num(),x );
else
printf("3: Thread# %d: x = %d\n", omp_get_thread_num(),x );
}
return 0;
}
(c) 2011 Clayton S. Ferner
14
OpenMP Examples
$ ./test
1: Thread#
1: Thread#
1: Thread#
3: Thread#
3: Thread#
2: Thread#
3: Thread#
$
3:
2:
1:
2:
1:
0:
3:
x
x
x
x
x
x
x
=
=
=
=
=
=
=
2
5
5
5
5
5
5
(c) 2011 Clayton S. Ferner
15
(c) 2011 Clayton S. Ferner
16
Paraguin Compiler
• The Paraguin Compiler is a parallelizing
compiler that produces parallel code using
MPI to run on a distributed-memory system
(cluster)
• Based on SUIF Compiler System
(suif.stanford.edu)
(c) 2011 Clayton S. Ferner
17
Pragma Directives
• Similar to OpenMP, the compiler is directed
through the use of pragma statements
• The goal is to create a similar abstraction as
OpenMP but on a distributed-memory system
(c) 2011 Clayton S. Ferner
18
Parallel Region
• Defining a parallel region
– #pragma paraguin begin_parallel
– #pragma paraguin end_parallel
• Statements between the begin and end
parallel region are executed by all processors
• Statements outside the parallel region are
executed by the master thread only (pid 0)
(c) 2011 Clayton S. Ferner
19
Hello World
int __guin_mypid = 0;
int main(int argc, char *argv[]) {
char hostname[256];
printf("Master process %d starting.\n", __guin_mypid);
;
#pragma paraguin begin_parallel
gethostname(hostname, 255);
printf("Hello world from process %3d on machine %s.\n",
__guin_mypid, hostname);
;
#pragma paraguin end_parallel
printf("Goodbye world from process %d.\n", __guin_mypid);
}
(c) 2011 Clayton S. Ferner
20
Hello World Results
Running
Compiling
$ runparaguin hello.c
Processing file hello.spd
Parallelizing procedure: "main"
$ mpirun –nolocal -np 8 hello.out
Hello world from process 3.
Hello world from process 1.
Hello world from process 7.
Hello world from process 5.
Hello world from process 4.
Hello world from process 2.
Hello world from process 6.
Master process 0 starting.
Hello world from process 0.
Goodbye world from process 0.
(c) 2011 Clayton S. Ferner
21
Hello World (cont.)
• Notice the semi colons in front of the pragma
statements.
• SUIF attaches the pragmas to the most
recently seen statement, which may be
nested.
• In order to have them attach to a top level
statement, we introduce a blank statement
(‘;’) to which the pragma can be attached.
(c) 2011 Clayton S. Ferner
22
Paraguin Predefined Variables
• Notice the declaration and initialization of the
variable __guin_mypid.
• The predefined variables of paraguin may be
declared, initialized, and referenced by the
user program. They should not be modified
beyond initialization.
• This is useful to allow the same program to be
compiled using gcc.
(c) 2011 Clayton S. Ferner
23
Paraguin Predefined Variables
Identifier
Type
__guin_NP
int
Number of Processors
__guin_blksz
int
Block size (number of partitions per processor)
__guin_mypid
int
Current Processor ID
__guin_pidr
int
Receiving threads processor id
__guin_pidw
int
Sending threads processor id
__guin_buffer
char []
__guin_position
__guin_status
int
Description
Buffer of data to be transmitted
Number of bytes in the buffer
MPI_Status Status of the message
__guin_p
int
Current Partition Number
__guin_pr
int
Receiving partition number
__guin_pw
int
Sending partition number
(c) 2011 Clayton S. Ferner
24
Parallel for
#pragma paraguin forall C
-1
1
p
-1
1
i
1
-1
j
k \
-1 0x0 \
1 0x0
• The next for loop nest will be partitioned to run on
multiple processors. The data that follows the “forall” is a
matrix of inequalities to determine which iterations are
mapped to partitions
• p stands for the partition number
• C stands for constant (or 1).
• 0x0 is hex for zero (to prevent SUIF from turning it into a
string)
(c) 2011 Clayton S. Ferner
25
Parallel for (cont.)
// LU Decomposition
;
#pragma paraguin forall C
-1
1
p
-1
1
i
1
-1
j
k \
-1 0x0 \
1 0x0
for (i = 0; i <= N; i++)
for (j = i + 1; j <= N; j++) {
X[j][i] = X[j][i] / X[i][i];
for (k = i + 1; k <= N; k++)
X[j][k]=X[j][k]-X[j][i]*X[i][k];
}
(c) 2011 Clayton S. Ferner
26
Parallel for (cont.)
C
-1
1
p
-1
1
i
1
-1
j
k \
-1 0x0 \
1 0x0
• This matrix represents the affine expressions of
inequalities:
1
p
-1
-1
1
-1
0
1
1
-1
-1
0
X
i
j
≤
0
0
k
(c) 2011 Clayton S. Ferner
27
Parallel for (cont.)
1
p
-1
-1
1
-1
0
1
1
-1
1
0
X
i
j
≤
0
0
k
-1 – p + i – j ≤ 0
1+p–i+j≤0
p≥i-j-1
p≤i-j-1
(c) 2011 Clayton S. Ferner
p=i-j-1
28
Parallel for (cont.)
p=i-j-1
p=0
p=1
p=2
i=1 j=0 i=2 j=0 i=3 j=0
i=2 j=1 i=3 j=1 i=4 j=1
i=3 j=2 i=4 j=2 i=5 j=2
.
.
.
.
.
.
.
.
.
(c) 2011 Clayton S. Ferner
p=3
i=4 j=0
i=5 j=1
i=6 j=2
.
.
.
. . .
29
Parallel for (cont.)
p=3 p=2 p=1 p=0 p=-1 p=-2 p=-3 p=-4 p=-5
i
p=i-j-1
0,0
1,0
2,0
3,0
4,0
j
(c) 2011 Clayton S. Ferner
0,1
1,1
2,1
3,1
4,1
0,2
1,2
2,2
3,2
4,2
0,3
1,3
2,3
3,3
4,3
0,4
1,4
2,4
3,4
4,4
30
Matrix Multiplication Example
;
#pragma paraguin forall
C
0x0
0x0
p
-1
1
i
1
-1
j
0x0
0x0
k \
0x0 \
0x0
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++) {
c[i][j] = 0.0;
for (k = 0; k < N; k++) {
c[i][j] = c[i][j] + a[i][k] * b[k][j];
}
}
}
(c) 2011 Clayton S. Ferner
31
Matrix Multiplication Example (cont.)
#pragma paraguin forall
C
0x0
0x0
p=i
p=0
p=1
p=2
p=3
p=4
i
p
-1
1
i
1
-1
j
0x0
0x0
k \
0x0 \
0x0
0,0
0,1
0,2
0,3
0,4
1,0
2,0
3,0
4,0
1,1
2,1
3,1
4,1
1,2
2,2
3,2
4,2
1,3
2,3
3,3
4,3
1,4
2,4
3,4
4,4
j
(c) 2011 Clayton S. Ferner
32
Mapping Partitions to Physical
Processors
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &__guin_NP);
MPI_Comm_rank(MPI_COMM_WORLD, &__guin_mypid);
__guin_blksz = ceil((ubp - lbp + 1) / __guin_NP);
if (0 <= __guin_mypid & __guin_mypid <= __guin_NP - 1) {
for (__guin_p = __guin_blksz * __guin_mypid;
__guin_p <=
min(N, __guin_blksz * (1 + __guin_mypid) -1);
__guin_p++)
...
Where lbp <= p <= ubp
(c) 2011 Clayton S. Ferner
33
Mapping Partitions to Physical
Processors
__guin_pid = 0
__guin_pid = 1
…
__guin_pid = NP-1
p = __guin_blksz *
__guin_mypid + 0
p = __guin_blksz *
__guin_mypid + 0
…
p = __guin_blksz *
__guin_mypid + 0
p = __guin_blksz *
__guin_mypid + 1
p = __guin_blksz *
__guin_mypid + 1
…
p = __guin_blksz *
__guin_mypid + 1
p = __guin_blksz *
__guin_mypid + 2
p = __guin_blksz *
__guin_mypid + 2
…
…
…
…
N
p = __guin_blksz *
(__guin_mypid + 1) - 1
…
…
p = __guin_blksz *
(__guin_mypid + 1) - 1
 ubp  lb p  1
__guin_blk sz  

NP


This is a block assignment of partitions to
processors (as opposed to cyclic assignment).
(c) 2011 Clayton S. Ferner
34
Mapping Partitions to Physical
Processors
__guin_pid = 0
__guin_pid = 1
…
__guin_pid = NP-1
p = blksz * 0 + 0
p = blksz * 1 + 0
…
p = blksz * (NP – 1) + 0
p = blksz * 0 + 1
p = blksz * 1 + 1
…
p = blksz * (NP – 1) + 1
p = blksz * 0 + 2
p = blksz * 1 + 2
…
…
…
N
…
p = blksz * (0 + 1) – 1
…
p = blksz * (1 + 1) - 1
…
 ubp  lb p  1
blksz  

NP


The values of __guin_pid have been
substituted in for __guin_pid. __guin_blksz as
been replaced with blksz.
(c) 2011 Clayton S. Ferner
35
Broadcasting Data
• Scatter is not implemented in Paraguin
• One has to use broadcast to get the input to
other processors
• This uses the broadcast operation of MPI
which is O(log2(NP)) not O(N).
#pragma paraguin bcast X
MPI_Bcast(X, ..., MPI_COMM_WORLD);
(c) 2011 Clayton S. Ferner
36
Loop Carried Dependencies
• Consider the code for the elimination step of Gaussian Elimination:
for (i = 1; i <= N; i++)
for (j = i+1; j <= N; j++)
for (k = N+1; k >= i; k--)
a[j][k] = a[j][k] - a[i][k] * a[j][i] /
a[i][i];
Output Dependence
• There is a data dependence between the lhs of the assignment and
the a[i][k] reference on the rhs such that iteration iw, jw, kw writes a
value to a[jw][kw] that is used in iteration ir = iw + 1, jr = iw, kr = kw.
• The is also a data dependence between the lhs and a[i][i] on the
rhs, but we will only consider one dependence here.
(c) 2011 Clayton S. Ferner
37
Loop Carried Dependencies (cont)
…
i=0
i=0
i=0
i=0
…
i=1
i=1
i=1
i=1
j=1
j=1
j=1
j=1
k=3
k=2
k=1
k=0
:
:
:
:
a[1][3]
a[1][2]
a[1][1]
a[1][0]
=
=
=
=
…
…
…
…
-
a[0][3]
a[0][2]
a[0][1]
a[0][0]
*
*
*
*
…
…
…
…
j=2
j=2
j=2
j=2
k=3
k=2
k=1
k=0
:
:
:
:
a[2][3]
a[2][2]
a[2][1]
a[2][0]
=
=
=
=
…
…
…
…
-
a[1][3]
a[1][2]
a[1][1]
a[1][0]
*
*
*
*
…
…
…
…
(c) 2011 Clayton S. Ferner
38
Loop Carried Dependencies
• Below is the pragma to specify the data dependence
#pragma paraguin dep 0x0 2
C
0x0
0x0
0x0
0x0
-1
1
iw
0x0
0x0
0x0
0x0
-1
1
jw
1
-1
0x0
0x0
0x0
0x0
kw
0x0
0x0
1
-1
0x0
0x0
ir
-1
1
0x0
0x0
1
-1
jr
0x0
0x0
0x0
0x0
0x0
0x0
kr
0x0
0x0
-1
1
0x0
0x0
\
\
\
\
\
\
• Paraguin will insert the code for the processor writing the
data to pack it up and send it to the processor that needs
• It also insert the code for the processor that needs that data
to receive the message and unpack the data.
(c) 2011 Clayton S. Ferner
39
Gather
• Gathering is getting the partial results back
from the various processors to the master
process.
#pragma paraguin gather <write reference>
<X> <A>
(c) 2011 Clayton S. Ferner
40
Gather
• Example: LU Decomposition
;
#pragma paraguin gather 3
C
1
-1
i
1
-1
j
-1
1
k \
0x0 \
0x0
for (i = 0; i <= N; i++) {
for (j = i + 1; j <= N; j++) {
X[j][i] = X[j][i] / X[i][i];
for (k = i + 1; k <= N; k++)
X[j][k] = X[j][k] - X[j][i] * X[i][k];
}
}
(c) 2011 Clayton S. Ferner
41
Gather
#pragma paraguin gather 3
C
1
-1
i
1
-1
j
-1
1
k \
0x0 \
0x0
• 3 indicates the 4th (starting at 0) array reference: X[j][k]
• The system of inequalities indicate which values of the loop
variables produce the final values of that array: j=i+1 for all k.
(c) 2011 Clayton S. Ferner
42
Gather
for (__guin_p = 1 + __guin_blksz * __guin_mypid;
__guin_p <= __suif_min(N, __guin_blksz +
__guin_blksz * __guin_mypid); __guin_p++){
i = __guin_p - 1;
j = i + 1;
for (k = 1 * __guin_p; k <= 100; k++)
MPI_Pack(&X[j][k], ..., MPI_COMM_WORLD);
}
MPI_Send(__guin_buffer, ... 0, ..., MPI_COMM_WORLD);
(c) 2011 Clayton S. Ferner
43
Some Results
(c) 2011 Clayton S. Ferner
44
Gaussian Elimination
#pragma paraguin forall
C
0x0
0x0
p
-1
1
i
0x0
0x0
j
1
-1
k \
0x0 \
0x0
#pragma paraguin dep 0x0 2
C
0x0
0x0
0x0
0x0
-1
1
iw
0x0
0x0
0x0
0x0
-1
1
jw
1
-1
0x0
0x0
0x0
0x0
kw
0x0
0x0
1
-1
0x0
0x0
ir
-1
1
0x0
0x0
1
-1
jr
0x0
0x0
0x0
0x0
0x0
0x0
kr
0x0
0x0
-1
1
0x0
0x0
\
\
\
\
\
\
#pragma paraguin dep 0x0 4
C
0x0
0x0
0x0
0x
-1
1
iw
0x0
0x0
0x0
0x0
-1
1
jw
-1
1
0x0
0x0
0x0
0x0
kw
0x0
0x0
-1
1
0x0
0x0
ir
1
-1
1
-1
1
-1
jr
0x0
0x0
0x0
0x0
0x0
0x0
kr
0x0
0x0
0x0
0x0
0x0
\
\
\
\
\
\
(c) 2011 Clayton S. Ferner
45
Gaussian Elimination (cont.)
#pragma paraguin gather 0x0
C
-1
1
i
-1
1
j
1
-1
k \
0x0 \
0x0
#pragma paraguin gather 0x0
C
0x0
0x0
i
-1
1
j
0x0
0x0
k \
1 \
-1
for (i = 1; i <= N; i++)
for (j = i+1; j <= N; j++)
for (k = N+1; k >= i; k--)
a[j][k] = a[j][k] - a[i][k] * a[j][i] / a[i][i];
(c) 2011 Clayton S. Ferner
46
Gaussian Elimination (cont.)
Execution time for Gaussian Elimination
N=100
MPI code produced by the Paraguin Compiler
Time (milliseconds)
1200
1000
MPI
Sequential
800
600
400
200
0
1
4
7
10
13
16
19
22
Number of Processors (NP)
(c) 2011 Clayton S. Ferner
47
LU Decomposition
#pragma paraguin forall C
-1
1
p
-1
1
i
1
-1
j
k \
-1 0x0 \
1 0x0
#pragma paraguin dep 3 2
C
1
-1
0x0
0x0
0x0
0x0
iw
1
-1
0x0
0x0
0x0
0x0
jw
0x0
0x0
1
-1
0x0
0x0
#pragma paraguin dep 3 6
C
-1
1
0x0
0x0
0x0
0x0
iw
-1
1
0x0
0x0
0x0
0x0
jw
0x0
0x0
1
-1
0x0
0x0
kw
0x0
0x0
0x0
0x0
1
-1
kw
0x0
0x0
0x0
0x0
1
-1
(c) 2011 Clayton S. Ferner
ir
-1
1
-1
1
-1
1
ir
1
-1
-1
1
0x0
0x0
jr
0x0
0x0
0x0
0x0
0x0
0x0
\
\
\
\
\
\
jr
0x0
0x0
0x0
0x0
0x0
0x0
kr
0x0
0x0
0x0
0x0
-1
1
\
\
\
\
\
\
48
LU Decomposition (cont.)
#pragma paraguin gather 0
C
0x0
i
0x0
j
0x0
#pragma paraguin gather 3
C
1
-1
i
1
-1
j
-1
1
\
k \
0x0 \
0x0
for (i = 0; i <= N; i++)
for (j = i + 1; j <= N; j++) {
X[j][i] = X[j][i] / X[i][i];
for (k = i + 1; k <= N; k++)
X[j][k]=X[j][k]-X[j][i]*X[i][k];
}
(c) 2011 Clayton S. Ferner
49
LU Decomposition (cont)
Execution time for LU Decomposition
N=100
MPI code produced by the Paraguin Compiler
Time (milliseconds)
800
700
MPI
Sequential
600
500
400
300
200
100
0
1
4
7
10
13
16
19
22
Number of Processors (NP)
(c) 2011 Clayton S. Ferner
50
Redundant Data in Messages
• We discovered that the messaged sent
between processors for the Gaussian
Elimination contained redundant data
• Jerry Martin (MS student 2010) studied
detecting and reducing this redundant data
(c) 2011 Clayton S. Ferner
51
Redundant Data in Messages (cont)
...
<pid0, p2>: pack a[2][2] - Value: 63.000000
<pid0, p2>: pack a[2][3] - Value: 28.000000
<pid0, p2>: pack a[2][4] - Value: 91.000000
<pid0, p2>: pack a[2][5] - Value: 60.000000
<pid0, p2>: pack a[2][6] - Value: 64.000000
...
<pid0, p2>: pack a[2][2] - Value: 63.000000
<pid0, p2>: pack a[2][3] - Value: 28.000000
<pid0, p2>: pack a[2][4] - Value: 91.000000
<pid0, p2>: pack a[2][5] - Value: 60.000000
<pid0, p2>: pack a[2][6] - Value: 64.000000
...
<pid0>: send to <pid1>
(c) 2011 Clayton S. Ferner
52
Suppressing Redundant Data
Without redundant data in messages
(c) 2011 Clayton S. Ferner
With redundant data in messages
53
Suppressing Redundant Data (cont)
Without redundant data in messages
With redundant data in messages
(c) 2011 Clayton S. Ferner
54
Communication Pattern of Gaussian
Elimination
p0
p1
p2
p3
p4
p5
p6
p7
p0
p1
p2
p3
p4
p5
p6
p7
p0
p1
p2
p3
p4
p5
p6
p7
p0
p1
p2
p3
p4
p5
p6
p7
…
(c) 2011 Clayton S. Ferner
55
Loop Carried Dependencies and
Distributed-Memory Clusters
• Notice that both Gaussian Elimination and LU
Decomposition do not do better that sequential
execution regardless of the number of processors
• In fact, the performance gets worse as the
number of processors increases
• The issue is that we can’t expect to obtain
speedup on a distributed-memory cluster when
we have communication between processors.
• The communication is just too slow
(c) 2011 Clayton S. Ferner
56
Communication Pattern that works on
a distributed-memory system
p0
Pattern: Beyond scattering
the input and gathering
the results, processors
work independently.
p0
p1
p2
p3
p4
p5
p6
p7
p0
(c) 2011 Clayton S. Ferner
57
Matrix Multiplication
;
#pragma paraguin begin_parallel
#pragma paraguin forall
C
0x0
0x0
p
-1
1
i
1
-1
j
0x0
0x0
j
0x0
0x0
k \
1 \
-1
k \
0x0 \
0x0
#pragma paraguin bcast a b
#pragma paraguin gather
//
//
//
//
1
C
0x0
0x0
i
0x0
0x0
We need to gather all c[i][j]. However, array reference
one is inside the k loop. If we put in an empty gather
then we'll have N copies of each c[i][j] send to the
master. To send just one, then we use k = 0.
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++) {
c[i][j] = 0.0;
for (k = 0; k < N; k++) {
c[i][j] = c[i][j] + a[i][k] * b[k][j];
}
}
}
;
#pragma paraguin end_parallel
(c) 2011 Clayton S. Ferner
58
Matrix Multiplication
;
#pragma paraguin begin_parallel
#pragma paraguin forall
C
0x0
0x0
p
-1
1
i
1
-1
j
0x0
0x0
j
0x0
0x0
k \
1 \
-1
k \
0x0 \
0x0
#pragma paraguin bcast a b
#pragma paraguin gather
1
C
0x0
0x0
i
0x0
0x0
(c) 2011 Clayton S. Ferner
59
Matrix Multiplication
//
//
//
//
We need to gather all c[i][j]. However, array reference
one is inside the k loop. If we put in an empty gather
then we'll have N copies of each c[i][j] send to the
master. To send just one, then we use k = 0.
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++) {
c[i][j] = 0.0;
for (k = 0; k < N; k++) {
c[i][j] = c[i][j] + a[i][k] * b[k][j];
}
}
}
;
#pragma paraguin end_parallel
(c) 2011 Clayton S. Ferner
60
Matrix Multiplication
Execution time for Matrix Multiplication
for 512x512 matrices
MPI code produce by the Paraguin Compiler
2
Time (seconds)
1.8
1.6
1.4
MPI
1.2
Sequential
1
0.8
0.6
0.4
0.2
0
1
6
11
16
21
26
31
36
Number of Processors/Cores (NP)
(c) 2011 Clayton S. Ferner
61
Traveling Salesman Problem (TSP)
• Traveling Salesman
Problem is to find the
Hamiltonian cycle of a
set of cities that
minimizes the distance
traveled.
• Doing a brute force
• We can fix the first and last city
search of the solution
to be city 0 since that will
space requires us to
remove cyclic variations of the
consider all
same solution
permutations of N cities.
• E.g. 0->1->2->3->4->0 is the
• This is be N!
same as 0->4->3->2->1->0
permutations
(c) 2011 Clayton S. Ferner
62
Traveling Salesman Problem (TSP)
N = n*n - 3*n + 2; // (n-1)(n-2)
;
#pragma paraguin bcast n
#pragma paraguin bcast N
#pragma paraguin bcast D
#pragma paraguin forall
C
0x0
0x0
N
0x0
0x0
pid
-1
1
p \
1 \
-1
for (p = 0; p < N; p++) {
perm[1] = p / (n-2) + 1;
perm[2] = p % (n-2) + 1;
if (perm[2] >= perm[1])
perm[2]++;
initialize(perm, n, 3);
do {
dist = computeDist(D, n, perm);
if (minDist < 0 || minDist > dist) {
// … Details omitted.
// Record the minumum distance and
// permutation
}
• Creating permutations
does not lend itself to
easy parallelization
• We can make a loop that
iterates (n-1)(n-2) times a
base the first two cites on
the loop variable
• City 0 is fixed
• City 1 = p / (n – 2) + 1
• City 2 = p % (n – 2) + 1
} while (increment(perm,n));
}
(c) 2011 Clayton S. Ferner
63
Traveling Salesman Problem (TSP)
N = n*n - 3*n + 2; // (n-1)(n-2)
;
#pragma paraguin bcast n
#pragma paraguin bcast N
#pragma paraguin bcast D
#pragma paraguin forall
C
0x0
0x0
N
0x0
0x0
pid
-1
1
p \
1 \
-1
for (p = 0; p < N; p++) {
perm[1] = p / (n-2) + 1;
perm[2] = p % (n-2) + 1;
…
(c) 2011 Clayton S. Ferner
64
TSP
Execution time for Traveling Salesman Problem (TSP)
For 13 Cities
MPI code produced by the Paraguin Compiler
700
Time (seconds)
600
500
MPI
400
Sequential
300
200
100
0
1
6
11
16
21
26
31
36
Number of Processors/Cores (NP)
(c) 2011 Clayton S. Ferner
65
Hybrid
• Hybrid Parallel programs are ones that make use
of distributed-memory systems of clusters as well
as the multiple cores within each computer
(node) of the cluster.
• We can use MPI to schedule processes to run on
multiple nodes and then use OpenMP to
schedule threads one the cores within a node.
• The threads of separate cores use a sharedmemory model whereas between nodes, MPI
uses a distributed-memory model.
(c) 2011 Clayton S. Ferner
66
Doing Hybrid in Paraguin
• The Paraguin compiler is a source-to-source
compiler. It creates C code with MPI calls from
C code. This new code is compiled using the
mpicc script, which uses gcc.
• gcc also has openMP support.
• The Paraguin compiler will simply pass
through pragmas that it does not recognize
creating a hybrid program.
(c) 2011 Clayton S. Ferner
67
Matrix Multiplication (Hybrid)
…
#pragma paraguin forall
C
0x0
0x0
p
-1
1
i
1
-1
j
0x0
0x0
k \
0x0 \
0x0
…
#pragma omp parallel for private(i,j,k) schedule(static)
num_threads(NUM_THREADS)
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++) {
c[i][j] = 0.0;
for (k = 0; k < N; k++)
c[i][j] = c[i][j] + a[i][k] * b[k][j];
}
}
…
(c) 2011 Clayton S. Ferner
68
Matrix Multiplication (Hybrid)
Execution Time Matrix Multiplication
Execution time (seconds)
For 1000x1000 matrices
For Hybrid (MPI and OpenMP) Code produced by the Paraguin Compiler
20
18
16
14
12
10
8
6
4
2
0
MPI Only
Hybrid
Sequential
OpenMP Only
4
8
12
16
20
24
28
32
36
Number of Processors/Cores
(c) 2011 Clayton S. Ferner
69
Questions?
(c) 2011 Clayton S. Ferner
70