Transcript Chapter 6

DISTRIBUTED AND
HIGH-PERFORMANCE COMPUTING
CHAPTER 6: MESSAGE PASSING PROGRAMMING
AND MPI
Distributed Memory Programming
Paradigm

Message passing programming is targeted at distributed
memory machines (although it can be emulated on shared
memory machines).

Set of processors, each with its own memory.

Communication between processors via a network.

Each processor runs a copy of the same executable — Single
Program, Multiple Data (SPMD):



Written in a conventional programming language.
With its own memory address space, aka ‘local’,‘private’, ‘nonglobal’.
Passing of data between processors done using library subroutine
calls.
Cont…

SPMD is a special case of the Multiple Instruction Multiple
Data (MIMD) model. MIMD allows different executables to be
run on each processor, but SPMD model is simpler and more
portable and hardly ever a limitation.
Message Passing Paradigm

Data movement specified explicitly by programmer.

Data propagated via packets of messages.

Communication types:


Point to point — most basic, one processor sending data,
and one processor receiving. Building block for all other
communications.
Collective, involving more than two processors, e.g.:

One processor broadcasting data to other processors.

Synchronisation, or barrier.

Reduction operation, e.g. global sum.
Cont…

Collective communications should be faster than user
implementations with point to point (optimised
implementations).

Communications libraries allow programmer to specify a virtual
interconnect topology, which abstracts over the physical
connectivity of the parallel computer.

Programmer must ensure correctness of code (every send
must have related receive, avoidance of deadlock, etc).
Message Passing Programs

In some cases the parallel program requires a completely different algorithm, so
the code will look very different to the sequential code.

However in many cases message passing codes use standard domain
decomposition parallelisation and look very similar to sequential codes, except
for:

Some code to distribute arrays across processors.

Some code to pass data across processors where necessary, e.g. exchange array
values at edges of domains, do global sums, etc.

The rest of the code is mostly the same, basically executing the same
sequential algorithm, except that it acts only on the local subset of the array.

In the latter case it is straightforward to use the same code on sequential and
parallel machines, with preprocessing directives to select sequential or parallel
version.

Alternatively, can run sequentially by just using parallel version on 1 processor,
but there is an overhead.
Types of Message Passing Programs



Embarrassingly Parallel

Virtually no communication required

Easily load balanced (even dynamically)

Expect perfect speedup
Regular and Synchronous

Easily statically load balanced

Expect good speedup for local communication

Expect reasonable speedup for non-local communication
Irregular and/or Asynchronous

Difficult to load balance – often requires dynamic load balancing

Communications overhead usually large

Hard to get good speedup (but possible with some applications)

Usually cannot be done efficiently using data parallel
programming – must use message passing to get best performance
Embarrassingly Parallel Problems

Each element of an array can be processed independently of the others.

No communication required, except to combine the final results.

Static load balancing is usually trivial – can use any kind of distribution
(e.g. scattered) since communications is not a factor.

Dynamic load balancing can be done using a task farm approach:


A master (or controller) process sends parcels of work to each of a set of
workers (processors).

Workers return results when they are done, and request more work when
idle.

This approach gives automatic dynamic load balancing as long as:

1. the problem can be split into lots of parcels of work

2. the parcels are not too small, to minimise communications overhead

3. the parcels are not too large, so as not to be left waiting for one processor once
the others are done
Expect perfect speedup.
Regular and Synchronous Problems




Regular algorithm and data structures.
Synchronous (or loosely synchronous) communications.
Use standard domain decomposition.
Usually requires:




As long as the amount of data per processor is large enough so that
computation time is much greater than communication time, expect:



Good speedup for local communication
Reasonable speedup for non-local communication
For problems with local communication:



local communication (edges of domain);
collective communication (combine final results);
and (sometimes) non-local communication.
Computation goes like the volume of the grid (or array) on each
processor
Communication goes like the edge (perimeter or surface) of the grid on
each processor
Comms overhead ∼ Comm / Comp ∼ 1 / grid length So expect
reasonable speedup as long as size of array is much larger than
number of processors.
Irregular and/or Asynchronous

Irregular algorithm and/or data structures.

Cannot be implemented efficiently except using message passing.

Usually a high communications overhead.

Complex asynchronous (and usually non-local) communication requires
careful coding.

May require careful load balancing.

May require dynamic repartitioning of data between processors.

Difficult to get good speedup, but possible for some problems.

In some cases most of the problem is regular, and only part of it is
irregular, which makes things easier, but still need to use message
passing instead of (or as well as) data parallel.

HPF allows calls to extrinsic functions, which may be written in MPI, to
support these mixed regular and irregular applications.
Message Passing Libraries

Long and noble history (since early 1980’s).

Main idea: abstraction from machine topology — routines to
send/receive/broadcast messages between processors without having to manage
the routing through the particular machine topology.

User sees a library for an existing language (e.g. C or Fortran), not (usually)
extensions to the language.

Speed, not portability, historically the main criterion (but these are not necessarily
exclusive).

N(implementations) ∼ N(vendors,institutes):

Machine specific – NX, MPL, CMMD, CS-Tools

Portable – CROS, Express, P4, Chameleon, CHIMP, PARMACS, PICL, PVM, SHMEM,
TITCH, TINY,VERTEX, ZIPCODE

Many good ideas in them.

Basic functionality the same, interfaces similar but annoyingly different.

PVM (Parallel Virtual Machine) had been de-facto standard in recent times, but
focussed on distributed computing (more so after introduction of MPI).
Message Passing Interface

In 1994 a standardized Message Passing Interface (MPI) was established by an
international committee (the MPI Forum).

Prompted by the success of the HPF Forum in creating a standard data parallel
language.

Interface to both C and Fortran, other languages could be supported (C++ bindings
are provided in version 2, Java bindings are being investigated).

Different languages have slightly different syntax for the interface, e.g. error values
returned as integer return codes in C, and as extra parameters in Fortran.

Provides only a specification, which can be implemented on any parallel computer or
workstation network.

Some implementations of this specification are:


Versions for many different machines – e.g. MPICH, LAM, CHIMP, MPI-F

Versions from vendors for specific machines – e.g. IBM SP2
Initial specification (MPI-1) did not handle some aspects such as parallel I/O and
distributed computing support, but most of these issues are addressed in the latest
version (MPI-2).
Reasons for Using MPI:

Standardization - MPI is the only message passing library
which can be considered a standard. It is supported on
virtually all HPC platforms. Practically, it has replaced all
previous message passing libraries.

Portability - There is no need to modify your source code
when you port your application to a different platform that
supports (and is compliant with) the MPI standard.

Performance Opportunities - Vendor implementations
should be able to exploit native hardware features to optimize
performance. For more information about MPI performance
see the MPI Performance Topics tutorial.

Functionality - Over 115 routines are defined in MPI-1 alone.

Availability - A variety of implementations are available, both
vendor and public domain.
Programming Model:

MPI lends itself to virtually any distributed memory parallel programming
model. In addition, MPI is commonly used to implement (behind the
scenes) some shared memory models, such as Data Parallel, on
distributed memory architectures.

Hardware platforms:

Distributed Memory: Originally, MPI was targeted for distributed memory
systems.

Shared Memory: As shared memory systems became more popular,
particularly SMP / NUMA architectures, MPI implementations for these
platforms appeared.

Hybrid: MPI is now used on just about any common parallel architecture
including massively parallel machines, SMP clusters, workstation clusters and
heterogeneous networks.

All parallelism is explicit: the programmer is responsible for correctly
identifying parallelism and implementing parallel algorithms using MPI
constructs.

The number of tasks dedicated to run a parallel program is static. New
tasks can not be dynamically spawned during run time. (MPI-2 addresses
this issue).
MPI Preliminaries

MPI provides support for:






Typographical convention:








Point-to-Point message passing
Collective Communications
Communicators to group messages and processes
Inquiry routines to query the environment
Constants and data types
All MPI identifiers prefixed by ‘MPI ’.
C routine names may contain lower case, as in ‘MPI Init’.
Fortran routines SHOUT AT YOU IN UPPER CASE, as in ‘MPI INIT’.
MPI constants are all in upper case, e.g. ‘MPI FLOAT’ is an MPI C datatype.
C routines are actually integer functions which return a status code.
Fortran routines really are subroutines, and return a status code as an
argument.
Strongly advised to always check these status codes.
Number of processors used is specified in the command line, when running the
MPI loader that loads the MPI program onto the processors, to avoid hardcoding it into the program.
MPI Routines

MPI has 129 different routines, but most programs can be written using about a
dozen simple routines:
#include "mpi.h“
include "mpif.h"

in C
in Fortran
MPI_INIT
MPI_FINALIZE
Initialize MPI computation
Terminate MPI computation
MPI_COMM_SIZE
MPI_COMM_RANK
MPI_SEND
MPI_RECV
MPI_BARRIER
MPI_BCAST
MPI_GATHER
MPI_SCATTER
MPI_REDUCE
MPI_ALLREDUCE
Determine number of processes
Determine process number
Send a message
Receive a message
Synchronize
Broadcast same data to all procs
Get data from all procs
Send different data to all procs
Combine data from all procsonto a single proc
Combine data from all procs onto all procs
Reduction operations include +, *, AND, OR, min, max
Will give general overview to these routines — check MPI specification, MPI
programming books and courses, or manuals for implementation you are using,
for exact usage of each routine.
Point to Point Communications



Communication between two processors.
one to send, and one to receive the data.
Information required to specify the message:







Two different communication operations:



Identification of sender processor.
Identification of destination processor.
Type of data — e.g. integers, floats, characters.
Number of data elements to send — good idea if sending arrays. . .
Where the data to send lives — usually a pointer.
Where the received data should be stored into — again a pointer.
Blocking — sender & receiver waits until communication is complete.
Non-blocking — send & receive handled ‘offline’, processor continues on.
In addition, four send modes:




Synchronous — Waits for the recipient to complete.
Buffered — Copies message data into a buffer.
Standard — MPI decides whether the message should be buffered.
Ready — A send is only started when a matching receive is initiated.
Collective Communications

Key concept: Groups are sets of processors that communicate with each other
in a certain way.

Communications can take place amongst Group members without cross-talk.

Rather like definable datatypes, allows better mapping between the language
and the problem.

Also allows construction of libraries that are guaranteed not to interfere with
communications within user program, since all their communications are
confined to a particular group.

MPI implements Groups using data objects called Communicators.

MPI COMM SIZE gives number of processes (or processors, if there is one
process per processor) in a communicator.

MPI defines a special Communicator MPI COMM WORLD for the Group of all
processes (or processors).

Subgroups are sub-divided from the world Group.

Each Group member is identified by a number, its Rank, enumerated 0, . . .
(Number in Group−1).
Header File:

Required for all programs/routines
which make MPI library calls.
C include file
#include "mpi.h"
Fortran include file
include 'mpif.h'
Format of MPI Calls
C Binding
Format
rc = MPI_Xxxxx(parameter, ... )
Example
rc = MPI_Bsend(&buf,count,type,dest,tag,comm)
Error Code
Returned as "rc". MPI_SUCCESS if successful
General MPI Program Structure:
Communicators and Groups

MPI uses objects called communicators and groups to define which
collection of processes may communicate with each other. Most MPI
routines require you to specify a communicator as an argument.

Use MPI_COMM_WORLD whenever a communicator is required - it
is the predefined communicator that includes all of your MPI
processes.
Rank

Within a communicator, every process has
its own unique, integer identifier assigned by
the system when the process initializes. A
rank is sometimes also called a "task ID".
Ranks are contiguous and begin at zero.

Used by the programmer to specify the
source and destination of messages. Often
used conditionally by the application to
control program execution (if rank=0 do this
/ if rank=1 do that).
MPI_Init
Initializes the MPI execution environment. This function must be
called in every MPI program, must be called before any other
MPI functions and must be called only once in an MPI
program. For C programs, MPI_Init may be used to pass the
command line arguments to all processes, although this is not
required by the standard and is implementation dependent.
MPI_Init (&argc,&argv)
MPI_INIT (ierr)
MPI_Comm_size
Determines the number of processes in the group
associated with a communicator. Generally used
within the communicator MPI_COMM_WORLD to
determine the number of processes being used by
your application.
MPI_Comm_size (comm,&size)
MPI_COMM_SIZE (comm,size,ierr)
MPI_Comm_rank
Determines the rank of the calling process within the
communicator. Initially, each process will be
assigned a unique integer rank between 0 and
number of processors - 1 within the communicator
MPI_COMM_WORLD. This rank is often referred to
as a task ID. If a process becomes associated with
other communicators, it will have a unique rank
within each of these as well.
MPI_Comm_rank (comm,&rank)
MPI_COMM_RANK (comm,rank,ierr)
MPI_Comm_rank
Determines the rank of the calling process within the
communicator. Initially, each process will be
assigned a unique integer rank between 0 and
number of processors - 1 within the communicator
MPI_COMM_WORLD. This rank is often referred to
as a task ID. If a process becomes associated with
other communicators, it will have a unique rank
within each of these as well.
MPI_Comm_rank (comm,&rank)
MPI_COMM_RANK (comm,rank,ierr)
MPI_Abort
Terminates all MPI processes associated with the
communicator. In most MPI implementations it
terminates ALL processes regardless of the
communicator specified.
MPI_Abort (comm,errorcode)
MPI_ABORT (comm,errorcode,ierr)
MPI_Get_processor_name
Returns the processor name. Also returns the length of the name.
The buffer for "name" must be at least
MPI_MAX_PROCESSOR_NAME characters in size. What is
returned into "name" is implementation dependent - may not
be the same as the output of the "hostname" or "host" shell
commands.
MPI_Get_processor_name (&name,&resultlength)
MPI_GET_PROCESSOR_NAME (name,resultlength,ierr)
MPI_Initialized
Indicates whether MPI_Init has been called - returns
flag as either logical true (1) or false(0). MPI
requires that MPI_Init be called once and only once
by each process. This may pose a problem for
modules that want to use MPI and are prepared to
call MPI_Init if necessary. MPI_Initialized solves this
problem.
MPI_Initialized (&flag)
MPI_INITIALIZED (flag,ierr)
MPI_Wtime
Returns an elapsed wall clock time in seconds (double
precision) on the calling processor.
MPI_Wtime ()
MPI_WTIME ()
MPI_Wtick
Returns the resolution in seconds (double precision) of
MPI_Wtime.
MPI_Wtick ()
MPI_WTICK ()
MPI_Finalize
Terminates the MPI execution environment. This
function should be the last MPI routine called in
every MPI program - no other MPI routines may be
called after it.
MPI_Finalize ()
MPI_FINALIZE (ierr)
Hello World in MPI
#include <stdio.h>
#include <mpi.h>
void main(int argc, char *argv[])
{
int ierror, rank, size;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0) {
printf ("Hello World!\n");
}
ierror=MPI_Comm_size(MPI_COMM_WORLD, &size);
if( ierror != MPI_OK ) {
MPI_Abort(MPI_COMM_WORLD,ierror);
}
printf("I am %d out of %d.\n", rank, size);
MPI_Finalize();
}



Must call initialize and finalize routines.
argc and argv required only for C language.
Note difference between a single output from the program (Hello World!) and
output from each of the processors.
Sending a Message:
Blocking, standard mode send
int ierror,message[message_len],dest_rank,tag=1;
...
ierror=MPI_Send(message,message_len,MPI_INT,dest_rank,tag,MPI_COMM_WORLD);






The message to send.
Message length.
Type of each data element.
Rank of destination processor.
Tag, for filtering out unwanted message.
The communicator for the sender & receiver group.
Every SEND must have a corresponding RECEIVE (and vice versa) or else will
get deadlock – processors will be waiting forever to send (or receive)
messages that never get received (or sent).
Deadlock can also occur if message buffers overflow – the MPI implementation
should handle this.
Receiving a Message :
Blocking Receive
int ierror,store[store_len],source_rank,tag=1;
MPI_Status status;
...
ierror=MPI_Recv(store,store_len,MPI_INT, source_rank,tag,MPI_COMM_WORLD,
&status);






Where to store the message, and its size.
Type of each data element.
Rank of source processor.
Same tag as used by sender processor.
The communicator for the sender & receiver group.
Data structure of received message (provides the message length,
so it can be compared to what is expected).
If communication is regular, e.g. each processor sends a message to its right
while receiving one from its left (a shift), can combine separate SEND and
RECV calls into a single SENDRECV (send and receive) call.
Global Summation :
Global sum, using Reduction Function
float part_sum=1.0,global_sum;
int ierror,root_rank=0;
...
ierror=MPI_Reduce(&part_sum,&global_sum,1, MPI_FLOAT, MPI_SUM,root_rank,
MPI_COMM_WORLD);






Partial and global sum variables.
Number of elements in the partial sum (= 1).
Datatype of element to sum.
Reduction operation.
Rank (processor) where results will be deposited.
Communicator for the participating processors.
MPI REDUCE places the result on the specified processor, MPI ALLREDUCE places
the result on all processors.
Other reduction operations include MPI MAX, MPI MIN, MPI LAND (logical AND), MPI
LXOR (exclusive OR), etc.
Also useful are MPI MINLOC and MPI MAXLOC, which return the location of the
minimum or maximum values in the data structure.