Rama Shenai - Chp. 8 Beowulf

Download Report

Transcript Rama Shenai - Chp. 8 Beowulf

MPI - A User level Message-passing Interface
K. Ramakrishna Shenai
History of MPI
• Principal branches of MPI
•MPI 1.2 Version
•MPI 2.0 Version
• An MPI Process in normally
written in C (C++) or fortran and
linked with MPI Libraries.
MPI Basic functionality
• MPI Program consists of
• A set of processes occupying a separate
and unshared address space
• Communication of data by invoking MPI
procedures
• MPI 2, has the added ability of adding
and destroying processes.
MPI Basic functionality (contd.)
• MPI Processes can be
• Can have the same executable file running in
different address spaces
• Individual Processes are instances of different
executables from different source files, linked
together
• Pass different command line arguments to
different processes
• Rank in a MPI_COMM_WORLD communicator
• Each process has a unique rank in the range of [0
.. p-1]
/* Example Program: Hello world */
# include <mpi.h>
# include <stdio.h>
# include <string.h>
int main(int argc, char *argv[]){
int rank, size, partner; int namelen;
char name[MPI_MAX_PROCESSOR_NAME]; char greeting[sizeof(name) + 100];
MPI_Init(&argc, 7argv);
/* Initialize MPI */
MPI_Comm_size(MPI_COMM_WORLD, &size);
/* Which one am I */
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
/* Where am I running*/
MPI_Get_processor(name, &namelen);
sprintf(greeting, “hello world %d of %d running on %s \n”,
rank, size, name);
if(rank==0) {
fputs(greeting, stdout);
for(partner=1; partner < size; partner++) {
MPI_Status stat;
MPI_Recv(greeting, sizeof(greeting), MPI_BYTE, partner,
1, MPI_COMM_WORLD, &stat);
fputs(greeting, stdout); }
}else { MPI_Send(greeting, strlen(greeting)+1, MPI_BYTE,
0, 1, MPI_COMM_WORLD); }
MPI_Finalize();
exit(0); }
Basic Concepts
 MPI_Init and MPI_finalize
In C, MPI_init should be passed the
address of the argc and argv. This
procedure must be called before any
other MPI Procedure
MPI_Finalize: This is a terminating
procedure after which any other MPI
procedure must not be implemented
Basic Concepts
Communicators
• Requirement  the caller specify a communicator
argument, which define the context in which
communication takes place
• In C , Communicators have the type MPI_Comm
•Every process within a communicator has size
MPI_Comm_size and the process rank with
MPI_Comm_rank.
MPI_Get_processor_name
• A procedure that is used to determine what physical
processor the current process is running. Usually returns the
system’s Internet Hostname
Basic Concepts:MPI_Send and MPI_Receive
• They are point to point communication primitives
• Transmission of messages directly from sender to
receiver’s memory.
• c prototypes
• int MPI_Send(void *buf, int count, MPI_Datatype datatype,
int dest, int tag, MPI_Comm comm)
•int MPI_Receive(void *buf, int count, MPI_Datatype
datatype, int dest, int tag, MPI_Comm comm)
•Mainly Blocking Mode of Communication, However in
certain cases a non-blocking mode of Communication is
possible.
•Other communication modes (buffered, synchronous
and ready exist)
Basic Concepts
• MPI_Status Structure  can be examined to
determine the source and tag of any message
• Loosely Synchronous Operations  Basically
means that all processes engaged in communication
operation must call the corresponding MPI procedure
at the same logical point in execution, with respect
to all other MPI communication calls.
Basic Concepts Balancing Sends and Receives
/* InCorrect MAY Deadlock depending upon buffering */
MPI_Send(out, count, type, partner, tag, comm);
MPI_Recv(in, count, type, partner, tag, comm, &stat);
/* Correct Send must match receive exactly */
If(rank > partner) {
MPI_Send(out, count, type, partner, tag, comm);
MPI_Recv(in, count, type, partner, tag, comm, &stat);}
else if(rank < partner){
MPI_Recv(in, count, type, partner, tag, comm, &stat);
MPI_Send(out, count, type, partner, tag, comm);}
else { /* rank==partner */ MPI_Type_size(type, &sz);
memcpy(in, out, count, * sz); /*content of stat not set */
/* Also Correct use only MPI_Sendrecv Procedure */
Mpi_Sendrecv(out, count, type , partner, tag, in, count, type
, partner, tag, comm, &stat);
Basic Concepts
MPI Data types: Elementary MPI data types in C
•
MPI_CHAR, MPI_SHORT, MPI_INT, MPI_LONG,
MPI_UNSIGNED_CHAR, MPI_UNSIGNED_SHORT,
MPI_UNSIGNED, MPI_UNISIGNED_LONG, MPI_FLOAT,
MPI_DOUBLE, MPI_LONG_DOUBLE, MPI_BYTE
Use of Standard Libraries
•
MPI does not alter any Standard Libraries.
•
Calls to external libraries (especially I/O libraries)
should be made from a distinguished process, the
rank=0 process.
•
LAM and MPICH are both quite flexible, when it comes
to handling such libraries.
Errors in MPI
Compiling using MPI on the Beowulf
• MPICH is installed in a public location like /usr/local/mpich,
also called <MPI_ROOT>
• All user oriented utilities are stored in <MPI_ROOT>/bin,
header files in <MPI_ROOT>/include and libraries in
<MPI_ROOT>/lib/LINUX/ch_p4/
•<MPI_ROOT>/bin contain mpicc, the script files for
compiling C programs with certain options.
•Ways of Compiling a C program to give the executable
•MPIROOT=/usr/local/mpi
•mpicc hello.c –o hello
•cc –o hello –I$MPIROOT/include hello.c –
L$MPIROOT/LINUX/ch_p4 -lmpich
Running an MPI Program
mpirun utility
Options
• -np <number_of_processes>: starts the
program with the specifies number of
processors
•-machinefile <filename>; explicitly specify the
list of processors.
•-nolocal force the first process to execute on
the first processor specified in the
processor_list
mpirun will start process 0 on the processor that
called mpirun.
Parallel Data Structures with MPI
• MPI provides a simple memory model over which
Parallel data structures can be created.
• Every symbol is actually a parallel object that exists
on every process. The symbol can take different
values for different processes.
Example Implementation: A Parallel Array
•parray structure
Typedef struct {
int nelem_local; /*the number in local memory */
void *data;
MPI_Comm comm;
int first_index_local; /*global index of the first
element of the local data (whose local index is 0) */
int nelem_global; /* aggregate number of elements */
size_t elem_size;
int commrank, commsize;
}parray;
/* procedure to create a self consistent parallel array from
information provided by each process about its own local data
*/
Void pacreate(parray *pa, void *data, int nelem, size_t size,
MPI_Comm comm) {
int last_index;
pa->nelem_local = nelem;
pa->data = data;
pa->comm = comm;
pa->elem_size = size;
/* My first element is the sum of the number of elements in
lower ranked procs */
MPI_Scan(&pa->nelem_local, &last_index, 1, MPI_INT, MPI_SUM,
pa->comm);
pa->first_index_local = last_index – pa->nelem_local;
MPI_Comm_size(pa->comm, &pa->commsize);
MPI_Comm_rank(pa->comm, &pa->commrank);
/* The global element count is the last index in the highest
rank processor (commsize – 1) use bcast to distribute it */
Mpi_Bcast(&last_index, 1, MPI_INT, pa->commsize-1, pa->comm);
Pa->nelem_global = last_index;
}
Example: A Parallel Array
• The parray structure automatically exists on every
process, when defined.
•Collective Communication
• MPI_Broadcast  sends the content of the buffer on
one process
• MPI_Scan performs operation (like ‘+’) on values
supplied by each processor, like the result returned on
process I is the result of the operation applied to values
supplied by every process of rank 0 to i.
• MPI predefined Reductions/scan operations 
MPI_MAX, MPI_SUM, MPI_PROD, MPI_MAXLOC (returns
extremum, like an array index or rank) etc.
Example: A Parallel Array (contd.)
• Use of MPI_Allreduce  computes the result of an
operation applied to a value supplied by every process.
The result is returned to every process. Can be
equivalent to MPI_Scan
• MPI_Gather, MPI_Scatter are combination of two or
more operations like MPI_Allreduce  MPI_Reduce +
MPI_Broadcast
/* procedure to compute the mean of the elements in a
parallel array */
double parraymean(struct parray *pa) {
double sum, sumall;
int i;
/* Compute the local sum of elements */
sum = 0;
for(i=0; i<pa->nelem_mine; i++) {
sum = sum + pa->data[i]; }
/* Use all reduce to get the global sum */
MPI_Allreduce(&sum, &sumall, 1, MPI_DOUBLE, MPI_SUM,
pa->comm);
return sumall/pa->nelem_global;
}
A One dimensional Cellular Automaton
• One Dimensional Cellular Automata (CA), are simple dynamic
systems, where the placement of data is meaningful, requiring point
to point communication.
• A CA with half-width hw --> It is an array of values with update
rule that states that the next value in location i depends only on the
previous value in locations (i-hw,.......,i...........,i_hw).
• The values of a Cellular Automaton can be integers, or bits etc.
• The update rule can be an arbitrary function of the 2hw+1 input
values. Special cases include linear functions and functions like
“parity” which count the number of values in an input domain.
typedefdefinition
struct ca_s
•Structure
of the {the one dimension cellular automaton
unsigned char *state; /* size ncells */
int A; int hw; int ncells;
unsigned char *old; /* size ncells + 2*hw */
/* other fields that control updateRule */
}CA_t;
/* updating a cellular automaton */
void Caiterate(CA_t *ca, int ntime){
int i;
for(i=0;i<ntimes; i++) {
Cacopystate(ca); Caupdate(ca); }
}
static void Caupdate(CA_t *ca) {
int n = ca->ncells;
unsigned char *oldcenter = ca->old;
unsigned char *new = ca->state;
while(n-->0) {
*new++ = updateRule(ca, oldcenter++); }
}
static void {
Cacopystate(CA_t *ca){
memcpy(ca->old, ca->state, ca->ncells);
/* Now for periodic boundary conditions */
memcpy(ca->old-ca->hw, ca->state+(ca->ncells - ca->hw),
ca->hw);
memcpy(ca->old+ca->ncells, ca->state, ca->hw); }
A One dimensional Cellular Automaton
(contd)
• The two operations in the CAiterate
• The CAcopystate involves copying the contents of the
state array to the old array so that we can write new
values into the state. The second phase involves
computing the new state depending on the values of
the old array.
•The periodic boundary conditions are imposed by
padding the old array on both the ends by hw values,
copied from the opposite end of the state.
A B C D E F G H
F G H A B C D E F G H
The arrow
indicates a
call to
memcpy
A B
C
Parallelizing the problem !
• General guideline : Keep a maximum amount of the
sequential code intact. The order of the elements and the
relationship between elements stored in different
processors is important.
• Boundary data should be exchanged between neighbours
• Modification
• Adding a MPI_COMM Element to the CA Structure
• calls to memcpy has been replaced to MPI_Sendrecv
in CA Copystate
typedef struct ca_s {
unsigned char *state; /* size ncells */
int A; int hw; int ncells;
unsigned char *old; /* size ncells + 2*hw */
MPI_Comm comm;
/* other fields that control updateRule */
}CA_t;
Parallelizing the problem !
static void {
Cacopystate(CA_t *ca){
int up_neighbor, down_neighbor, myrank, nproc;_
memcpy(ca->old, ca->state, ca->ncells);
MPI_Comm_rank (ca->comm, &myrank);
MPI_Comm_size (ca->comm, &nproc);
up_neighbor = (myrank + 1)% nproc;
down_neighbor = (myrank+nproc-1)%nproc;
MPI_Sendrecv(ca->state + (ca->ncells - ca->hw), ca->hw,
MPI_BYTE, up_neighbor, CA_UPTAG, ca->old, ca->hw,
MPI_BYTE, down_neighbor, CA_UPTAG, ca->comm, &stat);
MPI_Sendrecv(ca->state, ca->hw, MPI_BYTE, down_neighbor,
CA_DOWNTAG, ca->old+(ca->ncells+ca->hw), ca->hw,
MPI_BYTE, up_neighbor, CA_DOWNTAG, ca->comm, &stat);
ABCDEFGH
X Y Z ABCD EFG H I J K
I J K L MN O P
F G H I
J K L MN O P Q R S
In the Parallel version, when the state of CA is copied to
old, it is padded at both ends by data from neighboring
processors. Two processors are shown, The diagonal arrow
represent data transfers by MPI_Sendrecv, and the dotted
arrow represents a memcpy
CONSIDERATIONS
• Domain Decomposition and boundaries: Make a parallel
version of a problem as a collection of near-copies of a
sequential implementation with special boundary
conditions derived by communication with neighbors.
• Starting with a good sequential Implementation -->
• Fundamental Algorithms are basically designed
sequentially by default, hence changes required for
parallelization should be as small as possible.
• For the CA , it is the matter of obtaining boundary
conditions from neighboring processors.
• Avoid changing fundamental sequential parts of a
code as far as possible.
CONSIDERATIONS
• Communication should be treated as a separate phase
from local computation. It has a major effect on
performance
• Performance Considerations -->
• Communication is Slow . Communication latency will
dominate the time consideration.
• Despite our local computation being fast, eventual
performance will reduce due to time spent in
communication,
CONSIDERATIONS
•Time taken to complete an iteration in N steps
with P processors
tstep
= 2 tlatency + (N/P)tupdate
= (N tupdate/P) + ( 1 + (2 tlatency/(N/P)tupdate) )
first Term is just the time taken on one processor, divided
by P, a perfect speedup result.
Last term represents how worse than perfect is the actual
implementation --> it is the ration of the time spent on one
processor to communication, 2 tlatency to the time spent in
CAupdate, (N/P)tupdate
MPI Advanced Features
•Blocking and Non-blocking Calls and Alternative Sending
Modes
• Blocking Calls always wait for the requested action
to complete before returning control to the caller.
• Non blocking calls are Initiate background process,
only used to avoid unnecessary copying between
system buffers and user memory.
• Standard Mode sends
• Buffered Mode sends
• Synchronous mode sends
MPI Advanced Features
• Derived data types
• MPI includes a powerful set of procedures that can be used to
refer no-contiguous data; example--> refer data of heterogeneous
type, and integer followed by a set of floating point values.
• Latency is a major problem on Beowulf systems, specially in
transfer of data, an alternative is to pack multiple logical messages
into a single physical message and to incur the startup cost only
once.
• Intercommunicators
• Basically the MPI_COMM_WORLD, the group of all active
processes and communicators are derived by selecting subsets
from this communicators, the resulting logical structures are
insufficient.
• It is desirable to construct a new communicator, from the union
of the two existing communicators and these are called
INTERCOMMUNICATORS. Used in application relevant to functional
parallelism.
MPI 2
• Process creation and management = Processes can be
dynamically created and destroyed.
• One sided communication = Allowing either the sender or
the receiver set the parameters for communication. Reduces
complexity and can exploit fast communication mechanisms.
• Extended collective operations = New procedures for
creating collective operations and Intercommunicators.
• External interfaces = Users and library designers can
access internals of MPI’s objects, allows for new
functionality to be layered atop MPI’s internal data
structures.
• New Language bindings for C++ and fortran.
• I/0 MPI-I0 describes the application side of an abstract
model of a parallel I/O, how to define requests to a parallel
abstract I/O device from a distributed memory parallel
program.