Transcript Slide 1

Parallel Computing Through
MPI Technologies
Author: Nyameko Lisa
Supervisors: Prof. Elena Zemlyanaya, Prof Alexandr
P. Sapozhnikov and Tatiana F. Sapozhnikov
Outline – Parallel Computing through
MPI Technologies
Introduction
 Overview of MPI
 General Implementation
 Examples
 Application to Physics Problems
 Concluding Remarks

Introduction – Need for Parallelism

More stars in the sky than there are
grains of sands on all the beaches of
the world
Introduction – Need for Parallelism


It requires
approximately 204
billion atoms to
encode the human
genome sequence
Vast number of
problems from a wide
range of fields have
significant
computational
requirements
Introduction – Aim of Parallelism



Attempt to divide a single problem into
multiple parts
Distribute the segments of said problem
amongst various processes or nodes
Provide a platform layer to manage data
exchange between multiple processes that
solve a common problem simultaneously
Introduction – Serial Computation


Problem divided into discrete, serial sequence of
instructions
Each executed individually, on a single CPU
Introduction – Parallel Computation

Same problem distributed amongst several
processes (program and allocated data)
Introduction – Implementation

Main goal is to save time and hence money
–
–
–

3 methodologies for implementation of parallelism
–
–
–


Furthermore can solve larger problems – depleted resources
Overcome intrinsic limitations of serial computation
Distributed systems provide redundancy, concurrency and
access to non-local resources, e.g. SETI, Facebook, etc
Physical Architecture
Framework
Algorithm
In practice will almost always be combination of above
Greatest hurdle is managing distribution of information
and data exchange i.e. overhead
Introduction – Top 500



Japan’s K Computer (Kei = 10 quadrillion)
Currently fastest supercomputer cluster in the world
8.162 petaflops (~8 x 1015 calculations per second)
Overview – What is MPI?



Message Passing Interface
One of many frameworks and technologies
for implementing parallelization
Library of subroutines (FORTRAN), classes
(C/C++) and bindings for python packages
that mediate communication (via messages)
between single threaded processes,
executing independently and in parallel
Overview – What is needed?






Common user accounts with same password
Administrator / root privileges for all accounts
Common directory structure and paths
MPICH2 installed on all machines
This is combination of MPI-1 and MPI-2
standards
CH – Chameleon portability layer provides
backward compatibility to existing MPI
frameworks
Overview – What is needed?




MPICC & MPIF77 – Provide options and
special libraries needed to compile and link
MPI programs
MPIEXEC – Initialize parallel jobs and spawn
copies of the executable to all of the
processes
Each process executes its own copy of code
By convention choose root process (rank 0) to
serve as master process
General Implementation
Hello World - C++
General Implementation
Hello World - FORTRAN
General Implementation
Hello World - Output
Example - Broadcast Routine


Point-to-point (send & recv) and Collective (bcast)
library routines are contained in library
Source node mediates distribution of data to/from all
other nodes
Example - Broadcast Routine
Linear Case



Apart from root and last
nodes, each node
receives from and sends
to previous and next node
respectively
Use point-to-point library
routines to build custom
collective routine
MPI_RECV(myProc - 1)
MPI_SEND(myProc + 1)
Example - Broadcast Routine
Binary Tree






Each parent node sends
message to two child
nodes
MPI_SEND(2 * myProc)
MPI_SEND(2 * myProc+1)
IF( MOD(myProc,2) == 0 )
MPI_RECV( myProc/2 )
ELSE
MPI_RECV((myProc-1)/2)
Example – Broadcast Routine
Output
Applications to Physics Problems

Quadrature – Discretize interval [a,b] into N
steps and divide amongst processes:
–
–
FOR LOOP (1+myProc to N;incr of numProcs)
E.g. with N = 10 and numProcs = 3






Process: Iteration1, Iteration2,…
0: 1,4,7,10
1: 2,5,8
2: 3,6,9
Finite Difference problems – Similarly divide
mesh/grid amongst processes
Many applications, limited only by our ingenuity
Closing Remarks




In 1970’s, Intel co-founder Gordon Moore, correctly
predicted that, ”number of transistors that can be
inexpensively placed on an integrated circuit doubles
approximately every 2 years”
10-Core Xeon E7 processor family chips are
currently commercially available
MPI easy to implement and well suited to many
independent operations that can be executed
simultaneously
Only limitations are overhead incurred by interprocess communications, out ingenuity ands strictly
sequential segments of program
Acknowledgements and Thanks





NRF and South African Department of
Science and Technology
JINR, University Center
Dr. Jacobs and Prof. Lekala
Prof. Elena Zemlyanaya, Prof Alexandr P.
Sapozhnikov and Tatiana F. Sapozhnikov
Last but not least my fellow colleagues