Transcript Slide 1

COMP60611
Fundamentals of Parallel and Distributed
Systems
Lecture 5
Parallel Programming:
Languages – pthreads (and MPI)
John Gurd, Graham Riley
Centre for Novel Computing
School of Computer Science
University of Manchester
Sept 2011
1
Overview
– Parallel Programming Fundamentals
• Different levels of programming…
• Managing parallel work units and their interaction
– The Unix Model ( http://www.unix.org )
• Processes and threads ---- memory mapping
– Overview of pthreads - single address space (and MPI multiple address spaces)
– Summary
Sept 2011
2
Extensions to C
– We need specific programming constructs to define parallel
computations. We shall use (sequential) C as a starting point.
– In this lecture, we investigate extensions to C that allow the
programmer to express parallel activity in the thread-based or
data-sharing style and in the message passing style.
– We approach this from two main directions:
• Extensions that allow the programmer to create and manage
threads explicitly and interact via shared memory.
• Extensions that allow the programmer to manage processes
explicitly and exchange data via messages.
Sept 2011
3
Different Levels of
Thread-based Programming
– None of these schemes is fully implicit (i.e. automatic);
unfortunately, autoparallelisation of C (or any other serial)
programs is beyond the present state-of-the-art. Instead,
different schemes offer increasing amounts of high-level
assistance for the creation and management of parallel
threads.
– POSIX Parallel Threads Library
• 'Bare-metal' approach --- the programmer is responsible for
everything except the posix call implementations.
– OpenMP API (a higher level alternative for threads)
• Much functionality is provided, e.g. at the loop level --- the
programmer is presented with a simpler picture, but the
scope for losing performance through naivety increases.
Sept 2011
4
How to Obtain Parallel Thread-based
Activity
– The general approach to developing a parallel code is the
same in each scheme. The basic idea is to create a number of
parallel threads and then find (relatively) independent units of
work for each of them to execute.
– Units of work come in two basic types which correspond to
task- and data-parallelism:
• Functionally different subprograms, each executed once;
• Single subprogram, executed multiple times – with different data
– In general, these forms of parallelism can be nested.
– Each scheme relies on run-time support routines, provided as
part of the operating system. It is important to know how
memory (address space) is laid out at run-time. An example is
given by the UNIX system, described on the next slide – this is
similar to other operating systems.
5
Sept 2011
The UNIX Model:
Processes and Threads
• There are two basic units:
– A Process, the basic unit of resource.
– A thread, the basic unit of execution.
– The simplest process is one having a single thread
of execution.
• This corresponds well to our programming models.
Code is shared by all threads in a process. The
general situation is illustrated in the following slide.
– (Note: the terminology used in other operating systems is
dangerously ambiguous.)
Sept 2011
6
Memory Map for UNIX Processes and
Threads
OS segments
Code segment
Process-shared
data
Task-shared data
Thread-shared
Thread-shared data
data
Sept 2011
PC
Master stack
PC
PC
PC
Thread stack
Thread stack
Thread stack
7
POSIX Threads and an example…
• An IEEE standard for UNIX (like) systems (defined for C)
– Standard ‘wrappers’ exist to support use from FORTRAN and other
languages
• A set of library routines (and a run-time system) to manage the explicit
creation and destruction of threads, and to manage their interaction
• Essentially, a pthread executes a user-defined function
– Scheduling of work to threads is down to the user
• Calls to pthread synchronisation routines manage the interaction
between shared data
• OpenMP implementations can be built on top of pthreads
– with details hidden from user
• See: https://computing.llnl.gov/tutorials/pthreads
– Good overview; starts with a description of the relationship between
processes and threads
Sept 2011
8
Pthreads – simple example
#include <pthread.h>
#include <stdio.h>
#define NUM_THREADS 5
int main (int argc, char *argv[]) {
pthread_t threads[NUM_THREADS];
Int rc; long t;
for (t=0; t<NUM_THREADS; t++) {
printf ("In main: creating thread %ld\n", t);
rc = pthread_create (&threads[t], NULL, PrintHello, (void *)t);
if (rc) {
printf ("ERROR; return code from pthread_create() is %d\n", rc);
exit(-1);
}
}
pthread_exit(NULL);
}
Sept 2011
9
void *PrintHello(void *threadid)
{
long tid;
tid = (long)threadid;
printf("Hello World! It's me, thread #%ld!\n", tid);
pthread_exit(NULL);
}
Sept 2011
10
Synchronisation mechanisms
• Locks and condition variables in pthreads
– Example calls are shown on the following slides
• Semphores (provided by the OS)
– An early (Dijkstra, 1965) mechanism to control resource
sharing in concurrent systems (providing mutual exclusion).
– Pthread synchronisation routines are frequently implemented
by OS-supported semaphores. For example, a binary
semaphore is a lock.
– More on these later…
Sept 2011
11
Pthread locks
pthread_mutex_t mutex1 = PTHREAD_MUTEX_INITIALIZER;
void *func() {
pthread_mutex_lock( &mutex1 );
counter++;
printf("Counter value: %d\n",counter);
pthread_mutex_unlock( &mutex1 );
}
Note: using Semaphores, typical routines are:
semaphore_wait(s), equivalent to pthread_mutex_lock()
semaphore_signal(s), equivalent to pthread_mutex_unlock()
Where s is a semaphore initialised to 1 (a binary semaphore)
Sept 2011
12
Pthread condition synchronisation
• Condition variables provide a “wait-notify” mechanism for
threads.
– For example, consider a producer-consumer model where one
producer thread is served by one (or more) consumer threads. A
consumer must check the shared state to see if there is anything
for it to do. If there is nothing, the thread must wait until there is
work. When the producer has work to be done, it updates the
shared state and must then notify any consumers waiting.
• Condition synchronisation involves a condition lock and a
condition variable.
• The combination of some shared data and some
synchronisation mechanism (locks and condition variables, for
example) is called a monitor.
– More in later lectures on monitors.
Sept 2011
13
We refer to a thread entering a monitor when a thread acquires the
mutual exclusion lock associated with the monitor and exiting the
monitor when it releases the lock.
wait - causes the thread to exit the monitor (release the lock),
permitting other threads to enter the monitor.
notify – causes one of any waiting threads to be allowed to enter the
monitor.
Monitor
Thread A
data
notify
Sept 2011
Thread B
wait
14
• A consumer thread takes the condition lock before examining
the relevant shared state.
– If there is work, the thread can modify the state (i.e. ‘get’ the work)
and then release the lock.
– If there is no work, the thread calls pthread_cond_wait() with the
names of the condition lock and condition variable. This routine
releases the condition lock and ensures that the thread waits
(inside the wait routine) until it is woken up by a notification from the
producer.
• When the producer thread has work to be done, it takes the
condition lock in order to modify the shared state (to make the
work available to consumers).
– The producer then calls pthread_cond_broadcast, with the name of
the condition variable, to wake up all waiting threads and then
releases the condition lock.
– If the producer knows that only a single thread is waiting it may call
pthread_cond_signal which wakes only a single thread.
Sept 2011
15
• On being awakened, a consumer thread (which is still
inside the pthread_cond_wait routine) will
immediately try to take the condition lock so that, on
return from pthread_cond_wait, it owns the lock.
• The consumer thread should then re-check the
shared state to confirm that there is some work to be
done (why?).
– Typically, the consumer will call pthread_cond_wait from
within a ‘while’ loop which checks the shared state. Thus, if
there is no work, the consumer thread will simply call
pthread_cond_wait again. If there is work, the consumer will
modify the shared state to ‘get’ the work and then release
the condition lock.
Sept 2011
16
An example
• Consider a producer thread which increments a
shared counter (up to the value of 5) before passing
on the work of further incrementing the shared
counter (up to a value of 10) to a consumer thread.
• In the following C/pthreads example, functionCount2
is the producer and functionCount1 is the consumer.
Sept 2011
17
Pthread condition variables
int count = 0;
#define COUNT_DONE 10
#define COUNT_HALT1 5
pthread_mutex_t condition_mutex = PTHREAD_MUTEX_INITIALIZER;
pthread_cond_t condition_cond = PTHREAD_COND_INITIALIZER;
main() {
pthread_t thread1, thread2;
pthread_create( &thread1, NULL, &functionCount1, NULL);
pthread_create( &thread2, NULL, &functionCount2, NULL);
pthread_join( thread1, NULL); pthread_join( thread2, NULL);
exit(0);
}
Sept 2011
18
Producer
void *functionCount2() {
for(;;) {
pthread_mutex_lock( &condition_mutex );
if( count < COUNT_HALT1 ) {
count++;
pthread_cond_signal( &condition_cond );
/* OR pthread_cond_broadcast( &condition_cond ); */
printf("Counter value functionCount2: %d\n",count);
}
pthread_mutex_unlock( &condition_mutex );
if (count >= COUNT_DONE) return (NULL);
}
}
Sept 2011
19
Consumer
void *functionCount1() {
for(;;) {
pthread_mutex_lock( &condition_mutex );
while( count < COUNT_HALT1 ) {
pthread_cond_wait( &condition_cond, &condition_mutex );
printf("Counter value functionCount1 on waking: %d\n",count);
}
count++;
printf("Counter value functionCount1: %d\n",count);
pthread_mutex_unlock( &condition_mutex );
if(count >= COUNT_DONE) return(NULL);
}
}
Sept 2011
20
Behaviour
• The producer continually cycles taking the lock (entering the
monitor) and increments the counter if it is less than 5.
• After each increment, the producer signals a waiting thread (the
consumer) before exiting the monitor by releasing the lock.
• Either the producer or consumer will enter the monitor next.
– If it is the producer, it carries on as before.
• Note that once it has done its work, it simply releases the lock.
– If it is the consumer, and it isn’t time for it to take over (i.e. the state
is not 5), it simply announces the value it saw, and goes back to
waiting.
• Eventually the consumer takes over incrementing the counter
until it reaches the final value, at which point both it and the
producer thread terminate.
• The example code is in the directory CondVar in the teaching
domain directory ~griley/COMP60611/labs/extras
Sept 2011
21
Implicit Thread Control: OpenMP
– OpenMP is the de-facto standard API for writing shared
memory parallel applications in C, C++, and Fortran.
– OpenMP consists of:
• Compiler directives,
• Run time routines,
• Environment variables.
– Specification maintained by the OpenMP Architecture Review
Board (http://www.openmp.org).
– Version 3.0 was released May 2008.
Sept 2011
22
OpenMP Overview:
How do threads interact?
– OpenMP is a multi-threading, shared address model.
• Threads communicate by sharing variables.
– Unintended sharing of data causes race conditions:
• race condition – the program’s outcome changes as the
threads are scheduled differently.
– To control race conditions:
• Use synchronization to protect data conflicts.
– But synchronization can be expensive
Sept 2011
23
Advantages of OpenMP
– Good performance and scalability.
• if you do it right ....
– De-facto and mature standard.
– An OpenMP program is portable.
• Supported by a large number of compilers.
– Simplifies the writing of multi-threaded programs in Fortran, C
and C++.
– Allows the program to be parallelized incrementally.
– OpenMP is well suited for multicore architectures.
Sept 2011
24
Summary of Threads approach
– Programming multiple threads with explicit accesses to shared
data requires attention to detail and much low level control.
– This can be alleviated by providing the programmer with a
high-level data-sharing model and leaving low-level problems
to the implementation. Higher level abstractions make
programming increasingly easier, but they provide more
opportunity for performance to be lost, due to unforseen
actions by the compiler or run-time system.
– Experience shows that it is somewhat easier to program using
threads, compared to other approaches we shall study,
although it is still non-trivial.
Sept 2011
25
Overview of MPI
• Processes versus threads
– Shared and distributed memory systems
• Process-based Programming Fundamentals
– Managing Processes
– Passing Messages
– The Message-Passing Interface (MPI)
• See MPI forum on web
• Message-Passing Overheads
• Summary
Sept 2011
26
Parallel Computing with Multiple
Processes
• For anyone familiar with concurrent execution of processes
under a conventional uni-processor operating system, such as
Unix, the notion of parallel computing with multiple (singlethread) processes is quite natural.
• Each process is essentially a stand-alone sequential program,
with some form of interprocess communication mechanism
provided to permit controlled interaction with other processes.
• The simplest form of interprocess communication mechanism is
via input and output files. However, this does not allow very
'rich' forms of interaction.
• Hence, more complex varieties of message-passing have
evolved, e.g.:
– UNIX pipes, sockets, MPI, Java RMI…
Sept 2011
27
Message-Passing
• The process-based approach to parallel
programming is the main alternative to the threadbased approach. Again, we use (sequential) C as a
starting point.
• We will look at an extension to C which allow the
programmer to express parallel activity in the
message-passing style.
– Extensions that allow the programmer to send and receive
messages explicitly (to exchange program data and
synchronise). We illustrate this using the Message-Passing
Interface (MPI) standard library.
– MPI can also be used from FORTRAN and C++ (research
versions for, for example, Java are available too).
Sept 2011
28
Why MPI?
• Shared memory computers tend to be limited in size (numbers
of processors) and the cost of hardware to maintain cache
coherency across an interconnect grows rapidly with system
size. So the ‘biggest’ computers do not support shared memory.
• Distributed memory systems are relatively cheap to build. They
are essentially multiple copies of ‘independent’ processors
connected together. Interconnects for these are relatively simple
and cheap (e.g. based on routers). For example:
– Networks of workstations, NoWs, using Ethernet or Myrinet
– Supercomputers with specialised router-based interconnects:
HECToR, a Cray XT4 using fast SeaStar routers – more than
22,000 cores (5664 quad-core Opterons). Upgraded in 2010.
• Most of the Top100 computers in the world are DM systems and
MPI is the de-facto standard for writing parallel programs for
them (at least in the scientific world). See: www.top500.org.
Sept 2011
29
Managing Processes
• Remember: in our UNIX view, a process is a (virtual)
address space with one or more threads (program
counter plus stack). Processes are independent!
• A key requirement is to be able to create a new
process and know its (unique) identity. With process
identities known to one another in this way, it is
feasible within any process to construct a message
and direct it specifically to some other process.
• MPI has the concept of process ‘groups’ through
communicators, e.g. MPI_COMM_WORLD.
• Finally, there needs to be a mechanism for causing a
process to ‘die’ and allow the MPI ‘group’ to ‘die’.
Sept 2011
30
Passing Messages
• The fundamental requirements for passing a
message between two processes are:
– The sending process knows how to direct a message to an
appropriate receiving process.
– In MPI this is achieved explicitly by naming a process id or
through the use of a communicator (naming a group of
processes).
– There are several models of interacting and synchronising
processes in MPI. We shall keep it simple and look only at
basic sending and receiving:
• Where recvs block but sends do not (implying buffering of the
data)
• MPI also supports synchronous (sends block until a receive is
posted) and asynchronous sends and recvs (using poling)
Sept 2011
31
The MPI C Library
• Typical MPI scientific codes use processes that are
identical, thus implementing the Single-ProgramMultiple-Data (SPMD) scheme. For example:
mpiexec –n 4 a.out
! Runs a 4 process single SPMD job
• MPI also supports the Multiple-Program-MultipleData (MPMD) scheme, in which different code is
executed in each process:
mpiexec –n 3 a.out : -n 4 b.out : -n 6 c.out ! An MPMD job
• Chapter 8 in Foster's book is a good source for
additional MPI information. See also, LLNL tutorials.
Sept 2011
32
MPI Fundamentals
• Processes are grouped together; they are numbered within a group
•
using contiguous integers, starting from 0. Messages are passed using
the send (MPI_SEND) and receive (MPI_RECV) library subroutine
calls (many other forms exist!)
A message send has the general form:
MPI_Send(sbuf,icount,itype,idest,itag,icomm,ierr)
– A send may block or not – depends on the MPI implementation’s use of
buffering. (MPI_Ssend is a guaranteed blocking send)
– Programs should not assume buffering of sends! Can lead to deadlock (see
later example).
• A message receive has the general form:
•
MPI_Recv(rbuf,icount,itype,isrce,itag,icomm,istat,ierr)
The receiving process blocks until a message of the appropriate kind
becomes available. The buffer starting at rbuf has to be guaranteed to
be large enough to hold icount elements. The istat parameter shows
how many elements actually arrived, where from, etc.
Sept 2011
33
MPI Fundamentals
• There are four other core library functions, illustrated
in the following slides (which uses C syntax)
• The following slides show the MPI basics plus the
skeleton on an n-body MPI application in the SPMDstyle.
Sept 2011
34
#include “mpi.h”
/* include file of compile-time constants needed for MPI library calls */
/* main program */
main (int argc, char *argv[]) {
/* call to initialise this process - called once only per process */
ierr = MPI_Init(&argc, &argv);
/* find the number of processes */
MPI_Comm_size(MPI_COMM_WORLD, &np);
/* find the id (number) of this process */
MPI_Comm_rank(MPI_COMM_WORLD, &myid);
/* print a “Hello world” message from this process */
fprintf (“I am %d of %d processes!”, myid, nprocs);
/* shut down this process - last thing a process should do
MPI_Finalize();
}
Sept 2011
35
#include “mpi.h”
/* include file */
main(int argc, char *argv[]) {
/* main program */
int myid, np, ierr, lnbr, rnbr;
Real x[300], buff[300], forces[300];
MPI_Status status;
ierr = MPI_Init(&argc, &argv);
/* initialize */
if (ierr != MPI_SUCCESS) {
/* check return code */
fprintf(stderr,”MPI initialisation error\n”);
exit(1);
}
Sept 2011
36
MPI_Comm_size(MPI_COMM_WORLD, &np); /* nprocs */
MPI_Comm_rank(MPI_COMM_WORLD, &myid); /* my process id */
lnbr = (myid+np-1)%np;
/* id of left neighbour */
rnbr = (myid+1)%np;
/* id of right nbr */
Initialize(x, buff, forces);
for (i=0; i<np-1; i++) {
/* circulate messages */
/* Note: assumes sends do not block! What if they do?*/
MPI_Send(buff, 300, MPI_FLOAT, rnbr, 0, MPI_COMM_WORLD);
MPI_Recv(buff, 300, MPI_FLOAT, lnbr, 0, MPI_COMM_WORLD,
&status);
update_forces(x, buff, forces);
}
Print _forces(myid, forces);
/* print result */
MPI_Finalize();
/* shutdown */
}
Sept 2011
37
Other MPI Facilities
• The tag parameter is used to match up an input message with a
•
•
•
specific expected kind. If the kind of message is immaterial,
MPI_ANY_TAG will match with anything.
There are also constructs for: global 'barrier' synchronisation
(MPI_Barrier); transfer of data, including one-to-many 'broadcast'
(MPI_Bcast) and 'scatter' (MPI_Scatter), and many-to-one 'gather'
(MPI_Gather); and 'reduction' operators (MPI_Reduce and
MPI_All_reduce).
A reduction has the general form:
MPI_Reduce(src,result,icnt,ityp,op,iroot,icomm,ierr)
where op is the operator, typ is the element type, and root is the
number of the process that will receive the reduced result. All
processes in the group receive the same result when MPI_All_reduce
is used.
There are many other features, but these are too numerous to be
studied further here.
Sept 2011
38
MPI – pros and cons
• MPI is the de-facto standard for programming large
supercomputers because the current trend is only to build
distributed memory machines.
• The vast majority of current DM machines are built out of
multicore processors
– Mixed mode programming with MPI ‘outside’ and pthreads (or
OpenMP) ‘inside’ is possible…
• MPI forces the programmer to face up to the distributed nature
of machines – is this a good thing?
• MPI solutions tend to be more scalable than ptrhread (or
OpenMP) solutions
• (OpenMP is somewhat easier to use…)
Sept 2011
39
Summary of MPI
• Process-based programming using a library such as
MPI for explicit passing of messages requires
attention to detail and much low level description of
activities.
• Ultimately, the same underlying problems of
parallelism emerge, regardless of whether the shared
memory (e.g. pthreads or OpenMP) or distributed
memory (e.g. MPI) programming approach is used.
Sept 2011
40
Summary
• Pthreads exploit parallelism by exploiting multiple
threads within a single process
– A single address space model
• MPI exploits parallelism between processes and
supports the explicit exchange of messages between
processes
– A multiple address space model
• Note that MPI and pthreads can be nested!
– Multiple (p)threads can execute inside each (MPI) process
– An approach that appears to match the hierarchical
architecture of modern computers (i.e. muticore processors
in a distributed memory machine, e.g. HeCToR)
Sept 2011
41