Transcript lec7
High Performance
Computing – CISC 811
Dr Rob Thacker
Dept of Physics (308A)
thacker@physics
Today’s Lecture
Distributed Memory Computing I
Part 1: Networking issues and distributed
memory architectures (hardware)
Part 2: Brief overview of PVM
Part 3: Introduction to MPI
Part 1: Concepts & Distributed
Memory Architectures
Overview: Message passing API’s, RDMA
Networking, TCP/IP
Specialist interconnects
Distributed memory machine design and
balance
MPI,PVM
SHMEM
Implicit
Communication
Explicit
Parallel API’s from the
decomposition-communication
perspective
Message
Passing
API’s
CAF, UPC
HPF
OpenMP
Implicit
Shared memory
only
Decomposition
Operate effectively
on distributed memory
architectures
Explicit
Message Passing
Concept of sequential processes communicating via messages
was developed by Hoare in the 70’s
Each process has its own local memory store
Remote data needs are served by passing messages containing
the desired data
Naturally carries over to distributed memory architectures
Two ways of expressing message passing:
Hoare, CAR, Comm ACM, 21, 666 (1978)
Coordination of message passing at the language level (e.g. Occam)
Calls to a message passing library
Two types of message passing
Point-to-point (one-to-one)
Broadcast (one-to-all,all-to-all)
Broadcast versus point-to-point
Broadcast (one-to-all)
Point-to-point(one-to-one)
Process 2
Process 1
Process 3
Process 4
Collective operation
-Involves a group
of processes
Process 2
Process 1
Process 3
Process 4
Non-Collective operation
-Involves a pair of
processes
Message passing API’s
Message passing API’s dominate
Often reflect underlying hardware design
Legacy codes can frequently be converted more easily
Allows explicit management of memory hierarchy
Message Passing Interface (MPI) is the predominant
API
Parallel Virtual Machine (PVM) is an earlier API that
possesses some useful features over MPI
Remains the best paradigm for heterogeneous systems
http://www.csm.ornl.gov/pvm/pvm_home.html
PVM – An overview
API can be traced back to 1989
Daemon based
PVM group server controls this aspect
Limited number of collective operations
Each user may actively configure their host environment
Process groups for domain decomposition
Each host runs a daemon that controls resources
Process can be dynamically created and destroyed
PVM Console
Geist & Sunderam developed experimental version
Barriers, broadcast, reduction
Roughly 40 functions in the API
PVM API and programming model
PVM most naturally fits a master-worker model
Messages are typed and tagged
Master process responsible for I/O
Workers are spawned by master
Each process has a unique identifier
System is aware of data-type, allows easy portability across
heterogeneous network
Messages are passed via a three phase proces
Clear (initialize) buffer
Pack buffer
Send buffer
Example code
tid=pvm_mytid()
if (tid==source){
bufid= pvm_initsend(PvmDataDefault);
info = pvm_packint(&i1,1,1);
info = pvm_packfloat(vec1,2,1);
Sender
info = pvm_send(dest,tag);
}
else if (tid==dest){
bufid= pvm_recv(source,tag);
info = pvm_upkint(&i2,1,1);
info = pvm_upkfloat(vec2,2,1);
}
Receiver
MPI – An overview
API can be traced back to 1992
Mechanism for creating processes is not specified within API
‘Communicators’
Richer set of collective operations than PVM
Derived data-types important advance
Different mechanism on different platforms
MPI 1.x standard does not allow for creating or destroying processes
Process groups central to parallel model
First unofficial meeting of MPI forum at Supercomputing 92
Can specify a data-type to control pack-unpack step implicitly
125 functions in the API
MPI API and programming model
More naturally a true SPMD type programming model
Oriented toward HPC applications
Master-worker model can still be implemented effectively
As for PVM, each process has a unique identifier
Messages are typed, tagged and flagged with a
communicator
Messaging can be a single stage operation
Can send specific variables without need for packing
Packing is still an option
Remote Direct Memory Access
Message passing involves a number of expensive
operations:
RDMA cuts down on the CPU overhead
CPUs must be involved (possibly OS kernel too)
Buffers are often required
CPU sets up channels for the DMA engine to write directly
to the buffer and avoid constantly taxing the CPU
Frequently discussed under the “zero-copy” euphemism
Message passing API’s have been designed around this
concept (but usually called remote memory access)
Cray SHMEM
RDMA illustrated
HOST A
Memory/
Buffer
HOST B
CPU
CPU
NIC
(with
RDMA
engine)
NIC
(with
RDMA
engine)
Memory/
Buffer
Networking issues
Networks have played a profound role in the
evolution of parallel APIs
Examine network fundamentals in more detail
Provides better understanding of programming
issues
Reasons for library design (especially RDMA)
OSI network model
Grew out of 1982 attempt by ISO to develop Open Systems
Interconnect (too many vendor proprietary protocols at that
time)
Motivated from theoretical rather than practical standpoint
System of layers taken together = protocol stack
Each layer communicates with its peer layer on the remote host
Proposed stack was too complex and had too much freedom:
not adopted
e.g. X.400 email standard required several books of definitions
Simplified Internet TCP/IP protocol stack eventually grew out
of the OSI model
e.g. SMTP email standard takes a few pages
Conceptual structure of OSI network
Layer 7. Application(http,ftp,…)
Upper
level
Layer 6. Presentation (data std)
Layer 5. Session (application)
Layer 4.Transport (TCP,UDP,...)
Lower
level
Layer 3. Network (IP,…)
Layer 2. Data link (Ethernet,…)
Layer 1. Physical (signal)
Data transfer
Routing
Internet Protocol Suite
Protocol stack on which the internet runs
Doesn’t map perfectly to OSI model
Occasionally called TCP/IP protocol stack
OSI model lacks richness at lower levels
Motivated by engineering rather than concepts
Higher levels of OSI model were mapped into a single
application layer
Expanded some layering concepts within the OSI
model (e.g. internetworking was added to the network
layer)
Internet Protocol Suite
“Layer 7” Application
e.g. FTP, HTTP, DNS
Layer 4.Transport
e.g. TCP, UDP, RTP, SCTP
Layer 3. Network
IP
Layer 2. Data link
e.g. Ethernet, token ring
Layer 1. Physical
e.g. T1, E1
Internet Protocol (IP)
Data-oriented protocol used by hosts for
communicating data across a packet-switched internetwork
Addressing and routing are handled at this level
IP sends and receives data between two IP addresses
Data segment = packet (or datagram)
Packet delivery is unreliable – packets may arrive
corrupted, duplicated or not at all, and out of order
Lack of delivery guarantees allows fast switching
IP Addressing
On an ethernet network routing at the data link layer occurs
between 6 byte MAC (Media Access Control) addresses
IP adds its own configurable address scheme on top of this
4 byte address, expressed as 4 decimals on 0-255
Note 0 and 255 are both reserved numbers
Division of numbers determines network number versus node
Subnet masks determine how these are divided
Classes of networks are described by the first number in the IP
address and the number of network addresses
[192:255].35.91.* = class C network (254 hosts) (subnet mask 255.255.255.0)
[128:191].132.*.* = class B network (65,534 hosts)
( “ 255.255.0.0)
[1:126].*.*.*
= class A network (16 million hosts)
( “ 255.0.0.0)
Note the 35.91 in the class C example, and the 132. in the class B
example can be different, but are filled in to show how the network address
is defined
Subnets
Type A networks are extremely large and are
better dealt with by subdivision
Any network class can be subdivided into subnets
Broadcasts then work within each subnet
Subnet netmasks are defined by extending the
netmask from the usual 1 byte boundary
e.g. 10000000=128, 11000000=192,
11100000=224
Class C subnets
Subnet Mask
Net Bits
Host Bits
# of Nets
Hosts/Net
Total Hosts
255.255.255.0
0
8
1
254
254
255.255.255.128
1
7
0(2*)
126
0(254*)
255.255.255.192
2
6
2(4*)
62
124(248*)
255.255.255.224
3
5
6(8*)
30
180(240*)
255.255.255.240
4
4
14(16*)
14
196(224*)
255.255.255.248
5
3
30(32*)
6
180(192*)
255.255.255.252
6
2
62(64*)
2
124(128*)
255.255.255.254
7
1
126(128*)
0**
0(128*)
255.255.255.255
8
0
254(256*)
1***
254(256*)
*classic IP rules say you cannot use subnets with all zeros or ones in the network portion
**Host address with all zeros frequently mean this host in many IP implementations, while
1 is the broadcast address
***The netmask specifies the entire host address
Transmission Control Protocol
(TCP)
TCP is responsible for division of the applications
data-stream, error correction and opening the channel
(port) between applications
Applications send a byte stream to TCP
TCP divides the byte stream into appropriately sized
segments (set by the MTU* of the IP layer)
Each segment is given two sequence numbers to enable
the byte stream to be reconstructed
Each segment also has a checksum to ensure correct
packet delivery
Segments are passed to IP layer for delivery
*maximum transfer unit
UDP: Alternative to TCP
UDP=User Datagram Protocol
Only adds a checksum and multiplexing capabilitiy –
limited functionality allows a streamlined
implementation: faster than TCP
No confirmation of delivery
Unreliable protocol: if you need reliability you must
build on top of this layer
Suitable for real-time applications where error
correction is irrelevant (e.g. streaming media, voice over
IP)
DNS and DHCP both use UDP
Encapsulation of layers
Application
data
TCP
header
Transport
IP
header
Network
Data link
enet
header
data
data
data
Link Layer
For high performance clusters the link layer
frequently determines the networking above it
All high performance interconnects emulate IP
Some significantly better (e.g. IP over Myrinet) than
others (e.g. IP over Infiniband)
Each data link thus brings its own networking
layer with it
Overheads associated with TCP/IP
Moving data from the application to
the TCP layer involves a copy from
user space to the OS kernel
Recall memory wall issue? Bandwidth
to memory is not growing fast enough
Rule of thumb each TCP bit requires 1
Hz of CPU speed to perform copy
sufficiently quickly
“Zero copy” implementations remove
the copy from user space to OS
(DMA)
TCP off-load engines (TOE) removes
checksum etc from CPU
RDMA and removal of OS TCP
overheads will be necessary for 10Gb
to function effectively
OS
TCP control
Data
Link
COPY
User
Application
Ping-Ping, Ping-Pong
Messaging passing benchmarking
examines two main modes of communication
Ping-Ping messaging
Single node fires of
multiple messages
Bandwidth test
Node 1
Send
Node 1
Send
Node 0
Receive
time
Node 1
Send
Node 0
Receive
Node 0
Receive
Ping-Pong messaging
Pair of nodes cooperate
on send-receive-send
Latency and bandwidth
test
Node 1
Send
Node 0
Receive
Node 1
Receive
Node 0
Send
time
Overview of interconnect fabrics
Broadly speaking interconnect breakdown into
the two camps: commodity vs specialist
Commodity: ethernet (cost<50 per port)
Specialist: everything else (cost > 800 dollars per
port)
Specialist interconnects primarily provide two
features over gigabit:
Higher bandwidth
Lower message latency
Gigabit Ethernet
IEEE 802.3z
Most popular interconnect on top500 - 213 systems (none in top
10)
Optimal performance of around 80 MB/s
Significantly higher latency than the specialist interconnects (x10,
at 25 µs)
Popular because of hardware economics
However fast libraries have been written e.g. GAMMA, Scali, latencies
around 10µs
Cables very cheap
Switches getting cheaper all the time (32 port switch = $250)
NICs beginning to include additional co-processing hardware for
off-loading TCP overheads from CPU (TCP off-load engines:
TOEs)
Project to improve performance of network layer (VIA) has
been derailed
10Gigabit Ethernet
Relatively recent technology, still very expensive
(estimates are in the range of $2000=$3000 per
port)
Solutions in HPC arena limited to products
supplied by already established HPC vendors,
e.g. Myrinet, Quadrics
Commoditization is expected, but difficult to
pick out a driver for it – few people need a
GB/s out of their desktop
Myrinet
www.myrinet.com
Proprietary standard (1994) updated in 2000 to increase
bandwidth
Second most popular interconnect on Top 500 (79 systems) 1
system in top 10
E model has standard bandwidth, 770 MB/s
Myrinet 10G is a 10Gigabit Ethernet solution offering 1250
MB/s (new)
Latency 2.6µs (recent library update)
Most systems implemented via fully switched network
Dual channel cards available for higher bandwidth requirements
Scalable Coherent Interface
Developed out of IEEE standard (1992) updated in 2000
(popular in Europe)
No systems in top500
Originally conceived as a standard to build NUMA machines
Switchless topologies fixed to be either ring or torus (3d also)
Latency around 4.5 us, but is well optimized for short messages
Bandwidth ~350 MB/s per ring channel (no PCI-X or PCIExpress implementation yet)
Card costs ~ 1500 USD for 3d torus version
European solution – hence less popular in N. America
Infiniband
Infiniband (Open) standard is designed to cover many arenas,
from database servers to HPC
78 systems on top500, 3 systems in top10
Large market => potential to become commoditized like gigabit
Serial bus, can add bandwidth by adding more channels (e.g.
1x,4x,12x standards, each 1x=250 MB/s)
Double data rate option now available (250→500 MB/s)
Highest bandwidth observed thus far ~900 MB/s for 4x single
data rate link
Slow adoption among HPC until very recently – was initially
troubled by higher latencies (around 6.5 ms, lower now)
Usually installed as a fully switched fabric
Cards frequently come with 2 channels for redundancy
Costs and future
Current cost around $1000 per port (for
cards+switch+cable)
PCI-Express cards set to reduce cost of host channel
adapter (HCA=NIC), removes memory from card (cost
may fall to $600 per port)
Software support growing, common linux interface
underdevelopment www.openib.org
Discussion of installing directly on motherboard,
would improve adoption rate
Latency is coming down all the time – a lot of people
working on this issue
Quadrics (Elan4)
www.quadrics.com
14 systems on top500, 1 system in top 10
Most expensive specialist interconnect ~1600 to 3000
USD per port depending upon configuration
Had lowest latency of any available interconnect (1.5 µs)
New product from Pathscale “Infinipath” may be slightly faster
Theoretical 1066 MB/s bandwidth, 900 achieved
Fat-tree network topology
High efficiency RDMA engine on board
NIC co-processor capable of off-loading communication
operations from CPU
Quick History
Quadrics grew out of Meiko
(Bristol UK)
Meiko CS-1 transputer
based system, CS-2 built
around HyperSparc and the
first generation Elan
network processor
Quadrics (Elan4)
Full 64-bit
addressing
STEN=Short
Transaction Engine
RDMA engine is
pipelined
64-bit RISC
processor
2.6 Gbytes/s total
Optimizing short messages
I/O writes coming from the PCI-X bus can be
formatted directly into network transactions
These events can be handled independently of
DMA operations – avoids setting up DMA
channel
Method becomes less efficient than using DMA
engine at around 1024 byte messages
A few Elan details
Performance strongly influenced by PCI
implementation & speed of memory bus
Opteron (Hypertransport) is best of breed PCI
Network fault detection and tolerance is done in
hardware (all communication is acknowledged)
Elanlib library provides low level interface to
point-to-point messaging, and tagged messaging
MPI libs built upon either elanlib or even lower
level library elan3lib
Breakdown of transaction times:
worst case 8-byte put
2500
Nanoseconds
2000
1500
Switch
NIC
PCI
Cable
1000
500
0
elan 3
elan 4
(est)
elan 4
elan 4
(HT)
From presentation by John Taylor, Quadrics Ltd
QsNetII MPI All Reduce
Performance: Off-load
Interconnect I/O bus comparison
2000
1800
1600
1400
MB/s 1200
1000
800
600
400
200
0
Bus
PCI-X PCI-Ex Gigabit Infinib Myrinet Elan 4
133
8x
4x
E
Latencies
25
20
µs
15
Bus
10
5
0
Gigabit
Infinib 4x Myrinet E
Elan 4
GAMMA
Summary Part 1
TCP/IP has a large number of overheads that
require OS input
Specialist interconnects alleviate these problems
and have significantly higher throughput and
lower latencies
Cost remains a barrier to the widespread
adoption of higher performance interconnects
RDMA will appear in the next generation of
gigabit (10Gb)
Part 2:
Parallel Virtual Machine
Brief overview of PVM environment and
console
Programming with PVM
Why PVM?
PVM was envisaged to run on heterogeneous networks
PVM is an entire framework
Could have C master thread and FORTRAN workers
Fault tolerance is easily integrated into programs
Completely specifies how programs start on different architectures (spawning)
Resource managers are designed into the framework
Some parallel languages use the PVM framework (e.g. Glasgow Haskell)
Interoperable at the language level
Technically MPI can run on heterogeneous networks too, but it is not designed
around that idea (no common method for starting programs)
Primarily a result of the master-worker type parallelism that is preferred
Comparatively small distribution – only a few MB
Secure message transfer
“Free” book:
http://www.netlib.org/pvm3/book/pvm-book.html
Underlying design principles
User configuration of ‘host pool’
Transparent access to hardware
Process-based granularity
Message sizes are only limited by available memory
Network and multiprocessor support
Atom of computation is the task
Explicit message passing
Machines maybe added or deleted during operation
Messages and tasks are interoperable between the two
Multiple language support (Java added recently)
Evolution and incorporation of new techniques should be
simple
i.e. not involve defining a new standard
Development History
1.0 Developed at Oak Ridge Nat. Lab. By Geist
and Sunderam (1989)
2.0 Developed at U. of Tennesse (1991)
3.0 Rewrite (1993)
3.4 Released 1997, added a few more features
over 3.0 standard
Ultimately, PVM is now considered a legacy
parallel programming model
PVM daemon (pvmd3)
UNIX process that controls operation of the user
processes within the PVM environment
Process-to-process communication is mediated by
daemons:
Process-to-daemon communication occurs via the library
functions
Local daemon then messages remote
Architectural dependencies are dealt with at the daemon level
(pvmd3 must be compiled for the different architectures)
Each daemon carries configuration table
Aware of all hosts within the machine
From presentation by Al Geist
Configuring the PVM environment
Details on installing PVM can be found on the
www.netlib.org repository
Two environment variables: PVM_ROOT and
PVM_ARCH must be set in .cshrc
Traditionally used .rhosts file to enable access for
remote hosts
Already installed on HPCVL
syntax may differ for different platforms
rsh was the main way of starting daemons on remote
machine, recommended that ssh be used now
Once installed, you start the environment by executing
the pvm command
PVM console
Console started by executing “pvm”
pvm> add hostname
pvm> delete hostname
pvm> conf
1 host, 1 data format
Configure the PVM
Print configuration
HOST DTID ARCH SPEED
ig
40000 ALPHA
1
Pvm> ps –a
Pvm> quit
%pvm
Pvm> halt
List of PVM jobs
Quit but do not stop PVM
True halt of the system
Starting tasks
Programs can be started from the console – but
they are restricted to not have interactive input
spawn -> program_name (-> forces output to
console stdout)
Specify # hosts with –n
Specify specific hosts –hostname,…
Specify specific architectures -architecture
Usual methodology: write programs that enroll
themselves into the pvm environment
More manageable configuration
A list of hosts can be entered in a hostfile
Can specify alternative login names
Can force password entry
%pvm hostfile will then configure the PVM – example:
#Configuration for a given run
Host1 lo=othername so=pw
Host2
Host3
Additional configurations including specialist paths for the
daemon can also be placed in the hostfile
Clearly easier to manage if machines have homogeneous set-up!
XPVM
X-windows based console
Provides information about communication between
tasks
Can help debugging and performance analysis
Real-time monitor of PVM tasks
No need to recompile for this environment
Point and click interface
XPVM GUI
Programming with PVM
PVM programs must
Each PVM task has its own task identity (tid)
Include necessary header information (pvm3.h)
Link to library functions at compile time (add
-lpvm3, but may
need to specify path as well via -L)
Fortran programs link to –lfpvm3
Note – process group functionality will require an additional library to be
compiled in –lgpvm3
Task must enroll into the environment
call pvmfmytid(tid) or tid = pvm_mytid( void )
Exit from the evironment on completion
Call pvmfexit
Note if I do not provide all details of function, please use
reference book
Error tracking
PVM routines will return a negative number for the
info variable if an error occurs (at least 27 error
codes)
Successful calls return the constant PvmOK, which is 0
Good idea to track errors on key subroutine calls:
Spawning tasks
Communication
Can use an error checking subroutine to avoid code
bloat
Pass the info variable
Pass the name of the subroutine being checked to print an
error if necessary
Process Identification
Individual task identities enable parallel computation to
perform simple domain decompositions
Example: task identity is used to resolve to different
parts of an array
In PVM, request for a task ID is also equivalent to
enrolling into the VM
Task ID’s are returned in PVM via
call pvmfmytid(tid)
or
int tid = pvm_mytid(void)
Process Control
Tasks are spawned from within programs as follows:
call pvmfspawn(task, flag, where, ntask, tids, numt)
int numt = pvm_spawn(char *task, char **argv, int flag, char *where,
int ntask, int *tids)
task = name of executable (character string)
argv = command line arguments
flag = spawn flags
where = select host to spawn task
ntask = number of copies of the task to be spawned
tids = array of the spawned task ID’s
numt = number of tasks that were successfully spawned
Spawned tasks carry knowledge of their parent: pvmfparent
and pvm_parent can be used to obtain the tid of the parent
function
Killing tasks, and leaving the VM
Tasks may be killed:
call pvmfkill(tid, info)
int info = pvm_kill(int tid)
Tasks may also leave the VM:
call pvmfexit(info)
int info = pvm_kill(void)
Note that a parent task can spawn tasks and then
destroy itself – children will continue to execute
First example: enrolling and exiting
#include "pvm3.h"
main()
{
int mytid, info;
mytid = pvm_mytid();
printf("Hello from task %d", mytid);
info = pvm_exit();
program main
include "fpvm3.h"
integer mytid, info
call pvmfmytid(mytid)
print *, "Hello from task ", mytid
call pvmfexit(info)
end
exit();
}
This trivial example can be spawned from the PVM console as follows:
pvm> spawn -3 ->
[1]
3 successful
t40004
t40005
t40006
[1:t40005] Hello
[1:t40005] EOF
[1:t40006] Hello
[1:t40006] EOF
[1:t40004] Hello
[1:t40004] EOF
[1] finished
pvm> halt
myprog
from task 262149
from task 262150
from task 262148
Not a useless example either – if
you need to run multiple copies of
a given program this is the PVM
wrapper you would use.
Information about the VM
Pvmfconfig(nhost,narch,dtid,name,arch,speed,info)
Int info = pvm_config ( int *nhost, int *narch, struct
pvmhostinfo **hostp)
Provides information about the VM to active processes
Difference between C vs FORTRAN implementation
C version returns all information about hosts in structure
FORTRAN version cycles through all hosts, one per call
Pvmftasks(which,ntask,tid,ptid,dtid,flag,aout,info)
Int info = pvm_tasks( int which, int *ntask, struct pvmtaskinfo
**taskp)
Integer which specifies whether all tasks (0) or a specific pvmd tid (dtid)
As for info call, C version provides data in the structure, FORTRAN
version cycles through information
Setting options within the PVM
system
Allows various options to changed within the
PVM by executing tasks
Includes error message printing
Debugging levels
Communication routing options
pvmfsetopt(what,val,oldval)
int oldval = pvm_setopt(int what, int val)
Getopt variant provides information about
current settings
Signalling
pvmfsendsig(tid,signum,info)
Send signal signum to tid
pvmfnotify(what,msgtag,cnt,tids,info)
Send a signal to another task identified by tid
Possible notification request:
PvmTaskExit – notify if a task exits
PvmHostDelete – notify if a host is deleted (or fails)
PvmHostAdd – notify if a host is added
Useful to track error correction
No counterpart in MPI
Message passing, buffers &
encoding
Messaging requires the initialization of buffers
int bufid = pvm_initsend(int encoding) in C
pvmfinitsend(encoding, bufid) in FORTRAN
bufid contains the identifier for the buffer to be used in the send
Encoding options:
PvmDataDefault: XDR (external data representation) encoding is used by
default because the VM may well be a heterogeneous environment.
Additionally there is no way of knowing whether the VM will be updated
before a message is sent.
PvmDataRaw: Message is sent in the native format (i.e. no encoding is
performed). Only useable if receiving processes understands native
format, otherwise an error will be returned during the unpack stage.
PvmDataInPlace: No packing is performed at all and data is left “in
place”. In this case the buffer contains only a set of pointers and sizes to
the user data in memory. Reduces copy and packing overheads. Memory
items must not be updated in between the initsend and actual send
Multiple buffers
Most programs can be written without multiple (user defined)
buffers
User defined buffers can be created via
int info = pvm_freebuf(int bufid)
call pvmffreebuf(bufid, info)
User defined buffers must be set as the active send/receive
buffers using
Int bufid = pvm_mkbuf(int encoding) in C
Call pvmfmkbuf(encoding, bufid) in FORTRAN
Buffers can be removed by using the freebuf command:
Programs may only have one active send and receive buffer at any one
time
pvm_setsbuf or pvm_setrbuf
pvm_getsbuf or pvm_getrbuf will return the bufid of the
current defaults
Packing
11 different options for data type packing in C version
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
pvm_pkbyte(char *cp, int nitem, int stride) (byte data)
Pvm_pkcplx(float *xp,…) (float)
Pvm_pkdcplx(double *zp,…) (Double)
Pvm_pkdouble(double *dp,…) (Double)
Pvm_pkfloat(float *fp,…) (float)
pvm_pkint(int *ip,…) (integer)
Pvm_pkunit(unsigned int *ip,…) (unsigned int)
Pvm_pkushort(unsigned short *ip,…) (unsigned short)
Pvm_pkulong(unsigned long *ip,…) (unsigned long)
Pvm_pklong(long *ip,…) (long)
Pvm_pkshort(short *jp,…) (short)
C also has printf like interface, pvm_packf
FORTRAN: call pvmfpack(what, xp, nitem, stride, info)
what integer: string=0, byte1=1, integer2=2, integer4=3, real4=4,
complex8=5, real8=6, complex16=7
Sending
Send (destination and tag required):
Multicast (send to all processes in array tids)
int info = pvm_send(int tid, int msgtag)
call pvmfsend(tid, msgtag, info)
int info = pvm_mcast(int *tids, int ntask, int msgtag)
call pvmfsend(ntask, tids, msgtag, info)
Can specify more MPI-like communication model
int info = pvm_psend( int tid, int msgtag, void *vp, int cnt,
int type)
Call pvmfpsend(tid,msgtag,xp,cnt,type,info)
Receiving
Vanilla (blocking) receive:
Non-blocking version (maybe called multiple times – returns 0 if
message has not arrived)
int bufid = pvm_recv(int tid, int msgtag)
call pvmfrecv(tid, msgtag, bufid)
int bufid = pvm_nrecv(int tid, int msgtag)
call pvmfnrecv(tid, msgtag, bufid)
pvm_probe can be used to probe for message arrival
Matching pvm_precv to the pvm_psend variant
PVM also defines a message receive variant that will timeout
Useful for correctness checking if message never arrives
Unpacking
As would be expected unpacking is exactly the
inverse of packing
API is the same as packing except functions are
labelled “*upk*” or “*unpack*”
Process groups
Process groups are dynamic objects
pvmfjoingroup(group,inum)
Leave group, note if leave then join will likely get a different instance
number
pvmfbarrier(group,count,info)
Join group – specify group name, returns instance number of process
within the group
pvmfleavegroup(group,info)
Implies many process group calls must still be given the number of tasks
involved
Barrier sync within groups
pvmfreduce(func,data,count,datatype,msgtag,group,root,info)
Reduction operations for groups
PvmMin, PvmMax, PvmSum, PvmProduct
Collective Operations in PVM
Tasks can issue broadcasts to groups whether or
not they belong to that group
As well as barrier and reduction operations:
pvmfbcast(group, msgtag, info)
pvmfgather(result,data,count,datatype,msgtag,group,
root,info)
Pvmfscatter(result,data,count,datatype,msgtag,group,
root,info)
SPMD versus MPMD
When spawning tasks PVM tends toward an
MPMD type programming style
Note – can still spawn in an SPMD environment if
needs be, but requires additional logic overhead
SPMD programs are most naturally written
without spawning
Hello world: 2 program example
C
FORTRAN
#include <pvm3.h>
main ()
{
int tid ;
char reply [100] ;
printf ("I'm t%x\n", pvm_mytid ()) ;
if (pvm_spawn("hello_other", (char**)0,
0, "", 1, &tid) == 1)
{
pvm_recv (-1, -1) ;
pvm_bufinfo (pvm_getrbuf (), (int*)0,
(int*)0, &tid) ;
pvm_upkstr(reply) ;
printf (From t%x: %s\n", tid, reply) ;
} else
printf ("Can't start hello_other\n") ;
pvm_exit () ;
}
&
&
&
&
PROGRAM HELLO
INCLUDE 'fpvm3.h'
INTEGER mytid, tid, bufid, info
CHARACTER*100 reply1, reply2
CALL PVMFMYTID (mytid)
WRITE (*,*) 'I m t', mytid
CALL PVMFSPAWN ('hello_other', 0, '*',
1, tid, info)
IF (info .EQ. 1) THEN
CALL PVMFRECV (-1, -1, info)
CALL PVMFGETRBUF (bufid)
CALL PVMFBUFINFO (bufid, 0, 0, tid)
CALL PVMFUNPACK (STRING, reply1,
1, 1, info)
CALL PVMFUNPACK (STRING, reply2,
1, 1, info)
WRITE (*,*) 'From t', tid, ':',
reply1, reply2
ELSE
WRITE (*,*) 'Can t start hello_other'
ENDIF
CALL PVMFEXIT (info)
END
Master programs that spawn tasks called “hello other”
Spawned tasks
#include <pvm3.h>
PROGRAM HELLO_OTHER
INCLUDE 'fpvm3.h'
main ()
{
int ptid ; /* parent's tid */
char buf [100] ;
INTEGER ptid
INTEGER bufid, info
CHARACTER*100 buf
ptid = pvm_parent () ;
strcpy (buf, "Hello world from ") ;
gethostname (buf+strlen(buf), 64) ;
pvm_initsend (PvmDataDefault) ;
pvm_pkstr (buf) ;
pvm_send (ptid, 1) ;
pvm_exit () ;
}
CALL
CALL
CALL
CALL
PVMFPARENT (ptid)
GETHOSTNAME (buf, 64)
PVMFINITSEND (PvmDataDefault)
PVMFPACK (STRING, 'Hello From ',
&
1, 1, info)
CALL PVMFPACK (STRING, buf, 1, 1, info)
CALL PVMFSEND (ptid, 1, info)
CALL PVMFEXIT (info)
END
SPMD version
#include < stdio.h >
#include "pvm3.h"
#define NTASKS
#define HELLO_MSGTYPE
6
1
program hello
include 'fpvm3.h'
parameter(NTASKS = 6)
parameter(HELLO_MSGTYPE = 1)
main() {
int
mytid, parent_tid, tids[NTASKS], msgtype, i, rc;
char
helloworld[13] = "HELLO WORLD!";
integer mytid, parent_tid, tids(NTASKS), msgtype, i, info
character*12 helloworld/'HELLO WORLD!'/
mytid = pvm_mytid();
parent_tid = pvm_parent();
call pvmfmytid(mytid)
call pvmfparent(parent_tid)
if (parent_tid == PvmNoParent) {
printf("Parent task id= %d\n",mytid);
printf("Spawning child tasks ...\n");
for (i=0; i < NTASKS; i++) {
rc = pvm_spawn("hello", NULL, PvmTaskDefault, "", 1, &tids[i]);
printf("
spawned child tid = %d\n", tids[i]);
}
printf("Saying hello to all child tasks...\n");
msgtype = HELLO_MSGTYPE;
rc = pvm_initsend(PvmDataDefault);
rc = pvm_pkstr(helloworld);
for (i=0; i < NTASKS; i++)
rc = pvm_send(tids[i], msgtype);
printf("Parent task done.\n");
}
if (parent_tid .eq. PvmNoParent) then
print *,'Parent task id= ', mytid
print *,'Spawning child tasks...'
do 10 i=1,NTASKS
call pvmfspawn("hello", PVMDEFAULT, " ", 1, tids(i), info)
print *,'
spawned child tid = ', tids(i)
continue
print *,'Saying hello to all child tasks...'
msgtype = HELLO_MSGTYPE
call pvmfinitsend(PVMDEFAULT, info)
call pvmfpack(STRING, helloworld, 12, 1, info)
do 20 i=1,NTASKS
call pvmfsend(tids(i), msgtype, info)
continue
print *, 'Parent task done.'
endif
10
20
if (parent_tid .ne. PvmNoParent) then
print *, 'Child task id= ', mytid
if (parent_tid != PvmNoParent) {
msgtype = HELLO_MSGTYPE
printf("Child task id= %d\n",mytid);
call pvmfrecv (-1, msgtype, info)
msgtype = HELLO_MSGTYPE;
call pvmfunpack(STRING, helloworld, 12, 1, info)
rc = pvm_recv(-1, msgtype);
print *,' ***Reply to: ',parent_tid, ' : HELLO back from ',
rc = pvm_upkstr(helloworld);
&
mytid,'!'
printf(" ***Reply to: %d : HELLO back from %d!\n",parent_tid, mytid); endif
}
rc = pvm_exit();
}
call pvmfexit(info)
end
More complicated examples
Check out the PVM distribution for some more
examples
Plenty more examples on the web too
Future of PVM?
PVMPI project was designed to provide interoperability between
different MPI environments – never really went anywhere
Grid concepts are driving interest towards meta-systems
Example: HARNESS (Heterogeneous Adaptable Reconfigurable
Networked Systems) project
Designed to support both PVM and MPI environments via plug-ins
Concepts extends single VM to many distributed VMs
Dynamical operation: new plugins can be added without taking the entire
system down
PVM instruction set has been stable since 1999, core
development group has moved into Grid
Summary Part 2
PVM remains useful for heterogeneous environments
Instruction set is considerably less rich than MPI
Fault tolerance is a direct product of the preferred
Master-worker type model
Dynamic process groups are one of the unique features
Large investment in installed base of users – not likely
to disappear overnight
Part 3: Introduction to MPI
Overview of the MPI environment
Facilities within the API
Beginning programming
History of MPI
Many different message passing standards circa 1992
Most designed for high performance distributed memory systems
Following SC92 MPI Forum is started
Open participation encouraged (e.g. PVM working group was asked for input)
Goal is to produce as portable an interface as possible
Vendors included but not given control – specific hardware optimizations
were avoided
Web address: http://www.mpi-forum.org
MPI-1 standard released 1994
Forum reconvened in 1995-97 to define MPI-2
Fully functional MPI-2 implementations did not appear until 2002
though
Reference guide is available for download
http://www.netlib.org/utk/papers/mpi-book/mpi-book.ps
MPI distributions
Numerous vendor and open source distributions
Most popular Open Source versions:
Open-MPI
MPI Chameleon (MPICH 1&2)
http://www.open-mpi.org
http://www-unix.mcs.anl.gov/mpi/mpich/index.html
MPICH has been adopted by several vendors as the basis for
their commerical MPI
Both of these versions are high portable, and
OpenMPI and MPICH2 support MPI-2
C vs FORTRAN interface
As much effort as possible was extended to keep
the interfaces similar (unlike PVM!)
Only significant difference is C functions return
their value as the error code
FORTRAN versions pass a separate argument
Arguments to C functions may be more strongly
typed than FORTRAN equivalents
FORTRAN interface relies upon integers
MPI Communication model
As with PVM, messages are typed and tagged
Don’t need to explicitly define buffer
Specify start point of a message using memory address
Packing interface available if necessary (MPI_PACK datatype)
Communicators (process groups) are a vital component of the
MPI standard
Interface is provided if you want to use it
Destination processes must include the specific process group
Messages must therefore specify:
(address,count,datatype,destination,tag,communicator)
Address defines the message data, the remaining variables define
the message envelope
MPI-2
Significant advance over the 1.2 standard
Defines remote memory access (RMA) interface
Two types of modes of operation
Active target: all processes participate in a single
communication phase (although point-to-point messaging is
allowed)
Passive target: Individual processes participate in point-topoint messaging
Parallel I/O
Dynamic process management (MPI_SPAWN)
Missing pieces
MPI-1 does not specify how processes start
Recall PVM defines the console
Start-up is done using vendor/open source supplied package
MPI-2 defines mpiexec – a standardized startup routine
Standard buffer interface is implementation specific
Process groups are static – can only be created or
destroyed
No mechanism for obtaining details about the hosts
involved in the computation
Getting started: enrolling & exitting
from the MPI environment
Every program must initialize by executing MPI_INIT(ierr) or
int MPI_INIT(int argc, char ***argv)
Default communicator is MPI_COMM_WORLD
Determine the process id by calling
MPI_COMM_RANK(MPI_COMM_WORLD, myid,ierr)
Note PVM essentials puts enrollment and id resolution into one call
Determine total number of processes via
argc, argv are historical hangovers for the C version which maybe set to
NULL
MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs, ierr)
To exit, processes must call MPI_FINALIZE(ierr)
Minimal MPI program
#include "mpi.h"
#include <stdio.h>
program main
include “mpif.h”
integer ierr,myid
int main( int argc, char *argv[] )
call MPI_INIT( ierr )
{
call MPI_COMM_RANK(MPI_COMM_WORLD,
int myid;
&
myid, ierr)
MPI_Init( &argc, &argv );
print *, 'Hello, world from ’,myid
MPI_Comm_rank(MPI_COMM_WORLD, &myid);
printf( "Hello, world from %d\n“,myid); call MPI_FINALIZE( ierr )
end
MPI_Finalize();
return 0;
}
Normally execute by mpirun –np 4 my_program
Output:
Hello,world from 2
Hello world from 1
Hello world from 0
Hello world from 3
Compiling MPI codes
Some implementations (e.g. MPICH) define additional
wrappers for the compiler:
mpif77, mpif90 for F77,F90
mpicc, mpicxx for C/C++
Code is then compiled using mpif90 (e.g.) rather than
f90, libraries are linked in automatically
Usually best policy when machine specific libraries are
required
Linking can always be done by hand:
On HPCVL use -I/opt/SUNWhpc/include
-L/opt/SUNWhpc/lib -R/opt/SUNWhpc/lib -lmpi
Running on HPCVL
Sun’s parallel environment is called Cluster Runtime Environment (CRE)
Mprun is the job starting command – same semantics as mpirun, e.g. mprun –
np 4 my_prog
Additionally you can specify which partition to run on using –p
partition_name (but best to use default)
Executable is in /opt/SUNWhpc/HPC5.0/bin
May want to add this to your path
Try a quick test: /opt/SUNWhpc/HPC5.0/bin/mprun –np 4 date
Fri Feb 24 20:10:57 EST 2006
Fri Feb 24 20:10:57 EST 2006
Fri Feb 24 20:10:57 EST 2006
Fri Feb 24 20:10:57 EST 2006
Large non-interactive jobs need to be submitted via the Sun Grid Engine load
balancing system
Sun Grid Engine
HPCVL uses SGE
to load balance
jobs on the cluster
Job submission
requires
preparation of a
batch script
mprun line can
contain more
options to send to
the cre
environment
#!/bin/csh
# Shell to use to run the job.
#$ -S /bin/csh
# Where to put standard output / error
#$ -o <full_path_output_file>
#$ -e <full_path_error_file>
# Email status to user. e.g. your email address
#$ -M <email_address>
# Be notified by email when the job is ended
#$ -m e
# Request of the cre parallel environment
#$ -pe cre <number_of_processors>
# Use mprun to launch mpi program - <program>
source /gridware/sge/default/common/settings.csh
cd <full_path_working_directory>
# make sure <number_of_processors> is the same as above
mprun -x sge -np <my_prog>
Subtleties of point-to-point
messaging
Process A
MPI_Send(B)
MPI_Recv(B)
Process B
MPI_Send(A)
MPI_Recv(A)
Process A
MPI_Recv(B)
MPI_Send(B)
Process B
MPI_Recv(A)
MPI_Send(A)
Process A
MPI_Send(B)
MPI_Recv(B)
Process B
MPI_Recv(A)
MPI_Send(A)
This kind of communication
is `unsafe’. Whether it works
correctly is dependent upon
whether the system has enough
buffer space.
This code leads to a
deadlock, since the
MPI_Recv blocks
execution until it is
completed.
You should always try
and write communication
patterns like this: a send
is match by a recv.
Buffered Mode communication
Buffered sends avoid the issue of whether
enough internal buffering is available
Programmer explicitly defines buffer space sufficient
to allow all messages to be sent
MPI_Bsend has same semantics as MPI_Send
MPI_Buffer_attach(buffer,size,ierr) must be
called to define the buffer space
Frequently better to rely on non-blocking
communication though
Non-blocking communication
Helps alleviate two issues
1.
2.
MPI_Isend adds a handle to the subroutine call which is later
used to determine whether the operation has succeeded
MPI_Irecv is the matching non-blocking receive operation
MPI_Test can be used to detect whether the send/receive has
completed
MPI_Wait is used to wait for an operation to complete
Blocking communication can potentially starve a process for data while
it could be doing useful work
Problems related to buffering are circumvented, since the user must
explicitly ensure the buffer is available
Handle is used to identify which particular message
MPI_Waitall is used to wait for a series of operations to
complete
Array of handles is used
Solutions to deadlocking
If sends and recieves need to be matched use
MPI_Sendrecv
Process A
MPI_Sendrecv(B)
Process B
MPI_Sendrecv(A)
Process A
MPI_ISend(B)
MPI_IRecv(B)
MPI_Waitall
Process B
MPI_ISend(A)
MPI_IRecv(A)
MPI_Waitall
Non-blocking versions of Isend and Irecv will
prevent deadlocks
Use buffered mode sends so you know for sure
that buffer space is available
Other sending modes
Synchronous send (MPI_Ssend)
Only returns when the receiver has started receiving the
message
On return indicates that send buffer can be reused, and also
that receiver has started processing the message
Non-local communication mode: dependent upon speed of
remote processing
(Receiver) Ready send (MPI_Rsend)
Used to eliminate unnecessary handshaking on some systems
If posted before receiver is ready then outcome is undefined
(dangerous!)
Semantically, Rsend can be replaced by standard send
Collective Operations
Collectives apply to all processes within a given communicator
Three main categories:
Data movement (e.g. broadcast)
Synchronization (e.g. barrier)
Global reduction operations
All processes must have a matching call
Size of data sent must match size of data received
Unless specifically a synchronization function, these routines do
not imply synchronization
Blocking mode only – but unaware of status of remote
operations
No tags are necessary
Collective Data Movement
Types of data movement:
Broadcast (one to all, or all to all)
Gather (collect to single process)
Scatter (send from one processor to all)
MPI_Bcast(buff,count,datatype,root,comm,ierr)
data
Processor
A0
A0
MPI_Bcast
A0
A0
A0
Gather/scatter
MPI_Gather(sendbuf,sendcount,sendtype,recvbuf,recvcount,recvtype,root,co
mm,ierr)
MPI_Scatter has same semantics
Note MPI_Allgather removes root argument and all processes receive result
Think of it is gather followed by broadcast
MPI_Alltoall(sendbuf,sendcount,sendtype,recvbuf,recvcount,recvtype,comm,i
err)
Processes sends set of distinct data elements to others – useful for transposing a
matrix
data
Processor
A0 A1 A2 A3
MPI_Scatter
MPI_Gather
A0
A1
A2
A3
Global Reduction Operations
Plenty of operations covered:
Name of
Operation
MPI_MAX
MPI_MIN
MPI_SUM
MPI_PROD
MPI_LAND
MPI_BAND
MPI_LOR
MPI_BOR
MPI_LXOR
MPI_BXOR
MPI_MAXLOC
MPI_MINLOC
Action
maximum
minimum
sum
product
logical and
bit-wise and
logical or
bit-wise or
logical xor
bit-wise xor
max value and location
minimum value and location
Reductions
MPI_REDUCE(sendbuf,recvbuf,count,datatype,
op,root,comm,ierr)
Result is stored in the root process
All members must call MPI_Reduce with the same
root, op, and count
MPI_Allreduce(sendbuf,recvbuf,count,datatype,
op,comm,ierr)
All members of group receive answer
Example using collectives
Numerically integrate
1
4
0 1 x 2 dx 4(arctan(1) arctan( 0))
Parallel algorithm: break up integration regions and sum
separately over processors
Combine all values at the end
Very little communication required
Number of pieces
Return of values calculated
Example using broadcast and reduce
c
c
c
c
c
c
c
c
c
c
compute pi by integrating f(x) = 4/(1 + x**2)
each process:
- receives the # of intervals used in the apprxn
- calculates the areas of it's rectangles
- synchronizes for a global summation
process 0 prints the result and the time it took
program main
include 'mpif.h'
double precision PIX
parameter (PIX = 4*atan(1.0))
double precision mypi, pi, h, sum, x, f, a
integer n, myid, numprocs, i, ierr
function to integrate
f(a) = 4.d0 / (1.d0 + a*a)
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs, ierr)
print *, "Process ", myid, " of ", numprocs, " is alive"
if (myid .eq. 0) then
print *,"Enter the number of intervals: (0 to quit)"
read(5,*) n
endif
call MPI_BCAST(n, 1, MPI_INTEGER,0,MPI_COMM_WORLD,ierr)
c
check for n > 0
IF (N.GT.0) THEN
c
calculate the interval size
h = 1.0d0 / n
sum = 0.0d0
do 20 i = myid + 1, n, numprocs
x = h * (dble(i) - 0.5d0)
sum = sum + f(x)
20
continue
mypi = h * sum
collect all the partial sums
call MPI_REDUCE(mypi, pi, 1,
+ MPI_DOUBLE_PRECISION,MPI_SUM, 0,
+ MPI_COMM_WORLD, ierr)
c
c
97
process 0 prints the result
if (myid .eq. 0) then
write(6, 97) pi, abs(pi - PIX)
format(' pi is approximately: ',
+ F18.16,' Error is: ', F18.16)
endif
ENDIF
call MPI_FINALIZE(ierr)
stop
end
Summary of Part 3
MPI is a very rich instruction set
User defined standard
Role of buffers less significant than PVM
Multiple communication modes
Can program a wide variety of problems using a
handful of calls
Next lecture
More advanced parts of the MPI-1 API
User defined data-types
Cartesian primitives for meshes