Transcript Document

Parallel Seismic Computing
Theory and Practice
By Paul Stoffa
The University of Texas at Austin
Institute for Geophysics
4412 Spicewood Springs Road, Building 600
Austin, Texas 78759-8500
1
7/7/2015
Introduction
What one should know/learn before starting
 Why parallel?
 What we want/need to parallel
 What can and cannot be paralleled
 Ways to parallel - hardware
 Ways to parallel – software (MPI/PVM/MYP)
 Ways to parallel – algorithms

A.1
2
7/7/2015
Motivation
To process seismic data faster using better
algorithms and at reduced cost
 Better algorithms: 3D wave equation based
pre stack migration & inversion
 Faster: days not months to complete the
processing
 Reduced cost: $10-$100k per site not $1M
or more

B.1
3
7/7/2015
Floating Point Performance
1000
Cray X-MP
100
64-bit
Cray 2
Cray Y-MP
Cray 1S
10
Cray Y-MP/C90
DEC ALPHA
RS6000
i860
MIPS R2000
1
.1
68881
32-bit
16-bit
8087
.01
75
B.2
80
80387
80287
85
90
95
4
7/7/2015
Moore’s Law
In 1965, Gordon Moore (co-founder of Intel)
was preparing a speech and when he started
to graph data about the growth in memory
chip performance, he realized there was a
striking trend:



B.3
Each new chip contained roughly twice as much
capacity as its predecessor
Each chip was released within 18-24 months of the
previous chip
If this trend continued, computing power would rise
exponentially over relatively brief periods of time
5
7/7/2015
Moore’s Law (Diagram)
1975
1980
1985
1990
1995
Micro 500
2000 (MIPS)
10M
(transistors)
1M
80486
80386
80286
100K
Pentium
processor
8086
10K
8080
4004
25
1.0
0.1
0.01
Speed and # of transistors vs. year
B.4
6
7/7/2015
Breaking The Speed Limit
IBM Interlocked Pipelined CMOS
Intel Willamette
4.5 GHz
1.5 GHz
Intel Pentium III 1.0 GHz
AMD Athlon
Current
B.5
1.0 GHz
0.8 GHz
7
7/7/2015
What is MPP?
MPP = Massively parallel processing
 Up to thousands of processors
 Physically distributed memory

– Multicomputers: Each processor has a
separate address space
– Multiprocessors: All processors share a
common address space
C.1
8
7/7/2015
Processors may be commodity microprocessors or proprietary (i.e., custom)
designs
 Processors and memories are connected
with a network
 MPPs are scalable: The aggregate memory
bandwidth increases with the number of
processors

C.2
9
7/7/2015
Small-Scale Multiprocessing
Uniprocessing
P
M
C.3
P
P
P
P
Shared Memory
10
7/7/2015
Massively Parallel Processing
Interconnect network
C.4
PP
P
P
M
M
M
M
PP
P
P
M
M
M
M
PP
P
P
M
M
M
M
PP
P
P
M
M
M
M
PP
P
P
M
M
M
M
PP
P
P
M
M
M
M
PP
P
P
M
M
M
M
PP
P
P
M
M
M
M
PP
P
P
M
M
M
M
PP
P
P
M
M
M
M
PP
P
P
M
M
M
M
PP
P
P
M
M
M
M
PP
P
P
M
M
M
M
PP
P
P
M
M
M
M
PP
P
P
M
M
M
M
PP
P
P
M
M
M
M
PP
P
P
M
M
M
M
PP
P
P
M
M
M
M
PP
P
P
M
M
M
M
PP
P
P
M
M
M
M
PP
P
P
M
M
M
M
PP
P
P
M
M
M
M
PP
P
P
M
M
M
M
PP
P
P
M
M
M
M
PP
P
P
M
M
M
M
PP
P
P
M
M
M
M
PP
P
P
M
M
M
M
PP
P
P
M
M
M
M
PP
P
P
M
M
M
M
PP
P
P
M
M
M
M
PP
P
P
M
M
M
M
PP
P
P
M
M
M
M
11
7/7/2015
Why MPP?

Technology reasons:
–
–
–
–

Limits to device (transistor) switching speeds
Limits to signal propagation speeds
Need for large memory
Need for scalable systems
Economic reasons:
– Low cost of microprocessors
– Low cost of DRAM
– Leverage R&D for workstation market
C.5
12
7/7/2015
Problems with MPP

Hardware
– Commodity microprocessors are not designed
for MPP systems
– Slow divide
– Poor single-node efficiency
– Poor cache reuse on large-scale problems
– Low memory bandwidth
– Slow inter-processor communication
– Cost of synchronization
C.6
13
7/7/2015

Software
– MPP software is more complex
– It’s difficult to debug and optimize programs
– It’s difficult to port applications from
conventional systems

General
– Fundamental limits to parallel performance
– Load balance
– Data locality
C.7
14
7/7/2015
Clusters
Why Now ?
Commodity pricing of computer workstations,
servers and pc's has reduced the cost per
machine dramatically in the last ten years
 The power and speed of processors have
increased significantly during this time

D.1
15
7/7/2015
Cache and DRAM memory per machine
have also increased
 It is now possible to ‘cluster’ machines to
create a parallel machine using ethernet as
the communication vehicle
 Agreed programming standards (PVM &
MPI)

D.2
16
7/7/2015
Today’s Solution
Use widely available computer capability
designed for the general public, e.g. PC's,
workstations, servers
 Aggregate these machines into clusters that can
be dedicated to the seismic processing task
 Minimize the communication required between
processors to keep overhead to a minimum
 Minimize the transfer of data between
processors to keep I/O to a minimum

D.3
17
7/7/2015
Hardware
Parallel Computers
 Parallel Processing Concepts
 Memory Design
 Basic Hardware Concepts
 Processor Connectivity
 Amdahl’s Law

E.1
18
7/7/2015
Parallel Computers
Before commodity pricing and the
widespread availability of high speed
components parallel computers were
designed, built and used for scientific
computing
 These machines were expensive, hard to
program and of different designs

E.1.1
19
7/7/2015
Two distinct ways of organizing the
processing flow were investigated:
All processors issue the same instruction at
the same time
 The processors act independently of each
other

E.1.2
20
7/7/2015
Parallel Processing Concepts

SIMD single instruction multiple data
– Used in dedicated applications
– Fully synchronous

MIMD multiple instructions multiple data
– Used in general purpose applications
– Independent and asynchronous
– Can overlap computation and communication

SPMD single program multiple data
– Same as MIMD, except that all processors run
identical programs
E.2.1
21
7/7/2015
Flynn's taxonomy
A classification of computer architectures
based on the number of streams of instructions
and data:
– Single instruction/single data (SISD) - a sequential
computer
– Multiple instruction/single data (MISD) - unusual
– Single instruction/multiple data (SIMD) - e.g. DAP
– Multiple instruction/multiple data(MIMD) multiple autonomous processors simultaneously
executing different instructions on different data
E.2.2
22
7/7/2015
Memory Design
Several different memory distribution
designs were tried:
 Each processor has its own independent
memory bank
 Each processor’s independent memory is
shared
 All the memory is shared by all the
processors
E.3.1
23
7/7/2015
Independent
memory
banks
Shared
independent
memory
banks
Shared
memory
bank
E.3.2
M
M
M
M
M
M
P
P
P
P
P
P
M
M
M
M
M
M
P
P
P
P
P
P
P
P
MEMORY
P
P
P
P
24
7/7/2015
Parkinson's Law of Data
”Data expands to fill the space
available for storage”
Buying more memory encourages the use
of more memory-intensive techniques


E.3.3
It has been observed over the last 10 years that
the memory usage of evolving systems tends to
double roughly once every 18 months
Fortunately, memory density available for
constant dollars also tends to double about once
every 12 months (see Moore’s Law)
25
7/7/2015
Basic Hardware Concepts

Communication protocols
– Circuits switched
A physical circuit is established between any two
communicating processors
– Packet switched
Data communicated between processors is divided
into packets and routed along a path usually including
other processors
E.4.1
26
7/7/2015

Routing techniques
– Store and forward
Each packet is stored in the local memory of every
processor on its path before being forwarded to the
next processor
– Cut through
Each packet proceeds directly from one processor to
the next along channel, without being stored in
memory
E.4.2
27
7/7/2015

Metric and characteristics
– Latency
How long does it take to start sending a message
High latency implies use of large packets for efficiency
– Bandwidth
What data rate can be sustained once the message is started
Bi-directional bandwidth: Most systems can send data in
two (or more) directions at once
Aggregate bandwidth: the total bandwidth when all
processors are sending and receiving data
– Bisection bandwidth
At what rate can data be transferred from one half of the
system to the other
E.4.3
28
7/7/2015

Synchronization
Necessary for all MIMD systems
– Barriers
Processors wait at a barrier in the code until they all
reach it
– Critical Sections
Used to protect code: A critical section of code can
only be executed by one processor at a time
E.4.4
29
7/7/2015
– Locks
Used to protect data: Locked data cannot be accessed
until it is unlocked. A lock is a primitive synchronization
mechanism that can be used to implement barriers and
critical sections
– Synchronization mechanisms should be
supported with special hardware (but aren’t in
many systems)
– Some systems support only barriers
E.4.5
30
7/7/2015
Processor Connectivity
Several ways of organizing the connections
between processors were used to optimize
inter processor communication & data
exchange efficiency, i.e. the architecture of the
processor connections is important. Network
topologies are characterized by:
– Diameter: maximum distance between processors
– Degree: number of connections to/from processor
– Extendability: Can an arbitrary size network be
biult?
E.5.1
31
7/7/2015
Examples
X-Y grid (or 2D grid)
 Fat Tree
 Butterfly
 Hypercube
 2D Torus
 3D Torus

E.5.2
32
7/7/2015
X-Y Grid (Intel Paragon)
25 Node 2D Grid
E.5.3
33
7/7/2015
Fat Tree (CM5)
16 Node Fat Tree
E.5.4
34
7/7/2015
Butterfly (Kendal Square)
P
S
S
S
M
P
M
P
M
P
S
S
S
P
P
M
S
S
S
P
P
M
M
M
S
S
S
M
8 Node Butterfly
E.5.5
35
7/7/2015
Hypercube (IPSC2)
16 Node Hypercube
E.5.6
36
7/7/2015
2D Torus (concept)
25 Node 2D Torus
E.5.7
37
7/7/2015
3D Torus (T3D)
+Y
-Z
+X
-X
+Z
-Y
3D Torus Network
E.5.8
38
7/7/2015
Amdahl’s Law

Overall speedup of a program is strictly
limited by the fraction of the execution time
spent in code running in parallel, no matter
how many processors are applied.
If p is the percentage of the amount of time
spent (by a serial processor) on parts of the
program that can be done in parallel, the
theoretical speedup limit is
100
100 - p
E.6.1
39
7/7/2015
Amdahl’s Law for N Processors

If s is the percentage of amount of time
spent (by a serial processor) on serial parts
of a program (thus s+p=100 %), speedup is
100
s + p/N
E.6.2
40
7/7/2015
Time = 100
s
p
Run on serial processor
Run on parallel processor
Time= s + p/N
E.6.3
41
7/7/2015
Amdahl’s Law (Diagram 1)
25
Speedup limit
Speedup N=16
Speedup
20
15
10
5
0
0
E.6.4
10
20
30
40
50
60
70
80
90
100
p
42
7/7/2015
Amdahl’s Law (Diagram 2)
100
p=50%
p=60%
p=70%
p=80%
p=90%
Speedup
p=100%
10
N
1
2
E.6.5
4
8
16
32
64
128
256
512
1024
43
7/7/2015
Amdahl’s Law Reevaluated
Graph steepness in diagram 1 implies that few
problems will experience even a 10-fold speedup
 The implicit assumption that p is independent of
N is virtually never the case
 In reality , the problem size scales with N, i.e.
given a more powerful processor users gain
control over grid resolution, number of timesteps,
etc., allowing the program run in some desired
amount of time. It is more realistic to assume that
run time, not problem size, is constant

E.6.6
44
7/7/2015
Amdahl’s Law Reevaluated

If we use s´ and p´ to represent percentages
of serial and parallel time spent on the
parallel system then we arrive at an
alternative to Amdahl's law:
Scaled speedup is
s´ + Np´
100
E.6.7
45
7/7/2015
Time= s´ +N p´
Hypothetical run on serial processor
s´
p´
Run on parallel processor
Time= s´ + p´ =100
E.6.8
46
7/7/2015
Scaled speedup (diagram 3)
100
p=50%
p=60%
p=70%
p=80%
Speedup
p=90%
p=100%
10
N
1
2
E.6.9
4
8
16
32
64
128
256
512
1024
47
7/7/2015
Brief History of Example
Machines
DAP (Distributed Array of Processors)
 MasPar-1
 Thinking machine CM2 – SIMD
 Intel - IPSC2 MIMD hypercube
 Thinking machine CM5 – MIMD
 Cray T3D & T3E - MIMD torus
 SGI origin 2000 - MIMD shared memory
 Microway Alpha Cluster

F.1
48
7/7/2015
AMT DAP Hardware
VME
Host
computer
Host
Computer
unit
Master
Control
unit
Code
store
Processor
elements
1 – bit
8 - bit
Fast I/O
64
64
F.1.1
Array
memory
(at least
32 Kbits
per PE)
49
7/7/2015
DAP 610C is a SIMD machine. 4096
processor elements simultaneously execute
the same instruction on data within their
local memory
 The processor elements (PE) are arranged in
a square array (64 × 64), each comprising a
general-purpose bit-organized processor and
an 8-bit wide co-processor for performing
complex functions such as floating point
arithmetic

F.1.2
50
7/7/2015
Each PE has a local memory which ranges
from 32 to 256 Kbits per PE (i.e., 16 to
128 Mbytes in total)
 Each bit-organized processor is connected
to its four nearest neighbors in the square
array, and to row and column data paths
used for rapid broadcast or combination of
data

F.1.3
51
7/7/2015
Active Memory Solution
Simplicity in programming
 The array structure matches the parallelism
in a wide range of problems
 A large number of simple PEs can be
provided at reasonable cost

F.1.4
52
7/7/2015
Specialized functions such as floating
point operations or variable word length
mathematics can be micro-coded from
very general elementary operations
provided by the PE
 Overheads are very low since one control
operation (managed by the MCU)
simultaneously produces 4096 results (64
× 64 array), in contrast to a single result
for a conventional processor

F.1.5
53
7/7/2015
Conventional (SISD)
Memory
Program
Control unit
PE
Active memory (SIMD)
Memory
Memory
Memory
Master
control unit
1024 unit
PE
F.1.6
PE
PE
54
7/7/2015
MasPar-1

MasPar-1 architecture consists of four
subsystems:
–
–
–
–
F2.1
The Processor Element (PE) Array
The Array Control Unit (ACU)
A UNIX subsystem with standard I/O
A high-speed I/O subsystem
55
7/7/2015
F.2.2
56
7/7/2015
PE Array is computational heart of MP-1
Family architecture and enables to operate
on thousands of data elements in parallel
 Each PE is a RISC processor with dedicated
data memory and execution logic, and
operates at 1.6 MIPS (32-bit integer add)
and 73 KFLOPS (average of 32-bit floating
point multiply and add)
 Extensive use of CMOS VLSI allows
extremely dense packaging – 32 processors
per chip, with forty 32-bit registers per PE

F.2.3
57
7/7/2015
Using SIMD thousands of PEs can work on a
single problem simultaneously. Each PE can
share its own data memory with other
processors through a global communications
system. PEs receive instructions from ACU
 MP-1 systems support PE arrays with 1,024
to 16,384 PEs. Performance ranges from
1,600 to 26,000 integer MIPS (32-bit add)
and from 75 to 1,200 single precision
MFLOPS (average of add and multiply). The
aggregate PE Array data memory ranges
from 16 MB to 1 GB

F.2.4
58
7/7/2015
Processor Element Array
Data memory
RISC
Processor
A single Processing
Element (PE)
F.2.5
59
7/7/2015
Thinking machine CM2 – SIMD
Thinking machines provided a very cost
effective hardware platform in their CM2
which was a SIMD machine.
Unfortunately it was hard to program.
The machine was well suited for explicit
finite difference algorithms. Mobil was one
of the many companies that used this
machine for seismic processing.
F.3.1
60
7/7/2015
CM-2 General Specifications
Processors
 Memory
 Memory Bandwidth
 I/O Channels
 Capacity per Channe
 Max. Transfer Rate

F.3.2
65,536
512 Mbytes
300Gbits/Sec
8
l40 Mbytes/Sec
320 Mbytes/Sec
61
7/7/2015
DataVault Specifications
Storage Capacity
 I/O Interfaces
 Transfer Rate, Burst
 Max. Aggregate Rate

5 or 10 Gbytes
2
40 Mbytes/Sec
320 Mbytes/Sec
CM-2 has performance in excess of 2500
MIPs and floating point performance
above 2500 MFlops
F3.3
62
7/7/2015
CM-2 12-D Hypercube
Simple 1-bit processors, capable of
performing the same calculation, each on
its own separate data set, are grouped 16
to a chip, making total of 4096 chips
 The chips are wired together in a network
having shape of 12-D hypercube – a cube
with 212 corners, or 4096 chips, each
connected to 12 other chips

F.3.4
63
7/7/2015
Hypercube
A cube of more than three dimensions
 Single point (‘node’) can be considered as
a zero dimensional cube
 Two nodes joined by a line are a one
dimensional cube
 Four nodes arranged in a square are a two
dimensional cube
 Eight (23) nodes are an ordinary three
dimensional cube
F.4.1
64
7/7/2015
Zero dimensional cube
Three dimensional cube
Two dimensional cube
F.4.2
65
7/7/2015
Hypercube (Cont.)
First hypercube has 24 =16 nodes and is a
four dimensional shape (a "four-cube") and
an N dimensional cube has 2N nodes (an
"N-cube")
 Each node in an N dimensional cube is
directly connected to N other nodes

F.4.3
66
7/7/2015
Hypercube (Cont.)

F.4.4
The simple, regular geometrical structure
and the close relationship between the
coordinate system and binary numbers
make the hypercube an appropriate
topology for a parallel computer
interconnection network. The fact that the
number of directly connected, "nearest
neighbor", nodes increases with the total
size of the network is also highly desirable
for a parallel computer
67
7/7/2015
Intel - IPSC2 MIMD Hypercube
16 Node Hypercube
F.5.1
68
7/7/2015
Intel - IPSC2 MIMD hypercube
This machine was one of the very first
parallel machines used for scientific
computing. At that time no programming
standards were agreed upon and Intel
provided software for coordination and
control of the processors
Production seismic processing was done
using this machine and its successors by
Tensor
F.5.2
69
7/7/2015
nCUBE - MIMD hypercube
nCUBE was another parallel machine that
was used for seismic processing. MIT’s
ERL group had a machine and Shell used
several for production seismic processing.
nCUBE provided the specialized software
necessary for inter processor
communication and coordination.
F.6.1
70
7/7/2015
nCUBE 2
Architecture: Distributed Memory MIMD
hypercube
 Node: Based on a 64-bit custom chip and 64bit floating point unit, running at 7.5 MIPS or
3.3 MFLOPS,1-64 Mbytes of memory
 I/O: Other I/O boards include: an improved
graphics board; a parallel I/O sub-system
board; and an Open System board. Each
board can transfer data at a peak rate of 568
Mbytes/s per I/O slot

F.6.2
71
7/7/2015
Topology: A hypercube with extra I/O boards,
although host boards are no longer necessary
as the nCUBE 2 is hosted by a workstation
which is able to allocate sub-cubes separately
to different users. Wormhole routing is
implemented in hardware over high-speed
DMA communications channels
 Scalability: Scales from a 5-dimensional cube
(32 nodes) to a 13-dimensional cube (8192
nodes)

F.6.3
72
7/7/2015

F.6.4
Performance: Peak performance is 27
GFLOPS for a 13-dimensional hypercube
with 512 Gbytes of memory and a peak
I/O bandwidth of 36 Gbytes/s. It would
also be able to support 4096 disc drives.
The largest built to date is a 1024 node
machine which has been shown to scale
almost linearly on some applications
73
7/7/2015
Thinking machine CM5 – MIMD
Realizing the limitations of the CM2 the
CM5 was developed which was an easier
to program MIMD machine
GECO among other companies used this
machine for production seismic processing
F.7.1
74
7/7/2015
Thinking Machine CM5 – MIMD

Architecture: coordinated homogeneous
array of RISC microprocessors with
following communication mechanisms:
– A low-latency, high-bandwidth communications
mechanism that allows each processor to access
data stored in other processors
– A fast global synchronization mechanism that
allows the entire machine to be brought to a
defined state at specific point in the course of
computation
F.7.2
75
7/7/2015

Processing node: each processing node
consists of
– a RISC microprocessor
– 32 MB of memory
– interface to the control and data interconnection
network
– 4 independent vector units with direct 64-bit
path to an 8MB bank of memory
This gives the processing node with a
double-precision floating point rate of 128
MFLOPS and a memory bandwidth of 512
MB/sec
F.7.3
76
7/7/2015
To data
network
Instruction
Fetch,
Program
control
RISC
microprocessor
(SPARC)
To control
network
Data
network
interface
Control
network
interface
64-bit bus
Instructions
Floating
point
vector
processors
F.7.4
Vector unit
Vector unit
Vector unit
Vector unit
Data
Memory
(32MB)
64-bit
Data paths
77
7/7/2015

Data network topology: uses a fat tree
topology
– Each interior node connects four children to two
or more parents
– Processing nodes form the leaf nodes of the tree
– As more processing nodes are added, the number
of levels of the tree insreases, allowing the total
bandwidth across the network to increase
proportionally to the number of processors
F.7.5
78
7/7/2015
Processors and I/O nodes
16 Node Fat Tree
F.7.6
79
7/7/2015
Cray T3D Architecture
Disks
Tapes
Networks
Workstations
F.8.1
Disks
Tapes
Networks
Workstations
80
7/7/2015
Cray T3D System Configuration
Multi-cabinet
CRAY T3D
Model
MCA 32
Processing
Elements
(PEs)
32
Total
Memory
(Gbytes)
0.5 or 2
MCA 64
64
1 or 4
9.6
Air/water
MCA128
128
2 or 8
19.2
Air/water
MC 128
MC 256
MC 512
MC1024
MC2048
128
256
512
1024
2048
2 or 8
4 or 16
8 or 32
16 or 64
32 or 128
19.2
38.4
76.8
153.6
307.2
Water
Water
Water
Water
Water
F.8.2
Peak
Performance Cooling
(GFLOPS)
4.8
Air/water
81
7/7/2015
Cray T3D Latency Hiding
System design supports three major latency
hiding mechanisms that augment the highperformance system interconnect
– Pre-fetch queue (PFQ)
– Read ahead
– Block Transfer Engine (BLT)
PFQ and read-ahead features are designed to
hide latency at the level of a few words, BLT
is designed to hide latency at the level of
hundreds or thousands of words
F.8.3
82
7/7/2015
Cray T3D Fast Synchronization
Cray Research has developed a range of
synchronization mechanisms, including
–
–
–
–
Barrier/Eureka Synchronization
Fetch-and-Increment Register
Atomic Swap
Messaging Facility
These mechanisms accommodate both
(SIMD) and (MIMD) programming styles
F.8.4
83
7/7/2015
Cray T3D Parallel I/O
The high-bandwidth I/O subsystem, based
on Cray Research’s Model E I/O subsystem
technology, provides access to the full range
of Cray Research disk, tape, and network
peripherals. Peak performance in the
gigabytes per second range can be achieved.
F.8.5
84
7/7/2015
Cray T3D Scalability
Cray T3D system is scalable
– From 32 to 2048 processors
– From 4.2 to over 300 GFLOPS peak
performance
– From 512 Mbytes to 128 Gbytes of memory
– terabytes of storage capacity
Scalability is used for improving system
throughput by increasing the number of job
streams, and for significantly increasing the
performance of individual user jobs
F.8.6
85
7/7/2015
Cray T3E
Provides a shared physical address space of
up to 2,048 processors over a 3D torus interconnect. Each node contains an alpha 21164
processor, system control chip, local memory,
and network router. The system logic runs at
75 MHz, processor runs at some multiple of
this (300 MHz for Cray T3E) with sustainable
bandwidths of 267mb/per second input and
output for every four processors
F.8.7
86
7/7/2015
Cray T3E - Torus Interconnect
+Y
-Z
+X
-X
+Z
-Y
3D Torus Network
F.8.8
87
7/7/2015
Cray T3E node block diagram
75 MHz
Alpha
21164
300-450 MHz
Streams
Control
E-Reg
64 MB to
2GB
Router
F.8.9
88
7/7/2015
Node Organization
Processing node
Redundant node
I/O node
F.8.10
89
7/7/2015
Cray T3D, T3E - torus
Cray Research developed the T3D and
T3E using commodity processors, the
DEC alpha chip
The interconnect between processors was
based on a 3D torus which provided a very
efficient way to transfer data between the
compute nodes.
Phillips Petroleum used these machines for
pre stack 3D depth migration.
F.8.11
90
7/7/2015
SGI Origin 2000
Hardware:







8 SGI R10000 250 MHz processors
4 MB cache per processor
8 GB RAM
100 GB local disk space
800 MB/s link between nodes
100 Mbit/s link to LAN
1000 Mbit/s link to Microway Alpha cluster
Software:
 SGI IRIX64TM v.6.4
 SGI Message passing toolkit for IRIX v.1.3
Price: 465,098
F.9.1
91
7/7/2015
SGI Origin 2000 (Cont.)
Node 1
Cache
4MB
250 MHz
Processor 0
250 MHz
XIO
Memory and
Directory
2GB 800 MB/s
Node 1
Router
HUB
Craylink
800 MB/s
Xbow 0
XIO
1-6
LAN
1 Gbit/s
Node 3
F.9.2
800 MB/s
100 Mbit/s
Node 2
Node 4
Cache
4MB
250 MHz
Processor 1
250 MHz
Xbow 1
XIO
7-12
Alpha
92
7/7/2015
SGI origin 2000 - SMP shared
memory
The Origin 2000, O2K, became popular
because the use of shared memory between
the processors made programming easier
and minimized data transfers.
Routine seismic processing is performed
using these multi processor machines in
parallel as well as by running independent
processing tasks on each processor.
F.9.3
93
7/7/2015
Microway Aplha Cluster
Hardware:







8 Alpha 21164 667 MHz processors
8 MB cache per processor
8 GB RAM
72 GB local disk space
100 Mbit/s link between nodes
100 Mbit/s link to LAN
1000 Mbit/s link to SGI Origin 2000
Software:
 Microsoft Windows NT v.4.0 SP 4
 MPI/PRO™ for Windows NT v.1.2.3. by
MPI Software Technology, Inc.
 Hummingbird NFS Maestro v.6.1
Price: 71,915
F.10.1
94
7/7/2015
Microway Aplha Cluster (Cont.)
Node 0
Cache
8MB
1.2 GB/s
83 MHz
Processor 0
Alpha 21264
667 MHz
Ultra 2
LVD Controller
80 MB/s
Memory
1 GB
83 MHz
PCI Gbit
NIC
1 Gbit/s
Ultra SCSI
Disk
9GB
10 MB/s
1GB/s
Node 0
Node 1
100 Mbit/s
SGI
100 Mbit/s
Node 7
100 Mbit/s
LAN
3Com 36 port 100 Mbit Switch
1 Gbit/s
F.10.2
…
100 Mbit/s
95
7/7/2015
Basic Software Concepts

Styles of parallel coding
– Data parallel
– Messaging passing
– Work sharing

The data parallel style
– Data parallel is the most natural style for
SIMD, but it can also be implemented in
MIMD systems
– Uses array syntax for concise coding
G.1
96
7/7/2015

The message passing style
– Message passing is the natural style for multicomputer systems (separate address spaces)
– Uses standard languages plus libraries for
communication, global operations, etc.
– Both synchronous and asynchronous
communications are possible
– Both node-to-node and broadcast
communications are possible
G.2
97
7/7/2015

The work sharing style
– Work sharing is a natural style for a
multiprocessor ( global address space)
– Work sharing is similar to Cray’s auto-tasking
– Compiler directives specify data distributions
– Loop iterations can be distributed across
processors
G.3
98
7/7/2015

Comparison of the different styles
– The data parallel style is relatively high-level,
elegant, and concise. The parallelism is fully
implicit
– The message passing style is relatively low-level,
general, efficient, flexible, and economical of
memory. The parallelism is fully explicit:
communication, synchronization, and data
decomposition must be specified in complete
detail by the programmer
G.4
99
7/7/2015
– The work sharing style is intermediate between
the other two styles
Communication is implicit, synchronization is
explicit (but more general types are allowed than
in message passing), and data distribution is
implicit (with guidance from the programmer)
Work sharing can be concise yet efficient, and it
is more similar to serial code than the other
styles
– A few systems allow mixing styles within a
program
G.5
100
7/7/2015
Software

Machine Specific Software
– Intel
– nCUBE
– Thinking Machine

Standards
– High performance Fortran
– PVM
– MPI
H.1
101
7/7/2015
Software
…
…
 MYP (my parallel interface)
 PVM (parallel virtual machine)
 MPI (message passing interface)
…
…

H.2
102
7/7/2015
PVM Standard and
Implementation

Developed by
– University of Tennessee
– Oak Ridge National Laboratory
– Emory University
I.1
103
7/7/2015
Message Passing under PVM
Structure of the Transfer Process
FORM
MESSAGE
SEND
MESSAGE
DEQUEUE
MESSAGE

Creates a send buffer and assembles
the message (possibly from multiple
sources)

I.2
PVM:
Use IPC library or shared memory
implementation

Receiver:
Identifies message in PVM

COPY
TO
DESTINATION
Sender:
Receiver:
Initiates block transfer or memory
copy
104
7/7/2015
Syntax for sending a message in C

Initialize send buffer
pvm_initsend (int PvmData ---)
PvmDataDefault = 0
 PvmDataRaw
=1
 PvmDataInPlace = 2

I.3
105
7/7/2015

Insert data into send buffer
pvm_pkbyte(char*ptr, int num, int strd)
ptr - address of destination array
 num - message length in bytes
 strd - stride or spacing of elements


Send the message
pvm_send (int dest, int type)
dest = id of destination processor
 type = message type

I.4
106
7/7/2015
Syntax for receiving a message in C

Receive a message
pvm_recv (int src, int type)
src - source processor id (-1 means any source)
 type - message type (-1 means any type accepted)


Extract data from message
pvm_unpkbyte (char*ptr, int num, int strd)
ptr - address of destination array
 num - message length in bytes
 strd - stride or spacing of elements

I.5
107
7/7/2015
Syntax for sending a message in Fortran
 Initialize send buffer
call pvmfinitsend(pvminplace, istat)
 Insert data into send buffer
call pvmfpack(byte1, data, len, 1, istat)
 Send the message
call pvmfsend(dest, type, istat)
I.6
108
7/7/2015
Syntax for receiving a message in Fortran
 Receive a message
call pvmfrecv(src, type, istat)
 Extract data from message
call pvmfunpack(byte1, data, len, 1,istat)
I.7
109
7/7/2015
MPI Standard and
Implementation
MPI standardization effort involved about
40 organizations mainly from the US and
Europe. Most of the major vendors of
concurrent computers were involved in
MPI, along with researchers from
universities, government laboratories, and
industry
J.1
110
7/7/2015
Goals
Design an application programming
interface (not necessarily for compilers or a
system implementation library)
 Allow efficient communication: Avoid
memory-to-memory copying and allow
overlap of computation and communication
and offload to communication co-processor,
where available

J.2
111
7/7/2015
Allow for implementations that can be used
in a heterogeneous environment
 Allow convenient C and Fortran 77 bindings
for the interface
 Assume a reliable communication interface:
the user need not cope with communication
failures. Such failures are dealt with by the
underlying communication subsystem

J.3
112
7/7/2015
Define an interface that is not too different
from current practice, such as PVM, NX,
Express, p4, etc., and provides extensions
that allow greater flexibility
 Define an interface that can be
implemented on many vendor's platforms,
with no significant changes in the
underlying communication and system
software

J.4
113
7/7/2015
Semantics of the interface should be
language independent
 The interface should be designed to allow
for thread-safety

J.5
114
7/7/2015

The standardization process began with the
Workshop on Standards for Message
Passing in a Distributed Memory
Environment, sponsored by the Center for
Research on Parallel Computing, held
April 29-30, 1992, in Williamsburg,
Virginia

At this workshop a working group was
established to continue the standardization
process
J.6
115
7/7/2015
A preliminary draft proposal, known as
MPI1, was put forward by Dongarra,
Hempel, Hey, and Walker in November
1992, and a revised version was completed
in February 1993
 Version 1.0: June, 1994
 Version 1.1: June, 1995
 Version 1.2: April, 1997
(MPI 1.2 is part of MPI-2 document)

J.7
116
7/7/2015
UNIX MPICH v.1.2.0 : Portable
implementation of MPI, released on
12/2/1999
 SGI MPT for IRIX v.1.3 : MPI release
implements the MPI 1.2 standard
 MPI/PROTM for Windows NT® v.1.2.3: Full
MPI 1.2 specification is currently supported,
with MPI 2.0 support to be offered
incrementally in the future

J.8
117
7/7/2015
MPI Commands in C and Fortran

UNIX, SGI: Include files
– mpi.h (for C) and
– mpif.h (for Fortran)
contain definitions of MPI objects, e.g., data
types, communicators, collective operations,
error handles, etc.
J.9
118
7/7/2015
 MPI/Pro: File mpi.h refers to include
files for C
– MPIPro_SMP.H
– MPIPro_TCP.H
File mpif.H refers to include files for
Fortran
– MPIPro_SMP_f.H
– MPIPro_TCP_f.H
J.10
119
7/7/2015
Format Comparison
C
int MPI_Init(argc, argv)
#include ‘mpi.h’
int *argc;
char ***argv;
Input parameters:


J.11
Fortran
MPI_Init(ier)
include ‘mpif.h’
– Output variables

ier - integer –
returned error value
argc - Pointer to the
number of arguments
argv - Pointer to the
argument vector
120
7/7/2015
MPI Subroutines in Fortran
MPI_Init
 MPI_Comm_size
 MPI_Comm_rank
 MPI_Barrier

J.12
121
7/7/2015
MPI Subroutines in Fortran
MPI_Send
 MPI_Recv
 MPI_Bcast
 MPI_Finalize
 more …

J.13
122
7/7/2015
Subroutine MPI_init ( ier)
Initializes the MPI execution environment
Include ‘mpif.h’
– Output variables

J.14
ier - integer - returned error value
123
7/7/2015
Subroutine MPI_Comm_size
(comm,size,ier)
Determines the size of the group associated
with a communicator
Include ‘mpif.h’
– Input variables

comm - integer - communicator (handle)
– Output variables
size - integer - # of processes in the group of comm
 ier - integer - returned error value

J.15
124
7/7/2015
Subroutine MPI_Comm_rank
(mpi_com_world, rank, ier)
Determines the rank of the calling process in
the communicator
Include ‘mpif.h’
– Input variables

comm - integer - communicator (handle)
– Output variables
rank - integer- rank of the calling process in comm
 ier - integer - returned error value

J.16
125
7/7/2015
Subroutine MPI_Barrier
(comm, ier)
Blocks until all process have reached this
routine
Include ‘mpif.h’
– Input variables

comm - integer - communicator (handle)
– Output variables

J.17
ier - integer- returned error value
126
7/7/2015
Subroutine MPI_Send(buf,count,
datatype,dest,tag,comm, ier)
Performs a basic send
Include ‘mpif.h’
– Input variables



J.18
buf - integer- initial address of send buffer (choice)
count - integer- number of elements in send buffer
datatype - integer- datatype of each send buffer
element(handle)
127
7/7/2015
Subroutine MPI_Send(buf,count,
datatype,dest,tag,comm, ier)
dest - integer- rank of destination
 tag - integer- message tag
 comm - integer- communicator (handle)

– Output variables

J.19
ier - integer- returned error value
128
7/7/2015
Subroutine MPI_Recv (count,
datatype,source,tag,comm,
buf,status,ier)
Basic receive
Include ‘mpif.h’
– Input variables
count - integer- max number of elements in
receive buffer
 datatype - integer- datatype of each receive buffer
element (handle)

J.20
129
7/7/2015
Subroutine MPI_Recv (count,
datatype,source,tag,comm,
buf,status,ier)
source - integer- rank of source
 tag - integer- message tag
 comm - integer- communicator (handle)

– Output variables
buf - integer- initial address of receive buffer (choice)
 status - integer- status object (status)
 ier - integer- returned error value

J.21
130
7/7/2015
Subroutine MPI_Bcast (buffer,
count, datatype, root,comm,ier)
Broadcasts a message from ‘root’ to all
other processes of the group
Include ‘mpif.h’
– Input variables
buffer - integer- starting address of buffer (choice)
 count - integer- number of entries in buffer

J.22
131
7/7/2015
Subroutine MPI_Bcast (buffer,
count, datatype, root,comm,ier)
datatype - integer- data type of buffer (handle)
 root - integer- rank of broadcast root
 comm - integer- communicator (handle)

– Output variables

J.23
ier – integer - returned error value
132
7/7/2015
Subroutine MPI_Finalize (ier)
Terminates MPI execution environment
Include ‘mpif.h’
– Output variables

J.24
ier - integer- returned error value
133
7/7/2015
MYP Philosophy And
Implementation
A suite of Fortran subroutines that perform
the most common tasks in a parallel
program
 All application level software uses these
routines to avoid machine specific
software and changing ‘standard’ software
at the application level

K.1
134
7/7/2015
Subroutines can be modified to use any
specific parallel subroutines, e.g. current
version uses PVM and MPI
 A check program is used to verify accuracy
of any new implementation

K.2
135
7/7/2015
Software: MYP
Incorporates the following
PVM (parallel virtual machine)
 MPI (message passing interface)
 more…

K.3
136
7/7/2015
MYP Subroutines in Fortran
Myp_open
 Myp_barrier
 Myp_close
 Myp_writer
 Myp_readr
 Myp_bcast
 Myp_g2sum
 Myp_g1sum

K.4
137
7/7/2015
MYP Subroutines in Fortran
Myp_token
 Myp_transp
 Myp_copy
 Myp_rdxfer
 Myp_wrxfer
 Myp_message

K.5
138
7/7/2015
Subroutine myp_open
(iamnode,ier)
Initializes mpi
Include ‘mypf.h’
– Output variables
iamnode – integer – node identification number
 ier – integer – returned error value

K.6
139
7/7/2015
SUBROUTINE
&
myp_open(iamnode,ier)
INTEGER iamnode, ier
INCLUDE ‘mpif.h’
INCLUDE ‘mpyf.h’
CALL mpi_init(ier)
IF (ier .lt. 0) GOTO 7
CALL mpi_comm_rank
& (mpi_comm_world, iamnode, ier)
IF (ier .lt. 0) GOTO 7
CALL mpi_comm_size
&
(mpi_comm_world, nodes, ier)
IF (ier .lt. 0) GOTO 7
CALL mpi_barrier
&
(mpi_comm_world,ier)
IF (ier .lt. 0) GOTO 7
c get length of myname&myp_path for
c later use, then check if myname is ok
mytid=iamnode
myp_path=''
K.6.1
myname='myp_mpi'
mygroup='mpi_world'
ncmyname = ichrlen(myname)
IF(ncmyname.le.0) GOTO 7
ncmygroup=ichrlen(mygroup)
c get my node rank in ascii and save it,
c 5 character length, pre-pend zeros
ncmyp_path = ichrlen(myp_path)
write(nodechar, '(i5)') iamnode
DO j=1,5
if (nodechar(j:j) .eq. ' ')
&
nodechar(j:j) ='0'
ENDDO
ier = 0
RETURN
7 ier=-1
RETURN
END
140
7/7/2015
Subroutine myp_barrier
(mygroup, nodes,ier)
Blocks until all process reach this routine
Include ‘mypf.h’
– Input variables
nodes – integer – number of nodes in the group
 mygroup – character*(*) – name of the group

– Output variables

K.7
ier – integer – returned error value
141
7/7/2015
5
SUBROUTINE
myp_barrier(mygroup,nodes,ier)
INTEGER
nodes, ier
CHARACTER*(*) mygroup
INCLUDE
‘mpif.h’
INCLUDE
‘mypf.h’
CALL mpi_barrier(mpi_comm_world,ier)
IF (ier .lt. 0) GOTO 5
ier = 0
RETURN
ier=-1
RETURN
END
K.7.1
142
7/7/2015
Subroutine myp_close(ier)
Terminates MPI execution environment
Include ‘mypf.h’
– Output variables

K.8
ier – integer – returned error value
143
7/7/2015
9
SUBROUTINE myp_close(ier)
INTEGER ier
INCLUDE ‘mpif.h’
INCLUDE ‘myp.f’
CALL mpi_barrier(mpi_comm_world,ier)
IF (ier .lt. 0) GOTO 9
CALL mpi_finalize(ier)
IF (ier .lt. 0) GOTO 9
ier = 0
RETURN
ier=-1
RETURN
END
K.8.1
144
7/7/2015
Subroutine myp_writer (to,buf,
count, tag,tag1,lasync,ier)
Writes buffer buf into ‘to’ node
Include ‘mypf.h’
– Input variables
to – integer – receiving node rank
 buf(*) – real – send buffer (choice)
 count – integer – number of elements in send buffer

K.9
145
7/7/2015
Subroutine myp_writer (to,buf,
count, tag,tag1,lasync,ier)
tag – integer – message tag
 tag1 – integer – message tag
 lasync – logical

– Output variables

K.10
ier – integer – returned error value
146
7/7/2015
6
SUBROUTINE myp_writer(to,buf,count,tag,tag1,lasync,ier)
INTEGER
to, count, tag, tag1, ier
REAL
buf(*)
LOGICAL
lasync
INCLUDE
‘mpif.h’
INCLUDE
‘mypf.h’
CALL mpi_send(buf,count,mpi_real4,to,tag,mpi_comm_world,ier)
IF (ier .lt. 0) GOTO 6
ier = 0
RETURN
ier=-1
RETURN
END
K.10.1
147
7/7/2015
Subroutine myp_readr (from, buf,
count, tag,tag1,lasync,ier)
Reads into buffer buf from node "from“
Include ‘mypf.h’
– Input variables
from – integer – sending node rank
 count – integer – maximum number of elements in
receive buffer

K.11
148
7/7/2015
Subroutine myp_readr (from, buf,
count, tag,tag1,lasync,ier)
tag – integer – message tag
 tag1- integer – message tag
 lasync – logical

– Output variables
buf(*) – real – receiving buffer
 ier - integer- returned error value

K.12
149
7/7/2015
3
SUBROUTINE myp_readr(from,buf,count,tag,tag1,lasync,ier)
INTEGER
from, count, tag, tag1, ier
REAL
buf(*)
LOGICAL
lasync
INCLUDE
‘mpif.h’
INCLUDE
‘mypf.h’
INTEGER status (mpi_status_size)
CALL mpi_recv(buf,count,mpi_real4,from,tag,mpi_comm_world, status,ier)
IF (ier .lt. 0) GOTO 3
ier = 0
RETURN
ier=-1
RETURN
END
K.12.1
150
7/7/2015
Subroutine myp_bcast (from,buf,
count,tag,tag1,lasync,node,ier)
Broadcasts data in buf to all nodes in group
Include ‘mypf.h’
– Input variables
from – integer – rank of broadcast root
 buf(*) – real – input buffer
 count – integer – number of entries in buffer

K.13
151
7/7/2015
Subroutine myp_bcast (from,buf,
count,tag,tag1,lasync,node,ier)
tag – integer – message tag
 tag1 – integer – message tag
 node – integer – node ID
 lasync – logical

– Output variables
buf – real – output buffer
 ier – integer – returned error value

K.14
152
7/7/2015
3
SUBROUTINE myp_bcast(from,buf,count,tag,tag1,lasync,iamnode,ier)
INTEGER
from, count, tag, tag1, iamnode, ier
REAL
buf(*)
LOGICAL
lasync
INCLUDE
‘mpif.h’
INCLUDE
‘mypf.h’
CALL mpi_barrier( mpi_comm_world,ier)
IF (ier .lt. 0) GOTO 3
CALL mpi_bcast(buf,count,mpi_real4,from,mpi_comm_world,ier)
IF (ier .lt. 0) GOTO 3
CALL mpi_barrier(mpi_comm_world,ier)
IF (ier .lt. 0) GOTO 3
ier = 0
RETURN
ier=-1
RETURN
END
K.14.1
153
7/7/2015
Node To Node Transfer
Send
Receive
NODE i
NODE j
NODE 1
NODE 0
NODE 2
NODE 3
K.15
154
7/7/2015
Subroutine myp_g2sum(x,temp,
count,
iamnode,ier)
Sums an array from all nodes into an array
on node zero. Requires 2m nodes
Include ‘mypf.h’
– Input variables
count – integer – number of entries to transfer
 iamnode – integer – starting node ID

K.16
155
7/7/2015
Subroutine myp_g2sum(x,temp,
count,
iamnode,ier)
– Output variables
x(*) – real – result buffer
 temp(*) – real – intermediate buffer
 ier - integer- returned error value

K.17
156
7/7/2015
SUBROUTINE myp_g2sum(x,
&
temp,count,iamnode,ier)
REAL
x(*),temp(*)
INTEGER count, iamnode, ier
INCLUDE ‘mpif.h’
INCLUDE ‘mypf.h’
INTEGER tag, iinodes, node
INTEGER ifromnode, itonode
iinodes = nodes
DO ndim =0,14
iinodes = iinodes/2
IF (iinodes .eq. 0) EXIT
ENDDO
tag = 32768
node = iamnode
DO j=1,ndim
IF (mod(node,2) .eq. 0) THEN
ifromnode = iamnode+2**(j-1)
CALL myp_readr(ifromnode,
&
temp, count,tag,tag,.false.,ier)
K.17.1
IF (ier .lt. 0 ) GOTO 4
DO ii=1,count
x(ii) = x(ii) + temp(ii)
ENDDO
ELSE
itonode = iamnode-2**(j-1)
CALL myp_writer (itonode,
&
x,count,tag,tag,.false.,ier)
IF (ier .lt. 0) GOTO 4
ier = 0
RETURN
ENDIF
node = node/2
ENDDO
ier =0
RETURN
4 ier = -1
RETURN
END
157
7/7/2015
Power Of 2 Global Sum
NODE 0 NODE 1 NODE 2 NODE 3 NODE 4 NODE 5 NODE 6 NODE 7
+
+
+
+
NODE 0
NODE 2
NODE 4
NODE 6
+
+
NODE 0
NODE 4
+
NODE 0
K.18
158
7/7/2015
Subroutine myp_g1sum
(x,temp,count,iamnode,ier)
Sums an array from all nodes to node zero
Include ‘mypf.h’
– Input variables
x(*) – real – data buffer
 count – integer – number o entries to transfer
 iamnode – integer- node ID

– Output variables
temp(*) – real – intermediate buffer
 ier – integer – returned error value

K.19
159
7/7/2015
SUBROUTINE myp_g1sum(x,
&
temp, count,iamnode,ier)
REAL
x(*),temp(*)
INTEGER count, iamnode, ier
INCLUDE ‘mpif.h’
INCLUDE ‘mypf.h’
INTEGER tag, msend, mrecv
ifromnode = iamnode-1
itonode = iamnode+1
IF( itonode .gt. nodes-1)
&
itonode=0
IF(ifromnode .lt. 0 )
&
ifromnode=nodes-1
msend = 32768 + iamnode
mrecv = 32768 + ifromnode
IF(iamnode.eq.0) THEN
CALL myp_readr(ifromnode,
&
temp,count,mrecv,mrecv,
&
.false.,ier)
IF (ier .lt. 0) go to 1000
DO ii=1, count
x(ii) = x(ii) + temp(ii)
ENDDO
K.19.1
5
ELSE
IF(iamnode.gt.1) THEN
CALL myp_readr(ifromnode,
&
temp,count,mrecv,mrecv,
&
.false.,ier)
IF (ier .lt. 0) GOTO 5
DO ii=1, count
temp(ii) = x(ii) + temp(ii)
ENDDO
CALL myp_writer(itonode,
temp, count,msend,msend,
.false.,ier)
IF(ier .lt. 0) GOTO 5
ELSE
CALL myp_writer(itonode,x,
count, msend,msend,
.false.,ier)
IF (ier .lt. 0) GOTO 5
ENDIF
ENDIF
CALL myp_barrier(mygroup,
nodes,ier)
ier = 0
RETURN
ier = -1
RETURN
END
160
7/7/2015
Global Sum
NODE 0 NODE 1 NODE 2 NODE 3 NODE 4 NODE 5 NODE 6
+
+
+
+
+
+
K.20
161
7/7/2015
Subroutine myp_token(x,temp,
count,iamnode,incnode,ier)
Passes a buffer of data, x , between nodes
Include ‘mypf.h’
– Input variables
x(*) – real – data buffer
 count – integer
 iamnode – integer – node identification number
 incnode – integer – node increment for token ring

– Output variables
temp(*) – real – intermediate buffer
 ier – integer – returned error value

K.21
162
7/7/2015
SUBROUTINE myp_token(x, temp,
&
count, iamnode,incnode,ier)
INTEGER count, iamnode,
INTEGER incnode, ier
REAL
x(*), temp(*)
INCLUDE ‘mpif.h’
INCLUDE ‘mypf.h’
INTEGER msend, mrecv
INTEGER itonode, ifromnode
itonode = mod(iamnode + incnode,
&
nodes)
ifromnode = mod(iamnode –
&
incnode,nodes)
IF(ifromnode.lt.0)
&
ifromnode=ifromnode+nodes
msend = 32768 + 1000*incnode +
&
iamnode
mrecv = 32768 + 1000*incnode +
&
ifromnode
IF (iamnode .eq. 0 .or. itonode .eq.
ifromnode) THEN
CALL myp_writer(itonode,x,
count,msend,msend,.false.,ier)
K.21.1
IF (ier .lt. 0) GOTO 9
CALL myp_readr(ifromnode,x,
&
count,mrecv,mrecv,.false.,ier)
IF (ier .lt. 0) GOTO 9
ELSE
DO j=1,count
temp(j) = x(j)
ENDDO
CALL myp_readr(ifromnode,x,
&
count,mrecv,mrecv,.false.,ier)
IF (ier .lt. 0) GOTO 9
CALL myp_writer(itonode,temp,
&
count,msend,msend,.false.,ier)
IF (ier .lt. 0) GOTO 9
ENDIF
ier = 0
RETURN
9 ier = -1
RETURN
END
163
7/7/2015
General Token Ring Transfer
NODE 4
NODE 5
NODE 3
NODE 6
NODE 2
NODE 7
NODE 1
NODE 0
K.22
Increment =1
Increment =2
164
7/7/2015
Subroutine myp_transp(x, temp,
count,iamnode,ier)
Transposes data, x, between the nodes
Include ‘mypf.h’
– Input variables
x(*) – real – data buffer
 count – integer – number of entries to transfer
 iamnode – integer – node ID

– Output variables
temp(*) – real – intermediate buffer
 ier - integer- returned error value

K.23
165
7/7/2015
SUBROUTINE myp_transp(x,
&
temp, count,iamnode,ier)
INTEGER count,iamnode,ier
REAL
x(*),temp(*)
INCLUDE ‘mpif.h’
INTEGER msend, mrecv
INCLUDE ‘mypf.h’
msend = 32700
mrecv = 32700
DO jrnode=0, nodes-2
jxsend = jrnode*count+1
DO jsnode = jrnode+1, nodes -1
jxrecv = jsendnode*count+1
IF(iamnode .eq. jsendnode) THEN
CALL myp_writer(jrecvnode,
&
x(jxsend), count,msend,msend,
.false.,ier)
IF (ier .lt. 0) GOTO 7
CALL myp_readr(jrecvnode,
&
x(jxsend) ,count,mrecv,mrecv,
.false.,ier)
IF (ier .lt. 0) GOTO 7
ENDIF
K.23.1
7
IF( iamnode .eq. jrecvnode) THEN
DO j=1, count
temp(j) = x(jxrecv+j-1)
ENDDO
CALL myp_readr(jsendnode,
x(jxrecv), count,mrecv,mrecv,
.false.,ier)
IF(ier .lt. 0) GOTO 7
CALL myp_writer(jsendnode,
temp, count,msend,msend,
.false.,ier)
IF (ier .lt. 0) GOTO 7
ENDIF
ENDDO
ENDDO
ier = 0
RETURN
ier = -1
RETURN
END
166
7/7/2015
Transpose
a11 a12 a13
a21 a22 a23
a14
a24
a15
a25
NODE 0
a31
a41
a32
a42
a33
a43
a34
a44
a35
a45
NODE 1
a51
a61
a52
a62
a53 a54
a63 a64
a55
a65
NODE 2
NODE 3
K.24
a11
a12
a21
a22
a31
a32
a41
a42
a51 a61
a52 a62
a13
a14
a23
a24
a33
a34
a43
a44
a53 a63
a54 a64
a15
a25
a35
a45
a55 a65
Multinode transposes
 in memory
 out of memory (via disk)
 w/wo fast Fourier transform
167
7/7/2015
Subroutine pvmcopy
(i01,…,i20,n,ier)
Broadcasts up to 20 around to the nodes
Assume default integers or 4 bytes per unit
Include ‘mypf.h’
– Input variables
i01,…,i20 – integer – units to be broadcast
 n – integer – number of units

– Output variables

K.25
ier – integer- returned error value
168
7/7/2015
Subroutine pvm_rdxfer (issd,
trace, temp,nt,krec,llasync,
llssd,iamnode,jnode,ier)
Node ‘iamnode’ reads data from disk and
send to node ‘jnode’
Include ‘mypf.h’, ‘mypio.h’
– Input variables
issd – integer – logical unit number to read from
 nt – integer – number of samples
 krec – integer – record number
 llasync – logical

K.26
169
7/7/2015
Subroutine pvm_rdxfer (issd,
trace, temp,nt,k_rec,llasync,
llssd,iamnode,jnode,ier)
llssd – integer
 iamnode – integer – reading node ID
 jnode – integer – receiving node ID

– Output variables
trace(*) – real – buffer to read into
 temp(*) – real – transfer buffer
 ier – integer – returned error value

K.27
170
7/7/2015
Subroutine pvm_wrxfer(issd,
trace,temp,nt,krec,llasync,
llssd,iamnode, jnode,ier)
Writes data from jnode to disk
Include ‘mypf.h’, ‘mypio.h’
– Input variables
issd – integer – logical unit number to write to
 nt – integer – number of samples to be written
 krec – integer – record number
 llasync – logical

K.28
171
7/7/2015
Subroutine pvm_wrxfer(issd,
trace,temp,nt,krec,llasync,
llssd,iamnode, jnode,ier)
llssd – integer
 iamnode – integer – node ID
 jnode – integer – sending node ID

– Output variables
trace(*) – real – receiving buffer
 temp(*) - real – transfer buffer
 ier – integer – returned error value

K.29
172
7/7/2015
Subroutine myp_message
(mesg,ier)
Broadcasts message to all nodes
Include ‘mypf.h’
– Input variables

mesg – integer – message to be distributed to all
nodes
– Output variables

K.30
ier – integer – returned error value
173
7/7/2015
Algorithms

There are several ways to parallelize
seismic processing
– Task distribution

Independent processing tasks are distributed
among the processors
– Domain decomposition

Each processor works on a unique part of the
survey or image space
– Data decomposition

L.1
Each processor works on a unique part of the data
174
7/7/2015
Task distribution
Develop a pipeline of independent tasks
which are handled by multiple processors
 One or more tasks can be done by each
processor depending on the time required,e.g.

CDP
gather
L.2
NMO
Stack
Migration
175
7/7/2015
Task distribution (cont.)

Advantage
– Existing serial codes are easily adapted with
minimal rewriting

Disadvantages
– Difficult to balance the load between
processors so efficiency is reduced
– All data must move from processor to
processor
L.3
176
7/7/2015
Domain decomposition
Each part of the image space is developed
by a processor
 Divide the image space into sub volumes
that can be processed independently, e.g

– Finite difference RTM where the
computational grid in x,y,z is divided between
the processors
– Kirchhoff pre stack depth migration
L.4
177
7/7/2015
ny
nx
Node 0
nz1
ny
nx
Node 1
nz2
…
Node
…
ny
nx
nzk
L.5
Node k
178
7/7/2015
Domain decomposition (cont.)

Advantage
– Image volume is generally less than data volume
so large problems can be solved in memory

Disadvantage
– Depending on the algorithm exchange of data at
the borders of the sub volumes between the
processors may be necessary during the
algorithm (e.g. in RTM at each time step)
– All input data (in general) must be processed by
every processor
L.6
179
7/7/2015
Data decomposition
The data are distributed among the processors
 Divide the data into sub sets that can be
processed independently, e.g.

– In phase shift migration constant frequency
planes
– In plane wave migration in constant plane wave
volumes
– In Kirchhoff migration in constant offset sections
L.7
180
7/7/2015
Data decomposition (cont.)
x
Decomposed seismic data
volumes or sections can be
processed by each node
independently with
minimal communication
overhead
y

Seismic data can be imaged using
• Shot records
• CDPs
• Constant offsets
• Constant plane waves
• Frequency slices
L.8
181
7/7/2015
Plane wave volumes
x
image
 plane waves
y

node 2
node 1
node 0
etc.
Px1, Py1
Px2, Py2
L.9
-p
-p
-p
-p
NMO
Time migration
Split-step Fourier depth migration
Kirhhoff depth migration
182
7/7/2015
Offset volumes
image
 offsets
x
y

node 2
node 1
node 0
etc.
Ox2, Oy2
Ox1, Oy1
L.10
-p NMO
-p Time migration
-p Kirhhoff depth migration
183
7/7/2015
Data decomposition (cont.)

Advantage
– if the data decomposition results in
independent sub-volumes, no exchanges of
data between the processors are required
during the algorithm

Disadvantage
– the independent data volumes may still be
large and may not fit in a single processor’s
memory requiring an out of memory solution
L.11
184
7/7/2015
Which approach?
Which approach to use in designing the
parallel program is based on:
 The algorithm - some are easier to parallelize
than others
 Data structure and size – number of: sources
and receivers; frequencies; plane waves etc.
 Domain size - image size in x, y, and z
L.12
185
7/7/2015
Implementation
Nearly all seismic imaging algorithms can
be implemented in a parallel mode
 Some require new coding while others can
simply use serial code run in each
processor

M.1
186
7/7/2015
Existing serial code
Shot record migrations (RTM, split-step, or
implicit finite differences) can be run on
each processor independently
 The resulting images can then be summed
at the end. No communication between
processors is required. No data exchanges
are required.

M.2
187
7/7/2015
Modifications to existing
serial code
Any migration method that uses the data
organized as independent subsets such as
frequency planes, plane waves etc.
 Do loops over the independent data
susbsets are distributed across the
processors

M.3
188
7/7/2015
Test code
Initialize myp (myp_open)
 Test myp_writer & myp_readr
 Test myp_bcast
 Test myp_g2sum
 Test myp_g1sum
 Test myp_token
 Test myp_transpose
 Finalize and close

L.1
189
7/7/2015
Initialize myp
myp_writer & myp_readr
myp_bcast
myp_g2sum
Prepare
data
myp_g1sum
Check the
result
myp_token
myp_transp
L.2
Finalize & close
190
7/7/2015
PROGRAM
myp_test
INTEGER
iamnode, ier
PARAMETER (mxnodes=32, nx=mxnodes*1000)
REAL
x(nx), temp(nx)
CHARACTER*80 logfil
INCLUDE 'mypf.h'
c start up myp, get my node id & my tid
CALL myp_open(iamnode, ier)
IF(ier.lt.0) THEN
WRITE(*,*) 'myp_open error'
STOP
ENDIF
c log file
log=32
logfil=myp_path(1:ncmyp_path)//'myp_log.'//nodechar
OPEN(log,FILE=logfil,STATUS='unknown', FORM='formatted')
WRITE(ilog,*) 'iamnode & nodes=', iamnode, nodes
L.2.1
191
7/7/2015
c do a sequence of xfers & returns, define data & send data
ifromnode=0
IF(iamnode.eq.ifromnode) THEN
DO jnode=1,nodes-1
DO j=1,nx
x(j)=jnode+1
ENDDO
jrec =jnode
jreqid=jnode
CALL myp_writer(jnode,x,nx,jrec,jrecid,.false.,ier)
IF(ier.lt.0) GOTO 77
ENDDO
ELSE
CALLl myp_readr(ifromnode,x,nx,iamnode,jdum,.false.,ier)
IF(ier.lt.0) GOTO 77
c check & post results
DO j=1,nx
IF(abs(x(j)-float(iamnode+1)).gt.1.e-5) THEN
WRITE(log,*)'myp_readr/myp_writer error from node ',ifromnode
GOTO 20
ENDIF
ENDDO
WRITE(log,*)'myp_readr/writer: ok,iamnode&ifrom=', iamnode,ifromnode
20
CONTINUE
ENDIF
L.2.2
192
7/7/2015
c now let ifromnode read them back & check
IF (iamnode.ne.ifromnode) THEN
jrec =iamnode
jrecid=iamnode
CALL myp_writer(ifromnode,x,nx,jrec,jrecid,.false.,ier)
IF(ier.lt.0) GOTO 77
ELSE
DO jnode=1,nodes-1
CALL myp_readr(jnode,x,nx,jnode,jdum,.false.,ier)
DO j=1,nx
IF(abs(x(j)-float(jnode+1)).gt.1.e-6) THEN
WRITE(log,*)'myp_readr/myp_writer error from node ',jnode
GOTO 25
ENDIF
ENDDO
WRITE(log,*) 'myp_readr/writer: ok,iamnode&jnode=',iamnode,jnode
25
CONTINUE
ENDDO
ENDIF
L.2.3
193
7/7/2015
c do a broadcast xfer & returns, define data & send data
ifromnode=0
IF (iamnode.eq.ifromnode) FORALL(j=1:nx) x(j)=1000
jrec =1000
jreqid=1000
CALL myp_bcast(ifromnode,x,nx,jrec,jrecid,.false.,iamnode,ier)
IF (ier.lt.0) goto 77
c get the data back & check
IF (iamnode.ne.ifromnode) THEN
jrec =iamnode ; jrecid=iamnode
CALL myp_writer(ifromnode,x,nx,jrec,jrecid,.false.,ier)
IF(ier.lt.0) goto 77
ELSE
DO jnode=1,nodes-1
FORALL( j=1:nx) x(j)=0.
CALL myp_readr(jnode,x,nx,jnode,jdum,.false.,ier)
DO j=1,nx
IF(ABS(x(j)-1000.).gt.1.e-5) THEN
WRITE(log,*)'myp_brdcast: error from node ',jnode
GOTO 30
ENDIF
ENDDO
WRITE(log,*)'myp_brdcast: ok, iamnode= ',iamnode
30
CONTINUE
ENDDO
ENDIF
L.2.4
194
7/7/2015
c do a 2**m global sum, define data
DO jnode=0,nodes-1
IF(iamnode.eq.jnode) FORALL(j=1:nx) x(j)=jnode+1
ENDDO
c get sum -> node 0
CALL myp_g2sum(x,temp,nx,iamnode,ier)
IF(ier.lt.0) GOTO 77
c check & post results
xtest=0.
FORALL(j=1:nodes) xtest=xtest+j
IF (iamnode.eq.0) THEN
DO j=1,nx
IF(abs(x(j)-xtest).gt.1.e-5) THEN
WRITE(log,*)'myp_2gsum error: iamnode,xtest, x=',iamnode,xtest,x(1)
GOTO 10
ENDIF
ENDDO
WRIET(log,*)'myp_g2sum: ok,
iamnode & sum= ',iamnode,xtest
10
CONTINUE
ENDIF
L.2.5
195
7/7/2015
c do a linear global sum, define data
DO jnode=0,nodes-1
IF(iamnode.eq.jnode) FORALL(j=1:nx) x(j)=jnode+1
ENDDO
c get sum -> node 0
CALL myp_g1sum(x,temp,nx,iamnode,ier)
IF(ier.lt.0) GOTO 77
c check & post results
xtest=0.
FORALL(j=1:nodes) xtest=xtest+j
IF(iamnode.eq.0) THEN
DO j=1,nx
IF(abs(x(j)-xtest).gt.1.e-5) THEN
WRITE(log,*)'myp_1gsum error: iamnode,xtest,x=',iamnode,xtest,x(1)
GOTO 15
ENDIF
ENDDO
WRITE(log,*)'myp_g1sum: ok, iamnode & sum= ',iamnode,xtest
15
CONTINUE
ENDIF
L.2.6
196
7/7/2015
c token ring pass of xcbuf from 0->1, 1->2,... nodes-1->0
c define the data as the node id
DO incnode=1,nodes-1,nodes
FORALL(j=1:nx) x(j)=iamnode+10
c pass the data around the nodes displaced by incnode
CALL myp_token(x,temp,nx,iamnode,incnode,ier)
IF(ier.ne.0) GOTO 77
c check the result
ifrom=iamnode-incnode
IF(ifrom.lt.0) ifrom=ifrom+nodes
xtest=ifrom+10
DO j=1,nx
IF(abs(x(j)-xtest).gt.1.e-5) THEN
WRITE(log,*) 'myp_token error: iamnode,xtest,x,incnode=',
&
iamnode,xtest,x(1),incnode
GOTO 35
ENDIF
ENDDO
WRITE(log,*)'myp_token: ok, iamnode & inc= ', iamnode,incnode
35
CONTINUE
ENDDO
L.2.7
197
7/7/2015
c multinode transpose test data; for test purposes define the data such that
c iamnode*nbuf+jx-1, jx=1,nbuf + jbuf*10000, jbuf=1,nodes i.e. as follows:
c input: node 0: 10000->10099, 20000->20099, 30000->30099 ...
c
node 1: 10100->10199, 20100->20199, 30100->30199 ...
c output: node 0: 10000->10099, 10100->10199, 10200,10299 ...
c
node 1: 20000->20099, 20100->20199, 20200,20299 ... etc
c define the partial buffer as nbuf=100; it must be nbuf*nodes <= nx
nbuf=100
DO jbuf=1,nodes
jsa=(jbuf-1)*nbuf
FORALL(jx=1:nbuf) x(jsa+jx)=jbuf*10000+iamnode*nbuf+jx-1
ENDDO
CALL myp_transp(x,temp,nbuf,iamnode,ier)
IF(ier.ne.0) GOTO 77
c check transpose results
DO j=1,nodes*nbuf
test=(iamnode+1)*10000+j-1
IF(abs(x(j)-test).gt.1.e-5) THEN
WRITE(log,*)'myp_transp: error test ,x(j)=',test,x(j),iamnode
GOTO 40
ENDIF
ENDDO
WRITE(log,*) 'myp_transp: ok, iamnode= ',iamnode
40 CONTINUE
L.2.8
198
7/7/2015
c program finished exit myp_
CALL myp_close(iermyp)
STOP
77 WRITE(log,*)'a myp_ error occurred, ier, iermyp_=', ier,iermyp_
CALL myp_close(iermyp_)
STOP
END
L.2.9
199
7/7/2015
Testing results
L.3
200
7/7/2015
Seismic algorithms
Xtnmo
 Tpnmo
 Xttime
 Tptime
 Xtkirch
 Tpkirch
 Tp3d
 More …

M.1
201
7/7/2015
Xtnmo flowchart
M.2
202
7/7/2015
Xtnmo code
M.3
203
7/7/2015
204
7/7/2015
205
7/7/2015
206
7/7/2015
207
7/7/2015
x-t data

Input seismic data: shot gathers in the x–t
domain
number of shots
number of offsets/shot
number of samples/trace (time/depth)

1000
60
2000
Migration results: CIG gathers
number of CIG’s
number of traces/CIG
number of samples
1000
60
1000
208
7/7/2015
-p data

Input seismic data: shot gathers in the t–p
domain
number of shots
number of rays/shot
number of samples/trace

1000
61
2000
Migration results: CIG gathers
– number of CIG’s
1000
– number of rays/CIG
61
– number of samples/trace (time/depth) 1000
209
7/7/2015
Timing diagrams
-p time vs. number of CPUs
 -p time speedup
 x-t time vs. number of CPUs
 x-t time speedup
 -p plane wave split step vs. vs. number of
CPUs
 -p plane wave split step speedup

210
7/7/2015
-p time vs. number of CPUs
 -p time vs number of CPUs
18000
SGI
Cluster w/network
New cluster
16000
time (sec)
14000
12000
10000
8000
6000
4000
2000
0
1
2
3
4
5
6
7
8
num ber of nodes
211
7/7/2015
-p time speedup
-p time speed up
9
8
ratio to 1 CPU
7
6
5
4
3
2
SGI
Cluster w/network
Nodes
1
0
1
2
3
4
5
6
7
8
num ber of nodes
212
7/7/2015
x-t time vs. number of CPUs
x-t time vs number of CPUs
18000
SGI
Cluster w/network
16000
time (sec)
14000
12000
10000
8000
6000
4000
2000
0
1
2
3
4
5
6
7
8
num ber of nodes
213
7/7/2015
x-t time speedup
x-t time speedup
9
8
ratio to 1 CPU
7
6
5
4
3
2
SGI
Cluster w/network
1
0
1
2
3
4
5
6
7
8
num ber of nodes
214
7/7/2015
-p plane wave split step vs.
number of CPUs
 -p plane wave split-step vs number of CPUs
25000
SGI
Cluster w/network
New cluster
time (sec)
20000
15000
10000
5000
0
1
2
3
4
5
6
7
8
num ber of nodes
215
7/7/2015
-p plane wave split step
speedup
 -p plane wave split-step speed up
9
8
ratio to 1 CPU
7
6
5
4
3
2
SGI
Cluster w/network
1
0
1
2
3
4
5
6
7
8
num ber of nodes
216
7/7/2015
Conclusions
 Beowulf Alpha
Cluster is a cost
effective solution for parallel seismic
imaging
– performance is ‘comparable’ to SGI Origin
2000 for imaging algorithms tested
– cost is $71,915 vs. $465,098 for SGI Origin
2000
– price performance improvement of 6.5
217
7/7/2015
Conclusions (Cont.)
network interconnect from cluster to large
UNIX network is not a performance
bottleneck
 need better cluster ‘management’ tools to
facilitate use
 need true 64 bit cluster operating system

218
7/7/2015
Summary
State what has been learned
 Define ways to apply training
 Request feedback of training session

219
7/7/2015
Where to Get More Information
Other training sessions
 List books, articles, electronic sources
 Consulting services, other sources

220
7/7/2015