Analysis of Algorithms - Kent State University

Download Report

Transcript Analysis of Algorithms - Kent State University

Algorithmic Frameworks
Chapter 14 (starting with 14.2 - Parallel Algorithms)
Parallel Algorithms
Models for parallel machines and parallel
processing
Approaches to the design of parallel
algorithms
Speedup and efficiency of parallel algorithms
A class of problems NC
Some parallel algorithms (not all in the
textbook)
2
What is a Parallel Algorithm?
Imagine you needed to find a lost child in the
woods.
Even in a small area, searching by yourself
would be very time consuming
Now if you gathered some friends and family
to help you, you could cover the woods
much faster.
3
Parallel Architectures
Single Instruction Stream, Multiple Data
Stream (SIMD)
 One global control unit connected to each
processor
Multiple Instruction Stream, Multiple Data
Stream (MIMD)
 Each processor has a local control unit
4
Architecture (continued)
Shared-Address-Space
 Each processor has access to main memory
 Processors may be given a small private
memory for local variables
Message-Passing
 Each processor is given its own block of
memory
 Processors communicate by passing messages
directly instead of modifying memory
locations
5
Interconnection Networks
Static

Each processor is hard-wired to every
other processor
Completely
Connected
Dynamic

Star-Connected
Bounded-Degree
(Degree 4)
Processors are connected to a series of
switches
6
There are many different kinds of parallel
machines – this is only one type
A parallel computer must be capable of working on one
task even though many individual computers are tied
together.
Lucidor is a distributed memory computer (a cluster)
from HP. It consists of 74 HP servers and 16 HP
workstations. Peak: 7.2 GFLOPs
GFLOP =
one billion
decimal
number
operations/
second
source:http://www.pdc.kth.se/compresc/machines/lucidor.html
7
ASCI White (10 teraops/sec)
• Third in ASCI series
• IBM delivered in 2000
8
Some Other Parallel Machines
Blue Gene
Earth
Simulator
9
Lemieux
Lighting
10
Connection Machines - SIMDs
CM-2
CM-5
11
Example: Finding the Largest Key in an Array
In order to find the largest key in an array of
size n, at least n-1 comparisons must be
done.
A parallel version of this algorithm will still
perform the same amount of compares, but
by doing them in parallel it will finish sooner.
12
Example: Finding the Largest Key in an Array
Assume that n is a power of 2 and we have
n/2 processors executing the algorithm in
parallel.
Each Processor reads two array elements into
local variables called first and second
It then writes the larger value into the first of
the array slots that it has to read.
Takes lg n steps for the largest key to be
placed in S[1]
13
Example: Finding the Largest Key in an Array
3
12
P1
7
6
P2
8
11
P3
19
13
P4
Read
12
First
3
6
7
Second First
11
Second First
8
13
Second First
19
Second
Write
12
12
7
6
11
11
19
13
14
Example: Finding the Largest Key in an Array
12
Read
12
7
6
11
11
19
13
P3
P1
12
7
19
11
Write
12
12
7
6
19
11
19
13
15
Example: Finding the Largest Key in an Array
12
Read
12
7
6
19
11
19
13
19
13
P1
19
12
Write
19
12
7
6
19
11
16
Approaches to
The Design of Parallel Algorithms
Modify an existing sequential algorithm
 Exploiting those parts of the algorithm that are
naturally parallelizable.
Design a completely new parallel algorithm that
 May have no natural sequential analog.
Brute force Methods for parallel processing:
 Use an existing sequential algorithm but
 Each processor using differential initial
conditions
 Using compiler to optimize sequential algorithm
 Using advanced CPU to optimize code
17
Parallelism
Some of the next slides are due to Rebecca
Hartman-Baker at Oak Ridge National Labs.
She permits their use for non-commercial,
educational use only.
18
Parallelization Concepts
When performing task, some subtasks depend
on one another, while others do not
Example: Preparing dinner
 Salad prep independent of lasagna baking
 Lasagna must be assembled before baking
Likewise, in solving problems, some tasks are
independent of one another
19
Serial vs. Parallel
Serial: tasks must be performed in sequence
Parallel: tasks can be performed
independently in any order
20
Serial vs. Parallel: Example
Example: Preparing
dinner
 Serial tasks: making
sauce, assembling
lasagna, baking
lasagna; washing
lettuce, cutting
vegetables,
assembling salad
 Parallel tasks: making
lasagna, making
salad, setting table
21
Serial vs. Parallel: Example
Could have several
chefs, each performing
one parallel task
This is concept behind
parallel computing
22
SPMD vs. MPMD
SPMD: Write single program that will perform same
operation on multiple sets of data
 Multiple chefs baking many lasagnas
 Rendering different frames of movie
MPMD: Write different programs to perform different
operations on multiple sets of data
 Multiple chefs preparing four-course dinner
 Rendering different parts of movie frame
Can also write hybrid program in which some
processes perform same task
23
Multilevel Parallelism
May be subtasks within a task that can be
performed in parallel
 Salad and lasagna making are parallel
tasks; chopping of vegetables for salad are
parallelizable subtasks
We can use subgroups of processors (chefs)
to perform parallelizable subtasks
24
Discussion: Jigsaw Puzzle*
Suppose we want to do
5000 piece jigsaw puzzle
Time for one person to
complete puzzle: n hours
How can we decrease wall
time to completion?
* Thanks to Henry Neeman for
example
25
Discuss: Jigsaw Puzzle
Add another person at
the table
 Effect on wall time
 Communication
 Resource contention
Add p people at the
table
 Effect on wall time
 Communication
 Resource contention
26
Discussion: Jigsaw Puzzle
What about: p people,
p tables, 5000/p pieces
each?
What about: one person
works on river, one
works on sky, one works
on mountain, etc.?
27
Parallel Algorithm Design: PCAM
Partition: Decompose problem into fine-
grained tasks to maximize potential
parallelism
Communication: Determine communication
pattern among tasks
Agglomeration: Combine into coarser-grained
tasks, if necessary, to reduce communication
requirements or other costs
Mapping: Assign tasks to processors, subject
to tradeoff between communication cost and
concurrency
28
Example: Computing Pi
We want to compute 
One method: method of darts*
Ratio of area of square to area of inscribed
circle proportional to 
29
Method of Darts
Imagine dartboard with circle of radius R inscribed in
square
Area of circle   R 2
Area of square  2R  4R 2
2

Area of circle
Area of square


 R2
4R
2


4

30
Method of Darts
So, ratio of areas proportional to 
How to find areas?
 Suppose we threw darts (completely
randomly) at dartboard
 Could count number of darts landing in circle and total
number of darts landing in square
 Ratio of these numbers gives approximation to ratio of
areas
 Quality of approximation increases with number of
darts
 = 4  # darts inside circle
# darts thrown
31
Method of Darts
How in the world do we simulate this
experiment on computer?
 Decide on length R
 Generate pairs of random numbers (x, y)
such that -R ≤ x, y ≤ R
2
2
2
 If (x, y) within circle (i.e. if (x +y ) ≤ R ), add
one to tally for inside circle
 Lastly, find ratio
32
500 Fastest Supercomputers
http://www.top500.org/lists/2008/11
Some of these are MIMDs and some are SIMDs (and
some we don't know for sure because of propriety
information, but timings indicate often which is
which.)
The Top500 list is updated twice yearly (Nov and
June) by researchers at the University of Mannheim,
the University of Tennessee and the U.S. Energy
Department's Lawrence Berkeley National Laboratory.
It ranks computers by how many trillions of
calculations per second, or teraflops, they can perform
using algebraic calculations in a test called Linpack.
33
Parallel Programming Paradigms and
Machines
Two primary programming paradigms:
 SPMD (single program, multiple data)
 MPMD (multiple programs, multiple data)
There Are No PRAM Machines, but parallel
machines are often classified into the two groups
 SIMD (single instruction, multiple data)
 MIMD (multiple instruction, multiple data)
which clearly mimic the programming paradigms.
A network of distributed computers all working on
the same problem, models a MIMD machine.
34
Speedup and Efficiency of Parallel
Algorithms
Let T*(n) be the time complexity of a sequential
algorithm to solve a problem P of input size n
Let Tp(n) be the time complexity of a parallel
algorithm to solves P on a parallel computer with p
processors
Speedup
 Sp(n) = T*(n) / Tp(n)
 <claimed by some; disputed by others>
 Sp(n) <= p
 Best possible, Sp(n) = p when Tp(n) = T*(n)/p
35
Speedup and Efficiency of Parallel
Algorithms
Let T*(n) be the time complexity of a sequential
algorithm to solve a problem P of input size n
Let Tp(n) be the time complexity of a parallel
algorithm to solves P on a parallel computer with p
processors
Efficiency
 Ep(n) = T1(n) / (p Tp(n))
 where T1(n) is when the parallel algorithm runs
on 1 processor
 Best possible, Ep(n) = 1
36
A Class of Problems NC
The class NC consists of problems that
 Can be solved by parallel algorithm using
 Polynomially bounded number of processors
p(n)
 p(n)  O(nk) for problem size n and some
constant k
 The number of time steps bounded by a
polynomial is the logarithm of the problem size n
 T(n)  O( (log n)m ) for some constant m
Theorem:
 NC  P
37
Computing the Complexity of Parallel
Algorithms
Unlike sequential algorithms, we do not have one
kind of machine on which we must do our
algorithm development and our complexity analysis
Moreover, the complexity is a function of the input
size predominately.
The following will need to be considered for
parallel machines besides input size:
 What is the type of machine – SIMD?, MIMD?,
others?
 How many processors?
 How can data be exchanged between processors
– shared memory?, interconnection network?38
SIMD Machines
SIMD machines are coming in (again) as dual and
quad processors.
In the past, there have been several SIMD
machines Connection Machine - 64K processors
 ILLIAC was the name given to a series of
supercomputers built at the University of Illinois
at Urbana-Champaign. In all, five computers
were built in this series between 1951 and 1974.
Design of the ILLIAC IV had 256 processors
 Today: ClearSpeed: 192 high-performance
processing elements, each with: Dual 32 & 64bit FPU and 6 Kbytes high bandwidth memory39
PRAMs
A Simple Model for Parallel Processing
Parallel Random Access Machine (PRAM) model
 A number of processors all can access
 A large shared memory
 All processors are synchronized
 However, processors can run different programs.
 Each processor has an unique id, pid and
 May be instructed to do different things depending on
their pid (if pid < 200, do this, else do that)
41
The PRAM Model
Parallel Random Access Machine
 Theoretical model for parallel machines
 p processors with uniform access to a large
memory bank
 UMA (uniform memory access) – Equal
memory access time for any processor to
any address
42
PRAM Models
PRAM models vary according
 How they handle write conflicts
 The models differ in how fast they can solve
various problems.
Concurrent Read Exclusive Write (CREW)
 Only one processor is allow to write to
one particular memory cell at any one step
Concurrent Read Concurrent Write (CRCW)
 An algorithm that works correctly for CREW
will also work correctly for CRCW
but not vice versa
43
Summary of Memory Protocols
Exclusive-Read Exclusive-Write
Exclusive-Read Concurrent-Write
Concurrent-Read Exclusive-Write
Concurrent-Read Concurrent-Write
If concurrent write is allowed we must decide
which value to accept
44
Assumptions
There is no limit on the number of processors in the
machine.
Any memory location is uniformly accessible from any
processor.
There is no limit on the amount of shared memory in
the system.
Resource contention is absent.
The programs written on these machines can be
written for MIMDs or SIMDs.
45
More Details on the PRAM Model
Memory size is infinite, number of
processors in unbounded
No direct communication between
processors
 they communicate via the memory
Every processor accesses any memory
location in 1 cycle
Typically all processors execute the same
algorithm in a synchronous fashion
although each processor can run a
different program.
 READ phase
 COMPUTE phase
 WRITE phase
Some subset of the processors can stay
idle (e.g., even numbered processors
may not work, while odd processors do,
and conversely)
P1
P2
P3
.
.
.
Shared
Memory
PN
46
PRAM CW?
What ends up being stored when multiple writes occur?
 priority CW: processors are assigned priorities and the top priority
processor is the one that counts for each group write
 Fail common CW: if values are not equal, no change
 Collision common CW: if values not equal, write a “failure value”
 Fail-safe common CW: if values not equal, then algorithm aborts
 Random CW: non-deterministic choice of the value written
 Combining CW: write the sum, average, max, min, etc. of the values
 etc.
The above means that when you write an algorithm for a CW PRAM you
can do any of the above at any different points in time
It doesn’t corresponds to any hardware in existence and is just a
logical/algorithmic notion that could be implemented in software
In fact, most algorithms end up not needing CW
47
PRAM Example 1
Problem:
 We have a linked list of length n
 For each element i, compute its distance to the end of
the list:
d[i] = 0
if next[i] = NIL
d[i] = d[next[i]] + 1
otherwise
Sequential algorithm in O(n)
We can define a PRAM algorithm in O(log n)
 Associate one processor to each element of the list
 At each iteration split the list in two with odd-placed and
even-placed elements in different lists
 List size is divided by 2 at each step, hence O(log n)
48
PRAM Example 1
Principle:
Look at the next element
Add its d[i] value to yours
Point to the next element’s next element
1
1
1
1
1
0
2
2
2
2
1
0
4
4
3
2
1
0
5
4
3
2
1
0
The size of each list
is reduced by 2 at
each step, hence the
O(log n) complexity
49
PRAM Example 1
Algorithm
forall i
if next[i] == NIL then d[i]  0 else d[i]  1
while there is an i such that next[i] ≠ NIL
forall i
if next[i] ≠ NIL then
d[i]  d[i] + d[next[i]]
next[i]  next[next[i]]
What about the correctness of this algorithm? 50
forall Loop
At each step, the updates must be synchronized so
that pointers point to the right things:
next[i]  next[next[i]]
Ensured by the semantic of forall
forall i
A[i] = B[i]
forall i
tmp[i] = B[i]
forall i
A[i] = tmp[i]
Nobody really writes it out, but one mustn’t forget
that it’s really what happens underneath
51
while Condition
while there is an i such that next[i] ≠NULL
How can one do such a global test on a PRAM?
 Cannot be done in constant time unless the PRAM
is CRCW
 At the end of each step, each processor could
write to a same memory location TRUE or
FALSE depending on next[i] being equal to
NULL or not, and one can then take the AND of
all values (to resolve concurrent writes)
 On a PRAM CREW, one needs O(log n) steps for
doing a global test like the above
In this case, one can just rewrite the while loop into
a for loop, because we have analyzed the way in
which the iterations go:
for step = 1 to log n
52
What Type of PRAM?
The previous algorithm does not require a CW machine,
but:
tmp[i]  d[i] + d[next[i]]
which requires concurrent reads on proc i and j such that
j = next[i].
Solution:
 split it into two instructions:
tmp2[i]  d[i]
tmp[i]  tmp2[i] + d[next[i]]
(note that the above are technically in two different forall
loops)
Now we have an execution that works on a EREW PRAM,
which is the most restrictive type
53
Final Algorithm on a EREW PRAM
forall i
if next[i] == NILL then d[i]  0 else d[i]  1
for step = 1 to log n
forall i
if next[i] ≠ NIL then
tmp[i]  d[i]
d[i]  tmp[i] + d[next[i]]
O(1)
next[i]  next[next[i]]
O(1)
O(log n)
O(log n)
Conclusion: One can compute the length of a
list of size n in time O(log n) on any PRAM
54
Are All PRAMs Equivalent?
Consider the following problem
 Given an array of n elements, ei=1,n, all distinct,
find whether some element e is in the array
On a CREW PRAM, there is an algorithm that works
in time O(1) on n processors:
 Initialize a boolean to FALSE
 Each processor i reads ei and e and compare
them
 If equal, then write TRUE into the boolean (only
one processor will write, so we’re ok for CREW)
55
Are All PRAMs Equivalent?
One a EREW PRAM, one cannot do better than log
n
 Each processor must read e separately
 At worst a complexity of O(n), with sequential
reads
 At best a complexity of O(log n), with series of
“doubling” of the value at each step so that
eventually everybody has a copy (just like a
broadcast in a binary tree, or in fact a k-ary tree
for some constant k)
 Generally, “diffusion of information” to n
processors on an EREW PRAM takes O(log n)
Conclusion: CREW PRAMs are more powerful than
EREW PRAMs
56
This is a Typical Question for Various
Parallel Models
Is model A more powerful than model B?
Basically, you are asking if one can simulate
the other.
Whether or not the model maps to a “real”
machine, is another question of interest.
Often a research group tries to build a
machine that has the characteristics of a
given model.
57
Simulation Theorem
Simulation theorem: Any algorithm running on a CRCW
PRAM with p processors cannot be more than O(log p) times
faster than the best algorithm on a ECEW PRAM with p
processors for the same problem
Proof: We will “simulate” concurrent writes
 When Pi writes value xi to address li, one replaces the
write by an (exclusive) write of (li ,xi) to A[i], where A[i] is
some auxiliary array with one slot per processor
 Then one sorts array A by the first component of its
content
 Processor i of the EREW PRAM looks at A[i] and A[i-1]
 If the first two components are different or if i = 0,
write value xi to address li
 Since A is sorted according to the first component, writing
is exclusive
58
Proof (continued)
Picking one processor for each competing write
P0
12 8
P1
P2
43 29
P3
P4
P5
26 92
P0  (29,43)
= A[0]
A[0]=(8,12)
P0 writes
P1  (8,12)
= A[1]
A[1]=(8,12)
P1 nothing
P2  (29,43)
= A[2]
A[2]=(29,43) P2 writes
P3  (29,43)
= A[3]
A[3]=(29,43) P3 nothing
P4  (92,26)
= A[4]
A[4]=(29,43) P4 nothing
P5  (8,12)
= A[5]
A[5]=(92,26) P5 writes
sort
59
Proof (continued)
Note that we said that we just sort array A
If we have an algorithm that sorts p elements
with O(p) processors in O(log p) time, we’re
set
Turns out, there is such an algorithm: Cole’s
Algorithm.
 Basically a merge-sort in which lists are
merged in constant time!
 It’s beautiful, but we don’t really have time
for it, and it’s rather complicated
Therefore, the proof is complete.
60
Brent Theorem
Theorem: Let A be an algorithm with m operations that
runs in time t on some PRAM (with some number of
processors). It is possible to simulate A in time O(t + m/p)
on a PRAM of same type with p processors
Example: maximum of n elements on an EREW PRAM
 Clearly can be done in O(log n) with O(n) processors
 Compute series of pair-wise maxima
 The first step requires O(n/2) processors
 What happens if we have fewer processors?
 By the theorem, with p processors, one can simulate the
same algorithm in time O(log n + n / p)
 If p = n / log n, then we can simulate the same algorithm
in O(log n + log n) = O(log n) time, which has the same
complexity!
 This theorem is useful to obtain lower-bounds on number
of required processors that can still achieve a given
complexity.
61
Example: Merge Sort
8
1
4
5
P1
1
2
7
P2
8
4
4
6
P3
5
2
P4
7
3
P1
1
3
6
P2
5
8
2
3
6
6
7
8
7
P1
1
2
3
4
5
62
Merge Sort Analysis
Number of compares
 1 + 3 + … + (2i-1) + … + (n-1)
 ∑i=1..lg(n) 2i-1 = 2n-2-lgn = Θ(n)
We have improved from nlg(n) to n simply by
applying the old algorithm to parallel
computing, by altering the algorithm we can
further improve merge sort to (lgn)2
63
O(1) Sorting Algorithm
We assume a CRCW PRAM where concurrent write is
handled with addition
for(int i=1; i<=n; i++) {
for(int j=1; j<=n; j++) {
if(X[i] > X[j])
Processor Pij stores 1 in memory location mi
else
Processor Pij stores 0 in memory location mi } }
64
O(1) Sorting Algorithm with n2 Processors
Unsorted List {4, 2, 1, 3}
Note: Each is a different processor
P11(4,4)
P12(4,2)
P13(4,1)
P14(4,3)
0+1+1+1 = 3
P21(2,4)
P22(2,2)
4 is in position 3 when sorted
P23(2,1)
P24(2,3)
0+0+1+0 = 1
P31(1,4)
P32(1,2)
2 is in position 1 when sorted
P33(1,1)
P34(1,3)
0+0+0+0 = 0
P41(3,4)
P42(3,2)
0+1+1+0 = 2
1 is in position 0 when sorted
P43(3,1)
P44(3,3)
3 is in position 2 when sorted
65
MIMDs
Embarrassingly Parallel Problem
In parallel computing, an embarrassingly parallel
problem is one for which little or no effort is
required to separate the problem into a number of
parallel tasks.
This is often the case where there exists no
dependency (or communication) between those
parallel tasks.
Embarrassingly parallel problems are ideally suited
to distributed computing and are also easy to
perform on server farms which do not have any of
the special infrastructure used in a true
supercomputer cluster. They are thus well suited to
large, internet based distributed platforms.
67
A common example of an embarrassingly parallel
problem lies within graphics processing units (GPUs)
for tasks such as 3D projection, where each pixel on
the screen may be rendered independently.
Some other examples of embarrassingly parallel
problems include:
 The Mandelbrot set and other fractal calculations,
where each point can be calculated independently.
 Rendering of computer graphics. In ray tracing,
each pixel may be rendered independently.
 In computer animation, each frame may be
rendered independently.
68
Brute force searches in cryptography.
BLAST searches in bioinformatics.
Large scale face recognition that involves
comparing thousands of input faces with similarly
large number of faces.
Computer simulations comparing many
independent scenarios, such as climate models.
Genetic algorithms and other evolutionary
computation metaheuristics.
Ensemble calculations of numerical weather
prediction.
Event simulation and reconstruction in particle
physics.
69
A Few Other Current Applications that Utilize
Parallel Architectures - Some MIMDs and Some
SIMDs
Computer graphics processing
Video encoding
Accurate weather forecasting
Often in a dynamic programming algorithm a
given row or diagonal can be computed
simultaneously
 This makes many dynamic programming
algorithms amenable for parallel
architectures
70
Atmospheric Model
Typically, these points are located on a rectangular
latitude-longitude grid of size 15-30 , in the range 50-500.
A vector of values is maintained at each grid point,
representing quantities such as pressure,
temperature, wind velocity, and humidity.
The three-dimensional grid
used to represent the state
of the atmosphere. Values
maintained at each grid
point represent quantities
such as pressure and
temperature.
71
Parallelization Concepts
When performing task, some subtasks depend
on one another, while others do not
Example: Preparing dinner
 Salad prep independent of lasagna baking
 Lasagna must be assembled before baking
Likewise, in solving problems, some tasks are
independent of one another
72
Parallel Algorithm Design: PCAM
Partition: Decompose problem into fine-
grained tasks to maximize potential
parallelism
Communication: Determine communication
pattern among tasks
Agglomeration: Combine into coarser-grained
tasks, if necessary, to reduce communication
requirements or other costs
Mapping: Assign tasks to processors, subject
to tradeoff between communication cost and
concurrency
73
Parallelization Strategies for MIMD Models
What tasks are independent of each other?
What tasks must be performed sequentially?
Use PCAM parallel algorithm design strategy due to
Ian Foster.
PCAM = Partition, Communicate, Agglomerate, and
Map.
Note: Some of the next slides on this topic are
adapted and modified from Randy H. Katz’s course.
74
Program Design Methodology for
MIMDs
75
Ian Foster’s Scheme
Partitioning.


The computation that is to be performed and the data
operated on by this computation are decomposed into
small tasks.
Practical issues such as the number of processors in the
target computer are ignored, and attention is focused on
recognizing opportunities for parallel execution.
Communication.

The communication required to coordinate task execution
is determined, and appropriate communication structures
and algorithms are defined.
Agglomeration.


The task and communication structures defined in the first
two stages of a design are evaluated with respect to
performance requirements and implementation costs.
If necessary, tasks are combined into larger tasks to
improve performance or to reduce development costs.76
Mapping.



Each task is assigned to a processor in a manner
that attempts to satisfy the competing goals of
maximizing processor utilization and minimizing
communication costs.
Mapping can be specified statically or determined
at runtime by load-balancing algorithms.
Looking at these steps for the dart throwing
problem …
77
Partition or Decompose
“Decompose problem into fine-grained
tasks to maximize potential parallelism”
Finest grained task: throw of one dart
Each throw independent of all others
If we had huge computer, could assign one
throw to each processor
78
Communication
“Determine communication pattern among
tasks”
Each processor throws dart(s) then sends
results back to manager process
79
Agglomeration
“Combine into coarser-grained tasks, if
necessary, to reduce communication
requirements or other costs”
To get good value of , must use millions of
darts
We don’t have millions of processors available
Furthermore, communication between
manager and millions of worker processors
would be very expensive
Solution: divide up number of dart throws
evenly between processors, so each processor
does a share of work
80
Mapping
“Assign tasks to processors, subject to tradeoff
between communication cost and
concurrency”
Assign role of “manager” to processor 0
Processor 0 will receive tallies from all the
other processors, and will compute final value
of 
Every processor, including manager, will
perform equal share of dart throws
81
Program Partitioning
The partitioning stage of a design is intended to expose
opportunities for parallel execution.
 Thus, the focus is on defining a large number of small tasks
in order to yield what is termed a fine-grained
decomposition of a problem.
 Just as fine sand is more easily poured than a pile of bricks,
a fine-grained decomposition provides the greatest flexibility
in terms of potential parallel algorithms.
 In later design stages, evaluation of communication
requirements, the target architecture, or software
engineering issues may lead us to forego opportunities for
parallel execution identified at this stage.
 We then revisit the original partition and agglomerate tasks
to increase their size, or granularity.
 However, in this first stage we wish to avoid prejudging
alternative partitioning strategies.
82
Good Partitioning
A good partition divides into small pieces both the computation
associated with a problem and the data on which this
computation operates.
When designing a partition, programmers most commonly first
focus on the data associated with a problem, then determine
an appropriate partition for the data, and finally work out how
to associate computation with data.
This partitioning technique is termed domain decomposition.
The alternative approach---first decomposing the computation
to be performed and then dealing with the data---is termed
functional decomposition.
These are complementary techniques which may be applied to
different components of a single problem or even applied to the
same problem to obtain alternative parallel algorithms.
83
In this first stage of a design, we seek to
avoid replicating computation and data.
That is, we seek to define tasks that partition
both computation and data into disjoint sets.
Like granularity, this is an aspect of the
design that we may revisit later.
It can be worthwhile replicating either
computation or data if doing so allows us to
reduce communication requirements.
84
Domain Decomposition
Domain decompositions for a problem involving a
three-dimensional grid.
One-, two-, and three-dimensional
decompositions are possible; in each case, data
associated with a single task are shaded.
A three-dimensional decomposition offers the
greatest flexibility and is adopted in the early
stages of a design.
85
Functional decomposition in a computer model of climate.
Each model component can be thought of as a separate
task, to be parallelized by domain decomposition.
Arrows represent exchanges of data between components
during computation:
 The atmosphere model generates wind velocity data that
are used by the ocean model
 The ocean model generates sea surface temperature data
that are used by the atmosphere model, and so on.
86
Communication
The tasks generated by a partition are intended to execute
concurrently but cannot, in general, execute independently.
The computation to be performed in one task will typically
require data associated with another task.
Data must then be transferred between tasks so as to allow
computation to proceed.
This information flow is specified in the communication phase
of a design.
We use a variety of examples to show how communication
requirements are identified and how channel structures and
communication operations are introduced to satisfy these
requirements.
For clarity in exposition, we categorize communication
patterns along four loosely orthogonal axes: local/global,
structured/unstructured, static/dynamic, and
synchronous/asynchronous.
87
Types of Communication –Rather Obvious Definitions
In local communication, each task communicates with a
small set of other tasks (its ‘neighbors''); in contrast, global
communication requires each task to communicate with
many tasks.
In structured communication, a task and its neighbors form
a regular structure, such as a tree or grid; in contrast,
unstructured communication networks may be arbitrary
graphs.
In static communication, the identity of communication
partners does not change over time; in contrast, the identity
of communication partners in dynamic communication
structures may be determined by data computed at runtime
and may be highly variable.
In synchronous communication, producers and consumers
execute in a coordinated fashion, with producer/consumer
pairs cooperating in data transfer operations; in contrast,
asynchronous communication may require that a consumer
88
obtain data without the cooperation of the producer.
Local Communication
• Task and channel structure for
a two-dimensional finite
difference computation with
five-point update stencil.
• In this simple fine-grained
formulation, each task
encapsulates a single element
of a two-dimensional grid and
must both send its value to
four neighbors and receive
values from four neighbors.
• Only the channels used by the
shaded task are shown.
89
Global Communication
A centralized summation algorithm that uses
a central manager task (S) to sum N
numbers distributed among N tasks.
Here, N=8 , and each of the 8 channels is
labeled with the number of the step in which
they are used.
90
Distributing Communication and Computation
We first consider the problem of distributing a computation
and communication associated with a summation.
We can distribute the summation of the N numbers by
making each task i , 0<i<N-1 , compute the sum:
•A summation algorithm that connects N tasks in an array
in order to sum N numbers distributed among these tasks.
•Each channel is labeled with the number of the step in
which it is used and the value that is communicated on it.
91
Divide and Conquer
Tree structure for divide-and-conquer summation
algorithm with N=8 .
The N numbers located in the tasks at the bottom of
the diagram are communicated to the tasks in the
row immediately above; these each perform an
addition and then forward the result to the next level.
The complete sum is available at the root of the tree
after log N steps.
92
Unstructured and Dynamic Communication
Example of a problem
requiring unstructured
communication.
In this finite element mesh
generated for an assembly
part, each vertex is a grid
point.
An edge connecting two
vertices represents a data
dependency that will require
communication if the vertices
are located in different tasks.
Note that different vertices
have varying numbers of
neighbors. (Image courtesy of
M. S. Shephard.)
93
Asynchronous Communication
Using separate ``data tasks'' to service read and write
requests on a distributed data structure.
In this figure, four computation tasks (C) generate read and
write requests to eight data items distributed among four data
tasks (D).
Solid lines represent requests; dashed lines represent replies.
One compute task and one data task could be placed on each
of four processors so as to distribute computation and data
equitably.
94
Agglomeration
In the third stage, agglomeration, we move from the
abstract toward the concrete.
We revisit decisions made in the partitioning and
communication phases with a view to obtaining an
algorithm that will execute efficiently on some class
of parallel computer.
In particular, we consider whether it is useful to
combine, or agglomerate, tasks identified by the
partitioning phase, so as to provide a smaller number
of tasks, each of greater size.
We also determine whether it is worthwhile to
replicate data and/or computation.
95
Examples of Agglomeration
In (a), the size of tasks is
increased by reducing the
dimension of the decomposition
from three to two.
In (b), adjacent tasks are
combined to yield a threedimensional decomposition of
higher granularity.
In (c), subtrees in a divide-andconquer structure are
coalesced.
In (d), nodes in a tree algorithm
are combined.
96
Increasing Granularity
In the partitioning phase of the design process,
efforts are focused on defining as many tasks as possible.
This forces us to consider a wide range of opportunities for
parallel execution.
Note that defining a large number of fine-grained tasks does
not necessarily produce an efficient parallel algorithm.
One critical issue influencing parallel performance is
communication costs.
On most parallel computers, computing must stop in order to
send and receive messages.
Thus, we can improve performance by reducing the amount of
time spent communicating.
97
Increasing Granularity
Clearly, performance improvement can be achieved
by sending less data.
Perhaps less obviously, it can also be achieved by
using fewer messages, even if we send the same
amount of data.
This is because each communication incurs not only a
cost proportional to the amount of data transferred
but also a fixed startup cost.
98
The figure shows fine- and
coarse-grained two-dimensional
partitions of this problem.
In each case, a single task is
exploded to show its outgoing
messages (dark shading) and incoming
messages (light shading).
In (a), a computation on an
8x8 grid is partitioned into 64 tasks,
each responsible for a single point,
while in (b) the same computation is
partitioned into 2x2=4 tasks, each
responsible for 16 points.
In (a), 64x4=256
communications are required, 4 per
task; these transfer a total of 256 data
values.
In (b), only 4x4=16
communications are required, and only
99
16x4=64 data values are transferred.
Mapping
In the fourth and final stage of the parallel algorithm design
process, we specify where each task is to execute.
This mapping problem does not arise on uniprocessors or
on shared-memory computers that provide automatic task
scheduling.
In these computers, a set of tasks and associated
communication requirements is a sufficient specification for
a parallel algorithm; operating system or hardware
mechanisms can be relied upon to schedule executable
tasks to available processors.
Unfortunately, general-purpose mapping mechanisms have
yet to be developed for scalable parallel computers.
In general, mapping remains a difficult problem that must
be explicitly addressed when designing parallel algorithms.
100
Mapping
The goal in developing mapping algorithms is
normally to minimize total execution time.
We use two strategies to achieve this goal:
 We place tasks that are able to execute
concurrently on different processors, so as to
enhance concurrency.
 We place tasks that communicate frequently on
the same processor, so as to increase locality.
The mapping problem is known to be NP -
complete.
However, considerable knowledge has been gained
on specialized strategies and heuristics and the
classes of problem for which they are effective.
Next, we provide a rough classification of problems
and present some representative techniques.
101
Mapping Example
Mapping in a grid problem in which each task
performs the same amount of computation and
communicates only with its four neighbors.
The heavy dashed lines delineate processor
boundaries.
The grid and associated computation is partitioned to
give each processor the same amount of computation
and to minimize off-processor communication.
102
Load-Balancing Algorithms
A wide variety of both general-purpose and applicationspecific load-balancing techniques have been proposed for
use in parallel algorithms based on domain decomposition
techniques.
We review several representative approaches here namely
recursive bisection methods, local algorithms, probabilistic
methods, and cyclic mappings.
These techniques are all intended to agglomerate finegrained tasks defined in an initial partition to yield one
coarse-grained task per processor.
Alternatively, we can think of them as partitioning our
computational domain to yield one sub-domain for each
processor.
For this reason, they are often referred to as partitioning
algorithms.
103
Graph Partitioning Problem
Load balancing in a grid problem.
Variable numbers of grid points are placed on each
processor so as to compensate for load imbalances.
This sort of load distribution may arise if a local loadbalancing scheme is used in which tasks exchange
load information with neighbors and transfer grid
points when load imbalances are detected.
104
Task Scheduling Algorithms
Manager/worker load-balancing structure.
Workers repeatedly request and process problem
descriptions; the manager maintains a pool of
problem descriptions ( p) and responds to requests
from workers.
105
SIMD, Associative, and
Multi-Associative Computing
SIMDs vs MIMDs
SIMDs keep coming into and out of favor over the
years.
They are beginning to rise in interest today due to
graphics cards and multi-processor core machines
that use SIMD architectures.
If anyone has taken an operating system course
and learned about concurrency programming, they
can program MIMSs.
But, MIMDs spend a lot of time on communicating
between processors, load balancing etc.
Many people don't understand that SIMDs are
really simple to program - but they need a different
107
mindset.
SIMDs vs MIMDs
Most SIMD programs are very similar to those for
SISD machines, except for the way the data is
stored.
Many people, frankly, can't see how such simple
programs can do so much.
Example: The air traffic control problem and
history.
Associative computing uses a modified SIMD
model to solve problems in a very straight
forward manner.
108
Associative Computing References
Note: Below KSU papers are available on the
website: http://www.cs.kent.edu/~parallel/
(Click on the link to “papers”)
1. Maher Atwah, Johnnie Baker, and Selim Akl, An
Associative Implementation of Classical Convex Hull
Algorithms, Proc of the IASTED International
2.
Conference on Parallel and Distributed Computing
and Systems, 1996, 435-438
Johnnie Baker and Mingxian Jin, Simulation of
Enhanced Meshes with MASC, a MSIMD Model,
Proc. of the Eleventh IASTED International
Conference on Parallel and Distributed Computing
and Systems, Nov. 1999, 511-516.
109
Associative Computing References
3. Mingxian Jin, Johnnie Baker, and Kenneth Batcher,
Timings for Associative Operations on the MASC Model,
4.
5.
Proc. of the 15th International Parallel and Distributed
Processing Symposium, (Workshop on Massively
Parallel Processing, San Francisco, April 2001.)
Jerry Potter, Johnnie Baker, Stephen Scott, Arvind
Bansal, Chokchai Leangsuksun, and Chandra Asthagiri,
An Associative Computing Paradigm, Special Issue on
Associative Processing, IEEE Computer, 27(11):19-25,
Nov. 1994. (Note: MASC is called ‘ASC’ in this article.)
Jerry Potter, Associative Computing - A Programming
Paradigm for Massively Parallel Computers, Plenum
Publishing Company, 1992.
110
Associative Computers
Associative Computer: A SIMD computer with a
few additional features supported in hardware.
These additional features can be supported (less
efficiently) in traditional SIMDs in software.
This model, however, gives a real sense of how
problems can be solved with SIMD machines.
The name “associative” is due to its ability to
locate items in the memory of PEs by content
rather than location.
Traditionally, such memories were called
associative memories.
111
Associative Models
The ASC model (for ASsociative Computing) gives a list of
the properties assumed for an associative computer.
The MASC (for Multiple ASC) Model
Supports multiple SIMD (or MSIMD) computation.
Allows model to have more than one Instruction Stream
(IS)
 The IS corresponds to the control unit of a SIMD.
ASC is the MASC model with only one IS.
 The one IS version of the MASC model is sufficiently
important to have its own name.
112
ASC & MASC are KSU Models
Several professors and their graduate students at Kent
State University have worked on models
The STARAN and the ASPRO computers fully support the
ASC model in hardware. The MPP computer supports
ASC, partly in hardware and partly in software.
 Prof. Batcher was chief architect or consultant for
STARAN and MPP.
Dr. Potter developed a language for ASC
Dr. Baker works on algorithms for models and
architectures to support models
Dr. Walker is working with a hardware design to support
the ASC and MASC models.
Dr. Batcher and Dr. Potter are currently not actively
working on ASC/MASC models but still provide advice.
113
Motivation
The STARAN Computer (Goodyear Aerospace, early
1970’s) and later the ASPRO provided an architectural
model for associative computing embodied in the ASC
model.
ASC extends the data parallel programming style to a
complete computational model.
ASC provides a practical model that supports massive
parallelism.
MASC provides a hybrid data-parallel, control parallel
model that supports associative programming.
Descriptions of these models allow them to be compared
to other parallel models
114
The ASC Model
C
E
L
L
Cells
Memory
PE
  
N
E
T
W
O
R
K
IS
Memory
PE
Memory
PE
115
Basic Properties of ASC
Instruction Stream
 The IS has a copy of the program and can broadcast
instructions to cells in unit time
Cell Properties
 Each cell consists of a PE and its local memory
 All cells listen to the IS
 A cell can be active, inactive, or idle
 Inactive cells listen but do not execute IS commands
until reactivated
 Idle cells contain no essential data and are available
for reassignment
 Active cells execute IS commands synchronously
116
Basic Properties of ASC
Responder Processing
 The IS can detect if a data test is satisfied
by any of its responder cells in constant
time (i.e., any-responders property).
 The IS can select an arbitrary responder in
constant time (i.e., pick-one property).
117
Basic Properties of ASC
Constant Time Global Operations (across PEs)
 Logical OR and AND of binary values
 Maximum and minimum of numbers
 Associative searches
Communications
 There are at least two real or virtual networks
 PE communications (or cell) network
 IS broadcast/reduction network (which could
be implemented as two separate networks)
118
Basic Properties of ASC
The PE communications network is normally
supported by an interconnection network
 E.g., a 2D mesh
 The broadcast/reduction network(s) are
normally supported by a broadcast and a
reduction network (sometimes combined).
 See posted paper by Jin, Baker, & Batcher
(listed in associative references)
Control Features
 PEs and the IS and the networks all operate
synchronously, using the same clock

119
Non-SIMD Properties of ASC
Observation: The ASC properties that are unusual for
SIMDs are the constant time operations:
 Constant time responder processing
 Any-responders?
 Pick-one
 Constant time global operations
 Logical OR and AND of binary values
 Maximum and minimum value of numbers
 Associative Searches
These timings are justified by implementations using a
resolver in the paper by Jin, Baker, & Batcher (listed in
associative references and posted).
120
Typical Data Structure for ASC Model
PE1
Make
Color
Year
Dodge
red
1994
PE2
PE3
IS
PE4
Price
Busyidle
1
1
0
0
Ford
blue
1996
1
1
Ford
white
1998
0
1
PE5
PE6
PE7
Model
On
lot
Subaru
red
1997
0
0
0
0
1
1
Make, Color – etc. are fields the programmer establishes
Various data types are supported. Some examples will show
string data, but they are not supported in the ASC simulator.
121
The Associative Search
PE1
Make
Color
Year
Dodge
red
1994
Model
Price
PE2
PE3
IS
PE4
Busyidle
1
1
0
0
Ford
blue
1996
1
1
Ford
white
1998
0
1
PE5
PE6
PE7
On
lot
Subaru
red
1997
0
0
0
0
1
1
IS asks for all cars that are red and on the lot.
PE1 and PE7 respond by setting a mask bit in their PE.
122
Memory
PE
Memory
PE
Memory
PE
Memory
PE
Memory
PE
Memory
PE
Memory
PE
Memory
PE
Instruction
Stream
(IS)
Instruction
Stream
(IS)
Instruction
Stream
(IS)
IS Network
PE Interconnection Network
MASC Model
Basic Components
 An array of cells, each
consisting of a PE and
its local memory
 A PE interconnection
network between the
cells
 One or more
Instruction Streams
(ISs)
 An IS network
MASC is a MSIMD model
that supports
 both data and control
parallelism
 associative
programming
123
MASC Basic Properties
Each cell can listen to only one IS
Cells can switch ISs in unit time, based on the
results of a data test.
Each IS and the cells listening to it follow rules
of the ASC model.
Control Features:
 The PEs, ISs, and networks all operate
synchronously, using the same clock
 Restricted job control parallelism is used to
coordinate the interaction of the multiple ISs.
124
Characteristics of Associative
Programming
Consistent use of style of programming called data
parallel programming
Consistent use of global associative searching and
responder processing
Usually, frequent use of the constant time global
reduction operations: AND, OR, MAX, MIN
Broadcast of data using IS bus allows the use of the
PE network to be restricted to parallel data
movement.
125
Characteristics of Associative Programming
Tabular representation of data – think 2D arrays
Use of searching instead of sorting
Use of searching instead of pointers
Use of searching instead of the ordering provided by
linked lists, stacks, queues
Promotes an highly intuitive programming style that
promotes high productivity
Uses structure codes (i.e., numeric representation) to
represent data structures such as trees, graphs,
embedded lists, and matrices.
Examples of the above are given in
 Ref: Nov. 1994 IEEE Computer article.
 Also, see “Associative Computing” book by Potter.
126
Languages Designed for the ASC
Professor Potter has created several languages for the
ASC model.
ASC is a C-like language designed for ASC model
ACE is a higher level language than ASC that uses
natural language syntax; e.g., plurals, pronouns.
Anglish is an ACE variant that uses an English-like
grammar (e.g., “their”, “its”)
An OOPs version of ASC for the MASC was discussed (by
Potter and his students), but never designed.
Language References:
 ASC Primer – Copy available on parallel lab website
www.cs.kent.edu/~parallel/
 “Associative Computing” book by Potter [11] – some
features in this book were never fully implemented in
ASC Compiler
127
Algorithms and Programs Implemented in ASC
A wide range of algorithms implemented in ASC
without the use of the PE network:
 Graph Algorithms
 minimal spanning tree
 shortest path
 connected components
 Computational Geometry Algorithms
 convex hull algorithms (Jarvis March,
Quickhull, Graham Scan, etc)
 Dynamic hull algorithms
128
ASC Algorithms and Programs
(not requiring PE network)



String Matching Algorithms
 all exact substring matches
 all exact matches with “don’t care” (i.e.,
wild card) characters.
Algorithms for NP-complete problems
 traveling salesperson
 2-D knapsack.
Data Base Management Software
 associative data base
 relational data base
129
ASC Algorithms and Programs
(not requiring a PE network)




A Two Pass Compiler for ASC – This compiler
uses ASC parallelism.
 first pass
 optimization phase
Two Rule-Based Inference Engines for AI
 An Expert System OPS-5 interpreter
 PPL (Parallel Production Language
interpreter)
A Context Sensitive Language Interpreter
 (OPS-5 variables force context sensitivity)
130
An associative PROLOG interpreter
Associative Algorithms & Programs
(using a network)
There are numerous associative programs that use a PE
network;
 2-D Knapsack ASC Algorithm using a 1-D mesh
 Image processing algorithms using 1-D mesh
 FFT (Fast Fourier Transform) using 1-D nearest
neighbor & Flip networks
 Matrix Multiplication using 1-D mesh
 An Air Traffic Control Program (using Flip network
connecting PEs to memory)
 Demonstrated using live data at Knoxville in mid
70’s.
All but first were developed in assembler at Goodyear
Aerospace
131
The MST Problem
The MST problem assumes the weights are positive, the
graph is connected, and seeks to find the minimal
spanning tree,


i.e. a subgraph that is a tree1, that includes all nodes
(i.e. it spans), and
where the sum of the weights on the arcs of the
subgraph is the smallest possible weight (i.e. it is
minimal).
Note: The solution may not be unique.
132
An Example
2
A
7
B
4
3
F
6
5
1
6
I
2
3
2
H
4
8
2
E
C
G
1
D
As we will see, the algorithm is simple.
The ASC program is quite easy to write.
A SISD solution is a bit messy because of the data
structures needed to hold the data for the problem
133
An Example – Step 0
2
A
7
B
4
3
F
6
C
G
5
1
6
I
2
3
2
H
4
8
2
E
We will maintain
three sets
1
of nodes whose
membership
will change
during the run.
D
The first, V1, will be nodes selected to be in the tree.
The second, V2, will be candidates at the current step to be
added to V1.
The third, V3, will be nodes not considered yet.
134
An Example – Step 0
2
A
7
B
4
3
F
6
5
1
6
I
2
3
2
H
4
8
2
E
C
G
1
D
V1 nodes will be in red with their selected edges being in red also.
V2 nodes will be in light blue with their candidate edges in light
blue also.
V3 nodes and edges will remain white.
135
An Example – Step 1
2
A
7
B
4
3
F
6
5
1
6
I
2
3
2
H
4
8
2
E
C
G
1
D
Select an arbitrary node to place in V1, say A.
Put into V2, all nodes incident with A.
136
An Example – Step 2
2
A
7
B
4
3
F
6
5
1
6
I
2
3
2
H
4
8
2
E
C
G
1
D
Choose the edge with the smallest weight and put
its node, B, into V1. Mark that edge with red also.
Retain the other edge-node combinations in the “to
be considered” list.
137
An Example – Step 3
2
A
7
B
4
3
F
6
5
1
6
I
2
3
2
H
4
8
2
E
C
G
1
D
Add all the nodes incident to B to the “to be considered list”.
However, note that AG has weight 3 and BG has weight 6. So,
there is no sense of including BG in the list.
138
An Example – Step 4
2
A
7
B
4
3
F
6
5
1
6
I
2
3
2
H
4
8
2
E
C
G
1
D
Add the node with the smallest weight that is colored light blue
and add it to V1.
Note the nodes and edges in red are forming a subgraph which is
139
a tree.
An Example – Step 5
2
A
7
B
4
3
F
6
5
1
6
I
2
3
2
H
4
8
2
E
C
G
1
D
Update the candidate nodes and edges by including all that are
incident to those that are in V1 and colored red.
140
An Example – Step 6
2
A
7
B
4
3
F
6
5
1
6
I
2
3
2
H
4
8
2
E
C
G
1
D
Select I as its edge is minimal. Mark node and edge as red.
141
An Example – Step 7
2
A
7
B
4
3
F
6
C
G
5
1
6
I
2
H
4
8
2
E
2
3
1
D
Add the new candidate edges.
Note that IF has weight 5 while AF has weight 7. Thus, we drop
AF from consideration at this time.
142
An Example – after several more passes, C is
added & we have …
2
A
7
B
4
3
F
6
5
1
6
I
2
3
2
H
4
8
2
E
C
G
1
D
Note that when CH is added, GH is dropped as CH has less weight.
Candidate edge BC is also dropped since it would form a back edge between
two nodes already in the MST.
When there are no more nodes to be considered, i.e. no more in V3, we obtain
143
the final solution.
An Example – the final solution
2
A
7
B
4
3
F
6
5
1
6
I
2
3
2
H
4
8
2
E
C
G
1
D
The subgraph is clearly a tree – no cycles and connected.
The tree spans – i.e. all nodes are included.
While not obvious, it can be shown that this algorithm always
produces a minimal spanning tree.
The algorithm is known as Prim’s Algorithm for MST.
144
The ASC Program vs a SISD Solution in ,
say, C, C++, or Java
First, think about how you would write the program
in C or C++.
The usual solution uses some way of maintaining the
sets as lists using pointers or references.
 See solutions to MST in Algorithms texts by Baase
listed in the posted references.
In ASC, pointers and references are not even
supported as they are not needed and their use is
likely to result in inefficient SIMD algorithms
The implementation of MST in ASC, basically follows
the outline that was provided to the problem, but
first, we need to learn something about the language
ASC.
145
ASC-MST Algorithm Preliminaries
Next, a “data structure” level presentation of Prim’s
algorithm for the MST is given.
The data structure used is illustrated in the next two
slides.
 This example is from the Nov. 1994 IEEE Computer
paper cited in the references.
There are two types of variables for the ASC model,
namely
 the parallel variables (i.e., ones for the PEs)
 the scalar variables (ie., the ones used by the IS).
 Scalar variables are essentially global variables.
 Can replace each with a parallel variable with this
scalar value stored in each entry.
146
ASC-MST Algorithm Preliminaries (cont.)
In order to distinguish between them here, the parallel
variables names end with a “$” symbol.
Each step in this algorithm takes constant time.
One MST edge is selected during each pass through the
loop in this algorithm.
Since a spanning tree has n-1 edges, the running time of
this algorithm is O(n) and its cost is O(n 2).
 Definition of cost is (running time)  (number of
processors)
Since the sequential running time of the Prim MST
algorithm is O(n 2) and is time optimal, this parallel
implementation is cost optimal.
147
Graph Used for Data Structure
a
22
7
b
4
8
c
3
9
6
d
3
e
f
Figure 6 in [Potter, Baker, et. al.]
148
b$
c$
d$
f$
candidate$
parent$
current_best$
2
8
∞ ∞ ∞
no
b
2
∞
7
4
3
∞
no
a
2
c
8
7
∞ ∞
6
9
yes
b
7
IS
d
∞
4
∞ ∞
3
∞
yes
b
4
a
e
∞
3
6
3
∞ ∞
yes
root
b
3
nextnode
b
f
∞
∞
9
∞ ∞ ∞ wait
e$
a$
∞
mask$
a
PEs
node$
Data Structure for MST Algorithm
149
Algorithm: ASC-MST-PRIM
Initially assign any node to root.
All processors set
 candidate$ to “wait”
 current-best$ to 
 the candidate field for the root node to “no”
All processors whose distance d from their node to
root node is finite do
 Set their candidate$ field to “yes
 Set their parent$ field to root.
 Set current_best$ = d.
150
Algorithm: ASC-MST-PRIM (cont. 2/3)
While the candidate field of some processor is “yes”,
 Restrict the active processors whose candidate field is
“yes” and (for these processors) do
 Compute the minimum value x of current_best$.
 Restrict the active processors to those with
current_best$ = x and do
 pick an active processor, say node y.
 Set the candidate$ value of node y to “no”
 Set the scalar variable next-node to y.
151
Algorithm: ASC-MST-PRIM (cont. 3/3)
the value z in the next_node column
of a processor is less than its
current_best$ value, then
 Set current_best$ to z.
 Set parent$ to next_node
For all processors, if candidate$ is “waiting”
and the distance of its node from next_node y
is finite, then
 Set candidate$ to “yes”
 Set current_best$ to the distance of its
node from y.
 Set parent$ to y
 If

152
Initialize candidates to “waiting”
If there are any finite values in root’s field,
set candidate$ to “yes”
Algorithm: ASC-MST-PRIM(root)
set parent$ to root
set current_best$ to the values in root’s field
set root’s candidate field to “no”
Loop while some candidate$ contain “yes”
for them
restrict mask$ to mindex(current_best$)
set next_node to a node identified in the preceding step
set its candidate to “no”
if the value in their next_node’s field are less than current_best$, then
set current_best$ to value in next_node’s field
set parent$ to next_node
if candidate$ is “waiting” and the value in its next_node’s field is finite
set candidate$ to “yes”
set parent$ to next_node
set current_best to the values in next_node’s field
153
Comments on ASC-MST Algorithm
The three preceding slides are Figure 6 in [Potter, Baker,
et.al.] IEEE Computer, Nov 1994].
Preceding slide gives a compact, data-structures level
pseudo-code description for this algorithm
 Pseudo-code illustrates Potter’s use of pronouns (e.g.,
them, its) and possessive nouns.
 The mindex function returns the index of a processor
holding the minimal value.
 This MST pseudo-code is much shorter and simpler
than data-structure level sequential MST pseudocodes
154
b
2
∞
7
4
3
∞
c
8
7
∞ ∞
6
9
IS
d
∞
4
∞ ∞
3
∞
a
e
∞
3
6
3
∞ ∞
nextnode
f
∞
∞
9
∞ ∞ ∞
155
current_best$
d$
∞ ∞ ∞
parent$
c$
8
candidate$
b$
2
f$
a$
∞
e$
node$
a
PEs
root
mask$
Tracing 1st Pass of MST Algorithm on Figure 6