Chapter 2: Evaluate Parallel Program

Download Report

Transcript Chapter 2: Evaluate Parallel Program

Evaluating Parallel Programs
Cluster Computing, UNC-Charlotte, B. Wilkinson.
2a.1
Sequential execution time, ts: Estimate by
counting computational steps of best sequential
algorithm.
Parallel execution time, tp: In addition to number
of computational steps, tcomp, need to estimate
communication overhead, tcomm:
tp = tcomp + tcomm
2a.2
Computational Time
Count number of computational steps.
When more than one process executed simultaneously,
count computational steps of most complex process.
Generally, function of n and p, i.e.
tcomp = f (n, p)
Often break down computation time into parts. Then
tcomp = tcomp1 + tcomp2 + tcomp3 + …
Analysis usually done assuming that all processors are
same and operating at same speed.
2a.3
Communication Time
Many factors, including network structure.
As a first approximation, use
tcomm = tstartup + ntdata
tstartup -- startup time, essentially time to send a
message with no data. Assumed to be constant.
tdata -- transmission time to send one data word,
also assumed constant, and there are n data
words.
2a.4
Idealized Communication Time
Startup time
Number of data items (n)
The equation to compute the communication time ignore the fact that the
source and destination may not be directly linked in a real system so that the
message may pass through intermediate nodes.
It is also assumed that the overhead incurred by including information other
than data in the packet is constant and can be part of startup time.
2a.5
Final communication time, tcomm
Summation of communication times of all
sequential messages from one process, i.e.
tcomm = tcomm1 + tcomm2 + tcomm3 + …
Communication patterns of all processes assumed
same and take place together so that only one
process need be considered.
Both tstartup and tdata, measured in units of one
computational step, so that can add tcomp and tcomm
together to obtain parallel execution time, tp.
2a.6
Communication Time of
Broadcast/Gather
• If broadcast is done through single shared wire for Ethernet,
the time complexity is O(1) for single data item and O(w) if w
data items.
• If binary tree is used as the underlying network structure and
1-to-N fan-out broadcast is used, then what about
communication cost for p final destinations (leaf nodes) using
w messages?
– We assume the left and right child will receive the message from their
parent in a sequential way. However, at each level, different parent
nodes will send out the message at the same time.
2a.7
1-to-N fan-out Broadcast
• tcomm = 2 (log p) (tstartup + wtdata )
• It depends on number of levels and
number of nodes at each level.
• For a binary tree and p final destinations
at the leave level.
2a.8
Benchmark Factors
With ts, tcomp, and tcomm, can establish speedup
factor and computation/communication ratio for
a particular algorithm/implementation:
Both functions of number of processors, p, and
number of data elements, n.
2a.9
Factors give indication of scalability of parallel
solution with increasing number of processors and
problem size.
Computation/communication ratio will highlight
effect of communication with increasing problem
size and system size.
We wish to have dominant factor in computation
instead of communication, as n increases,
communication can be ignored and adding more
processors can improve the performance.
2a.10
Example
• Adding n numbers using two computers,
each adding n/2 numbers each.
• Numbers initially held in one computer.
Computer 1
Computer 2
Send n/2 numbers
Add up n/2 numbers
tcomm = tstartup +(n/2)tdata
tcomp = n/2
Send result back
tcomm = tstartup + tdata
Add partial sums
tcomp = 1
2a.11
Overall
tcomm = 2tstartup +(n/2 + 1)tdata = O(n)
tcomp = n/2 + 1
= O(n)
Computation/Communication ratio = O(1)
2a.12
Another problem
Computation time complexity = O(n2)
Communication time complexity = O(n)
Computation/Communication ratio = O(n)
2a.13
Cost
Cost = (execution time ) x (number of processors)
Cost of sequential computation = ts
Cost of parallel computation = tp x p
Cost-optimal algorithm
When parallel computation cost is proportional to
sequential computation:
Cost = tp x p = k x ts
k is a constant
2a.14
Example
Suppose
ts = O(n log n) for the best sequential
program
where n = number of data item
p = number of processors
For cost optimality if
tp = O(n log n) / p = O(n/p log n)
Not cost optimal if
tp = O(n^2/p )
A parallel algorithm is cost-optimal if parallel time complexity
times the number of processors equals the sequential time
complexity.
2a.15
Evaluating programs
• Measuring the execution time
• Time-complexity analysis gives an insight into the parallel
algorithm and is useful in comparing different algorithms. We
want to know how the algorithm actually performs in a real
system.
• We can measure the elapsed time between two points in the
code in seconds.
– System calls, such as clock(), time(), or gettimeofday() or MPI_Wtime()
– Example:
L1: time(&t1);
.
.
L2: time(&t2);
elapsed_time = difftime(t2, t1);
2a.16
Communication Time by the
Ping-Pong Method
• Point-to-point communication time of a specific
system can be found using the ping-pong method.
• One process p0 sends a message to another
process, say p1. Immediately upon receiving the
message, p1 sends the message back to p0. The
time is divided by two to obtain an estimate of the
time of one-way communication. For example, at p0:
time(&t1);
send(&x, p1);
recv(&x, p1);
time(&t2);
elapsed_time = 0.5* difftime(t2, t1);
2a.17
Profilling
• A profile of a program is a histogram or
graph showing the time spent on
different part of the program. Showing
the number of times certain source code
are executed.
• It can help to identify certain hot spot
places in a program visited many times
during the execution. These places
could be optimized first.
2a.18
Program Profile Histogram
Statement number of region of program
2a.19