Performancs issues in Parallel Programming

Download Report

Transcript Performancs issues in Parallel Programming

Programming for Performance
Laxmikant Kale
CS 433
Causes of performance loss
• If each processor is rated at k MFLOPS, and
there are p processors, why don’t we see k.p
MFLOPS performance?
– Several causes,
– Each must be understood separately
– but they interact with each other in complex ways
• Solution to one problem may create another
• One problem may mask another, which manifests itself
under other conditions (e.g. increased p).
Causes
•
•
•
•
•
•
•
Sequential: cache performance
Communication overhead
Algorithmic overhead (“extra work”)
Speculative work
Load imbalance
(Long) Critical paths
Bottlenecks
Algorithmic overhead
• Parallel algorithms may have a higher
operation count
• Example: parallel prefix (also called “scan”)
– How to parallelize this?
B[0] = A[0];
for (I=1; I<N; I++)
B[I] = B[I-1]+A[I];
Parallel Prefix: continued
• How to this operation in parallel?
– Seems inherently sequential
– Recursive doubling algorithm
– Operation count: log(P) . N
• A better algorithm:
– Take blocking of data into account
– Each processor calculate its sum, then
participates in a prallel algorithm to get sum to
its left, and then adds to all its elements
– N + log(P) +N: doubling of op. Count
Bottleneck
• Consider the “primes” program (or the “pi”)
– What happens when we run it on 1000 pes?
• How to eliminate bottlenecks:
– Two structures are useful in most such cases:
• Spanning trees: organize processors in a tree
• Hypercube-based dimensional exchange
Communication overhead
• Components:
– per message and per byte
– sending, receiving and network
– capacity constraints
• Grainsize analysis:
– How much computation per message
– Computation-to-communication ratio
Communication overhead
examples
• Usually, must reorganize data or work to
reduce communication
• Combining communication also helps
• Examples:
Communication overhead
Communication delay: time interval between sending on one
processor to receipt on another:
time = a + b. N
Communication overhead: the time a processor is held up (both
sender and receiver are held up): again of the form a+ bN
Typical values: a = 10 - 100 microseconds, b: 2-10 ns
Grainsize control
• A Simple definition of grainsize:
– Amount of computation per message
– Problem: short message/ long message
• More realistic:
– Computation to communication ratio
Example: matrix multiplication
• How to parallelize this?
For (I=0; I<N; I++)
For (J=0; j<N; J++) // c[I][j] ==0
For(k=0; k<N; k++)
C[I][J] += A[I][K] * B[K][J];
A simple algorithm:
• Distribute A by rows, B by columns
– So,any processor can request a row of A and get
it (in two messages). Same for a col of B,
– Distribute the work of computing each element
of C using some load balancing scheme
• So it works even on machines with varying processor
capabilities (e.g. timeshared clusters)
– What is the computation-toc-mmunication ratio?
• For each object: 2.N ops, 2 messages with N bytes
A better algorithm:
• Store A as a collection row-bunches
– each bunch stores g rows
– Same of B’s columns
• Each object now computes a gxg section of C
• Comp to commn ratio:
– 2*g*g*N ops
– 2 messages, gN bytes each
– alpha ratio: 2g*g*N/2, beta ratio: g
Alpha vs beta
• The per message cost is significantly larger
than per byte cost
– factor of several thousands
– So, several optimizations are possible that trade
off : get larger beta cost for smaller alpha
– I.e. send fewer messages
– Applications of this idea:
• Message combining
• Complex communication patterns: each-to-all, ..
Example:
• Each to all communication:
– each processor wants to send N bytes, distinct
message to each other processor
– Simple implementation: alpha*P + N * beta *P
• typical values?
Programming for performance:
steps
•
•
•
•
•
Select/design Parallel algorithm
Decide on Decomposition
Select Load balancing strategy
Plan Communication structure
Examine synchronization needs
– global synchronizations, critical paths
Design Philosophy:
• Parallel Algorithm design:
– Ensure good performance (total op count)
– Generate sufficient parallelism
– Avoid/minimize “extra work”
• Decomposition:
– Break into many small pieces:
• Smallest grain that sufficiently amortizes overhead
Design principles: contd.
• Load balancing
– Select static, dynamic, or quasi-dynamic strategy
• Measurement based vs prediction based load estimation
– Principle: let a processor idle but avoid
overloading one (think about this)
• Reduce communication overhead
– Algorithmic reorganization (change mapping)
– Message combining
– Use efficient communication libraries
Design principles: Synchronization
• Eliminate unnecessary global synchronization
– If T(i,j) is the time during i’th phase on j’th PE
• With synch: sum ( max {T(i,j)})
• Without: max { sum(T (i,j) }
• Critical Paths:
– Look for long chains of dependences
• Draw timeline pictures with dependences
Diagnosing performance problems
• Tools:
– Back of the envelope (I.e. simple) analysis
– Post-mortem analysis, with performance logs
• Visualization of performance data
• Automatic analysis
• Phase-by-phase analysis (prog. may have many phases)
– What to measure
• load distribution, (commun.) overhead, idle time
• Their averages, max/min, and variances
• Profiling: time spent in individual modules/subroutines
Diagnostic technniques
• Tell-tale signs:
– max load >> average, and # Pes > average is >>1
• Load imbalance
– max load >> average, and # Pes > average is ~ 1
• Possible bottleneck (if there is dependence)
– profile shows increase in total time in routine f
with increase in Pes: algorithmic overhead
– Communication overhead: obvious