CS433 Introduction - Parallel Programming Laboratory

Download Report

Transcript CS433 Introduction - Parallel Programming Laboratory

Programming for Performance
CS433
Spring 2001
Laxmikant Kale
Causes of performance loss
• If each processor is rated at k MFLOPS, and there are p
processors, why don’t we see k.p MFLOPS performance?
– Several causes,
– Each must be understood separately
– but they interact with each other in complex ways
• Solution to one problem may create another
• One problem may mask another, which manifests itself under other
conditions (e.g. increased p).
2
Causes
•
•
•
•
•
•
•
Sequential: cache performance
Communication overhead
Algorithmic overhead (“extra work”)
Speculative work
Load imbalance
(Long) Critical paths
Bottlenecks
3
Algorithmic overhead
• Parallel algorithms may have a higher operation count
• Example: parallel prefix (also called “scan”)
– How to parallelize this?
B[0] = A[0];
for (I=1; I<N; I++)
B[I] = B[I-1]+A[I];
4
Parallel Prefix: continued
• How to this operation in parallel?
– Seems inherently sequential
– Recursive doubling algorithm
– Operation count: log(P) . N
• A better algorithm:
– Take blocking of data into account
– Each processor calculate its sum, then participates in a prallel
algorithm to get sum to its left, and then adds to all its elements
– N + log(P) +N: doubling of op. Count
5
Bottleneck
• Consider the “primes” program (or the “pi”)
– What happens when we run it on 1000 pes?
• How to eliminate bottlenecks:
– Two structures are useful in most such cases:
• Spanning trees: organize processors in a tree
• Hypercube-based dimensional exchange
6
Communication overhead
• Components:
– per message and per byte
– sending, receiving and network
– capacity constraints
• Grainsize analysis:
– How much computation per message
– Computation-to-communication ratio
7
Communication overhead examples
• Usually, must reorganize data or work to reduce communication
• Combining communication also helps
• Examples:
8
Communication overhead
Communication delay: time interval between sending on one
processor to receipt on another:
time = a + b. N
Communication overhead: the time a processor is held up (both
sender and receiver are held up): again of the form a+ bN
Typical values: a = 10 - 100 microseconds, b: 2-10 ns
9
Grainsize control
• A Simple definition of grainsize:
– Amount of computation per message
– Problem: short message/ long message
• More realistic:
– Computation to communication ratio
– computation time / (a + bN) for one message
10
Example: matrix multiplication
• How to parallelize this?
For (I=0; I<N; I++)
For (J=0; j<N; J++) // c[I][j] ==0
For(k=0; k<N; k++)
C[I][J] += A[I][K] * B[K][J];
11
A simple algorithm:
• Distribute A by rows, B by columns
– So,any processor can request a row of A and get it (in two
messages). Same for a col of B,
– Distribute the work of computing each element of C using some
load balancing scheme
• So it works even on machines with varying processor capabilities
(e.g. timeshared clusters)
– What is the computation-to-communication ratio?
• For each object: 2.N ops, 2 messages with N bytes
• 2N / (2 a + 2N b) = 2N * 0.01 / (2*10 + 2*0.002N)
12
A better algorithm:
• Store A as a collection row-bunches
– each bunch stores g rows
– Same of B’s columns
• Each object now computes a gxg section of C
• Comp to commn ratio:
– 2*g*g*N ops
– 2 messages, gN bytes each
– alpha ratio: 2g*g*N/2, beta ratio: 2g
13
Alpha vs beta
• The per message cost is significantly larger than per byte cost
– factor of several thousands
– So, several optimizations are possible that trade off :
• get larger beta cost in return for smaller alpha
– I.e. send fewer messages
– Applications of this idea:
• Examined in the last two lectures
14
Programming for performance:steps
•
•
•
•
•
Select/design Parallel algorithm
Decide on Decomposition
Select Load balancing strategy
Plan Communication structure
Examine synchronization needs
– global synchronizations, critical paths
15
Design Philosophy:
• Parallel Algorithm design:
– Ensure good performance (total op count)
– Generate sufficient parallelism
– Avoid/minimize “extra work”
• Decomposition:
– Break into many small pieces:
• Smallest grain that sufficiently amortizes overhead
16
Design principles: contd.
• Load balancing
– Select static, dynamic, or quasi-dynamic strategy
• Measurement based vs prediction based load estimation
– Principle: let a processor idle but avoid overloading one
• (think about this)
• Reduce communication overhead
– Algorithmic reorganization (change mapping)
– Message combining
– Use efficient communication libraries
17
Design principles: Synchronization
• Eliminate unnecessary global synchronization
– If T(i,j) is the time during i’th phase on j’th PE
• With synch: sum ( max {T(i,j)})
• Without: max { sum(T (i,j) }
• Critical Paths:
– Look for long chains of dependences
• Draw timeline pictures with dependences
18
Diagnosing performance problems
• Tools:
– Back of the envelope (I.e. simple) analysis
– Post-mortem analysis, with performance logs
• Visualization of performance data
• Automatic analysis
• Phase-by-phase analysis (prog. may have many phases)
– What to measure
• load distribution, (commun.) overhead, idle time
• Their averages, max/min, and variances
• Profiling: time spent in individual modules/subroutines
19
Diagnostic technniques
• Tell-tale signs:
– max load >> average, and # PEs > average is >>1
Load imbalance
– max load >> average, and # PEs > average is ~ 1
Possible bottleneck (if there is dependence)
– Profile shows increase in total time in routine f with increase in PEs:
Algorithmic overhead
– Communication overhead: obvious
20
Communication Optimization
• Example problem from earlier lecture: Molecular Dynamics
– Each Processor, assumed to house just one cell, needs to send 26
short messages to “neighboring” processors
– Assume Send/Receive each: alpha = 10 us, beta: 2ns
– Time spent (notice: 26 sends and 26 receives):
• 26*2(10 ) = 520 us
– If there are more than one cells on each PE, multiply this number!
– Can this be improved? How?
21
Message combining
• If there are multiple cells per processor:
– Neighbors of a cell may be on the same neighboring processor.
– Neighbors of two different cells on the same processor
– Combine messages going to the same processor
22
Communication Optimization I
• Take advantage of the structure of communication, and do
communication in stages:
– If my coordinates are: (x,y,z):
•
•
•
•
Send to (x+1, y,z), anything that goes to (x+1, *, *)
Send to (x-1, y,z), anything that goes to (x-1, *, *)
Wait for messages from x neighbors, then
Send to y neighbors a combined message
– A total of 6 messages instead of 26
– Apparently longer critical path
23
Communication Optimization II
• Send all migrating atoms to processor 0
– Let processor 0 sort them out and send 1 message to each processor
– Works ok if the number of processors is small
• Otherwise, bottleneck at 0
24
Communication Optimization 3
• Generalized problem:
– Each to all, individualized messages
– Apply all previously learned techniques
25
Intro to Load Balancing
• Example: 500 processors, 50000 units of work
• What should the objective of load balancing be?
26
Causes of performance loss
• If each processor is rated at k MFLOPS, and there are p
processors, why don’t we see k.p MFLOPS performance?
– Several causes,
– Each must be understood separately
– but they interact with each other in complex ways
• Solution to one problem may create another
• One problem may mask another, which manifests itself under other
conditions (e.g. increased p).
27
Causes
•
•
•
•
•
•
•
Sequential: cache performance
Communication overhead
Algorithmic overhead (“extra work”)
Speculative work
Load imbalance
(Long) Critical paths
Bottlenecks
28
Algorithmic overhead
• Parallel algorithms may have a higher operation count
• Example: parallel prefix (also called “scan”)
– How to parallelize this?
B[0] = A[0];
for (I=1; I<N; I++)
B[I] = B[I-1]+A[I];
29
Parallel Prefix: continued
• How to this operation in parallel?
– Seems inherently sequential
– Recursive doubling algorithm
– Operation count: log(P) . N
• A better algorithm:
– Take blocking of data into account
– Each processor calculate its sum, then participates in a prallel
algorithm to get sum to its left, and then adds to all its
elements
– N + log(P) +N: doubling of op. Count
30
Bottleneck
• Consider the “primes” program (or the “pi”)
– What happens when we run it on 1000 pes?
• How to eliminate bottlenecks:
– Two structures are useful in most such cases:
• Spanning trees: organize processors in a tree
• Hypercube-based dimensional exchange
31
Communication overhead
• Components:
– per message and per byte
– sending, receiving and network
– capacity constraints
• Grainsize analysis:
– How much computation per message
– Computation-to-communication ratio
32
Communication overhead examples
• Usually, must reorganize data or work to reduce communication
• Combining communication also helps
• Examples:
33
Communication overhead
Communication delay: time interval between sending on one
processor to receipt on another:
time = a + b. N
Communication overhead: the time a processor is held up (both
sender and receiver are held up): again of the form a+ bN
Typical values: a = 10 - 100 microseconds, b: 2-10 ns
34
Grainsize control
• A Simple definition of grainsize:
– Amount of computation per message
– Problem: short message/ long message
• More realistic:
– Computation to communication ratio
35
Example: matrix multiplication
• How to parallelize this?
For (I=0; I<N; I++)
For (J=0; j<N; J++) // c[I][j] ==0
For(k=0; k<N; k++)
C[I][J] += A[I][K] * B[K][J];
36
A simple algorithm:
• Distribute A by rows, B by columns
– So,any processor can request a row of A and get it (in two
messages). Same for a col of B,
– Distribute the work of computing each element of C using
some load balancing scheme
• So it works even on machines with varying processor
capabilities (e.g. timeshared clusters)
– What is the computation-toc-mmunication ratio?
• For each object: 2.N ops, 2 messages with N bytes
37
A better algorithm:
• Store A as a collection row-bunches
– each bunch stores g rows
– Same of B’s columns
• Each object now computes a gxg section of C
• Comp to commn ratio:
– 2*g*g*N ops
– 2 messages, gN bytes each
– alpha ratio: 2g*g*N/2, beta ratio: g
38
Alpha vs beta
• The per message cost is significantly larger than per byte
cost
– factor of several thousands
– So, several optimizations are possible that trade off : get larger
beta cost for smaller alpha
– I.e. send fewer messages
– Applications of this idea:
• Message combining
• Complex communication patterns: each-to-all, ..
39
Example:
• Each to all communication:
– each processor wants to send N bytes, distinct message to each other
processor
– Simple implementation: alpha*P + N * beta *P
• typical values?
40
Programming for performance:
steps
•
•
•
•
•
Select/design Parallel algorithm
Decide on Decomposition
Select Load balancing strategy
Plan Communication structure
Examine synchronization needs
– global synchronizations, critical paths
41
Design Philosophy:
• Parallel Algorithm design:
– Ensure good performance (total op count)
– Generate sufficient parallelism
– Avoid/minimize “extra work”
• Decomposition:
– Break into many small pieces:
• Smallest grain that sufficiently amortizes overhead
42
Design principles: contd.
• Load balancing
– Select static, dynamic, or quasi-dynamic strategy
• Measurement based vs prediction based load estimation
– Principle: let a processor idle but avoid overloading one (think
about this)
• Reduce communication overhead
– Algorithmic reorganization (change mapping)
– Message combining
– Use efficient communication libraries
43
Design principles: Synchronization
• Eliminate unnecessary global synchronization
– If T(i,j) is the time during i’th phase on j’th PE
• With synch: sum ( max {T(i,j)})
• Without: max { sum(T (i,j) }
• Critical Paths:
– Look for long chains of dependences
• Draw timeline pictures with dependences
44
Diagnosing performance problems
• Tools:
– Back of the envelope (I.e. simple) analysis
– Post-mortem analysis, with performance logs
• Visualization of performance data
• Automatic analysis
• Phase-by-phase analysis (prog. may have many phases)
– What to measure
• load distribution, (commun.) overhead, idle time
• Their averages, max/min, and variances
• Profiling: time spent in individual modules/subroutines
45
Diagnostic technniques
• Tell-tale signs:
– max load >> average, and # Pes > average is >>1
• Load imbalance
– max load >> average, and # Pes > average is ~ 1
• Possible bottleneck (if there is dependence)
– profile shows increase in total time in routine f with increase in
Pes: algorithmic overhead
– Communication overhead: obvious
46