Power point slides for lecture 26

Download Report

Transcript Power point slides for lecture 26

• Optimizations (continue):
–
–
–
–
–
–
Common subexpression elimination
Code motion
Strngth reduction
Induction variable elimination
Instruction selection
Register allocation
– Branch chaining:
– While c do if a then b:
L2: c
If (c = false) goto L3
a
If (a = false) goto L1
b
L1: goto L2
L3:
L2: c
If (c = false) goto L3
a
If (a = false) goto L2
b
L1: goto L2
L3:
– Jump elimination
If (a = 0) goto L1
goto L2
If (a!=0) goto L2:
L1:
L1:
– Instruction scheduling
R2 = j
K=R2
R1 = R1+1
R2 = j
R1=R1+1
K=R2
– Loop unrolling
For (I=0; I < 4; I++)
a(I) = 0;
A(0) = 0;
A(1) = 0;
A(2) = 0;
A(3) = 0;
– Loop fusion
For (I=0; I<4; I++)
a(I) = 0;
For (I=0; I<4; I++)
b(I) = 0;
For(I=0; I<4;I++)
a(I) = 0;
b(I) = 0;
• Difference in local optimization and global
optimization:
– Common subexpression elimination:
• Local CSE: uses dag will solve the problem
• Global CSE: needs (data) flow analysis
R1 = I * 2
R1 = b[R1]
R2 = I * 2
R2 = b[R2]
R1 = I * 2
R1 = b[R1]
R2 = I * 2
R2 = b[R2]
• Compilation techniques for distributed
memory machines – problems:
– How to allocate memory?
• Shared arrays are allocated across different machines
– Who to perform the operations?
• The owner computes rule.
• Referencing to the remote array elements results in
explicit communication.
For (I=1; I<100; I++)
a(I) = b(I-1)
A:
0 1 ….24 25 26 …. 49 50 51 ….74 75 76 …. 99
B:
0 1 ….24 25 26 …. 49 50 51 ….74 75 76 …. 99
P0
P1
P2
P3
• Issues for compiling for distributed memory
machines:
–
–
–
–
Data partitioning
Computational partitioning
Communication optimization
Efficient Code generation
• To motivate the problem: let us see a straight
forward implementation for computational
partitioning (Pingali, Cornell, 1992):
• Assuming that all processors have the ghost image of all the
shared arrays.
For (I=0; I<100; I++)
If (I own b(I-1) but not a(I)) then
send (b(I-1) to the processor that owns a(I)
If (I own a(I) but not b(I-1)) then
Receive b(I-1) from the processor that owns b(I-1)
If (I own a(I)) then perform a(I) = b(I-1)
• This assumption is exactly what we don’t want.
• Too many communications with small messages.
• Date partitioning:
– Specified by hand – HPF (High Performance
Fortran), Fortran D, Craft Fortran, Fortran 90D.
• Allowing manual specification of data alignment
and data distribution
• Block, cyclic, block-cyclic distribution.
– Automatically determined by the compiler:
• In a number of research compiler (FX(CMU),
paradigm(UIUC), …)
• The compiler looks at the program and determines
the data distribution that would minimize the
communication cost.
– Different loops usually require different distribution
– Fancy distribution is usually useless
– Related to another difficult problem: communication opt.
• Computation partitioning:
– With the owner computes rule, the computation
partitioning only depends on the data
partitioning.
• Communication is only needed for read.
– Other computation partitioning technique is
also possible.
• Communication may also be needed for write.
– The basic method to compile a program is the
same regardless of the computation partitioning
methods.
• Communication optimization:
• Message vectorization: combining small messages
within a loop to be a large message before the loop.
• Redundant communication elimination: eliminate
communications that have been performed.
– Can be done partially.
• Communication scheduling: combining message
with the same pattern into one larger message
• Overlap communication with computation.
– Separate send and receive as far as possible.
• Exploit pipeline communication.
– Allow small messages instead of large messages.
– Generate efficient communication code:
• Buffer management
• Pack and unpack messages
– Fancy data distribution methods usually result in high
overheads.
• Considering the features in the underlying
communication system.
• General note: very difficult, main results coming
from the HPF group from Rice University.
– Future of HPF: