Transcript Document

The University of New Mexico
Software Optimization
Vikas, Chaudhary
EECE 360 –MA
Prof.471
Schamiloglu
The University of New Mexico
High-level
code
Intermediate code
Object code
Assembler
Assembly Code
Steps to create an executable
EECE 360 –MA
Prof.471
Schamiloglu
Executable
The University of New Mexico
Higher Optimizations
Procedure within basic blocks.
Procedure within single and nested loop structures.
Entire procedure including all blocks and structures.
File (inter-procedural analysis within a source file)
Cross file (inter-procedural analysis across all procedures)
EECE 360 –MA
Prof.471
Schamiloglu
The University of New Mexico
Compiler Options
There are no strict rules about what each level of
optimization means but generally
O0 does one to many translations.
O1 does basic block optimizations.
O2 does loop optimizations.
O4 does interfile optimizations.
Some compilers also provide +odataprefetch to indicate
that prefetch instructions should be inserted to
prefetch data from memory to cache.
EECE 360 –MA
Prof.471
Schamiloglu
The University of New Mexico
Increasing Register Pressure
When too many registers are needed, compilers must store values to
memory and restores values from memory. This degrades the
performance.
If we generate assembly code from compiler via –S and see that there
is an inordinate number of load and store instructions then it is implied
that compiler is generating too many spills.
Use register data type carefully.
EECE 360 –MA
Prof.471
Schamiloglu
The University of New Mexico
Dead Code Elimination
Dead code Elimination is merely the removal of code that is never used.
i=0
If (i!=0) deadcode(i);
EECE 360 –MA
Prof.471
Schamiloglu
The University of New Mexico
Constant Folding and Propagation
Constant folding is when expressions with multiple constants are
folded together and evaluated at compile time.
a = 1+ 2
Will be replaced by a = 3.
EECE 360 –MA
Prof.471
Schamiloglu
The University of New Mexico
Common Subexpression elimination
Common subexpression elimination analyzes lines of code,
determines where identical subexpressions are used and creates
a temporary variable to hold one instance of these values.
a = b + (c + d)
f = e + (c + d)
EECE 360 –MA
Prof.471
Schamiloglu
The University of New Mexico
Strength Reductions
Strength reduction means replacing expensive
operations with cheaper ones.
Replacing integer multiplication or division by constants
with shift operations.
Replacing 32-bit integer division by 64-bit floating point
division.
Replacing floating point multiplications by small constants
with floating point additions.
Replacing power function by floating point multiplications.
EECE 360 –MA
Prof.471
Schamiloglu
The University of New Mexico
Filling Branch Delay Slots
Branch delay slots are the instructions after a branch
that are always executed.
If the compiler is used with no optimization, it will
probably insert a nop into branch delay slot.
EECE 360 –MA
Prof.471
Schamiloglu
The University of New Mexico
Induction Variable Optimization
for (i=0 ; i< n ; i +=2)
ia[i] = i * k + m;
Where i is induction variable.
The above code can be replaced by
ic = m
for (i = 0; i< n ; i += 2)
{
ia[i] = ic;
ic = ic + k;
}
EECE 360 –MA
Prof.471
Schamiloglu
The University of New Mexico
Loop Fission
This technique is often used when an inner loop consists of a
large number of lines and the compiler has difficulty
generating code without spilling.
This technique is also helpful in improving cache
performance.
for(i = 0; i < n; i++)
Y[i] = y[i] + x[i] + x[i+m]
Suppose x[i] and x[i +m] maps to same cache location.
(Direct mapped cache). This will cause cache thrashing.
EECE 360 –MA
Prof.471
Schamiloglu
The University of New Mexico
Loop can be split as
for(i = 0; i < n; i++)
y[i] = y[i] + x[i];
for(i = 0; i < n; i++)
y[i] = y[i] + x[i + m];
This technique might not be very useful when cache is n-way
set associative.
EECE 360 –MA
Prof.471
Schamiloglu
The University of New Mexico
Loop Unrolling
This technique reduces the effect of branches, instruction latency, and
potentially the number of cache misses.
Do I = 1, N
Y(I) = X(I)
ENDDO
After Unrolling
NEND = 4 * (N/4)
Do I = I, N , 4
Y(I) = X(I)
Y(I + 1) = X(I + 1)
Y(I + 1) = X(I + 1)
ENDDO
Do I = NEND+1 , N
Y(I) = X(I)
ENDDO
EECE 360 –MA
Prof.471
Schamiloglu
The University of New Mexico
• Loading all the values of X before the values of Y reduces the
possibility of cache thrashing.
• Amount of unrolling can decrease the number of software prefetch
instructions.
• Excessive unrolling will cause data to be spilled from register to
memory.
• Unrolling increases size of object code, which might cause too many
instruction cache misses.
EECE 360 –MA
Prof.471
Schamiloglu
The University of New Mexico
Clock Cycles in an Unrolled Loop
Original order
CC
Modified order
CC
Load X(1)
1
Load X(1)
1
Store Y(1)
7
Load X(2)
2
Load X(2)
8
Load X(3)
3
Store X(2)
14
Load X(4)
4
Load X(3)
15
Store X(1)
7
Store X(3)
21
Store X(2)
8
Load X(4)
22
Store X(3)
9
Store X(4)
28
Store X(4)
10
EECE 360 –MA
Prof.471
Schamiloglu
The University of New Mexico
Loop peeling
This technique is used by compilers to handle boundary conditions.
Do I = 1, N
if(I .EQ. 1) then
X[I] = 0
ELSEIF (I. EQ. N) THEN
X(I) = N
ELSE
X(I) = X (I) + Y(I)
ENDDO
AFTER LOOP PEELING
X(1) = 0
Do I = 2, N-1
X(I) = X(I) + Y(I)
ENDDO
X(N) = N
EECE 360 –MA
Prof.471
Schamiloglu
The University of New Mexico
Software Pipelining
Software pipelining is a technique for recognizing loops such that each
iteration in the software-pipelined code is made from instructions
chosen from different iterations of the original loop.
Iteration 0
Iteration 1
Iteration 2
Iteration 3
EECE 360 –MA
Prof.471
Schamiloglu
The University of New Mexico
Software pipeline is an optimization that is impossible to
duplicate with high level code since the multiple
assembly language instruction that a single line of high
level language creates are moved around extensively.
Software pipeline is created only at high optimization
level.
EECE 360 –MA
Prof.471
Schamiloglu
The University of New Mexico
Compiler Speculation with Hardware Support
Modern compilers try to speculate either to improve the
scheduling or to increase issue rate.
Hurdle
Conditional instructions.
In moving instructions across a branch the compiler must
ensure that exception behavior is not changed and dynamic
data dependence remains same.
Compiler also finds out, which registers are not being used
and those registers are renamed.
EECE 360 –MA
Prof.471
Schamiloglu
The University of New Mexico
Thank you!
EECE 360 –MA
Prof.471
Schamiloglu