Compiler Research in HPC Lab

Download Report

Transcript Compiler Research in HPC Lab

Compiler Research
in HPC Lab
R. Govindarajan
High Performance Computing Lab.
[email protected]
Organization


HPC Lab Research Overview
Compiler Analysis & Optimizations
Precise Dataflow Analysis
 Energy Reduction for Embedded Systems





Array Allocation for Partitioned Memory Arch.
Dynamic Voltage Scaling
Integrated Spill Code Generation &
Scheduling
Conclusions
HPC Team (or HPC– XI)
Coach:





R. Govindarajan
Mruggesh Gajjar
B.C. Girish
R. Karthikeyan
R. Manikantan
Santosh Nagarakatte






Rupesh Nasre
Sreepathi Pai
Kaushik Rajan
T.S. Rajesh Kumar
V.Santhosh Kumar
Aditya Thakur
HPC Lab Research Overview

Compiler Optimizations
 Traditional
analysis & optimizations,
power-aware compiling techniques,
compilation techniques for embedded
systems

Computer Architecture
 Superscalar
architecture, architecturecompiler interaction, application-specific
processors, embedded systems

High Performance Computing
 Cluster
computing, HPC Applications
Compiler Research in HPC Lab.




ILP Compilation Techniques
Compiling Techniques for Embedded
Systems
Compiling Techniques for ApplicationSpecific Systems
Dataflow Analysis
ILP Compilation Techniques





Instruction Scheduling
Software pipelining
Register Allocation
Power/Energy Aware Compilation
techniques
Compiling Techniques for embedded
systems/application specific processors
(DSP, Network Processors, …)
Compiling Techniques for
Embedded Systems






Power-aware software pipelining method
(using integer linear program formulation)
Simple Offset Assignment for code-size
reduction.
Loop transformation and memory bank
assignment for power reduction.
Compiler Assisted Dynamic Voltage Scaling
Memory layout problem for embedded
systems
MMX code generation using vectorization
Compiling Techniques for
Application Specific Systems


Framework for exploring application
design space for network application
Compiling techniques for Streaming
Applications and Program Models

Buffer-Aware, Schedule-size Aware,
Throughput Optimal Schedules
Compiler Analysis


Precise Dataflow Analysis
Pointer Analysis
So, What is the Connection?

Compiler problems are

Optimization problems – solved by
formulating the problem as Integer
Linear Program problem.



Involves non-trivial effort!
Efficient formulation for reducing exec. time!
Other evolutionary approaches can also used.
Graph Theoretic problems – leverage
existing well-known approaches
 Modelled using Automaton – elegant
problem formulation to ensure
correctness

Precise Dataflow Analysis

The Problem: Improve precision of
data-flow analysis used in compiler
optimization
Constant Propagation
start
A: ...
B: x=1;
C: x=2;
{x = 1}
{x = 2}
D: ... {x = nc}
E: ...
F: ...
Can’t replace the use of x at G with a
constant.
G: y = x;
H: ...
I: ...
J: ...
end
… : statements unrelated
to x or y
nc : not constant
{ } : Data-flow information
Overview of our Solution
start
start
A0
A: ...
B0
B: x=1;
C0
C: x=2;
D1
D2
D: ...
E1
E: ...
F1
E2
F2
F: ...
Can replace uses of
x at G1 and G2
G: y = x;
H: ...
I: ...
J: ...
end
G1
G2
H0
I0
J0
end
Challenges



The Problem: Improve precision of
data-flow analysis
Approach: Restructuring control-flow
of the program
Challenges:
Developed generic framework
 Guarantees optimization opportunities
 Handles the precision and code size
trade-off
 Approach is simple and clean

start
A brief look at our example.
A: ...
B: x=1;
C: x=2;
{x = 1}
{x = 2}
D: ... {x = nc}
E: ...
At control-flow merge D, we lose
precision.
F: ...
G: y = x;
H: ...
I: ...
J: ...
end
… : statements unrelated
to x or y
nc : not constant
{ } : Data-flow information
start
A: ...
B: x=1;
C: x=2;
D: ...
E: ...
F: ...
Need to duplicate this in order to
optimize node G…
G: y = x;
H: ...
I: ...
J: ...
end
… : statements unrelated
to x or y
nc : not constant
{ } : Data-flow information
start
A: ...
B: x=1;
C: x=2;
D: ...
E: ...
F: ...
…such that paths with differing dataflow
information do not intersect.
G: y = x;
H: ...
I: ...
J: ...
end
… : statements unrelated
to x or y
nc : not constant
{ } : Data-flow information
start
A: ...
B: x=1;
C: x=2;
D: ...
E: ...
F: ...
G: y = x;
H: ...
I: ...
J: ...
end
No need to
duplicate this.
… : statements unrelated
to x or y
nc : not constant
{ } : Data-flow information
Control-flow Graph = Automaton

View a control-flow graph G as a finite
automaton with
states as nodes
 start state as entry node
 accepting state as exit node
 transitions as the edges

The Automaton
start
A: ...
B: x=1;
C: x=2;
0
D: ...
E: ...
G-H
G-I
G-H
G-I
B-D
F: ...
1
G: y = x;
H: ...
I: ...
C-D
B-D
C-D
Split Automaton for D
J: ...
end
2
The Automaton
start
A: ...
B: x=1;
C: x=2;
0
D: ...
E: ...
G-H
G-I
G-H
G-I
B-D
F: ...
1
G: y = x;
H: ...
I: ...
C-D
B-D
C-D
Split Automaton for D
J: ...
end
2
start
A: ...
CFG x Automaton =
Split Graph
start
A0
B0
B: x=1;
C0
C: x=2;
D1
D2
D: ...
E1
0
E: ...
F: ...
B-D
G: y = x;
1
I: ...
E2
F2
G-H
G-I
G-H
G-I
H: ...
F1
C-D
B-D
G1
G2
H0
I0
2
C-D
J0
J: ...
Split Automaton for D
end
end
more
Energy Reduction: Array Alloc.
for Partitioned Memory Arch.

Dynamic Energy reduction in Memory
Subsystem.
 Memory
 Many
embedded applications are array intensive
 Memory

subsystem consumes significant energy
architecture with multiple banks
Exploiting various low-power modes of
partitioned memory architectures.
 Put
idle memory banks in low-power mode
 Allocate
arrays to memory banks s.t. more memory
banks can be in low-power mode for longer duration
Partitioned Memory Architectures

Memory banks with low-power modes.
 Active,

Stand-by, Napping, Power-down, Disabled.
Resynchronization time – time to move from
lower power mode to Active mode
Resynch.
Time (cycles)
Energy
Consumed (nJ)
Active
0
0.718
Standby
2
0.468
Napping
30
0.0206
9000
0.00875
Mode
Power Down
Motivating Example
Example :
float a[N], d[N];
double b[N], c[N];
L1: for (ia=0;ia < N;ia++)
d[ia] = a[ia] + k;
L2: for (ia=0;ia < N;ia++)
a[ia] = b[ia] * k ;
L3: for (ia=0;ia < N;ia++)
c[ia] = d[ia] / k;
L4: for (ia=0;ia < N;ia++)
b[ia] = c[ia] - k;
L5: for (ia=0;ia < N;ia++)
b[ia] = d[ia] + k;
Array Relation Graph
a
N
2N
d
c
4N
N
8N
b
Arrays a, d ~ 1 MB each
Arrays b, c ~ 2 MB each
Memory bank size = 4MB
Memory banks active for
a total of 32N cycles!
Motivating Example -- Our Approach
• Array allocation requires
partitioning the ARG!
• Graph partitioning such
that each subgraph can be
accommodated in a memory
bank.
• Weights of edges across
subgraphs is the cost of
keeping multiple banks active
together. Minimize them!
• Arrays b and c in one
subgraph and a and d in
another
Array Relation Graph
a
N
2N
d
c
4N
N
8N
b
Memory banks active for
a total of 23N cycles!
Dynamic Voltage Scaling




Dynamically vary the CPU frequency and
supply voltage.
Dynamic Power proportional to C * V2 * f
 C capacitance
 V supply voltage
 f operating frequency
Processors support different Voltage (and
Frequency) modes and can switch betn. them.
AMD, Transmeta, Xscale provide support for
DVS, have multiple operating frequencies.
Compiler Assisted DVS



Identify program regions where DVS can be
performed.
For each program region, identify the
voltage (freq.) mode to operate on, s.t.
energy is minimized
Ensure that performance is not degraded.
Motivating Example
Freq.
P1
P2
Exec. Time
200
Energy
Exec. Time
300
Energy
151 6827
82
125
100 4552
149
163
335 6827
39
125
223 4552
72
163
Exec. Time
400
Energy
76 3414
198 274
Freq.
200 400
DVS Exec. Time 151 3414
Energy
82 274
P3
P4
P5
Total
335
39
223
72
14475
410
9650
619
168 3414 168
2 % Increase
176 274 176
7240
1098
300 400
223 3414
300
223
-7425
72
778
30 % decrease
72
274
DVS Problem Formulation


Program divided into number of regions.
Assign an operating frequency for each
program region.

Constraint


Objective


Marginal increase in exec. time of the program.
Minimizing program Energy Consumption.
Multiple Choice Knapsack Problem
Compiler Problem as
Optimization Problem


Integrated register allocation, spill
code generation and scheduling in
Software Pipelined loop
Problem: Given Machine M, Loop L, a
software pipelined schedule S with initiation
interval II, perform Register Allocation and
generate spill code, if necessary, and
schedule them such that the register
requirement of the schedule  Number of
Registers and resource constraints are met!
Modeling Liverange
Live Range Representation
A
0
Register R0
def
....................
TN A,0,0
Register Rn
TN A,n,0
1
TN A,0,1
....................
TN A,n,1
2
TN A,0,2
....................
TN A,n,2
3
TN A,0,3
4
use
TN A,0,4
5
TN A,0,5
6
TN A,0,6
7
use
TN A,0,7
....................
....................
TN A,n,3
TN A,n,4
TN A,n,5
....................
TN A,n,6
TN A,n,7
Modeling Spill Stores
Store decision variables
Register R0
....................
Register Rn
1
STN A,0,1
....................
STN A,n,1
2
STN A,0,2
....................
STN A,n,2
3
STN A,0,3
A
0
def
store
store
store
use
store
4
STN A,0,4
5
6
7
use
Latencies: Load : 1, Store : 1, Instruction : 1
STN A,n,3
....................
STN A,n,4
Modeling Spill Loads
Load decision variables
A
0
Register R0
....................
LTN A,0,3
....................
Register Rn
def
1
2
load
load
load
load
3
use
4
LTN A,0,4
5
LTN A,0,5
6
LTN A,0,6
7
use
Latencies: Load : 1, Store : 1, Instruction : 1
....................
LTN A,n,3
LTN A,n,4
LTN A,n,5
....................
LTN A,n,6
Constraints
Constraints- Overview





Every live range must be in a register at
the definition time and the use time.
Spill load can take place only if the spill
store has already taken place.
After a spill store, a live range can continue
or cease to exist.
Ensure that the spill loads and stores don't
saturate the memory units.
Minimize the number of spill loads and
stores.
Objective


No Objective function – just a
constrain solving problem!
Minimize the number of spill loads and
stores
 STN i,r,t+LTN i,r,t
Conclusions



Compiler research is fun!
It is cool to do compiler research!
But, remember Proebsting’s Law:
Compiler Technology Doubles CPU Power
Every 18 YEARS!!


Plenty of opportunities in compiler
research!
However, NO VACANCY in HPC lab
this year! 