Commutativity Analysis: A New Analysis Framework for

Download Report

Transcript Commutativity Analysis: A New Analysis Framework for

Commutativity Analysis: A New Analysis
Framework for Parallelizing Compilers
Martin C. Rinard
Pedro C. Diniz
University of California, Santa Barbara
Santa Barbara, California 93106
{martin,pedro}@cs.ucsb.edu
http://www.cs.ucsb.edu/~{martin,pedro}
Goal
Develop a Parallelizing Compiler for Object-Oriented
Computations
• Current Focus
• Irregular Computations
• Dynamic Data Structures
• Future
• Persistent Data
• Distributed Computations
• New Analysis Technique:
Commutativity Analysis
Structure of Talk
• Model of Computation
• Example
• Commutativity Testing
• Steps To Practicality
• Experimental Results
• Conclusion
Model of Computation
1
0
3
0
operations
objects
executing
operation
operation
1
0
initial object
state
1
0
1
1
new object
state
invoked
operations
Graph Traversal Example
class graph {
1
0
int val, sum;
2
3
graph *left, *right;
0
0
};
4
void graph::traverse(int v) {
0
sum += v;
if (left !=NULL) left->traverse(val);
if (right!=NULL) right->traverse(val);
}
Goal
Execute left and right traverse operations in parallel
Parallel Traversal
1
0
2
0
1
0
3
0
2
0
4
0
3
0
1
1
4
0
2
0
1
1
2
1
1
1
3
1
4
0
2
0
4
0
3
0
4
0
3
0
Commuting Operations in Parallel Traversal
1
1
2
1
1
1
2
1
1
1
3
1
2
1
4
0
3
1
1
1
4
2
3
1
2
1
1
5
4
0
2
1
1
5
3
1
4
0
2
5
4
5
3
5
4
3
3
1
Model of Computation
• Operations: Method Invocations
• In Example: Invocations of graph::traverse
• left->traverse(3)
• right->traverse(2)
• Objects: Instances of Classes
• In Example: Graph Nodes
1
0
3
0
• Instance Variables Implement Object State
• In Example: val, sum, left, right
Model of Computation
• Operations: Method Invocations
• In Example: Invocations of graph::traverse
• left->traverse(3)
• right->traverse(2)
• Objects: Instances of Classes
• In Example: Graph Nodes
1
0
3
0
Separable Operations
Each Operation Consists of Two Sections
Object Section
Only Accesses Receiver
Object
Invocation Section
Only Invokes Operations
Both Sections
Can Access Parameters
Basic Approach
• Compiler Chooses A Computation to Parallelize
• In Example: Entire graph::traverse Computation
• Compiler Computes Extent of the Computation
• Representation of all Operations in Computation
• Current Representation: Set of Methods
• In Example: { graph::traverse }
• Do All Pairs of Operations in Extent Commute?
• No - Generate Serial Code
• Yes - Generate Parallel Code
• In Example: All Pairs Commute
Code Generation
For Each Method in Parallel Computation
• Augments Class Declaration With Mutual Exclusion Lock
• Generates Driver Version of Method
• Invoked from Serial Code to Start Parallel Execution
• Invokes Parallel Version of Operation
• Waits for Entire Parallel Computation to Finish
• Generates Parallel Version of Method
Object Section
• Lock Acquired at Beginning
• Lock Released at End
• Ensure Atomic Execution
Invocation Section
• Invoked Operations
Execute in Parallel
• Invokes Parallel Version
Code Generation In Example
class graph {
lock mutex;
Class Declaration
int val, sum;
graph *left, *right;
};
Driver Version
void graph::traverse(int v){
parallel_traverse(v);
wait();
}
Parallel Version In Example
void graph::parallel_traverse(int v) {
mutex.acquire();
sum += v;
mutex.release();
if (left != NULL)
spawn(left->parallel_traverse(val));
if (right != NULL)
spawn(right->parallel_traverse(val));
}
Compiler Structure
All Operations
Commute
Computation
Selection
Entire Computation
of Each Method
Extent
Computation
Traverse Call Graph
to Extract Extent
Commutativity
Testing
Generate Parallel
Code
All Pairs of Operations
In Extent
Operations May
Not Commute
Generate Serial
Code
Traditional Approach
• Data Dependence Analysis
• Analyzes Reads and Writes
• Independent Pieces of Code Execute in Parallel
• Demonstrated Success for Array-Based Programs
Data Dependence Analysis in Example
• For Data Dependence Analysis To Succeed in Example
• left and right traverse Must Be Independent
• left and right Subgraphs Must Be Disjoint
• Graph Must Be a Tree
• Depends on Global Topology of Data Structure
• Analyze Code that Builds Data Structure
• Extract and Propagate Topology Information
• Fails For Graphs
Properties of Commutativity Analysis
• Oblivious to Data Structure Topology
• Local Analysis
• Simple Analysis
• Wide Range of Computations
• Lists, Trees and Graphs
• Updates to Central Data Structure
• General Reductions
• Introduces Synchronization
• Relies on Commuting Operations
Commutativity Testing
Commutativity Testing Conditions
• Do Two Operations A and B Commute?
• Compiler Considers Two Execution Orders
• A;B - A executes before B
• B;A - B executes before A
• Compiler Must Check Two Conditions
Instance Variables
Invoked Operations
New values of instance
variables are same in both
execution orders
A and B together directly
invoke same set of
operations in both
execution orders
Commutativity Testing Conditions
4
0
4
2
4
0
4
5
4
0
4
3
Commutativity Testing Algorithm
• Symbolic Execution:
• Compiler Executes Operations
• Computes with Expressions not Values
• Compiler Symbolically Executes Operations
In Both Execution Orders
• Expressions for New Values of Instance Variables
• Expressions for Multiset of Invoked Operations
Expression Simplification and Comparison
• Compiler Applies Rewrite Rules to Simplify Expressions
• a*(b+c) a*b)+(a*c)
• b+(a+c) (a+b+c)
• a+if(b<c,d,e)  if(b<c,a+d,a+e)
• Compiler Compares Corresponding Expressions
• If All Equal - Operations Commute
• If Not All Equal - Operations May Not Commute
Commutativity Testing Example
• Two Operations
r->traverse(v1) and r->traverse(v2)
• In Order r->traverse(v1);r->traverse(v2)
Instance Variables
New sum=
(sum+v1)+v2
Invoked Operations
if(right!=NULL,right->traverse(val)),
if(left!=NULL,left->traverse(val)),
if(right!=NULL,right->traverse(val)),
if(left!=NULL,left->traverse(val))
• In Order r->traverse(v2);r->traverse(v1)
Instance Variables
New sum=
(sum+v2)+v1
Invoked Operations
if(right!=NULL,right->traverse(val)),
if(left!=NULL,left->traverse(val)),
if(right!=NULL,right->traverse(val)),
if(left!=NULL,left->traverse(val))
Important Special Case
• Independent Operations Commute
• Analysis in Current Compiler
• Dependence Analysis
• Operations on Objects of Different Classes
• Independent Operations on Objects of Same Class
• Symbolic Commutativity Testing
• Dependent Operations on Objects of Same Class
• Future
• Integrate Pointer or Alias Analysis
• Integrate Array Data Dependence Analysis
Important Special Case
• Independent Operations Commute
• Conditions for Independence
• Operations Have Different Receivers
• Neither Operation Writes an Instance Variable that
Other Operation Accesses
• Detecting Independent Operations
• In Type-Safe Languages
• Class Declarations
• Instance Variable Accesses
• Pointer or Alias Analysis
Analysis in Current Compiler
• Dependence Analysis
• Operations on Objects of Different Classes
• Independent Operations on Objects of Same Class
• Symbolic Commutativity Testing
• Dependent Operations on Objects of Same Class
• Future
• Integrate Pointer or Alias Analysis
• Integrate Array Data Dependence Analysis
Steps to Practicality
Programming Model Extensions
• Extensions for Read-Only Data
• Allow Operations to Freely Access Read-Only Data
• Enhances Ability of Compiler to Represent Expressions
• Increases Set of Programs that Compiler can Analyze
• Analysis Granularity Extensions
• Integrate Operations into Callers for Analysis Purposes
• Coarsens Commutativity Testing Granularity
• Reduces Number of Pairs Tested for Commutativity
• Enhances Effectiveness of Commutativity Testing
Optimizations
• Synchronization Optimizations
• Eliminate Synchronization Constructs in Methods that
Only Access Read-Only Data
• Reduce Number of Acquire and Release Constructs
• Parallel Loop Optimization
• Suppress Exploitation of Excess Concurrency
Extent Constants
Motivation: Allow Parallel Operations to Freely Access Read-Only Data
• Extent Constant Variable
Global variable or instance variable
written by no operation in extent
• Extent Constant Expression
Expression whose value depends only on
extent constant variables or parameters
• Extent Constant Value
Value computed by extent constant
expression
• Extent Constant
Automatically generated opaque constant
used to represent an extent constant value
• Requires: Interprocedural Data Usage Analysis
• Result Summarizes How Operations Access Instance Variables
• Interprocedural Pointer Analysis for Reference Parameters
Extent Constant Variables In Example
void graph::traverse(int v) {
sum += v;
Extent Constant Variable
if (left != NULL) left->traverse(val);
if (right != NULL) right->traverse(val);
}
Extent Constant Variable
Advantages of Extent Constants
• Extent Constants Extend Programming Model
• Enable Direct Global Variable Access
• Enable Direct Access of Objects other than Receiver
• Extent Constants Make Compiler More Effective
• Enable Compact Representations of Large Expressions
• Enable Compiler to Represent Values Computed by
Otherwise Unanalyzable Constructs
Auxiliary Operations
Motivation: Coarsen Granularity of Commutativity Testing
• An Operation is an Auxiliary Operation if its Entire Computation
• Only Computes Extent Constant Values
• Only Externally Visible Writes are to Local Variables of Caller
• Auxiliary Operations are Conceptually Part of Caller
• Analysis Integrates Auxiliary Operations into Caller
• Represents Computed Values using Extent Constants
• Requires:
• Interprocedural Data Usage Analysis
• Interprocedural Pointer Analysis for Reference Parameters
• Intraprocedural Reaching Definition Analysis
Auxiliary Operation Example
int
graph::square_and_add(int v) {
Extent Constant Variable
Parameter
return(val*val + v);
}
Extent Constant Expression
void graph::traverse(int v) {
sum += square_and_add(v);
if (left != NULL) left->traverse(val);
if (right != NULL) right->traverse(val);
}
Advantages of Auxiliary Operations
• Coarsen Granularity of Commutativity Testing
• Reduces Number of Pairs Tested for Commutativity
• Enhances Effectiveness of Commutativity Testing
Algorithm
• Support Modular Programming
Synchronization Optimizations
• Goal: Eliminate or Reduce Synchronization Overhead
• Synchronization Elimination
If
An Operation Only
Computes Extent
Constant Values
Then
Compiler Does Not
Generate Lock Acquire
and Release
• Lock Coarsening
Data
Use One Lock for
Multiple Objects
Computation
Generate One Lock Acquire and
Release for Multiple Operations
on the Same Object
Data Lock Coarsening Example
Original Code
class vector {
lock mutex;
double val[NDIM];
}
void vector::add(double *v){
mutex.acquire();
for(int i=0; i < NDIM; i++)
val[i] += v[i];
mutex.release();
}
class body {
lock mutex;
double phi;
vector acc;
};
void body::gravsub(body *b){
double p, v[NDIM];
mutex.acquire();
p = computeInter(b,v);
phi -= p;
mutex.release();
acc.add(v);
}
Optimized Code
class vector {
double val[NDIM];
}
void vector::add(double *v){
for(int i=0; i < NDIM; i++)
val[i] += v[i];
}
class body {
lock mutex;
double phi;
vector acc;
};
void body::gravsub(body *b){
double p, v[NDIM];
mutex.acquire();
p = computeInter(b,v);
phi -= p;
acc.add(v);
mutex.release();
}
Computation Lock Coarsening Example
Original Code
class body {
lock mutex;
double phi;
vector acc;
};
void body::gravsub(body *b){
double p, v[NDIM];
mutex.acquire();
p = computeInter(b,v);
phi -= p;
acc.add(v);
mutex.release();
}
void body::loopsub(body *b){
int i;
for (i = 0; i < N; i++) {
this->gravsub(b+i);
}
}
Optimized Code
class body {
lock mutex;
double phi;
vector acc;
};
void body::gravsub(body *b){
double p, v[NDIM];
p = computeInter(b,v);
phi -= p;
acc.add(v);
}
void body::loopsub(body *b){
int i;
mutex.acquire();
for (i = 0; i < N; i++) {
this->gravsub(b+i);
}
mutex.release();
}
Parallel Loops
Goal: Generate Efficient Code for Parallel Loops
If A Loop is in the Following Form
for (i = exp1; i < exp2; i += exp3) {
exp4->op(exp5,exp6, ...);
}
Where exp1,
exp2, ...
Extent Constant Expressions
Then Compiler Generates Parallel Loop Code
Parallel Loop Optimization
• Without Parallel Loop Optimization
• Each Loop Iteration Generates a Task
• Tasks are Created and Scheduled Sequentially
• Each Iteration Incurs Task Creation and Scheduling Overhead
• With Parallel Loop Optimization
• Generated Code Immediately Exposes All Iterations
• Scheduler Operates on Chunks of Loop Iterations
• Each Chunk of Iterations Incurs Scheduling Overhead
• Advantages
• Enables Compact Representation for Loop Computation
• Reduces Task Creation and Scheduling Overhead
• Parallelizes Overhead
Suppressing Excess Concurrency
Goal: Reduce Overhead of Exploiting Parallelism
• Goal Achieved by Generating Computations that
• Execute Operations Serially with No Parallelization Overhead
• Use Synchronization Required to Execute Safely in Parallel Context
• Mechanism: Mutex Versions of Methods
Object Section
• Acquires Lock at Beginning
• Releases Lock at End
Invocation Section
• Operations Execute Serially
• Invokes Mutex Version
• Current Policy:
• Each Parallel Loop Iteration Invokes Mutex Version of Operation
• Suppresses Parallel Execution Within Iterations of Parallel Loops
Experimental Results
Methodology
• Built Prototype Compiler
• Built Run Time System
• Concurrency Generation and Task Management
• Dynamic Load Balancing
• Synchronization
• Acquired Two Complete Applications
• Barnes-Hut N-Body Solver
• Water Code
• Automatically Parallelized Applications
• Ran Applications on Stanford DASH Machine
• Compare Performance with Highly Tuned, Explicitly Parallel
Versions from SPLASH-2 Benchmark Suite
Prototype Compiler
• Clean Subset of C++
• Sage++ is Front End
• Structured As a Source-To-Source Translator
• Analysis Finds Parallel Loops and Methods
• Compiler Generates Annotation File
• Identifies Parallel Loops and Methods
• Classes to Augment with Locks
• Code Generator Reads Annotation File
• Generates Parallel Versions of Methods
• Inserts Synchronization and Parallelization Code
• Parallelizes Unannotated Programs
Major Restrictions
Motivation: Simplify Implementation of Prototype
•
•
•
•
•
•
•
•
No Virtual Methods
No Operator or Method Overloading
No Multiple Inheritance or Templates
No typedef, struct, union or enum types
Global Variables must be Class Types
No Static Members or Pointers to Members
No Default Arguments or Variable Numbers of Arguments
No Operation Accesses a Variable Declared in a Class from
which its Receiver Class Inherits
Run Time Library
Motivation: Provide Basic Concurrency Managment
• Single Program, Multiple Data Execution Model
• Single Address Space
• Alternate Serial and Parallel Phases
• Library Provides
• Task Creation and Synchronization Primitives
• Dynamic Load Balancing
• Implemented
• Stanford DASH Shared-Memory Multiprocessor
• SGI Shared-Memory Multiprocessors
Applications
• Barnes-Hut
• O(NlgN) N-Body Solver
• Space Subdivision Tree
• 1500 Lines of C++ Code
• Water
• Simulates Liquid Water
• O(N^2) Algorithm
• 1850 Lines of C++ Code
Obtaining Serial C++ Version of Barnes-Hut
• Started with Explicitly Parallel Version (SPLASH-2)
• Removed Parallel Constructs to get Serial C
• Converted to Clean Object-Based C++
• Major Structural Changes
• Eliminated Scheduling Code and Data Structures
• Split a Loop in Force Computation Phase
• Introduced New Field into Particle Data Structure
Obtaining Serial C++ Version of Water
• Started with Serial C translated from FORTRAN
• Converted to Clean Object-Based C++
• Major Structural Change
• Auxiliary Objects for O(N^2) phases
Pairs Tested for Commutativity
Commutativity Statistics for Barnes-Hut
20
15
Symbolically
Executed Pairs
Independent Pairs
10
5
Position
Velocity
Force
(3 Methods) (3 Methods) (6 Methods)
Parallel Extent
Auxiliary Operation Call Sites
Auxiliary Operation Statistics for Barnes-Hut
15
Call Sites
10
5
Position
Velocity
Force
(3 Methods) (3 Methods) (6 Methods)
Parallel Extent
Performance Results for Barnes-Hut
Speedup
32
J
24
H
H
16
H
H
J
8
0
J
H
H
JJ
0
J
J
J
J
H
H
H
32
H
J
J
Speedup
H
Ideal
SPLASH-2
Commutativity
Analysis
24
Ideal
SPLASH-2
Commutativity
H
Analysis
H
H
H
H
16
H
J
8
H
J
J
J
J
J
J
JH
H
J
0
8
16
24
32
JH
H
J
0
8
16
24
32
Number of Processors
Number of Processors
Barnes-Hut on DASH
Data Set - 8K Particles
Barnes-Hut on DASH
Data Set - 16K Particles
Performance Analysis
Motivation: Understand Behavior of Parallelized Program
• Instrumented Code to Measure Execution Time Breakdowns
Parallel Idle - Time Spent Idle in Parallel Section
Serial Idle - Time Spent Idle in a Serial Section
Blocked - Time Spent Waiting to Acquire a Lock Held by
Another Processor
Parallel Compute - Time Spent Doing Useful Work in a
Parallel Section
Serial Compute - Time Spent Doing Useful Work in a Serial
Section
Performance Analysis for Barnes-Hut
300
Cumulative Total Time (seconds)
Cumulative Total Time (seconds)
120
100
80
60
40
20
0
250
Parallel
Idle
200
Serial
Idle
150
Blocked
100
Parallel
Compute
Serial
Compute
50
0
1
2
4
8
16
24
Number of Processors
Barnes-Hut on DASH
Data Set - 8K Particles
32
1
2
4
8
16
24
32
Number of Processors
Barnes-Hut on DASH
Data Set - 16K Particles
Performance Results for Water
32
Speedup
J
24
H
32
J
H
H
H
16
H
H
Speedup
H
Ideal
SPLASH-2
Commutativity
Analysis
24
Ideal
SPLASH-2
Commutativity
Analysis
H
H
H
16
H
H
H
8
0
8
H
J
H
JH
0
H
J
J
J
J
J
J
J
16
24
H
J
0
8
H
32
H
H
JJ
0
H
J
J
8
J
J
16
J
J
24
J
J
32
Number of Processors
Number of Processors
Water on DASH
Data Set - 343 Molecules
Water on DASH
Data Set - 512 Molecules
Performance Results for Computation Replication
Version of Water
32
Speedup
J
24
32
J
H
H
16
H
H
8
0
H
H
J
H
J
H
J
0
H
J
J
J
J
H
J
H
J
J
Speedup
H
Ideal
SPLASH-2
Commutativity
Analysis
24
16
H
H
16
H
H
8
24
32
H
H
0
8
Ideal
SPLASH-2
Commutativity
Analysis
H
J
H
H
JJ
0
H
J
8
J
J
16
J
J
J
J
24
32
Number of Processors
Number of Processors
Water on DASH
Data Set - 343 Molecules
Water on DASH
Data Set - 512 Molecules
Pairs Tested for Commutativity
Commutativity Statistics for Water
15
10
Symbolically
Executed Pairs
Independent Pairs
5
Forces
Momenta
Virtual
Loading
Energy
(2 Methods) (2 Methods) (3 Methods) (4 Methods) (5 Methods)
Parallel Extent
Auxiliary Operation Call Sites
Auxiliary Operation Statistics for Water
15
10
Call Sites
5
Forces
Momenta
Virtual
Loading
Energy
(2 Methods) (2 Methods) (3 Methods) (4 Methods) (5 Methods)
Parallel Extent
1400
600
Cumulative Total Time (seconds)
Cumulative Total Time (seconds)
Performance Analysis for Water
500
400
300
200
100
0
1
2 4 8 16 24 32
Number of Processors
Water on DASH
Data Set - 343 molecules
1200
Parallel
Idle
1000
Serial
Idle
Blocked
800
600
Parallel
Compute
400
Serial
Compute
200
0
1
2 4 8 16 24 32
Number of Processors
Water on DASH
Data Set - 512 molecules
Future Work
• Relative Commutativity
• Integrate Other Analysis Frameworks
• Pointer or Alias Analysis
• Array Data Dependence Analysis
• Analysis Problems
• Synchronization Optimizations
• Analysis Granularity Optimizations
• Generation of Self-Tuning Code
• Message Passing Implementation
Related Work
• Bernstein (IEEE Transactions on Computers 1966)
• Dependence Analysis for Pointer-Based Data Structures
•
•
•
•
•
Ghiya and Hendren (POPL 96)
Ruf (PLDI 95)
Wilson and Lam (PLDI 95)
Deutsch (PLDI 94)
Choi, Burke and Carini (POPL 93)
•
•
•
•
•
Landi, Ryder and Zhang (PLDI 93)
Hendren, Hummel and Nicolau (PLDI 92)
Plevyak, Karamcheti and Chien (LCPC 93)
Chase, Wegman and Zadek (PLDI 90)
Larus and Hilfinger (PLDI 88)
• Reduction Analysis
• Ghuloum and Fisher (PPOPP 95)
• Pinter and Pinter (POPL 92)
• Callahan (LCPC 91)
• Commuting Operations in Parallel Languages
• Rinard and Lam (PPOPP 91)
• Steele (POPL 90)
• Barth, Nikhil and Arvind (FPCA 91)
Conclusions
Conclusion
• Commutativity Analysis
• New Analysis Framework for Parallelizing Compilers
• Basic Idea
• Recognize Commuting Operations
• Generate Parallel Code
• Current Focus
• Dynamic, Pointer-Based Data Structures
• Good Initial Results
• Future
• Persistent Data
• Distributed Computations
Latest Version of Paper
http://www.cs.ucsb.edu/~martin/paper/pldi96.ps
What if Operations Do Not Commute?
• Parallel Tree Traversal
• Example: Distance of Node from Root
class tree {
int distance;
tree *left;
tree *right;
};
tree::set_distance(int d) {
distance = d;
if (left != NULL) left->set_distance(d+1);
if (right != NULL) right->set_distance(d+1);
}
Equivalent Computation with Commuting
Operations
tree::zero_distance() {
distance = 0;
if (left != NULL) left->zero_distance();
if (right != NULL) right->zero_distance();
}
tree::sum_distance(int d) {
distance = distance + d;
if (left != NULL) left->sum_distance(d+1);
if (right != NULL) right->sum_distance(d+1);
}
tree::set_distance(int d) {
zero_distance();
sum_distance(d);
}
Theoretical Result
• For Any Tree Traversal on Data With
• A Commutative Operator (for example +) that has
• A Zero Element (for example 0)
• There Exists A Program P such that
• P Computes the Traversal
• Commutativity Analysis Can Automatically Parallelize
P
• Complexity Results:
• Program P is asymptotically optimal if the Data Struture
is a Perfectly Balanced Tree
• Program P has complexity O(N^2) if the Data Structure
is a Linked-List
Pure Object-Based Model of Computation
• Goal
• Obtain a Powerful, Clean Model of Computation
• Enable Compiler to Analyze Program
• Objects: Instances of Classes
• Implement State with Instance Variables
• Primitive Types from Underlying Language (int, ...)
• References to Other Objects
• Nested Objects
• Operations: Invocations of Methods
• Each Operation Has Single Receiver Object