STATIC ANALYSES FOR JAVA IN THE PRESENCE OF

Download Report

Transcript STATIC ANALYSES FOR JAVA IN THE PRESENCE OF

STATIC ANALYSES FOR JAVA IN THE
PRESENCE OF
DISTRIBUTED COMPONENTS AND
LARGE LIBRARIES
Mariana Sharp
Adviser: Prof. Atanas Rountev
Committee: Prof. Paul Sivilotti, Prof. Neelam Soundarajan
PRESTO: Program Analyses and Software Tools Research Group, Ohio State University
Introduction
 Static program analysis infers properties of program
behavior based on code structure and semantics
 Analyses considered in our research:
-
Points-to analysis
Type analysis
Side-effect analysis
Dependence analysis
 Static analysis techniques used in:
-
performance optimization
program understanding and maintenance
software testing
verification of program properties
Two Challenges for Analysis of Modern Java Software
 Distributed Java applications
- Traditional algorithms designed for non-distributed
software only
- Distributed Java applications using RMI have complex
semantics not modeled by existing static analysis
 Large-scale applications
- Existing algorithms designed to operate on whole
homogeneous programs
 Start from scratch at each analysis execution
 Library code analyzed together with the application
code (90% of methods are in libraries)
- Potential scalability problems limit the practical use of
the analyses for real-world Java applications
Contributions
 Theoretical model for the analysis of distributed
Java applications
- Extension of points-to analysis and side-effect analysis for
RMI-based applications
- Experimental evaluation on a collection of RMI programs
shows practical cost and high precision of technique
 Incremental approach for analyzing Java
applications built with reusable components
- Use of precomputed summaries for reusable components
as input to the analysis of client code
- Summary generation algorithms for type analysis and
dependence analysis
- Experimental evaluation on a collection of Java programs
shows dramatic savings compared to whole-program
analysis
Outline
 Static Analysis of Object References in RMI-based
Java Software
 Type Analysis in the Presence of Large Libraries
 Dependence Analysis in the Presence of Large
Libraries
Static Analysis of Object References in
RMI-based Java Software
 Points-to analysis: which objects may variable x
refer to?
- Builds a points-to graph
 Side-effect analysis: which objects could be
modified by a statement s?
- Based on points-to information
Analysis in the presence of RMI calls
 Java Remote Method Invocation (RMI)
- There is a set of components C1 , C2 , …, Cn
 Each runs on a different VM
- References to remote objects cross JVM boundaries
- Methods declared in remote interfaces can be invoked
remotely
- Parameter passing for remote calls involves object
serialization
 Entire graphs of serialized objects are copied over
Java RMI
 Example of a remote class:
interface Listener extends java.rmi.Remote
{ void update(Event b); }
class MyListener implements Listener extends …
{ void update(Event b) { … } }
 Parameter passing in the remote calls affects the
points-to information
- Remote references
- Passing of serialized objects
Passing a Remote Reference
(Points-to Relations)
Component 1
Component 2
(remote)
ochannel
f
g
(remote)
p
void add( Listener p ) {
...
}
olistener
Channel f = (Channel)
Naming.lookup(...);
Listener g = new MyListener();
f.add(g);
Passing a Serialized Object
(Points-to Relations)
Component 1
Component 2
(remote)
ochannel
f
e
p
(serializable)
(serializable)
copy of oevent
void notify( Event p ) {
...
}
oevent
Event e = new Event();
f.notify(e);
Analysis algorithm
 The result is a Pointer Assignment Graph (PAG)
- Nodes are variables and object fields
 Nodes have points-to sets attached
- Edges represent flow of values
 Each node has a local points-to set PtL and a remote
points-to set PtR
 v1 = v2 in Ci: creates edge
node( v2 i ) →node( v1 i )
- The edge represents following relationships:
PtL( v2 i )  PtL( v1 i )
PtR( v2 i )  PtR( v1 i )
Rules for Handling Statements
Rules for Handling Statements
Passing a Remote Reference
(PAG)
Component 1
(remote)
Component 2
PtR(f) = {ochannel}
f
ochannel
PtL(g) = {olistener}
g
PtR(p) = {olistener}
(remote)
p
void add( Listener p ) {
...
}
olistener
Channel f = (Channel)
Naming.lookup(...);
Listener g = new MyListener();
f.add(g);
Passing a Serialized Object
(PAG)
Component 1
(remote)
Component 2
PtR(f) = {ochannel}
f
ochannel
PtL(e) = {oevent}
p
PtL(p) =
{copy of oevent}
(serializable)
copy of oevent
void notify( Event p ) {
...
}
e
(serializable)
oevent
Event e = new Event();
f.notify(e);
Experimental results
 11 RMI applications
- Between 12 and 125 methods
- Analysis includes ~7000 library methods
 Implementation
- Generalized the points-to analysis in the Soot analysis
framework
 Analysis running time
- 2.8 GHz PC with 3 GB memory
- Time: 5-6 minutes per application
- Special handling of libraries: standard libraries are not
replicated across components
Experimental results
Number of remote call sites
Remote calls
Remote references
Serialization
40
35
30
25
20
15
10
5
ss
l
se
or
da
ta
ba
tr
an
sl
at
je
nu
t
jo
dl
n
au
ct
io
ba
nk
el
ch
an
n
tta
sk
rm
fi l
es
rv
st
oc
ks
0
• Passing of remote references and serialization are common
• High precision of the call graph at remote calls
• All remote call sites with serialization: used acyclic and uniquely-typed object
graphs
Static Analysis for RMI-based Software: Conclusions
 Existing formalisms generalized to handle RMI
features
 Key points:
- Two separate points-to sets per variable
- Remote PAG edges
- Propagation of references to deserialized copy objects
 Precise and practical choice for the analysis of RMI
applications
 Approach can be generalized to side-effect analysis
- Determine inter-component data dependencies
Outline
 Static Analysis of Object References in RMI-based
Java Software
 Type Analysis in the Presence of Large Libraries
 Dependence Analysis in the Presence of Large
Libraries
Type Analysis in the Presence of Large Libraries
 Type analysis: what are the types of objects
variable x may refer to?
- A form of points-to analysis
- Dynamically allocated objects are represented with one
abstract object per class
 Libraries represent a large part of the code
- Roughly 93% of methods in benchmark programs are in
the standard Java libraries
 Summary-based analysis: the code is split into a
user component and a library component. Two
steps:
- Computing the information about the library
- Running the analysis that uses this information
Summary Representation
 We restate the analysis in terms of graph
representations and operations
- Based on the formulation of Interprocedural Distributive
Environment (IDE) problems
- Compact representation of dataflow functions
- Efficient operations of functional meet and functional
composition
 Dataflow function: encoded by a graph that
represents the flow of data in a method
- Nodes represent variables and object types
- Edges:
 Object type to variable: represents a type relation
 Variable to variable: represents object flow
Summary Generation Type Analysis
 Input: the code of the library classes
 Output: for each method
- Dataflow graph of the method
- Information about the call sites occurring in the method
 Algorithm with 3 steps:
- Step 1: Computing dataflow functions for library methods
 No calls
- Step 2: Computing the closure of each dataflow function
 Type relationships due to transitivity
- Step 3: Minimizing dataflow functions
 Eliminating edges that have only intraprocedural
effect
Code and Dataflow Function for Sample Method
public class MyClass {
private String[] names;
this
r0
String replaceName(String) {
MyClass r0;
String r1, r3;
String[] r2;
r0 := this;
param0
r1
r1 := param0;
r2 := r0.names;
r2[0] := r1;
r4
r5
r3 := r1;
r4 := new String;
specialinvoke
String
r4.<init>("New name:");
r5 := r4;
r6 := virtualinvoke r5.concat(r3);
return r6;
}
}
names
r2
array_elem
r3
r6
return
Experimental Study: Generating the Summary
 Input of the summary generation: classes from Java
standard libraries from J2SE 1.4.2
- Packages java., javax., com., COM., org., and sun.
(10238 classes, 77190 methods, 1496003 statements)
 Output: file that stores the summary representation
of these classes, methods, and the corresponding
dataflow functions
- Summary file size: 12.2 MB
Study: Summary-based Type Analysis
 Summary-based type analysis implemented with the
Soot 2.2.2 framework (Spark)
 Jimple representation used for classes that belong
to the user component of the analyzed program
 For library classes, a representation is obtained
from the summary
 Experimental study with 20 Java programs
ja
ck
ja
j
va a v
c u ac
p0.
10
j
jb
-6
.1
j
jf l e ss
ex
-1
.
jl e 4 .1
x1.
2
j
t a .6
m
r
in
dt -1.2
er
m 1
m 1.1
p
.5
m eg a
uf
fi n udio
-0
.9
.3
s a ray a
t
bl
ec rac
e
c2.
s o 18
ck . 2
s
s o ec
ck ho
sp
ro
xy
vi
ol
et
db
fra
ct
al
ab
c o bi t
2
m
pr
es
s
R
Running time (in second)
Comparison of the Whole-program Analysis and the
Summary-based Analysis (Running Time)
90
80
Whole-program
70
Summary-based
60
50
40
30
20
10
0
vi
ol
et
ja
ck
ja
j
va a v
c u ac
p0.
10
j
jb
-6
.1
j
jf l e ss
ex
-1
.
jl e 4 .1
x1.
2.
m jt ar 6
in
dt -1.2
er
m 1
m 1.1
pe
.5
m ga
uf
fi n udio
-0
.9
.3
r
s a ay a
t
bl
ec rac
e
c2.
1
so
8
ck . 2
s
s o ec
ck ho
sp
ro
xy
db
fra
ct
al
ab
c o bi t
2
m
pr
es
s
R
Memory usage (in MB)
Comparison of the Whole-program Analysis and the
Summary-based Analysis (Memory Usage)
160
140
Whole-program
Summary-based
120
100
80
60
40
20
0
Experimental Study for Summary-Based Type Analysis
 Compared to its whole-program counterpart,
summary-based type analysis can achieve
significant savings of running time and memory
usage
- For all experimental subjects, the running time reduction
was at least 55%, with average running time reduction of
70%
 Savings come from avoiding the cost of reading the
library code and building its Jimple representation,
as well as the type propagation in library methods
Outline
 Static Analysis of Object References in RMI-based
Java Software
 Type Analysis in the Presence of Large Libraries
 Dependence Analysis in the Presence of Large
Libraries
Dependence Analysis in the Presence of Large Libraries
 Dependence analysis: two kinds of dependencies:
- Control dependencies: relationships between conditional
expressions and the statements guarded by them.
- Data dependencies: represent the flow of values from
one statement to another due to writes and reads of
shared memory locations.
 Our focus:
- Analysis of data dependencies
 Data dependencies between the formals and the
return of a method
 Dependence analysis based on an earlier type analysis
- Addressing the problem of reducing analysis cost for
applications built with large libraries.
Whole-Program Dependence Analysis Algorithm
 Step 1: Intraprocedural reaching definitions analysis
- Computes a set of reaching definitions for each
statement inside a method body
- Intraprocedural def-use analysis calculates direct
dependencies between the statements of a method
 Does not consider the effects of calls
- Output: “reduced” CFG
 Step 2: Intraprocedural Dependence Analysis
- Transitive dependencies are computed
- Calls are not considered
 Step 3: Interprocedural Dependence Analysis
- Bottom-up traversal of the call graph
- For each call site the callee's dependence information is
inlined into the caller's dependence information
Example: Reduced CFG for Sample Method
N1
N4
r0, N1
r0 := this
N3
r4 := new String
r4
r4
r5 := r4
N5
N2
r2 := r0.names
N6
r1, N6
specialinvoke r4...
N7
r5
r3 := r1
r3, N6
N9
r1 := param0
r6 := virtualinvoke r5.concat(r3)
r6, N9
N10
return r6
r1, N6
N8
r2[0] := r1
Summary Generation
 Summary information
- Only dependencies that may ultimately affect the return
value of a method are considered
- Dependencies related to the return value of a method
and the call sites
- Dependence pairs
 Summary Generation Algorithm
- Two phases that are similar to the first two phases of the
whole-program dependence analysis
- A third phase with summary optimizations
Example:
N1
N4
r0, N1
r0 := this
N3
r4 := new String
r4
r4
r5 := r4
N5
N2
r2 := r0.names
N6
r1, N6
specialinvoke r4...
N7
r5
r3 := r1
r3, N6
N9
r1 := param0
r6 := virtualinvoke r5.concat(r3)
r6, N9
N10
return r6
r1, N6
N8
r2[0] := r1
Summary Optimizations
 Definitions:
- Fixed call: call with exactly one target method
- Fixed method: method that either does not make any
calls, or it has only fixed calls to fixed methods
 Optimization 1: Inlining all fixed methods (into
callers that can be both fixed and non-fixed)
 Definition:
- Method with known callers: all the callers of such a
method can be determined when the summary is built
 Cannot be called by future user code
 Optimization 2: Completely removing from the
summary all fixed methods that have known callers
Experimental Study
 Generating the summary:
- Improvements after the first optimization:
 number of calls in the library reduced by 36.9%,
 number of dependence pairs reduced by 16.2%
- Improvements after the second optimization:
 number of dependence pairs resulted from first
optimization reduced by 16.8%
- File size on disk: 14.4 MB
 2.2 MB of dependence information
 Summary-based analysis
- Same 20 benchmarks as for type analysis
Experimental Study: Summary-based Analysis
60
50
Running time (in seconds)
40
Whole-program
30
Summary-based
Baseline
20
10
-

vi
ol
et
s
x-
1.
4
je
s
6.
1
jle .1
x1.
2.
jta 6
m
rin
dt 1.2
er
1
m
-1
m .1
p e .5
ga
m
uf ud
fin io
-0
.9
.3
r a
sa ay
bl tra
ec
c
c- e
2.
1
so 8.2
ck
se
so ch
ck o
sp
ro
xy
Baseline analysis:
jfl
e
ac
ja
v

jb
-
a
up c
-0
.1
0j
k
ja
v
ja
c
al
db
ac
t
fr
ss
pr
e
m
co
R
ab
bi
t2
0
Uses artificial summary with empty dependency information
Represents a limit in what time reduction can be achieved
The average reduction in time is 79.78% in the summary-based version

l
x1
je
ss
The memory usage is reduced on average by 96.93%
vi
o
le
t
.4
jle .1
x1.
2.
6
jt a
m
r1
in
.2
dt
1
er
m
-1
.
m 1 .5
pe
g
m aud
uf
fin io
-0
.9
.3
ra a
yt
sa
ra
bl
ce
ec
c2
.1
so 8.2
ck
se
so cho
ck
sp
ro
xy
jf l
e
ja
ja
va va
c
cu
p0.
10
j
jb
-6
.1
ja
ck
fr
ac
ta
db
bb
c o it 2
m
pr
es
s
R
a
Memory usage (in MB)
Experimental Study: Summary-based Analysis
70
60
50
40
Whole-program
Summary-based
Baseline
30
20
10
0
Conclusions
 Limitations of the traditional model of wholeprogram data-flow analysis
- Whole-program analysis cannot be applied to distributed
software
- Whole-program analysis does not scale to very large
programs
 Solutions to address these limitations
- Theoretical model for points-to analysis of distributed
Java applications
- Analysis approach which employs precomputed library
summary information
Future Work
 Static Analysis for RMI Software
- Various flow- and context-sensitive points-to analyses
- RMI generalizations for other analyses
 Summary-based static analyses
- Summary-based algorithms for other dataflow analyses
- Systems built with multiple library components