Transcript ppt

Mark Marron
IMDEA-Software (Madrid, Spain)
[email protected]
1

Many optimization and software engineering
applications utilize heap information
• Optimization
 Parallelization
 Memory management
• Software Engineering and Debugging
 Interactive debugging of heap data structures
 Tainted or Information Flow analysis through heap
objects

Existing heap analysis techniques often
inapplicable due to imprecision (points-to
style) or computational cost (shape analysis)
2

General purpose model capable of supporting
these applications:
• Must model the range of fundamental properties
needed by our target application domains
• Cannot place significant restrictions on the program
being analyzed
• Must be computationally efficient and compact
• Willing to sacrifice some precision

Our focus is on providing general classes of
information for others to build on
3

Connectivity
• Reachability
• Interference
• Paths

Logical data structures (Regions)
• Group related sections of the heap
• Keep unrelated sections of the heap separate

Shape of a region
• Cycle, Dag, Tree, List, Singleton
4

Identity
• Given an object at point p, track the flow of this
object at all later program points q

Heap Based Use-Mod
• Find all program points a given memory location
may be read/written at

Escape
• Objects that are freshly allocated
• Objects that escape the local call context
5

The theory of Abstract Interpretation provides
framework for static program analysis
• Takes a lattice (set) of abstract models, each of
which represents a set of concrete program states
• Computes, for each program point, an abstract
model that represents all possible heap states that
may occur at the program point
6

A surprising benefit of building a model
suitable for abstract interpretation that is the
model also works for dynamic analysis:
• Debugging
• Specification mining/checking

Given a snapshot of the single current
program heap compute the corresponding
abstract model
7




Handle a large fragment of Java 1.5 and
commonly used libraries (lang, util, io)
Precisely model (in static and dynamic
analyses) the properties of interest
Can efficiently (on the order of seconds)
statically analyze moderate sized programs
(~15KLOC to date)
Have simple implementation of debugger and
specification miner (a few seconds to
compute models of Multi-MB heaps)
8

Base on storage shape graph
• Nodes represent sets of objects (or recursive data
structures), edges represent sets of pointers
• Has natural representation for many of the
properties we are interested in
• Easy to visualize
• Efficient to compute with

Annotate nodes and edges with additional
instrumentation properties
9

Key issue in shape graph is how to pick nodes
that abstract concrete objects
• Too many nodes is confusing and computationally
expensive
• Too few nodes leads to imprecision (as a single
node must represent multiple logical structures)
• Often done via allocation site or types

Solution: nodes are related sets of objects
• Recursive type information (recursive vs. nonrecursive types)
• Objects stored in the same collection, array or
structure
10
11
12

Most general way objects in a region are
connected
•
•
•
•
•

(S)ingleton: no pointers between any objects
(L)ist: may contain a linear List or simpler structures
(T)ree: may contain a Tree or simpler structures
(D)ag: may contain a Dag or simpler structures
(C)ycle: may a cyclic or simpler structures
E.g. A region with a (T)ree layout may contain
tree, list or singleton structures, but no dag
or cyclic structures.
13
14


Edges abstract sets of references (variable
references or pointers)
Heap Graph has ability to track some sharing
properties but insufficiently precise to model
many important properties
• E.g. given an array of objects does any object
appear multiple times?

May occur between references abstracted by
same edge or two different edges
• Interference: abstracted by same edge
• Connectivity: abstracted by different edges
15


Does a single edge abstract only references
with disjoint targets or may some of these
references alias/related?
Edge e is:
• non-interfering: all pairs of references r1, r2 in γ(e)
must be unrelated (refer to disjoint data structures).
• interfering: may be a pair of references r1, r2 in γ(e)
that are related (refer to the same data structure).
16
17

Connectivity: Do two edges abstract sets of

Edges e1, e2 are:
references with disjoint targets or do some of these
references alias/related?
• disjoint: all pairs of references r1 in γ(e1), r2 in γ(e2)
are unrelated (refer to disjoint data structures).
• connected: may be pair of references r1 in γ(e1), r2
in γ(e2) that are related (refer to the same data
structure).
18
19

Object Identity
• Across each method call track how data structures
are split, merged, reconnected

Field Sensitive Use/Mod
• For each method track the fields for the objects in
each region (node) and if the field is used/modified
in the method
• At each line track which regions (nodes) and fields
may be used modified

Object Allocation
• Track which objects are allocated in this scope and
which may escape
20
1
2
3
4
5
void swap(Pair p) {
Data temp = p.first;
p.first = p.second;
p.second = temp;
}
21
22


N-Body simulation in 3-dimensions
Uses Fast Multi-Pole method with space
decomposition tree
• For nearby bodies use naive n2 algorithm
• For distant bodies compute center of mass of many
bodies and treat as single point mass


Updates space decomposition tree to account
for body motion
Has not been analyzed with other existing
(precise) heap analysis methods
23
24

Inline Double[] into MathVector objects, 23% serial
speedup 37% memory use reduction
25
Iterator b = this.bodyTabRev.iterator();
while(b.hasNext())
((Body) b.next()).hackGravity(rsize, root);
26

TLP update loop over bodyTabRev, factor 3.09
speedup on quad-core machine
27
Benchmark
LOC
Analysis
Time
Analysis Shape
Mem
Sharing
Use/Mod*
tsp
910
0.03s
<30MB
100%
98%
Y
em3d
1103
0.09s
<30MB
100%
100%
Y
voronoi
1324
0.50s
<30MB
98%
97%
Y
bh
2304
0.72s
<30MB
94%
96%
Y
db
1985
0.68s
<30MB
100%
82%
Y
raytrace
5809
15.5s
38MB
98%
92%
Y
Exp
3567
152.3s
48MB
100%
100%
Y
Interpreter
15293
114.8s
122MB
97%
86%
Y
28
Benchmark
Objects
Nodes
Time
bisort
~100
5
0.01s
bisort
~17000
5
1.20s
tsp
~250
4
0.12s
tsp
~8000
4
1.62s
health
~300
20
0.01s
health
~5000
20
1.35s
exp
~70
12
0.02s
exp
~4000
12
1.55s
29

Have the core of a practical analysis system
• Performance:
 Analyze moderate size non-trivial Java programs
 15KLoc programs in a 114 seconds using ~120MB of
memory (average 2 contexts per method)
 Debugging abstraction efficiently compresses large
heaps to compact abstract representation
• Accuracy:
 Precisely represent connectivity, sharing, shape
properties + region, frame, and dependence
information
• Qualitatively Useful
 Used results in multiple optimization domains and in
debugging applications
30

Currently working on transforming core
concepts from prototype to robust tools
• Implementing static analysis for MSIL bytecode +
core libraries
• Implementing full featured debugger support and
specification mining (for both MSIL and Java)

Enrich the model
• Wider range of properties (what is useful in general)
• Allow user to easily extend with new properties

Apply information in more client applications
• Additional optimization domains
• Support for programmer assisted refactorings
31
32


Simple interpreter and debug environment for
large subset of Java language
14,000+ Loc (in normalized form), 90 Classes
• Additional 1500 Loc for specialized standard library
handling stubs


Large recursive call structures, large
inheritance trees with numerous virtual
method implementations
Wide range of data structure types, extensive
use of java.util collections, uses both shared
and unshared structures
33