Modeling The Heap - The IMDEA Software Institute
Download
Report
Transcript Modeling The Heap - The IMDEA Software Institute
Mark Marron
IMDEA-Software (Madrid, Spain)
[email protected]
1
“Algorithms + Data Structures = Programs”
Niklaus Wirth
“Show me your flowcharts and conceal your
tables, and I shall continue to be mystified.
Show me your tables, and I won't usually need
your flowcharts; they'll be obvious.”
Fred Brooks
2
Understanding the heap is increasingly
important to understanding the program
• Object Oriented Programming makes objects (data
not code) central organizing principle of program
• Introduction of GC encourages extensive use of
heap allocated structures
Support large class of client applications in
IDE, compiler, and testing
Focus on scalable, manageable models/tools
even at cost of overall expressivity/analytic
power
3
Optimization
• Parallelization and Redundancy Elimination
• Memory Allocation and GC
Program Understanding
• Pre/Post Annotation Extraction
• Rich Class Invariant Computation
Testing
• Test Case Generation
• Debugger Visualizations
4
Model of the Program heap
• Represent key heap properties
• Amenable to efficient static analysis
• Understandable by non-experts
Static Analysis
• Scalable to module level
• Efficient (time/memory) to run interactively on
developer machines
Dynamic Support
• Debugger hooks
• Hooks for specification checking
• Heap structure test case generation
5
Representation
6
Logical data structures (Regions)
• Group related sections of the heap
• Keep unrelated sections of the heap separate
Connectivity Inside and Between Regions
• Shape
• Reachability/Interference
Between variables
Arbitrary objects on heap
Important special case is uniqueness or aliasing of
elements in collections/arrays
7
Object/Data Structure Identity
• Track objects (data structures) across time in
program execution (how are structures rearranged)
Heap Based Use-Mod
• Given a stmt which memory locations may be
used/mod at that line
• Given a method which memory locations may be
used/mod in the body (frame or footprint)
Allocation and Escape Sites
Field Nullity and Collection Size/Iterators
8
Base on storage shape graph
• Nodes represent sets of objects (or data structures),
edges represent sets of pointers
• Has natural representation for many of the
properties we are interested in
• Easy to visualize and understand by non-experts
• Efficient and compact for computation
Disjointness properties are natural (and mostly free)
Majority of computation is local
Imprecision spread is minimal
Annotate nodes and edges with additional
instrumentation properties
9
Key issue for shape graph approach is how to
group concrete objects into abstract nodes
• Too many nodes is confusing and computationally
expensive
• Too few nodes leads to imprecision (as a single
node must represent multiple logical structures)
• Often done via allocation site or types
Solution: nodes are related sets of objects
• Recursive type information (recursive vs. nonrecursive types)
• Objects stored in the same collection, array or
structure
10
11
12
Most general way objects in a region are
connected
•
•
•
•
(S)ingleton: no pointers between any objects
(L)ist: may contain a linear List or simpler structures
(T)ree: may contain a Tree or simpler structures
(C)ycle: may a cyclic or simpler structures
E.g. A region with a (T)ree layout may contain
Tree, List or Singleton structures, but no
Cyclic structures.
13
14
Edges abstract sets of references (variable
references or pointers)
Heap graph has ability to track some sharing
properties but insufficiently precise to model
many important properties
• E.g. given an array of objects does any object
appear multiple times?
May occur between references abstracted by
same edge or two different edges
• Interference: abstracted by same edge
• Connectivity: abstracted by different edges
15
Does a single edge abstract only references
with disjoint targets or may some of these
references be aliased/related?
Edge e is:
• non-interfering: all pairs of references r1, r2 must
refer to disjoint data structures.
• interfering: may exist a pair of references r1, r2 that
refer to the same data structure.
16
17
Want to know how heap structure is changed
between points in the program
• Often access paths are used to specify sets of
objects but these quickly become very complicated
Shape graph is a partition of the heap
• Pick an arbitrary points in the program to use as the
canonical partitions
• At each future program point track how partitioning
has changed relative to canonical partition
Most natural (and generally useful) point is
each method entry
18
Use the notion of Data Structure Identity
• Each node represents a set of objects
• Add tags to each node tracking the most recent
stmt that
Read from a field in an object represented by the node
Wrote to a field in an object represented by the node
Can then compute stmt level and method
level use/mod sets (footprints, frames) via
• Structure Identity
• Use/Mod tags in Nodes
19
int mangle(Pair p)
{
int res = p.first.val;
p.second = p.first;
p.first = null;
return res;
}
20
Track object allocation sites and escape sites
similarly to tracking use/mod
• Allocation sites of the objects
• Escape sites of the objects
Nullity and collection information
• Per field nullity tag in each node
• For nodes that may represent collections (lists, sets,
dictionaries) track size of collection (0, 1+, 0+)
21
N-Body simulation in 3-dimensions
Uses Fast Multi-Pole method with space
decomposition tree
• For nearby bodies use naive n2 algorithm
• For distant bodies compute center of mass of many
bodies and treat as single point mass
22
23
Inline double[] into MathVector objects and unroll
loops, 23% speedup 37% memory use reduction
24
Iterator b = this.bodyTabRev.iterator();
while(b.hasNext())
((Body) b.next()).hackGravity(rsize, root);
25
TLP update loop over bodyTabRev, factor 3.09
speedup on quad-core machine
26
Static Analysis
27
Computational Cost
• Flow Sensitivity
• Disjunctive Representations (powersets)
• Context Sensitivity
Extensive use of Collections and other
standard libraries
• Need to deal with IO, Files, etc.
• Collections particularly important (wrt. heap
analysis) and used extensively
28
Adopt existing work on Partially Disjunctive
Domains (SAS 04) to help with flow and
disjunctive costs
Our recent work (in submission) addresses
the disjunctive and context sensitivity issues
• Calls take/return a model set of size 1
• Virtual calls handled efficiently, e.g. do not result in
explosion in set size
• Average number of contexts per method (in all
benchmarks) is less than 2.2
• All with near optimal precision (more later)
29
Most libraries are not interesting from the
standpoint of userland code
• Don’t care how FileStream is implemented
• Just treat as uninterrupted heap region
Can analyze Collection implementations
directly but
• Many use specialized highly optimized data
structures, so very hard to model correctly
• Analysis is expensive and redundant
Use simple high level semantics for common
collections (precise and efficient)
30
Prototype implemented in C++ analyzing Java
source (timing results reported here)
Currently working on an implementation for
CLI (C#/.net) bytecode
• Working on support for bytecodes generated by C#
compiler (currently function pointers, references,
exceptions not handled)
• Support key portions of System, Collections, and IO
handled - plans to support GUI libs and threads
• Goal is module level analysis with other techniques
for inter-module analysis
31
Benchmark
LOC
Analysis
Time
Analysis
Mem
Region
Shape
Share
tsp
910
0.03s
<30MB
100%
100%
98%
em3d
1103
0.09s
<30MB
100%
100%
99%
voronoi
1324
0.50s
<30MB
97%
98%
98%
bh
2304
0.72s
<30MB
100%
96%
96%
db
1985
0.68s
<30MB
100%
100%
81%
raytrace
5809
15.5s
38MB
100%
95%
92%
Exp
3567
152.3s
48MB
100%
100%
100%
Interpreter
15293 114.8s
122MB
100%
97%
96%
32
33
Model of the Program heap
• Captures key properties for client applications
• Amenable to static analysis
• Intuitive and understandable by non-experts
Static Analysis
• Analyze 15KLoc in a few minutes and ~200MB
(average 2 contexts per method)
• Precisely represent range of useful information
Dynamic Support
• Debugger hooks in place for C# and Visual Studio
• Other tool support in progress
34
Transform core concepts from prototype to
robust tools
• Finish implementation of static analysis for CLI
bytecode + core libraries (also runtime support)
• Export results to Visual Studio for inspection, spec.
generation, or other tools
• Build strong foundation for other tools
Apply results in optimization, refactoring,
testing, or error detection applications
35
36
Simple interpreter and debug environment for
simple OO-Java like language
14,000+ Loc (in normalized form), 90 Classes
• Additional 1500 Loc for specialized standard library
handling stubs
Large recursive call structures, large
inheritance trees with numerous virtual
method implementations
Wide range of data structure types, extensive
use of java.util collections, uses both shared
and unshared structures
37
Track object allocation sites and escape sites
similarly to tracking use/mod
• Add tags to each node
Allocation sites of the objects represented by the node
Escape sites of the objects represented by the node
Then based on Identity information and these
Tags we can identify
• Where objects are allocated locally
• Which objects escape
• Which calls created objects that escaped into this
call and which data structures these objects are in
38
Via Common Compiler Infrastructure (CCI)
and reflection, rewrite bytecode with shadow
memory
• Extract snapshot
• Apply already defined abstraction function
More precise than static analysis (less chance
for user confusion)
• No error propagation issues
• Abstract Model represents a single heap (not a set)
• Does not have the overhead of static analysis
39
Experience with using model to understand
heap state shows promise
• Has been used by a number of individuals outside
the group with positive reports
Our implementation (done as a Visual Studio
addin) can abstract nontrivial heaps
• 220K objects in 9s and approx. 300MB of Memory
Code is available on CodePlex
• ...
40