Modeling The Heap - The IMDEA Software Institute

Download Report

Transcript Modeling The Heap - The IMDEA Software Institute

Mark Marron
IMDEA-Software (Madrid, Spain)
[email protected]
1
“Algorithms + Data Structures = Programs”
Niklaus Wirth
“Show me your flowcharts and conceal your
tables, and I shall continue to be mystified.
Show me your tables, and I won't usually need
your flowcharts; they'll be obvious.”
Fred Brooks
2

Understanding the heap is increasingly
important to understanding the program
• Object Oriented Programming makes objects (data
not code) central organizing principle of program
• Introduction of GC encourages extensive use of
heap allocated structures


Support large class of client applications in
IDE, compiler, and testing
Focus on scalable, manageable models/tools
even at cost of overall expressivity/analytic
power
3

Optimization
• Parallelization and Redundancy Elimination
• Memory Allocation and GC

Program Understanding
• Pre/Post Annotation Extraction
• Rich Class Invariant Computation

Testing
• Test Case Generation
• Debugger Visualizations
4

Model of the Program heap
• Represent key heap properties
• Amenable to efficient static analysis
• Understandable by non-experts

Static Analysis
• Scalable to module level
• Efficient (time/memory) to run interactively on
developer machines

Dynamic Support
• Debugger hooks
• Hooks for specification checking
• Heap structure test case generation
5
Representation
6

Logical data structures (Regions)
• Group related sections of the heap
• Keep unrelated sections of the heap separate

Connectivity Inside and Between Regions
• Shape
• Reachability/Interference
 Between variables
 Arbitrary objects on heap
 Important special case is uniqueness or aliasing of
elements in collections/arrays
7

Object/Data Structure Identity
• Track objects (data structures) across time in
program execution (how are structures rearranged)

Heap Based Use-Mod
• Given a stmt which memory locations may be
used/mod at that line
• Given a method which memory locations may be
used/mod in the body (frame or footprint)


Allocation and Escape Sites
Field Nullity and Collection Size/Iterators
8

Base on storage shape graph
• Nodes represent sets of objects (or data structures),
edges represent sets of pointers
• Has natural representation for many of the
properties we are interested in
• Easy to visualize and understand by non-experts
• Efficient and compact for computation
 Disjointness properties are natural (and mostly free)
 Majority of computation is local
 Imprecision spread is minimal

Annotate nodes and edges with additional
instrumentation properties
9

Key issue for shape graph approach is how to
group concrete objects into abstract nodes
• Too many nodes is confusing and computationally
expensive
• Too few nodes leads to imprecision (as a single
node must represent multiple logical structures)
• Often done via allocation site or types

Solution: nodes are related sets of objects
• Recursive type information (recursive vs. nonrecursive types)
• Objects stored in the same collection, array or
structure
10
11
12

Most general way objects in a region are
connected
•
•
•
•

(S)ingleton: no pointers between any objects
(L)ist: may contain a linear List or simpler structures
(T)ree: may contain a Tree or simpler structures
(C)ycle: may a cyclic or simpler structures
E.g. A region with a (T)ree layout may contain
Tree, List or Singleton structures, but no
Cyclic structures.
13
14


Edges abstract sets of references (variable
references or pointers)
Heap graph has ability to track some sharing
properties but insufficiently precise to model
many important properties
• E.g. given an array of objects does any object
appear multiple times?

May occur between references abstracted by
same edge or two different edges
• Interference: abstracted by same edge
• Connectivity: abstracted by different edges
15


Does a single edge abstract only references
with disjoint targets or may some of these
references be aliased/related?
Edge e is:
• non-interfering: all pairs of references r1, r2 must
refer to disjoint data structures.
• interfering: may exist a pair of references r1, r2 that
refer to the same data structure.
16
17

Want to know how heap structure is changed
between points in the program
• Often access paths are used to specify sets of
objects but these quickly become very complicated

Shape graph is a partition of the heap
• Pick an arbitrary points in the program to use as the
canonical partitions
• At each future program point track how partitioning
has changed relative to canonical partition

Most natural (and generally useful) point is
each method entry
18

Use the notion of Data Structure Identity
• Each node represents a set of objects
• Add tags to each node tracking the most recent
stmt that
 Read from a field in an object represented by the node
 Wrote to a field in an object represented by the node

Can then compute stmt level and method
level use/mod sets (footprints, frames) via
• Structure Identity
• Use/Mod tags in Nodes
19
int mangle(Pair p)
{
int res = p.first.val;
p.second = p.first;
p.first = null;
return res;
}
20

Track object allocation sites and escape sites
similarly to tracking use/mod
• Allocation sites of the objects
• Escape sites of the objects

Nullity and collection information
• Per field nullity tag in each node
• For nodes that may represent collections (lists, sets,
dictionaries) track size of collection (0, 1+, 0+)
21


N-Body simulation in 3-dimensions
Uses Fast Multi-Pole method with space
decomposition tree
• For nearby bodies use naive n2 algorithm
• For distant bodies compute center of mass of many
bodies and treat as single point mass
22
23

Inline double[] into MathVector objects and unroll
loops, 23% speedup 37% memory use reduction
24
Iterator b = this.bodyTabRev.iterator();
while(b.hasNext())
((Body) b.next()).hackGravity(rsize, root);
25

TLP update loop over bodyTabRev, factor 3.09
speedup on quad-core machine
26
Static Analysis
27

Computational Cost
• Flow Sensitivity
• Disjunctive Representations (powersets)
• Context Sensitivity

Extensive use of Collections and other
standard libraries
• Need to deal with IO, Files, etc.
• Collections particularly important (wrt. heap
analysis) and used extensively
28


Adopt existing work on Partially Disjunctive
Domains (SAS 04) to help with flow and
disjunctive costs
Our recent work (in submission) addresses
the disjunctive and context sensitivity issues
• Calls take/return a model set of size 1
• Virtual calls handled efficiently, e.g. do not result in
explosion in set size
• Average number of contexts per method (in all
benchmarks) is less than 2.2
• All with near optimal precision (more later)
29

Most libraries are not interesting from the
standpoint of userland code
• Don’t care how FileStream is implemented
• Just treat as uninterrupted heap region

Can analyze Collection implementations
directly but
• Many use specialized highly optimized data
structures, so very hard to model correctly
• Analysis is expensive and redundant

Use simple high level semantics for common
collections (precise and efficient)
30


Prototype implemented in C++ analyzing Java
source (timing results reported here)
Currently working on an implementation for
CLI (C#/.net) bytecode
• Working on support for bytecodes generated by C#
compiler (currently function pointers, references,
exceptions not handled)
• Support key portions of System, Collections, and IO
handled - plans to support GUI libs and threads
• Goal is module level analysis with other techniques
for inter-module analysis
31
Benchmark
LOC
Analysis
Time
Analysis
Mem
Region
Shape
Share
tsp
910
0.03s
<30MB
100%
100%
98%
em3d
1103
0.09s
<30MB
100%
100%
99%
voronoi
1324
0.50s
<30MB
97%
98%
98%
bh
2304
0.72s
<30MB
100%
96%
96%
db
1985
0.68s
<30MB
100%
100%
81%
raytrace
5809
15.5s
38MB
100%
95%
92%
Exp
3567
152.3s
48MB
100%
100%
100%
Interpreter
15293 114.8s
122MB
100%
97%
96%
32
33

Model of the Program heap
• Captures key properties for client applications
• Amenable to static analysis
• Intuitive and understandable by non-experts

Static Analysis
• Analyze 15KLoc in a few minutes and ~200MB
(average 2 contexts per method)
• Precisely represent range of useful information

Dynamic Support
• Debugger hooks in place for C# and Visual Studio
• Other tool support in progress
34

Transform core concepts from prototype to
robust tools
• Finish implementation of static analysis for CLI
bytecode + core libraries (also runtime support)
• Export results to Visual Studio for inspection, spec.
generation, or other tools
• Build strong foundation for other tools

Apply results in optimization, refactoring,
testing, or error detection applications
35
36


Simple interpreter and debug environment for
simple OO-Java like language
14,000+ Loc (in normalized form), 90 Classes
• Additional 1500 Loc for specialized standard library
handling stubs


Large recursive call structures, large
inheritance trees with numerous virtual
method implementations
Wide range of data structure types, extensive
use of java.util collections, uses both shared
and unshared structures
37

Track object allocation sites and escape sites
similarly to tracking use/mod
• Add tags to each node
 Allocation sites of the objects represented by the node
 Escape sites of the objects represented by the node

Then based on Identity information and these
Tags we can identify
• Where objects are allocated locally
• Which objects escape
• Which calls created objects that escaped into this
call and which data structures these objects are in
38

Via Common Compiler Infrastructure (CCI)
and reflection, rewrite bytecode with shadow
memory
• Extract snapshot
• Apply already defined abstraction function

More precise than static analysis (less chance
for user confusion)
• No error propagation issues
• Abstract Model represents a single heap (not a set)
• Does not have the overhead of static analysis
39

Experience with using model to understand
heap state shows promise
• Has been used by a number of individuals outside
the group with positive reports

Our implementation (done as a Visual Studio
addin) can abstract nontrivial heaps
• 220K objects in 9s and approx. 300MB of Memory

Code is available on CodePlex
• ...
40