Transcript PPT
CS590Z Statistical Debugging
Xiangyu Zhang
(part of the slides are from Chao Liu)
A Very Important Principle
Traditional debugging techniques deal with
single (or very few) executions.
With the acquisition of a large set of
executions, including passing and failing
executions, statistical debugging is often
highly effective.
Failure reporting
In house testing
Tarantula (ASE 2005, ISSTA 2007)
Scalable Remote Bug Isolation (PLDI
2004, 2005)
Look at predicates
Branches
Function returns (<0, <=0, >0, >=0, ==0, !=0)
Scalar pairs
For each assignment x=…, find all variables y_i and constants
c_j, each pair of x (=,<,<=…) y_i/c_j
Sample the predicate evaluations (Bernoulli
sampling)
Investigate the relation of the probability of a predicate
being true with the bug manifestion.
Bug Isolation
Bug Isolation
How much does P being true increase the probability of failure over simply
reaching the line P is sampled.
An Example
void subline(char *lin, char *pat, char *sub)
{
int i, lastm, m;
lastm = -1;
i = 0;
Symptoms
563 lines of C code
130 out of 5542 test
cases fail to give correct
outputs
No crashes
while((lin[i] != ENDSTR)) {
m = amatch(lin, i, pat, 0);
if (m
((m>=
>=0){
0) && (lastm != m) ){
putsub(lin, i, m, sub);
lastm = m;
}
if ((m == -1) || (m == i)){
fputc(lin[i], stdout);
i = i + 1;
} else
The predicate are
evaluated to both true
and false in one
execution
i = m;
}
}
not enough
void subline(char *lin, char *pat, char *sub)
{
int i, lastm, m;
lastm = -1;
i = 0;
while((lin[i] != ENDSTR)) {
m = amatch(lin, i, pat, 0);
if ((m >= 0) && (lastm != m) ){
putsub(lin, i, m, sub);
lastm = m;
}
if ((m == -1) || (m == i)){
fputc(lin[i], stdout);
i = i + 1;
} else
i = m;
}
}
P_f (A) = tilde P (A | A & !B)
P_t (A) = tilde P (A | !(A&!B))
Program Predicates
A predicate is a proposition about any program
properties
e.g., idx < BUFSIZE, a + b == c, foo() > 0 …
Each can be evaluated multiple times during one
execution
Every evaluation gives either true or false
Therefore, a predicate is simply a boolean random
variable, which encodes program executions from a
particular aspect.
Evaluation Bias of Predicate P
Evaluation bias
Def’n: the probability of being evaluated as true within one execution
Maximum likelihood estimation: Number of true evaluations over the
total number of evaluations in one run
Each run gives one observation of evaluation bias for predicate P
Suppose we have n correct and m incorrect executions, for any
predicate P, we end up with
An observation sequence for correct runs
An observation sequence for incorrect runs
S_p = (X’_1, X’_2, …, X’_n)
S_f = (X_1, X_2, …, X_m)
Can we infer whether P is suspicious based on S_p and S_f?
Underlying Populations
Imagine the underlying distribution of evaluation bias for correct
and incorrect executions are
and
S_p and S_f can be viewed as a random sample from the
underlying populations respectively
One major heuristic is
The larger the divergence between
relevant the predicate P is to the bug
, the more
Prob
Prob
0
and
Evaluation bias
1
0
Evaluation bias
1
Major Challenges
Prob
Prob
0
Evaluation bias
1
0
Evaluation bias
1
No knowledge of the closed forms of both
distributions
Usually, we do not have sufficient incorrect
executions to estimate
reliably.
Our Approach
Algorithm Outputs
A ranked list of program predicates w.r.t. the
bug relevance score s(P)
Higher-ranked predicates are regarded more
relevant to the bug
What’s the use?
Top-ranked predicates suggest the possible buggy
regions
Several predicates may point to the same region
……
Outline
Program Predicates
Predicate Rankings
Experimental Results
Case Study: bc-1.06
Future Work
Conclusions
Experiment Results
Localization quality metric
Related works
Software bug benchmark
Quantitative metric
Cause Transition (CT), [CZ05]
Statistical Debugging, [LN+05]
Performance comparisons
Bug Benchmark
Bug benchmark
Dreaming benchmark
Siemens Program Suite
Known bugs, thus judgments are objective
Large number of bugs, thus comparative study is statistically significant.
Disadvantages
130 variants of 7 subject programs, each of 100-600 LOC
130 known bugs in total
mainly logic (or semantic) bugs
Advantages
Large number of known bugs on large-scale programs with adequate test suite
Small-scaled subject programs
State-of-the-art performance, so far claimed in literature,
Cause-transition approach, [CZ05]
Localization Quality Metric [RR03]
1st Example
1
10
2
6
3
5
4
7
9
8
T-score = 70%
2nd Example
1
10
2
6
3
5
4
7
9
8
T-score = 20%
Related Works
Cause Transition (CT) approach [CZ05]
A variant of delta debugging [Z02]
Previous state-of-the-art performance holder on Siemens
suite
Published in ICSE’05, May 15, 2005
Cons: it relies on memory abnormality, hence its
performance is restricted.
Statistical Debugging (Liblit05) [LN+05]
Predicate ranking based on discriminant analysis
Published in PLDI’05, June 12, 2005
Cons: Ignores evaluation patterns of predicates within each
execution
Localized bugs w.r.t. Examined Code
Cumulative Effects w.r.t. Code Examination
Top-k Selection
Regardless of specific selection of k, both Liblit05 and
SOBER are better than CT, the current state-of-the-art holder
From k=2 to 10, SOBER is better than Liblit05 consistently
Outline
Evaluation Bias of Predicates
Predicate Rankings
Experimental Results
Case Study: bc-1.06
Future Work
Conclusions
Case Study: bc 1.06
bc 1.06
Two bugs were localized
14288 LOC
An arbitrary-precision calculator shipped with
most distributions of Unix/Linux
One was reported by Liblit in [LN+05]
One was not reported previously
Some lights on scalability
Outline
Evaluation Bias of Predicates
Predicate Rankings
Experimental Results
Case Study: bc-1.06
Future Work
Conclusions
Future Work
Further leverage the localization quality
Robustness to sampling
Torture on large-scale programs to confirm its
scalability to code size
…
Conclusions
We devised a principled statistical method for
bug localization.
No parameter setting hassles
It handles both crashing and noncrashing bugs.
Best quality so far.
Discussion
Features
Easy implementation
Difficult experimentation
More advanced statistical technique may not be necessary
Go wide, not go deep…
Predicates are treated as independent random
variables.
Can execution indexing help?
Can statistical principles be combined with slicing or
IWIH ?