Transcript PPT

CS590Z Statistical Debugging
Xiangyu Zhang
(part of the slides are from Chao Liu)
A Very Important Principle


Traditional debugging techniques deal with
single (or very few) executions.
With the acquisition of a large set of
executions, including passing and failing
executions, statistical debugging is often
highly effective.


Failure reporting
In house testing
Tarantula (ASE 2005, ISSTA 2007)
Scalable Remote Bug Isolation (PLDI
2004, 2005)

Look at predicates



Branches
Function returns (<0, <=0, >0, >=0, ==0, !=0)
Scalar pairs


For each assignment x=…, find all variables y_i and constants
c_j, each pair of x (=,<,<=…) y_i/c_j
Sample the predicate evaluations (Bernoulli
sampling)

Investigate the relation of the probability of a predicate
being true with the bug manifestion.
Bug Isolation
Bug Isolation
How much does P being true increase the probability of failure over simply
reaching the line P is sampled.
An Example
void subline(char *lin, char *pat, char *sub)
{

int i, lastm, m;
lastm = -1;
i = 0;
Symptoms

563 lines of C code

130 out of 5542 test
cases fail to give correct
outputs
No crashes
while((lin[i] != ENDSTR)) {
m = amatch(lin, i, pat, 0);
if (m
((m>=
>=0){
0) && (lastm != m) ){

putsub(lin, i, m, sub);
lastm = m;
}
if ((m == -1) || (m == i)){
fputc(lin[i], stdout);
i = i + 1;
} else

The predicate are
evaluated to both true
and false in one
execution
i = m;
}
}
not enough
void subline(char *lin, char *pat, char *sub)
{
int i, lastm, m;
lastm = -1;
i = 0;
while((lin[i] != ENDSTR)) {
m = amatch(lin, i, pat, 0);
if ((m >= 0) && (lastm != m) ){
putsub(lin, i, m, sub);
lastm = m;
}
if ((m == -1) || (m == i)){
fputc(lin[i], stdout);
i = i + 1;
} else
i = m;
}
}
P_f (A) = tilde P (A | A & !B)
P_t (A) = tilde P (A | !(A&!B))
Program Predicates

A predicate is a proposition about any program
properties




e.g., idx < BUFSIZE, a + b == c, foo() > 0 …
Each can be evaluated multiple times during one
execution
Every evaluation gives either true or false
Therefore, a predicate is simply a boolean random
variable, which encodes program executions from a
particular aspect.
Evaluation Bias of Predicate P

Evaluation bias




Def’n: the probability of being evaluated as true within one execution
Maximum likelihood estimation: Number of true evaluations over the
total number of evaluations in one run
Each run gives one observation of evaluation bias for predicate P
Suppose we have n correct and m incorrect executions, for any
predicate P, we end up with

An observation sequence for correct runs


An observation sequence for incorrect runs


S_p = (X’_1, X’_2, …, X’_n)
S_f = (X_1, X_2, …, X_m)
Can we infer whether P is suspicious based on S_p and S_f?
Underlying Populations



Imagine the underlying distribution of evaluation bias for correct
and incorrect executions are
and
S_p and S_f can be viewed as a random sample from the
underlying populations respectively
One major heuristic is

The larger the divergence between
relevant the predicate P is to the bug
, the more
Prob
Prob
0
and
Evaluation bias
1
0
Evaluation bias
1
Major Challenges
Prob
Prob
0


Evaluation bias
1
0
Evaluation bias
1
No knowledge of the closed forms of both
distributions
Usually, we do not have sufficient incorrect
executions to estimate
reliably.
Our Approach
Algorithm Outputs

A ranked list of program predicates w.r.t. the
bug relevance score s(P)


Higher-ranked predicates are regarded more
relevant to the bug
What’s the use?



Top-ranked predicates suggest the possible buggy
regions
Several predicates may point to the same region
……
Outline






Program Predicates
Predicate Rankings
Experimental Results
Case Study: bc-1.06
Future Work
Conclusions
Experiment Results

Localization quality metric



Related works



Software bug benchmark
Quantitative metric
Cause Transition (CT), [CZ05]
Statistical Debugging, [LN+05]
Performance comparisons
Bug Benchmark

Bug benchmark

Dreaming benchmark


Siemens Program Suite





Known bugs, thus judgments are objective
Large number of bugs, thus comparative study is statistically significant.
Disadvantages


130 variants of 7 subject programs, each of 100-600 LOC
130 known bugs in total
mainly logic (or semantic) bugs
Advantages


Large number of known bugs on large-scale programs with adequate test suite
Small-scaled subject programs
State-of-the-art performance, so far claimed in literature,

Cause-transition approach, [CZ05]
Localization Quality Metric [RR03]
1st Example
1
10
2
6
3
5
4
7
9
8
T-score = 70%
2nd Example
1
10
2
6
3
5
4
7
9
8
T-score = 20%
Related Works

Cause Transition (CT) approach [CZ05]





A variant of delta debugging [Z02]
Previous state-of-the-art performance holder on Siemens
suite
Published in ICSE’05, May 15, 2005
Cons: it relies on memory abnormality, hence its
performance is restricted.
Statistical Debugging (Liblit05) [LN+05]



Predicate ranking based on discriminant analysis
Published in PLDI’05, June 12, 2005
Cons: Ignores evaluation patterns of predicates within each
execution
Localized bugs w.r.t. Examined Code
Cumulative Effects w.r.t. Code Examination
Top-k Selection


Regardless of specific selection of k, both Liblit05 and
SOBER are better than CT, the current state-of-the-art holder
From k=2 to 10, SOBER is better than Liblit05 consistently
Outline






Evaluation Bias of Predicates
Predicate Rankings
Experimental Results
Case Study: bc-1.06
Future Work
Conclusions
Case Study: bc 1.06

bc 1.06



Two bugs were localized



14288 LOC
An arbitrary-precision calculator shipped with
most distributions of Unix/Linux
One was reported by Liblit in [LN+05]
One was not reported previously
Some lights on scalability
Outline






Evaluation Bias of Predicates
Predicate Rankings
Experimental Results
Case Study: bc-1.06
Future Work
Conclusions
Future Work




Further leverage the localization quality
Robustness to sampling
Torture on large-scale programs to confirm its
scalability to code size
…
Conclusions




We devised a principled statistical method for
bug localization.
No parameter setting hassles
It handles both crashing and noncrashing bugs.
Best quality so far.
Discussion

Features







Easy implementation
Difficult experimentation
More advanced statistical technique may not be necessary
Go wide, not go deep…
Predicates are treated as independent random
variables.
Can execution indexing help?
Can statistical principles be combined with slicing or
IWIH ?