Transcript Slide 1

Prioritizing Test Cases for
Regression Testing
Article By: Rothermel, et al.
Presentation by:
Martin, Otto, and Prashanth
•Test case prioritization techniques - schedule test cases for
execution in an order that attempts to increase their
effectiveness at meeting some performance goal.
• One goal is the rate of fault detection - a measure of how
quickly faults are detected within the testing process
An improved rate of fault detection during testing can
provide faster feedback on the system under test and let
software engineers begin correcting faults earlier than
might otherwise be possible.
•One application of prioritization techniques involves
regression testing




This paper describes several techniques for using test execution
information to prioritize test cases for regression testing, including:
1) techniques that order test cases based on their total coverage of
code components,
2) techniques that order test cases based on their coverage of code
components not previously covered, and
3) techniques that order test cases based on their estimated ability
to reveal faults in the code components that they cover.




When the time required to re-execute an entire test suite is short,
test case prioritization may not be cost-effective-it may be sufficient
simply to schedule test cases in any order.
When the time required to execute an entire test suite is sufficiently
long, however, test-case prioritization may be beneficial because, in
this case, meeting testing goals earlier can yield meaningful
benefits.
In general test case prioritization, given program P and test suite T,
we prioritize the test cases in T with the intent of finding an
ordering of test cases that will be useful over a succession of
subsequent modified versions of P.
In the case of regression testing, prioritization techniques can use
information gathered in previous runs of existing test cases to help
prioritize the test cases for subsequent runs.

This paper considers 9 different test case prioritization techniques.

The first three techniques serve as experimental controls



The last six techniques represent heuristics that could be
implemented using software tools
A source of motivation for these approaches is the conjecture that
the availability of test execution data can be an asset.
This assumes that past test execution data can be used to predict,
with sufficient accuracy, subsequent execution behavior.




Definition 1. The Test Case Prioritization Problem:
Given: T, a test suite, PT, the set of permutations of T, and f, a
function from PT to the real numbers.
PT represents the set of all possible prioritizations (orderings) of T
f is a function that, applied to any such ordering, yields an award
value for that ordering.

A challenge: care must be taken to keep the cost of performing the
prioritization from excessively delaying the very regression testing
activities it is intended to facilitate.




M3: Optimal prioritization.
Given program P and a set of known faults for P, if we can
determine, for test suite T, which test cases in T expose which faults
in P, then we can determine an optimal ordering of the test cases in
T for maximizing T's rate of fault detection for that set of faults.
This is not a practical technique, as it requires a priori knowledge of
the existence of faults and of which test cases expose which faults.
However, by using this technique in the empirical studies, we can
gain insight into the success of other practical heuristics, by
comparing their solutions to optimal solutions.



M4: Total statement coverage prioritization.
By instrumenting a program, we can determine, for any test case,
which statements in that program were exercised (covered) by that
test case.
We can then prioritize test cases in terms of the total number of
statements they cover by counting the number of statements
covered by each test case and then sorting the test cases in
descending order of that number.




M5: Additional statement coverage prioritization.
Total statement coverage prioritization schedules test cases in the
order of total coverage achieved; however, having executed a test
case and covered certain statements, more may be gained in
subsequent testing by executing statements that have not yet been
covered.
Additional statement coverage prioritization iteratively selects a test
case that yields the greatest statement coverage, then adjusts the
coverage information on all remaining test cases to indicate their
coverage of statements not yet covered and repeats this process
until all statements covered by at least one test case.
We may reach a point where each statement has been covered by
at least one test case, and the remaining unprioritized test cases
cannot add additional statement coverage. We could order these
remaining test cases using any prioritization technique.



M6: Total branch coverage prioritization.
Total branch coverage prioritization is the same as total statement
coverage prioritization, except that it uses test coverage measured
in terms of program branches rather than statements.
In this context, we define branch coverage as coverage of each
possible overall outcome of a (possibly compound) condition in a
predicate. Thus, for example, each if or while statement must be
exercised such that it evaluates at least once to true and at least
once to false.



M7: Additional branch coverage prioritization.
Additional branch coverage prioritization is the same as additional
statement coverage prioritization, except that it uses test coverage
measured in terms of program branches rather than statements.
After complete coverage has been achieved the remaining test cases
are prioritized by resetting coverage vectors to their initial values
and reapplying additional branch coverage prioritization to the
remaining test cases.




M8: Total fault-exposing-potential (FEP) prioritization.
Some faults are more easily exposed than other faults, and some
test cases are more adept at revealing particular faults than other
test cases.
The ability of a test case to expose a fault-that test case's fault
exposing potential (FEP)-depends not only on whether the test case
covers (executes) a faulty statement, but also on the probability
that a fault in that statement will cause a failure for that test case
Three probabilities that could be used in determining FEP:



1) the probability that a statement s is executed (execution
probability),
2) the probability that a change in s can cause a change in
program state (infection probability), and
3) the probability that a change in state propagates to output
(propagation probability).




This paper adopts an approach that uses mutation analysis, to
produce a combined estimate of propagation-and-infection that
does not incorporate independent execution probabilities.
Mutation analysis creates a large number of faulty versions
(mutants) of a program by altering program statements, and uses
these to assess the quality of test suites by measuring whether
those test suites can detect those faults (‘kill’ those mutants).
Given program P and test suite T, we first create a set of mutants N
={n1; n2; . . . ; nm} for P, noting which statement sj in P contains
each mutant. Next, for each test case ti in T, we execute each
mutant version nk of P on ti, noting whether ti kills that mutant.
Having collected this information for every test case and mutant, we
consider each test case ti and each statement sj in P, and calculate
the fault-exposing potential FEP(s, t) of ti on sj as the ratio of
mutants of sj killed by ti to the total number of mutants of sj.


To perform total FEP prioritization, given these FEP(s; t) values, we
next calculate, for each test case ti in T, an award value, by
summing the FEP(sj; ti) values for all statements sj in P.
Given these award values, we then prioritize test cases by sorting
them in order of descending award value.

M9: Additional fault-exposing-potential (FEP) prioritization.

This lets us account for the fact that additional executions of a
statement may be less valuable than initial executions.

We require a mechanism for measuring the value of an execution of
a statement, that can be related to FEP values.

For this, we use the term confidence. We say that the confidence in
statement s, C(s), is an estimate of the probability that s is correct.

If we execute a test case t that exercises s and does not reveal a
fault in s, C(s) should increase.



Research Questions
 Can test case prioritization improve the rate of fault detection in
test suites?
 How do the various test case prioritization techniques discussed
earlier compare to one another in terms of effects on rate of
fault detection?
Effectiveness Measures
 Use a weighted Average of the Percentage of Faults Detected
(APFD)
 Ranges from 0..100
 Higher numbers means faster detection
Problems with APFD
 Doesn’t measure cost of prioritization
 Cost is normally amortized because test suites are created after
the release of a version of the software
Effectiveness Example

Programs used



Aristotle program analysis system for test
coverage and control graph information
Proteum mutation system to obtain mutation
scores.
Used 8 C programs as subjects

First 7 were created at Siemens, the eighth is a
European Space Agency program

Siemens Programs - Description
 7 programs used by Siemens in a study that observed the “fault
detecting effectiveness of coverage criteria”
 Created faulty versions of these programs by manual seeding
them with single errors creating the “number of versions”
column
 Using single line faults only allows researchers to determine
whether a test case discovers the error or not
 For each of the seven programs, a test case suite was created
by Siemens. First via a black box method, they then completed
the suite using white box testing, so that each “executable
statement, edge, and definition use pair … was exercised by at
least 30 test cases.
 Kept faulty programs whose errors were detectable by between
3 and 350 test cases
 Test suites were created by the researchers by random selection
until a branch coverage adequate test suite was created
 Proteum was used to create mutants of the seven programs

Space Program – Description





33 versions of space with only one fault in each were
created by the ESA, 2 more were created by the
research team
Initial pool of 10 000 test cases were obtained from
Vokolos and Frankl
Used these as a base and added cases until each
statement and edge was exercised by at least 30 test
cases
Created a branch coverage adequate test suite in the
same way as the Siemens program
Also created mutants via Proteum


Empirical Studies and Results
4 different studies using the 8 programs




Siemens programs with APFD measured
relative to Siemens faults
Siemens programs with APFD measured
relative to mutants
Space with APFD measured relative to actual
faults
Space with APFD measure relative to mutants

Siemens programs with APFD measured
relative to Siemens faults – Study Format



M2 to M9 were applied to each of the 1000
test suites, resulting in 8000 prioritized test
suites
The original 1000 were used as M1
Calculated the APFD relative to the faults
provided by the program
Example boxplot

Study 1 - Overall observations
 M3 is markedly better than all of the others (as expected)
 The test case prioritization techniques offered appear to have
some improvement, but more statistics needed to be done to
confirm
 Upon completion of these statistics, more results were revealed
 Branch based coverage did as well or better than statement
coverage
 All except one indicates that total branch coverage did as
well or better than additional branch coverage
 All total statement coverage did as well or better than
additional statement coverage
 In 5 of 7 programs, even randomly prioritized test suites did
better than untreated test suites
Example Groupings

Siemens programs with APFD measured relative to
mutants – Study Format



Same format as the first study, 9000 test suites used, 1000 for
each prioritization technique
But rather than run those test cases on the small subset of
known errors, they were applied to mutated programs that were
created to form a larger bed of programs to test against
Results





Additional and Total FEP prioritization outperformed all others
(except optimal)
Branch almost always outperformed statement
Total statement outperformed additional
But additional branch coverage outperformed total branch
coverage
However, in this study random did not outperform the control




Space with APFD measured relative to Actual
Faults
M2 – M9 were applied to each of the 50 test
suites, resulting in 400 test suites, plus the
original 50 resulting in 450 total test suites
Additional FEP outperformed all others, but
there was no significant difference among the
rest
Also random is no better than the control
Study 3 Groupings

Space with APFD measured relative to
mutants




Same technique as other space study, only
using 132,163 mutant version of the software
Additional FEP outperformed all others
Branch and statement are indistinguishable
But additional coverage always outperforms
its total counterpart
Study 4 Groupings

Threats to Validity



Construct Validity – You are measuring what
you say you are measuring (and not
something else)
Internal Validity – Ability to say that the
causal relationship is true
External Validity – Ability to generalize results
across the field

Construct Validity




APFD is highly accurate, but it is not the only
method of measuring fault detection, could
also measure percentage of test suite that
must be run before all errors are found
No value to later tests that detect the same
error
FEP based calculations – Other estimates may
more accurately capture the probability of a
test case finding a fault
Effectiveness is measured without cost

Internal Validity



Instrumentation bias can bias results
especially in APFD and prioritization
measurement tools
Performed code revision
Also limit problems by running prioritization
algorithm on each test suite and each subject
program

External Validity






The Siemens programs are non-trivial but not representative of
real world programs. The space program is, but is only one
program
Faults in Siemens programs were seeded (not like those in the
real world)
Faults in space were found during development, but these may
differ from those found later in the development process. Plus
they are only one set of faults found by one set of programmers
Single faults version programs are also not representative of the
real world
The test suites were created with only a single method, other
real world methods exist
These threats can only be answered by more studies with
different test suites, programs, and errors
Additional Discussion And
Practical Implications



Test case prioritization can substantially
improve rate of fault detection of test suites.
Additional FEP prioritization techniques do
not always justify the additional expenses
incurred, as is gathered from cases where
specific coverage based techniques
outperformed them and also in cases where
the total gain in APFD, when the additional
FEP techniques did perform the best, was not
large enough.
Branch-coverage-based techniques almost
always performed as well if not better than
statement-coverage-based techniques. Thus
if the two techniques incur similar costs,
branch-coverage-techniques are advocated.

Total statement and branch coverage
techniques perform almost at par with the
additional branch and statement coverage
techniques, entitling its use due to its
lower complexity.


However, this does not apply for space (Study
4) program where the additional branch and
statement coverage techniques outperformed
the total statement and branch coverage
techniques by a huge margin.
Randomly prioritized test suites typically
outperform untreated test suites.
Conclusion


Any one of the prioritization techniques
offer some amount of improved fault
detection capabilities.
These studies are of interest only to
research groups, due to the high expense
that they incur. However, code coverage
based techniques have immediate
practical implications.
Future Work




Additional studies to be performed using wider
range of programs, faults and test suites.
The gap between optimal prioritization and FEP
prioritization techniques is yet to be bridged.
Determining which prioritization technique is
warranted by particular types of programs and
test suites.
Other prioritization objectives have to be
investigated.


Version specific techniques
Techniques may not only be applied to regression
testing but also during the initial testing of the
software.