Slides - University of Nebraska–Lincoln

Transcript Slides - University of Nebraska–Lincoln

Why do Record/Replay Tests of
Web Applications Break?
Mouna Hammoudi1, Gregg Rothermel1,
Paolo Tonella2
1University
of Nebraska-Lincoln, USA
2Fondazione Bruno Kessler—IRST, Italy
This work has been partially supported by the National Science Foundation through award IIS-1314365.
1
Record/Replay (RR) Techniques
• RR techniques record interactions with web
apps and replay them
• Recordings can be used to regression test web
apps as they evolve
• RR tools include:
-Selenium
-Sahi
-TestCafe
-QTP
-Sikuli
-Watir
2
Recording
StopRecording
Recording
Record/Replay Illustration Start
First Name
John
Last Name
Smith
Address
300 M street
Postal Code
89146
State
Nevada
County
Clark
City
Las Vegas
type
type
type
type
select
select
select
click
id=FirstName
id=LastName
id=Address
id=PostalCode
id=State
id=County
id=City
id=Submit
John
Smith
300 M Street
89146
Nevada
Clark
Las Vegas
Each Selenese command is denoted by
a 3-tuple <action, locator, value>
3
HTML elements
• HTML pages are composed of HTML elements
(drop down lists, text fields, etc)
• HTML elements have attributes that are
known as name/value pairs
– id=FullName
– name=UserNameField
4
Locators
• Used by JavaScript, other languages and RR
tools to identify and manipulate elements
– Attribute-based locators: make use of attributes
such as IDs and names to identify elements
id=submitButton
– Structure-based locators: rely on the structure of
a web page to identify elements based on their
DOM tree position (XPaths, CSS selectors, etc)
//div/form/input[@type=“submit”]
5
Problem: Test Breakage
• Problem: RR tests are brittle and easily break
in the face of web app evolution
• “Test Breakage”: A test that used to work on a
web app ceases to work on a new version of
that app, due to a change in the web app
6
Examples
• Examples of test breakages:
– The choices in a drop down menu have
changed
– A page element has been renamed
– The XPath of a button has been changed
– An assertion’s expected value has been
changed
7
Novelty
• No formal attempt has been made to
characterize the causes of test breakages by
studying them empirically, much less to
measure the frequencies at which those
different causes occur
• This is a pre-requisite for research on testing
web apps
We set out to develop a taxonomy of test
breakages
8
Our Procedure: Training Set
• Collect 300 versions of 5 open source web
apps in SourceForge and construct test suites
for the earliest versions using a RR tool
• Apply tests to subsequent versions of each
web app, collect data on breakage causes
• Given test breakages on a subsequent version
V’, we repaired them and augmented the test
suite to account for added functionality
• This yielded data on 722 breakages
9
Our Procedure (Continued)
• Using these breakages, we developed a
taxonomy of breakage causes that
characterizes all the breakages observed
• Test Set: We gathered 153 versions of 3
additional apps and applied the same process
• This yielded 343 additional breakages
10
Objects of Study
• Have at least 20 versions or sequences of
commits
• Involve at least 30,000 LOC
• Are installable and executable
• Have been downloaded at least 5,000 times
11
Objects of Study
Training Set
Test Set
App Name
Rel./Com
mits
LOC
Downloads
Tests
PHPFusion
49
256,899
1,605,195
47
Address Book
79
35,675
126,146
44
PHPAgenda
32
43,831
64,605
42
MyCollaboration
29
116,345
7,638
39
Joomla
111
312,978
>50,000,000
56
MyMovieLibrary
23
31,324
6,201
36
Dolibarr
30
42,010
864,698
38
YourContacts
100
64,765
676,543
57
Average
57
112,978
6,668,878
45
12
Choice of RR Tool
• We chose Selenium IDE as a representative RR
tool given its popularity, its open source nature
and the features it supports
• Selenium IDE is similar to many other RR tools
(Sahi, CasperJS, Watir) which have test structures
that utilize <action, locator, value> tuples and
utilize similar types of locators
• We expect our taxonomy to generalize to other
RR tools in this class, possibly with different
breakage frequencies
13
Test Suite Creation
• No RR test suites were available for the web apps
• We followed a systematic and iterative procedure
for each web app
– We installed the first version V0 and created a test
suite T for V0
– We used functional adequacy and partition/boundary
value coverage criteria
– We executed T on the next version V’, repaired (and
recorded) breakages and augmented T to cover new
functionality
– We continued until all versions were considered
14
Taxonomization Process
• Systematic Process:
– Study each breakage to determine its cause and write
its description
– Cluster the causes of test breakages in terms of
similarity factors
– Identify candidate equivalence classes of breakages
and assign labels to them
– Review equivalence classes and reach a consensus on
them
– Organize labels into hierarchies by clustering based on
the similarity of factors responsible for breakages
15
Taxonomization Process
• Taxonomies evolve as additional observations are
made
• Applying the taxonomy to subsequent apps may not
allow every type of test breakage in those apps to be
categorized
• We repeated the taxonimization process on a test set
of 3 additional apps
• The taxonomic categories we created in our training
set sufficed to categorize breakages in the test set
16
Causes of Test Breakages
• Proximal causes: Causes that “most nearly impact
the test code” and describe the direct cause of the
breakage
• Distal causes: Changes that were made to the web
app code that later led to the breakage
• Example: Locator Breakage
– Proximal Cause: Selenium cannot locate a particular
drop down list
– Distal Cause: an engineer deleted the drop down list
from the HTML code
• Taxonomy of proximal causes
17
Counting Test Breakages
• Correcting one breakage and rerunning the
test allows other breakages to appear later,
we count and categorize each such breakage
• Example: <type, id=“password”, mypassword>
– Change 1: Text box ID is changed into “Password”
– Change 2: New restrictions on password length
• We count these two breakages separately
18
Taxonomy
19
Taxonomy
20
Taxonomy
21
Taxonomy
22
Results
23
Results
24
Results
25
Taxonomy
26
Element Attribute not Found
id=“Firstname”
<type, id=firstname, Mouna>
<type, id=Firstname, Mouna>
27
Taxonomy
28
Hierarchy-Based Locator Target not
Found
<div>
</div>
<click, //form/button, >
<click, //form/div/button, >
29
Taxonomy
30
Index-Based Locator Target not Found
<form>
<button…> … </button>
</form>
<form>
<click, //form[1]/button, >
<click, //form[2]/button, >
31
Taxonomy
32
Missing Value
Required
<click, css=input[type=“submit”], >
<type, name=address, 300 S 16th street>
<click, css=input[type="submit”], >
33
Taxonomy
34
Value Absent from Drop Down List
<option value=“volvo”>Volvo</option>
<select, name=cars, label=Volvo >
35
Threats to Validity
• Generalization of findings: We considered only 8 web
apps and our results do not generalize to different web
apps
• Our results are gathered relative to Selenium IDE tests,
generalization to programmable and visual tools is not
supported
• Developers who create tests may behave differently
when modifying their applications, resulting in
different frequencies of test breakages
• We created the tests and we classified test breakages
36
Implications
•
•
•
•
•
•
•
Prioritization of breakages for test repair
Prioritization of inspections
Breakage avoidance
Breakage prevention
Detecting bad smells in web tests
IDE enhancements
Root cause analysis
37
Implications
•
•
•
•
•
•
•
Prioritization of breakages for test repair
Prioritization of inspections
Breakage avoidance
Breakage prevention
Detecting bad smells in web tests
IDE enhancements
Root cause analysis
38
Breakage Avoidance
• Educate developers and maintainers of web
applications as to the causes and probabilities of
test breakages
• Our taxonomy could serve as as a source of
guidelines on best practices to follow while
changing web application code
• Changing a locator name might be useful for code
readability but its cost in terms of impact on tests
can be quite high
39
Breakage Prevention
• Knowledge about causes of test breakages could
help web app and test developers prevent them
• Add meaningful and stable id’s to the core
elements of the web interface, or take advantage
of tools to create robust locators automatically
• Adopt design patterns such as the Page Object
design pattern to produce change-resilient tests
40
Implications
•
•
•
•
•
•
•
Prioritization of breakages for test repair
Prioritization of inspections
Breakage avoidance
Breakage prevention
Detecting bad smells in web tests
IDE enhancements
Root cause analysis
41
Related Work:
Web App Test Breakages
• Stocco et al. investigate the automated generation of page
objects that confine causes of test breakages to a single class
• Yandrapally et al. address the problem of test script fragility
in relation to locators, proposing approaches for identifying
UI elements in more robust ways using contextual clues
• Leotta et al. developed ROBULA and multi-locators, which
are approaches for producing robust locators
• Choudhary et al. developed a differential testing approach
that compares the behavior of a broken test to its behavior
on a prior version of the web app and attempts to repair the
test
42
Related Work:
Taxonomy of Test Breakages
• Choudhary et al. focus on changes and associated problems that
may result in test breakages
• Their classification is not drawn from any formal empirical study of
web applications
• Identify three types of changes related to broken test scripts
1. Structural changes are changes in the DOM tree and may
cause locator problems that can be characterized as “nonselection” or “mis-selection” problems
2. Content changes involve modifications of the text or HTML
contained in a DOM node, these can affect assertions that
check node content and lead to “obsolete content” problems
3. Blind changes involve changes in server-side code
43
Conclusion
• Quantitative and a qualitative assessment of the
causes behind test breakages
• Taxonomy Uses:
– Help practitioners create less brittle tests
– Help researchers find better ways to repair tests,
prevent and avoid breakages
– Create more robust IDEs and RR tools
• Current Work: Extend the taxonomy to
programmable web testing (Selenium WebDriver)
44
Why do Record/Replay Tests of
Web Applications Break?
Mouna Hammoudi1, Gregg Rothermel1,
Paolo Tonella2
1University
of Nebraska-Lincoln, USA
2Fondazione Bruno Kessler—IRST, Italy
This work has been partially supported by the National Science Foundation through award IIS-1314365.
45
Refined Taxonomy
46
Element Attribute Not Found Breakages
47
Breakage Manifestation Scenarios
• Direct Breakages: manifest themselves
precisely when the breakage cause is
encountered
• Propagated Breakages: do not manifest
themselves immediately but do manifest
themselves later on subsequent test actions
• Silent Breakages: never manifest themselves
explicitly
48
Taxonomy
49
Element Text not Found
<click, link=Visit our HTML tutorial>
50
Taxonomy
51
Invalid Text Field Value Input
<type, name=username, mouna>
52
Taxonomy
53
Unexpected Assertion Value
<assertAlert, Thank you for your submission!>
<assertAlert, Thank you for your submission, we will contact you shortly!>
54
Taxonomy
55
Page Reload Needed
<click, css=input[type="submit"]>
<clickAndWait, css=input[type="submit"]>
56
Taxonomy
57
Page Reload No Longer Needed
<clickAndWait, css=input[type="submit"]>
<click, css=input[type="submit"]>
58
Taxonomy
59
User Session Made Longer
<wait, 1800, >
<assertAlert, “You will be logged out shortly!”>
ERROR: Undetected Alert
60
Taxonomy
61
User Session Made Shorter
<wait, 3600, >
<assertAlert, “You will be logged out shortly!”>
ERROR: Unexpected Alert
62
Taxonomy
63
Popup Box Added
Error: Unexpected Popup Box
64
Taxonomy
65
Popup Box Deleted
ERROR: Undetected Pop Up Box
66
Related Work:
GUI Test Repair
• Memon and Soffa, Memon and Datchayani et al. use
event flow graphs and transformation techniques to
repair broken GUI tests
• Huang et al. use a genetic algorithm to repair GUI tests
• Grechanik et al. present an approach that analyzes an
initial and modified GUI for differences and generates
a report for engineers
• Daniel et al. use GUI refactorings to keep track of
changes engineers make to a GUI, suggesting that this
information could later be used to repair tests
67
Related Work:
Test Maintainability
• Leotta et al. conducted an empirical study in which
they measured the effort associated with the
evolution of two equivalent Selenium IDE and
Selenium WebDriver test suites, with the aim of
comparing the maintainability of record/replay vs.
programmable test suites
68
Related Work:
Fault Taxonomies
• Ocariza et al. characterize the classes of error
messages output by JavaScript in web apps and the
classes of faults found in JavaScript
• Ricca and Tonella present a preliminary fault
taxonomy for web apps, consisting of a single level of
general fault categories
• Marchetto et al. present a taxonomy of web app
faults.
69

Slides - University of Nebraska–Lincoln

Transcript Slides - University of Nebraska–Lincoln

Directory