mining - University of Illinois at Urbana

Download Report

Transcript mining - University of Illinois at Urbana

Software Analytics:
Towards Software Mining
that Matters
Tao Xie
University of Illinois at Urbana-Champaign
http://www.cs.illinois.edu/homes/taoxie/
[email protected]
Should I test\review my?
©A. Hassan
©A. Hassan
©A. Hassan
©A. Hassan
Software analytics is to enable software practitioners
to perform data exploration and analysis in order to
obtain insightful and actionable information for datadriven tasks around software and services.
[MALETS’11 Zhang et al.]
Software Intelligence &
Analytics for Software Development
http://people.engr.ncsu.edu/txie/publications/foser10-si.pdf
http://thomas-zimmermann.com/publications/files/buse-foser-2010.pdf
• use Data Exploration and Analysis
 Mining Software Repositories (MSR)
• for Software Practitioners
Beyond Software Developers
• obtain Insightful and Actionable info
Need get real as well
• Analytic Techniques
• Producing Impact on Practice
Look through your
software
data
©A. Hassan
Promise Data
Repository
Mine through
the data!
An international effort to
make software repositories actionable
http://msrconf.org
http://promisedata.org
©A. Hassan
Mining Software Repositories (MSR)
• Transforms static recordkeeping repositories to
active repositories
• Makes repository data
actionable by uncovering
hidden patterns and
trends
Field logs CVS/SVN
Bugzilla
Mailinglist
Crashes
11
©A. Hassan
Sourceforge
GoogleCode
Code Repos
Source Control
CVS/SVN
Bugzilla
Mailing
lists
Historical Repositories
Crash
Repos
Field
Logs
Runtime Repos
12
MSR researchers
analyze and cross-link repositories
discussions
Buggy change &
Fixing change
fixed
bug
Bugzilla
Mailinglist
CVS/SVN
Field
crashes
Crashes
New Bug Report
Estimate fix effort
Mark duplicates
Suggest experts and fix!
©A. Hassan
• use Data Exploration and Analysis
 Mining Software Repositories (MSR)
• for Software Practitioners
Beyond Software Developers
• obtain Insightful and Actionable info
Need get real as well
• Analytic Techniques
• Producing Impact on Practice
We continue to help
practitioners (esp. developers)
©A. Hassan
©A. Hassan
©A. Hassan
©A. Hassan
Detection and Management of
Code Clones
©A. Hassan
Support
Logs
Source
Code
©A. Hassan
©A. Hassan
• use Data Exploration and Analysis
 Mining Software Repositories (MSR)
• for Software Practitioners
Beyond Software Developers
• obtain Insightful and Actionable info
Need get real as well
• Analytic Techniques
• Case Studies
Predicting Bugs
• Studies have shown that most complexity metrics
correlate well with LOC!
– Graves et al. 2000 on commercial systems
– Herraiz et al. 2007 on open source systems
• Noteworthy findings:
– Previous bugs are good predictors of future bugs
– The more a file changes, the more likely it will have
bugs in it
– Recent changes affect more the bug potential of a file
over older changes (weighted time damp models)
– Number of developers is of little help in predicting bugs
– Hard to generalize bug predictors across projects
unless in similar domains [Nagappan, Ball et al. 2006]
23
Using Imports in Eclipse to Predict Bugs
71% of files that import compiler packages,
had to be fixed later on.
import org.eclipse.jdt.internal.compiler.lookup.*;
import org.eclipse.jdt.internal.compiler.*;
import org.eclipse.jdt.internal.compiler.ast.*;
import org.eclipse.jdt.internal.compiler.util.*;
...
import org.eclipse.pde.core.*;
import org.eclipse.jface.wizard.*;
import org.eclipse.ui.*;
14% of all files that import ui packages,
had to be fixed later on.
[Schröter et al. 06]
24
Don’t program on Fridays ;-)
Percentage of bug-introducing changes for eclipse
[Zimmermann et al. 05]
25
Failure is a 4-letter Word
[PROMISE’11 Zeller et al.]
26
Actionable Alone is not Enough!
[PROMISE’11 Zeller et al.]
27
Who produces more buggy code?
©A. Hassan
• use Data Exploration and Analysis
 Mining Software Repositories (MSR)
• for Software Practitioners
Beyond Software Developers
• obtain Insightful and Actionable info
Need get real as well
• Analytic Techniques
• Producing Impact on Practice
Analytic Techniques in SE
• Association rules and frequent patterns
• Classification
• Clustering
• Text mining/Natural language processing
• Visualization
More details are at
• https://sites.google.com/site/xsoftanalytics/
30
Solution-Driven Problem-Driven
Where can I apply X miner?
Basic
mining
algorithms
E.g., association rule,
frequent itemset mining…
Advanced
mining
algorithms
E.g., frequent partial order
mining [ESEC/FSE 07]
What patterns do
we really need?
New/adapted
mining
algorithms
E.g., [ICSE 09], [ASE 09]
31
Mining  Searching + Mining
Traditional approaches
Code repositories
1
patterns
mining
2
Eclipse, Linux, …
Often lack sufficient relevant data points (Eg. API call sites)
Our new approaches
Code repositories
1
2
…
Open source code
on the web
N
searching
mining
patterns
Code search engine
e.g.,
32
32
32
Existing approaches produce high % of false positives
One major observation:
Programmers often write code in different ways for
achieving the same task
Some ways are more frequent than others
Frequent
ways
mine patterns
Infrequent
ways
detect violations
Mined Patterns
S.Thummalapenta and T. Xie. Alattin: Mining Alternative Patterns for Detecting Neglected Conditions. ASE 2009.
Example: java.util.Iterator.next()
Java.util.Iterator.next() throws NoSuchElementException when invoked on a list
without any elements
Code Sample 1
PrintEntries1(ArrayList<string>
entries)
{
…
Iterator it = entries.iterator();
if(it.hasNext()) {
string last = (string) it.next();
}
…
}
Code Sample 2
PrintEntries2(ArrayList<string>
entries)
{
…
if(entries.size() > 0) {
Iterator it = entries.iterator();
string last = (string) it.next();
}
…
}
34
S.Thummalapenta and T. Xie. Alattin: Mining Alternative Patterns for Detecting Neglected Conditions. ASE 2009.
Example: java.util.Iterator.next()
Code Sample 1
Code Sample 2
PrintEntries1(ArrayList<string>
entries)
{
…
Iterator it = entries.iterator();
if(it.hasNext()) {
string last = (string) it.next();
}
…
}
PrintEntries2(ArrayList<string>
entries)
{
…
if(entries.size() > 0) {
Iterator it = entries.iterator();
string last = (string) it.next();
}
…
}
Sample 2 (6/1243)
Sample 1 (1218 / 1243)
1243 code
examples
Mined Pattern from existing
approaches:
“boolean check on return of Iterator.hasNext before Iterator.next”
35
S.Thummalapenta and T. Xie. Alattin: Mining Alternative Patterns for Detecting Neglected Conditions. ASE 2009.
Example: java.util.Iterator.next()
Code Sample 1
PrintEntries1(ArrayList<string>
entries)
{
…
Iterator it = entries.iterator();
if(it.hasNext()) {
string last = (string) it.next();
}
…
}
Code Sample 2
PrintEntries2(ArrayList<string>
entries)
{
…
if(entries.size() > 0) {
Iterator it = entries.iterator();
string last = (string) it.next();
}
…
}
 Require more general patterns (alternative patterns): P1 or P2
P1 : boolean check on return of Iterator.hasNext before Iterator.next
P2 : boolean check on return of ArrayList.size before Iterator.next
 Cannot be mined by existing approaches, since alternative P2 is
infrequent
S.Thummalapenta and T. Xie. Alattin: Mining Alternative Patterns for Detecting Neglected Conditions. ASE 2009.
Our Solution: ImMiner Algorithm
[ASE 09]
 Mines alternative patterns of the form P1 or P2
 Based on the observation that infrequent alternatives such as P2
are frequent among code examples that do not support P1
1243 code examples
Sample 2 (6/1243)
Sample 1 (1218 / 1243)
P2 is infrequent among
entire 1243 code examples
P2 is frequent among code
examples not supporting P1
37
S.Thummalapenta and T. Xie. Alattin: Mining Alternative Patterns for Detecting Neglected Conditions. ASE 2009.
Alternative Patterns
 ImMiner mines three kinds of alternative
patterns of the general form “P1 or P2”
Balanced: all alternatives (both P1 and P2) are frequent
Imbalanced: some alternatives (P1) are frequent and
others are infrequent (P2). Represented as “P1 or P^2”
Single: only one alternative
38
S.Thummalapenta and T. Xie. Alattin: Mining Alternative Patterns for Detecting Neglected Conditions. ASE 2009.
ImMiner Algorithm
 Uses frequent-itemset mining [Burdick et al. ICDE 01]
iteratively
 An input database with the following APIs
for Iterator.next()
Input database
Mapping of IDs to APIs
S.Thummalapenta and T. Xie. Alattin: Mining Alternative Patterns for Detecting Neglected Conditions. ASE 2009.
ImMiner Algorithm: Frequent Alternatives
Input database
Frequent itemset
mining
(min_sup 0.5)
Frequent item: 1
P1: boolean-check on the return of Iterator.hasNext()
before Iterator.next()
S.Thummalapenta and T. Xie. Alattin: Mining Alternative Patterns for Detecting Neglected Conditions. ASE 2009.
ImMiner: Infrequent Alternatives of P1
 Split input database into two databases: Positive and Negative
Positive database (PSD)
Negative database (NSD)
 Mine patterns that are frequent in NSD and are infrequent in PSD
 Reason: Only such patterns serve as alternatives for P1
 Alternative Pattern : P2 “const check on the return of ArrayList.size()
before Iterator.next()”
 Alattin applies ImMiner algorithm to detect neglected conditions
S.Thummalapenta and T. Xie. Alattin: Mining Alternative Patterns for Detecting Neglected Conditions. ASE 2009.
41
Neglected Conditions
 Neglected conditions refer to
 Missing conditions that check the arguments
or receiver of the API call before the API call
 Missing conditions that check the return or
receiver of the API call after the API call
 One primary reason for many fatal issues
 security or buffer-overflow vulnerabilities
[Chang et al. ISSTA 07]
S.Thummalapenta and T. Xie. Alattin: Mining Alternative Patterns for Detecting Neglected Conditions. ASE 2009.
• use Data Exploration and Analysis
 Mining Software Repositories (MSR)
• for Software Practitioners
Beyond Software Developers
• obtain Insightful and Actionable info
Need get real as well
• Analytic Techniques
• Producing Impact on Practice
Machine Learning that Matters
[ICML’12 Wagstaff]
http://arxiv.org/ftp/arxiv/papers/1206/1206.4656.pdf
• Hyper-Focus on Benchmark Data Sets
• Hyper-Focus on Abstract Metrics
• Lack of Follow-Through
[ICML’12 Wagstaff]
http://arxiv.org/ftp/arxiv/papers/1206/1206.4656.pdf
• Meaningful Evaluation Methods
• Involvement of the World Outside ML
• Eyes on the Prize
[ICML’12 Wagstaff]
http://arxiv.org/ftp/arxiv/papers/1206/1206.4656.pdf
MSRA Software Analytics Group
Utilize data-driven approach to help create highly performing, user
friendly, and efficiently developed and operated software and services.
Information Visualization
Software
Users
Software
Vertical
Development
Process
Analysis Algorithms
Horizontal
Software
Systems
Large-scale Computing
Research Topics
Contact: Dongmei Zhang ([email protected])
http://research.microsoft.com/groups/sa/
Technology Pillars
Software Analytics in Practice
Adoption Challenges for
Software Analytics
Must show value
before data quality
improves
Correlation vs.
Causation
ICSE Papers: Industry vs. Academia
OSDI 2008 26% vs. xSE ?%
Developers, Programmers, Architects
Among All Attendees
MSR 11 Keynote
ICSE 09 Keynote
Source© Carlo Ghezzi
MSR 12 Keynote
ICSM 11 Keynote
SCAM 12 Keynote
"Are Automated Debugging [Research]
Techniques Actually Helping Programmers?"
• 50 years of automated debugging research
– N papers  only 5 evaluated with actual programmers
“
”
[ISSTA11 Parnin&Orso]
Are Regression Testing [Research] Techniques
Actually Helping Industry?
• Likely most studied testing problems
– N papers
“
”
[STVR11 Yoo&Harman]
Are [Some] Failure-Proneness Prediction
[Research] Techniques Actually Helping?
• Empirical software engineering (on prediction)
– N papers
”
[PROMISE11 Zeller et al.]
A Researcher's Observation in HCI
Research Community
• “The reviewers simply do not value the
difficulty of building real systems and how
hard controlled studies are to run on real
systems for real tasks. This is in contrast
with how easy it is to build new interaction
techniques and then to run tight,
controlled studies on these new
techniques with small, artificial tasks”
“I give up on CHI/UIST” by James Landay
http://dubfuture.blogspot.com/2009/11/i-give-up-on-chiuist.html
Source©J. Landay
A Researcher's Observation in HCI
Research Community
• “This attitude is a joke and it offers
researchers no incentive to do systems
work. Why should they? Why should we
put 3-4 person years into every CHI
publication? Instead we can do 8 weeks of
work on an idea piece or create a new
interaction technique and test it tightly in
8-12 weeks and get a full CHI paper.”
“I give up on CHI/UIST” by James Landay
http://dubfuture.blogspot.com/2009/11/i-give-up-on-chiuist.html
Source©J. Landay
A Researcher's Observation in HCI
Research Community
• “When will this community wake up and
understand that they are going to run out any
work on creating new systems (rather than
Doesand cede that
small pieces of systems)
our
research
community
important
endeavor
to industry?”
have
issues??
• “We are our
ownsimilar
worst enemies.
I think we
have been blinded by the perception that "true
scientific" research is only found in controlled
experiments and nice statistics.”
“I give up on CHI/UIST” by James Landay
http://dubfuture.blogspot.com/2009/11/i-give-up-on-chiuist.html
Source©J. Landay
MS Academic Search: “Pointer
Analysis”
“Pointer Analysis: Haven’t We Solved
This Problem Yet?” [Hind PASTE’01]
“During the past 21 years, over 75 papers
and 9 Ph.D. theses have been published on
pointer analysis. Given the tones of work on
this topic one may wonder, “Haven't we
solved this problem yet?'' With input from
many researchers in the field, this paper
describes issues related to pointer analysis
and remaining open problems.”
Michael Hind. Pointer analysis: haven't we solved this problem yet?. In Proc.
ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and
Engineering (PASTE 2001)
Source©M. Hind
58
“Pointer Analysis: Haven’t We Solved
This Problem Yet?” [Hind PASTE’01]
Section 4.3 Designing an Analysis for a Client’s Needs
“Barbara Ryder expands on this topic: “… We can all
write an unbounded number of papers that compare
different pointer analysis approximations in the
abstract. However, this does not accomplish the key
goal, which is to design and engineer pointer
analyses that are useful for solving real
software problems for realistic programs.”
Michael Hind. Pointer analysis: haven't we solved this problem yet?. In Proc.
ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and
Engineering (PASTE 2001)
Source©M. Hind&B. Ryder
59
MS Academic Search: “Clone
Detection”
Typically focus/evaluate on
intermediate steps (e.g., clone
detection) instead of ultimate
tasks (e.g., bug detection or
refactoring), even when the
field already grows mature
with n years of efforts on
intermediate steps
Some Success Stories of Applying
Clone Detection [Focus on Ultimate Tasks]
MSRA
XIAO
Yingnong Dang, Dongmei Zhang, Song
Ge, Chengyun Chu, Yingjun Qiu, and
Tao Xie. XIAO: Tuning Code Clones at
Hands of Engineers in Practice. In
Proc. ACSAC 2012,
http://research.microsoft.com/en-us/groups/sa/
Zhenmin Li, Shan Lu, Suvda Myagmar,
and Yuanyuan Zhou. CP-Miner: a tool
for finding copy-paste and related
bugs in operating system code. In
Proc. OSDI 2004.
http://patterninsight.com/
http://www.blackducksoftware.com/
61
Suggested Actions  Tech Adoption
•
•
•
•
Get research problems from real practice
Get feedback from real practice
Collaborate across disciplines
Collaborate with industry
•Software Analytics
Data Exploration and Analysis
For Software Practitioners
Obtain Insightful and Actionable info
With Analytic Techniques
• Producing Impact on Practice
Acknowledgments
• Microsoft Research Asia Software Analytics
Group
• Ahmed Hassan, Lin Tan, Jian Pei
• Many other colleagues
64
Q&A
•Software Analytics
Data Exploration and Analysis
For Software Practitioners
Obtain Insightful and Actionable info
With Analytic Techniques
• Producing Impact on Practice