Toward Mining “Concept Keywords” from Identifiers in Large

Download Report

Transcript Toward Mining “Concept Keywords” from Identifiers in Large

Toward Mining
“Concept Keywords”
from Identifiers
in Large Software Projects
Masaru Ohba
and
Katsuhiko Gondow
Tokyo Institute of Technology
What are “concept keywords”?
• Most programmers try to name identifiers meaningfully.
• Concept keywords are defined terms that describe key
concepts to aid in as program understanding.
– e.g. read_dirent() : dirent is a concept keyword.
Concept keywords
G rouping words
Attributes,
less im portant concepts
G eneric verbs
d ire n t, root, PTE , tss,
path, sig nal, yield
kbd_ , vg a_ , FAT12_ , sys_ ,
H, t
busy, byte, offset, nam e,
m em ory, end, int8, ag ain
re a d , set, is, m ove, wait,
print, dum p, m ake, init
Human-selected concept keywords and other category words in udos
Suggestion
• We should use more “concept keywords” in
program understanding tools.
– concept keywords are concise and descriptive
• Our solution:
– provides a way to mine concept keywords.
• ckTF/IDF methods / Identifier Exploratory Framework
– could be used to build tools that support and utilize
extracted concept keywords (future work).
Future work
• Applying concept keywords to a Bug Tracking System
(BTS) to see the relationship between bug report and
corresponding problem source code.
fat12.c
Bug-report no.1
Overview:
It could not read directories.
dirent
read_dirent() {
return NULL;
}
task.c
signal
Bug-report no.3
Overview:
I could not catch system calls.
sys_signal(){
sys_kill();
}
Concept keyword
can bridge the gap
between bug-reports
and source code.
IBM Watson Research Center
Source code that talks:
an exploration of Eclipse task comments
and their implication to repository mining
Annie Ying
(joint work with Jim Wright & Steve Abrams)
© 2005 IBM Corporation
Annie Ying et. al., IBM Research
In a software development task...
task-oriented
info
development
artifacts
communication
reqs
change
reports
class Foo
class Foo {{
emails
// Joan, please
fix this
}
void m1() {
}
© 2005 IBM Corporation
Annie Ying et. al., IBM Research
Empirical study on
Eclipse task comments
Eclipse task comments
// TODO an ugly hack for now
–sue. Joan, please fix it
// TODO eliminate this once
ECR 317 complete
© 2005 IBM Corporation
Annie Ying et. al., IBM Research
Conclusion
 Presented observations on uses of comments
– e.g., task-oriented info and communication
 Take-home message:
– When mining software repositories, consider analyzing
comments.
© 2005 IBM Corporation
Annie Ying et. al., IBM Research
The End
© 2005 IBM Corporation
Annie Ying et. al., IBM Research
Challenges in analyzing Eclipse task comments
informality
Eclipse task comments
// TODO an ugly hack for now
–sue. Joan, please fix it
implied context
// TODO eliminate this once
ECR 317 complete
// TODO explain why this
method is public
// TODO once we have
Eclipse-icon-decorator
mechanism, use it here
// TODO workaround for ...
...
// End workaround
fuzzy scope
© 2005 IBM Corporation
Text Mining for Software Engineering:
How Analyst Feedback Impacts Final Results
Jane Huffman Hayes,
Alex Dekhtyar,
Senthil Karthikeyan Sundaram
*Funded by NASA
Department of Computer Science
University of Kentucky
Question of the Day
What can Data Mining
Do for Software
Engineering ???
Question of the Day
Answer 1
What can Data Mining
Help study
the process
After-the-fact
Exploratory
Conclusions help
future projects
Do for Software
Engineering ???
Question of the Day
Answer 1
What can Data Mining
Help study
the process
Answer 2
Help improve
the process
After-the-fact
Exploratory
Conclusions help Do for Software
future projects
Engineering ???
!!!
Our Approach
Use Data Mining
during the process
Use Mining During the Process?
Final Result
Task
Feedback
Loop
Analyst
Ultimately,
We are interested
In the accuracy
Of the final result
Automated “Mining” Tool
Objective Study(RE’04,PROMISE’05)
Subjective Study
Preliminary Study
Question: What would
the analyst do with
machine-generated data?
Final Result
Analyst
Task : Requirements Tracing
Metrics: Precision
Recall
Automated “Mining” Tool
Preliminary Study
Question: What would
the analyst do with
machine-generated data?
Final Result
Analyst
Pr
40%
20%
80%
Rec
60%
90%
30%
Candidate link lists
Preliminary Study
Question: What would
the analyst do with
machine-generated data?
Analyst
Pr
40%
20%
80%
Rec
60%
90%
30%
Candidate link lists
Pr
45%
58%
23%
Rec
56%
65%
27%
Preliminary Study
Pr
45%
58%
23%
Question: What would
the analyst do with
machine-generated data?
Rec
56%
65%
27%
ΔPr ΔRec
100
+5% -4%
+38% -25%
-57% -2%
Analyst
90
80
Rec
60%
90%
30%
Candidate link lists
60
Recall
Pr
40%
20%
80%
70
Trend???
50
40
30
20
10
0
0
10
20
30
40
50
60
70
80
Precision
T1
T3
T4
From RE2003
reg
90
100
(Not Quite) Conclusions
Final Result
Task
Feedback
Loop
Automated “Mining” Tool
Analyst
(Not Quite) Conclusions
Final Result
Analyst
• New Field of Study
• Larger Study Needed
Call for Help!
WANTED!
VOLUNTEERS!
Thank You!
Signature Change Analysis
Sunghun Kim, Jim Whitehead, Jennifer Bevan
{hunkim, ejw, jbevan}@cs.ucsc.edu
University of California, Santa Cruz
Biological and Software Evolution
Biological and Software Evolution
v1
v2
v3
Biological and Software Evolution
• Can we shape software
evolution path?
v1
v2
v3
–
–
–
–
LOC
Number of Changes
Structural Changes
Signature Changes
Found Signature Change
properties
•
The most common signature change kinds are complex data
type, parameter addition, parameter ordering, and
parameter deletion.
60
50
A 1.3
A2
APR
APU
CVS
GCC
SVN
AVG
40
30
20
10
0
Parameter name change
Only ordering change
Addition
Deletion
Modifier change
Array/Pointer
Complex type name
change
Primitive type change
Found Signature Change
properties
•
•
•
•
More than half of function signatures never change. About
90% of function signatures change less than three times.
A function’s signature changes after every 5-15 function
body changes.
A project’s average number of parameters per function
remains relatively constant over time.
Functions typically have parameter lists with 1, 2, or 3
parameters.
Found Signature Change
properties
•
•
Weak correlations between signature change and other
changes including LOC and function body changes.
Each project has its own signature change patterns, and the
pattern can be discovered after analyzing the first 1000 to
1500 revisions.
SVN
A 1.3
60
60
100
200
300
500
1000
1500
2000
5000
6029
50
40
30
20
10
0
100
200
300
500
1000
1500
2000
5000
7747
50
40
30
20
10
0
Parameter
name
change
Only
ordering
changes
Additon
Deletion
Modifier
change
Complex
type name
change
Parameter
name
change
Only
ordering
changes
Additon
Deletion
Modifier
change
Complex
type name
change
Found Signature Change
properties
•
Probability of a change kind depends on previous changes.
0.07
0.58
A
0.04
D
O
0.38
O
0.22
C
O
0.81
C
0.61
0.09
C
0.83
0.15
C
0.58
0.17
C
O
C
0.76
C
0.73
0.94
C
C
(a) APR
0.16
0.51
A
0.19
A
0.11
D
0.33
C
0.27
A
O
0.21
C
0.18
O
C
0.36
C
C
(b) Apache 2
0.66
C
0.78
0.73
0.53
C
0.1
C
0.61
C
Future Work
• Signature change analysis on OOP (Java)
– The results presented here are based on a procedural
programming language (C) open source projects:
Apache HTTP 1.3, Apache HTTP 2.0 , Apache Portable Runtime,
APR utility, CVS, GCC, and Subversion
– Find OOP signature change properties and compare
the with those from a procedural language
• Changes inside Struct/Class
– Variable addition/deletion
– Variable renaming
– Method addition/deletion
Signature Change Analysis
Sunghun Kim, Jim Whitehead, Jennifer Bevan
{hunkim, ejw, jbevan}@cs.ucsc.edu
University of California, Santa Cruz
Linear Predictive Coding and Cepstrum
coefficients for mining time variant
information from software repositories
G. Antoniol, F. Rollo and G. Venturi
RCOST – Unievrsity of Sannio - Italy
LPC Idea

Model a time series with a polynomial
approximation




LPC Cepstrum
smooth the spectrum
Define the distance between two time series
as the distance between their polynomial
approximations
Use distance to cluster time series with
identical or similar evolutions.
LPC and Linux Kernel
Similar pairs for different thresholds
and coefficients used

10000
1E-3
1000

1E-4
1E-5

100
12
16
20
32
Similar pair of evolving files
800

211 Linux releases about
1700 files
Study the influence of the
number of coefficients
Study the influence of
distance thresholds
Mine files with similar
evolution:

700
600
500
400
300
200
100
0
1
14
27
40
53
66
79
92 105 118 131 144 157 170 183 196 209 222 235 248
Create groups of files with
the same or very similar size
evolution
Complementing Each Other: GQM & DMAIC
GQM
DMAIC
(Define-Measure-AnalyzeImprove-Control)
(Goal-Question-Metric)
•
CMM sometimes
criticized for
emphasizing
repeatability over
improving productivity.
•
Six Sigma sometimes
criticized as
inappropriate for
processes characterized
by knowledge efforts.
•
GQM strong in defining
metrics appropriate to
business goals and
nature of the process.
•
DMAIC strong in focus
on continuous iterative
process improvement.
CMM+6σ Process Improvement Cycle
Define
Measure
Analyze
Control
Define
Improve
Control
Improve
Baselines
Weaknesses
Opportunities
Measure
Hypotheses
Trends
Indicators
Causes
Analyze
Progress
Defects
Delays
Dissatisfactions
Collect Data
Assess
Requirements
Activities
Changes
Time
Results
Areas of Concern
•
Architecture
– Design weaknesses
– General or for new demands
•
Bottlenecks
– Areas for focused attention
•
Causal Connections
– System view of process
– Root cause analysis
Mining Version Histories to Verify the Learning
Process of LPP
• Mining the Boundary of Openness of an Open
Source Software Project
• Explore if we can apply Open Source
Development (OSD) Process to Proprietary
Software
• Show the Boundary of Openness during OSD
National Chiao Tung University
Shih-Kung Huang, Kang-min Liu
Method
• Team Members
– Core= Relatively Important Developers
– NonCore = All – Core
• Source Code
– Kernel = All – NonKernel
– NonKernel = {d | d is touched by one of the NonCore}
• Project Characteristic function
– f(x) = {y | y is the kernel ratio with respect to the core
ratio of x}
– Kernel Ratio = (Kernel Size)/All
– Core Ratio = (Core Team Size) / All
gallery
phpmyadmin
moodle
GCC
Slashcode
Pugs
Conclusions
• Obtain the characteristic function of each project
team
– Reveal different team consititutions with varied
involvement in the software
• An Implication to develop a hybrid software
process model to embed OSD into commercial
software.
– OpenDarwin: Mac OS X
– Helix: Real Network Server
Towards a Taxonomy of Approaches
for
Mining of Source Code Repositories
Huzefa H. Kagdi, Michael L. Collard, Jonathan I. Maletic
Software Development Laboratory <SDML>
Department of Computer Science
Kent State University
Kent Ohio, USA
Motivation
• A number of approaches have been proposed to
derive and express changes from source code
repositories in a more source-code “aware”
manner
• We need better insight of the current research in
the MSR community in order to facilitate building
efficient and effective MSR tools
Building a Taxonomy
• Draw similarities and variations between six MSR
approaches based on three dimensions
– Entity type and granularity
– How changes are expressed and defined
– Type of MSR question
• Define notations to describe MSR to facilitate a
taxonomic description of approaches
An Initial Taxonomy
Entity
Change
Question
Gall et al
class
syntax and semantic
-hidden
dependencies
market basket and
prevalence
German
file & comment
syntax and semantic
- file coupling
market basket and
prevalence
function & variable
syntax and semantic
-dependencies
market basket
class & method
syntax and semantic
- association rules
market basket
Raghavan et al
logical statement
syntax and semantic
- move
prevalence
Collard et al
logical statement
syntax - add, delete,
modify
prevalence
Annotation Analysis
Heuristic
Hassan et al
Data Mining
(association rule)
Zimmerman et al
Differencing
Conclusions
• Most of the approaches except Differencing work
with fairly high-level entities
• Very different semantic information being is used
in these approaches
• Further investigation is necessary to discern
between how changes are expressed
A Framework for Describing and
Understanding Mining Tools in
Software Development
D.M. German, D. Čubranić, and M.-A. Storey
University of Victoria
Introduction
• Software engineering is a collaborative
activity → activity awareness is important
• Can be provided by mining software
repositories
• A variety of mining tools → how to
compare?
• Do we mine what is easy to mine and think
about the uses for it later?
Proposal
• Develop a framework for describing tools
for mining software repositories
• Purpose:
• Help designers understand and compare tools
• Assist users assess tools
• Identify new research areas
• Keep the specific user needs and tasks in
the forefront!
The Framework
• Intent
• Role, time, cognitive support
• Information
• Change management, program code, defect
tracking
• Informal communication, local history,
correlated information
• Infrastructure
• Requirements, offline/online, storage backend
What Next?
• Applied the framework to three tools:
• softChange
• Hipikat
• Xia/Creole
• We invite researchers to apply it to their
tools and give us feedback on their
experiences