Transcript Document

Defect prediction using social
network analysis on issue repositories
Reporter: Dandan Wang
Date: 04/18/2011
Basic information
• Conference: ICSSP 2011
• Authors
– Serdar Bicer
• Gerger consulting, Istanbul, Turkey
– Ayse Bsar Bener
• Ryerson university, Ted Rogers School of information
Technology Management, Toronto, Canada
– Bora Caglayan
• Bogazici university Department
Engineering, Istanbul, Turkey
of
Computer
Outline
1
• Introduction
2
• Methodology
3
• Results
4
• Conclusion
Introduction
• Objective
– Overcome ceiling effects of defect predictors.
• Research question
– What is the benefit of social network metrics on issue
repositories to predict defects?
• Metrics
– Social network metrics
– Churn metrics
• Method
– Naive Bayes (Learning based prediction model)
Outline
1
• Introduction
2
• Methodology
3
• Results
4
• Conclusion
Methodology
•
•
•
•
•
Dataset
Communication structure in projects
Metrics used
Defect prediction model
Performance measures
Dataset
•
•
RTC
– Year: 2007 and 2008.
– Team: Large distributed team and used the Jazz platform
– Version control system, issue repository
Drupal
– Year: 2009-2010
– Team : Large distributed team
– Public CVS repository, issue repository(bug reports, feature requests, and
other tasks)
Data extraction process for datasets
•
•
Nodes in graphs
represents developers
who commented on
each file.
Files were labeled as
defective if they were
modified
after
snapshot date.
Communication structure in projects
• RTC and Drupal projects are similar to each other in
communication structure.
• Commenting on issues is the main task-related communication
used by contributors in both projects. If a commit in version
control system is related with an issue, issue number is written
to commit message.
• Jazz framework automatically creates a connection from issue
to change set, which is not available in Drupal.
• The issues are assigned to and owned by contributors.
• Other project members express their opinions by commenting
on issues.
Metrics used
• While first 6 metrics were used in previous studies [22, 33, 44, 42],
• Diameter, Clustering Coefficient, Bridge Rate, and Characteristic Path
Length are new metrics
Defect prediction model
• Metrics
– Social network metrics on issue repositories
• Algorithm
– Naive Bayes data mining algorithm
• Validation
– 10*10-fold cross validation to eliminate sampling
bias
– Cost-benefit analysis (Weka software)
Performance measures
• Widely used performance measures
– Probability of detection(pd)
– Probability of false alarms(pf)
Higher balances are better
because their points (pd, pf) are
closer to the ideal point (1, 0)
Cost-benefit analysis
Cost curve
• Cost curve is proposed by Drummond and Holte to supply the
deficiencies of ROC curves. It is a visualization technique that
shows classifier’s performance based on the cost of
misclassification.
– X: PC(+). Probability of positive class, combination of the two
misclassification costs and the class distribution into a single value.
– Y: NEC. Normalized expected cost which denotes error rate.
Outline
1
• Objective
2
• Methodology
3
• Results
4
• Conclusion
Results
• Prediction performance analysis
• T-test analysis: statistically significantly
• Cost-benefit analysis
Cost curves for datasets
Beneficial outcomes
•
•
•
Our proposed model either considerably decreases high false alarm rates
without compromising the detection rates or considerably increases low
prediction rates without compromising low false alarm rates compared to churn
metrics. In both cases this results in increase of overall prediction performance.
Consequently, this leads to decrease in verification costs compared to churn
metrics. Thus we recommend practitioners to collect social network metrics on
issue repositories.
We can interpret this result as structure of information flow in a developer
communication network has significant effect on code quality. Since our
metrics are directly related with network’s topology, this model can help
managers to build developer networks more efficiently.
We used only a recent part of developer communication history to construct our
model. Communication between project members begins at the start and
continues until the end of the project. But in this study, we did not collect full
communication history. This is important for software teams which have begun
to keep record of developer communication after the beginning of the project
because our proposed model can also be used for these kind of projects.
Outline
1
• Objective
2
• Methodology
3
• Results
4
• Conclusion
Conclusion
• Reason: communication and coordination between
developers is important but patterns of interaction between
developers have not been investigated for defect prediction.
• Main contribution of this study is using new data source and
metrics in the area of defect prediction.
• Performance analysis
– Churn metrics, social network metrics
– Pd,pf, balance
• Cost-benefit analysis.
– Social network metrics on issue repositories reduced costs required for
verification of prediction results and made results closer to cost-adverse
region of ROC curve.
Thank you!
Q&A