Transcript Slide 1

Software Quality Analysis with
Limited Prior Knowledge of Faults
Naeem (Jim) Seliya
Assistant Professor, CIS Department
University of Michigan – Dearborn
313 583 6669  [email protected]
Overview


Introduction
Knowledge-Based Software Quality Analysis


Empirical Case Study



Software Quality Analysis with Limited Prior
Knowledge of Faults
Software Measurement Data
Empirical Results
Conclusion
Oct. 3, 2006
Wayne State University CS Seminar
2
Introduction

Software quality assurance is vital during the software
development process

Knowledge-based software quality models useful for
allocating limited resources to faulty programs

Software measurements often observed as predictors of
software quality, i.e. working hypothesis

System operations and software maintenance are
benefited by targeting program modules that are likely
to have defects
Oct. 3, 2006
Wayne State University CS Seminar
3
Introduction …

Software quality modeling has been addressed in
related literature




software quality classification models
software fault prediction models
software module order modeling
A supervised learning approach is typically taken for
software quality modeling


requires the prior experience of developing systems relatively
similar to target system
requires complete knowledge of defect data of previously
developed program modules
Oct. 3, 2006
Wayne State University CS Seminar
4
Software Quality Analysis
Learnt
Hypothesis
Model
Application
Previous
Experience
Oct. 3, 2006
Software
Metrics
Unknown Defect Data
Known Defect Data
Software
Metrics
Model
Training
Target
Project
Wayne State University CS Seminar
5
Software Quality Analysis …

Practical software engineering problems




Organization has limited software defect data from previous
experiences with similar software projects
Organization does not have software defect data from previous
experiences with similar software projects
Organization does not have experience with developing similar
software projects
Two very likely problem scenarios


Software quality modeling with limited software defect data
Software quality modeling without software defect data
Oct. 3, 2006
Wayne State University CS Seminar
6
Limited Defect Data Problem
Known Defect Data
Software
Metrics
Previous
Experience
Oct. 3, 2006
Learnt
Hypothesis
Model
Application
Unknown Defect Data
Wayne State University CS Seminar
Software
Metrics
Unknown Defect Data
Model
Training
Target
Project
7
No Software Defect Data Problem
Learnt
Hypothesis
Model
Application
Previous
Experience
Oct. 3, 2006
Software
Metrics
Unknown Defect Data
Known Defect Data
Software
Metrics
Model
Training
Target
Project
Wayne State University CS Seminar
8
Limited Defect Data Problem …

Some contributing issues:


cost of metrics data collection may limit for which
subsystems the software fault data is collected
software defect data collected for some modules may be
error prone due to data collection problems



defect data may be reliable for only some components
only some project components of a distributed software
system may collect software fault data
in a multiple release system, fault data may not be
collected for all releases
Oct. 3, 2006
Wayne State University CS Seminar
9
Objectives

Developing solutions to software quality
analysis when there is only limited a priori
knowledge of defect data


Learning software quality trends from both labeled
(small size) and unlabeled (large size) components of
software measurement data
Providing empirical software engineering
evidence toward effectiveness and practical
appeal of proposed solutions
Oct. 3, 2006
Wayne State University CS Seminar
10
Proposed Solutions


Constraint-Based Clustering with Expert Input
Semi-Supervised Classification with the
Expectation-Maximization Algorithm
Oct. 3, 2006
Wayne State University CS Seminar
11
Constraint-Based Clustering
with Expert Input



Clustering is an appropriate choice for software quality
analysis based on program attributes alone
Clustering algorithms group program modules according
to their software attributes
Program modules with similar attributes will likely have
similar software quality characteristics


Low-quality modules will likely group together into nfp clusters
High-quality modules will likely group together into fp clusters
Oct. 3, 2006
Wayne State University CS Seminar
12
Constraint-Based Clustering
with Expert Input …

Labeled data instances are used to modify and enhance
clustering results on the unlabeled data instances

Investigated unsupervised clustering with expert input
for software quality classification

Constraint-based clustering can aid the expert in better
labeling the clusters as fp or nfp

Identify difficult-to-classify modules or noisy instances
in the software measurement data
Oct. 3, 2006
Wayne State University CS Seminar
13
Proposed Algorithm

A constraint-based clustering approach is implemented
with the k-means algorithm

Labeled program modules are used to initialize centroids
of a certain number of clusters

Grouping of the labeled modules remains unchanged as
fixed constraints

Expert has the flexibility to inspect and label additional
clusters as nfp or fp during the semi-supervised
clustering process
Oct. 3, 2006
Wayne State University CS Seminar
14
Proposed Algorithm …
Let D contain L_nfp, L_fp, and U sets of
program modules
1. Obtain initial numbers of nfp and fp clusters:



Execute Cg algorithm to obtain optimal (p) number
of nfp clusters among {1, 2, …, Cin_nfp}
Execute Cg algorithm to obtain optimal (q) number
of fp clusters among {1, 2, …, Cin_fp}
Cg algorithm work of Krzanowski and Lai

Oct. 3, 2006
A criterion for determining the number of groups in a data
set using sums-of-squares clustering. Biometrics,
44(1):23-34, March 1988.
Wayne State University CS Seminar
15
Proposed Algorithm …
2. Initialize centroids of clusters:




Centroids of p out of C_max clusters are initialized
to centroids of nfp clusters
Centroids of q out of {C_max - p} clusters are
initialized to centroids of fp clusters
Centroids of remaining r (i.e. C_max - p – q)
clusters initialized to randomly selected modules
from U
Randomly select 5 unique sets of modules for
initializing the r unlabeled clusters
Oct. 3, 2006
Wayne State University CS Seminar
16
Proposed Algorithm …
3. Execute constraint-based clustering:

k-means with Euclidean distance run on D with
initialized centroids of C_max clusters

Clustering is run under constraint that an existing
membership of a module to a labeled cluster
remains unchanged
Clustering repeated for all 5 centroid initialization
settings
Clustering associated with median SSE value
selected for subsequent computation


Oct. 3, 2006
Wayne State University CS Seminar
17
Proposed Algorithm …
4. Expert-based labeling of clusters:



Expert is presented with descriptive statistics of the
r unlabeled clusters and asked to label them as nfp
or fp
Expert labels only those clusters for which he is
very confident in the label estimation
If at least 1 of the r clusters is labeled, go to to
Step 2, and continue
Oct. 3, 2006
Wayne State University CS Seminar
18
Proposed Algorithm …
5. Stop semi-supervised clustering:



Iterative semi-supervised clustering process is
stopped when the sets C_nfp, C_fp, and C_ul are
unchanged
Program modules in the p (or q) clusters are
labeled and recorded as nfp (or fp)
Program modules in the remaining r unlabeled
clusters are not assigned any fault-proneness labels
Oct. 3, 2006
Wayne State University CS Seminar
19
Software Measurement Data

Software metrics datasets obtained from seven NASA
software projects (MDP Initiative)



JM1, KC1, KC2, KC3, CM1, MW1, and PC1
Projects characterized by same set of software product
metrics and built in similar software development
environments
Defect data reflect changes made to source code for
correcting errors recorded in problem reporting systems
Oct. 3, 2006
Wayne State University CS Seminar
20
Software Measurement Data …

The JM1 project dataset used as training data in our
empirical case studies



it is the largest dataset among the seven software projects
The remaining six datasets used as test data for model
evaluation and generalization performance
Among the 21 product metrics only 13 basic metrics are
used in our study
Oct. 3, 2006
Wayne State University CS Seminar
21
Software Metrics
1.
Cyclomatic complexity
2.
Essential complexity
8.
9.
3.
Design complexity
10.
Total number of lines of
source code
Executable lines of code
Lines with code and
comments
4.
Number of unique operators
5.
Number of unique operands
11.
Lines with only comments
6.
Total number of operators
12.
Blank lines of code
7.
Total number of operands
13.
Branch count
Oct. 3, 2006
Wayne State University CS Seminar
22
Case Study Datasets
Oct. 3, 2006
Wayne State University CS Seminar
23
Constraint-Based Clustering
Case Study

JM1 dataset pre-processed to yield a reduced dataset of
8850 modules, i.e. JM1-8850



program modules with identical software attributes but with
different fault-proneness labels were eliminated
JM1 used as training instances and to form the
respective labeled & unlabeled datasets
KC1, KC2, KC3, CM1, MW1, & PC1 used as test datasets
to evaluate knowledge learnt post constraint-based
clustering analysis of software data
Oct. 3, 2006
Wayne State University CS Seminar
24
Constraint-Based Clustering
Case Study …

Labeled datasets formed by random sampling

LP = {100, 250, 500, 1000, 1500, 2000, 2500, 3000} labeled
modules


3 samples were obtained for each LP value, and average
results are reported in the paper


Each LP dataset randomly selected to maintain a 80:20 proportion
of nfp:fp program modules
5 samples for LP = {100, 250, 500}
Parameter settings



C_max = {30, 40} clusters
Cin_fp = {10, 20} for Cg algorithm
Cin_nfp = {10, 20} for Cg algorithm
Oct. 3, 2006
Wayne State University CS Seminar
25
Initial Clusters of Labeled Modules
Oct. 3, 2006
Wayne State University CS Seminar
26
Expert-Based Labeling Results
Oct. 3, 2006
Wayne State University CS Seminar
27
Classification of Labeled Modules
Oct. 3, 2006
Wayne State University CS Seminar
28
Classification of Labeled …
Oct. 3, 2006
Wayne State University CS Seminar
29
Classification of Labeled …
C_max = 30 Clusters
Oct. 3, 2006
Wayne State University CS Seminar
30
Classification of Labeled …
C_max = 40 Clusters
Oct. 3, 2006
Wayne State University CS Seminar
31
Classification of Test Datasets
Unsupervised Clustering
Oct. 3, 2006
Wayne State University CS Seminar
32
Classification of Test Datasets …
Constraint-Based Clustering (LP = 250)
Oct. 3, 2006
Wayne State University CS Seminar
33
Classification of Test Datasets …
Constraint-Based Clustering (LP = 1000)
Oct. 3, 2006
Wayne State University CS Seminar
34
Classification of Test Datasets …
Constraint-Based Clustering (LP = 3000)
Oct. 3, 2006
Wayne State University CS Seminar
35
Classification of Test Datasets …
Average Classification of Test Data Modules
Oct. 3, 2006
Wayne State University CS Seminar
36
Comparison with C4.5 Models

C4.5 decision tree implemented in Weka, an open source data
mining tool

Supervised decision tree models built using 10 fold cross validation

Decision tree parameters tuned for appropriate comparison with
constraint-based clustering

Tuning for similar Type I (false positive) error rates

C4.5 models yielded very low false positives in conjunction with
very high false negatives

Performance of C4.5 models generally remain unchanged with LP
compared to an improvement by constraint-based clustering
Oct. 3, 2006
Wayne State University CS Seminar
37
Remaining Unlabeled Modules

do they constitute as noisy data? are they hard to model modules?

do they form new groups of program modules for given system ?



are their software measurements uniquely different from the other
program modules?
did something go wrong in the software metrics data collection
process?
did the project not collect other software metrics that may better
represent the software quality?
Oct. 3, 2006
Wayne State University CS Seminar
38
Remaining Unlabeled Modules …

Ensemble Filter (EF) strategy

Comparison with majority EF

Consists of 25 classifiers from different learning theories and
methodologies

Investigate commonality of modules detected by EF and those that
remain unlabeled after constraint-based clustering process

About 40% to 50% were common with those considered noisy by
ensemble filter

A relatively large number of same modules were consistently
included in the pool of remaining unlabeled program modules
Oct. 3, 2006
Wayne State University CS Seminar
39
Constraint-Based Clustering
Case Study … Summary

Improved estimation performance compared to
unsupervised clustering with expert input

Better test data performances compared to a supervised
learner trained on labeled dataset


For larger labeled datasets, generally improved
performance compared to EM-based semi-supervised
classification
Several of remaining modules are likely to constitute as
noisy data, providing insight into their attributes
Oct. 3, 2006
Wayne State University CS Seminar
40
Semi-Supervised Classification
with the EM Algorithm

Learning from a small labeled and a large unlabeled
software measurement dataset

Expectation Maximization (EM) algorithm for building
semi-supervised software quality classification models

Improve the supervised learner with knowledge stored
in software attributes of the unlabeled program modules

The labeled dataset is iteratively augmented with
program modules in unlabeled dataset
Oct. 3, 2006
Wayne State University CS Seminar
41
Semi-Supervised Classification
with the EM Algorithm …
Labeled
Program
Modules
EM Algorithm
for Estimating
Class Labels
{100, 250, 500, 1000,
1500, 2000, 2500, 3000}
Unlabeled
Program
Modules
JM1-8850
Oct. 3, 2006
Wayne State University CS Seminar
Selected
Unlabeled
Modules
Confidence
Based Selection
42
Semi-Supervised Classification
with the EM Algorithm …


Proposed semi-supervised classification process
improved generalization performance
Semi-supervised software quality classification models
generally yielded better performance than C4.5 decision
trees

About 40 to 50% of the remaining modules are likely to
constitute as noisy data

Number of unlabeled modules selected for
augmentation was largest when LP = 1000
Oct. 3, 2006
Wayne State University CS Seminar
43
Conclusion



Practical solutions to problem of software quality analysis with
limited a priori knowledge of defect data
Empirical investigation with software measurement data from real
world projects
Constraint-based clustering vs. Semi-supervised classification



Constraint-based clustering generally yielded better performance than
EM-based semi-supervised classification has lower complexity
Semi-supervised classification with EM allows for control of relative
balance between the Type I and Type II error rates
Oct. 3, 2006
Wayne State University CS Seminar
44
Some Future Work

Applying the limited defect data problem to quantitative
software fault prediction models

A software engineering study on characteristics of
program modules that remain unlabeled



Investigate software development process
Exploring semi-supervised learning schemes for detecting noisy
instances in a dataset
Investigating self-labeling heuristics for minimizing
expert involvement in the unsupervised and constraintbased clustering approaches
Oct. 3, 2006
Wayne State University CS Seminar
45
Other SE Research Focus

Knowledge-based software security modeling and
analysis

Studying the influence of diversity in pair programming
teams

Software engineering measurements for agile
development teams and projects

Software forensics with cyber security and education
applications
Oct. 3, 2006
Wayne State University CS Seminar
46
Software Quality Analysis with
Limited Prior Knowledge of Faults
Questions !!!
Naeem (Jim) Seliya
Assistant Professor, CIS Department
University of Michigan – Dearborn
313 583 6669  [email protected]