Transcript slides

Evaluating data quality
issues from an industrial
data set
Gernot Liebchen
Bheki Twala
Mark Stephens
Martin Shepperd
Michelle Cartwright
[email protected]
What is it all about?
• Motivations
• Dataset – the origin & quality issues
• Noise & cleaning methods
• The Experiment
• Issues & conclusion
• Future Work
[email protected]
Motivations
• A previous investigation compared 3
noise handling methods (robust
algorithms [pruning] , filtering,
polishing)
• Predictive accuracy was highest with
polishing followed by pruning and
only then by filtering
• But suspicions were mentioned (at
EASE)
[email protected]
Suspicions about
previous investigation
• The dataset contained missing values
which were imputed (artificially created)
during the build of the model (decision
tree)
• Polishing alters the data (What impact can
that have?)
• The methods were evaluated by using the
predictions of another decision tree -> Can
the findings be supported by metrics
specialist?
[email protected]
Why do we bother?
• Good quality data is important for
good quality predictions and
assessments
• How can we hope for good quality
results if the quality of the input data
is not good?
• The data is used for a variety of
different purposes – esp. analysis
and estimation support
[email protected]
The Dataset
• Given a large dataset provided by a EDS
• The original dataset contains more than
10 000 cases with 22 attributes
• Contains information about software
projects carried out since the beginning of
the 1990s
• Some attributes are more administrative
(e.g. Project Name, Project ID), and will not
have any impact on software productivity
[email protected]
Suspicions
• The data might contain noise
• which was confirmed by the
preliminary analysis of the data
which also indicated the existence of
outliers.
[email protected]
How could it occur?
(in the case of the dataset)
• Input errors (some teams might be more
meticulous than others) / the person
approving the data might not be as
meticulous
• Misunderstood standards
• The input tool might not provide range
checking (or maybe limited)
• “Service Excellence” dashboard in head
quarters
• Local management pressure
[email protected]
Suspicious Data Example
•
•
•
•
•
•
•
•
•
•
Start Date:
Finish Date:
Name:
FP Count:
Effort:
Country
Industry Sector
Project Type
Etc.
01/08/2002
24/02/2004
*******Rel 24
1522
38182.75
IRELAND
Government
Enhance.
-
01/06/2002
09/02/2004
*******Rel 24
1522
33461.5
UK
Government
Enhance.
But there were also example with extremely high/low FP
counts per hour (1FP for 6916.25 hours; or 52 FP in 4 hours;
1746 FP in 468 hours)
[email protected]
What imperfections could
occur?
• Noise – Random Errors
• Outliers – Exceptional “True” Cases
• Missing data
• From now on Noise and Outliers will
be called Noise because both are
unwanted
[email protected]
Noise Detection
can be
• Distance based (e.g. visualisation
methods; cooks, mahalanobis and
euclidean distance; distance
clustering)
• Distribution based (e.g. neural
networks, forward search algorithms
and robust tree modelling)
[email protected]
What to do with noise?
• First detection
(we used decision trees
- usually a pattern detection tool in
data mining
- but used to categorise the data
in a training set and cases tested
in a test set)
• 3 basic options of cleaning : Polishing,
Filtering, Pruning
[email protected]
Polishing/Filtering/Pruning
• Polishing – identifying the noise and
correcting it
• Filtering – Identifying the noise and
eliminating it
• Pruning – Avoiding Overfitting
(trying to ignore the leverage effects)
– the instances which lead us to
overfitting can be seen as noise and
are taken out
[email protected]
What did we do?
& How did we do it?
• Compared the results of filtering and
pruning and discussed a implications of
pruning
• Reduced the dataset to eliminate cases
with missing values (avoid missing value
imputation)
• Produced lists of “noisy” instances and
polished counterparts
• Passed them on to Mark ( as metrics
specialist)
[email protected]
Results
• Filtering produced a list of 226 cases
from 436
(36% in noise list/ in cleaned set 21%)
• Pruning produced a list of 191 from
436
(33% in noise list/ 25% in cleaned set)
• Both were inspected and both
contain a large number of possible
true cases and unrealistic cases
(productivity)
[email protected]
Results 2
• By just inspecting historical data it
was not possible to judge which
method performed better
• The decision tree as a noise detector
does not detect unrealistic instances
but outliers in the dataset which can
only be overcome with domain
knowledge
[email protected]
So what about polishing?
• Polishing does not necessarily alter size or
effort, and we are still left with unrealistic
instances
• It makes them fit into the regression
model
• Is this acceptable from the point of view of
the data owner?
- depends on the application of the results
- What if unrealistic cases impact on the
model?
[email protected]
Issues/Conclusions
• In order to build the models we had to
categorise the dependent variable – 3
categories (<=1042,<= 2985.5,>2985.5)
BUT these categories appeared to coarse
for our evaluation of the predictions
• If we know there are unrealistic cases, we
should really take them out before we
apply the cleaning methods (avoid the
inclusion of these cases in the building of
the model)
[email protected]
Where to go from here?
• Rerun the experiment without “unrealistic
cases”
• Simulate a dataset with model, induce
noise and missing values and evaluate
methods with the knowledge of what the
real underlying model is
[email protected]
What was it all about?
• Motivations
• Dataset – the origin & quality issues
• Noise & Cleaning methods
• The Experiment
• Issues & conclusion
• Future Work
[email protected]
Any Questions?
[email protected]