How to Find Relevant Data for Effort Estimation

Download Report

Transcript How to Find Relevant Data for Effort Estimation

How to Find Relevant Data for
Effort Estimation ?
毛 可
2012-03-28
1
Author
• Ekrem Kocaguneli ( [email protected] )
• Tim Menzies
• Specialties: Data Mining, Effort Estimation
• 11’ TSE: Exploiting the Essential Assumptions of Analogy-Based Effort
Estimation (TEAK)
• 11’ TSE: On the Value of Ensemble Effort Estimation
• 11’ ESEM: –
• 10’ ASE: When to Use Data from Other Projects for Effort Estimation(short)
• Pre: Relevancy Filtering for Defect Estimation
2
Motivation (Why)
The Locality(1) Assumption
• Data divides best on one attribute
–
–
–
–
–
–
1. project type;e.g. embedded, etc;
2. development centers of developers;
3. development language
4. application type(MIS; GNC; etc);
5. targeted hardware platform;
6. in-house vs out sourced projects;
• If Locality(1)
– Hard to use data across these boundaries
– confined model, need to collect local data
3
Motivation (Why)
The Locality(N) Assumption
• Data divides best on combination of
attributes
• If Locality(N)
– Easier to use data across these
boundaries
4
Work
• Cross-vs-Within + “relevancy filtering” for
effort estimation
– Cross as good as within
– Companies can use other’s data for their
estimates
– If they first apply “relevancy filtering”
• "cross" same as "local"
5
Technology (How)
• How to find relevant training data?
6
Technology (How)
• Variance Pruning
7
Technology (How)
• TEAK = ABE0 + Instance selection
– 11’ TSE: Exploiting the Essential Assumptions of Analogy-Based
Effort Estimation
• ABE0 = ABE version 0
–
–
–
–
–
most commonly used
Normalized numerics, 0 to 1
Euclidean distance
equal weight to all attributes
return median effort of k-nearest neighbors
• Instance selection
– smart way to adjust training data
8
Technology (How)
• TEAK is a variance-based instance
selector
• It is built via GAC trees
(binary for even)
• TEAK is a two-pass system
– First pass selects low variance
relevant projects
(instance selection)
– Second pass retrieves projects to
estimate from
( instance retrieval )
• Variance Pruning
– > 10% * max ( σ2 )
– > (100%+10%) * max ( σ2 ) ?
9
Technology (How)
• TEAK finds local regions
important to the estimation of
particular cases
• TEAK finds those regions via
locality(N) not locality(1)
11
Experiments - Datasets
• Public availability: for reproducibility
• cross-within divisibility
• 6 out of 20+ datasets from PROMISE
12
Experiments - Datasets
For dataset X: subset X1,X2,X3
• Within
– TEAK for X1, X2, X3 separately. LOOCV
• Cross
– X1 test, X2+X3 train. … N-Fold CV
• Repeat 20 times! As TEAK is greedy, vary
according to input data order
13
Experiments
• Win-Loss-Tie:
• Mann Whitney Test (95%)
– 检验两个总体的分布是否有显著的差别
14
Experiment1 - Performace Comparison
MAR: Mean Absolute Residual
MdMRE: Median MRE
15
Experiment1 - Performace Comparison
Analogy by 1-neighbor: (PRED(25) > 0.3 on C81 Subsets )
for i = 1:numTestCases
estimates(i) = effortTrain(nearestCase(i)) * sizeTest(i) / sizeTrain(nearestCase(i));
for k = 1 : numTestFactors
estimates(i) = estimates(i) * cdTestReady(i,k) / cdTrainReady(nearestCase(i),k);
end
end
Analogy by K-neighbor:
16
Experiment2 – Retrieval Tendency
17
Experiment2 – Retrieval Tendency
Diagonal( WC ) vs.
Off-Diagonal( CC ) selection
Percentages sorted
Percentiles of diagonals and
off-diagonals
18
Conclusion
1. Cross performance is no worse than within
performance
2. Probability that estimator retrieves a training
instance form cross/within data is the same
Implication:
•
•
Companies can learn from each other’s data
Locality(N). Maybe, there are general effects in SE
– Effects that transcend boundaries of one company
– Local vs. Global Model…
19
Future work
• Check external validity
– After instance selection, Does cross == within ?
• Build more repositories
– More useful than previously thought for effort estimation
• Synonym discovery
– Can only use cross-data if it has the same ontology
– Auto-generate lexicons to map terms between data sets.
( “LOC” – “size”, “product complexity” )
20
Thanks!
Q&A?
21