uisp09-Evaluation

Download Report

Transcript uisp09-Evaluation

Evaluating Results of
Learning
Blaž Zupan
www.ailab.si/blaz/predavanja/uisp
Evaluating ML Results
• Criteria
– Accuracy of induced concepts (predictive accuracy)
• accuracy = probability of correct classification
• error rate = 1 -accuracy
– Comprehensibility
• Both are important
– but comprehensibility is hard to measure
– accuracy usually studied
• Kinds of accuracy
– Accuracy on learning data
– Accuracy on new data (much more important)
– Major topic: estimating accuracy on new data
Usual Procedure to
Estimate Accuracy
All available data
Learning set
(Training set)
Test set
(Holdout set)
Learning
System
Induced
Classifier
Internal
Validation
Main idea: accuracy on test data
approximates accuracy on new data
Accuracy on
test data
External
Validation
Problems
• Common mistake
– estimating accuracy on new data by accuracy on
learning data (resubstitution accuracy)
• Size of the data set
– hopefully test set is representative for new
data
– no problem when available data abounds
• Scarce data: major problem
– much data is needed for successful learning
– much data is needed for reliable accuracy
estimate
Estimating Accuracy from Test Set
• Consider
– Induced classifier classifies a=73% of test
cases correctly
– So we expect accuracy on a new data close to
75%. But:
• How close?
• How confident we are in this estimate? (this depends
on the size of the testing data set)
Confidence Intervals
• Can be used to assess the confidence for
our accuracy estimates
• Confidence intervals
success rate on test data
0%
50%
100%
95% confidence interval
Evaluation Schemes
(sampling methods)
3-Fold Cross Validation
dataset
reoder
arbitrarily
train
test
train &
test #1
train &
test #2
train &
test #3
evaluate statistics for each iteration
and then compute the average
k-Fold Cross Validation
• Split the data to k sets of approximately
equal size (and class distribution, if
stratified)
• For i=1 to k:
– Use i-th subset for testing and remaining (k-1)
subsets for training
• Compute average accuracy
• k-fold CV can be repeated several, say, 100
times
Random Sampling (70/30)
• Random split data to, say,
– 70% data for training
– 30% data for testing
• Learn on training, test on testing data
• Repeat procedure, say, 100 times, and
compute the average accuracy and its
confidence intervals
Statistics
calibration
discrimination
Calibration and Discrimination
• Calibration
– how accurate are probabilities assigned by the
induced model
– classification accuracy, sensitivity, specificity,
...
• Discrimination
– how good would the model be to distinguish
between positive and negative cases
– area under ROC
Test Statistics:
Contingency Table of Classification Results
True Class
+
-
Totals
Result from
classification
+
TP
FP
TP+FP
model
-
FN
TN
FN+TN
Totals
TP+FN
FP+TN
N
• true positive, false positive
• false negative, true negative
Classification Accuracy
True Class
+
-
Totals
Result from
classification
+
TP
FP
TP+FP
model
-
FN
TN
FN+TN
Totals
TP+FN
FP+TN
N
• CA = (TP+TN) / N
• Proportion of correctly classified examples
Sensitivity
True Class
+
-
Totals
Result from
classification
+
TP
FP
TP+FP
model
-
FN
TN
FN+TN
Totals
TP+FN
FP+TN
N
• Sensitivity = TP / (TP + FN)
• Proportion of correctly detected positive
examples
• In medicine (+, -: presence and absence of a
disease):
– chance that our model correctly identifies a patient with
a disease
Specificity
• Specificity = TN / (FP + TN)
• Proportion of correctly detected negative
examples
• In medicine:
– chance that our model correctly identifies a patient
without a disease
Other Statistics
From DL Sackett et al.: Evidence-Based Medicine, Churchill-Livingstone, 2000.
ROC Curves
• ROC = Receiver Operating Characteristics
• From 70s used to evaluate medical prognostic models
• Recently popular within ML [rediscovery?]
sensitivity
[TP rate]
100%
a very good model
not so good model
0%
0%
100%
1-specificity
[FP rate]
ROC Curve
T=0
T = 0.5
Class
yes
yes
yes
no
yes
no
yes
no
no
no
P(yes)
0.89
0.80
0.80
0.80
0.63
0.33
0.33
0.10
0.10
0.10
T= 
no
no
no
no
no
no
no
no
no
no
T= 0
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
T = 0.5
yes
yes
yes
yes
yes
no
no
no
no
no
T=
∞
ROC Curve (Recipe)
1.
Draw grid:
–
–
2.
3.
4.
5.
6.
7.
step 1/N horizontally
step 1/F vertically
Sort results by descending
predicted probabilities
Start at (0,0)
From the table, select top row(s)
with the highest probability
Let rows include p positive and n
negative examples: move to a point
p grid points up and n right
Remove selected rows
If any more rows, go to 4
ROC Curve (Recipe)
Class
yes
yes
yes
no
yes
no
yes
no
no
no
P(yes)
0.89
0.80
0.80
0.80
0.63
0.33
0.33
0.10
0.10
0.10
ROC Curve (Recipe)
Class
yes
yes
yes
no
yes
no
yes
no
no
no
P(yes)
0.89
0.80
0.80
0.80
0.63
0.33
0.33
0.10
0.10
0.10
ROC Curve (Recipe)
Class
yes
yes
yes
no
yes
no
yes
no
no
no
P(yes)
0.89
0.80
0.80
0.80
0.63
0.33
0.33
0.10
0.10
0.10
ROC Curve (Recipe)
Class
yes
yes
yes
no
yes
no
yes
no
no
no
P(yes)
0.89
0.80
0.80
0.80
0.63
0.33
0.33
0.10
0.10
0.10
ROC Curve (Recipe)
Class
yes
yes
yes
no
yes
no
yes
no
no
no
P(yes)
0.89
0.80
0.80
0.80
0.63
0.33
0.33
0.10
0.10
0.10
ROC Curve (Recipe)
Class
yes
yes
yes
no
yes
no
yes
no
no
no
P(yes)
0.89
0.80
0.80
0.80
0.63
0.33
0.33
0.10
0.10
0.10
Area Under ROC
TP Rate
100%
0%
FP Rate
0%
100%
For every negative example we sum up the number of positive
examples with higher estimate, and normalize this score with a
product of positive and negative examples.
AROC = P [ P+(positive example) > P+(negative example) ]
Area Under ROC
TP Rate
100%
0%
FP Rate
0%
100%
• Is expected to be from 0.5 to 1.0
• The score is not affected by class distributions
• Characteristic landmarks
– 0.5: random classifier
– below 0.7: poor classification
– 0.7 to 0.8: ok, reasonable classification
– 0.8 to 0.9: here is where very good predictive models start
Final Thoughts
• Never test on the learning set
• Use some sampling procedure for testing
• At the end, evaluate both
– predictive performance
– semantical content
• Bottom line: good models are those that
are useful in practice