Evaluating Models

Download Report

Transcript Evaluating Models

1
Evaluating Induced Models
with
Daniel L. Silver
Copyright (c), 2004
All Rights Reserved
CogNova
Technologies
2
Agenda
 Interpretation
and Evaluation Phase
 Model accuracy (fitness) and
confidence
 Testing the difference between two
models
 Testing the difference between two
DM methods (e.g. IDT versus ANN)
CogNova
Technologies
3
The KDD Process
Interpretation
and Evaluation
Data Mining
Knowledge
Selection and
Preprocessing
Data
Consolidation
p(x)=0.02
Patterns &
Models
Data
Warehouse
Prepared Data
Consolidated
Data
Data Sources
CogNova
Technologies
4
Inductive Modeling = Data Mining
Basic Framework for Inductive Learning
Testing
Examples
Environment
Training
Examples
(x, f(x))
Inductive
Learning System
Induced
Model of
Classifier
~ f(x)?
h(x) =
Focus is on developing models that
can accurately classify new examples.
Output Classification
(x, h(x))
CogNova
Technologies
5
Model Accuracy and Confidence
Preferably a separate verification set is used
to judge fitness or accuracy
 Statistical confidence in the accuracy of a
model can be expressed as an interval

Mean
Error
or
Error
Rate
h1
CogNova
Technologies
6
The Normal Curve and
Confidence Intervals
 Consider
a class of 30 persons
 True mean (average) mark of 75%
 How can we estimate this from the
marks of only 10 sample persons?
 Let’s do an example using Excel
CogNova
Technologies
7
Model Accuracy and Confidence
Approach #1:
Large Sample
When the amount of available data is large ...
Available Examples
70%
Divide randomly
Training
Set
Used to develop one model
Test
Set
30%
Verify
Set
Generalization
= test/verify fit
Compute
Test error
CogNova
Technologies
8
Model Accuracy and Confidence
Generalization statistic (fit, error or accuracy)
is provided by the learning system
 Confidence interval must be computed:

• Continuous target variable - Compute mean error
over n examples and confidence interval using
Excel (evaluate_models.xls)
• Nominal (binary) target variable - Given an error
rate of P from a sample of n examples, then the
95%conf. interval = 1.96 sqrt(P(1-P)/n) = 1.96 stdev
o
P = number incorrect / n
• Strictly speaking this is for n >= 30
CogNova
Technologies
Testing the Difference Between
Two Models
9
 Which
of the following two hypotheses
is the better? … h1 or h2 ?
Fitness
or
Error
Rate
h1
h2
h3
CogNova
Technologies
Testing the Difference Between
Two Models
10
 Assumption:
If some measurable
characteristic of the models is
statistically different then we will
consider the models different
 We will focus on the characteristics:
mean error, and error rate (proportion
incorrect) which can be computed from
the test results
CogNova
Technologies
Testing the Difference Between
Two Models
 Continuous
11
target variable
• Use a Difference of Means Test
 Nominal
(binary) target variable
• Use a Difference of Proportions Test
 For
95% confidence in a difference then
p-value statistic must be <= 0.05
(see Excel spreadsheet example)
CogNova
Technologies
Testing the Difference Between
Two DM Methods
12
 Cross-Validation
must be performed
 Requires generating several models
with different train, test and verify sets
 With WEKA use the accuracy or error
rate on the test sets
CogNova
Technologies
13
Network Training
Approach #2: Cross-validation
Provides a sense of confidence in model ...
Available Examples
10%
90%
Training
Set
Used to develop 10 different models
Repeat 10
times
Test
Set
Ver.
Set
Generalization
determined by mean
test fit and stddev
Accumulate
test errors
CogNova
Technologies
Testing the Difference Between
Two DM Methods
14
A
Difference of Means T-test can be
used to determine a p-value statistic
 For 95% confidence in a difference then
p-value statistic must be <= 0.05
(see Excel spreadsheet example)
CogNova
Technologies
15
Example: Using Census Data
 Problem: To identify males given census
data
 Performance
measure:
• Accuracy = Goodness of fit
 Model
generation: IDT and ANN
CogNova
Technologies
16
Example: Using Census Data
 Record results: Goodness of fit stats on test set
for 10 different models
• Mean fitness: ANN= 26.6, IDT = 31.8
 Test difference between models: Use a
difference of means T-test (see evaluate_models.xls)
• p-value = 0.00124
• Since p-value < 0.05, the two models are
significantly different
CogNova
Technologies
17
THE END
[email protected]
CogNova
Technologies