No Slide Title
Download
Report
Transcript No Slide Title
Evaluation of Learning Models
Literature:
T. Mitchel, Machine Learning, chapter 5
I.H. Witten and E. Frank, Data Mining, chapter
5
1
Fayyad’s KDD Methodology
Transformed
data
Target
data
Patterns
Processed
data
Data Mining
data
Selection
Knowledge
Interpretation
Evaluation
Transformation
Preprocessing & feature
selection
& cleaning
2
Overview of the lecture
•
•
•
•
•
•
Evaluating Hypotheses (errors, accuracy)
Comparing Hypotheses
Comparing Learning Algorithms (hold-out
methods)
Performance Measures
Varia (Occam’s razor, warning)
No Free Lunch
3
Evaluating Hypotheses:
Two definitions of error
True error of hypothesis h with respect to
target function f and distribution D is the
probability that h will misclassify an instance
drawn at random according to D.
error D (h) Pr [ f ( x) h( x)]
xD
4
Two definitions of error (2)
Sample error of hypothesis h with respect to
target function f and data sample S is the
proportion of examples h misclassifies
errorS (h) 1n ( f ( x) h( x))
xS
where ( f ( x) h( x)) is 1 if
and 0 otherwise.
f ( x ) h( x )
5
Two definitions of error (3)
How well does errorS(h) estimate errorD(h)?
6
Problems estimating error
1. Bias: If S is training set, errorS(h) is
optimistically biased
bias E[errorS (h)] errorD (h)
For unbiased estimate, h and S must be
chosen independently.
2. Variance: Even with unbiased S, errorS(h)
may still vary from errorD(h).
7
Example
Hypothesis h misclassifies 12 of the 40
examples in S
12
error S (h)
0.30
40
What is errorD(h)?
8
Estimators
Experiment:
1. choose sample S of size n according to
distribution D
2. measure errorS(h)
errorS(h) is a random variable (i.e., result of
an experiment)
errorS(h) is an unbiased estimator for
errorD(h)
Given observed errorS(h) what can we
conclude about errorD(h)?
9
Confidence intervals
If
S contains n examples, drawn independently of h
and each other
n 30
Then
With approximately 95% probability, errorD(h) lies
in the interval
errorS (h) 1.96
errorS ( h )(1 errorS ( h ))
n
10
Confidence intervals (2)
If
S contains n examples, drawn independently of h
and each other
n 30
Then
With approximately N% probability, errorD(h) lies in
the interval
error S (h) z N
errorS ( h )(1 errorS ( h ))
n
where
N%: 50% 68% 80% 90% 95% 98% 99%
zN:
0.67 1.00 1.28 1.64 1.96 2.33 2.58
11
errorS(h) is a random variable
Rerun the experiment with different randomly
drawn S (of size n)
Probability of observing r misclassified
examples:
n!
r
nr
P(r )
errorD (h) (1 errorD (h))
r!(n r )!
12
Binomial probability distribution
Probability P(r) of r heads in n coin flips,
if p = Pr(heads)
Binomial distribution
for n = 10 and p = 0.3
n!
r
nr
P(r )
errorD (h) (1 errorD (h))
r!(n r )!
13
Binomial probability distribution (2)
Expected, or mean value of X, E[X], is
n
E[ X ] iP(i ) np
i 0
Variance of X is
Var( X ) E[( X E[ X ]) ] np(1 p)
2
Standard deviation of X, X, is
X E[( X E[ X ]) ] np (1 p )
2
14
Normal distribution approximates
binomial
errorS(h) follows a Binomial distribution, with
mean error ( h) errorD (h)
S
standard deviation
error
S (h)
error D ( h )(1 error D ( h ))
n
Approximate this by a Normal distribution with
mean error ( h) errorD (h)
S
standard deviation
error
S (h)
error D ( h )(1 error D ( h ))
n
15
Normal probability distribution
p ( x)
1
2
a
e
The probability that X will fall into the interval
b
(a,b) is given by
2
12 ( x ) 2
p( x)dx
Expected, or mean value of X, E[X], is
E[X] =
Variance of X is Var(x) = 2
Standard deviation of X, X , X =
16
Normal probability distribution
80% of area (probability) lies in 1.28
N% of area (probability) lies in zN
N%:
zN:
50% 68% 80% 90% 95% 98% 99%
0.67 1.00 1.28 1.64 1.96 2.33 2.58
17
Confidence intervals, more correctly
If
S contains n examples, drawn independently of h
and each other
n 30
Then
with approximately 95% probability, errorS(h) lies in
the interval
error D (h) 1.96
error D ( h )(1 error D ( h ))
n
errorS (h) 1.96
errorS ( h )(1 errorS ( h ))
n
and errorD(h) approximately lies in the interval
18
Central Limit Theorem
Consider a set of independent, identically
distributed random variables Y1...Yn, all
governed by an arbitrary probability
distribution with mean and finite variance 2.
Define the sample mean,
n
Y 1n Yi
i 1
Central Limit Theorem. As n , the
distribution governing Y approaches a Normal
distribution, with mean and variance 2/n.
19
Comparing Hypotheses:
Difference between hypotheses
Test h1 on sample S1, test h2 on S2
1. Pick parameter to estimate
d errorD (h1 ) errorD (h2 )
2. Choose an estimator
3. Determine probability distribution that
governs estimator
dˆ errorS1 (h1 ) errorS2 (h2 )
dˆ
errorS1 ( h1 )(1errorS1 ( h1 ))
n1
errorS2 ( h2 )(1errorS2 ( h2 ))
n2
20
Difference between hypotheses (2)
4. Find interval (L, U) such that N% of
probability mass falls in the interval
dˆ z N
errorS1 ( h1 )(1errorS1 ( h1 ))
n1
errorS2 ( h2 )(1errorS2 ( h2 ))
n2
21
Paired t test to compare hA, hB
1. Partition data into k disjoint test sets
T1,T2,…,Tk of equal size, where this size is at
least 30.
2. For i from 1 to k, do
i errorT (hA ) errorT (hB )
i
i
3. Return the value
, where
k
1k i
i 0
22
Paired t test to compare hA, hB (2)
N% confidence interval estimate for d:
k
1k i
i 0
s
k
1
k ( k 1)
(
i 0
i
)
2
Note i approximately Normally distributed
23
Comparing learning algorithms:
LA and LB
What we’d like to estimate:
ES D [errorD ( LA (S )) errorD ( LB (S ))]
where L(S) is the hypothesis output by learner
L using training set S
I.e., the expected difference in true error
between hypotheses output by learners LA and
LB, when trained using randomly selected
training sets S drawn according to distribution
D.
24
Comparing learning algorithms LA
and LB (2)
But, given limited data D0, what is a good
estimator?
Could partition D0 into training set S0 and test set
T0, and measure
errorT0 ( LA (S0 )) errorT0 ( LB (S0 ))
Even better, repeat this many times and average
the results
25
Comparing learning algorithms LA
and LB (3): k-fold cross validation
1. Partition data D0 into k disjoint test sets
T1,T2,…,Tk of equal size, where this size is at
least 30.
2. For i from 1 to k, do
use Ti for the test set, and the remaining
data for training set Si
3. Return the average of the errors on the test
sets
26
Practical Aspects
A note on parameter tuning
It is important that the test data is not used in any
way to create the classifier
Some learning schemes operate in two stages:
Stage 1: builds the basic structure
Stage 2: optimizes parameter settings
The test data can’t be used for parameter tuning!
Proper procedure uses three sets: training data,
validation data, and test data
Validation data is used to optimize parameters
27
Holdout estimation, stratification
What shall we do if the amount of data is limited?
The holdout method reserves a certain amount for
testing and uses the remainder for training
Usually: one third for testing, the rest for training
Problem: the samples might not be representative
Example: class might be missing in the test data
Advanced version uses stratification
Ensures that each class is represented with approximately
equal proportions in both subsets
28
More on cross-validation
Standard method for evaluation: stratified ten-fold
cross-validation
Why ten? Extensive experiments have shown that this
is the best choice to get an accurate estimate
There is also some theoretical evidence for this
Stratification reduces the estimate’s variance
Even better: repeated stratified cross-validation
E.g. ten-fold cross-validation is repeated ten times
and results are averaged (reduces the variance)
29
Estimation of the accuracy of a
learning algorithm
10-fold cross validation gives a pessimistic estimate of
the accuracy of the hypothesis build on all training
data, provided that the law “the more training data
the better” holds.
For model selection 10-fold cross validation often
works fine.
An other method is: leave-one-out or jackknife (N-fold
cross validation with N = training set size).
Also the standard deviation is essential for comparing
learning algorithms.
30
Performance Measures:
Issues in evaluation
Statistical reliability of estimated differences in
performance
Choice of performance measure
Number of correct classifications
Accuracy of probability estimates
Error in numeric predictions
Costs assigned to different types of errors
Many practical applications involve costs
31
Counting the costs
In practice, different types of classification
errors often incur different costs
Examples:
Predicting when cows are in heat (“in estrus”)
“Not in estrus” correct 97% of the time
Loan decisions
Oil-slick detection
Fault diagnosis
Promotional mailing
32
Taking costs into account
The confusion matrix:
predicted class
yes
actual
class
no
yes True positive
False negative
no
True negative
False positive
There are many other types of costs!
E.g.: costs of collecting training data
33
Lift charts
In practice, costs are rarely known
Decisions are usually made by comparing
possible scenarios
Example: promotional mailout
Situation 1: classifier predicts that 0.1% of all
households will respond
Situation 2: classifier predicts that 0.4% of the
10000 most promising households will respond
A lift chart allows for a visual comparison
34
Generating a lift chart
Instances are sorted according to their predicted
probability of being a true positive:
Rank
1
2
3
4
…
Predicted probability Actual class
0.95
Yes
0.93
Yes
0.93
No
0.88
Yes
…
…
In lift chart, x axis is sample size and y axis is number
of true positives
35
A hypothetical lift chart
36
Probabilities, reliability
In order to generate a lift chart we need
information that tells that one classification is
more probable/reliable to be of the class of
interest than another classification.
The Naïve Bayes classifier and also Nearest
Neighbor classifiers can output such
information.
If we start to fill our subset with the most
probable examples, the subset will contain a
larger proportion desired elements.
37
Summary of measures
38
Varia:
Model selection criteria
Model selection criteria attempt to find a good
compromise between:
A. The complexity of a model
B. Its prediction accuracy on the training data
Reasoning: a good model is a simple model
that achieves high accuracy on the given data
Also known as Occam’s Razor: the best theory
is the smallest one that describes all the facts
39
Warning
Suppose you are gathering hypotheses that
have a probability of 95% to have an error
level below 10%
What if you have found 100 hypotheses
satisfying this condition?
Then the probability that all have an error
below 10% is equal to (0.95)100 0.013
corresponding to 1.3 %. So, the probability of
having at least one hypothesis with an error
above 10% is about 98.7%!
40
No Free Lunch!!
Theorem (no free lunch)
For any two learning algorithms LA and LB the
following is true, independently of the sampling
distribution and the number of training instances n:
Uniformly averaged over all target functions F,
E(errorS(hA) | F, n) = E(errorS(hB) | F, n)
Idem for a fixed training set D
See: Mitchell ch.2
41
No Free Lunch (2)
Sketch of proof:
If all functions are possible, for each function F for which LA
outperforms LB a function F’ can be found with the opposite
conclusion.
Conclusion: Without assumptions on the underlying
functions, the hypothesis space, no generalization is
possible!
In order to generalize, machine learning algorithms
need some kind of bias:
inductive bias : not all possible functions are in the hypothesis
space (restriction bias) or not all possible functions will be
found because of the search strategy (preference or search
bias) (Mitchell, ch. 2, 3).
42
No free lunch (3)
The other way around: for any ML algorithm
there exist data sets on which it performs well
and there exist data sets on which it performs
badly!
We hope that the latter sets do not occur too
often in real life.
43
Summary of notions
•
•
•
•
•
•
•
•
•
•
•
•
•
True error, sample error
Bias of sample error
Accuracy, confidence intervals
Central limit theorem
Paired t test
k-fold cross validation, leave-one-out, holdout
Stratification
Training set, validation set, and test set
Confusion matrix, TP, FP, TN, FN
Lift chart, ROC curve, Recall-precision curve
Occam’s razor
No free lunch
Inductive bias; restriction bias; search bias
44