No Slide Title

Download Report

Transcript No Slide Title

Evaluation of Learning Models



Literature:
T. Mitchel, Machine Learning, chapter 5
I.H. Witten and E. Frank, Data Mining, chapter
5
1
Fayyad’s KDD Methodology
Transformed
data
Target
data
Patterns
Processed
data
Data Mining
data
Selection
Knowledge
Interpretation
Evaluation
Transformation
Preprocessing & feature
selection
& cleaning
2
Contents




Estimation of errors for one hypothesis
Comparison of hypotheses
Comparison of learning models
Practical aspects
3
Two definitions of error

True error of hypothesis h with respect to
target function f and distribution D is the
probability that h will misclassify an instance
drawn at random according to D.
error D (h)  Pr [ f ( x)  h( x)]
xD
4
Two definitions of error (2)

Sample error of hypothesis h with respect to
target function f and data sample S is the
proportion of examples h misclassifies
errorS (h)  1n  ( f ( x)  h( x))
xS
where  ( f ( x)  h( x)) is 1 if
and 0 otherwise.
f ( x )  h( x )
5
Two definitions of error (3)

How well does errorS(h) estimate errorD(h)?
6
Problems estimating error

1. Bias: If S is training set, errorS(h) is
optimistically biased
bias  E[errorS (h)]  errorD (h)

For unbiased estimate, h and S must be
chosen independently.
2. Variance: Even with unbiased S, errorS(h)
may still vary from errorD(h).
7
Example

Hypothesis h misclassifies 12 of the 40
examples in S
12
error S (h) 
 0.30
40
What is errorD(h)?
8
Estimators

Experiment:
1. choose sample S of size n according to
distribution D
2. measure errorS(h)
errorS(h) is a random variable (i.e., result of
an experiment)
errorS(h) is an unbiased estimator for
errorD(h)
Given observed errorS(h) what can we
conclude about errorD(h)?
9
Confidence intervals

If
 S contains n examples, drawn independently of h
and each other
 n  30

Then
 With approximately 95% probability, errorD(h) lies
in the interval
errorS (h)  1.96
errorS ( h )(1 errorS ( h ))
n
10
Confidence intervals (2)

If
 S contains n examples, drawn independently of h
and each other
 n  30

Then
 With approximately N% probability, errorD(h) lies in
the interval
error S (h)  z N
errorS ( h )(1 errorS ( h ))
n
where
N%: 50% 68% 80% 90% 95% 98% 99%
zN:
0.67 1.00 1.28 1.64 1.96 2.33 2.58
11
errorS(h) is a random variable


Rerun the experiment with different randomly
drawn S (of size n)
Probability of observing r misclassified
examples:
n!
r
nr
P(r ) 
errorD (h) (1  errorD (h))
r!(n  r )!
12
Binomial probability distribution

Probability P(r) of r heads in n coin flips, if p =
Pr(heads)
Binomial distribution
for n = 10 and p = 0.3
n!
r
nr
P(r ) 
errorD (h) (1  errorD (h))
r!(n  r )!
13
Binomial probability distribution (2)

Expected, or mean value of X, E[X], is
n
E[ X ]   iP(i )  np
i 0

Variance of X is
Var( X )  E[( X  E[ X ]) ]  np(1  p)
2

Standard deviation of X, X, is
 X  E[( X  E[ X ]) ]  np (1  p )
2
14
Normal distribution approximates
binomial

errorS(h) follows a Binomial distribution, with
 mean error ( h)  errorD (h)
S
 standard deviation
 error

S (h)

error D ( h )(1 error D ( h ))
n
Approximate this by a Normal distribution with
 mean error ( h)  errorD (h)
S
 standard deviation
 error
S (h)

error D ( h )(1 error D ( h ))
n
15
Normal probability distribution
p ( x) 

1
2
a


e
The probability that X will fall into the interval
b
(a,b) is given by


2
 12 ( x ) 2
p( x)dx
Expected, or mean value of X, E[X], is
E[X] = 
Variance of X is Var(x) = 2
Standard deviation of X, X , X = 
16
Normal probability distribution


80% of area (probability) lies in   1.28
N% of area (probability) lies in   zN
N%:
zN:
50% 68% 80% 90% 95% 98% 99%
0.67 1.00 1.28 1.64 1.96 2.33 2.58
17
Confidence intervals, more correctly

If
 S contains n examples, drawn independently of h
and each other
 n  30

Then
 with approximately 95% probability, errorS(h) lies in
the interval
error D (h)  1.96
error D ( h )(1 error D ( h ))
n
errorS (h)  1.96
errorS ( h )(1 errorS ( h ))
n
 and errorD(h) approximately lies in the interval
18
Central Limit Theorem

Consider a set of independent, identically
distributed random variables Y1...Yn, all
governed by an arbitrary probability
distribution with mean  and finite variance 2.
Define the sample mean,
n
Y  1n  Yi
i 1

Central Limit Theorem. As n  , the
distribution governing Y approaches a Normal
distribution, with mean  and variance 2/n.
19
Difference between hypotheses


Test h1 on sample S1, test h2 on S2
1. Pick parameter to estimate
d  errorD (h1 )  errorD (h2 )

2. Choose an estimator

3. Determine probability distribution that
governs estimator
dˆ  errorS1 (h1 )  errorS2 (h2 )
 dˆ 
errorS1 ( h1 )(1errorS1 ( h1 ))
n1

errorS2 ( h2 )(1errorS2 ( h2 ))
n2
20
Difference between hypotheses (2)

4. Find interval (L, U) such that N% of
probability mass falls in the interval
dˆ  z N
errorS1 ( h1 )(1errorS1 ( h1 ))
n1

errorS2 ( h2 )(1errorS2 ( h2 ))
n2
21
Paired t test to compare hA, hB


1. Partition data into k disjoint test sets
T1,T2,…,Tk of equal size, where this size is at
least 30.
2. For i from 1 to k, do
 i  errorT (hA )  errorT (hB )
i

i
3. Return the value
 , where
k
  1k   i
i 0
22
Paired t test to compare hA, hB (2)

N% confidence interval estimate for d:
k
  1k   i
i 0
s 

k
1
k ( k 1)
 (
i 0
i
 )
2
Note  i approximately normally distributed
23
Comparing learning algorithms LA
and LB

What we’d like to estimate:
ES D [errorD ( LA (S ))  errorD ( LB (S ))]

where L(S) is the hypothesis output by learner
L using training set S
I.e., the expected difference in true error
between hypotheses output by learners LA and
LB, when trained using randomly selected
training sets S drawn according to distribution
D.
24
Comparing learning algorithms LA
and LB (2)

But, given limited data D0, what is a good
estimator?
 Could partition D0 into training set S0 and test set
T0, and measure
errorT0 ( LA (S0 ))  errorT0 ( LB (S0 ))
 Even better, repeat this many times and average
the results
25
Comparing learning algorithms LA
and LB (3): k-fold cross validation



1. Partition data D0 into k disjoint test sets
T1,T2,…,Tk of equal size, where this size is at
least 30.
2. For i from 1 to k, do
use Ti for the test set, and the remaining
data for training set Si
3. Return the average of the errors on the test
sets
26
Practical Aspects
A note on parameter tuning


It is important that the test data is not used in any
way to create the classifier
Some learning schemes operate in two stages:
 Stage 1: builds the basic structure
 Stage 2: optimizes parameter settings




The test data can’t be used for parameter tuning!
Proper procedure uses three sets: training data,
validation data, and test data
Validation data is used to optimize parameters
27
Holdout estimation, stratification



What shall we do if the amount of data is limited?
The holdout method reserves a certain amount for
testing and uses the remainder for training
 Usually: one third for testing, the rest for training

Problem: the samples might not be representative
 Example: class might be missing in the test data

Advanced version uses stratification
 Ensures that each class is represented with approximately
equal proportions in both subsets
28
More on cross-validation



Standard method for evaluation: stratified ten-fold
cross-validation
Why ten? Extensive experiments have shown that this
is the best choice to get an accurate estimate
 There is also some theoretical evidence for this


Stratification reduces the estimate’s variance
Even better: repeated stratified cross-validation
 E.g. ten-fold cross-validation is repeated ten times

and results are averaged (reduces the variance)
29
Issues in evaluation


Statistical reliability of estimated differences in
performance
Choice of performance measure
 Number of correct classifications
 Accuracy of probability estimates
 Error in numeric predictions

Costs assigned to different types of errors
 Many practical applications involve costs
30
Counting the costs


In practice, different types of classification
errors often incur different costs
Examples:
 Predicting when cows are in heat (“in estrus”)
 “Not in estrus” correct 97% of the time
 Loan decisions
 Oil-slick detection
 Fault diagnosis
 Promotional mailing
31
Taking costs into account

The confusion matrix:
predicted class
yes
actual
class

no
yes True positive
False negative
no
True negative
False positive
There are many other types of costs!
 E.g.: costs of collecting training data
32
Lift charts



In practice, costs are rarely known
Decisions are usually made by comparing
possible scenarios
Example: promotional mailout
 Situation 1: classifier predicts that 0.1% of all
households will respond
 Situation 2: classifier predicts that 0.4% of the
10000 most promising households will respond

A lift chart allows for a visual comparison
33
Generating a lift chart

Instances are sorted according to their predicted
probability of being a true positive:
Rank
1
2
3
4
…

Predicted probability Actual class
0.95
Yes
0.93
Yes
0.93
No
0.88
Yes
…
…
In lift chart, x axis is sample size and y axis is number
of true positives
34
A hypothetical lift chart
35
Summary of measures
36
Model selection criteria



Model selection criteria attempt to find a good
compromise between:
A. The complexity of a model
B. Its prediction accuracy on the training data
Reasoning: a good model is a simple model
that achieves high accuracy on the given data
Also known as Occam’s Razor: the best theory
is the smallest one that describes all the facts
37
Warning



Suppose you are gathering hypotheses that
have a probability of 95% to have an error
level below 10%
What if you have found 100 hypotheses
satisfying this condition?
Then the probability that all have an error
below 10% is equal to (0.95)100  0.013
corresponding to 1.3 %. So, the probability of
having at least one hypothesis with an error
above 10% is about 98.7%!
38