Machine Learning
Download
Report
Transcript Machine Learning
Machine Learning: Lecture 5
Experimental Evaluation of
Learning Algorithms
(Based on Chapter 5 of Mitchell T..,
Machine Learning, 1997)
1
Motivation
Evaluating the performance of learning
systems is important because:
Learning systems are usually designed to
predict the class of “future” unlabeled
data points.
In some cases, evaluating hypotheses is
an integral part of the learning process
(example, when pruning a decision tree)
2
Difficulties in Evaluating Hypotheses
when only limited data are available
Bias in the estimate: The observed accuracy of the
learned hypothesis over the training examples is a
poor estimator of its accuracy over future examples
==> we test the hypothesis on a test set chosen
independently of the training set and the hypothesis.
Variance in the estimate: Even with a separate test
set, the measured accuracy can vary from the true
accuracy, depending on the makeup of the particular
set of test examples. The smaller the test set, the
greater the expected variance.
3
Questions Considered
Given the observed accuracy of a hypothesis
over a limited sample of data, how well does
this estimate its accuracy over additional
examples?
Given that one hypothesis outperforms another
over some sample data, how probable is it that
this hypothesis is more accurate, in general?
When data is limited what is the best way to use
this data to both learn a hypothesis and estimate
its accuracy?
4
Variance in Test Accuracy
P(errorS(h))
Let errorS(h) denote the percentage of examples in an
independently sampled test set S of size n that are incorrectly
classified by hypothesis h.
Let errorD(h) denote the true error rate for the overall data
distribution D.
When n is at least 30, the central limit theorem ensures that
the distribution of errorS(h) for different random samples will
be closely approximated by a normal (Guassian) distribution.
errorD(h)
errorS(h)
5
Comparing Two Learned Hypotheses
When evaluating two hypotheses, their observed
ordering with respect to accuracy may or may not
reflect the ordering of their true accuracies.
Assume h1 is tested on test set S1 of size n1
Assume h2 is tested on test set S2 of size n2
P(errorS(h))
errorS1(h1)
errorS2(h2)
errorS(h)
Observe h1 more accurate than h2
6
P(errorS(h))
errorS1(h1)
errorS2(h2)
errorS(h)
Observe h1 less accurate than h2
7
Estimating Hypothesis Accuracy
Two Questions of Interest:
Given a hypothesis h and a data sample
containing n examples drawn at random
according to distribution D, what is the best
estimate of the accuracy of h over future
instances drawn from the same distribution?
==> sample vs. true error
What is the probable error in this accuracy
estimate? ==> confidence intervals
8
Sample Error and True Error
Definition 1: The sample error (denoted errors(h)) of
hypothesis h with respect to target function f and data
sample S is:
errors(h)= 1/n xS(f(x),h(x))
where n is the number of examples in S, and the quantity
(f(x),h(x)) is 1 if f(x) h(x), and 0, otherwise.
Definition 2: The true error (denoted errorD(h)) of
hypothesis h with respect to target function f and
distribution D, is the probability that h will misclassify
an instance drawn at random according to D.
errorD(h)= PrxD[f(x) h(x)]
9
Confidence Intervals for
Discrete-Valued Hypotheses
The general expression for approximate N%
confidence intervals for errorD(h) is:
errorS(h) zNerrorS(h)(1-errorS(h))/n
where ZN is given in [Mitchell, table 5.1]
This approximation is quite good when
n errorS(h)(1 - errorS(h)) 5
10
Mean and Variance
Definition 1: Consider a random variable Y
that takes on possible values y1, …, yn. The
expected value (or mean value) of Y, E[Y],
is:
E[Y] = i=1n yi Pr(Y=yi)
Definition 2: The variance of a random
variable Y, Var[Y], is:
.
Var[Y] = E[(Y-E[Y])2]
Definition 3: The standard deviation of a
random variable Y is the square root of the
variance.
11
Estimators, Bias and Variance
Since errorS(h) (an estimator for the true error) obeys a
Binomial distribution (See, [Mitchell, Section 5.3]), we
have: errorS(h) = r/n and errorD(h) = p
where n is the number of instances in the sample S, r is the
number of instances from S misclassified by h, and p is the
probability of misclassifying a single instance drawn from D.
Definition: The estimation bias ( from the inductive
bias) of an estimator Y for an arbitrary parameter p is
E[Y] – p
Unbiased Estimator if E[Y]=p
The standard deviation for errorS(h) is given by
p(1-p)/n errorS(h)(1-errorS(h))/n
12
Confidence Intervals
N% Confidence Interval for some parameter
p is an interval that is expected with prob.
N% to contain p.
Example in page 132 and 138
Approximated Normal by central limit
theorem rather than Binomial Distribution
n errorS(h)(1 - errorS(h)) 5
13
Two-sided and One-sided Bounds
One-sided Bound
What is the prob. that errorD(h) is at most U?
100(1-alpha)% confidence interval for twosided bounds
100(1-alpha/2)% confidence interval for
one-sided bounds for (-inf, U) and (L,+inf)
Example: page 141
14
Difference in Error of two
Hypotheses
Let h1 and h2 be two hypotheses for some discrete-valued target
function. H1 has been tested on a sample S1 containing n1 (>=30)
randomly drawn examples, and h2 has been tested on an
independent sample S2 containing n2 (>=30) examples drawn
from the same distribution.
Let’s estimate the difference between the true errors of these two
hypotheses, d, by computing the difference between the sample
errors: dˆ = errorS1(h1)-errorS2(h2) or dˆ = errorS(h1)-errorS(h2)
The approximate N% confidence interval for d is: d^
ZNerrorS1(h1)(1-errorS1(h1))/n1 + errorS2(h2)(1-errorS2(h2))/n2
15
Sample Z-Score Test 1
Assume we test two hypotheses on different test sets of size
100 and observe:
error
S1
( h1 ) 0 . 20
d error
error
d
S1
( h1 ) error
S2
S1
( h1 ) (1 error
S1
S2
( h 2 ) 0 . 30
( h2 ) 0 .2 0 .3 0 .1
( h1 ))
error
S2
( h 2 ) (1 error
n1
0 . 2 (1 0 . 2 )
d
d
0 .1
0 . 0608
S2
( h 2 ))
n2
0 . 3 (1 0 . 3 )
100
z
error
0 . 0608
100
1 . 644
Confidence for two-tailed test: 90%
Confidence for one-tailed test: (100 – (100 – 90)/2) = 95%
h1 has better accuracy than h2 for 95times out of 100 trials.
16
Sample Z-Score Test 2
Assume we test two hypotheses on different test sets of size
100 and observe:
error
S1
d error
( h1 ) 0 . 20
S1
error
d
( h1 ) error
S1
error
S2
( h1 ) (1 error
( h 2 ) 0 . 25
( h 2 ) 0 . 2 0 . 25 0 . 05
S1
( h1 ))
error
S2
( h 2 ) (1 error
n1
0 . 2 (1 0 . 2 )
d
d
0 . 05
S2
( h 2 ))
n2
0 . 25 (1 0 . 25 )
100
z
S2
0 . 0589
100
0 . 848
0 . 0589
Confidence for two-tailed test: 50%
Confidence for one-tailed test: (100 – (100 – 50)/2) = 75%
17
Hypothesis Testing
What is Probability[errorD(h1)>error D(h2)]
Size=100 errorS1(h1)=0.3 errorS2(h2)=0.2
Observed difference dˆ=0.1 0 .0608
Prob [errorD(h1)>error D(h2)]= Prob[d >0]
d
• H: [errorD(h1)>error D(h2)]
– null hypothesis: Opposite hypothesis
z
d
d
0 .1
1 . 644
0 . 0608
Confidence for two-tailed test: 90%
Confidence for one-tailed test: (100 – (100 – 90)/2) = 95%
h1 has better accuracy than h2 for 95times out of 100 trials.
18
Comparing Learning Algorithms
Which of LA and LB is the better learning method on
average for learning some particular target function f ?
To answer this question, we wish to estimate the
expected value of the difference in their errors:
ESD [errorD(LA(S))-errorD(LB(S))]
Of course, since we have only a limited sample D0 we
estimate this quantity by dividing D0 into a training set
S0 and a testing set T0 and measure:
errorT0(LA(S0))-errorT0(LB(S0))
Problem: We are only measuring the difference in errors
for one training set S0 rather than the expected value of
this difference over all samples S drawn from D
19
Solution: k-fold Cross-Validation
k-Fold Cross-Validation
1. Partition the available data D0 into k disjoint
subsets T1, T2, …, Tk of equal size, where this size
is at least 30.
2. For i from 1 to k, do
use Ti for the test set, and the remaining data for
training set Si
• Si <- {D0 - Ti}
• hA <- LA(Si)
• hB <- LB(Si)
• i <- errorTi(hA)-errorTi(hB)
3. Return the value avg(),
.
where
avg() = 1/k i=1k i
20
K-Fold Cross Validation
Comments
Every example gets used as a test example once
and as a training example k–1 times.
All test sets are independent; however, training
sets overlap significantly.
Measures accuracy of hypothesis generated for
[(k–1)/k]|D| training examples.
Standard method is 10-fold.
If k is low, not sufficient number of train/test
trials; if k is high, test set is small and test variance
is high and run time is increased.
If k=|D|, method is called leave-one-out cross
validation.
21
Significance Testing
Typically k<30, so not sufficient trials for a z test.
Can use (Student’s) t-test, which is more accurate
when number of trials is low.
Can use a paired t-test, which can determine
smaller differences to be significant when the
training/sets sets are the same for both systems.
However, both z and t test’s assume the trials are
independent. Not true for k-fold cross validation:
Test sets are independent
Training sets are not independent
22
Confidence of the k-fold
Estimate
The approximate N% confidence interval for
estimating ESD0 [errorD(LA(S))-errorD(LB(S))]
using avg(), is given by:
avg()tN,k-1savg()
where tN,k-1 is a constant similar to ZN (See [Mitchell,
Table 5.6]) and savg() is an estimate of the standard
deviation of the distribution governing avg()
savg())=1/k(k-1) i=1k (i -avg())2
23
Experimental Evaluation Conclusions
Good experimental methodology is important to evaluating
learning methods.
Important to test on a variety of domains to demonstrate a
general bias that is useful for a variety of problems. Testing
on 20+ data sets is common.
Variety of freely available data sources
UCI Machine Learning Repository
http://www.ics.uci.edu/~mlearn/MLRepository.html
KDD Cup (large data sets for data mining)
http://www.kdnuggets.com/datasets/kddcup.html
CoNLL Shared Task (natural language problems)
http://www.ifarm.nl/signll/conll/
Data for real problems is preferable to artificial problems
to demonstrate a useful bias for real-world problems.
Many available datasets have been subjected to significant
feature engineering to make them learnable.
24