Transcript h - TWiki

Performance Evaluation
and
Hypothesis Testing
1
Motivation
• Evaluating the performance of learning
systems is important because:
– Learning systems are usually designed to
predict the class of “future” unlabeled
data points.
– In some cases, evaluating hypotheses is
an integral part of the learning process
(example, when pruning a decision tree)
•2
Issues
• How well can a classifier be expected to perform
on “novel” data, not used for training?
• Which performance measure we should use?
• Since a performance measure is an ESTIMATE on
a sample, how accurate is our estimate?
• How to compare performances of different
hypotheses or those of different classifiers?
3
Performances of a given hypothesis
• Performances are usually reported in the form of a
CONFUSION MATRIX (also called contingency table)
• The table has four cells (in case of binary classifiers):
– TP: “true positive”, i.e., number (or %) of positive
instances classified as positive by the system
– FP: “false positive”, should be negative, the system
classified as positive
– TN: “true negative” negative instances classified as
negative
– FN: “false negative” positive instances classified as
negative
TP +TN
ACCURACY =
|T |
T = TP +TN + FP + FN
4
Performances of a given hypothesis
• Precision, Recall, F-measure
• P=TP/(TP+FP)
• R=TP/(TP+FN)
• F=2(P×R)/(P+R)
5
ROC curves plot precision and recall
A and B are two
alternative ML
systems
Receiver Operating Characteristic curve (or ROC curve.) is a graphical plot
that illustrates the performance of a binary classifier
system as its discrimination threshold is varied (e.g. in decision trees, the
support and confidence cutoff for accepting a decision).
The curve is created by plotting the true positive rate (TPR)
against the false positive rate (FPR) at various system settings.
6
Learning Curves
• Plots accuracy vs. size of training set.
• Has maximum accuracy (“Bayes optimal”) nearly been
reached or will more examples help?
• Is one system better when training data is limited?
• Most learners eventually converge to Bayes optimal given
sufficient training examples.
Test Accuracy
100%
Bayes optimal
Random guessing
Dotted line is
some baseline,
e.g random guess
# Training examples
7
Noise Curves
• Plot accuracy versus noise level to determine
relative resistance to noisy training data.
• Artificially add category or feature noise by
randomly replacing some specified fraction of
category or feature values with random values.
Test Accuracy
100%
% noise added
8
Evaluating an hypothesis
• ROC and Accuracy not enough
• How well will the learned classifier perform
on novel data?
• Performance on the training data is not a
good indicator of performance on future
data (remember the SpamCollection
example)
9
Example
Learning set
Test: is it a lion?
Testing on the training data is not appropriate.
The learner will try to fit at best on available data, and will not learn to generalize.
Possibly, it will misclassify new unseen istances.
10
Difficulties in Evaluating Hypotheses
when only limited data are available
• Bias in the estimate: The observed accuracy of the
learned hypothesis over the training examples is a
poor estimator of its accuracy over future
examples ==> we must test the hypothesis on a test
set chosen independently of the training set and the
hypothesis.
• Variance in the estimate: Even with a separate test
set, the measured accuracy can vary from the true
accuracy, depending on the makeup of the particular
set of test examples. The smaller the test set, the
greater the expected variance.
•11
Variance (intuitive)
High variance
Low variance
Questions to be Considered
• Given the observed accuracy of a hypothesis over a
limited sample of data, how well does this estimate its
accuracy over additional (unseen) instances?
• Given that one hypothesis h1 outperforms another (h2)
over some sample data, how probable is it that this
hypothesis is more accurate, in general?
• When data is limited what is the best way to use this
data to both learn a hypothesis and estimate its
accuracy?
•13
Evaluating error
• Evaluating the error rate of an hypothesis
• Evaluating alternative hypotheses
• Evaluating alternative learning algorithms
using at best training set
14
1. Estimating Hypothesis Accuracy
Two Questions of Interest:
– Given a hypothesis h and a data sample containing
n examples (instances) drawn at random according
to distribution D, what is the best estimate of the
accuracy of h over future instances drawn from
the same distribution? => sample error vs. true
error
– What is the probable error in this accuracy
estimate? => confidence intervals= error in
estimating error
Sample Error and True Error
• Definition 1: The sample error (denoted errors(h) ore eS(h)) of
hypothesis h with respect to target function c and data sample S
is:
errors(h)= 1/n× xS(c(x),h(x))=r/n
where n is the number of instances in sample S, and the quantity
(c(x),h(x)) is 1 if c(x)  h(x), and 0, otherwise.
• Definition 2: The true error (denoted errorD(h), or p) of
hypothesis h with respect to target function f and distribution D, is
the probability that h will misclassify an instance drawn at
random according to D.
errorD(h)= p=PrxD[c(x)  h(x)]
Sample error is an estimate on S of the
true error on D, which is a probability
•16
Estimate, probability and random
variables
• We are given a sample S of n instances, we apply h(x)
and we measure r errors, we then estimate the error
probability of h(x): P(r errors in n instances)=r/n
• However, r (or r/n) is a RANDOM variable, governed
by chance. If we get another sample S’ of n different
instances, we may get a different number r’ and a
different estimate. In general errorS(h)≠errorS’(h)!!!
• Simple experiment: make k different sets of trials, in
each toss a coin 10 times and mesure the number of
“head”
• So, how can we get an idea of the “real” probability of
error , errorD (h)?
17
Sample error and true error
• We do not know the “true” error rate however we know that
errorD(h) follows a binomial distribution with mean p
(unknown)
• Why? And what is this “binomial”?
• Say p is the (unknown) “true” error probability of h(x) in D. If
we have a sample of n instances, what is the probability that,
given instances x in S, c(x)  h(x) for r times??
• Even if we do not know p, each instance x in S has p probability
of being misclassified by h(x) and (1-p) of being classified
correctly.
• The probability of observing r misclassified examples
in n in which
# of ways
instances is then:
we can select r items
ær ö r
n!
P(errorD (h) = r / n) = ç ÷ p (1- p)n-r =
p r (1- p)n-r from a population of n
r!(n - r)!
ènø
18
Example
Abscissa is the value r, e.g. there is a 20% probability that h(x) misclassifies 5 out of
20 instances,
6% probability it misclassifies 8 out of 20, etc.
19
We can normalize and plot r/n
0
0,3
Now x is the % of wrongly classified instances
1
Example: coin toss
• Say we toss a coin, we know that p(head)=0,5
• However, if we toss a coin 4 times, we are not sure we
will get “head” two times. The number of observed
“heads” in each trial is governed by chance.
• What is the probability of getting just 1 head in 4 tosses?
21
Properties of Binomial distribution
If we have e.g., a sample of
n=20 instances,
the expected (or mean)
value
for the n. of errors is:
r=nE(X)=np=20p
ED(X)= expected value of
random variable X in distribution D
(also denoted μ(X) or X^)
Var =variance
σ = standard deviation (also
denoted SD)
Variance and Standard deviation of
Binomial: why p(1-p)?
For np times Xi=1, for
n(1-p) times Xi=0
Variance is defined as the mean of
square differences between values of
n individual outcomes Xi and the
mean (p), i.e. the dispersion around the mean.
In a binomial, outcomes are
either 1 or 0 (e.g., wrong or correct
classification of an istance by h(x))
Standard deviation of a random
variable is the square root of
its variance
23
Estimator of an error
So, we know that errorD(h) follows a binomial distribution with unknown
mean p. If we sample the error on a set of instances S, we obtain a value
r/n which is our current ESTIMATE errorS(h) of errorD(h).
Note that the “estimator” errorS(h) of p is also a random variable! If
we perform many experiments on different S we could get different values
(as in the coin flipping experiment! )
However, for large enough dimension of the sample,
the mean (expected value) of errorS(h) is the same es for errorD(h)!
Why? Because of the central limit theorem
24
Central Limit Theorem
• (General formulation) The theorem states that the
arithmetic mean of a sufficiently large number of
experiments of independent random variables, each with a
well-defined expected value and well-defined variance,
will be approximately normally distributed
• This will hold true regardless of whether the source
population is normal or skewed, provided the samples size
is sufficiently large (usually n > 30).
• Futhermore, the mean of all experiments will be the same
as the “real” population mean
25
avg(errorS(h))p
P(n/r)
p
• p is the unknown
expected error rate.
• Green bars are the
results of different
experiments on
different samples
• The average of these
results tends to p as
the number of
experiments grows
r/n
In other words, if you perform “several” experiments, E(errorS(h))=p
26
Central Limit Theorem
• Therefore, the theorem ensures that the
distribution of estimated errorS(h) for different
samples of at least 30 instances follows a
Gaussian (Normal) distribution whose mean is the
true error mean errorD(h)=p
27
Example (tossing n coins)
In this experiment, each figure
shows the distribution of values
with a growing number of
coins (from 1 to 20), and a
large number of tosses (say,
k=100). X axis is the ratio of
“head” when tossing n coins k
times, and Y is the number of
trials (or ratio of k trials) in
which the X value has
occurred.
For example, the black bar represents the number (or %) of times in which the ratio
of heads when tossing n=20 coins was 0,25. Here, head=1, tail=0. When using n=1 coin,
we will get 50% heads, 50% tails for large enough number of tosses. As n grows,
the distribution gets closer to a gaussian.
28
..but..
• WHY is it important to approximate a
binomial distribution with a Gaussian
distribution??
• Stay tuned..
29
Standard deviation of a sample
• Thanks to CLM, we know that E(errorS(h))=E(errorD(h))=p
• What about the standard deviation of errorS(h)??
• Standard deviation of a random variable is the square root of its
variance, as we said.
• The standard deviation of a sample of n instances is defined as:
sS =
• However, we don’t know
sD
n
s D , since we don’t know p!!
• But here comes the advantage of the central limit theorem..
30
Estimating the standard deviation of
errorS(h) on the sample S
We replace the (unknown) p with our computed mean value r/n
This is an ESTIMATE since we assume that r/n is a good
approximation of the real error rate p, which holds approximately true
for large enough n, according to CLT!
r
| p - errorS (h) |=| p - |= e
n
the bias of an estimator is the difference e between this estimator's
expected value and the true value of the parameter being estimated
31
Summary
• X=errorD(h)=|h(x)-c(x)| is a random boolean variable and P(X) is a
Binomial probability distribution with mean p, over the full population D.
• The sample error, errorS(h)=r /n, is also a random variable following a
Binomial distribution. In general, errorS(h)≠errorD(h) and
|errorD(h)≠errorS(h)|=|p-r/n|=e is called bias.
• If the number of examples n in S is >30, then according to Central Limit
Theorem (CLT) the underlying binomial distribution for errorS(h)
approximates a Gaussian distribution and has the same mean p as for
errorD(h) in D
• THIS DOES NOT MEAN that for n>30 errorD(h)=errorS(h)!! It means that,
if we repeat the experiment on different independent samples Si of the
same dimension n, we would obtain different values for errorSi(h), and the
DISTRIBUTION of these values approximated a Gaussian with mean p
and standard deviation: σS=p(1-p)/sqr(n)
32
Confidence interval for an error estimate
• Confidence interval
LB £ errorD (h(x)) - errorS (h(x)) |=| p - r / n £UB
• Provides an estimate of the minimum and maximum expected
discrepancy between the measured and real error rate
• Def: an N% confidence interval for an estimate e is an interval
[LB,UB] that includes e with probability N% (with probability N% we
have LBeUB)
• In our case, e= errorD (h(x))- errorS (h(x)) is a random variable (again!)
representing the difference between true and estimated error (e is the
error in estimating the error!!)
• If errorS and errorD obey a gaussian distribution, then also e does,
and the confidence interval can be easily estimated!!
33
Confidence interval
• Given an hypothesis h(x), if errorS(h(x)) obeys a gaussian
with mean =E(r/n)=errorD(h(x))=p and SD=/sqtr(n)
(which we said is approximately true for n>30) then the
estimated value errorS(h(x)) on a sample S of n examples ,
r/n, with probability N% will lie in the following m ± z s
N
Of course, we don’t know
interval:
where our estimator r/n is placed
• zN is half of the lenght of the interval
around μ that
in this distribution, because we don’t
includes N% of the total probability
actually mass.
know p A
norz-score
, we only know
tells you “how many standard deviations”
(ZNσ) frominthe
that r/n lies somewhere
a
gaussian-shaped distributionz and 
mean
result is.
As we said,
the your
Gaussian
N
approximates the probability
distribution of our outcomes
(outcomes are the measured
error values r/n) for n>30
can be approximated
Estimate in a gaussian distribution
r/n
A z-score tells you
how many standard
deviations from the
mean your result is.
We don’t know where is our estimate (the green line) placed, but we know that,
e.g. with 68.26% probability (34.13+34,13) it will be at a distance ±1σ from the mean
error, with probability 95,44 (68,26+2x13.59) it will be at a distance ±2σ
35
and in general, with probability N% at a distance ±ZNσ
Probability mass in a gaussian
distribution
probability mass as a function of σ
m - zNs £ errorS £ m + zNs
So, what we actually KNOW is that WHEREVER our estimated error is in
this Gaussian, it will be with 68% probability at a distance +-1 from the
true error rate p, with 80% probability at a distance +-1.28 , with 98%
probability at a distance +-2.33 , etc.
Even though we don’t know D, we have seen that it can be approximated
with S:
r
r
(1- )
n of ZN (with average
probability mass asna function
0 and standard deviationn1 )
N% 50 68 80 90 95 98 99
zN 0,67 1.00 1.28 1.64 1.96 2.33 2.58
36
Finding the N% Confidence Interval
•
•
•
•
N% is a PARAMETER.
We know that N% of the mass lies between zN
The Gaussian table gives us zN for any N%,
e.g., if N=80% then zN=1,28: we know that with 80%
probability our estimated error is ±1,28 far from the
true error.
N% 50 68 80 90 95 98 99
zN 0,67 1.00 1.28 1.64 1.96 2.33 2.58
r
r
(1- )
sD 1
errorS (h)(1- errorS (h))
n
n
s S = = np(1- p) @
=
n n
n
n
And:
37
Finding the N% confidence interval
[LB,UB] = [ (m - Z ns ), (m + Z ns )] @
é
êe(h) - Z N
ë
e(h)(1- e(h))
, e(h) + Z N
n
e(h)(1- e(h)) ù
ú
n
û
e(h) = errorS (h)
38
Calculating the N% Confidence Interval:
Example
• We have a classifier h(x) and a test set S of 100
instances
• We apply h(x) on S and compute 13% error rate
• Since n>30 we assume that the error distribution
follows a gaussian with mean 0,13 and standard
deviation σS:
• If we wish to compute the 90% confidence
interval, on the table we have ZN=1.64
N% 50 68 80 90 95 98 99
zN 0,67 1.00 1.28 1.64 1.96 2.33 2.58
39
Calculating the N% Confidence Interval:
Example (2)
• We then have:
• ZN=1.64
• The 90% confidence interval is estimated
by:
40
Example 2
N% 50 68 80 90 95 98 99
zN 0,67 1.00 1.28 1.64 1.96 2.33 2.58
41
Solution
Precision is 0.78 hence error rate r/n is 0.22; the test set has 50 instances, hence n=50
Choose e.g. to compute the N% confidence interval with N=0.95
42
One sided bound
a=(100-95)%=5% is the percentage of the gaussian area
outside lower and upper extremes of the curve; a/2=2.5%
43
One sided two sided bounds
L<=P(x)<=U
P(x)<=U
P(x)>=L
44
Upper bound
P(x)>=U
45
Evaluating error
Evaluating the error rate of an hypothesis
• Evaluating alternative hypotheses
• Evaluating alternative learning algorithms
using at best training set
46
Comparing Two Learned Hypotheses
• When evaluating two hypotheses, their observed
ordering with respect to accuracy may or may not
reflect the ordering of their true accuracies.
P(errorS(h))
– Assume h1 is tested on test set S1 of size n1
– Assume h2 is tested on test set S2 of size n2
errorS1(h1)
errorS2(h2)
errorS(h)
Observe h1 more accurate than h2
47
Comparing Two Learned Hypotheses
• When evaluating two hypotheses, their observed
ordering with respect to accuracy may or may not
reflect the ordering of their true accuracies.
P(errorS(h))
– Assume h1 is tested on test set S3 of size n1
– Assume h2 is tested on test set S4 of size n2
errorS3(h1)
errorS4(h2)
errorS(h)
Observe h1 less accurate than h2
48
Hypothesis testing (1)
Our problem is the following: given that we
observed, e.g., that:
errorS1(h1)-errorS2(h2)=d>0
how confident we can be that h2 is indeed
better than h1?
PLEASE DO NOT BE CONFUSED with
terms: our problem here is to test the
HYPOTHESIS that hypothesis h2 is better
than h1 given that we observed a lower error
rate on our sample
49
Hypotheis testing (2)
• When we wish to understand how much we can
rely on a statistical finding (for example, that h2 is
more precise than h1 on a sample dataset), we
need to list alternative hypotheses (e.g. that h2 in
NOT more precise than h1 on the entire
population).
• One of these is called the NULL HYPOTHESIS
H0
• Usually, the null hypothesis is one that
disconfirms our findings
50
Hypotheis testing
• We measure the error rate of h1 and the error rate of h2,
and find a difference d=errorS1(h1)-errorS2(h2)
• Two-tail test (we obtain a value |d|≠0):
– H0: data do not support that there is a difference
between h1 and h2, hence errorD(h1)-errorD (h2)=0
– H1: there is indeed a difference between h1 and h2
• One-tail right-test (we find that d>0)
– H0: data do not support that h2>h1
– H1: h2>h1 (since error of h1 is lower)
• One-tail left-test (we find that d<0)
– H0: data do not support that h2<h1
– H1: h1>h2 (since error of h1 is higher)
51
Z-Score Test for Comparing
Learned Hypotheses
• Assumes h1 is tested on test set S1 of size n1 and h2
is tested on test set S2 of size n2. Assume both
n>30 for the Central Limit Theorem to hold.
• Compute the difference between the accuracy of
h1 and h2: d̂ = error (h ) - error (h )
S1 1
S2
2
•
Note: the SD of the sum
or difference of random
The difference is a random variable
and since it is
variables, is the sum
the difference between two variables
following a
of SDs.
gaussian distribution, it also follows a gaussian,
with standard deviation:
sd =
D
D
s h1
s h2
n1
+
n2
@
S1
S2
s h1
s h2
n1
+
n2
errorS (h1 ) × (1- errorS (h1 )) errorS (h2 ) × (1- errorS (h2 ))
=
+
n1
n2
1
1
2
2
52
Testing for the null hypothesis
• If H0 holds true, then we should have:
errorD(h1)=errorD (h2) and therefore dD=0
I.e., the mean of error differences of h1 and h2 on D is zero.
• We then compute:
Error estimates on the samples
True error probabilities on
population (unknown)
Error bounds in estimating d^
If compute z and look on a z-table, we will know “how many times” our result d^
53
is far from the expected mean difference (which is zero according to H0)
Testing for the null hypothesis
• Assume that d=0.15 and σ=0.05 then z=3 (μ=0)
• z=3
• N=99,87%
54
We should reject H0!!
Z=3
Our z-test says that, if the mean difference is zero, the probability to obtain the value
|d|=0.15 or more, is less than 0.03 (100-99,87)!! So H0 is VERY UNLIKELY
55
p-value
• The p-value is the “probability value” of
observing our estimate, given that H0 holds
true
• Common wisdom is to reject the null
hypotesis if p<0.05 (5%)
• In previous example we obtained p<0.03
56
One-tail test
In a one-tail test we test either h1>h2 or h1<h2
For example, if we test h1>h2 we have:
H0: data do not support that h1>h2 (hence h1≤h2)
H1: h1>h2 (in this case we should get an estimate of d )
57
Example (one-tail test: is h1<h2?)
• errors1(h1)= x1=17.5%, errors2(h2)= x2=12.4%, d=5.1%
• n1=50, n2=50
errorS (h1 ) × (1- errorS (h1 )) errorS (h2 ) × (1- errorS (h2 ))
sd @
+
n1
n2
1
1
2
2
=
0.175× (1- 0.175) 0.124 × (1- 0.124)
+
= 0.005
50
50
z=
(x1- x2) + (errorD (h1 ) - errorD (h2 ))
= (0.051- 0) / 0.07 = 0.73
0.07
p< 0.75
The null hypothesis is accepted: difference is not
large enough to support h1<h2
58
Summary: two-side test
59
One side test
H1: h2 is better than h1
H2: h1 is better than h2
60
Z-Score Test Assumptions: summary
• Hypotheses can be tested on different test sets; if
same test set used, stronger conclusions might be
warranted.
• Test set(s) must have at least 30 independently
drawn examples (to apply central limit theorem).
• Hypotheses must be constructed from
independent training sets.
• Only compare two specific hypotheses regardless
of the methods used to construct them. Does not
compare the underlying learning methods in
general.
61
Evaluating error
Evaluating the error rate of an hypothesis
Evaluating alternative hypotheses
• Evaluating alternative learning algorithms
using at best training set
62
Comparing 2 Learning Algorithms
• Comparing the average accuracy of hypotheses produced
by two different learning systems is more difficult since
we need to average over multiple training sets. Ideally,
we want to measure:
ES  D (errorD ( LA ( S ))  errorD ( LB ( S )))
where LX(S) represents the hypothesis learned by learning
algorithm LX from training data S.
• To accurately estimate this, we need to average over
multiple, independent training and test sets.
• However, since labeled data is limited, generally must
average over multiple splits of the overall data set into
training and test sets (K-fold cross validation).
63
K-fold cross validation of an hypothesis
Partition the test set in k equally sized random samples
64
K-fold cross validation (2)
At each step, learn from Li and test on Ti, then compute the error
Li
Ti
e1
e2
e3
65
K-FOLD CROSS VALIDATION
66
K-Fold Cross Validation: summary
• Every example in D used as a test example once
and as a training example k–1 times.
• All test sets are independent; however, training
sets overlap significantly (see two previous
slides).
• In total we test on [(k–1)/k]|D| training examples.
• Standard method is 10-fold.
• If k is low, not sufficient number of train/test
trials; if k is high, test set may be too small and
test variance is high and run time is increased.
• If k=|D|, method is called leave-one-out cross
validation (at each step, you leave out one
example). Used for specific cases (e.g. learning
recommendations)
67
How to use K-Fold Cross Validation to
evaluate different learning algorithms
Randomly partition data D into k disjoint equal-sized (N)
subsets P1…Pk
For i from 1 to k do:
Use Pi for the test set and remaining data for training
Si = (D – Pi)
hA = LA(Si)
hB = LB(Si)
δi = errorPi(hA) – errorPi(hB)
Return the average difference in error:
k
1
d = ådi
k i=1
d ± Z ×sd
Error bound is
computed as:
sd =
(
k
1
å di - d
k(k -1) i=1
68
)
2
Is LA better than LB?
• K-fold cross validation improves
confidence in our estimate of δ since we are
performing many experiments and
computing δ as the AVERAGE of δi.
• As K grows this average tends to the true
mean difference (however we cannot make
K too big since individual samples should
be large enough for the CLT to apply)
• We can in any case apply hypothesis testing
as before
69
Sample Experimental Results
Which experiment provides better evidence that SystemA is better than SystemB?
Experiment 2
Experiment 1
SystemA SystemB
δ
SystemA SystemB
δ
87%
82%
+5%
Trial 1
90%
82%
+8%
Trial 3
88%
83%
+5%
Trial 3
80%
85%
–5%
Trial 4
82%
77%
+5%
Trial 4
85%
75%
+10%
Trial 5
85%
80%
+5%
Trial 5
77%
82%
– 5%
Average
85%
80%
+5%
Average
85%
80%
+5%
Trial 1
Trail 2
Experiment 1 mean δ has σ=0, therefore we have a
+17%
+5%
Trail 2
76%
83%
78%
perfect
confidence
in the estimate
of93%
δ
70
Experimental Evaluation Conclusions
• Good experimental methodology is important to evaluating
learning methods.
• Important to test on a variety of domains to demonstrate a
general bias that is useful for a variety of problems.
Testing on 20+ data sets is common.
• Variety of freely available data sources
– UCI Machine Learning Repository
http://www.ics.uci.edu/~mlearn/MLRepository.html
– KDD Cup (large data sets for data mining)
http://www.kdnuggets.com/datasets/kddcup.html
– CoNLL Shared Task (natural language problems)
http://www.ifarm.nl/signll/conll/
• Data for real problems is preferable to artificial problems
to demonstrate a useful bias for real-world problems.
• Many available datasets have been subjected to significant
feature engineering to make them learnable.
71