Slides II (practice)
Download
Report
Transcript Slides II (practice)
Evaluation
(practice)
Predicting performance
Assume the estimated error rate is 25%. How close
is this to the true error rate?
Depends on the amount of test data
Prediction is just like tossing a (biased!) coin
“Head” is a “success”, “tail” is an “error”
In statistics, a succession of independent events like
this is called a Bernoulli process
Statistical theory provides us with confidence intervals
for the true underlying proportion
2
Confidence intervals
We can say: p lies within a certain specified interval with a certain
specified confidence
Example: S=750 successes in N=1000 trials
Estimated success rate: 75%
How close is this to true success rate p?
Answer: with 80% confidence p[73.2,76.7]
Another example: S=75 and N=100
Estimated success rate: 75%
With 80% confidence p[69.1,80.1]
I.e. the probability that p[69.1,80.1] is 0.8.
Bigger the N more confident we are, i.e. the surrounding interval is
smaller.
Above, for N=100 we were less confident than for N=1000.
3
Mean and Variance
Let Y be the random variable with possible values
1 for success and
0 for error.
Let probability of success be p.
Then probability of error is q=1-p.
What’s the mean?
1*p + 0*q = p
What’s the variance?
(1-p)2*p + (0-p)2*q
= q2*p+p2*q
= pq(p+q)
= pq
4
Estimating p
Well, we don’t know p. Our goal is to estimate p.
For this we make N trials, i.e. tests.
More trials we do more confident we are.
Let S be the random variable denoting the number of successes,
i.e. S is the sum of N value samplings of Y.
Now, we approximate p with the success rate in N trials, i.e. S/N.
By the Central Limit Theorem, when N is big, the probability
distribution of the random variable f=S/N is approximated by a
normal distribution with
mean p and
variance pq/N.
5
Estimating p
c% confidence interval [–z ≤ X ≤ z] for random variable with 0
mean is given by:
Pr[− z≤ X≤ z]= c
With a symmetric distribution:
Pr[− z≤ X≤ z]=1−2× Pr[ x≥ z]
Confidence limits for the normal distribution with 0 mean
and a variance of 1:
Thus: Pr[−1.65≤ X≤1.65]=90%
To use this we have to reduce our random variable f=S/N to
have 0 mean and unit variance
6
Estimating p
Thus: Pr[−1.65≤ X≤1.65]=90%
To use this we have to reduce our random variable S/N to
have 0 mean and unit variance:
Pr[−1.65≤ (S/N – p) / S/N ≤1.65]=90%
Now we solve two equations:
(S/N – p) / S/N =1.65
(S/N – p) / S/N =-1.65
7
Estimating p
Let N=100, and S=70
S/N is sqrt(pq/N) and we approximate it by
sqrt(p'(1-p')/N)
where p' is the estimation of p, i.e.
0.7
So, S/N is approximated by
sqrt(.7*.3/100) = .046
The two equations become:
(0.7 – p) / .046 =1.65
p = .7 - 1.65*.046 = .624
(0.7 – p) / .046 =-1.65
Thus, we say:
With a 90% confidence we
have that the success rate
p of the classifier will be
0.624 p 0.776
8
Exercise
Suppose I want to be 95% confident in my estimation.
Looking at a detailed table we find: Pr[−2≤ X≤2] 95%
Normalizing S/N, we need to solve:
(S/N – p) / f =2
(S/N – p) / f =-2
We approximate f with
sf
p' q'
N
1 S S
1
N N N
where p' is the estimation of p through trials, i.e. S/N
So we need to solve:
So,
p
S
N
1 S
N N
S
1 S S
2
1
N
N N N
p
S
1
N
2
S
N
1 S
N N
p
S
1
N
2
9
Exercise
Suppose N=1000 trials, S=590 successes
p'=S/N=590/1000 = .59
S
1 S S
p 2
1
N
N N N
.59 * .41
p .59 2
1000
.59 .0155
10
Cross-validation
k-fold cross-validation:
First step: split data into k subsets of equal
size
Second step: use each subset in turn for
testing, the remainder for training
The error estimates are averaged to yield an
overall error estimate
11
Comparing data mining Schemes
Frequent question: which of two learning
schemes performs better?
Obvious way: compare for example 10-fold
Cross Validation estimates
Problem: variance in estimate
We don’t know whether the results are reliable
need to use statistical-test for that
12
Paired t-test
Student’s t-test tells whether the means of two
samples are significantly different.
In our case the samples are cross-validation
estimates for different datasets from the
domain
Use a paired t-test because the individual
samples are paired
The same Cross Validation is applied twice
13
Distribution of the means
x1, x2, … xk and y1, y2, … yk are the 2k samples for the
k different datasets
mx and my are the means
With enough samples, the mean of a set of
independent samples is normally distributed
Estimated variances of the means are
sx2 / k and sy2 / k
If x and y are the true means then the following are
approximately normally distributed with mean 0, and
variance 1:
mx x
my y
x2 k
y2 k
14
Student’s distribution
With small samples (k < 30) the mean follows
Student’s distribution with k–1 degrees of freedom
similar shape, but wider than normal distribution
Confidence limits (mean 0 and variance 1):
15
Distribution of the differences
Let md = mx – my
The difference of the means (md) also has a
Student’s distribution with k–1 degrees of
freedom
Let sd2 be the estimated variance of the
difference
The standardized version of md is called the
t-statistic:
t
md
sd2 k
16
Performing the test
Fix a significance level
If a difference is significant at the % level, there
is a (100- )% chance that the true means differ
Divide the significance level by two because
the test is two-tailed
Look up the value for z that corresponds to /2
If t–z or tz then the difference is significant
I.e. the null hypothesis (that the difference is zero)
can be rejected
17
Example
We have compared two classifiers through crossvalidation on 10 different datasets.
The error rates are:
Dataset
1
2
3
4
5
6
7
8
9
10
Classifier A
10.6
9.8
12.3
9.7
8.8
10.6
9.8
12.3
9.7
8.8
Classifier B
10.2
9.4
11.8
9.1
8.3
10.2
9.4
11.8
9.1
8.3
Difference
.4
.4
.5
.6
.5
.4
.4
.5
.6
.5
18
Example
md = 0.48
sd = 0.0789
t
md
0.48
19.24
sd2 k 0.0789 10
The critical value of t for a two-tailed statistical
test, = 10% and 9 degrees of freedom is:
1.83
19.24 is way bigger than 1.83, so classifier B is
much better than A.
19
Dependent estimates
We assumed that we have enough data to create
several datasets of the desired size
Need to reuse data if that's not the case
E.g. running cross-validations with different
randomizations on the same data
Samples become dependent insignificant
differences can become significant
A heuristic test is the corrected resampled t-test:
Assume we use the repeated holdout method, with n1
instances for training and n2 for testing
md
New test statistic is:
t
1 n2 2
sd
k n1
20