Slides II (practice)

Download Report

Transcript Slides II (practice)

Evaluation
(practice)
Predicting performance
 Assume the estimated error rate is 25%. How close
is this to the true error rate?
 Depends on the amount of test data
 Prediction is just like tossing a (biased!) coin
 “Head” is a “success”, “tail” is an “error”
 In statistics, a succession of independent events like
this is called a Bernoulli process
 Statistical theory provides us with confidence intervals
for the true underlying proportion
2
Confidence intervals
 We can say: p lies within a certain specified interval with a certain
specified confidence
 Example: S=750 successes in N=1000 trials
 Estimated success rate: 75%
 How close is this to true success rate p?
 Answer: with 80% confidence p[73.2,76.7]
 Another example: S=75 and N=100
 Estimated success rate: 75%
 With 80% confidence p[69.1,80.1]
 I.e. the probability that p[69.1,80.1] is 0.8.
 Bigger the N more confident we are, i.e. the surrounding interval is
smaller.
 Above, for N=100 we were less confident than for N=1000.
3
Mean and Variance
 Let Y be the random variable with possible values
1 for success and
0 for error.
 Let probability of success be p.
 Then probability of error is q=1-p.
 What’s the mean?
1*p + 0*q = p
 What’s the variance?
(1-p)2*p + (0-p)2*q
= q2*p+p2*q
= pq(p+q)
= pq
4
Estimating p
 Well, we don’t know p. Our goal is to estimate p.
 For this we make N trials, i.e. tests.
 More trials we do more confident we are.
 Let S be the random variable denoting the number of successes,
i.e. S is the sum of N value samplings of Y.
 Now, we approximate p with the success rate in N trials, i.e. S/N.
 By the Central Limit Theorem, when N is big, the probability
distribution of the random variable f=S/N is approximated by a
normal distribution with
 mean p and
 variance pq/N.
5
Estimating p
 c% confidence interval [–z ≤ X ≤ z] for random variable with 0
mean is given by:
Pr[− z≤ X≤ z]= c
 With a symmetric distribution:
Pr[− z≤ X≤ z]=1−2× Pr[ x≥ z]
 Confidence limits for the normal distribution with 0 mean
and a variance of 1:
Thus: Pr[−1.65≤ X≤1.65]=90%
To use this we have to reduce our random variable f=S/N to
have 0 mean and unit variance
6
Estimating p
Thus: Pr[−1.65≤ X≤1.65]=90%
To use this we have to reduce our random variable S/N to
have 0 mean and unit variance:
Pr[−1.65≤ (S/N – p) / S/N ≤1.65]=90%
Now we solve two equations:
(S/N – p) / S/N =1.65
(S/N – p) / S/N =-1.65
7
Estimating p
Let N=100, and S=70
S/N is sqrt(pq/N) and we approximate it by
sqrt(p'(1-p')/N)
where p' is the estimation of p, i.e.
0.7
So, S/N is approximated by
sqrt(.7*.3/100) = .046
The two equations become:
(0.7 – p) / .046 =1.65
p = .7 - 1.65*.046 = .624
(0.7 – p) / .046 =-1.65
Thus, we say:
With a 90% confidence we
have that the success rate
p of the classifier will be
0.624  p  0.776
8
Exercise
 Suppose I want to be 95% confident in my estimation.
 Looking at a detailed table we find: Pr[−2≤ X≤2]  95%
 Normalizing S/N, we need to solve:
(S/N – p) / f =2
(S/N – p) / f =-2
 We approximate f with
sf 
p' q'

N
1 S  S
  1  
N N  N
 where p' is the estimation of p through trials, i.e. S/N
 So we need to solve:
 So,
p
S
N
1 S

N N
S
1 S  S
 2
  1  
N
N N  N
p
S

 1  
 N
2
S
N
1 S

N N
p
S

 1  
 N
 2
9
Exercise
 Suppose N=1000 trials, S=590 successes
 p'=S/N=590/1000 = .59
S
1 S  S
p   2
  1  
N
N N  N
.59 * .41
p  .59  2 
1000
 .59  .0155
10
Cross-validation
k-fold cross-validation:
First step: split data into k subsets of equal
size
Second step: use each subset in turn for
testing, the remainder for training
 The error estimates are averaged to yield an
overall error estimate
11
Comparing data mining Schemes
 Frequent question: which of two learning
schemes performs better?
 Obvious way: compare for example 10-fold
Cross Validation estimates
 Problem: variance in estimate
 We don’t know whether the results are reliable
 need to use statistical-test for that
12
Paired t-test
 Student’s t-test tells whether the means of two
samples are significantly different.
 In our case the samples are cross-validation
estimates for different datasets from the
domain
 Use a paired t-test because the individual
samples are paired
 The same Cross Validation is applied twice
13
Distribution of the means
 x1, x2, … xk and y1, y2, … yk are the 2k samples for the
k different datasets
 mx and my are the means
 With enough samples, the mean of a set of
independent samples is normally distributed
 Estimated variances of the means are
 sx2 / k and sy2 / k
 If x and y are the true means then the following are
approximately normally distributed with mean 0, and
variance 1:
mx   x
my   y
 x2 k
 y2 k
14
Student’s distribution
 With small samples (k < 30) the mean follows
Student’s distribution with k–1 degrees of freedom
 similar shape, but wider than normal distribution
 Confidence limits (mean 0 and variance 1):
15
Distribution of the differences
 Let md = mx – my
 The difference of the means (md) also has a
Student’s distribution with k–1 degrees of
freedom
 Let sd2 be the estimated variance of the
difference
 The standardized version of md is called the
t-statistic:
t
md
sd2 k
16
Performing the test
 Fix a significance level
 If a difference is significant at the  % level, there
is a (100- )% chance that the true means differ
 Divide the significance level by two because
the test is two-tailed
 Look up the value for z that corresponds to /2
 If t–z or tz then the difference is significant
 I.e. the null hypothesis (that the difference is zero)
can be rejected
17
Example
 We have compared two classifiers through crossvalidation on 10 different datasets.
 The error rates are:
Dataset
1
2
3
4
5
6
7
8
9
10
Classifier A
10.6
9.8
12.3
9.7
8.8
10.6
9.8
12.3
9.7
8.8
Classifier B
10.2
9.4
11.8
9.1
8.3
10.2
9.4
11.8
9.1
8.3
Difference
.4
.4
.5
.6
.5
.4
.4
.5
.6
.5
18
Example
 md = 0.48
 sd = 0.0789
t
md
0.48

 19.24
sd2 k 0.0789 10
 The critical value of t for a two-tailed statistical
test,  = 10% and 9 degrees of freedom is:
1.83
 19.24 is way bigger than 1.83, so classifier B is
much better than A.
19
Dependent estimates
 We assumed that we have enough data to create
 several datasets of the desired size
 Need to reuse data if that's not the case
 E.g. running cross-validations with different
randomizations on the same data
 Samples become dependent  insignificant
differences can become significant
 A heuristic test is the corrected resampled t-test:
 Assume we use the repeated holdout method, with n1
instances for training and n2 for testing
md
 New test statistic is:
t
 1 n2  2
   sd
 k n1 
20