Comparison between two algorithms
Download
Report
Transcript Comparison between two algorithms
Statistical Comparison of Two
Learning Algorithms
Presented by:
Payam Refaeilzadeh
Overview
How can we tell if one algorithm can learn
better than another?
–
–
–
–
Design an experiment to measure the accuracy of
the two algorithms.
Run multiple trials.
Compare the samples not just their means. Do a
statistically sound test of the two samples.
Is any observed difference significant? Is it due to
true difference between algorithms or natural
variation in the measurements?
Statistical Hypothesis Testing
Statistical Hypothesis: A statement about
the parameters of one or more populations
Hypothesis Testing: A procedure for
deciding to accept or reject the hypothesis
–
–
–
–
–
Identify the parameter of interest
State a null hypothesis, H0
Specify an alternate hypothesis, H1
Choose a significance level α
State an appropriate test statistic
Statistical Hypothesis Testing Cont
Null Hypothesis (H0): A statement presumed to be
true until statistical evidence shows otherwise
Usually specifies an exact value for a parameter
Example H0: µ = 30 Kg
Alternate Hypothesis (H1): Accepted if the null
hypothesis is rejected
Test Statistic: Particular statistic calculated from
measurements of a random sample / experiment
–
–
A test statistic is assumed to follow a particular distribution
(normal, t, chi-square, etc)
That particular distribution can be used to test for the
significance of the calculated test statistic.
Error in Hypothesis Testing
Type I error occurs when H0
is rejected but it is in fact true
–
P(Type I error)=α or significance
level
Type II error occurs when we
fail to reject H0 but it is in fact
false
–
–
–
P(Type II error)=β
power = 1-β = Probability of
correctly rejecting H0
power = ability to distinguish
between the two populations
Paired t-Test
Collect data in pairs:
–
Suppose n paired measurements have been made
Assume
–
–
Example: Given a training set DTrain and a test set DTest,
train both learning algorithms on DTrain and then test their
accuracies on DTest.
The measurements are independent
The measurements for each algorithm follow a normal
distribution
The test statistic T0 will follow a t-distribution with n-1
degrees of freedom
Paired t-Test cont
Trial
#
Algorithm 1
Accuracy
X1
Algorithm 2
Accuracy
X2
1
X11
X21
2
X12
X22
…
..
…
n
X1N
X2N
Assume: X1 follows N(µ1,σ1)
X2 follows N(µ2,σ2)
Let:
µD = µ1 - µ2
Di = X1i - X2i i=1,2,...,n
1
D X 1i X 2i
n i
S D STDEV ( X 1i X 2i )
Null Hypothesis:
H0: µD = Δ0
Test Statistic:
T0
D
0
n
SD
Rejection Criteria:
H1: µD ≠ Δ0 |t0| > tα/2,n-1
H1: µD > Δ0 t0 > tα,n-1
H1: µD < Δ0 t0 < -tα,n-1
Cross Validated t-test
Paired t-Test on the 10 paired
accuracies obtained from 10-fold
cross validation
Advantages
–
–
Large train set size
Most powerful (Diettrich, 98)
Disadvantages
–
–
Accuracy results are not independent
(overlap)
Somewhat elevated probability of
type-1 error (Diettrich, 98)
…
5x2 Cross Validated t-test
Run 2-fold cross validation 5 times
Use results from the first of five replications to
estimate mean difference
Use results for all folds to estimate the variance
Advantage:
–
Lowest Type-1 error (Diettrich, 98)
Disadvantage
–
Not as powerful as 10 fold cross validated t-test (Diettrich,
98)
Re-sampled t-test
Randomly divide data into train / test sets
(usually 2/3 – 1/3)
Run multiple trials (usually 30)
Perform a paired t-test between the trial
accuracies
This test has very high probability of type-1
error and should never be used.
Calibrated Tests
Bouckaert – ICML 2003:
–
–
–
It is very difficult to estimate the true degrees of
freedom because independence assumptions are
being violated
Instead of correcting for the mean-difference,
calibrate on the degrees of freedom
Recommendation: use 10 times repeated 10-fold
cross validation with 10 degrees of freedom
References
R. R. Bouckaert. Choosing between two
learning algorithms based on calibrated
tests. ICML’03: PP 51-58.
T. G. Dietterich. Approximate statistical tests
for comparing supervised classification
learning algorithms. Neural Computation,
10:1895–1924, 1998.
D. C. Montgomery et al. Engineering
Statistics. 2nd Edition. Wiley Press. 2001