Transcript error S (h)

Evaluating Hypotheses
• Sample error, true error
• Confidence intervals for observed hypothesis error
• Estimators
• Binomial distribution, Normal distribution,
Central Limit Theorem
• Paired t-tests
• Comparing Learning Methods
CS 8751 ML & KDD
Evaluating Hypotheses
1
Problems Estimating Error
1. Bias: If S is training set, errorS(h) is optimistically
biased
bias  E[errorS (h)]  errorD (h)
For unbiased estimate, h and S must be chosen
independently
2. Variance: Even with unbiased S, errorS(h) may
still vary from errorD(h)
CS 8751 ML & KDD
Evaluating Hypotheses
2
Two Definitions of Error
The true error of hypothesis h with respect to target function f
and distribution D is the probability that h will misclassify
an instance drawn at random according to D.
errorD (h)  Pr  f ( x)  h( x)
xD
The sample error of h with respect to target function f and
data sample S is the proportion of examples h misclassifies
1
errorS (h)     f ( x)  h( x) 
n xS
where   f ( x)  h( x)  is 1 if f ( x)  h( x), and 0 otherwise
How well does errorS(h) estimate errorD(h)?
CS 8751 ML & KDD
Evaluating Hypotheses
3
Example
Hypothesis h misclassifies 12 of 40 examples in S.
12
errorS (h) 
 .30
40
What is errorD(h)?
CS 8751 ML & KDD
Evaluating Hypotheses
4
Estimators
Experiment:
1. Choose sample S of size n according to
distribution D
2. Measure errorS(h)
errorS(h) is a random variable (i.e., result of an
experiment)
errorS(h) is an unbiased estimator for errorD(h)
Given observed errorS(h) what can we conclude
about errorD(h)?
CS 8751 ML & KDD
Evaluating Hypotheses
5
Confidence Intervals
If
• S contains n examples, drawn independently of h and each
other
• n  30
Then
• With approximately N% probability, errorD(h) lies in
interval
errorS (h)(1  errorS (h))
errorS (h)  z N
n
where
N% : 50% 68% 80% 90% 95% 98% 99%
z N : 0.67 1.00 1.28 1.64 1.96 2.33 2.53
CS 8751 ML & KDD
Evaluating Hypotheses
6
Confidence Intervals
If
• S contains n examples, drawn independently of h and each
other
• n  30
Then
• With approximately 95% probability, errorD(h) lies in
interval
errorS (h)(1  errorS (h))
errorS (h) 1.96
n
CS 8751 ML & KDD
Evaluating Hypotheses
7
errorS(h) is a Random Variable
• Rerun experiment with different randomly drawn S (size n)
• Probability of observing r misclassified examples:
Binomial distribution for n=40, p=0.3
0.14
0.12
P(r)
0.10
0.08
0.06
0.04
0.02
0.00
0
5
10
15
20
r
25
30
35
40
n!
P(r ) 
errorD (h) r (1  errorD (h)) nr
r!(n  r )!
CS 8751 ML & KDD
Evaluating Hypotheses
8
Binomial Probability Distribution
Binomial distribution for n=40, p=0.3
0.14
0.12
P(r ) 
P(r)
0.10
0.08
0.06
n!
p r (1  p) nr
r!(n  r )!
0.04
0.02
0.00
0
5
10
15
20
r
25
30
35
40
Probabilty P(r) of r heads in n coin flips, if p  Pr (heads)
n
 Expected, or mean value of X : E[X]   iP (i )  np
i 0
 Variance of X : Var(X)  E[( X  E[ X ]) 2 ]  np (1  p )
 Standard deviation of X : σ X  E[( X  E[ X ]) 2 ]  np (1  p )
CS 8751 ML & KDD
Evaluating Hypotheses
9
Normal Probability Distribution
Normal distribution with mean 0, standard deviation 1
0.4
0.35
P(r ) 
0.3
0.25
0.2
0.15
1
2πσ
 12  xσμ 
2
2
e
0.1
0.05
0
-3
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
3
The probabilit y that X will fall into the interval (a,b) is given by

b
a
p( x)dx
 Expected, or mean value of X : E[X]  μ
 Variance of X : Var(X)  σ 2
 Standard deviation of X : σ X  σ
CS 8751 ML & KDD
Evaluating Hypotheses
10
Normal Distribution Approximates Binomial
errors (h) follows a Binomial distributi on, with
 mean μ errorS ( h )  errorD (h)
 standard deviation
σ errorS ( h )
errorD (h)(1  errorD (h))

n
Approximat e this by a Normal distributi on with
 mean μ errorS ( h )  errorD (h)
 standard deviation
σ errorS ( h )
CS 8751 ML & KDD
errorS (h)(1  errorS (h))

n
Evaluating Hypotheses
11
Normal Probability Distribution
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
-3
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
3
80% of area (probabili ty) lies in   1.28σ
N% of area (probabili ty) lies in   z Nσ
N% : 50% 68% 80% 90% 95% 98% 99%
z N : 0.67 1.00 1.28 1.64 1.96 2.33 2.53
CS 8751 ML & KDD
Evaluating Hypotheses
12
Confidence Intervals, More Correctly
If
• S contains n examples, drawn independently of h and each
other
• n  30
Then
• With approximately 95% probability, errorS(h) lies in
interval
errorD (h)(1  errorD (h))
errorD (h) 1.96
n
• equivalently, errorD(h) lies in interval
errorD (h)(1  errorD (h))
errorS (h) 1.96
n
• which is approximately
errorS (h)(1  errorS (h))
errorS (h) 1.96
n
CS 8751 ML & KDD
Evaluating Hypotheses
13
Calculating Confidence Intervals
1. Pick parameter p to estimate
• errorD(h)
2. Choose an estimator
• errorS(h)
3. Determine probability distribution that governs estimator
• errorS(h) governed by Binomial distribution, approximated
by Normal when n  30
4. Find interval (L,U) such that N% of probability mass falls
in the interval
• Use table of zN values
CS 8751 ML & KDD
Evaluating Hypotheses
14
Central Limit Theorem
Consider a set of independen t, identicall y distribute d random
variables Y1 Yn , all governed by an arbitrary probabilit y distributi on
with mean  and finite variance  2 . Define the sample mean
1 n
Y   Yi
n i 1
Central Limit Theorem . As n  , the distributi on governing Y
σ2
approaches a Normal distributi on, with mean  and variance
.
n
CS 8751 ML & KDD
Evaluating Hypotheses
15
Difference Between Hypotheses
Test h1 on sample S1 , test h2 on S 2
1. Pick parameter to estimate
d  errorD (h1 )  errorD (h2 )
2. Choose an estimator
d  errorS1 (h1 )  errorS 2 (h2 )
3. Determine probabilit y distributi on that governs estimator
σd 
errorS1 (h1 )(1  errorS1 (h1 ))
n1

errorS 2 (h2 )(1  errorS 2 (h2 ))
n2
4. Find interval (L, U) such that N% of probabilit y mass falls
in the interval
errorS1 (h1 )(1  errorS1 (h1 )) errorS 2 (h2 )(1  errorS 2 (h2 ))
ˆ
d  zN

n1
n2
CS 8751 ML & KDD
Evaluating Hypotheses
16
Paired t test to Compare hA,hB
1. Partition data into k disjoint t est sets T1,T2 ,...,Tk of
equal size, where this size is at least 30.
2. For i from 1 to k do
δ i  errorTi (hA )  errorTi (hB )
3. Return the value d, where
1 k
δ  δ i
k i 1
N% confidence interval estimate for d :
δ  t N,k-1sδ
k
1
2
sδ 
(
δ

δ
)
 i
k (k  1) i 1
Note δ i approximately Normally distri buted
CS 8751 ML & KDD
Evaluating Hypotheses
17
N-Fold Cross Validation
• Popular testing methodology
• Divide data into N even-sized random folds
• For n = 1 to N
–
–
–
–
Train set = all folds except n
Test set = fold n
Create learner with train set
Count number of errors on test set
• Accumulate number of errors across N test sets
and divide by N (result is error rate)
• For comparing algorithms, use the same set of
folds to create learners (results are paired)
CS 8751 ML & KDD
Evaluating Hypotheses
18
N-Fold Cross Validation
• Advantages/disadvantages
– Estimate of error within a single data set
– Every point used once as a test point
– At the extreme (when N = size of data set), called
leave-one-out testing
– Results affected by random choices of folds (sometimes
answered by choosing multiple random folds –
Dietterich in a paper expressed significant reservations)
CS 8751 ML & KDD
Evaluating Hypotheses
19
Results Analysis: Confusion Matrix
• For many problems (especially multiclass
problems), often useful to examine the sources of
error
• Confusion matrix:
Expected
ClassA
CS 8751 ML & KDD
Predicted
ClassA ClassB ClassC
25
5
20
Total
50
ClassB
0
45
5
50
ClassC
25
0
25
50
Total
50
50
50
150
Evaluating Hypotheses
20
Results Analysis: Confusion Matrix
• Building a confusion matrix
– Zero all entries
– For each data point add one in row corresponding to
actual class of problem under column corresponding to
predicted class
• Perfect prediction has all values down the
diagonal
• Off diagonal entries can often tell us about what is
being mis-predicted
CS 8751 ML & KDD
Evaluating Hypotheses
21
Receiver Operator Characteristic
(ROC)
Curves
Originally from signal detection
•
• Becoming very popular for ML
• Used in:
– Two class problems
– Where predictions are ordered in some way (e.g., neural network
activation is often taken as an indication of how strong or weak a
prediction is)
• Plotting an ROC curve:
– Sort predictions (right) by their predicted strength
– Start at the bottom left
– For each positive example, go up 1/P units where P is the number
of positive examples
– For each negative example, go right 1/N units where N is the
number of negative examples
CS 8751 ML & KDD
Evaluating Hypotheses
22
ROC Curve
True Positives (%)
100
75
50
25
0
25
50
75
100
False Positives (%)
CS 8751 ML & KDD
Evaluating Hypotheses
23
ROC Properties
• Can visualize the tradeoff between coverage and accuracy
(as we lower the threshold for prediction how many more
true positives will we get in exchange for more false
positives)
• Gives a better feel when comparing algorithms
– Algorithms may do well in different portions of the curve
• A perfect curve would start in the bottom left, go to the top
left, then over to the top right
– A random prediction curve would be a line from the bottom left to
the top right
• When comparing curves:
– Can look to see if one curve dominates the other (is always better)
– Can compare the area under the curve (very popular – some people
even do t-tests on these numbers)
CS 8751 ML & KDD
Evaluating Hypotheses
24