original - Kansas State University

Download Report

Transcript original - Kansas State University

Lecture 17 of 42
Methods for
Statistical Evaluation of Hypotheses
Friday, 29 February 2008
William H. Hsu
Department of Computing and Information Sciences, KSU
http://www.cis.ksu.edu/~bhsu
Readings:
Section 6.13, Han & Kamber 2e
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Lecture Outline
•
Read Section 6.13 – 6.14, Han & Kamber 2e
•
Statistical Evaluation Methods for Learning: Three Questions
– Generalization quality
• Estimating hypothesis accuracy on future data: sample theory
• How well does this express generalization accuracy?
• Estimation bias and variance
• Confidence intervals
– Comparing generalization quality
• How certain are we that h1 is better than h2?
• Significance evaluation and t tests
– Learning and statistical evaluation
• What is the best way to make the most of limited data?
• Validation choices and tradeoffs
•
Next Lecture: Sections 6.1-6.5, Mitchell (Bayesian Learning Basics)
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Motivation
•
Why Evaluate Performance?
– To decide whether to use hypothesis h (e.g., database of medical treatments)
– To boost performance (e.g., validation-set accuracy in DT, ANN learning)
•
Precise Evaluation from Limited Data: Issues
– Bias: overoptimistic estimates of generalization quality
– Variance: uncertainty about generalization quality (even if estimate unbiased)
•
What Are We Evaluating?
– Accuracy of learned hypothesis (percentage of examples correctly classified)
• Range of probable error
• Probability that observed accuracy was “due to chance”
– Relative accuracy of two hypotheses
– Relative accuracy of two hypotheses using limited data
– Other figures of merit
• Utility: cost of false positives, negatives
• Efficiency of model: e.g., syntactic complexity of rules
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Estimating Hypothesis Error:
Definitions
•
Setting for Learning Problem
– Objective: learn target function f: X  A
– Unknown probability distribution D over X (e.g., age of people met on street)
– Input: D  {<x  X, f(x)  A>} (aka {<x, c(x)>}) drawn from D
– Output: represented as hypothesis h  H
•
Questions
– Given h, D (|D| = n), give best estimate of future accuracy of h on x drawn from D
– What is the probable error in this estimate?
•
Definitions
– Sample error of hypothesis h with respect to target function f, sample D
1 if f  x   h x 
1
δ f  x , h x , where δ f  x , h x   I f  x  h  x   
n xD
0 otherwise
– True error of h: the probability that h will misclassify a single x drawn from D
errorD h  

errorD h  Pr cx  hx 
xD
– Main question: How well does sample error estimate true error?
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Confidence Intervals
for Discrete-Valued Hypotheses
•
Estimating True Error
– Discrete-valued hypothesis h (classification model)
– Estimate based on observed sample error over sample D
• Samples (instances) x are independent and identically distributed (i.i.d.) ~ D
• Let: r   δ f x , hx   n  errorD h  (i.e., errorD(h) = r/n)
xD
• Suppose n  30 (heuristics for n: later today)
– Given no other information, the most probable errorD(h) is errorD(h)
•
Definition
– N% confidence interval (CI) for a parameter : an interval that is expected with
probability N% to contain 
– Intuitive idea: want “errorD(h)  errorD(h)  zN” for given N, desired   errorD(h)
•
Example
– Want 95% CI for r = 12, n = 40 (errorD(h) = 12/40 = 0.30)
– zN =1.96,  = 0.07, CI = 0.30  0.14 (more on what zN and  are in a bit)
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Sampling Theory:
Definitions from Probability and Statistics
•
Estimating Error Rate
– How does |errorD(h) - errorD(h)| depend on |D|?
– General statistical framework: parameter estimation
• Want to estimate Pr(Ph(x)) - proportion of x drawn from D with property Ph
• Ph(x)  “h misclassifies x” (i.e., h(x)  c(x))
• Experiment with random outcome: select an x and test h on it (Bernoulli trial)
– Goal
• Estimate Pr(Ph(D, e))  Pr(errorD(h) = e) based on Pr(Ph(x))
• D: random sample of size n
•
Definitions to Review
– F: cumulative distribution function (cdf); aggregation of probability measure Pr
• Sum (discrete mass function p) or integral (continuous density function f)
• p(x) = Pr(X = x), F(x) = Pr(X  x)
– Relevant distributions: Binomial, Normal
– Statistical measures on random variable X ~ Pr(X): mean X, variance X
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Error Estimation and
Estimating Binomial Proportions
•
Bernoulli Random Variables
– Random variable (RV): real-valued function over a sample space (X:   R)
•  denotes range of values for X
• X maps values to their probability
– Bernoulli random variable: a random variable where   H  {0, 1}
– Bernoulli trial: experiment whose outcome is denoted by a Bernoulli RV (e.g.,
coin flip)
•
Understanding the Binomial Distribution
– Definition: Binomial RV
• One where   N (natural numbers, i.e., nonnegative integers)
• Value denotes number of observations in n Bernoulli trials (n  N)
– Idea: calculate number of ways to get r observations in n trials, probability of r
•
Bernoulli Trials, Binomial Distributions, and Error Estimation
– Interpret observation (X = 1) as “h incorrectly classifies x  D”, no observation (X
= 0) as “h correctly classifies x  D”
– Consider n trials, i.i.d. ~ Bernoulli: distribution on error rate for all of D
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
The Binomial Distribution
•
Probability Mass Function for Binomial Distribution
– p(R = r)  number of ways to get r observations on n trials • Pr(r out of n)
• Count number of ways:
   C n, r   r! nn! r !
n
r
• Calculate probability of a particular combination:
Pr particular observation of r errors  p r 1  p 
n r
•
, 1 r  n
• Calculate overall probability mass function:
n!
n r
Pr R  r  
 p r 1  p  , 1  r  n
r! n  r !
Using The Estimate
– Binomial RV of interest: R
R
n
 δ hx , f x 
i
i
i 1
– Based on p of interest
• We know p from errorD(h)
• To answer questions about errorD(h), characterize R (bias, variance)
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Mean, Variance, and
Estimation Bias
•
Definitions
– Expectation of a random variable X (aka expected value or mean)
• General definition
μ X  E X  
 x  Pr X  x 
x Ω X
• Suppose X is discrete-valued
μ X  E X  
n
x
i
 Pr  X  x i 
i 1
• Binomial distribution: E[X] = np
– Variance of X:
•

  
 
Var X   σ X  E X - E X   E X 2  E X   E X 2  μ X
2
2
2
– Standard deviation of X:
σ X  Var X 
– Estimation bias of an estimator X for a parameter :
β  E X   θ
2
NB: Estimation Bias  Inductive Bias
–  = 0  X is unbiased estimator
– Is errorD(h) an unbiased estimator for errorD(h)? (Hint: E[R] = np)
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Confidence Intervals
•
Objective: Calculate Bounds for True Error (at Given Confidence Level)
•
Definitions
– N% confidence interval (CI) for : interval that contains  with probability N%
– Normal (aka Gaussian) distribution:
px  
1
e
1  x μ 
 

2 σ 
2
2σπ
where  denotes mean of X,  variance of X
•
2
Finding CI for  in Terms of Given N
–  of interest: errorD(h)
– Evaluator specifies N
– Depends on a “constant” zN (a function of N), n = |D|, and errorD(h)
– Want errorD(h) - zN  errorD(h)  errorD(h) + zN
•
N% CI for errorD(h): errorD(h)  zN
– Coefficient zN: table lookup
–
σ  Var erro rD h  
Confidence Level N%
Coefficient z N
50%
0.67
68%
1.00
80%
1.28
90%
1.64
95%
1.96
98%
2.33
99%
2.58
erro rD h 1  erro rD h 
n
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Confidence Intervals:
Two-Sided and One-Sided Bounds
•
Two-Sided versus One-Sided Bounds
– Previous zN: for a two-sided bound (upper and lower extrema: [L, U])
– Sometimes, want to ask: What is p(errorD(h)  U)?
– In many experiments, maximum error bound U is all that matters
• One-sided bound: don’t care that expected error is at least L
• Want to adapt CI computation procedure to handle half-interval
•
Property: Normal Distribution is Symmetric about 
•
Modification to CI Procedure
–  = probability of error falling into tails of the distribution (< L or > U)
– Simply convert 100(1 - )% CI for [L, U] into 100(1 - /2)% CI for [-, U]
•
Example
– Suppose again that h commits r = 12 errors on n = 40 examples
– 95% CI of 0.30  0.14
– 97.5% CI of [0.30, 0.44]
Confidence Level N%
Coefficient z N
CIS 732: Machine Learning and Pattern Recognition
50%
0.67
68%
1.00
80%
1.28
90%
1.64
95%
1.96
98%
2.33
99%
2.58
Kansas State University
Department of Computing and Information Sciences
Confidence Intervals:
Applying The Central Limit Theorem
•
General 4-Step Approach for Deriving CIs
– 1. Identify the population parameter  to be estimated (e.g., errorD(h))
– 2. Determine the estimator X (e.g., errorD(h))
• Prefer unbiased X (E[X] -  = 0), minimum variance - sometimes in conflict!
– 3. Determine the distribution DX such that X ~ DX; particularly want X and X
• Other ways to characterize probability mass (or density) function for DX (see
moments, moment generating functions)
– 4. Determine N% CI for given N%: find L, U such that N% of pmf for  lies in [L, U]
•
Central Limit Theorem
1 n
– Consider: set of n RVs {Xi} i.i.d. ~ (arbitrary) D(, 2), sample mean X n  i 1 X i
n
X

μ
n
– Claim: as n  ,
~ Normal (0, 1)
σ/ n
– Ramification: for any estimator X that is a mean (e.g., errorD(h)),
X ~ DX  Normal (X, X) for large enough n
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Difference in Error
of Two Hypotheses
•
Estimating Difference in True Errors: Want a CI for Difference
•
Use Generic 4-Step Procedure to Derive Estimator and CI
– 1. Desired parameter: d  errorD(h1) - errorD(h2)
– 2Q. What is a good estimator for d?
– 2A. Difference between sample errors dˆ  errorD1 h1   errorD2 h2 
– 3. Determine distribution DX governing the estimator X  d̂
• Central Limit Theorem (CLT): errors are ~ Normal for large enough n1, n2
• Difference d̂ of Normal distributions is a Normal distribution
errorD1 h1 1 - errorD1 h1  errorD2 h2 1 - errorD2 h2 
• Result: X,X μ ˆ  d, σ ˆ 2 

d
d
n1
n2
– 4. Use parameters of distribution DX to derive N% CI

•



errorD1 h1  1 - errorD1 h1  errorD2 h2  1 - errorD2 h2 
dˆ  zN

n1
n2
NB: Typically, Samples Are Identical (S1 = S2)
– Ramification: (usually) lower estimator variance - why?
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Hypothesis Testing
•
Neyman-Pearson Paradigm
– Intuitive idea: comparing two hypotheses h1 and h2
• Subtlety: “hypothesis” here refers to assertions about errorD(h1), errorD(h2)
• Define such a hypothesis H0 (null hypothesis: “errorD(h1)  errorD(h2)”)
• Want to decide between it and HA (alternative hypothesis, h0)
– The test question
• Event: errorD(h1) > errorD(h2) makes us reject H0
• What is Pr[errorD(h1) > errorD(h2)]?
•
Observation of Estimators and Hypothesis Testing
– We see: dˆ  errorD1 h1   errorD2 h2 
– We want to use this to estimate d  errorD(h1) - errorD(h2)
•
Confidence Intervals and Hypothesis Testing


– Pr d  0 | observed dˆ = probability that estimator falls in one-sided CI dˆ  μdˆ  dˆ
– Pr(estimate of d is within observed value) = Pr(correct evaluation)
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Hypothesis Testing:
An Example
•
Statistical Experiment Definition
– Common distribution D
– Two samples (data sets): D1, D2
– Two hypotheses h1 and h2 (e.g., old hypothesis and new one in learning algorithm)
– Parameter: error difference at confidence level N% (compute and compare)
•
Observation: Measured Errors errorD h1 , errorD h2 
1
•
2
Example
– Suppose errorD1 h1   0.30 , errorD2 h2   0.20  dˆ  0.10
–
σdˆ  0.061
dˆ  0.10  μdˆ  1.64σdˆ
Confidence Level N%
Coefficient z N
50%
0.67
68%
1.00
80%
1.28
90%
1.64
95%
1.96
98%
2.33
99%
2.58
– Here, we have solved for zN given sample errors
• Reverse table lookup: 90% two-sided level of confidence, 95% one-sided
• Can answer: difference in sample errors needed to accept H0 (N% confidence)
• U  boundary of one-sided rejection region
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Comparing Learning Algorithms
•
Another Statistical Experiment Definition
– Common distribution D
– Error computed over single sample D drawn from D
– Two learning algorithms LA and LB
– Parameter: error difference at confidence level N% (compute and compare)
•
Observation: Measured Errors on Test Data
•
Observation of Estimators and Hypothesis Testing
– Break limited sample D into Dtrain, Dtest
– We see: dˆ  errorD LA D   errorD LB D 
test
test
– We want to use this to estimate d  E errorD LA D   errorD LB D 
D D
– Need method for reducing loss of training data to test set
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Estimating Difference in Error
between Learning Algorithms
•
Algorithm k-Fold-Cross-Validation(D, LA, LB)
– Partition D into k disjoint subsets T1, T2, …, Tk of equal size (|Ti| at least 30)
– FOR i  1 TO k DO
• Use Ti for the test set and the remaining data Di = D ~ Ti for training
• hA  LA(Di)
• hB  LB(Di)
• i  errorTi(hA), errorTi(hB)
•
k
– RETURN δ  1 δi
k i 1
k-Fold Cross Validation: aka Jackknife
– N% confidence interval for estimating d  E errorD LA D   errorD LB D 
D D
– δ t
N,k 1  s δ
k
1
δi  δ 2
t N,k 1  t test constant, sδ  degrees of freedom 

k k  1 i 1
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Statistical Errors and Significance
•
Neyman-Pearson Revisited
– Two kinds of statistical errors
• Type I  false negatives (reject H0 when it is true)
• Type II  false positives (accept H0 when it is false)
– Probability of statistical errors
• Pr(false negative)    significance level of the test
• Pr(false positive)  ; 1 -   Pr(sound rejection)  power of the test
•
Jargon
– “Significant to the N% level of confidence”: N = 100(1 - )
– “p < ”
• p-value  smallest value of  for which H0 will be rejected
• p-value inversely proportional to observed successes
• “Significance of p”: p-value = p
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Paired t Tests
•
Student t Test
– Estimator: sample mean
1 k
δ  δi
k i 1
– Test statistic: tN,k-1 (k - 1 degrees of freedom)
• k counts number of independent random events used to estimate parameter
• Property: lim t N,k 1  zN
k 
• Result: cross-validated CI δ  t N,k 1  s δ
•
Paired Tests
– Evaluation of tests over identical samples
– Lower variance (no random differences in samples)  tighter CI
•
For More Information: Consult Statistics Textbook (e.g., [Rice, 1988])
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
The Bias-Variance Tradeoff
•
Intuitive Idea
– Variance: being uncertain about d
• Effort expended to reduce variance: coding the model
• Antithesis of precision
– (Estimation) Bias: being wrong about d
• Effort expended to reduce bias: coding the residual error
• Antithesis of accuracy
– Important in designing learning algorithms
•
Jargon
– Bias-variance tradeoff: idea that limited resources exist to code model and error
– Algorithms that exploit bias-variance tradeoff: exchange flexibility for complexity
• Flexibility: more trainable parameters (e.g., ANN weights)
• Complexity: slower convergence, overfitting, other problems
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Using Statistics
in Evaluation of Learning Algorithms
•
Practical Considerations
– Setting k in k-fold CV
– “User-defined” target N%
•
Things to Keep in Mind
– Limited data resources
• More used to evaluate, less to generalize over
• Goal: find ways to evaluate in a mathematically sound fashion
– Limited computational resources
• Only so much space and time to perform experiments
• Goal: design efficient and informative tests
– No silver bullet
• No learning algorithm does equally well on all data sets
• Evaluation of learning algorithms  evaluation of inductive biases
•
Further Reading: Duda and Hart (2e Due Out Any Day Now™)
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Statistics and ANN Jargon
•
Many Common Concepts
•
Differences in Terminology
– Inductive learning
• ANNs and AI: “generalizing from noisy data”; “supervised learning”
• Statistics: “statistical inference”; “(multivariate) regression”
– Design of learning systems
• ANNs and AI: “architecture”
• Statistics: “model”
– Model components
• ANNs: “trainable weight or bias”
• Statistics: “parameter”
•
Many, Many More
•
Most Comprehensive Resources to Date
– http://www.aic.nrl.navy.mil/~aha/research/txt/searle.txt
– http://www.faqs.org/faqs/ai-faq/neural-nets/part1/section-14.html
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Terminology
•
Statistical Evaluation Methods for Learning
– Foundations
• Mean, variance, estimation bias
• Confidence intervals
– Comparing generalization quality
• Significance (to a confidence level)
• p-value
• t test
– Making the most of limited data
• k-fold cross validation
• Degrees of freedom
•
Neyman-Pearson Paradigm
– Null hypothesis (H0) and alternative hypothesis (HA)
– Rejection, acceptance of H0
– Type I, Type II statistical errors; significance, power
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences
Summary Points
•
Statistical Evaluation Methods for Learning: Three Questions
– Generalization quality
• How well does observed accuracy estimate generalization accuracy?
• Estimation bias and variance
• Confidence intervals
– Comparing generalization quality
• How certain are we that h1 is better than h2?
• Confidence intervals for paired tests
– Learning and statistical evaluation
• What is the best way to make the most of limited data?
• k-fold CV
•
Tradeoffs: Bias versus Variance
•
Next Lecture: Sections 6.1-6.5, Mitchell (Bayes’s Theorem; ML; MAP)
CIS 732: Machine Learning and Pattern Recognition
Kansas State University
Department of Computing and Information Sciences