What is Confidence?

Download Report

Transcript What is Confidence?

What is Confidence?
How to Handle Overfitting When Given
Few Examples
Top Changwatchai
AIML seminar
27 February 2001
(based on my ongoing research, Fall 2000)
27 February 2001
What is Confidence?
Slide 1
Overview
• The problem of overfitting
• Bayesian network models
• Defining confidence
27 February 2001
What is Confidence?
Slide 2
A Learning Problem
• Say we want to learn a classifier
– fixed distribution of examples, each drawn independently
– example consists of set of features
– given a set of labeled examples drawn from the distribution
• Is our primary goal to fit these examples as well as possible?
– No!
– We want to fit the underlying distribution
• Overfitting
– finding a hypothesis which fits the training examples better than
some other hypothesis, but which fits the underlying distribution
worse than that hypothesis
• Moral
– focusing only on fitting the training data exposes you to the danger
of overfitting
27 February 2001
What is Confidence?
Slide 3
Handling Overfitting
• Approaches
– Assume you have enough examples
• Statistical anomalies are minimized
– Smoothing (handling unlikely examples)
– Other statistical methods, such as partitioning into
training/test data
– Other heuristics/assumptions
• My constraints
– Very small number of examples
• Shh! Ultimate goal is incorporating domain knowledge...
– If approximations are necessary, make only those that can
be directly, quantitatively justified
– Want a quantitative measure of overfitting
27 February 2001
What is Confidence?
Slide 4
Bayesian Networks
• Notes
– BN’s are simply an example application
– Focusing on inverted-tree BN’s used as classifiers
• Quick overview
– Nodes represent features (assume Boolean)
• Bottom node represents label
– Links represent direct dependence
• Absence of link represents lack of direct dependence
– Conditional Probability Tables (CPT’s)
• Each node has 2Pa entries (Pa = # parents)
– It is fairly straightforward to:
• Learn CPT entries
• Perform inference
27 February 2001
What is Confidence?
Slide 5
Bayesian Network Structures
• Structure determines expressiveness
– More parents per node = more expressive
– Directly related to the total number of CPT entries in the BN
• “Bayesian networks don’t overfit”
– Given:
• A BN structure
• A set of training examples
– There is a way of choosing CPT entries which fits the training
examples as well as possible
– Since we’re given the structure, we must assume that the best fit to
the training examples is also the best fit to the underlying
distribution
• However:
– Manually building BN structures is a lot of work
– We’d like to not only learn the CPT entries, but also learn the
correct structures
27 February 2001
What is Confidence?
Slide 6
BN’s and Overfitting
•
•
Choosing the “correct” structure is where overfitting becomes a
problem
If the goal is to maximize accuracy on the training data, then we always
prefer more expressive networks
– In our inverted-tree classifiers, it would be a naïve Bayes structure
•
•
Unfortunately, the more expressive the network, the greater the
tendency to overfit for a fixed number of training examples
Intuition:
– Fitting curves to data points
– “Spending” examples to increase confidence
•
Current approaches in addressing overfitting
– BIC, AIC, MDL, etc.
– Each network structure is given a two-part “score”
• Accuracy (the more accurate, the better)
• Expressiveness (the fewer CPT entries, the better)
– I think these rely on assumption that we have sufficiently many examples
27 February 2001
What is Confidence?
Slide 7
Confidence
•
Recall our needs:
– Given very few examples, we want a
– Quantitative measure of overfitting that is
– As exact as possible
•
Intuitive definition of “confidence” of a given BN structure
– Probability that we have seen enough examples to either accept or reject
this structure
•
Confidence and accuracy
– Low confidence: need more examples
– High confidence, low accuracy: reject this structure
– High confidence, high accuracy: accept this structure
•
Sadly, there is not enough time to cover my definition of confidence for
an inverted-tree Bayesian network classifier
– Coincidentally, I have run into certain technical difficulties in realizing a
practical algorithm for evaluating this confidence
– See me afterward for discussion
27 February 2001
What is Confidence?
Slide 8
A New Problem Domain
• Goal at the end of this section:
– Motivate a quantitative definition of confidence of a single-node
“Bayesian network” (each example has no other features except for
its Boolean label)
N H
N H
  w  1  w 
p
pheads
i
i
H  i
k
  wi  k  
N H
N H
i 1
  w j  1  w j 
 pj

j 1  H 
k

• Coin-flipping domain
i 1
– k coins
wi
H 1
 1  wi 
N H
 w  1  w 
k
j 1
N H
H
j
 pi
j
 pj
• Coin i has weight wi (probability of getting heads)
• One of these coins is picked at random (prior probability of picking coin
i is pi)
– This coin is flipped N times, and we observe heads H times
– Assuming we know the wi’s, pi’s, H, N, and the experimental setup,
we can calculate the probability that the next toss of the coin is
heads
27 February 2001
What is Confidence?
Slide 9
Confidence in Coin Flipping
•
•
•
How do we define confidence?
First, we need to define our decision algorithm
In this case it’s easy:
–
–
•
Define confidence as follows:
–
–
–
•
if pheads  0.5, then we predict “heads”
if pheads < 0.5, then we predict “tails”
Our confidence in our decision is the probability that if we saw an arbitrarily large
(infinite) number of tosses, we would still make the same decision
Seeing an infinite number of tosses is tantamount to knowing what the weight of the
coin (wi) is
In other words, confidence  Prob(make the same decision | know the coin’s weight)
Subtle point: we don’t know the coin’s weight, but we speculate that we do
–
Alternative POV: say Tasha is in the next room. She knows everything we know (wi’s,
pi’s, H, N, experimental setup). In addition, she knows the weight of the coin (wi) that
was picked. Her decision is likewise simple:
•
•
–
if wi  0.5, then predict “heads”
if wi < 0.5, then predict “tails”
Then we can restate the definition:
•
confidence = Prob(we make the same decision as Tasha)
27 February 2001
What is Confidence?
Slide 10
An Equation for Confidence
•
To repeat:
– confidence = Prob(we make the same decision as Tasha)
•
WLOG, let’s say, after calculating pheads = Prob(heads | H, N), we pick
heads (i.e., pheads  0.5)
– Then confidence = Prob(Tasha also picked heads)
– In other words: confidence   Pcoin i | H , N 
i:wi 0.5
– where
– Recall
N H
  wi  1  wi N  H  pi
H
Pcoin i | H , N   k  
N H
  w j  1  w j N  H  p j

j 1  H 
k
pheads   wi  Pcoin i | H , N 
i 1
•
Thus if we define a random variable X mapping wi to P(coin i | H, N),
then pheads = E(X) and confidence = Prob(X  0.5)
27 February 2001
What is Confidence?
Slide 11
Returning to Single-Node Network
•
•
•
Coin-flipping is a discrete domain (discrete set of coins)
Results generalize to the continuous case
Consider our Boolean-valued labeled examples
–
–
•
•
•
•
Underlying distribution (which we are trying to learn): a single number w0, which is the
probability that a given example will be labeled true
We observe N examples, with H of them labeled true
Let W be a random variable corresponding to the prior probability of the weight
w0
Let X be a random variable representing the posterior probability of the weight
w0 given H and N
It can be shown that if W has a beta distribution, then X also has a beta
distribution. In particular, if W is uniform (we have no information about the prior
probability), then Xbeta(H+1, N-H+1)
From the properties of a beta distribution, we see that
Pr(true)  E  X  
•
•
H 1
N 2
Note this is not H/N!
As before, confidence = Prob(X  0.5)
27 February 2001
What is Confidence?
Slide 12
Final Notes
• We have made few assumptions about the data (for
example, N can be small)
• We have come up with an exact, quantitative
expression for confidence (although it may be difficult
to evaluate)
• Analysis extends (not trivially) to multivariate case
(more than one node in BN)
• Defining confidence can be an important first step to
dealing with overfitting when given few examples (I
haven’t shown the next few steps)
27 February 2001
What is Confidence?
Slide 13
Summary
•
Overfitting is bad
– Overfitting is an issue any time we do learning from examples
– Often we make assumptions which allow us to assume we don't overfit
– At the very least, we should be aware of these assumptions when we do
learning
•
Too much expressiveness is bad
– Limiting expressiveness (introducing bias) not only helps to reduce the
number of examples needed to learn, but also reduces tendency to overfit
•
You can quantify overfitting
– I'm not aware of any other efforts in this direction, but it is doable and may
prove useful, especially in reducing reliance on assumptions
– To do so, you must clearly define your learning goals (not just the concept
to be learned)
– In this presentation, we define and use "confidence"
27 February 2001
What is Confidence?
Slide 14