BayesianDecisionTheory

Download Report

Transcript BayesianDecisionTheory

Bayesian Decision Theory
 Basic Concepts
 Discriminant Functions
 The Normal Density
 Decision Theory for Discrete Features
Figure 1.1
Basic Concepts
Definitions.
State of nature.
Let c denote the state of nature. c is a random variable.
(e.g., c = c1 for sea bass or c = c2 for salmon )
A priori probabilities.
Let P(c1) and P(c2) denote the a priori probability of c1 and c2 respectively.
We know P(c1) + P(c2) = 1.
Decision rule.
Decide c1 if P(c1) > P(c2); otherwise choose c2.
But we know there will be many mistakes ….
Basic Concepts
Class-conditional probability density function.
Let x be a continuous random variable.
p(x|c) is the probability density for x given the state of nature c.
For example, what is the probability of lightness given that the
class is salmon?
p(lightness | salmon) ?
Or what is the probability of lightness given that the class is sea bass?
P(lightness | sea bass) ?
Figure 2.1
Bayes Formula
How do we combine a priori and class-conditional probabilities
to know the probability of a state of nature?
Bayes Formula.
prior probability
P(cj | x) = p(x|cj) P(cj) / p(x)
evidence
posterior
probability
likelihood
Bayes Decision:
Choose c1 if P(c1|x) > P(c2|x); otherwise choose c2.
Figure 2.2
Simplifying Bayes Rule
Bayes Formula.
P(cj | x) = p(x|cj) P(cj) / p(x)
The evidence p(x) is the same for all states or classes so
we can dispense with it to make a decision.
Rule:
Choose c1 if p(x|c1)P(c1) > p(x|c2)P(c2); otherwise decide c2
Observations:
• If p(x|c1) = p(x|c2) then the decision depends on the prior probabilities only.
• If P(c1) = P(c2) then the decision depends on the likelihoods only.
Thomas Bayes
Born in London (1701).
Studied logic and theology (Univ. of Edinburgh).
Fellow of the Royal Society (year 1742).
Given white and black balls in an urn, what is the prob. of
drawing one or the other?
Given one or more balls, what can be said about the number of balls in the urn?
Minimizing Error
What is the probability of error?
P(error | x) = P(c1|x) if we decide c2
P(c2|x) if we decide c1
Does Bayes rule minimize the probability of error?
P(error) =
∫ P(error,x) dx = ∫ P(error|x) p(x) dx
and if for every x we minimize the error then P(error|x)
is as small as it can be.
Answer is “yes”.
Loss Function
Let {c1,c2, …} be the possible states of nature.
Let {a1, a2, …} be the possible actions.
Loss function
λ(ai|cj) is the loss incurred for taking action ai when
the state of nature is cj.
Suppose we observe a particular x and think about taking action ai.
If the true state is wj, the loss will be λ(ai|cj).
Expected loss
R(ai|x) = Σj λ(ai|cj) P(cj|x) (this is also called the conditional risk.)
Decision: Select the action that minimizes the conditional risk
( ** best possible performance ** )
Two Categories
How does this apply when we only have two categories?
Assume λij = λ (ai|cj).
R(a1|x) = λ11 P(c1|x) + λ12 P(c2|x)
R(a2|x) = λ21 P(c1|x) + λ22 P(c2|x)
Decide a1 if
(λ21 - λ11) P(c1|x) > (λ12 - λ22) P(c2|x)
Two Categories
Another way of stating this is as follows.
Decide a1 if
P(x|c1) / P(x|c2) > [λ12 - λ22 / λ21 - λ11] P(c2) / P(c1)
likelihood ratio
constant
Stated as:
Choose a1 if the likelihood ratio exceeds a threshold value
independent of the observation x.
Figure 2.3
Zero-One Loss
We will normally be concerned with the
symmetrical or zero-one loss function:
λ (ai|cj) =
0 if i = j
1 if i = j
In this case the conditional risk is:
R(ai|x) = Σj λ(ai|cj) P(cj|x)
= 1 – P(ci|x)
Exercise
This question refers to Bayes's classifiers.
Let {C1,C2, …, Cn} be the possible states of nature or classes.
Let {a1, a2, …, ak} be the possible actions. We define a loss function λ(ai|Cj)
as the loss incurred for taking action ai when the state of nature is Cj.
Let's define a new loss function as follows:
λ(ai|Cj) = 0 if i = j and
λ(ai|Cj) = 3 if i ≠ j
Assume a uniform distribution on the posterior probability P(Cj|x)
(i.e., all posterior probabilities have the same probability).
What is the expected risk R(ai | x) ?
Exercise
R(ai|x) =
=
=
=
=
Σj λ(ai|Cj) P(Cj|x)
Σj λ(ai|Cj) 1/n
1/n 3(n-1)
3 - 3/n
3(1 - 1/n)
Discriminant Functions
How do we represent pattern classifiers?
The most common way is through discriminant functions.
Remember we use {c1,c2, …} to be the possible states of nature.
For each class we create a discriminant function gi(x).
The classifier assigns class ci if
gi(x) > gj(x) for all j = i
Our classifier is a network or machine that computes c discriminant functions.
Figure 2.5
Simplifying the Discriminant Functions
Notice the decision is the same if we change every gi(x) for f(gi(x))
Assuming f(.) is a monotonically increasing function.
Alternatives:
•
•
•
gi(x) = P(ci|x)
gi(x) = p(x|ci) P(ci)
gi(x) = ln p(x|ci) + ln P(ci) ( where ln is the natural logarithm)
The net effect is to divide the feature space into c regions (one for each class).
We then have c decision regions separated by decision boundaries.
Figure 2.6
Normal Density
Recall that using Bayes rule a discriminant function needs two probabilities:
gi(x) = p(x|ci) P(ci)
How do we model the likelihood or class-conditional probability?
The most common approach is to assume a multivariate normal density.
p(x) = 1 / σ (2Π)0.5 e ( -1/2
[(x – u) / σ]2 )
u = E [ x ] = ∫ x p(x) dx
σ 2 = E [ (x – u)2 ] = ∫ (x – u)2 p(x) dx
Figure 2.7
Normal Density
If features are statistically independent and the variance is the same
for all features, the discriminant function is simple and is linear in nature.
A classifier that uses linear discriminant functions is called
a linear machine.
The decision surface are pieces of hyperplanes defined
by linear equations.
Figure 2.10
Figure 2.11
Signal Detection Theory
Suppose we want to detect a single pulse from a signal.
We assume the signal has some random noise.
When the signal is present we observe a normal distribution with mean u2.
When the signal is not present we observe a normal distribution with mean u1.
We assume same standard deviation.
Can we measure the discriminability of the problem?
Can we do this independent of the threshold x*?
Discriminability:
d’ = | u2 – u1 | / σ
Figure 2.19
Signal Detection Theory
How do we find d’ if we do not know u1, u2, or x*?
From the data we can compute:
1.
P( x > x* | c2) a hit.
2.
P( x > x* | c1) a false alarm.
3.
P( x < x* | c2) a miss.
4.
P( x < x* | c1) a correct rejection.
If we plot a point in a space representing hit and false alarm rates,
then we have a ROC (receiver operating characteristic) curve.
With it we can distinguish between discriminability and bias.
Figure 2.20
Points to Remember
 Bayes formula and Bayes decision theory
 Loss functions (e.g., zero-one loss)
 Discriminant functions
 ROC curves
 Conditional independence
 Naïve Bayes