Transcript P(a/x)

Pattern Recognition
Chapter 2
Bayesian Decision Theory
1
Bayesian Decision Theory
• Fundamental statistical approach to problem
classification.
• Quantifies the tradeoffs between various
classification decisions using probabilities and
the costs associated with such decisions.
– Each action is associated with a cost or risk.
– The simplest risk is the classification error.
– Design classifiers to recommend actions that minimize
some total expected risk.
2
Terminology
(using sea bass – salmon classification example)
• State of nature ω (random variable):
– ω1 for sea bass, ω2 for salmon.
• Probabilities P(ω1) and P(ω2 ) (priors)
– prior knowledge of how likely is to get a sea bass or a
salmon
• Probability density function p(x) (evidence):
– how frequently we will measure a pattern with feature
value x (e.g., x is a lightness measurement)
Note: if x and y are different measurements, p(x) and p(y)
correspond to different pdfs: pX(x) and pY(y)
3
Terminology (cont’d)
(using sea bass – salmon classification example)
• Conditional probability density p(x/ωj) (likelihood):
– how frequently we will measure a pattern with feature
value x given that the pattern belongs to class ωj
e.g., lightness distributions
between salmon/sea-bass
populations
4
Terminology (cont’d)
(using sea bass – salmon classification example)
• Conditional probability P(ωj /x) (posterior):
– the probability that the fish belongs to class ωj
given measurement x.
Note: we will be using an uppercase P(.) to denote
a probability mass function (pmf) and a
lowercase p(.) to denote a probability density
function (pdf).
5
Decision Rule Using Priors Only
Decide ω1 if P(ω1) > P(ω2); otherwise decide ω2
P(error) = min[P(ω1), P(ω2)]
• Favours the most likely class … (optimum if no other info is
available).
• This rule would be making the same decision all the times!
• Makes sense to use for judging just one fish …
6
Decision Rule Using
Conditional pdf
• Using Bayes’ rule, the posterior probability of category ωj
given measurement x is given by:
P( j / x) 
where
p( x /  j ) P( j )
p ( x)
2
p( x)   p( x /  j ) P( j )
likelihood  prior

evidence
(scale factor – sum of probs = 1)
j 1
Decide ω1 if P(ω1 /x) > P(ω2/x); otherwise decide ω2
or
Decide ω1 if p(x/ω1)P(ω1)>p(x/ω2)P(ω2) otherwise decide ω2
7
Decision Rule Using
Conditional pdf (cont’d)
P(1 ) 
2
3
P ( 2 ) 
1
3
8
Probability of Error
•
The probability of error is defined as:
 P(1 / x) if we decide 2
P(error / x)  
 P(2 / x) if we decide 1
•
The average probability error is given by:
P(error ) 
•




 P(error , x)dx   P(error / x) p( x)dx
The Bayes rule is optimum, that is, it minimizes the average
probability error since:
P(error/x) = min[P(ω1/x), P(ω2/x)]
9
Where do Probabilities Come From?
• The Bayesian rule is optimal if the pmf or pdf is
known.
• There are two competitive answers to the above
question:
(1) Relative frequency (objective) approach.
– Probabilities can only come from experiments.
(2) Bayesian (subjective) approach.
– Probabilities may reflect degree of belief and can be
based on opinion as well as experiments.
10
Example
• Classify cars on UNR campus whether they are more or less
than $50K:
– C1: price > $50K
– C2: price < $50K
– Feature x: height of car
• From Bayes’ rule, we know how to compute the posterior
probabilities:
p ( x / Ci )P (C i )
P(Ci / x ) 
p( x)
• Need to compute p(x/C1), p(x/C2), P(C1), P(C2)
11
Example (cont’d)
• Determine prior probabilities
– Collect data: ask drivers how much their car was and measure height.
– e.g., 1209 samples: #C1=221 #C2=988
221
 0.183
1209
988
P(C2 ) 
 0.817
1209
P(C1 ) 
12
Example (cont’d)
• Determine class conditional probabilities (likelihood)
– Discretize car height into bins and use normalized histogram
13
Example (cont’d)
• Calculate the posterior probability for each bin:
P(C1 / x  1.0) 
p( x  1.0 / C1) P( C1)

p( x  1.0 / C1) P( C1)  p( x 1.0 / C2) P( C2)

0.2081*0.183
 0.438
0.2081*0.183  0.0597 *0.817
14
A More General Theory
• Use more than one features.
• Allow more than two categories.
• Allow actions other than classifying the input to
one of the possible categories (e.g., rejection).
• Introduce a more general error function
– loss function (i.e., associate “costs” with actions)
15
Terminology
•
•
•
•
Features form a vector x  R d
A finite set of c categories ω1, ω2, …, ωc
A finite set l of actions α1, α2, …, αl
A loss function λ(αi/ ωj)
– the loss incurred for taking action αi when the classification
category is ωj
• Bayes rule using vector notation
p (x /  j ) P( j )
P( j / x) 
p( x)
c
where p(x)   p(x /  j ) P( j )  scale factor 
j 1
16
Expected Loss (Conditional Risk)
• Expected loss (or conditional risk) with taking
action αi :
c
R(ai / x)    (ai /  j ) P( j / x)
j 1
• The expected loss can be minimized by selecting
the action that minimizes the conditional risk.
17
Overall Risk
• Overall risk R
conditional risk
R   R(a (x) / x) p (x)dx
where α(x) determines which action α1, α2, …, αl to
take for every x (i.e., α(x) is a decision rule).
• To minimize R, find a decision rule α(x) that
chooses the action with the minimum conditional
risk R(ai/x) for every x.
• This rule yields optimal performance…
18
Bayes Decision Rule
• The Bayes decision rule minimizes R by:
– Computing R(αi /x) for every αi given an x
– Choosing the action αi with the minimum R(αi /x)
• Bayes risk (i.e., resulting minimum) is the best
performance that can be achieved.
R minR
*
19
Example:
Two-category classification
• Two possible actions
– α1 corresponds to deciding ω1
– α2 corresponds to deciding ω2
• Notation:
λij=λ(αi,ωj)
• The conditional risks are:
20
Example:
Two-category classification
• Decision rule:
or
or
(i.e., using likelihood ratio)
>
21
Special Case:
Zero-One Loss Function
• It assigns the same loss to all errors:
• The conditional risk corresponding to this loss function:
22
Special Case:
Zero-One Loss Function (cont’d)
• The decision rule becomes:
or
or
• What is the overall risk in this case?
(answer: average probability error)
23
Example
• θa was determined assuming zero-one loss function
• θb was determined assuming 12  21
a  P(2 ) / P(1 )
(decision regions)
P (2 )(12  22 )
b 
P (1 )(21  11 )
24
Discriminant Functions
• Functional structure of a general statistical classifier
Assign x to ωi if: gi(x) > gj(x) for all j  i
(discriminant functions)
pick max
25
Discriminants for Bayes Classifier
• Using risks:
gj(x)=-R(αj/x)
• Using zero-one loss function (i.e., min error rate):
gj(x)=P(ωj/x)
• Is the choice of gj unique?
– Replacing gj(x) with f(gj(x)), where f() is monotonically
increasing, does not change the classification results.
26
Decision Regions and Boundaries
• Decision rules divide the feature space in decision regions
R1, R2, …, Rc
• The boundaries of the decision regions are the decision
boundaries.
g1(x)=g2(x)
at the decision
boundaries
27
Case of two categories
• More common to use a single discriminant function
(dichotomizer) instead of two:
• Examples of dichotomizers:
g (x)  P (1 / x)  P (2 / x)
p (x / 1 )
P(1 )
g (x)  ln
 ln
p ( x / 2 )
P(2 )
28
Discriminant Function for
Multivariate Gaussian
N(μ,Σ)
• Assume the following discriminant function:
p(x/ωi)
29
Multivariate Gaussian Density:
Case I
• Assumption: Σi=σ2
– Features are statistically independent
– Each feature has the same variance
favors the a-priori
more likely category
30
Multivariate Gaussian Density:
Case I (cont’d)
w i=
threshold or bias
)
)
31
Multivariate Gaussian Density:
Case I (cont’d)
• Comments about this hyperplane:
–
–
–
–
–
It passes through x0
It is orthogonal to the line linking the means.
What happens when P(ωi)= P(ωj) ?
If P(ωi)= P(ωj), then x0 shifts away from the more likely mean.
If σ is very small, the position of the boundary is insensitive to P(ωi)
and P(ωj)
32
Multivariate Gaussian Density:
Case I (cont’d)
33
Multivariate Gaussian Density:
Case I (cont’d)
• Minimum distance classifier
– When P(ωi) is the same for each of the c classes
g i ( x)   || x  i ||2
34
Multivariate Gaussian Density:
Case II
• Assumption: Σi= Σ
35
Multivariate Gaussian Density:
Case II (cont’d)
• Comments about this hyperplane:
–
–
–
–
It passes through x0
It is NOT orthogonal to the line linking the means.
What happens when P(ωi)= P(ωj) ?
If P(ωi)=P(ωj), then x0 shifts away from the more likely mean.
36
Multivariate Gaussian Density:
Case II (cont’d)
37
Multivariate Gaussian Density:
Case II (cont’d)
• Mahalanobis distance classifier
– When P(ωi) is the same for each of the c classes
38
Multivariate Gaussian Density:
Case III
• Assumption: Σi= arbitrary
hyperquadrics
e.g., hyperplanes, pairs of hyperplanes, hyperspheres,
hyperellipsoids, hyperparaboloids etc.
39
Multivariate Gaussian Density:
Case III (cont’d)
P(ω1)=P(ω2)
disconnected
decision
regions
40
Multivariate Gaussian Density:
Case III (cont’d)
disconnected
decision
regions
non-linear
decision
boundaries
41
Multivariate Gaussian Density:
Case III (cont’d)
• More examples (Σ arbitrary)
42
Multivariate Gaussian Density:
Case III (cont’d)
• A four category example
43
Example - Case III
decision boundary:
P(ω1)=P(ω2)
boundary does
not pass through
midpoint of μ1,μ2
44
Missing Features
• Suppose x=(x1,x2) is a feature vector.
• What can we do when x1 is missing during classification?
– Maybe use the mean value of all x1 measurements?
– But p( xˆ2 / 2 ) is the largest!
45
Missing Features (cont’d)
• Suppose x=[xg,xb] (xg: good features, xb: bad features)
• Derive the Bayes rule using the good features:
p
p
46
Noisy Features
• Suppose x=[xg,xb] (xg: good features, xb: noisy features)
• Suppose noise is statistically independent and …
– We know noise model p(xb/xt)
– xb: observed feature values, xt: true feature values.
• Assume statistically independent noise
– if xt were known, xb would be independent of xg , ωi
P (i / x g , xb ) 
p (i , x g , xb )
p (x g , xb )

p
47
Noisy Features (cont’d)
use independence assumption
• What happens when p(xb/xt) is uniform?
48
Receiver Operating Characteristic
(ROC) Curve
• Every classifier employs some kind of a threshold value.
• Changing the threshold affects the performance of the
system.
• ROC curves can help us distinguish between
discriminability and decision bias (i.e., choice of
threshold)
49
Example: Person Authentication
• Authenticate a person using biometrics (e.g., face image).
• There are two possible distributions:
– authentic (A) and impostor (I)
correct rejection
correct acceptance
I
false rejection
A
false positive
50
Example: Person Authentication
(cont’d)
• Possible cases
– (1) correct acceptance (true positive):
• X belongs to A, and we decide A
– (2) incorrect acceptance (false positive):
• X belongs to I, and we decide A
– (3) correct rejection (true negative):
• X belongs to I, and we decide I
– (4) incorrect rejection (false negative):
• X belongs to A, and we decide I
51
Error vs Threshold
x*
52
False Negatives vs Positives
53
Statistical Dependences
Between Variables
• Many times, the only knowledge we have about a
distribution is which variables are or are not
dependent.
• Such dependencies can be represented graphically
using a Bayesian Belief Network (or Belief Net).
• In essence, Bayesian Nets allow us to represent a
joint probability density p(x,y,z,…) efficiently using
dependency relationships.
• p(x,y,z,…) could be either discrete or continuous.
54
Example of Dependencies
• State of an automobile
–
–
–
–
–
Engine temperature
Brake fluid pressure
Tire air pressure
Wire voltages
Etc.
• NOT causally related variables
– Engine oil pressure
– Tire air pressure
• Causally related variables
– Coolant temperature
– Engine temperature
55
Definitions and Notation
• A belief net is usually a Directed Acyclic Graph (DAG)
• Each node represents one of the system variables.
• Each variable can assume certain values (i.e., states) and each
state is associated with a probability (discrete or continuous).
56
Relationships Between Nodes
• A link joining two nodes is directional and represents a causal
influence (e.g., X depends on A or A influences X)
• Influences could be direct or indirect (e.g., A influences X
directly and A influences C indirectly through X).
57
Parent/Children Nodes
• Parent nodes P of X
– the nodes before X (connected to X)
• Children nodes C of X:
– the nodes after X (X is connected to them)
58
Conditional Probability Tables
• Every node is associated with a set of weights which
represent the prior/conditional probabilities (e.g., P(xi/aj),
i=1,2, j=1,2,3,4)
probabilities
sum to 1
59
Learning
• There exist algorithms for learning these probabilities from
data…
60
Computing Joint Probabilities
• We can compute the probability of any configuration of
variables in the joint density distribution:
e.g., P(a3, b1, x2, c3, d2)=P(a3)P(b1)P(x2/a3,b1)P(c3/x2)P(d2/x2)=
0.25 x 0.6 x 0.4 x 0.5 x 0.4 = 0.012
61
Computing the Probability at a Node
• E.g., determine the probability at D
62
Computing the Probability at a Node
(cont’d)
• E.g., determine the probability at H:
=
63
Computing Probability Given
Evidence (Bayesian Inference)
• Determine the probability of some particular configuration
of variables given the values of some other variables
(evidence).
e.g., compute P(b1/a2, x1, c1)
64
Computing Probability Given Evidence
(Bayesian Inference)(cont’d)
• In general, if X denotes the query variables and e denotes the
evidence, then
P( X, e)
P ( X / e) 
  P( X, e)
P ( e)
where α=1/P(e) is a constant of proportionality.
65
An Example
• Classify a fish given that we only have evidence that the fish
is light (c1) and was caught in south Atlantic (b2) -- no
evidence about what time of the year the fish was caught nor
its thickness.
66
An Example (cont’d)
67
An Example (cont’d)
68
An Example (cont’d)
• Similarly,
P(x2/c1,b2)=α 0.066
• Normalize probabilities (not needed necessarily):
P(x1/c1,b2)+ P(x2/c1,b2)=1 (α=1/0.18)
P(x1/c1,b2)= 0.63
P(x2/c1,b2)= 0.27
salmon
69
Another Example:
Medical Diagnosis
Uppermost nodes: biological agents (bacteria, virus)
Intermediate nodes: diseases
Lowermost nodes:
symptoms
• Given some evidence (biological agents, symptoms),
find most likely disease.
70
Naïve Bayes’ Rule
• When dependency relationships among features are
unknown, we can assume that features are
conditionally independent given the category:
P(a,b/x)=P(a/x)P(b/x)
• Naïve Bayes rule :
P(a, b / x) P( x) P(a / x) P(b / x) P( x)
P ( x / a, b) 

P ( a, b)
P ( a, b)
• Simple assumption but … usually works well in
practice.
71