Transcript Document
Bayesian Decision Theory
Dr. Ghassabi
[email protected]
Spring 2015
Course Link: cipcv.ir/lessons
1
Outline
What is pattern recognition?
What is classification?
Need for Probabilistic Reasoning
Probabilistic Classification Theory
What is Bayesian Decision Theory?
HISTORY
PRIOR PROBABILITIES
CLASS-CONDITIONAL PROBABILITIES
BAYES FORMULA
A Casual Formulation
Decision
2
Outline
What is Bayesian Decision Theory?
Decision for Two Categories
3
Outline
What is classification?
Classification by Bayesian
Classification
Basic Concepts
Bayes Rule
More General Forms of Bayes Rule
Discriminated Functions
Bayesian Belief Networks
4
What is pattern recognition?
TYPICAL APPLICATIONS OF PR
IMAGE PROCESSING EXAMPLE
• Sorting Fish: incoming fish are
sorted according to species
using optical sensing (sea
bass or salmon?)
• Problem Analysis:
set up a camera and take
some sample images to
extract features
Consider features such as
length, lightness, width,
number and shape of fins,
position of mouth, etc.
Sensing
Segmentation
Feature Extraction
5
Pattern Classification System
Preprocessing
Feature Extraction
Segment (isolate) fishes from one another
and from the background
Reduce the data by measuring certain
features
Classification
Divide the feature space into decision regions
6
7
Classification
Initially use the length of the fish as a
possible feature for discrimination
8
TYPICAL APPLICATIONS
LENGTH AS A DISCRIMINATOR
• Length is a poor discriminator
9
Feature Selection
The length is a poor feature alone!
Select the lightness as a possible
feature
10
TYPICAL APPLICATIONS
ADD ANOTHER FEATURE
• Lightness is a better feature than length because it
reduces the misclassification error.
• Can we combine features in such a way that we improve
11
performance? (Hint: correlation)
Feature Vector
Adopt the lightness and add the width of
the fish to the feature vector
Fish
xT = [x1, x2]
Lightness
Width
12
TYPICAL APPLICATIONS
WIDTH AND LIGHTNESS
Straight line decision boundary
• Treat features as a N-tuple (two-dimensional vector)
• Create a scatter plot
• Draw a line (regression) separating the two classes
13
Features
We might add other features that are not
highly correlated with the ones we already
have. Be sure not to reduce the
performance by adding “noisy features”
Ideally, you might think the best decision
boundary is the one that provides optimal
performance on the training data (see the
following figure)
14
Threshold decision boundary
and cost relationship
Move decision boundary toward smaller
values of lightness in order to minimize
the cost (reduce the number of sea bass
that are classified salmon!)
Task of decision theory
15
TYPICAL APPLICATIONS
DECISION THEORY
• Can we do better than a linear classifier?
Is this a good decision boundary?
• What is wrong with this decision surface?
(hint: generalization)
16
Decision Boundary Choice
Our satisfaction is premature because the
central aim of designing a classifier is to
correctly classify new (test) input
Issue of generalization!
17
TYPICAL APPLICATIONS
GENERALIZATION AND RISK
Better decision boundary
• Why might a smoother decision surface be a better
choice? (hint: Occam’s Razor).
• PR investigates how to find such “optimal” decision
surfaces and how to provide system designers with the
18
tools to make intelligent trade-offs.
Need for Probabilistic Reasoning
Most everyday reasoning is based on uncertain
evidence and inferences.
Classical logic, which only allows conclusions to
be strictly true or strictly false, does not account
for this uncertainty or the need to weigh and
combine conflicting evidence.
Todays expert systems employed fairly ad hoc
methods for reasoning under uncertainty and for
combining evidence.
19
Probabilistic Classification Theory
بنابراین هر الگو با، کالسها با هم همپوشانی دارندclassification در
.یک احتمالی به یک کالس متعلق است
In practice, some overlap between classes and
random variation within classes occur, hence
perfect separation between classes can not be
achieved: Misclassification may occur.
اجازه می دهد تا تابع هزینه ای ارائهBayesian decision تئوری
شود برای موقعی که ممکن است که یک بردار ورودی که عضوی از کالس
(misclassify) . نسبت داده شودB است اشتباها به کالسA
20
What is Bayesian Decision Theory?
HISTORY
Bayesian Probability was
named after Reverend Thomas
Bayes (1702-1761).
He proved a special case of
what is currently known as the
Bayes Theorem.
The term “Bayesian” came into
use around the 1950’s.
21
http://en.wikipedia.org/wiki/Bayesian_probability
HISTORY (Cont.)
Pierre-Simon, Marquis de Laplace (17491827) independently proved a generalized
version of Bayes Theorem.
1970 Bayesian Belief Network at Stanford
University (Judea Pearl 1988)
The idea’s proposed above was not fully
developed until later. BBN became popular in
the 1990s.
22
HISTORY (Cont.)
Current uses of Bayesian Networks:
Microsoft’s printer troubleshooter.
Diagnose diseases (Mycin).
Used to predict oil and stock prices
Control the space shuttle
Risk Analysis – Schedule and Cost Overruns.
23
BAYESIAN DECISION THEORY
PROBABILISTIC DECISION THEORY
Bayesian decision theory is a fundamental
statistical approach to the problem of pattern
classification.
Using probabilistic approach to help
making decision (e.g., classification)
so as to minimize the risk (cost).
Assume all relevant probability distributions are
known (later we will learn how to estimate these
from data).
24
BAYESIAN DECISION THEORY
PRIOR PROBABILITIES
State of nature is prior information
denote the state of nature
Model as a random variable, :
= 1: the event that the next fish is a sea bass
category 1: sea bass; category 2: salmon
• A priori probabilities:
P(1) = probability of category 1
P(2) = probability of category 2
But we know there will be many mistakes ….
P(1) + P( 2) = 1 (either 1 or 2 must occur)
• Decision rule
Decide 1 if P(1) > P(2); otherwise, decide 2
25
BAYESIAN DECISION THEORY
CLASS-CONDITIONAL PROBABILITIES
A decision rule with only prior information always
produces the same result and ignores measurements.
If P(1) >> P( 2), we will be correct most of the time.
• Given a feature, x (lightness), which is a
continuous random variable, p(x|2) is the classconditional probability density function:
• p(x|1) and p(x|2) describe the difference in
lightness between populations of sea and salmon.
26
Let x be a continuous random variable.
p(x|w) is the probability density for x
given the state of nature w.
p(lightness | salmon) ?
P(lightness | sea bass) ?
27
BAYESIAN DECISION THEORY
BAYES FORMULA
How do we combine a priori and class-conditional
probabilities to know the probability of a state of nature?
Suppose we know both P(j) and p(x|j), and we can
measure x. How does this influence our decision?
The joint probability that of finding a pattern that is in
category j and that this pattern has a feature value of x is:
p( j , x ) P j x p x p x j P j
• Rearranging terms, we arrive at Bayes formula.
28
BAYESIAN THEOREM
A special case of Bayesian
Theorem:
P(A∩B) = P(B) x P(A|B)
P(B∩A) = P(A) x P(B|A)
Since P(A∩B) = P(B∩A),
P(B) x P(A|B) = P(A) x P(B|A)
=> P(A|B) = [P(A) x P(B|A)] / P(B)
A
B
P( A) P( B | A)
P( A) P( B | A)
P( A | B)
P( B)
P APB | A P A P B | A
29
Preliminaries and Notations
i {1 , 2 ,, c } :
a state of nature
P(i ) : prior probability
x : feature vector
class-conditional
p(x | i ) :
density
P(i | x) : posterior probability
30
A Casual
FormulationDECISION THEORY
BAYESIAN
•The prior probability reflects knowledge of the
POSTERIOR PROBABILITIES
relative frequency of instances of a class
•The likelihood is a measure of the probability that a
Bayes
formula:
measurement
value occurs in
p ax class
j P j
•The evidence is P
a scaling
j x term
p x
can be expressed in words as:
likelihood prior
posterior
evidence
By measuring x, we can convert the prior probability,
P(j), into a posterior probability, P(j|x).
Evidence can be viewed as a scale factor and is often
ignored in optimization applications (e.g., speech
recognition). Bayes Decision:
31
Choose w1 if P(w1|x) > P(w2|x); otherwise choose w2.
Decision
p(x | i ) P(i )
P(i | x)
p ( x)
D (x) arg max P(i | x)
i
Decide i if P(i|x) > P(j|x)
ji
The evidence, p(x), is a scale factor that assures conditional probabilities sum
to 1: P(1|x)+P(2|x)=1
We can eliminate the scale factor (which appears on both sides of the equation):
Decide i if p(x|i)P(i) > p(x|j)P(j) j i
decision is based entirely
Special cases:
on the likelihood, p(x|j).
1. P(1)=P(2)= =P(c)
x gives us no
2. p(x|1)=p(x|2) = = p(x|c) useful information
32
Decide i if P(i|x) > P(j|x)
ji
Decide i if p(x|i)P(i) > p(x|j)P(j) j i
Two Categories
Decide 1 if P(1|x) > P(2|x); otherwise decide 2
Decide 1 if p(x|1)P(1) > p(x|2)P(2); otherwise decide 2
Special cases:
1. P(1)=P(2)
Decide 1 if p(x|1) > p(x|2); otherwise decide 2
2. p(x|1)=p(x|2)
Decide 1 if P(1) > P(2); otherwise decide 2
33
Example
Special cases:
1. P(1)=P(2)
Decide 1 if p(x|> p(x|2); otherwise decide 1
2. p(x|1)=p(x|2)
Decide 1 if P(1) > P(2); otherwise decide 2
R2
R1
P(1)=P(2)
34
Example
P(1)=2/3
P(2)=1/3
R2
R2
R1
R1
Bayes Decision Rule
Decide 1 if p(x|1)P(1) > p(x|2)P(2); otherwise decide 352
BAYESIAN DECISION THEORY
POSTERIOR PROBABILITIES
Two-class fish sorting problem (P(1) = 2/3, P(2) = 1/3):
For every value of x, the posteriors sum to 1.0.
At x=14, the probability it is in category 2 is 0.08, and for
36
category 1 is 0.92.
BAYESIAN DECISION THEORY
BAYES DECISION RULE
Classification Error
Decision rule:
For an observation x, decide 1 if P(1|x) > P(2|x); otherwise,
decide 2
Probability of error:
P ( 2 x ) x 1
P error | x
P ( 1 x ) x 2
The average probability of error is given by:
P ( error ) P ( error , x )dx P ( error | x ) p( x )dx
Consider two categories:
P ( error | x ) min[ P ( 1 x ), P ( 2 x )]
If for every x we ensure that P(error|x) is as small as possible,
then the integral is as small as possible. Thus, Bayes decision
37
rule for minimizes P(error).
CONTINUOUS FEATURES
GENERALIZATION OF TWO-CLASS PROBLEM
Generalization
of the preceding ideas:
Use of more than one feature
(e.g., length and lightness)
Use more than two states of nature
(e.g., N-way classification)
Allowing actions other than a decision to decide on the state
of nature (e.g., rejection: refusing to take an action when
alternatives are close or confidence is low)
Introduce a loss of function which is more general than the
probability of error
(e.g., errors are not equally costly)
Let us replace the scalar x by the vector x in a
d-dimensional Euclidean space, Rd, called
the feature space.
38
The Generation
{1 , 2 ,, c } :
{1 , 2 ,, a } :
LOSS FUNCTION
ij ( i | j ) :
can be zero.
a set of c states of nature or
c categories
a set of a possible actions
The loss incurred for
taking action i when the
true state of nature is j.
We want to minimize the expected loss in making decision.
Risk
39
Examples
Ex 1: Fish classification
X= is the image of fish
x =(brightness, length, fin
#, etc.)
is our belief what the fish
type is
C
= {“sea bass”, “salmon”,
“trout”, etc}
is a decision for the
fish type, in this case
= {“sea bass”, “salmon”,
“trout”, “manual expection
needed”, etc}
Ex 2: Medical diagnosis
C
X= all the available medical
tests, imaging scans that a
doctor can order for a patient
x =(blood pressure, glucose
level, cough, x-ray, etc.)
is an illness type
={“Flu”, “cold”, “TB”,
“pneumonia”, “lung cancer”,
etc}
is a decision for treatment,
= {“Tylenol”, “Hospitalize”,
“more tests needed”, etc}
40
P( j | x)
Conditional Risk
p(x | j ) P( j )
p ( x)
c
p( x) p(x | j )P( j )
j 1
c
c
j 1
j 1
R( i | x) ( i | j ) P( j | x) ij P( j | x)
Given
x, the expected loss (risk)
associated with taking action
i.
41
0/1 Loss Function
c
c
j 1
j 1
R( i | x) ( i | j ) P( j | x) ij P( j | x)
0 i is a correct decision assiciated with j
( i | j )
P ( 2 x ) x 1
1 otherwise
P error | x
P ( 1 x ) x 2
R(i | x) P(error | x)
42
Decision
c
c
j 1
j 1
R( i | x) ( i | j ) P( j | x) ij P( j | x)
A general decision rule is a function (x) that tells us
which action to take for every possible observation.
Bayesian Decision Rule:
(x) arg min R( i | x)
i
43
Overall Risk
The overall risk is given by:
Compute the conditional risk for every
and select the action that minimizes
R(i|x). This is denoted R*, and is
referred to as the Bayes risk
R R( (x) | x) p(x)dx
Decision function
The Bayes risk is the best performance
that
can beisachieved.
If we choose (x) so that
R(i(x))
as small as possible for every x,
the overall risk will be minimized.
Bayesian decision rule: (x) arg min R( i | x)
i
the optimal one to minimize the overall risk
Its resulting overall risk is called the Bayesian risk
44
Two-Category Classification
Let
1 correspond to 1, 2 to 2, and ij = (i|j)
State of Nature
{1 , 2}
The
Action
{1 , 2 }
Loss Function
1
2
1 2
11 12
21 22
conditional risk is given by:
R(1 | x) 11P(1 | x) 12 P(2 | x)
R( 2 | x) 21P(1 | x) 22 P(2 | x)
45
Our decision rule is:
choose 1 if: R(1|x) < R(2|x);
otherwise decide 2
Two-Category Classification
Perform 1 if R(2|x) > R(1|x); otherwise perform 2
21P(1 | x) 22 P(2 | x) 11P(1 | x) 12 P(2 | x)
(21 11) P(1 | x) (12 22 ) P(2 | x)
R(1 | x) 11P(1 | x) 12 P(2 | x)
R( 2 | x) 21P(1 | x) 22 P(2 | x)
46
Two-Category Classification
Perform 1 if R(2|x) > R(1|x); otherwise perform 2
21P(1 | x) 22 P(2 | x) 11P(1 | x) 12 P(2 | x)
(21 11) P(1 | x) (12 22 ) P(2 | x)
positive
positive
Posterior probabilities are scaled before comparison.
If the loss incurred for making an error is greater than
that incurred for being correct, the factors (21- 11) and
(12- 22) are positive, and the ratio of these factors simply
scales the posteriors.
47
p(x | i ) P(i )
P(i | x)
p ( x)
irrelevan
t
Two-Category Classification
Perform 1 if R(2|x) > R(1|x); otherwise perform 2
21P(1 | x) 22 P(2 | x) 11P(1 | x) 12 P(2 | x)
(21 11) P(1 | x) (12 22 ) P(2 | x)
By employing Bayes formula, we can replace the posteriors
by the prior probabilities and conditional densities:
(21 11) p(x | 1 ) P(1 ) (12 22 ) p(x | 2 ) P(2 )
p(x | 1 ) (12 22 ) P(2 )
p(x | 2 ) (21 11 ) P(1 )
48
This slide will be recalled later.
Two-Category Classification
Stated as:
Choose a1 if the likelihood ration exceeds a
threshold value independent of the observation x.
Likelihood
Ratio
Perform 1 if
Threshold
p(x | 1 ) (12 22 ) P(2 )
p(x | 2 ) (21 11 ) P(1 )
49
50
Loss Factors
0
1
1
0
1 1
1 1
1 0
0 1
0 0
1
1
0 1
0
1
If the loss factors are identical, and the prior probabilities
are equal, this reduces to a standard likelihood ratio:
p( x | 1 )
choose 1 if :
1
p( x | 2 )
51
MINIMUM ERROR RATE
Error rate (the probability of error) is to be minimized
Consider a symmetrical or zero-one loss function:
0 i j
( i j )
i , j 1,2 ,..., c
1 i j
The conditional risk is:
c
R( i x ) R( i j )P ( j x)
j 1
c
P ( j x)
ji
1 P ( i x)
The conditional risk is the average probability of error.
To minimize error, maximize P(i|x) — also known as
maximum a posteriori decoding (MAP).
52
Minimum Error Rate
LIKELIHOOD RATIO
Minimum
error rate classification:
choose i if: P(i| x) > P(j| x) for all ji
53
Example
For sea bass population, the lightness x is a normal
random variable distributes according to N(4,1);
for salmon population x is distributed according to
N(10,1);
Select the optimal decision where:
a.
The two fish are equiprobable
b.
P(sea bass) = 2X P(salmon)
c.
The cost of classifying a fish as a salmon when it
truly is seabass is 2$, and t The cost of classifying a
fish as a seabass when it is truly a salmon is 1$.
54
2
55
56
57
How to define discriminant functions?
The Multicategory Classification
How do we represent pattern classifiers?
The most common way is through discriminant functions.
Remember we use {w1,w2, …, wc} to be the possible states of nature.
For each class we create a discriminant function gi(x).
g1(x)
x
g2(x)
gi(x)’s are called the
discriminant functions.
Action
(x)
(e.g., classification)
Our classifier is a network or machine
that computes c discriminant functions.
gc(x)
The classifier
Assign x to i if
gi(x) > gj(x) for all j i.
58
If f(.) is a monotonically increasing function,
than f(gi(.) )’s are also be discriminant functions.
Simple Discriminant Functions
Notice the decision is the same if we change every gi(x) for f(gi(x))
Assuming f(.) is a monotonically increasing function.
Minimum Risk case:
gi (x) R( i | x)
Minimum Error-Rate case:
gi (x) P(i | x)
gi (x) p(x | i ) P(i )
gi (x) ln p(x | i ) ln P(i )
59
Figure 2.5
60
Decision Regions
The net effect is to divide the feature space into c regions (one for each
class). We then have c decision regions separated by decision boundaries.
R i {x | g i (x) g j (x) j i}
Two-category example
Decision regions are
separated by decision
boundaries.
61
Figure 2.6
62
Bayesian Decision Theory
(Classification)
The Normal
Distribution
63
Basics of Probability
Discrete random variable (X) - Assume integer
Probability mass function (pmf):
Cumulative distribution function (cdf):
p( x) P( X x)
F ( x) P( X x)
x
p(t )
t
Continuous random variable (X)
Probability density function (pdf):
p( x) or f ( x) not a probability
x
Cumulative distribution function (cdf):
F ( x) P( X x) p(t )dt
64
Probability mass function
•The graph of a probability mass
function.
•All the values of this function must be
non-negative and sum up to 1 .
65
Probability density function
The pdf can be calculated by taking the integral of
the function f(x) by the integration interval of the
input variable x.
For example: the probability of the variable X being within
the interval [4.3,7.8] would be
66
Expectations
Let g be a function of random variable X.
g ( x) p( x)
E[ g ( X )] x
g ( x) p ( x)dx
The kth moment
X is discrete
X is continuous
E[ X k ]
The 1st moment X E[X ]
The kth central moments
E[( X X ) k ]
67
Fact: Var[ X ] E[ X 2 ] ( E[ X ]) 2
Important Expectations
Mean
xp( x)
X E[ X ] x
xp( x)dx
X is discrete
X is continuous
Variance
2
(
x
)
p( x)
X
2
2
X Var[ X ] E[( X X ) ] x
( x X ) 2 p ( x)dx
X is discrete
X is continuous
68
Entropy
The entropy measures the fundamental uncertainty
in the value of points selected randomly from a
distribution.
p ( x) ln p ( x)
H [ X ] x
p ( x) ln p( x)dx
X is discrete
X is continuous
69
Properties:
1. Maximize the entropy
Univariate Gaussian Distribution 2. Central limit theorem
X~N(μ,σ2)
1
p ( x)
e
2
E[X] =μ
p(x)
( x )2
2 2
μ
σ
2σ
3σ
x
σ
2σ
3σ
Var[X] =σ2
70
Probability density function
of the sum of two terms
Probability
Probability
density
density
function
function
ofof
thethe
sum
sum
of four
of three
terms
terms
Illustration of the central limit theorem
Let x1,x2,…,xn be a sequence of n independent and identically
distributed random variables having each finite values of expectation µ
and variance σ2>0 The central limit theorem states that as the sample
size n increases ,the distribution of the sample average of these
random variables approaches the normal distribution with a mean µ and
variance σ2 / n irrespective of the shape of the original distribution .
Original probability density function
71
A probability density function
X ( X 1 , X 2 , , X d )
T
Random Vectors
A d-dimensional
random vector
Vector Mean:
X: R
d
μ E[ X] ( 1 , 2 , , d )
Covariance Matrix:
Σ E[( X μ)( X μ) ]
T
12 12
2
2
21
d 1 d 2
T
1d
2d
722
d
1
p ( x)
e
2
( x )2
2 2
Multivariate Gaussian Distribution
X~N(μ,Σ)
p ( x)
A d-dimensional random vector
1
(2 ) d / 2 | Σ |1/ 2
1
T
1
exp (x μ) Σ (x μ)
2
E[X] =μ
E[(X-μ) (X-μ)T] =Σ
73
Properties of N(μ,Σ)
X~N(μ,Σ)
A d-dimensional random vector
Let Y=ATX, where A is a d × k matrix.
Y~N(ATμ, ATΣA)
74
Properties of N(μ,Σ)
X~N(μ,Σ)
A d-dimensional random vector
Let Y=ATX, where A is a d × k matrix.
Y~N(ATμ, ATΣA)
75
On Parameters of N(μ,Σ)
X~N(μ,Σ)
X ( X 1 , X 2 , , X d )
μ E[ X] ( 1 , 2 , , d )
T
T
i E[ X i ]
Σ E[( X μ)( X μ) ] [ ij ]d d
T
ij E[( X i i )( X j j )] Cov( X i , X j )
ii i2 E[( X i i ) 2 ] Var ( X i )
X i X j ij 0
76
Σ (ΦΛ )(ΦΛ )
1/ 2
1/ 2 T
More On Covariance Matrix
is symmetric and positive semidefinite.
Σ ΦΛΦ ΦΛ Λ Φ
T
1/ 2
1/ 2
T
: orthonormal matrix, whose columns are eigenvectors of .
: diagonal matrix (eigenvalues).
Σ E[( X μ)( X μ) ] [ ij ]d d
T
ij E[( X i i )( X j j )] Cov( X i , X j )
ii i2 E[( X i i ) 2 ] Var ( X i )
X i X j ij 0
77
Σ (ΦΛ )(ΦΛ )
1/ 2
1/ 2 T
Whitening Transform
X~N(μ,Σ)
Y=ATX
Let Aw ΦΛ
Y~N(ATμ, ATΣA)
1 / 2
A w X ~ N ( A μ, A ΣA w )
T
w
T
w
ATw ΣA w (ΦΛ1/ 2 )T (ΦΛ1/ 2 )(ΦΛ1/ 2 )T (ΦΛ1/ 2 ) I
A w X ~ N ( A μ, I )
T
w
78
Σ (ΦΛ )(ΦΛ )
1/ 2
1/ 2 T
Whitening Transform
X~N(μ,Σ)
Y=ATX
Let Aw ΦΛ
Whitening
Linear
Transform
T
Y~N(A μ, ATΣA)
1 / 2
Projection
A w X ~ N ( A μ, A ΣA w )
T
w
T
w
ATw ΣA w (ΦΛ1/ 2 )T (ΦΛ1/ 2 )(ΦΛ1/ 2 )T (ΦΛ1/ 2 ) I
A w X ~ N ( A μ, I )
T
w
79
Whitening Transform
•The whitening transformation is a decorrelation method that converts
the covariance matrix S of a set of samples into the identity matrix I.
•This effectively creates new random variables that are uncorrelated and
have the same variances as the original random variables.
•The method is called the whitening transform because it transforms the
input matrix closer towards white noise .
This can be expressed as
Aw ΦΛ1/ 2
where Φ is the matrix with the eigenvectors of "S" as its columns and Λ is the
diagonal matrix of non-increasing eigenvalues.
80
White noise
•White noise is a random signal( or process) with a flat power spectral density .
•In other words, the signal contains equal power within a fixed bandwidth
at any center frequency .
Energy spectral density
81
r 2 (x μ)T Σ 1 (x μ)
Mahalanobis Distance
X~N(μ,Σ)
p ( x)
1
(2 ) d / 2 | Σ |1/ 2
depends on
constant
the value of r2
1
T
1
exp (x μ) Σ (x μ)
2
r2
82
Mahalanobis distance
In statistics ,Mahalanobis distance is a distance measure introduced by
P. C. Mahalanobis in 1936.
It is based on correlations between variables by which different patterns
can be identified and analyzed.
It is a useful way of determining similarity of an unknown sample set to a
known one.
It differs from Euclidean distance in that it takes into account the correlations
of the data set and is scale-invariant ,i.e. not dependent on the scale of
measurements .
83
Mahalanobis distance
Formally, the Mahalanobis distance from a group of values
with above mean and covariance matrix Σ for a multivariate
vector is defined as:
Mahalanobis distance can also be defined as dissimilarity
measure between two random vectors
of the same
distribution with the covariance matrix Σ :
If the covariance matrix is the identity matrix, the Mahalanobis
distance reduces to the Euclidean distance. If the covariance
matrix is diagonal, then the resulting distance measure is
called the normalized Euclidean distance:
where σi is the standard deviation of the xi over the sample set.
84
r 2 (x μ)T Σ 1 (x μ)
Mahalanobis Distance
X~N(μ,Σ)
p ( x)
1
(2 ) d / 2 | Σ |1/ 2
depends on
constant
the value of r2
1
T
1
exp (x μ) Σ (x μ)
2
r2
85
Bayesian Decision Theory
(Classification)
Discriminant Functions for
the Normal Populations
86
Normal Density
If features are statistically independent and the variance is the same
for all features, the discriminant function is simple and is linear in nature.
A classifier that uses linear discriminant functions is called
a linear machine.
The decision surface are pieces of hyperplanes defined
by linear equations.
87
gi (x) P(i | x)
gi (x) p(x | i ) P(i )
Minimum-Error-Rate Classification
gi (x) ln p(x | i ) ln P(i )
Assuming
the measurements are normally distributed, we have
Xi~N(μi,Σi)
1
1
T
1
p(x | i )
exp
(
x
μ
)
Σ
(
x
μ
)
i
i
i
2
(2 ) d / 2 | Σi |1/ 2
1
d
1
T
1
g i (x) (x μ i ) Σ i (x μ i ) ln 2 ln | Σ i | ln P(88i )
2
2
2
Some Algebra to Simplify the
Discriminants
Since
We take the natural logarithm to rewrite the first term
gi ( x) ln p( x | i ) ln P(i )
1
( x i ) t i1 ( x i )
1
2
ln p(x | i ) ln
e
d /2
1/ 2
2
|
|
i
1
( x i ) t i1 ( x i )
1
2
ln
ln
e
d /2
1/ 2
2
|
|
i
89
Some Algebra to Simplify the
Discriminants (continued)
12 ( x i )t i1 ( x i )
1
ln e
ln
d /2
1/ 2
2
|
|
i
1
d /2
t 1
( x i ) i ( x i ) ln 2 | i |1/ 2
2
1
d /2
t 1
1/ 2
( x i ) i ( x i ) ln 2 ln | i |
2
1
d
1
t 1
( x i ) i ( x i ) ln 2 ln | i |
2
2
2
90
Minimum-Error-Rate Classification
Three Cases:
Case 1: Σ i 2 I
Classes are centered at different mean, and their feature
components are pairwisely independent have the same variance.
Case 2: Σ i Σ
Classes are centered at different mean, but have the same variation.
Case 3: Σ i Σ j
Arbitrary.
1
d
1
T
1
g i (x) (x μ i ) Σ i (x μ i ) ln 2 ln | Σ i | ln P(91i )
2
2
2
i 2 I is independen t of i
Case 1. i =
2
I Σ
Since | i | 2 d (a constant), g i ( x) can be reduced to
g i ( x)
1
|| x μ i ||2 ln P(i )
2 2
1
2 (xT x 2μTi x μTi μ i ) ln P (i )
2
t
1
i
1
2
I
2 0 0 0
0 2 ... 0
i
2d
0 ... ... ...
0 ... 2
0
but x x is independen t of i and may be omitted so
irrelevant
1 T
1 T
g i ( x) 2 μ i x
μ i μ i ln P(i )
2
2
irrelevant
d
ln 2 is independen t of i and may be omitted
2
1
d
1
T
1
g i (x) (x μ i ) Σ i (x μ i ) ln 2 ln | Σ i | ln P(92i )
2
2
2
w i 12 μi
Case 1. i =
g i (x) w Ti x wi 0
2I
wi 0 21 2 μTi μi ln P(i )
which is linear in x!
1 T
g i ( x) 2 μ x
μ
μ
ln
P
(
)
i i
i
2
2
1
T
i
93
w i 12 μi
Case 1. i =
2I
wi 0 21 2 μTi μi ln P(i )
g i (x) w x wi 0
wTi x wi 0 wTj x w j 0
g i ( x) g j ( x)
(w w )x w j 0 wi 0
T
i
T
j
P(i )
(μ μ )x (μ μi μ μ j ) ln
P( j )
T
i
T
j
T
i
1
2
T
j
2
(μ μ )x (μ μ )(μi μ j )
T
i
T
j
1
2
T
i
j
i
T
i
T
j
Boundary btw.
i and j
T
T
(
μ
μ
i
j )(μ i μ j )
2
|| μ i μ j ||2
P(i )
ln
P94( j )
The decision boundary will be a hyperplane
perpendicular to the line btw. the means at somewhere.
Case 1. i =
2
I
i
w (x x 0 ) 0
T
w
x0
w μi μ j
x 0 12 (μ i μ j )
2
|| μ i μ j ||2
ln
P (i )
(μ i μ j )
P( j )
midpoint
0 if P(i)=P(j)
wT
(μ μ )x (μ μ )(μi μ j )
T
i
T
j
x j
xx0
1
2
T
i
T
j
g i ( x) g j ( x)
Boundary btw.
i and j
T
T
(
μ
μ
i
j )(μ i μ j )
2
|| μ i μ j ||2
P(i )
ln
P95( j )
P(1 )
x 0 (μ1 μ 2 )
ln
(μ1 μ 2 )
2
|| μ1 μ 2 ||
P(2 )
1
2
Case 1. i =
2
2
I
The decision region when the priors are equal and the support
regions are spherical is simply halfway between the means
(Euclidean distance).
P(1 ) P(2 )
Minimum distance classifier (template matching)
96
97
P(1 )
x 0 (μ1 μ 2 )
ln
(μ1 μ 2 )
2
|| μ1 μ 2 ||
P(2 )
1
2
2
Case 1. i = 2I
Note how priors shift the boundary away from the more likely mean !!!
P(1 ) P(2 )
98
P(1 )
x 0 (μ1 μ 2 )
ln
(μ1 μ 2 )
2
|| μ1 μ 2 ||
P(2 )
1
2
2
Case 1. i = 2I
P(1 ) P(2 )
99
P(1 )
x 0 (μ1 μ 2 )
ln
(μ1 μ 2 )
2
|| μ1 μ 2 ||
P(2 )
1
2
2
Case 1. i = 2I
P(1 ) P(2 )
100
Case 2. i =
•Covariance matrices are arbitrary, but equal to each other for all classes.
• Features then form hyper-ellipsoidal clusters of equal size and shape.
1
g i (x) (x μ i )T Σ 1 (x μ i ) ln P(i )
2
Mahalanobis Irrelevant if
Distance P(i)= P(j) i, j
irrelevant
1
d
1
T
1
g i (x) (x μ i ) Σ i (x μ i ) ln 2 ln | Σ i | ln P(101i )
2
2
2
w i Σ 1μ i
Case 2. i =
wi 0 12 μTi Σ 1μ i ln P(i )
1
g i (x) (x μ i )T Σ 1 (x μ i ) ln P(i )
2
1 T 1
(x Σ x 2μTi Σ 1x μTi Σ 1μ i ) ln P(i )
2
Irrelevant
g i (x) w Ti x wi 0
102
w i Σ 1μ i
Case 2. i =
wi 0 12 μTi Σ 1μ i ln P(i )
g i (x) w Ti x wi 0
i
The
discriminant hyperplanes are often not
orthogonal to the segments joining the class means
x
w
j
x0
w T (x x 0 ) 0
g i ( x) g j ( x)
1
w Σ (μi μ j )
x 0 (μ i μ j )
1
2
ln[ P(i ) / P( j )]
1
(μ i μ j ) Σ (μ i μ j )
T
(μ i μ j )
103
Case 2. i =
104
Case 2. i =
105
1 1 w Σ 1μ w 1 μT Σ 1μ 1 ln | Σ 1 | ln P( )
i
i
i
i0
i
i
i
i
2 i
2
Wi Σ i
2
Case 3. i j
The covariance matrices are different for each category
In two class case, the decision boundaries form hyperquadratics.
1
1
T
1
g i (x) (x μ i ) Σ i (x μ i ) ln | Σ i | ln P(i )
2
2
g i (x) x Wi x w x wi 0
T
T
i
Without this term
In Case 1 and 2
Decision surfaces are hyperquadrics, e.g.,
• hyperplanes
• hyperspheres
• hyperellipsoids
• hyperhyperboloids
irrelevant
1
d
1
T
1
g i (x) (x μ i ) Σ i (x μ i ) ln 2 ln | Σ i | ln P(106i )
2
2
2
Case 3. i j
Non-simply connected decision
regions can arise in one dimensions
for Gaussians having unequal
variance.
107
Case 3. i j
108
Case 3. i j
109
Case 3. i j
110
Multi-Category
Classification
111
Example : A Problem
Exemplars
(transposed)
For 1 = {(2, 6), (3,
4), (3, 8), (4, 6)}
For 2 = {(1, -2), (3,
0), (3, -4), (5, -2)}
Calculated means
(transposed)
m1 = (3, 6)
m2 = (3, -2)
10
8
6
Class 1
4
x2
Class 2
2
m1
0
-2
m2
0
4
2
6
-4
-6
x1
112
Example : Covariance
Matrices
1 1 2 0 1 / 2 0
1 ai 1 ai 1
0
8
0
2
4
4
i 1
2
3
1
1
a1 1 a1 1 a1 1 t
6 6 0
0
3 3 0
0
t
a2 1 a2 1 a2 1
4 6 2
0
4
a3 1
t
3 3 0
0
t
a
a
3
1
3
1
6 2
0
8
4 3 1
1
t
a4 1 a4 1 a4 1
6 6 0
0
0
0
0
4
0
4
0
0
113
Example : Covariance
Matrices
1 1 8 0 2 0
2 bi 2 bi 2
0 2
0
8
4
4
i 1
1
3
2
4 0
b1 2 b1 2 b1 2 t
2 2 0
0 0
3 3 0
0 0
t
b2 2 b2 2 b2 2
0 2 2
0 4
4
b3 2
t
3 3 0
0
t
b
b
3
2
3
2
2 2
0
4
5 3 2
4
t
b4 2 b4 2 b4 2
2 2 0
0
0
4
0
0
114
Example: Inverse and Determinant
for Each of the Covariance Matrices
1 / 2 0
2 0
1
1
1
and 1 1
0 2
0 1 / 2
2 0
1 / 2 0
1
2
2
and 2 4
0 2
0 1 / 2
115
Example : A Discriminant
Function for Class 1
Since
1
1
t 1
g i ( x) ( x i ) i ( x i ) ln | i | ln P(i )
2
2
2 0 x1 3 1
1
g1 ( x) x1 3 x2 6
ln 1 ln P(1 )
2
0 1 / 2 x2 6 2
x2
x1 3
3
ln P(1 )
2
x2 6
1
x2
2 x1 6 x1 3 3 x2 6 ln P(1 )
2
2
1
2 x1 6
2
116
Example
1
x
g1 ( x) 2 x1 6 x1 3 2 3 x2 6 ln P(1 )
2
2
1 2
x22
2 x1 12 x1 6 x2 36 ln P(1 )
2
2
2
x
x12 6 x1 2 3x2 18 ln P(1 )
4
117
Example : A Discriminant
Function for Class 2
Also,
1
1
g i ( x) ( x i ) t i1 ( x i ) ln | i | ln P (i )
2
2
1 / 2 0 x1 3 1
1
g 2 ( x) x1 3 x2 2
ln 4 ln P (2 )
2
0 1 / 2 x2 2 2
x1 3
1
( x1 3) ( x2 2)
ln 2 ln P (2 )
4
x2 2
1
2
2
x1 3 x2 2 ln 2 ln P (2 )
4
118
Example
Hence,
1
x1 32 x2 22 ln 2 ln P(2 )
4
1
x12 6 x1 x22 4 x2 13 ln 2 ln P(2 )
4
x12 3
x22
13
x1 x2 ln 2 ln P(2 )
4 2
4
4
g 2 ( x)
119
Example : The Class
Boundary
Assuming P(ω1 ) P(ω2 ) and setting g1(x) g 2(x),
2
x
g1 ( x) x12 6 x1 2 3 x2 18 ln P(1 )
4
x12 3
x22
13
g 2 ( x) x1 x2 ln 2 ln P(2 )
4 2
4
4
2
2
2
x
x
3
x
13
2
2
1
2
x1 6 x1 3 x2 18 x1 x2 ln 2
4
4 2
4
4
3 x12 9
13
4 x2
x1 18 ln 2
4 2
4
120
Example : A Quadratic
Separator
3x12 9
13
4 x2
x1 18 ln 2
4 2
4
3x12 9
13 18 ln 2
x2
x1
16 8
16
4
x2 0.1875 x12 1.125 x1 3.514, (a quadratic! )
121
Example : Plot of the Discriminant
10
8
6
x2
Class 1
4
Class 2
2
m1
m2
0
-2
0
2
4
6
8
Discriminant
-4
-6
x1
122
Summary Steps for Building a
Bayesian Classifier
Collect class exemplars
Estimate class a priori probabilities
Estimate class means
Form covariance matrices, find the
inverse and determinant for each
Form the discriminant function for each
class
123
Using the Classifier
Obtain a measurement vector x
Evaluate the discriminant function gi(x)
for each class i = 1,…,c
Decide x is in the class j if gj(x) > gi(x)
for
all i j
124
Bayesian Decision Theory
(Classification)
Criterions
125
Bayesian Decision Theory
(Classification)
Minimun error rate
Criterion
126
Minimum-Error-Rate Classification
If action i is taken and the true state is j ,
then the decision is correct if i j and in error
if i j
Error rate (the probability of error) is to be
minimized
Symmetrical or zero-one loss function
0, i j
( i | j )
, i, j 1, , c
1, i j
Conditional risk
c
R( i | x) ( i | j ) P( j | x) 1 P(i | x)
j 1
127
Minimum-Error-Rate Classification
128
Bayesian Decision Theory
(Classification)
Minimax Criterion
129
Bayesian Decision Rule:
Two-Category Classification
Likelihood
Ratio
Decide 1 if
Threshold
p(x | 1 ) (12 22 ) P(2 )
p(x | 2 ) (21 11 ) P(1 )
Minimax criterion deals with the case that
the prior probabilities are unknown.
130
Basic Concept on Minimax
To choose the worst-case prior probabilities
(the maximum loss) and, then, pick the
decision rule that will minimize the overall risk.
Minimize the maximum possible overall risk.
So that the worst risk for any value of the priors is as small as possible
131
R(1 | x) 11P(1 | x) 12 P(2 | x)
R( 2 | x) 21P(1 | x) 22 P(2 | x)
Overall Risk
R R( (x) | x) p(x)dx
R(1 | x) p(x)dx R( 2 | x) p (x)dx
R1
R2
R [11P(1 | x) 12 P(2 | x)] p (x)dx
R1
R2
[21P(1 | x) 22 P(2 | x)] p(x)dx
132
Overall Risk
p(x | i ) P(i )
P(i | x)
p ( x)
R [11P(1 ) p (x | 1 ) 12 P(2 ) p (x | 2 )]dx
R1
R2
[21P(1 ) p (x | 1 ) 22 P(2 ) p (x | 2 )]dx
R [11P(1 | x) 12 P(2 | x)] p (x)dx
R1
R2
[21P(1 | x) 22 P(2 | x)] p(x)dx
133
Overall Risk
P(2 ) 1 P(1 )
R [11P(1 ) p (x | 1 ) 12 P(2 ) p (x | 2 )]dx
R1
R2
[21P(1 ) p (x | 1 ) 22 P(2 ) p (x | 2 )]dx
R {11P(1 ) p (x | 1 ) 12[1 P(1 )] p (x | 2 )}dx
R1
R2
{21P(1 ) p (x | 1 ) 22[1 P(1 )] p (x | 2 )}dx
R 12 p(x | 2 )dx 22 p(x | 2 )dx
R1
R2
11 P (1 ) p (x | 1 )dx 12 P(1 ) p(x | 2 )dx
R1
R1
21 P(1 ) p(x | 1 )dx 22 P(1 ) p(x | 2 )dx
R2
R2
134
Overall Risk
R1
p(x | i )dx p(x | i )dx 1
R2
R[ P(1 )] 22 (12 22 ) p(x | 2 )dx
R1
P(1 ) (11 22 ) (21 11 ) p(x | 1 )dx (12 22 ) p(x | 2 )dx
R2
R1
R 12 p(x | 2 )dx 22 p(x | 2 )dx
R1
R2
11 P (1 ) p (x | 1 )dx 12 P(1 ) p(x | 2 )dx
R1
R1
21 P(1 ) p(x | 1 )dx 22 P(1 ) p(x | 2 )dx
R2
R2
135
Overall Risk
R(x) = ax + b
The value depends on
the setting of decision boundary
R[ P(1 )] 22 (12 22 ) p(x | 2 )dx
R1
P(1 ) (11 22 ) (21 11 ) p(x | 1 )dx (12 22 ) p(x | 2 )dx
R2
R1
The value depends on
the setting of decision boundary
The overall risk for a particular P(1).
136
Overall Risk
R(x) = ax + b
= Rmm, minimax risk
R[ P(1 )] 22 (12 22 ) p(x | 2 )dx
R1
P(1 ) (11 22 ) (21 11 ) p(x | 1 )dx (12 22 ) p(x | 2 )dx
R2
R1
= 0 for minimax solution
Independent on the value of P(i).
137
Minimax Risk
R[ P(1 )] 22 (12 22 ) p(x | 2 )dx
R1
P(1 ) (11 22 ) (21 11 ) p(x | 1 )dx (12 22 ) p(x | 2 )dx
R2
R1
Rmm 22 (12 22 ) p(x | 2 )dx
R1
11 (21 11 ) p(x | 1 )dx
R2
138
Use 0/1 loss function
Error Probability
R[ P(1 )] 22 (12 22 ) p(x | 2 )dx
R1
P(1 ) (11 22 ) (21 11 ) p(x | 1 )dx (12 22 ) p(x | 2 )dx
R2
R1
Perror[ P(1 )] p(x | 2 )dx
R1
P(1 ) p(x | 1 )dx p(x | 2 )dx
R 2
R1
139
Use 0/1 loss function
Minimax Error-Probability
Pmm (error ) p(x | 2 )dx p (x | 1 )dx
R2
R1
P(1|2)
P(2|1)
Perror[ P(1 )] p(x | 2 )dx
R1
P(1 ) p(x | 1 )dx p(x | 2 )dx
R 2
R1
140
Perror[ P(1 )] p(x | 2 )dx
R1
P(1 ) p(x | 1 )dx p(x | 2 )dx
R 2
R1
Minimax Error-Probability
Pmm (error ) p(x | 2 )dx p (x | 1 )dx
R2
R1
P(1|2)
P(2|1)
1
2
R1
R2
141
Perror[ P(1 )] p(x | 2 )dx
R1
P(1 ) p(x | 1 )dx p(x | 2 )dx
R 2
R1
Minimax Error-Probability
142
Bayesian Decision Theory
(Classification)
Neyman-Pearson
Criterion
143
Bayesian Decision Rule:
Two-Category Classification
Likelihood
Ratio
Decide 1 if
Threshold
p(x | 1 ) (12 22 ) P(2 )
p(x | 2 ) (21 11 ) P(1 )
Neyman-Pearson Criterion deals with the
case that both loss functions and the prior
probabilities are unknown.
144
Signal Detection Theory
The theory of signal detection theory
evolved from the development of
communications and radar equipment the
first half of the last century.
It migrated to psychology, initially as part
of sensation and perception, in the 50's and
60's as an attempt to understand some of
the features of human behavior when
detecting very faint stimuli that were not
being explained by traditional theories of
thresholds.
145
The situation of interest
A person is faced with a stimulus (signal)
that is very faint or confusing.
The person must make a decision, is the
signal there or not.
What makes this situation confusing and
difficult is the presences of other mess that
is similar to the signal. Let us call this
mess noise.
146
Example
Noise is present both in the
environment and in the sensory
system of the observer.
The observer reacts to the
momentary total activation of
the sensory system, which
fluctuates from moment to
moment, as well as responding
to environmental stimuli, which
may include a signal.
147
Signal Detection Theory
Suppose we want to detect a single pulse from a signal.
We assume the signal has some random noise.
When the signal is present we observe a normal distribution with mean u2.
When the signal is not present we observe a normal distribution with mean
u1.
We assume same standard deviation.
Can we measure the discriminability of the problem?
Can we do this independent of the threshold x*?
Discriminability:
d’ = | u2 – u1 | / σ
148
Example
A radiologist is examining a CT scan, looking
for evidence of a tumor.
A Hard job, because there is always some
uncertainty.
There are four possible outcomes:
hit (tumor present and doctor says "yes'')
miss (tumor present and doctor says "no'')
Two types
false alarm (tumor absent and doctor says "yes") of Error
correct rejection (tumor absent and doctor says
149
"no").
Signal detection theory was developed to help us understand how a
continuous and ambiguous signal can lead to a binary yes/no decision.
The Four Cases
Signal (tumor)
No (1)
Absent (1)
Present (2)
P(1|1)
P(1|2)
Correct Rejection
Miss
P(2|1)
P(2|2)
False Alarms
Hit
Decision
Yes (2)
150
Discriminability
d'
| 2 1 |
Decision Making
Criterion
d’
Noise
Based on expectancy
(decision bias)
Noise + Signal
Hit
P(2|2)
False
P(2|1)
Alarm
1
No (1)
2
Yes (2)
151
152
Signal Detection Theory
How do we find d’ if we do not know u1, u2, or x*?
From the data we can compute:
1.
P( x > x* | w2) a hit.
2.
P( x > x* | w1) a false alarm.
3.
P( x < x* | w2) a miss.
4.
P( x < x* | w1) a correct rejection.
If we plot a point in a space representing hit and false alarm rates,
then we have a ROC (receiver operating characteristic) curve.
With it we can distinguish between discriminability and bias.
153
ROC Curve
(Receiver Operating Characteristic)
Hit
PH=P(2|2)
False
Alarm
154
PFA=P(2|1)
Neyman-Pearson Criterion
Hit
PH=P(2|2)
NP:
max. PH
subject to PFA ≦ a
False
Alarm
155
PFA=P(2|1)