Bayesian Decision Theory

Transcript Bayesian Decision Theory

CHAPTER 3:
BAYESIAN DECISION
THEORY
Making Decision Under Uncertainty
2



Use Bayes rule to calculate the probability of the
classes
Make rational decision among multiple actions to
minimize expected risk
Learning association rules from data
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Unobservable variables
3



Tossing a coin is completely random process, can’t
predict the outcome
Only can talk about the probabilities that the
outcome of the next toss will be head or tails
If we have access to extra knowledge (exact
composition of the coin, initial position, force etc.)
the exact outcome of the toss can be predicted
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Unobservable Variable
4


Unobservable variable is the extra knowledge
that we don’t have access to
Coin toss: the only observable variables is the
outcome of the toss

X=f(z), z is unobservables , x is observables

f is deterministic function
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Bernoulli Random Variable
5







Result of tossing a coin is {Heads,Tails}
Define a random variable X {1,0}
po the probability of heads
P(X = 1) = po and P(X = 0) = 1 − P(X = 1) = 1 − po
Assume asked to predict the next toss
If know po we would predict heads if po >1/2
Choose more probable case to minimize probability
of the error 1- po
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Estimation
6






What if we don’t know P(X)
Want to estimate from given data (sample) X
Realm of statistics
Sample X generated from probability distribution
of the observables xt
Want to build an approximator p(x) using sample X
In coin toss example: sample is outcomes of past N
tosses and in distribution is characterized by single
parameter po
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Parameter Estimation
7
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Classification
8








Credit scoring: two classes – high risk and low risk
Decide on observable information: (income and saving)
Have reasons to believe that these 2 variable gives us
idea about the credibility of a customer
Represent by two random variable X1 and X2
Can’t observe customer intentions and moral codes
Can observe credibility of a past customer
Bernoulli random variable C conditioned on X=[X1 , X2]T
We know P(C| X1 , X2)
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Classification
9


Assume know P(C| X1 , X2)
New applications X1=x1, X2 =x2
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Classification
10

Similar to coin toss but C is conditioned on two
other observable variables x=[x1 ,x2]T

The problem : Calculate P(C|x)

Use Bayes rule
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Conditional Probability
11


Probability of A (point
will be inside A) if we
know that B happens
(point is inside B)
P(A|B)=P(AB)/P(B)
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Bayes Rule
12


P(A|B)=P(AB)/P(B)
P(B|A)= P(AB)/P(A)=>P(AB)=P(B|A)*P(A)
P(A|B)=P(B|A)*P(A)/P(B)
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Bayes Rule
13
prior
posterior
likelihood
P C  p x | C 
P C | x  
p x 
evidence


Prior: probability of a customer is high risk
regardless of x.
Knowledge we have as to the value of C before
looking at observables x
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Bayes Rule
14
prior
posterior
likelihood
P C  p x | C 
P C | x  
p x 
evidence


Likelihood: probability that event in C will have
observable X
P(x1,x2|C=1) is the probability that a high-risk
customer has his X1=x1 ,X2=x2
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Bayes Rule
15
prior
posterior
likelihood
P C  p x | C 
P C | x  
p x 
evidence

Evidence: P(x) probability that observation x is seen
regardless if positive or negative
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Bayes’ Rule
16
prior
posterior
likelihood
P C  p x | C 
P C | x  
p x 
evidence
P C  0  P C  1  1
p x   p x | C  1P C  1  p x | C  0P C  0
p C  0 | x   P C  1 | x   1
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Bayes Rule for classification
17




Assume know : prior, evidence and likelihood
Will learn how to estimate them from the data later
Plug them in into Bayes formula to obtain P(C|x)
Choose C=1 if P(C=1|x)>P(c=0|x)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Bayes’ Rule: K>2 Classes
18
p x | Ci P Ci 
P Ci | x  
p x 
p x | Ci P Ci 
 K
 p x | Ck P Ck 
k 1
P Ci   0 and
K
 P C   1
i 1
i
choose Ci if P Ci | x   max k P C k | x 
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Bayes’ Rule
19
P C i | x 
p x | C i P C i 

p x 
p x | C i P C i 
 K
 p x | C k P C k 
k 1



Deciding on specific input x
P(x) is the same for all classes
Don’t need it to compare posterior
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Losses and Risks




Decisions/Errors are not equally good or costly
Actions: αi is assignment to class i
Loss of αi when the state is Ck : λik
Expected risk (Duda and Hart, 1973)
K
R i | x    ik P C k | x 
k 1
choose i if R i | x   min k R  k | x 
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 20
Losses and Risks: 0/1 Loss
21
0 if i  k
ik  
1 if i  k
K
R i | x    ik P Ck | x 
k 1
  P Ck | x 
k i
 1  P Ci | x 
For minimum risk, choose the most probable class
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Losses and Risks: Reject
0 if i  k

ik   if i  K  1, 0    1
1 otherwise

K
R  K 1 | x    P Ck | x   
k 1
R i | x    P Ck | x  1  P Ci | x 
k i
choose Ci if P  Ci |x   P  Ck |x  k  i and P  Ci |x   1  
reject otherwise
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press
22 (V1.1)
Discriminant Functions

Define a function gi(x) for each class ( “goodness”
of selecting class Ci given observables x
choose Ci if gi x   max k gk x 
 R i | x 

gi x   P Ci | x 
p x | C P C 
i
i


Maximum discriminant corresponds to minimum
conditional risk
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT
23Press (V1.1)
Decision Regions
gi x , i  1,, K
K decision regions R1,...,RK
Ri  x | gi x   max k gk x 
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT24
Press (V1.1)
K=2 Classes

g(x) = g1(x) – g2(x)
C1 if g x   0
choose 
C 2 otherwise

Log odds:
P C1 | x 
log
P C 2 | x 
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT
25Press (V1.1)
Correlation and Causality
26
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Interaction Between Variables
27

Have many variables want to represent and
analyze
 Structure
of dependency among them
 Conditional and marginal probabilities



Use graphical representation
Nodes (events) and arcs (dependencies)
Numbers of nodes and arcs (conditional
probabilities)
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Causes and Bayes’ Rule
28
diagnostic
causal
Diagnostic inference:
Knowing that the grass is wet,
what is the probability that rain is
the cause?
P W | R P R 
P R | W  
P W 
P W | R P R 

P W | R P R   P W |~ R P ~ R 
0.9  0.4

 0.75
0.9  0.4  0.2  0.6
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Causal vs Diagnostic Inference
29
Causal inference: If the
sprinkler is on, what is the
probability that the grass is wet?
P(W|S) = P(W|R,S) P(R|S) +
P(W|~R,S) P(~R|S)
= P(W|R,S) P(R) +
P(W|~R,S) P(~R)
= 0.95 0.4 + 0.9 0.6 = 0.92
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Causal vs Diagnostic Inference
30
•Diagnostic inference: If the
grass is wet, what is the
probability that the sprinkler
is on?
P(S|W) = 0.35 > 0.2 P(S)
P(S|R,W) = 0.21
Explaining away: Knowing
that it has rained decreases
the probability that the
sprinkler is on.
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Bayesian Networks: Causes
31
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Bayesian Nets: Local structure
32
P C , S , R,W , F   P C P S | C P R | C P W | S , RP F | R

Only need to specify interactions between neighboring
events
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Bayesian Networks: Classification
33
diagnostic
P (C | x )
Bayes’ rule inverts the arc:
p x | C P C 
P C | x  
p x 
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Naive Bayes’ Classifier
34
Given C, xj are independent:
p(x|C) = p(x1|C) p(x2|C) ... p(xd|C)
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Naïve Bayes’ Classifier
35





Why naïve?
Ignores dependency among the inputs
Dependency among “baby food” and “diapers”
Hidden variables (“have children”)
Insert node/arcs and estimate their values from
data
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Association Rules


Association rule: X  Y
Support (X  Y):
# customers who bought X and Y 
P  X ,Y  
# customers

Confidence (X  Y):
P X ,Y 
P Y | X  
P( X )
# customers who bought X and Y 

# customers who bought X 
36
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Association Rule
37





Only one customer bought chips
Same customer bought beer
P(C|B)=1
But support is tiny
Support shows statistical significance
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Bayesian Decision Theory

Transcript Bayesian Decision Theory

Directory