Reasoning under uncertainty

Download Report

Transcript Reasoning under uncertainty

Baye’s Rule
P(b | a )P(a )
1
P(a | b) 
 P(b | a )P(a ) where  
P(b)
P(b)
P( a | b  c )  P(b  c | a )P( a )
P(b  c | a )  P(b | a )P( c | a )
Naive Baye' s - Full Joint Probability Di stribution
P(Cause, Effect ,..., Effect )  P(Cause) P( Effect | Cause)
1
n
i
i
Baye’s Rule and Reasoning
Allows use of uncertain causal knowledge


Knowledge: given a cause what is the likelihood of
seeing particular effects (conditional probabilities)
Reasoning: Seeing some effects, how do we infer the
likelihood of a cause.
P( H | e1  e2 ...  ek ) 


P(e1  e2 ...  ek | H )P( H )
P(e1  e2 ...  ek )
This can be very complicated: need joint probability
distribution of (k+1) variables, i.e., 2k+1 numbers.
Use conditional independence to simplify expressions.
Allows sequential step by step computation
Bayesian/Belief Network
To avoid problems of enumerating large joint
probabilities
Use causal knowledge and independence to simplify
reasoning, and draw inferences
P( X 1 , X 2 ,...., X n )  P( X 1 | X 2 ,...., X n ) P( X 2 ,...., X n )
 P( X 1 | X 2 ,...., X n ) P( X 2 | X 3 ,...., X n )....... P( X n 1 | X n ) P( X n )
Bayesian Networks
Also called Belief Network or probabilistic
network



Nodes – random variables, one variable per node
Directed Links between pairs of nodes. AB A has
a direct influence on B
P( X i | Parents( X i ))
 With no directed cycles
A conditional distribution for each node given its
parents
Must determine the
Domain specific topology.
Toothache
Cavity
Weather
Catch
Bayesian Networks
Next step is to determine the conditional
probability distribution for each variable.

Represented as a conditional probability table
(CPT) giving the distribution over Xi for each
combination of the parent value.
Once CPT is determined, the full joint probability
distribution is represented by the network.
The network provides a complete description of a domain.
Belief Networks: Example
If you go to college, this will effect the likelihood that
you will study and the likelihood that you will party.
Studying and partying effect your chances of exam
success, and partying effects your chances of having fun.
Variables: College, Study, Party, Exam (success), Fun
Causal Relations:
 College will affect studying
 College will affect parting
 Studying and partying will affect exam success
College
 Partying affects having fun.
Study
Party
Exam
Fun
College example: CPTs
P(C)
0.2
C
P(S)
True
False
College
C
P(P)
0.8
True
0.6
0.2
False
0.5
Study
CPT
S
P
P(E)
True
True
0.6
True
False
False
False
Party
P
P(F)
0.9
True
0.9
True
0.1
False
0.7
False
0.2
Exam
Fun
Discrete Variables only in this format
Belief Networks: Compactness
A CPT for Boolean variable Xi with k Boolean
parents is 2k rows for combinations of parent
values
Each row requires one number p for Xi = true



(the number Xi = false is 1-p)
Row must sum to 1.
Conditional Probability
If each variable had no more than k parents,
then complete network requires O(n2k) numbers

i.e., the numbers grow linearly in n vs. O(2n) for the
full joint distribution
College net has 1+2+2+4+2=11 numbers
Belief Networks:
Joint Probability Distribution Calculation
Global semantics defines the full joint
distribution as the product of local
n
distributions: P( X 1, X 2 ,..., X n )   P( X i | Parents( X i ))
i 1
Can use the networks to make inferences.
Probabilit y of going to college and that you will study and be
successful on your exams but not party or have fun.
College
P (C  S  P  E  F )
P (C ) P ( S | C ) P (P | C ) P ( E | S , P ) P (F | P )
Study
Party
0.2*0.8*0.4*0.9*0.3 = 0.01728
Exam
Fun
Every value in a full joint probability distribution
can be calculated.
College example: CPTs
P(C  S  P  E  F )
P(C)
0.2
C
P(S)
True
False
College
C
P(P)
0.8
True
0.6
0.2
False
0.5
Study
S
P
P(E)
True
True
0.6
True
False
False
False
Party
P
P(F)
0.9
True
0.9
True
0.1
False
0.7
False
0.2
Exam
Fun
P(C ) P( S | C ) P(P | C ) P( E | S , P) P(F | P)
0.2*0.8*0.4*0.9*0.3 = 0.01728
Network Construction
Must ensure network and distribution are
good representations of the domain.

Want to rely on conditional independence
relationships.
First, rewrite the joint distribution in terms of
the conditional probability.
P( x1,..., xn )  P( xn | xn1,..., x1 ) P( xn1,..., x1 )

Repeat for each conjunctive probability

P( x1 ,..., xn )  P( xn | xn 1 ,..., x1 ) P( xn 1 ,..., x1 )... P( x2 | x1 ) P( x1 )
n
  P( xi | xi 1 ,..., x1 )
i 1
Chain Rule
Network Construction
n
Note P( x1 ,..., xn )   P( xi | xi 1 ,..., x1 ) is
i 1
equivalent to:
P( X i | X i 1,..., X 1 )  P( X i | Parents( X i ))
where the partial order is defined by the graph
structure.
Parents( X i )  { X i 1,..., X 1}
The above equation says that the network correctly represents the domain
only if each node is conditionally independent of its predecessors in the
node ordering, given the node’s parents.
Means: Parents of Xi needs to contain all nodes in X1,…,Xi-1 that have
a direct influence on Xi.
College example:
P(C)
0.2
C
P(S)
True
False
College
C
P(P)
0.8
True
0.6
0.2
False
0.5
Study
S
P
P(E)
True
True
0.6
True
False
False
False
Party
P
P(F)
0.9
True
0.9
True
0.1
False
0.7
False
0.2
Exam
Fun
P(F|C, S, P, E) = P(F|P)
Compact Networks
Bayesian networks are sparse, therefore, much
more compact than full joint distribution.



Sparse: each subcomponent interacts directly with a
bounded number of other nodes independent of the
total number of components.
Usually linearly bounded complexity.
 College net has 1+2+2+4+2=11 numbers
 Fully connected domain = full joint distribution.
Must determine the correct network topology.
 Add “root causes” first then the variables that they
influence.
Network Construction
Need a method such that a series of locally testable
assertions of conditional independence guarantees the
required global semantics
1.
Choose an ordering of variables X1, …., Xn
2.
For i = 1 to n
add Xi to network
select parents from X1, …, Xi-1 such that
P(Xi |Parents(Xi)) = P(Xi | X1,… Xi-1 )
The choice of parents guarantees the global semantics
n
P( X 1 ,..., X n )   P( X i | X 1 ,..., X i 1 )  chain rule
i 1
n
  P( X i | Parents( X i ))  by construction
i 1
Constructing Baye’s networks:
Example
Choose an ordering F, E, P, S, C
Fun
P(E|F)=P(E)?
Exam
P(P|F)=P(P)?
P(S|F,E)=P(S|E)?
Party
Study
P(S|F,E)=P(S)?
College
P(C|F,E,P,S)=P(C|P,S)?
P(C|F,E,P,S)=P(C)?
Note that this network has additional dependencies
Compact Networks
Fun
Exam
College
Party
Study
Study
College
Party
Exam
Fun
Network Construction: Alternative
Start with topological semantics that
specifies the conditional independence
relationships.

Defined by either:
 A node is conditionally independent of its non-
descendants, given its parents.
 A node is conditionally independent of all other
nodes given its parents, children, and children’s
parents: Markov Blanket.
Then reconstruct the CPTs.
Network Construction: Alternative
X
Each node is conditionally
independent of its nondescendants given its
parents
Local semantics 
Global semantics
Exam is independent of College,
given the values of Study and Party.
Network Construction: Alternative
U1
Z1j
…
Um
X
Y1
…
Znj
Yn
Each node is conditionally
independent of its parents,
children and children’s
parents. – Markov Blanket
College is independent of fun,
given Party.
Canonical Distribution
Completing a node’s CPT requires up to
O(2k) numbers. (k – number of parents)

If the parent child relationship is arbitrary,
than can be difficult to do.
Standard patterns can be named along
with a few parameters to satisfy the CPT.

Canonical distribution
Deterministic Nodes
Simplest form is to use deterministic nodes.


A value is specified exactly by its parent’s values.
No uncertainty.
But what about relationships that are uncertain?

If someone has a fever do they have a cold, the flu, or
a stomach bug?
Can you have a cold or stomach bug without a fever?
Noisy-Or Relationships
A Noisy-or relationship permits uncertainty
related to the each parent causing a child
to be true.


The causal relationship may be inhibited.
Assumes:
 All possible causes are known.

Can have a miscellaneous category if necessary (leak
node)
 Inhibition of a particular parent is independent of
inhibiting other parents.
Can you have a cold or stomach bug without a fever?
Fever is true iff cold, Flu, or Malaria is true.
Example
Given:
P (fever | cold , flu, malaria )  0.6
P (fever | cold , flu, malaria )  0.2
P (fever | cold , flu, malaria )  0.1
Example
Requires O(k) parameters rather than O(2k)
Cold Flu Malaria
F
F F
F
F T
P(  Fever)
1.0
0.1
F
T
F
0.2
F
T
T
T
T
T
F
F
T
T
T
F
T
F
T
0.2 * 0.1 = 0.02
0.6
0.6 * 0.1 = 0.06
0.6 * 0.2 = 0.12
0.6 * 0.2 * 0.1 = 0.012
Networks with Continuous
Variables
How are continuous variables represented?

Discretization using intervals
 Can result in loss of accuracy and large CPTs

Define probability density functions specified
by a finite number of parameters.
 i.e. Gaussian distribution
Hybrid Bayesian Networks
Contains both discrete and continuous
variables.
Specification of such a network requires:


Conditional distribution for a continuous
variable with discrete or continuous parents.
Conditional distribution for a discrete variable
with continuous parents.
Example
Continuous child with a discrete parent and a continuous parent
Discrete parent
Continuous parent
subsidy
Discrete parent is
Explicitly enumerated.
harvest
Cost
Buys
Continuous parent is
represented as a distribution.
Cost c depends on the
distribution function for h.
A linear Gaussian distribution
can be used.
Have to define the distribution for
both values of subsidy.
Example
Discrete child with a continuous parent
subsidy
harvest
Set a threshold for cost.
Can use a integral of the
standard normal distribution.
Continuous parent
Discrete child
Cost
Buys
Underlying decision process
has a hard threshold but the
Threshold’s location moves
based upon random
Gaussian noise.
Probit Distribution
Example
Probit distribution

Usually a better fit for real problems
Logit distribution


Uses sigmoid function to determine
threshold.
Can be mathematically easier to work with.
Baye’s Networks and Exact Inference
Notation




X: Query variable
E: set of evidence variables E1,…Em
e: a particular observed event
Y: set of nonevidence variables Y1,…Ym
 Also called hidden variables.


The complete set of variables: X  {X }  E  Y
A query: P(X|e)
College example: CPTs
P(C)
0.2
C
P(S)
True
False
College
C
P(P)
0.8
True
0.6
0.2
False
0.5
Study
S
P
P(E)
True
True
0.6
True
False
False
False
Party
P
P(F)
0.9
True
0.9
True
0.1
False
0.7
False
0.2
Exam
Fun
Example Query
If you succeeded on an exam and had
fun, what is the probability of partying?

P(Party|Exam=true, Fun=true)
Inference by Enumeration
From Chap 13 we know:
P( X | e)  P( X , e)    y P( X , e, y )
From this Chapter we have:
n
P( X 1 , X 2 ,..., X n )   P( X i | Parents( X i ))
i 1

P(x,b,y) in the joint distribution can be
represented as products of the conditional
probabilities.
Inference by Enumeration
A query can be answered using a Baye’s
Net by computing the sums of products of
the conditional probabilities from the
network.
Example Query
If you succeeded on an exam and had
fun, what is the probability of partying?

P(Party|Exam=true, Fun=true)

What are the hidden variables?
Example Query
Let:





C = College
PR = Party
S = Study
E = Exam
F =Fun
Then we have from eq. 13.6 (p.476):
P( pr | e, f )  P( pr, e, f )     P( pr, e, f , S , C )
C
S
Example Query
Using
n
P( X 1 , X 2 ,..., X n )   P( X i | Parents( X i ))
i 1
we can put
P( pr | e, f )  P( pr, e, f )    P( pr, e, f , S , C )
C
S
in terms of the CPT entries.
P( pr | e, f )    P( pr | C ) P( S | C ) P(e | S , pr ) P( f | pr )P(C )
C
S
The worst case complexity of this equation is: O(n2n) for
n variables.
Example Query
Improving the calculation

P(f|pr) is a constant so it can be moved out of
the summation over C and S.
P( pr | e, f )  P( f | pr ) P( pr | C ) P( S | C ) P(e | S , pr )P(C )
C

S
The move the elements that only involve C
and not S to outside the summation over S.
P( pr | e, f )  P( f | pr ) P(C ) P( pr | C ) P( S | C ) P(e | S , pr )
C
S
College example:
P(C)
0.2
C
P(S)
True
False
College
C
P(PR)
0.8
True
0.6
0.2
False
0.5
Study
S
PR
P(E)
True
True
0.6
True
False
False
False
Party
PR
P(F)
0.9
True
0.9
True
0.1
False
0.7
False
0.2
Exam
Fun
P( pr | e, f )  P( f | pr ) P(C ) P( pr | C ) P( S | C ) P(e | S , pr )
C
S
Example Query
P(f|pr)
.9
P(pr|c)
.6
P(s|c)
.8
+
P(c)
.2
Similarly for P(  pr|e,f).
.126
+
P(c)
.8
.06 + .08 = .14
P(s | c)
.2
+
P(s | c)
P(s | c)
.8
.2
.12 + .08 = .2
.48 + .02 = .5
P(e|s,pr)
.6
P( pr | c)
.5
P(e | s, pr )
.1
P(e | s, pr )
.1
P(e|s,pr)
.6
P( PR | e, f )  P( f | pr ) P(c) P( pr | c) P( s | c) P(e | s, pr ),
c
Still O(2n)
s
P( f | pr ) P(c) P(pr | c) P( s | c) P(e | s, pr ) 
c
s
Variable Elimination
A problem with the enumeration method is that
particular products can be computed multiple
times, thus reducing efficiency.

Reduce the number of duplicate calculations by doing
the calculation once and saving it for later.
Variable elimination evaluates expressions from
right to left, stores the intermediate results and
sums over each variable for the portions of the
expression dependent upon the variable.
Variable Elimination
First, factor the equation.
P( PR | e, f )  P( f | PR) P(C ) P( PR | C ) P( S | C ) P(e | S , PR)
c
F
s
C
PR
S
E
Second, store the factor for E

A 2x2 matrix fE(S,PR).
Third, store the factor for S.

A 2x2 matrix. f S ( S , C )   P( s | c) P(s | c) 


 P( s | c) P(s | c) 
Variable Elimination
Fourth, Sum out S from the product of the first
two factors.
f S E (C , PR)   fS ( S , C ) * f E ( S , PR)
S
 f s ( s, C ) * f E ( s, PR)  fS (s, C ) * f E (s, PR)

This is called a pointwise product
 It creates a new factor whose variables are the union of
the two factors in the product.
f ( X 1... X j , Y1...Yk , Z1...Z l )  f ( X 1... X j , Y1...Yk ) * f (Y1...Yk , Z1...Z l )

Any factor that does not depend on the variable to be
summed out can be moved outside the summation.
Variable Elimination
P( PR | e, f )  P( f | PR) P(C ) P( PR | C ) f (C, PR)
C
Fifth, store the factor for PR

A 2x2 matrix.
 P( pr | c) P( pr | c) 

f PR ( PR, C )  
 P(pr | c) P(pr | c) 
Sixth, Store the factor for C.
 P (c ) 

fC (C )  
 P(c) 
SE
Variable Elimination
P( PR | e, f )  P( f | PR) P(C ) P( PR | C ) f (C, PR)
c
SE
Seventh, sum out C from the product of
the factors where
fC PR S E ( PR)   fC (C ) * f PR ( PR, C ) * f SE (C , PR)
c
 fC (c) * f PR ( PR, c) * f SE (c, PR) 
fC (c) * f PR ( PR, c) * f SE (c, PR)
Variable Elimination
P( PR | e, f )  P( f | PR) f
C PR S E
( PR)
Next, store the factor for F.
 P( f | pr ) 

f F ( PR)  
 P( f | pr ) 
Finally, calculate the final result
P( PR | e, f )  f ( PR )f
F
C PR S E
( PR)
Elimination Simplification
Any leaf node that is not a query variable
or an evidence variable can be removed.
Every variable that is not an ancestor of a
query variable or an evidence variable is
irrelevant to the query and can be
eliminated.
Elimination Simplification
Book Example:

What is the probability that John calls if
there is a burglary?
P( J | b)  P(b) P(e) P(a | b, e) P( J | a ) P(m | a )
e
a
m
Does this matter?
Burglary
Earthquake
Alarm
JohnCalls
MaryCalls
Complexity of Exact Inference
Variable elimination is more efficient than
enumeration.

Time and space requirements are dominated
by the size of the largest factor constructed
which is determined by the order of variable
elimination and the network structure.
Polytrees
Polytrees are singly connected networks


At most one directed path between any two
nodes.
Time and space requirements are linear in the
size of the network.
 Size is the number of CPT entries.
Polytrees
Burglary
Study
Alarm
JohnCalls
College
Earthquake
MaryCalls
Party
Exam
Fun
Applying variable elimination
to multiply connected networks
has worst case exponential
time and space complexity.
Are these trees polytrees?