Handling uncertainty: Propositional probabilistic logic

Download Report

Transcript Handling uncertainty: Propositional probabilistic logic

Solving problems using
propositional logic
• Need to write what you know as propositional formulas
• Theorem proving will then tell you whether a given new
sentence will hold given what you know
• Three kinds of queries
– Is my knowledgebase consistent? (i.e. is there at least one world
where everything I know is true?) Satisfiability
– Is the sentence S entailed by my knowledge base? (i.e., is it true in
every world where my knowledge base is true?)
– Is the sentence S consistent/possibly true with my knowledge
base? (i.e., is S true in at least one of the worlds where my
knowledge base holds?)
• S is consistent if ~S is not entailed
• But cannot differentiate between degrees of likelihood
among possible sentences
Example
• Pearl lives in Los Angeles. It is
a high-crime area. Pearl
installed a burglar alarm. He
asked his neighbors John &
Mary to call him if they hear
the alarm. This way he can
come home if there is a
burglary. Los Angeles is also
earth-quake prone. Alarm goes
off when there is an earthquake.
Burglary => Alarm
Earth-Quake => Alarm
Alarm => John-calls
Alarm => Mary-calls
If there is a burglary, will Mary
call?
Check KB & E |= M
If Mary didn’t call, is it possible
that Burglary occurred?
Check KB & ~M doesn’t entail
~B
Example (Real)
•
•
Pearl lives in Los Angeles. It is a highcrime area. Pearl installed a burglar
alarm. He asked his neighbors John &
Mary to call him if they hear the alarm.
This way he can come home if there is
a burglary. Los Angeles is also earthquake prone. Alarm goes off when
there is an earth-quake.
Pearl lives in real world where (1)
burglars can sometimes disable alarms
(2) some earthquakes may be too slight
to cause alarm (3) Even in Los Angeles,
Burglaries are more likely than Earth
Quakes (4) John and Mary both have
their own lives and may not always call
when the alarm goes off (5) Between
John and Mary, John is more of a
slacker than Mary.(6) John and Mary
may call even without alarm going off
Burglary => Alarm
Earth-Quake => Alarm
Alarm => John-calls
Alarm => Mary-calls
If there is a burglary, will Mary call?
Check KB & E |= M
If Mary didn’t call, is it possible that
Burglary occurred?
Check KB & ~M doesn’t entail ~B
John already called. If Mary also calls, is it
more likely that Burglary occurred?
You now also hear on the TV that there was
an earthquake. Is Burglary more or less
likely now?
How do we handle Real Pearl?
Potato in the tail-pipe
• Omniscient & Eager way:
– Model everything!
– E.g. Model exactly the
conditions under which
John will call
• Ignorant (non-omniscient)
and Lazy (nonomnipotent) way:
• He shouldn’t be listening
to loud music, he hasn’t
gone on an errand, he
didn’t recently have a tiff
with Pearl etc etc.
A & c1 & c2 & c3 &..cn => J
(also the exceptions may have
interactions
c1&c5 => ~c9 )
Qualification and Ramification problems
make this an infeasible enterprise
– Model the likelihood
– In 85% of the worlds where
there was an alarm, John
will actually call
– How do we do this?
•
•
•
•
Non-monotonic logics
“certainty factors”
“fuzzy logic”
“probability” theory?
Non-monotonic (default) logic
• Prop calculus (as well as the first order logic we shall discuss later) are
monotonic, in that once you prove a fact F to be true, no amount of
additional knowledge can allow us to disprove F.
• But, in the real world, we jump to conclusions by default, and revise them
on additional evidence
– Consider the way the truth of the statement “F: Tweety Flies” is revised by us
when we are given facts in sequence: 1. Tweety is a bird (F)2. Tweety is an
Ostritch (~F) 3. Tweety is a magical Ostritch (F) 4. Tweety was cursed recently
(~F) 5. Tweety was able to get rid of the curse (F)
• How can we make logic show this sort of “defeasible” (aka defeatable)
conclusions?
– Many ideas, with one being negation as failure
– Let the rule about birds be Bird & ~abnormal => Fly
• The “abnormal” predicate is treated special— if we can’t prove abnormal, we can
assume ~abnormal is true
• (Note that in normal logic, failure to prove a fact F doesn’t allow us to assume that
~F is true since F may be holding in some models and not in other models).
– Non-monotonic logic enterprise involves (1) providing clean semantics for this
type of reasoning and (2) making defeasible inference efficient
Certainty Factors
• Associate numbers to each of the facts/axioms
• When you do derivations, compute c.f. of the
results in terms of the c.f. of the constituents
(“truth functional”)
• Problem: Circular reasoning because of mixed
causal/diagnostic directions
– Raining => Grass-wet 0.9
– Grass-wet => raining 0.7
• If you know grass-wet with 0.4, then we know
raining which makes grass more wet, which….
Fuzzy Logic vs. Prob. Prop. Logic
• Fuzzy Logic assumes that
the word is made of
statements that have
different grades of truth
– Recall the puppy example
• Fuzzy Logic is “Truth
Functional”—i.e., it
assumes that the truth
value of a sentence can be
established in terms of the
truth values only of the
constituent elements of
that sentence.
• PPL assumes that the
world is made up of
statements that are either
true or false
• PPL is truth functional for
“truth value in a given
world” but not truth
functional for entailment
status.
Prop. Logic vs. Prob. Prop. Logic
• Model theory for logic
•
– Set of all worlds where
the formula/KB holds
– Each world is either a
model or not
• Proof theory/inference
– Rather than enumerate
all models, write KB
sentences that
constrain the models
•
Model theory for prob. Logic
– For each world, give the probability
p that the formula holds in that
world
• p(w)=0 means that world is
definitely not a model
• Otherwise, 0< p(w) <= 1
• Sum of p(w) = 1
• AKA Joint Probability
Distribution—2n – 1 numbers
Proof Theory
– statements on subsets of
propositions
• E.g. p(A)=0.5; p(B|A)=0.7 etc
• (Cond) Independences…
– These constrain the joint probability
distribution
Easy Special Cases
• If there are no relations
between the propositions
(i.e., they can take values
independently of each
other)
– Then the joint probability
distribution can be specified
in terms of probabilities of
each proposition being true
– Just n numbers instead of 2n
• If in addition, each
proposition is equally
likely to be true or false,
– Then the joint probability
distribution can be specified
without giving any
numbers!
• All worlds are equally
probable! If there are n
props, each world will be
1/2n probable
– Probability of any
propositional
conjunction with m (<
n) propositions will be
1/2m
Will we always need
•
n
2
numbers?
If every pair of variables is independent of each other, then
– P(x1,x2…xn)= P(xn)* P(xn-1)*…P(x1)
– Need just n numbers!
– But if our world is that simple, it would also be very uninteresting & uncontrollable
(nothing is correlated with anything else!)
A more realistic middle ground is that interactions between
variables are contained to regions.
--e.g. the “school variables” and the “home variables” interact only loosely
(are independent for most practical purposes)
-- Will wind up needing O(2k) numbers (k << n)
•
We need 2n numbers if every subset of our n-variables are correlated together
– P(x1,x2…xn)= P(xn|x1…xn-1)* P(xn-1|x1…xn-2)*…P(x1)
– But that is too pessimistic an assumption on the world
• If our world is so interconnected we would’ve been dead long back… 
Probabilistic Calculus to the Rescue
Burglary => Alarm
Earth-Quake => Alarm
Alarm => John-calls
Alarm => Mary-calls
Suppose we know the likelihood
of each of the (propositional)
worlds (aka Joint Probability
distribution )
Then we can use standard rules of
probability to compute the
likelihood of all queries (as I
will remind you)
So, Joint Probability Distribution is
all that you ever need!
In the case of Pearl example, we just
need the joint probability
distribution over B,E,A,J,M (32
numbers)
--In general 2n separate numbers
(which should add up to 1)
Topology encodes the conditional
independence assertions
Only 10 (instead of 32)
numbers to specify!
If Joint Distribution is sufficient for
reasoning, what is domain
knowledge supposed to help us
with?
--Answer: Indirectly by helping us
specify the joint probability
distribution with fewer than 2n
numbers
---The local relations between
propositions can be seen as
“constraining” the form the joint
probability distribution can take!
Probabilistic Calculus to the Rescue
Burglary => Alarm
Earth-Quake => Alarm
Alarm => John-calls
Alarm => Mary-calls
Suppose we know the likelihood
of each of the (propositional)
worlds (aka Joint Probability
distribution )
Then we can use standard rules of
probability to compute the
likelihood of all queries (as I
will remind you)
So, Joint Probability Distribution is
all that you ever need!
In the case of Pearl example, we just
need the joint probability
distribution over B,E,A,J,M (32
numbers)
--In general 2n separate numbers
(which should add up to 1)
Only 10 (instead of 32)
numbers to specify!
If Joint Distribution is sufficient for
reasoning, what is domain
knowledge supposed to help us
with?
--Answer: Indirectly by helping us
specify the joint probability
distribution with fewer than 2n
numbers
---The local relations between
propositions can be seen as
“constraining” the form the joint
probability distribution can take!
Propositional Probabilistic Logic
Your name:
Computing Prob. Queries
Given a Joint Distribution
P(CA & TA) =
P(CA) =
TA
~TA
CA
0.04
0.06
~CA
0.01
0.89
P(TA) =
P(CA V TA) =
P(CA|~TA) =
P(~TA|CA) =
Check if P(CA|~TA) = P(~TA|CA)* P(CA)
P(CA & TA) = 0.04
P(CA) = 0.04+0.06 = 0.1
(marginalizing over TA)
TA
~TA
CA
0.04
0.06
~CA
0.01
0.89
P(TA) = 0.04+0.01= 0.05
P(CA V TA) = P(CA) + P(TA) – P(CA&TA)
= 0.1+0.05-0.04 = 0.11
P(CA|~TA) = P(CA&~TA)/P(~TA)
= 0.06/(0.06+.89) = .06/.95=.0631
Think of this
as analogous
to entailment
by truth-table
enumeration!
The material in this slide was presented on the white board
If B=>A then
P(A|B) = ?
P(B|~A) = ?
P(B|A) = ?
Most useful probabilistic reasoning
involves computing posterior
distributions
P(CA|TA;Catch;Romney-won)?
Probability
P(CA|TA;Catch)
P(CA|TA)
P(CA)
Variable values
Important: Computing posterior distribution is inference; not learning
The material in this slide was presented on the white board
CONDITIONAL PROBABLITIES
Non-monotonicity w.r.t. evidence– P(A|B) can be either higher,
lower or equal to P(A)
The material in this slide was presented on the white board
Generalized bayes rule
P(A|B,e) = P(B|A,e) P(A|e)
P(B|e)
Get by with easier
to assess numbers
A be Anthrax; Rn be Runny Nose
P(A|Rn) = P(Rn|A) P(A)/ P(Rn)
Think of this as
analogous to
inference rules
(like modus-ponens)
The material in this slide was presented on the white board
Relative ease/utility of Assessing
various types of probabilities
•
•
•
Joint distribution requires us to assess
probabilities of type
P(x1,~x2,x3,….~xn)
This means we have to look at all
entities in the world and see which
fraction of them have x1,~x2,x3….~xm
true
Difficult experiment to setup..
•
Conditional probabilities of type P(A|B)
are relatively easier to assess
– You just need to look at the set of
entities having B true, and look at
the fraction of them that also have
A true
– Eventually, they too can get
baroque P(x1,~x2,…xm|y1..yn)
• Among the conditional probabilities, causal probabilities of the form
P(effect|cause) are better to assess than diagnostic probabilities of the
form P(cause|effect)
– Causal probabilities tend to me more stable compared to diagnostic
probabilities
– (for example, a text book in dentistry can publish P(TA|Cavity) and hope
that it will hold in a variety of places. In contrast, P(Cavity|TA) may
depend on other fortuitous factors—e.g. in areas where people tend to eat
a lot of icecream, many tooth aches may be prevalent, and few of them
may be actually due to cavities.
The material in this slide was presented on the white board
What happens if there are
multiple symptoms…?
Patient walked in and complained of toothache
You assess P(Cavity|Toothache)
Conditional independence
To the rescue
Suppose
P(TA,Catch|cavity)
= P(TA|Cavity)*P(Catch|Cavity)
Now you try to probe the patients mouth with that steel thingie, and it
catches…
How do we update our belief in Cavity?
P(Cavity|TA, Catch) = P(TA,Catch| Cavity) * P(Cavity)
P(TA,Catch)
= a P(TA,Catch|Cavity) * P(Cavity)
Need to know this!
If n evidence variables,
We will need 2n probabilities!
Put variables in reverse
topological sort; apply chain rule
P(J|M,A,~B,~E)*P(M|A,~B,~E)*P(A|~B,~E)*P(~B|~E)*P(~E)
P(J|A)
* P(M|A)
*P(A|~B,~E) * P(~B) * P(~E)
.9
.7
.001
By conditional independence
inherent in the Bayes Net
.000628
.999
.998
Local Semantics
Node independent
of non-descendants
given its parents
Gives global semantics
i.e. the full joint
Conditional Independence
Assertions
• We write X || Y | Z to say that the set of variables X is
conditionally independent of the set of variables Y given
evidence on the set of variables Z (where X,Y,Z are subsets
of the set of all random variables in the domain model)
• Inference can exploit conditional independence assertions.
Why not
Specifically,
– X || Y| Z implies
• P(X & Y|Z) = P(X|Z) * P(Y|Z)
• P(X|Y, Z) = P(X|Z)
• P(Y|X,Z) = P(Y|Z)
– If A||B|C then P(A,B,C)=P(A|B,C)P(B,C)
=P(A|B,C)P(B|C)P(C)
=P(A|C)P(B|C)P(C)
(Can get by with 1+2+2=5 numbers
instead of 8)
write down
all
conditional
independence
assertions
that hold in a
domain?
Cond. Indep. Assertions (Contd)
• Idea: Why not write down all conditional independence assertions
(CIA) (X || Y | Z) that hold in a domain?
• Problem: There can be exponentially many conditional independence
assertions that hold in a domain (recall that X, Y and Z are all subsets
of the domain variables).
• Brilliant Idea: May be we should implicitly specify the CIA by writing
down the “local dependencies” between variables using a graphical
model
– A Bayes Network is a way of doing just this.
• The Bayes Net is a Directed Acyclic Graph whose nodes are random variables,
and the immediate dependencies between variables are represented by directed
arcs
– The topology of a bayes network shows the inter-variable dependencies.
Given the topology, there is a way of checking if any Cond. Indep.
Assertion. holds in the network (the Bayes Ball algorithm and the D-Sep
idea)
Topological Semantics
Markov
Blanket
Parents;
Children;
Children’s
other parents
These two conditions are equivalent
Many other conditional indepdendence assertions follow from these
CIA implicit in Bayes Nets
• So, what conditional independence assumptions
are implicit in Bayes nets?
– Local Markov Assumption:
• A node N is independent of its non-descendants (including
ancestors) given its immediate parents. (So if P are the
immediate paretnts of N, and A is a subset of of Ancestors and
other non-descendants, then {N} || A| P )
• (Equivalently) A node N is independent of all other nodes
given its markov blanket (parents, children, children’s parents)
– Given this assumption, many other conditional
independencies follow. For a full answer, we need to
appeal to D-Sep condition and/or Bayes Ball
reachability
1
2
4
31!
8
16
Alarm
Burglary
P(A|J,M) =P(A)?
Introduce variables
in the causal order
Easy when you know the
Earthquake
causality of the domain
hard otherwise..
How many probabilities are needed?
13 for the new; 10 for the old
Is this the worst?
Bayesian (Personal) Probabilities
•
•
Continuing bad friends, in the question
above, suppose a second friend comes along
and says that he can give you
the conditional probabilities that you want
to complete the specification of your bayes
net. You ask him a CPT entry,
and pat comes a response--some number
between 0 and 1. This friend is well meaning,
but you are worried that the
numbers he is giving may lead to some sort
of inconsistent joint probability distribution.
Is your worry justified ( i.e., can your
friend give you numbers that can lead to an
inconsistency?)
(To understand "inconsistency", consider
someone who insists on giving you P(A),
P(B), P(A&B) as well as P(AVB) and they
wind up not satisfying the P(AVB)=
P(A)+P(B) -P(A&B)
[or alternately, they insist on giving you
P(A|B), P(B|A), P(A) and P(B), and the four
numbers dont satisfy the bayes rule]
•
Answer: No—as long as we only ask
the friend to fill up the CPTs in the
bayes network, there is no way the
numbers won’t makeup a consistent
joint probability distribution
– This should be seen as a feature..
Personal Probabilities
– John may be an optimist and
believe that P(burglary)=0.01 and
Tom may be a pessimist and
believe that P(burglary)=0.99
– Bayesians consider both John and
Tom to be fine (they don’t insist
on an objective frequentist
interpretation for probabilites)
– However, Bayesians do think that
John and Tom should act
consistently with their own beliefs
• For example, it makes no
sense for John to go about
installing tons of burglar
alarms given his belief, just as
it makes no sense for Tom to
put all his valuables on his
lawn
Ideas for reducing the number of
probabilties to be specified
•
•
•
Problem 1: Joint distribution
requires 2n numbers to
specify; and those numbers
are harder to assess
Problem 2: But, CPTs will be
as big as the full joint if the
network is dense CPTs
Problem 3: But, CPTs can
still be quite hard to specify
if there are too many parents
(or if the variables are
continuous)
• Solution: Use Bayes Nets to
reduce the numbers and specify
them as CPTs
• Solution: Introduce
intermediate variables to induce
sparsity into the network
• Solution: Parameterize the CPT
(use Noisy OR etc for discrete
variables; gaussian etc for
continuous variables)
Making the network Sparse by
introducing intermediate variables
• Consider a network of boolean
variables where n parent nodes are
connected to m children nodes (with
each parent influencing each child).
– You will need n + m*2n conditional
probabilities
• Suppose you realize that what is
really influencing the child nodes is
some single aggregate function on
the parent’s values (e.g. sum of the
parents).
– We can introduce a single
intermediate node called “sum”
which has links from all the n
parent nodes, and separately
influences each of the m child nodes
• Now you will wind up needing only
n+ 2n + 2m conditional probabilities
to specify this new network!
Learning such
hidden variables
from data poses
challenges..
Compact/Parameterized distributions are pretty much
the only way to go when continuous variables are
involved!
Think of a firing squad with upto k gunners trying to shoot you
You will live only if everyone who shoots misses..
We only consider
the failure to cause
probability of the
Causes that hold
How about Noisy And? (hint: A&B => ~( ~A V ~B) )
k
i=j+1
ri
Prob that X
holds even
though ith
parent doesn’t
Constructing Belief Networks:
Summary
• [[Decide on what sorts of queries you are interested in answering
– This in turn dictates what factors to model in the network
• Decide on a vocabulary of the variables and their domains for the
problem
– Introduce “Hidden” variables into the network as needed to make the
network “sparse”
• Decide on an order of introduction of variables into the network
– Introducing variables in causal direction leads to fewer connections
(sparse structure) AND easier to assess probabilities
• Try to use canonical distributions to specify the CPTs
– Noisy-OR
– Parameterized discrete/continuous distributions
• Such as Poisson, Normal (Gaussian) etc
Case Study: Pathfinder System
•
Domain: Lymph node diseases
– Deals with 60 diseases and 100 disease findings
•
Versions:
–
–
–
Pathfinder I: A rule-based system with logical reasoning
Pathfinder II: Tried a variety of approaches for uncertainity
• Simple bayes reasoning outperformed
Pathfinder III: Simple bayes reasoning, but reassessed probabilities
– Parthfinder IV: Bayesian network was used to handle a variety of conditional
dependencies.
• Deciding vocabulary: 8 hours
• Devising the topology of the network: 35 hours
• Assessing the (14,000) probabilities: 40 hours
–
•
Physician experts liked assessing causal probabilites
Evaluation: 53 “referral” cases
– Pathfinder III: 7.9/10
– Pathfinder IV: 8.9/10 [Saves one additional life in every 1000 cases!]
– A more recent comparison shows that Pathfinder now outperforms experts who
helped design it!!
Conditional Independence
Assertions
• We write X || Y | Z to say that the set of variables X is
conditionally independent of the set of variables Y given
evidence on the set of variables Z (where X,Y,Z are subsets
of the set of all random variables in the domain model)
• Inference can exploit conditional independence assertions.
Why not
Specifically,
– X || Y| Z implies
• P(X & Y|Z) = P(X|Z) * P(Y|Z)
• P(X|Y, Z) = P(X|Z)
• P(Y|X,Z) = P(Y|Z)
– If A||B|C then P(A,B,C)=P(A|B,C)P(B,C)
=P(A|B,C)P(B|C)P(C)
=P(A|C)P(B|C)P(C)
(Can get by with 1+2+2=5 numbers
instead of 8)
write down
all
conditional
independence
assertions
that hold in a
domain?
Topological Semantics
Markov
Blanket
Parents;
Children;
Children’s
other parents
These two conditions are equivalent
Many other conditional indepdendence assertions follow from these
Independence in Bayes Networks:
Causal Chains; Common Causes;
Common Effects
Common Cause (diverging)
X and Y are caused by Z
is blocked if Z is given
Causal chain (linear)
X causes Y through Z
is blocked if Z is given
Common Effect (converging)
X and Y cause Z
is blocked only if neither Z
nor its descendants are given
D-sep (direction dependent Separation)
•
X || Y | E if every
undirected path from X to
Y is blocked by E
–
A path is blocked if there
is a node Z on the path s.t.
1.
2.
3.
[Z] Z is in E and Z
has one arrow coming in
and another going out
[Z] is in E and Z has
both arrows going out
[Z] Neither Z nor
any of its descendants are
in E and both path arrows
lead to Z
B||M|A
(J,M)||E | A
B||E
B||E | A
B||E | M
Topological Semantics
Markov
Blanket
Parents;
Children;
Children’s
other parents
These two conditions are equivalent
Many other conditional indepdendence assertions follow from these
V||A | T
How many probabilities?
(assume all are boolean variables)
T ||L ?
P(V,~T,S,~L,A,B,~X,~D) =
T||L | D
• “Asia” network:
Visit to
Asia
Smoking
X||D
X||D | A
Tuberculosis
Lung Cancer
(V,S)||(X,D)| A
Abnormality
in Chest
X-Ray
(V,S) || X | A
Bronchitis
Dyspnea
Name:
th
0
idea for Bayes Net Inference
• Given a bayes net, we can compute all the entries
of the joint distribution (by just multiplying entries
in CPTs)
• Given the joint distribution, we can answer any
probabilistic query.
• Ergo, we can do inference on bayes networks
• Qn: Can we do better?
– Ideas:
• Implicity enumerate only the part of the joint that is needed
• Use sampling techniques to compute the probabilities
Factor Algebra…
P(A,B)
Variable Elimination (Summing
Away)
P(A,B) =
Elimination is
what we mostly do
to compute probabilities
P(A|B) = P(AB)/P(B)
P(B) =? Eliminate A
Sum rows where A changes polarity
Normalization
If the expression tree is evaluated in a depth first fashion,
then the space requirement is linear..
fA(a,b,e)*fj(a)*fM(a)+
fA(~a,b,e)*fj(~a)*fM(~a)+
A join..
Complexity depends on the
size of the largest factor
which in turn depends on
the order in which variables
are eliminated..
*Read Worked Out Example of
Variable Elimination in the Lecture
Notes*
[From Lise Getoor’s notes]
A More Complex Example
• “Asia” network:
Visit to
Asia
Tuberculosis
Smoking
Lung Cancer
Abnormality
in Chest
X-Ray
Bronchitis
Dyspnea
• We want to compute P(d)
• Need to eliminate: v,s,x,t,l,a,b
S
V
L
T
B
A
Initial factors
X
D
P (v )P (s )P (t |v )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a, b )
• We want to compute P(d)
• Need to eliminate: v,s,x,t,l,a,b
S
V
L
T
B
A
Initial factors
X
D
P (v )P (s )P (t |v )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a, b )
Eliminate: v
Compute:
fv (t )  P (v )P (t |v )
v
 fv (t )P (s )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a, b)
Note: fv(t) = P(t)
In general, result of elimination is not necessarily a probability
term
• We want to compute P(d)
• Need to eliminate: s,x,t,l,a,b
S
V
L
T
B
A
• Initial factors
X
D
P (v )P (s )P (t |v )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a, b )
 fv (t )P (s )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a, b)
Eliminate: s
Compute:
fs (b,l )  P (s )P (b | s )P (l | s )
s
 fv (t )fs (b,l )P (a |t ,l )P (x | a )P (d | a, b)
Summing on s results in a factor with two arguments fs(b,l)
In general, result of elimination may be a function of several
variables
• We want to compute P(d)
• Need to eliminate: x,t,l,a,b
S
V
L
T
B
A
• Initial factors
X
D
P (v )P (s )P (t |v )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a, b )
 fv (t )P (s )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a, b)
 fv (t )fs (b,l )P (a |t ,l )P (x | a )P (d | a, b)
Eliminate: x
Compute:
fx (a )  P (x | a )
x
 fv (t )fs (b,l )fx (a )P (a |t ,l )P (d | a, b)
Note: fx(a) = 1 for all values of a !!
• We want to compute P(d)
• Need to eliminate: t,l,a,b
S
V
L
T
B
A
• Initial factors
X
D
P (v )P (s )P (t |v )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a, b )
 fv (t )P (s )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a, b)
 fv (t )fs (b,l )P (a |t ,l )P (x | a )P (d | a, b)
 fv (t )fs (b,l )fx (a )P (a |t ,l )P (d | a, b)
Eliminate: t
Compute:
ft (a ,l )  fv (t )P (a |t ,l )
t
 fs (b,l )fx (a )ft (a,l )P (d | a, b)
• We want to compute P(d)
• Need to eliminate: l,a,b
S
V
L
T
B
A
• Initial factors
X
D
P (v )P (s )P (t |v )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a, b )
 fv (t )P (s )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a, b)
 fv (t )fs (b,l )P (a |t ,l )P (x | a )P (d | a, b)
 fv (t )fs (b,l )fx (a )P (a |t ,l )P (d | a, b)
 fs (b,l )fx (a )ft (a,l )P (d | a, b)
Eliminate: l
Compute:
fl (a , b )  fs (b,l )ft (a ,l )
 fl (a, b)fx (a )P (d | a, b)
l
• We want to compute P(d)
• Need to eliminate: b
S
V
L
T
B
A
• Initial factors
X
D
P (v )P (s )P (t |v )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a, b )
 fv (t )P (s )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a, b)
 fv (t )fs (b,l )P (a |t ,l )P (x | a )P (d | a, b)
 fv (t )fs (b,l )fx (a )P (a |t ,l )P (d | a, b)
 fs (b,l )fx (a )ft (a,l )P (d | a, b)
 fl (a, b)fx (a )P (d | a, b)  fa (b,d )  fb (d )
Eliminate: a,b
Compute:
fa (b,d )  fl (a , b )fx (a ) p (d | a , b ) fb (d )  fa (b,d )
a
b
Variable Elimination
• We now understand variable elimination as a
sequence of rewriting operations
• Actual computation is done in elimination step
• Computation depends on order of elimination
– Optimal elimination order can be computed—but is
NP-hard to do so 
Sufficient Condition 1
In general, any leaf node
that is not a query or
evidence variable
is irrelevant (and
can be removed)
(once it is removed, others
may be seen to be irrelevant)
Can drop irrelevant variables from the network before starting the
query off..
Sufficient Condition 2
Note that condition 2 doesn’t subsume condition 1.
In particular, it won’t allow us to say that M is irrelevant
for the query P(J|B)
Notice that sampling methods could in general be used even when
we don’t know the bayes net (and are just observing the world)!
We should strive to make the sampling more efficient given
that we know the bayes net
Generating a Sample from the Network
Note that the sample
is generated in the causal
order so all the required
probabilities can be
read off from CPTs!
(To see how useful this is,
consider generating the sample
in the order of wetgrass, rain,
sprinkler … To
sample WG, we will first need
P(WG), then we need
P(Rain|Wg),
then we need P(Sp|Rain,Wg)—
NONE
of these are directly given in
CPTs –and
have to be computed…
Note that in MCMC we do
(re)sample a node given its
markov blanket—which is not
in causal order—since MB
contains children and their
parents.
<C, ~S, R, W>
Network Samples Joint distribution
That is, the rejection sampling method doesn’t really use
the bayes network that much…
Notice that to attach the likelihood to the evidence, we are using
the CPTs in the bayes net. (Model-free empirical observation,
in contrast, either gives you a sample or not; we can’t get fractional samples)
Notice that to attach the likelihood to the evidence, we are using
the CPTs in the bayes net. (Model-free empirical observation,
in contrast, either gives you a sample or not; we can’t get fractional samples)
Note that the other parents
of zj are part of the
markov blanket
P(rain|cl,sp,wg) = P(rain|cl) * P(wg|sp,rain)
Examples of singly
connected networks
include Markov Chains
and Hidden Markov Models
Network Topology & Complexity of
Inference
Sprinklers
Rain
Wetgrass
Can be converted
to singly-connected
(by merging nodes)
Cloudy
Singly Connected Networks
(poly-trees – At most one
path between any pair of nodes)
Inference is polynomial
Wetgrass
Sprinklers+Rain
(takes 4 values
2x2)
The “size” of the merged network can be
Exponentially larger (so polynomial inference
on that network isn’t exactly god’s gift 
Cloudy Multiplyconnected
Inference NP-hard
Summary of BN Inference Algorithms
• Complexity
– NP-hard (actually #P-Complete; since we “count” models)
• Polynomial for “Singly connected” networks (one path between each
pair of nodes)
– NP-hard also for absolute and relative approximation
•
Exact Inference
Algorithms
– Enumeration
– Variable elimination
• Avoids the redundant
computations of Enumeration
– [Many others such as “message
passing” algorithms, Constraintpropagation based algorithms etc.]
•
Approximate Inference
Algorithms
– Based on Stochastic Simulation
•
•
•
•
Sampling from empty networks
Rejection sampling
Likelihood weighting
MCMC [And many more]
Conjunctive queries are essentially computing joint distributions on
sets of query variables. A special case of computing the full joint on
query variables is finding just the query variable configuration that is
Most likely given the evidence. There are two special cases here
MPE—Most Probable Explanation
Most likely assignment to all other variables given the evidence
Mostly involves max/product
MAP—Maximum a posteriori
Most likely assignment to some variables given the evidence
Can involve, max/product/sum operations