Week 9: Prob. Prop. Logic

Download Report

Transcript Week 9: Prob. Prop. Logic

10/15
Homework 3 due 10/17
Mid-term on 10/24
Project 2 accepted until 10/26 (4 days
automatic extension)
You can avoid
assessing P(E=e)
if you assess
P(Y|E=e)
since it must
add up to 1
Relative ease/utility of Assessing
various types of probabilities
•
•
•
Joint distribution requires us to assess
probabilities of type
P(x1,~x2,x3,….~xn)
This means we have to look at all
entities in the world and see which
fraction of them have x1,~x2,x3….~xm
true
Difficult experiment to setup..
•
Conditional probabilities of type P(A|B)
are relatively much easier to assess
–
You just need to look at the set of
entities having B true, and look at the
fraction of them that also have A true
• Among the conditional probabilities, causal probabilities of the form
P(effect|cause) are better to assess than diagnostic probabilities of the
form P(cause|effect)
– Causal probabilities tend to me more stable compared to diagnostic
probabilities
– (for example, a text book in dentistry can publish P(TA|Cavity) and hope
that it will hold in a variety of places. In contrast, P(Cavity|TA) may
depend on other fortuitous factors—e.g. in areas where people tend to eat
a lot of icecream, many tooth aches may be prevalent, and few of them
may be actually due to cavities.
Generalized bayes rule
P(A|B,e) = P(B|A,e) P(A|e)
P(B|e)
Get by with easier
to assess numbers
A be Anthrax; Rn be Runny Nose
P(A|Rn) = P(Rn|A) P(A)/ P(Rn)
Think of this as
analogous to
inference rules
(like modus-ponens)
What happens if there are
multiple symptoms…?
Patient walked in and complained of toothache
You assess P(Cavity|Toothache)
Conditional independence
To the rescue
Suppose
P(TA,Catch|cavity)
= P(TA|Cavity)*P(Catch|Cavity)
Now you try to probe the patients mouth with that steel thingie, and it
catches…
How do we update our belief in Cavity?
P(Cavity|TA, Catch) = P(TA,Catch| Cavity) * P(Cavity)
P(TA,Catch)
= a P(TA,Catch|Cavity) * P(Cavity)
Need to know this!
If n evidence variables,
We will need 2n probabilities!
Written as A||B
Conditional Independence
Assertions
• We write X || Y | Z to say that the set of variables X is
conditionally independent of the set of variables Y given
evidence on the set of variables Z (where X,Y,Z are subsets
of the set of all random variables in the domain model)
• We saw that Bayes Rule computations can exploit
conditional independence assertions. Specifically,
– X || Y| Z implies
• P(X & Y|Z) = P(X|Z) * P(Y|Z)
• P(X|Y, Z) = P(X|Z)
• P(Y|X,Z) = P(Y|Z)
• Idea: Why not write down all conditional independence
assertions that hold in a domain?
Cond. Indep. Assertions (Contd)
• Idea: Why not write down all conditional independence assertions
(CIA) (X || Y | Z) that hold in a domain?
• Problem: There can be exponentially many conditional independence
assertions that hold in a domain (recall that X, Y and Z are all subsets
of the domain variables.
• Brilliant Idea: May be we should implicitly specify the CIA by writing
down the “dependencies” between variables using a graphical model
– A Bayes Network is a way of doing just this.
• The Bayes Net is a Directed Acyclic Graph whose nodes are random variables,
and the immediate dependencies between variables are represented by directed
arcs
– The topology of a bayes network shows the inter-variable dependencies.
Given the topology, there is a way of checking if any Cond. Indep.
Assertion. holds in the network (the Bayes Ball algorithm and the D-Sep
idea)
CIA implicit in Bayes Nets
• So, what conditional independence assumptions
are implicit in Bayes nets?
– Local Markov Assumption:
• A node N is independent of its non-descendants (including
ancestors) given its immediate parents. (So if P are the
immediate paretnts of N, and A is a subset of of Ancestors and
other non-descendants, then {N} || A| P )
• (Equivalently) A node N is independent of all other nodes
given its markov blanket (parents, children, children’s parents)
– Given this assumption, many other conditional
independencies follow. For a full answer, we need to
appeal to D-Sep condition and/or Bayes Ball
reachability
Topological Semantics
These two conditions are equivalent
Many other conditional indepdendence assertions follow from these
P(~B)P(~E|~B)P(A|~E,~B)P(J|A,~E,~B)P(M|A,~B,~E,J) (Chain
Rule)
P(~B) P(~E)P(A|~E,~B)P(J|A)P(M|A)
Local Semantics
Node independent
of non-descendants
given its parents
D-sep (direction dependent Separation)
•
X || Y | E if every
undirected path from X
to Y is blocked by E
–
A path is blocked if
there is a node Z on the
path s.t.
1.
2.
3.
Z is in E and Z has
one arrow coming in
and another going out
Z is in E and Z has
both arrows going out
Neither Z nor any of
its descendants are in
E and both path
arrows lead to Z
B||M|A
(J,M)||E | A
B||E
B||E | A
B||E | M
bunch of CIA
• Qn. If I tell you exactly which CIA hold in a
domain, can you give me a bayes net that
exactly models those and only those CIA?
– Unfortunately, NO. (See the example to the right)
– This is why there is another type of graphical
models called “undirected graphical models”
Give a bayes net on
X,Y,Z,W s.t.
X||Y|(Z,W)
Z||W|(X,Y)
And no other C.I.
• In an undirected graphical model, also called a
markov random field, nodes correspond to random
variables, and the immediate dependencies
between variables are represented by undirected
edges.
– The CIA modeled by an undirected graphical model
are different
» X || Y | Z in an undirected graph if every path
from a node in X to a node in Y must pass
through a node in Z (so if we remove the
nodes in Z, then X and Y will be
disconnected)
– Undirected models are good to represent “soft
constraints” between random variables (e.g. the
correlation between different pixels in an image)
while directed models are good for representing
causal influences between variables
X
Z
W
Y
Impossible!
•
Bayes Nets are not sufficient to
model
all
sets
of
CIAs
We said that a bayes net implicitly represents a
10/17
P(~B)P(~E|~B)P(A|~E,~B)P(J|A,~E,~B)P(M|A,~B,~E,J) (Chain
Rule)
P(~B) P(~E)P(A|~E,~B)P(J|A)P(M|A)
Local Semantics
Node independent
of non-descendants
given its parents
D-sep (direction dependent Separation)
•
X || Y | E if every
undirected path from X
to Y is blocked by E
–
A path is blocked if
there is a node Z on the
path s.t.
1.
2.
3.
Z is in E and Z has
one arrow coming in
and another going out
Z is in E and Z has
both arrows going out
Neither Z nor any of
its descendants are in
E and both path
arrows lead to Z
B||M|A
(J,M)||E | A
B||E
B||E | A
B||E | M
Causal Chains; Common Causes;
Common Effects: Another way of
understanding D-SEP
Common Cause
X and Y are caused by Z
is blocked if Z is given
Causal chain
X causes Y through Z
is blocked if Z is given
Common Effect
X and Y cause Z
is blocked only if neither Z
nor its descendants are given
•
Logic (both deterministic and probabilistic) doesn’t by default encapsulate
causality
– A logical dependency A => B is direction less (it is basically equivalent to ~A V B)
– A probabilistic dependency A || B | C is also directionless. It only states that A and
B are correlated
– Causality, on the other hand, has direction.
– If the logical/probabilistic model ignores this it can get to logically valid and yet
silly conclusions
• E.g. If grass is wet, then it rained. If you break this bottle, then grass gets wet. Ergo: If
you break this botte, then it will rain.
• E.g. The probability of there being two bombs on a plane is exceedingly low. So, to
increase the safety of the flight I will take a bomb along with me.
• E.g.(?) Children whose homes have a lot of books do well in school. Go ahead and put a
whole bunch of books in each home (Actually, the state of Illinois had a program where
they will send books every month to all kids).
– Learning “causal models” from data is thus harder than learning “correlation
models”
– Which is why it is so very great when causal models are available for the asking—
through domain experts.
See IJCAI 99 Research Excellence lecture
by Judea Pearl
Causality
Blog Questions
•
•
Answer:
–
You have been given the topology
of a bayes network, but haven't yet
gotten the conditional probability
tables
(to be concrete, you may think of
the pearl alarm-earth quake
scenario bayes net).
Your friend shows up and says
he has the joint distribution all
ready for you. You don't quite trust
your
friend and think he is making
these numbers up. Is there any way
you can prove that your friends'
joint
distribution is not correct?
Check to see if the joint
distribution given by your friend
satisfies all the conditional
independence assumptions.
– For example, in the Pearl network,
Compute P(J|A,M,B) and P(J|A).
These two numbers should come
out the same!
• Notice that your friend could pass
all the conditional indep
assertions, and still be cheating re:
the probabilities
–
For example, he filled up the
CPTs of the network with made
up numbers (e.g. P(B)=0.9;
P(E)=0.7 etc) and computed the
joint probability by multiplying
the CPTs. This will satisfy all
the conditional indep
assertions..!
– The main point to understand here
is that the network topology does
put restrictions on the joint
distribution.
Blog Questions (2)
•
•
Continuing bad friends, in the question
above, suppose a second friend comes along
and says that he can give you
the conditional probabilities that you want
to complete the specification of your bayes
net. You ask him a CPT entry,
and pat comes a response--some number
between 0 and 1. This friend is well meaning,
but you are worried that the
numbers he is giving may lead to some sort
of inconsistent joint probability distribution.
Is your worry justified ( i.e., can your
friend give you numbers that can lead to an
inconsistency?)
(To understand "inconsistency", consider
someone who insists on giving you P(A),
P(B), P(A&B) as well as P(AVB) and they
wind up not satisfying the P(AVB)=
P(A)+P(B) -P(A&B)
[or alternately, they insist on giving you
P(A|B), P(B|A), P(A) and P(B), and the four
numbers dont satisfy the bayes rule]
•
Answer: No—as long as we only ask
the friend to fill up the CPTs in the
bayes network, there is no way the
numbers won’t makeup a consistent
joint probability distribution
– This should be seen as a feature..
Personal Probabilities
– John may be an optimist and
believe that P(burglary)=0.01 and
Tom may be a pessimist and
believe that P(burglary)=0.99
– Bayesians consider both John and
Tom to be fine (they don’t insist on
an objective frequentist
interpretation for probabilites)
– However, Bayesians do think that
John and Tom should act
consistently with their own beliefs
• For example, it makes no
sense for John to go about
installing tons of burglar
alarms given his belief, just as
it makes no sense for Tom to
put all his valuables on his
lawn
Distributions that can have a perfect map
in terms of a bayes network
Blog
Questions (3)
•
X
X
Y
BN
MN
Z
W
Y
Z
A BN that MN
can’t represent
Your friend heard your claims that Bayes
Nets can represent any possible conditional
independence assertions exactly. He comes to
you
and says he has four random variables, X, Y,
W and Z, and only TWO conditional
independence assertions:
•
•
X .ind. Y | {W,Z}
W .ind. Z | {X, Y}
•
He dares you to give him a bayes network
topology on these four nodes that exactly
represents these and only these conditional
independencies.
Can you? (Note that you only need to look at
4 vertex directed graphs).
Given a graphical model G, and a distribution D,
G is an I-map of D if every CIA reflected in G
actually holds in D [“soundness”]
G is a D-map of D if every CIA of D is reflected
in G [“completeness”]
G is a perfect map of D if it is both I-map and D-map
All distributions
Answer: No this is not possible.
Here are two “wrong” answers
–
–
–
•
•
Consider a disconnected graph where X, Y, W, Z are all
unconnected. In this graph, the two CIAs hold.
However, unfortunately so do many other CIAs
Consider a graph where W and Z are both immediate
parents of X and Y. In this case, clearly, X .ind. Y|
{W,Z}. However, W and Z are definitely dependent
given X and Y (Explaining away).
Undirected models can capture these CIA exactly.
Consider a graph X is connected to W and Z; and Y
is connected to W and Z (sort of a diamond).
–
•
An MN that BN
can’t represent
In undirected models CIA is defined in terms of graph
separability
Since X and Y separate W and Z (i.e., every path
between W and Z must pass through X and Y), W .ind.
Z|{X,Y}. Similarly the other CIA
Undirected graphs will be unable to model some
scenarios that directed ones can; so you need both…
There are also distributions that neither undirected
nor directed models can perfect-map (see picture
above)
If we can’t have perfect map, we can consider giving
up either I-map or a d-map
–
–
Giving up I-map leads to loss of accuracy (since your
model assumes CIAs that don’t exist in the distribution).
It can however increase efficiency (e.g. naïve bayes
models)
Giving up D- map leads to loss of efficiency but
preserves accuracy (if you think more things are
connected that really are, you will assess more
probabilities—and some of them wind up being
redundant anyway because of the CIAs that hold in the
distribution)
1
2
4
31!
8
16
Alarm
Burglary
Introduce variables
in the causal order
Earthquake
P(A|J,M) =P(A)?
How many probabilities are needed?
13 for the new; 10 for the old
Is this the worst?
Digression: Is assessing/learning numbers
the only hard model-learning problem?
• We are making it sound as if assessing the probabilities is a big deal
• In doing so, we are taking into account model acquisition/learning
costs.
• How come we didn’t care about these issues in logical reasoning? Is it
because acquiring logical knowledge is easy?
• Actually—if we are writing programs for worlds that we (the humans)
already live in, it is easy for us (humans) to add the logical knowledge
into the program. It is a pain to give the probabilities..
• On the other hand, if the agent is fully autonomous and is
bootstrapping itself, then learning logical knowledge is actually harder
than learning probabilities..
– For example, we will see that given the bayes network topology (“logic”),
learning its CPTs is much easier than learning both topology and CPTs
Making the network Sparse by
introducing intermediate variables
• Consider a network of boolean
variables where n parent nodes are
connected to m children nodes (with
each parent influencing each child).
– You will need n+m*2n conditional
probabilities
• Suppose you realize that what is
really influencing the child nodes is
some single aggregate function on
the parent’s values (e.g. sum of the
parents).
– We can introduce a single
intermediate node called “sum”
which has links from all the n
parent nodes, and separately
influences each of the m child nodes
• You will wind up needing only
n+2n+2m conditional probabilities
to specify this new network!
We only consider
the failure to cause
Prob of the causes
that hold
How about Noisy And? (hint: A&B => ~( ~A V ~B) )
k
i=j+1
ri
Prob that X
holds even
though ith
parent doesn’t
Constructing Belief Networks:
Summary
• [[Decide on what sorts of queries you are interested in answering
– This in turn dictates what factors to model in the network
• Decide on a vocabulary of the variables and their domains for the
problem
– Introduce “Hidden” variables into the network as needed to make the
network “sparse”
• Decide on an order of introduction of variables into the network
– Introducing variables in causal direction leads to fewer connections
(sparse structure) AND easier to assess probabilities
• Try to use canonical distributions to specify the CPTs
– Noisy-OR
– Parameterized discrete/continuous distributions
• Such as Poisson, Normal (Gaussian) etc
Case Study: Pathfinder System
•
Domain: Lymph node diseases
– Deals with 60 diseases and 100 disease findings
•
Versions:
–
–
–
Pathfinder I: A rule-based system with logical reasoning
Pathfinder II: Tried a variety of approaches for uncertainity
• Simple bayes reasoning outperformed
Pathfinder III: Simple bayes reasoning, but reassessed probabilities
– Parthfinder IV: Bayesian network was used to handle a variety of conditional
dependencies.
• Deciding vocabulary: 8 hours
• Devising the topology of the network: 35 hours
• Assessing the (14,000) probabilities: 40 hours
–
•
Physician experts liked assessing causal probabilites
Evaluation: 53 “referral” cases
– Pathfinder III: 7.9/10
– Pathfinder IV: 8.9/10 [Saves one additional life in every 1000 cases!]
– A more recent comparison shows that Pathfinder now outperforms experts who
helped design it!!