Rational statistical inference (Bayes, Laplace)

Download Report

Transcript Rational statistical inference (Bayes, Laplace)

Logistics
• Class size? Who is new? Who is listening?
• Everyone on Athena mailing list “conceptsand-theories”? If not write to me.
• Everyone on stellar yet? If not, write to
Melissa Yeh ([email protected]).
• Interest in having a printed course pack,
even if a few readings get changed?
Plan for tonight
• Why be Bayesian?
• Informal introduction to learning as
probabilistic inference
• Formal introduction to probabilistic
inference
• A little bit of mathematical psychology
• An introduction to Bayes nets
Plan for tonight
• Why be Bayesian?
• Informal introduction to learning as
probabilistic inference
• Formal introduction to probabilistic
inference
• A little bit of mathematical psychology
• An introduction to Bayes nets
Virtues of Bayesian framework
• Generates principled models with strong explanatory
and descriptive power.
Virtues of Bayesian framework
• Generates principled models with strong explanatory
and descriptive power.
• Unifies models of cognition across tasks and domains.
–
–
–
–
–
–
Categorization
Concept learning
Word learning
Inductive reasoning
Causal inference
Conceptual change
–
–
–
–
–
Biology
Physics
Psychology
Language
...
Virtues of Bayesian framework
• Generates principled models with strong explanatory
and descriptive power.
• Unifies models of cognition across tasks and domains.
• Explains which processing models work, and why.
–
–
–
–
Associative learning
Connectionist networks
Similarity to examples
Toolkit of simple heuristics
Virtues of Bayesian framework
• Generates principled models with strong explanatory
and descriptive power.
• Unifies models of cognition across tasks and domains.
• Explains which processing models work, and why.
• Allows us to move beyond classic dichotomies.
– Symbols (rules, logic, hierarchies, relations) versus Statistics
– Domain-general versus Domain-specific
– Nature versus Nurture
Virtues of Bayesian framework
• Generates principled models with strong explanatory
and descriptive power.
• Unifies models of cognition across tasks and domains.
• Explains which processing models work, and why.
• Allows us to move beyond classic dichotomies.
• A framework for understanding theory-based
cognition:
– How are theories used to learn about the structure of the
world?
– How are theories acquired?
Rational statistical inference
(Bayes, Laplace)
• Fundamental question:
How do we update beliefs in light of data?
• Fundamental (and only) assumption:
Represent degrees of belief as probabilities.
• The answer:
Mathematics of probability theory.
What does probability mean?
Frequentists: Probability as expected frequency
• P(A) = 1: A will always occur.
• P(A) = 0: A will never occur.
• 0.5 < P(A) < 1: A will occur more often than not.
Subjectivists: Probability as degree of belief
• P(A) = 1: believe A is true.
• P(A) = 0: believe A is false.
• 0.5 < P(A) < 1: believe A is more likely to be true
than false.
What does probability mean?
Frequentists: Probability as expected frequency
• P(“heads”) = 0.5 ~ “If we flip 100 times, we expect
to see about 50 heads.”
Subjectivists: Probability as degree of belief
• P(“heads”) = 0.5 ~ “On the next flip, it’s an even
bet whether it comes up heads or tails.”
• P(“rain tomorrow”) = 0.8
• P(“Saddam Hussein is dead”) = 0.1
• ...
Is subjective probability
cognitively viable?
• Evolutionary psychologists (Gigerenzer,
Cosmides, Tooby, Pinker) argue it is not.
“To understand the design of statistical inference mechanisms, then, one
needs to examine what form inductive-reasoning problems -- and the
information relevant to solving them -- regularly took in ancestral
environments. […] Asking for the probability of a single event seems
unexceptionable in the modern world, where we are bombarded with
numerically expressed statistical information, such as weather forecasts
telling us there is a 60% chance of rain today. […] In ancestral
environments, the only external database available from which to reason
inductively was one's own observations and, possibly, those
communicated by the handful of other individuals with whom one lived.
The ‘probability’ of a single event cannot be observed by an individual,
however. Single events either happen or they don’t -- either it will rain
today or it will not. Natural selection cannot build cognitive
mechanisms designed to reason about, or receive as input, information
in a format that did not regularly exist.”
(Brase, Cosmides and Tooby, 1998)
Is subjective probability
cognitively viable?
• Evolutionary psychologists (Gigerenzer,
Cosmides, Tooby, Pinker) argue it is not.
• Reasons to think it is:
– Intuitions are old and potentially universal
(Aristotle, the Talmud).
– Represented in semantics (and syntax?) of
natural language.
– Extremely useful ….
Why be subjectivist?
• Often need to make inferences about
singular events
– e.g., How likely is it to rain tomorrow?
• Cox Axioms
– A formal model of common sense
• “Dutch Book” + Survival of the Fittest
– If your beliefs do not accord with the laws of
probability, then you can always be out-gambled by
someone whose beliefs do so accord.
• Provides a theory of learning
– A common currency for combining prior knowledge
and the lessons of experience.
Cox Axioms (via Jaynes)
• Degrees of belief are represented by real numbers.
• Qualitative correspondence with common sense,
e.g.: Bel(A)  f [ Bel( A)]
Bel( A  B)  g[ Bel( A), Bel( B | A)]
• Consistency:
– If a conclusion can be reasoned in more than one way,
then every possible way must lead to the same result.
– All available evidence should be taken into account
when inferring a degree of belief.
– Equivalent states of knowledge should be represented
with equivalent degrees of belief.
• Accepting these axioms implies Bel can be
represented as a probability measure.
Plan for tonight
• Why be Bayesian?
• Informal introduction to learning as
probabilistic inference
• Formal introduction to probabilistic
inference
• A little bit of mathematical psychology
• An introduction to Bayes nets
Example: flipping coins
•
•
•
•
Flip a coin 10 times and see 5 heads, 5 tails.
P(heads) on next flip? 50%
Why? 50% = 5 / (5+5) = 5/10.
“Future will be like the past.”
• Suppose we had seen 4 heads and 6 tails.
• P(heads) on next flip? Closer to 50% than to 40%.
• Why? Prior knowledge.
Example: flipping coins
• Represent prior knowledge as fictional
observations F.
• E.g., F ={1000 heads, 1000 tails} ~ strong
expectation that any new coin will be fair.
• After seeing 4 heads, 6 tails, P(heads) on next flip
= 1004 / (1004+1006) = 49.95%
• E.g., F ={3 heads, 3 tails} ~ weak expectation that
any new coin will be fair.
• After seeing 4 heads, 6 tails, P(heads) on next flip
= 7 / (7+9) = 43.75%. Prior knowledge too weak.
Example: flipping thumbtacks
• Represent prior knowledge as fictional
observations F.
• E.g., F ={4 heads, 3 tails} ~ weak expectation that
tacks are slightly biased towards heads.
• After seeing 2 heads, 0 tails, P(heads) on next flip
= 6 / (6+3) = 67%.
• Some prior knowledge is always necessary to
avoid jumping to hasty conclusions.
• Suppose F = { }: After seeing 2 heads, 0 tails,
P(heads) on next flip = 2 / (2+0) = 100%.
Origin of prior knowledge
• Tempting answer: prior experience
• Suppose you have previously seen 2000 coin flips:
1000 heads, 1000 tails.
• By assuming all coins (and flips) are alike, these
observations of other coins are as good as actual
observations of the present coin.
Problems with simple empiricism
• Haven’t really seen 2000 coin flips, or any
thumbtack flips.
– Prior knowledge is stronger than raw experience justifies.
• Haven’t seen exactly equal number of heads and
tails.
– Prior knowledge is smoother than raw experience
justifies.
• Should be a difference between observing 2000
flips of a single coin versus observing 10 flips each
for 200 coins, or 1 flip each for 2000 coins.
– Prior knowledge is more structured than raw experience.
A simple theory
• “Coins are manufactured by a standardized
procedure that is effective but not perfect.”
– Justifies generalizing from previous coins to the
present coin.
– Justifies smoother and stronger prior than raw
experience alone.
– Explains why seeing 10 flips each for 200 coins is
more valuable than seeing 2000 flips of one coin.
• “Tacks are asymmetric, and manufactured to
less exacting standards.”
Limitations
• Can all domain knowledge be represented
so simply, in terms of an equivalent number
of fictional observations?
• Suppose you flip a coin 25 times and get all
heads. Something funny is going on ….
• But with F ={1000 heads, 1000 tails},
P(heads) on next flip = 1025 / (1025+1000)
= 50.6%. Looks like nothing unusual.
Plan for tonight
• Why be Bayesian?
• Informal introduction to learning as
probabilistic inference
• Formal introduction to probabilistic
inference
• A little bit of mathematical psychology
• An introduction to Bayes nets
Basics
•
•
•
•
Propositions: A, B, C, . . . .
Negation:  A
Logical operators “and”, “or”: A  B, A  B
Obey classical logic, e.g.,
A  B   (A  B)
Basics
• Conservation of belief: P( A)  P(A)  1
• “Joint probability”: P( A  B), also writtenP( A, B)
• For independent propositions:
P( A, B)  P( A) P( B)
• More generally:
P( A, B)  P( A) P( B | A)  P( B) P( A | B)
“Conditional probability”
Basics
• Example:
– A = “Heads on flip 2”
– B = “Tails on flip 2”
1 1 1
P( A, B)  P( A) P( B)   
2 2 4
1
P( A, B)  P( A) P( B | A)   0  0
2
Basics
• All probabilities should be conditioned on
background knowledge K: e.g., P( A | K )
• All the same rules hold conditioned on any
K: e.g., P( A, B | K )  P( A | K ) P( B | A, K )
• Often background knowledge will be
implicit, brought in as needed.
Bayesian inference
• Definition of conditional probability:
P( A, B)  P( A) P( B | A)  P( B) P( A | B)
• Bayes’ theorem:
P( B) P( A | B)
P( B | A) 
P( A)
Bayesian inference
• Definition of conditional probability:
P( A, B)  P( A) P( B | A)  P( B) P( A | B)
• Bayes’ rule:
P( H ) P( D | H )
P( H | D) 
P( D)
• “Posterior probability”: P( H | D)
• “Prior probability”: P(H )
• “Likelihood”: P( D | H )
Bayesian inference
P( H ) P( D | H )
• Bayes’ rule: P( H | D) 
P( D)
• What makes a good scientific argument?
P(H|D) is high if:
– Hypothesis is plausible: P(H) is high
– Hypothesis strongly predicts the observed data:
P(D|H) is high
– Data are surprising: P(D) is low
Bayesian inference
• Deriving a more useful version:
P( B) P( A | B)
P( B | A) 
P( A)
P(B) P( A | B)
P(B | A) 
P( A)
P( B | A)  P(B | A)  1
Bayesian inference
• Deriving a more useful version:
P( B) P( A | B)
P( B | A) 
P( A)
P(B) P( A | B)
P(B | A) 
P( A)
P( B) P( A | B) P(B) P( A | B)

1
P( A)
P( A)
Bayesian inference
• Deriving a more useful version:
P( B) P( A | B)
P( B | A) 
P( A)
P(B) P( A | B)
P(B | A) 
P( A)
P( B) P( A | B)  P(B) P( A | B)  P( A)
“Conditionalization”
P( A, B)  P( A, B)  P( A)
“Marginalization”
Bayesian inference
• Deriving a more useful version:
P( B) P( A | B)
P( B | A) 
P( A)
P(B) P( A | B)
P(B | A) 
P( A)
P( B) P( A | B)  P(B) P( A | B)  P( A)
Bayesian inference
• Deriving a more useful version:
P( B) P( A | B)
P( B | A) 
P( A)
P( B) P( A | B)  P(B) P( A | B)  P( A)
Bayesian inference
• Deriving a more useful version:
P( B) P( A | B)
P( B | A) 
P( B) P( A | B)  P(B) P( A | B)
Bayesian inference
• Deriving a more useful version:
P( H ) P( D | H )
P ( H | D) 
P( H ) P( D | H )  P(H ) P( D | H )
Random variables
• Random variable X denotes a set of
mutually exclusive exhaustive propositions
(states of the world): X  {x1,, xn}
 P( X  xi )  1
i
• Bayes’ theorem for random variables:
P(Y  y) P( X  x | Y  y)
P(Y  y | X  x) 
 P(Y  yi ) P( X  x | Y  yi )
i
Random variables
• Random variable X denotes a set of
mutually exclusive exhaustive propositions
(states of the world): X  {x1,, xn}
 P( X  xi )  1
i
• Bayes’ rule for more than two hypotheses:
P( H  h) P( D  d | H  h)
P( H  h | D  d ) 
 P(H  hi ) P( D  d | H  hi )
i
Sherlock Holmes
• “How often have I said to you that when you have
eliminated the impossible whatever remains,
however improbable, must be the truth?” (The
Sign of the Four)
P( H  h) P( D  d | H  h)
P( H  h | D  d ) 
 P(H  hi ) P( D  d | H  hi )
i
Sherlock Holmes
• “How often have I said to you that when you have
eliminated the impossible whatever remains,
however improbable, must be the truth?” (The
Sign of the Four)
P ( h) P ( d | h)
P(h | d ) 
P (h) P (d | h)   P (hi ) P (d | hi )
hi  h
Sherlock Holmes
• “How often have I said to you that when you have
eliminated the impossible whatever remains,
however improbable, must be the truth?” (The
Sign of the Four)
P ( h) P ( d | h)
P(h | d ) 
P (h) P (d | h)   P (hi ) P (d | hi )
hi  h
=0
Sherlock Holmes
• “How often have I said to you that when you have
eliminated the impossible whatever remains,
however improbable, must be the truth?” (The
Sign of the Four)
P ( h) P ( d | h)
P(h | d ) 
1
P ( h) P ( d | h)
Plan for tonight
• Why be Bayesian?
• Informal introduction to learning as
probabilistic inference
• Formal introduction to probabilistic
inference
• A little bit of mathematical psychology
• An introduction to Bayes nets
Representativeness in reasoning
Which sequence is more likely to be produced
by flipping a fair coin?
HHTHT
HHHHH
A reasoning fallacy
Kahneman & Tversky: people judge the
probability of an outcome based on the
extent to which it is representative of the
generating process.
But how does “representativeness” work?
Predictive versus inductive
reasoning
Hypothesis
H
Data
D
Predictive versus inductive
reasoning
Prediction
given
H
Likelihood: P( D | H )
?
D
Predictive versus inductive
reasoning
Prediction
given
Induction
?
H
Representativeness: P( D | H1)
Likelihood: P( D | H )
P( D | H 2 )
?
P(H1|D)
P(H2|D)
D
P(D|H1)
=
x
P(D|H2)
given
P(H1)
P(H2)
Bayes’ Rule in odds form
P(H1|D)
P(H2|D)
P(D|H1)
=
x
P(D|H2)
P(H1)
P(H2)
D: data
H1, H2: models
P(H1|D): posterior probability that model 1 generated
the data.
P(D|H1): likelihood of data given model 1
P(H1): prior probability that model 1 generated the
data
Bayesian analysis of coin flipping
P(H1|D)
P(H2|D)
P(D|H1)
=
x
P(D|H2)
P(H1)
P(H2)
D: HHTHT
H1, H2: fair coin, trick “all heads” coin.
P(D|H1) = 1/32
P(H1) = 999/1000
P(D|H2) = 0
P(H2) = 1/1000
P(H1|D) / P(H2|D) = infinity
Bayesian analysis of coin flipping
P(H1|D)
P(H2|D)
P(D|H1)
=
x
P(D|H2)
P(H1)
P(H2)
D: HHHHH
H1, H2: fair coin, trick “all heads” coin.
P(D|H1) = 1/32
P(H1) = 999/1000
P(D|H2) = 1
P(H2) = 1/1000
P(H1|D) / P(H2|D) = 999/32 ~ 30:1
Bayesian analysis of coin flipping
P(H1|D)
P(H2|D)
P(D|H1)
=
x
P(D|H2)
P(H1)
P(H2)
D: HHHHHHHHHH
H1, H2: fair coin, trick “all heads” coin.
P(D|H1) = 1/1024
P(H1) = 999/1000
P(D|H2) = 1
P(H2) = 1/1000
P(H1|D) / P(H2|D) = 999/1024 ~ 1:1
The role of theories
The fact that HHTHT looks representative of
a fair coin and HHHHH does not reflects
our implicit theories of how the world
works.
– Easy to imagine how a trick all-heads coin
could work: high prior probability.
– Hard to imagine how a trick “HHTHT” coin
could work: low prior probability.
Plan for tonight
• Why be Bayesian?
• Informal introduction to learning as
probabilistic inference
• Formal introduction to probabilistic
inference
• A little bit of mathematical psychology
• An introduction to Bayes nets
Scaling up
• Three binary variables: Cavity, Toothache, Catch
(whether dentist’s probe catches in your tooth).
P(cav | ache catch) 
P(ache catch| cav) P(cav)
P(ache catch| cav) P(cav)  P(ache catch| cav) P(cavity)
Scaling up
• Three binary variables: Cavity, Toothache, Catch
(whether dentist’s probe catches in your tooth).
P(cav | ache catch) 
P(ache catch| cav) P(cav)
P(ache catch| cav) P(cav)  P(ache catch| cav) P(cavity)
• With n pieces of evidence, we need 2n+1
conditional probabilities.
• Here n=2. Realistically, many more: X-ray, diet,
oral hygiene, personality, . . . .
Conditional independence
• All three variables are dependent, but Toothache
and Catch are independent given the presence or
absence of Cavity.
• Both Toothache and Catch are caused by Cavity,
but via independent causal mechanisms.
• In probabilistic terms:
P(ache catch| cav)  P(ache| cav) P(catch| cav)
P(ache catch| cav)  P(ache| cav) P(catch| cav)
 1  P(ache| cav)P(catch| cav)
• With n pieces of evidence, x1, …, xn, we need 2 n
conditional probabilities: P( xi | cav), P( xi | cav)
A simple Bayes net
• Graphical representation of relations between a set of
random variables:
Cavity
Toothache
Catch
• Causal interpretation: independent local mechanisms
• Probabilistic interpretation: factorizing complex terms
P( A, B, C ) 

P(V | parents[V ])
V { A, B,C}
P( Ache, Catch, Cav)  P( Ache, Catch | Cav) P(Cav)
 P( Ache| Cav) P(Catch | Cav) P(Cav)
A more complex system
Battery
Radio
Ignition
Gas
Starts
On time to work
• Joint distribution sufficient for any inference:
P( B, R, I , G, S , O)  P( B) P( R | B) P( I | B) P(G) P(S | I , G) P(O | S )
 P( B) P( R | B) P( I | B) P(G) P(S | I , G) P(O | S )
P(O, G ) B, R, I , S
P(O | G ) 

P(G )
P(G )
A more complex system
Battery
Radio
Ignition
Gas
Starts
On time to work
• Joint distribution sufficient for any inference:
P( B, R, I , G, S , O)  P( B) P( R | B) P( I | B) P(G) P(S | I , G) P(O | S )


P(O, G)
P(O | G) 
    P( B) P( I | B) P(S | I , G)  P(O | S )


P(G)
S  B, I

A more complex system
Battery
Radio
Ignition
Gas
Starts
On time to work
• Joint distribution sufficient for any inference:
P( B, R, I , G, S , O)  P( B) P( R | B) P( I | B) P(G) P(S | I , G) P(O | S )
• General inference algorithm: local message passing
Explaining away
Rain
Sprinkler
Grass Wet
P( R, S ,W )  P( R) P(S ) P(W | S , R)
• Assume grass will be wet if and only if it rained last
night, or if the sprinklers were left on:
P(W  w | S , R)  1 if S  s or R  r
 0 if R  r and S  s.
Explaining away
Rain
Sprinkler
Grass Wet
P( R, S ,W )  P( R) P(S ) P(W | S , R)
P(W  w | S , R)  1 if S  s or R  r
 0 if R  r and S  s.
Compute probability it
rained last night, given
that the grass is wet:
P(r | w) 
P( w | r ) P(r )
P( w)
Explaining away
Rain
Sprinkler
Grass Wet
P( R, S ,W )  P( R) P(S ) P(W | S , R)
P(W  w | S , R)  1 if S  s or R  r
 0 if R  r and S  s.
Compute probability it
rained last night, given
that the grass is wet:
P (r | w) 
P( w | r ) P(r )
 P(w | r , s) P(r , s)
r , s 
Explaining away
Rain
Sprinkler
Grass Wet
P( R, S ,W )  P( R) P(S ) P(W | S , R)
P(W  w | S , R)  1 if S  s or R  r
 0 if R  r and S  s.
Compute probability it
rained last night, given
that the grass is wet:
P(r | w) 
P(r )
P(r , s)  P(r , s)  P(r , s)
Explaining away
Rain
Sprinkler
Grass Wet
P( R, S ,W )  P( R) P(S ) P(W | S , R)
P(W  w | S , R)  1 if S  s or R  r
 0 if R  r and S  s.
Compute probability it
rained last night, given
that the grass is wet:
P(r | w) 
P(r )
P(r )  P(r , s)
Explaining away
Rain
Sprinkler
Grass Wet
P( R, S ,W )  P( R) P(S ) P(W | S , R)
P(W  w | S , R)  1 if S  s or R  r
 0 if R  r and S  s.
Compute probability it
rained last night, given
that the grass is wet:
P(r | w) 
P(r )
P(r )  P(r ) P( s)
Between 1 and P(s)
 P(r )
Explaining away
Rain
Sprinkler
Grass Wet
P( R, S ,W )  P( R) P(S ) P(W | S , R)
P(W  w | S , R)  1 if S  s or R  r
 0 if R  r and S  s.
Compute probability it
rained last night, given
that the grass is wet and
sprinklers were left on:
P(r | w, s) 
P( w | r , s ) P(r | s )
P( w | s )
Both terms = 1
Explaining away
Rain
Sprinkler
Grass Wet
P( R, S ,W )  P( R) P(S ) P(W | S , R)
P(W  w | S , R)  1 if S  s or R  r
 0 if R  r and S  s.
Compute probability it
rained last night, given
that the grass is wet and
sprinklers were left on:
P(r | w, s)  P(r | s)  P(r )
Explaining away
Rain
Sprinkler
Grass Wet
P( R, S ,W )  P( R) P(S ) P(W | S , R)
P(W  w | S , R)  1 if S  s or R  r
 0 if R  r and S  s.
P(r )
P(r )  P(r ) P( s)
P(r | w, s)  P(r | s)  P(r )
P(r | w) 
 P(r )
“Discounting” to
prior probability.
Contrast w/ spreading activation
Rain
Sprinkler
Grass Wet
• Excitatory links: Rain
Wet, Sprinkler
Wet
• Observing rain, Wet becomes more active.
• Observing grass wet, Rain and Sprinkler become
more active.
• Observing grass wet and sprinkler, Rain cannot
become less active. No explaining away!
Contrast w/ spreading activation
Rain
Sprinkler
Grass Wet
• Excitatory links: Rain
• Inhibitory link: Rain
Wet, Sprinkler
Sprinkler
Wet
• Observing grass wet, Rain and Sprinkler become
more active.
• Observing grass wet and sprinkler, Rain becomes
less active: explaining away.
Contrast w/ spreading activation
Rain
Burst pipe
Sprinkler
Grass Wet
• Each new variable requires more inhibitory
connections.
• Interactions between variables are not causal.
• Not modular.
– Whether a connection exists depends on what other
connections exist, in non-transparent ways.
– Big holism problem.
– Combinatorial explosion.
Causality and the Markov property
P( A, B, C ) 

P(V | parents[V ])
V { A, B,C}
• Markov property: Any variable is conditionally
independent of its non-descendants, given its
parents.
• Example:
Cavity
Ache
Catch
P( Ache, Catch, Cav)
P( Ache, Catch | Cav) 
P(Cav)
Causality and the Markov property
P( A, B, C ) 

P(V | parents[V ])
V { A, B,C}
• Markov property: Any variable is conditionally
independent of its non-descendants, given its
parents.
• Example:
Cavity
Ache
Catch
P( Ache| Cav) P(Catch | Cav) P(Cav)
P( Ache, Catch | Cav) 
P(Cav)
Causality and the Markov property
P( A, B, C ) 

P(V | parents[V ])
V { A, B,C}
• Markov property: Any variable is conditionally
independent of its non-descendants, given its
parents.
• Example:
Cavity
Ache
Catch
P( Ache, Catch | Cav)  P( Ache| Cav) P(Catch | Cav)
Causality and the Markov property
P( A, B, C ) 

P(V | parents[V ])
V { A, B,C}
• Markov property: Any variable is conditionally
independent of its non-descendants, given its
parents.
• Example: Rain
Sprinkler
Grass Wet
P( Rain , Sprinkler ) 
 P( Rain , Sprinkler ,Wet )
W et
Causality and the Markov property
P( A, B, C ) 

P(V | parents[V ])
V { A, B,C}
• Markov property: Any variable is conditionally
independent of its non-descendants, given its
parents.
• Example: Rain
Sprinkler
Grass Wet
P( Rain , Sprinkler ) 
 P(Wet | Rain , Sprinkler ) P( Rain ) P(Sprinkler )
W et
=1, for any values of Rain and Sprinkler
Causality and the Markov property
P( A, B, C ) 

P(V | parents[V ])
V { A, B,C}
• Markov property: Any variable is conditionally
independent of its non-descendants, given its
parents.
• Example: Rain
Sprinkler
Grass Wet
P( Rain, Sprinkler)  P( Rain) P(Sprinkler)
Causality and the Markov property
• Markov property: Any variable is conditionally
independent of its non-descendants, given its parents.
• Suppose we get the direction of causality wrong,
thinking that “symptoms” causes “diseases”:
Ache
Catch
Cavity
• Does not capture the correlation between symptoms:
falsely believe P(Ache, Catch) = P(Ache) P(Catch).
Causality and the Markov property
• Markov property: Any variable is conditionally
independent of its non-descendants, given its parents.
• Suppose we get the direction of causality wrong,
thinking that “symptoms” causes “diseases”:
Ache
Catch
Cavity
• Inserting a new arrow allows us to capture this
correlation.
• This model is too complex: do not believe that
P( Ache, Catch | Cav)  P( Ache| Cav) P(Catch | Cav)
Causality and the Markov property
• Markov property: Any variable is conditionally
independent of its non-descendants, given its parents.
• Suppose we get the direction of causality wrong,
thinking that “symptoms” causes “diseases”:
Ache
X-ray
Catch
Cavity
• New symptoms require a combinatorial proliferation
of new arrows. Too general, not modular, holism,
yuck . . . .
Still to come
• Applications to models of categorization
• More on the relation between causality and
probability:
Causal structure
Statistical dependencies
• Learning causal graph structures.
• Learning causal abstractions (“diseases
cause symptoms”)
• What’s missing
The end
Mathcamp data: raw
Mathcamp data: collapsed over parity
Zenith radio data: collapsed over parity