X - UCLA Computer Science

Download Report

Transcript X - UCLA Computer Science

THE MATHEMATICS OF
CAUSE AND EFFECT
Judea Pearl
UCLA
November 8, 2012
OUTLINE
1.
From Turing test to Bayes networks
2.
From Bayes networks to do-calculus
3.
From messy science to counterfactuals
4.
From counterfactuals to practical victories
a) policy evaluation
b) attribution
c) mediation
d) generalizability
CAN MACHINE THINK?
Alan M. Turing (1912 – 1954)
•
•
•
The Turing Test
“Computing Machinery
and Intelligence” (1950)
Turing: Yes, if it acts like it thinks.
Acts = It answers non-trivial questions
about a story, a topic or a situation?
HOW TURING ENVISIONED
THE TEST CONVERSATION
Q
A
Please write me a sonnet on the subject of
the Forth Bridge.
Count me out on this one. I never could write
poetry.
Q
A
Add 34,957 and 70,764.
(Pause about 30 seconds and then give an
answer) 105,721.
Q
A
Do you play chess?
Yes.
Q
I have K at my K1, and no other pieces.
You have only K at K6 and R at R1.
It is your move. What do you play?
(After a pause of 15 seconds) R-R8: mate!
A
A CHILD MACHINE AND EVOLUTION
•
The survival of the fittest is a slow method for
measuring advantages.
•
The experimenter, by exercise of intelligence,
should be able to speed it up.
•
If he can trace a cause for some weakness he can
probably think of the kind of mutation which will
improve it.
(A.M. Turing, 1950)
A “MINI” TURING TEST
IN CAUSAL CONVERSATION
The Story
Input:
Story
Question: What is? What if? Why?
Answers: I believe that...
Image adapted from Saygin, 2000.
Q1: If the season is dry, and the pavement is slippery
did it rain?
A1: Unlikely, it is more likely the sprinkler was ON.
with a very slight possibility that it is not even wet.
A “MINI” TURING TEST
IN CAUSAL CONVERSATION
The Story
Image adapted from Saygin, 2000.
Q2: But what if we SEE that the sprinkler is OFF?
A2: Then it is more likely that it rained.
A “MINI” TURING TEST
IN CAUSAL CONVERSATION
The Story
Image adapted from Saygin, 2000.
Q3: Do you mean that if we actually turn the sprinkler ON,
the rain will be less likely?
A3: No, the likelihood of rain would remain the same but
the pavement will surely get wet.
A “MINI” TURING TEST
IN CAUSAL CONVERSATION
The Story
Image adapted from Saygin, 2000.
Q4: Suppose we SEE that the sprinkler is ON and the
pavement wet. What if the sprinkler were OFF?
A4: The pavement would be dry, because the season
is likely dry.
SEARLE’S CHINESE ROOM
ARGUMENT
WHAT’S IN SEARLE’S
RULE BOOK?
Searle's oversight: there are not enough molecules in the
universe to make the book.
Even for the sprinkler example. Why causal conversation.
IS PARSIMONY NECESSARY
(SUFFICIENT) FOR UNDERSTANDING?
Understanding requires translating world constraints
into a grammar (contraints over symbol strings) and
harnessing it to answer queries swiftly and reliably.
Parsimony can only be achieved by exploiting the
constraints in the world to beat the combinatorial
explosion.
THE PLURALITY OF
MINI TURING TESTS
Poetry
Arithmetic
Turing Test
Data-intensive
Scientific
applications
Thousands of
Hungry and
aimless
customers
...
Causal
Reasoning
.
Chess
Stock market
Human
Cognition and
Ethics
Robotics
Scientific
thinking
THE PLURALITY OF
MINI TURING TESTS
Poetry
Arithmetic
Turing Test
...
Causal
Reasoning
.
Chess
Stock market
Human
Cognition and
Ethics
Causal Explanation
“She handed me the fruit
and I ate”
“The serpent deceived me,
and I ate”
COUNTERFACTUALS AND OUR
SENSE OF JUSTICE
Abraham:
Are you about to smite the
righteous with the wicked?
What if there were fifty
righteous men in the city?
And the Lord said,
“If I find in the city of Sodom fifty
good men, I will pardon the whole
place for their sake.”
Genesis 18:26
THE PLURALITY OF
MINI TURING TESTS
Poetry
Arithmetic
Turing Test
...
Causal
Reasoning
.
Chess
Stock market
Human
Cognition and
Ethics
Scientific
thinking
WHY PHYSICS IS
COUNTERFACTUAL
Scientific Equations (e.g., Hooke’s Law) are non-algebraic
e.g., Length (Y) equals a constant (2) times the weight (X)
Correct notation:
Y :==2X
2X
X=3 X=1
Process information
X=1
Y=2
The solution
Had X been 3, Y would be 6.
If we raise X to 3, Y would be 6.
Must “wipe out” X = 1.
X=½Y X=3
Y = X+1
Alternative
WHY PHYSICS IS
COUNTERFACTUAL
Scientific Equations (e.g., Hooke’s Law) are non-algebraic
e.g., Length (Y) equals a constant (2) times the weight (X)
Correct notation:
(or)
Y  2X
X=3 X=1
Process information
X=1
Y=2
The solution
Had X been 3, Y would be 6.
If we raise X to 3, Y would be 6.
Must “wipe out” X = 1.
X=½Y X=3
Y = X+1
Alternative
THE PLURALITY OF
MINI TURING TESTS
Poetry
Arithmetic
Turing Test
...
Causal
Reasoning
.
Chess
Stock market
Human
Cognition and
Ethics
Robotics
Scientific
thinking
CAUSATION AS A
PROGRAMMER'S NIGHTMARE
Input:
1. “If the grass is wet, then it rained”
2. “if we break this bottle, the grass
will get wet”
Output:
“If we break this bottle, then it rained”
WHAT KIND OF QUESTIONS
SHOULD THE ROBOT ANSWER?
•
•
•
•
Observational Questions:
“What if we see A”
(What is?)
Action Questions:
“What if we do A?”
(What if?)
Counterfactuals Questions:
“What if we did things differently?”
Options:
“With what probability?”
THE CAUSAL HIERARCHY
(Why?)
THE PLURALITY OF
MINI TURING TESTS
Poetry
Arithmetic
Turing Test
Data-intensive
Scientific
applications
Thousands of
Hungry and
aimless
customers
...
Causal
Reasoning
.
Chess
Stock market
Human
Cognition and
Ethics
Robotics
Scientific
thinking
THE FIVE NECESSARY STEPS FOR
EFFECT ESTIMATION
Define:
Express the target quantity Q as a property of
the model M.
P(Yx  y )
or
P( y | do( x))
Assume: Express causal assumptions in structural or
graphical form.
Identify:
Determine if Q is identifiable.
Estimate: Estimate Q if it is identifiable; approximate it,
if it is not.
Test:
If M has testable implications
THE FIVE NECESSARY STEPS FOR
AVERAGE TREATMENT EFFECT
Define:
Express the target quantity Q as a property of
the model M.
ATE  E (Y | do( x1))  E (Y | do( x0 ))
Assume: Express causal assumptions in structural or
graphical form.
Identify:
Determine if Q is identifiable.
Estimate: Estimate Q if it is identifiable; approximate it,
if it is not.
Test:
If M has testable implications
THE FIVE NECESSARY STEPS FOR
DYNAMIC POLICY ANALYSIS
Define:
Express the target quantity Q as a property of
the model M.
P( y | do( x  g ( z ))
Assume: Express causal assumptions in structural or
graphical form.
Identify:
Determine if Q is identifiable.
Estimate: Estimate Q if it is identifiable; approximate it,
if it is not.
Test:
If M has testable implications
THE FIVE NECESSARY STEPS FOR
TIME VARYING POLICY ANALYSIS
Define:
Express the target quantity Q as a property of
the model M.
P( y | do( X  x, Z  z,W  w))
Assume: Express causal assumptions in structural or
graphical form.
Identify:
Determine if Q is identifiable.
Estimate: Estimate Q if it is identifiable; approximate it,
if it is not.
Test:
If M has testable implications
THE FIVE NECESSARY STEPS FOR
TREATMENT ON TREATED
Define:
Express the target quantity Q a property of the
model M.
ETT  P(Yx  y | X  x' )
Assume: Express causal assumptions in structural or
graphical form.
Identify:
Determine if Q is identifiable.
Estimate: Estimate Q if it is identifiable; approximate it,
if it is not.
Test:
If M has testable implications
THE FIVE NECESSARY STEPS FOR
INDIRECT EFFECTS
Define:
Express the target quantity Q a property of the
model M.
IE  E[Yx, Z ( x') ]  E[Yx ]
Assume: Express causal assumptions in structural or
graphical form.
Identify:
Determine if Q is identifiable.
Estimate: Estimate Q if it is identifiable; approximate it,
if it is not.
Test:
If M has testable implications
THE FIVE NECESSARY STEPS
FROM DEFINITION TO ASSUMPTIONS
Define:
Express the target quantity Q as a property of
the model M.
Assume: Express causal assumptions in structural or
graphical form.
Identify:
Determine if Q is identifiable.
Estimate: Estimate Q if it is identifiable; approximate it,
if it is not.
Test:
If M has testable implications
THE LOGIC OF CAUSAL ANALYSIS
A - CAUSAL
ASSUMPTIONS
CAUSAL
MODEL
(MA)
A* - Logical
implications of A
Causal inference
Q Queries of
interest
Q(P) - Identified
estimands
T(MA) - Testable
implications
Statistical inference
Data (D)
Q - Estimates
of Q(P)
Q( D, A)
Provisional claims
g (T )
Model testing
Goodness of fit
STRUCTURAL CAUSAL MODELS:
THE WORLD AS A COLLECTION
OF SPRINGS
Definition: A structural causal model is a 4-tuple
V,U, F, P(u), where
• V = {V1,...,Vn} are endogeneas variables
• U = {U1,...,Um} are background variables
• F = {f1,..., fn} are functions determining V,
vi = fi(v, u)
e.g., y    x  uY
• P(u) is a distribution over U
P(u) and F induce a distribution P(v) over
observable variables
COUNTERFACTUALS ARE
EMBARRASINGLY SIMPLE
Definition:
The sentence: “Y would be y (in situation u), had X been x,”
denoted Yx(u) = y, means:
The solution for Y in a mutilated model Mx, (i.e., the equations
for X replaced by X = x) with input U=u, is equal to y.
The Fundamental Equation of Counterfactuals:
Yx (u )  YM x (u )
COUNTERFACTUALS ARE
EMBARRASINGLY SIMPLE
Definition:
The sentence: “Y would be y (in situation u), had X been x,”
denoted Yx(u) = y, means:
The solution for Y in a mutilated model Mx, (i.e., the equations
for X replaced by X = x) with input U=u, is equal to y.
• Joint probabilities of counterfactuals:
P(Yx  y, Z w  z ) 
In particular:

u:Yx (u )  y, Z w (u )  z
P( y | do(x ) ) 
 P(Yx  y ) 
P(Yx '  y '| x, y ) 


u:Yx (u )  y
P (u )
P(u | x, y )
u:Yx ' (u )  y '
P(u )
THE MIRACLE OF UNIVERSAL
CONSTRAINTS
E PLURIBUS UNUM – OUT OF MANY, ONE
C (Climate)
C  f C (U C )
S
(Sprinkler)
R
(Rain)
W (Wetness)
S  f S (C ,U S )
R  f R (C ,U R )
W  fW ( S , R,U W )
Each function summarizes millions of micro processes.
U3
U2
U4
S
C
U1
THE MIRACLE OF UNIVERSAL
CONSTRAINTS
E PLURIBUS UNUM – OUT OF MANY, ONE
C (Climate)
C  f C (U C )
S
(Sprinkler)
R
(Rain)
W (Wetness)
S  f S (C ,U S )
R  f R (C ,U R )
W  fW ( S , R,U W )
Each function summarizes millions of micro processes.
U3
Still, if the U 's are
U2 independent, the observed
U4
distribution P(C,R,S,W) must satisfy certain constraints
C
that are: S
(1) independent of the f ‘s and of P(U) and
(2) can be read from the structure of the graph.
U1
D-SEPARATION: NATURE’S LANGUAGE
FOR COMMUNICATING ITS STRUCTURE
C (Climate)
C  f C (U C )
S
(Sprinkler)
R
(Rain)
W (Wetness)
S  f S (C ,U S )
R  f R (C ,U R )
W  fW ( S , R,U W )
Every missing arrow advertises an independency, conditional
on a separating set.
e.g., C
W | (S,R)
S
R|C
Applications
1. Structure learning
2. Model testing
3. Reducing "what if I do" questions to symbolic calculus
4. Reducing scientific questions to symbolic calculus
SEEING VS. DOING
P( x1,..., xn )   P( xi | pai )
i
P( x1, x2 , x3 , x4 , x5 )  P( x1) P( x2 | x1) P( x3 | x1) P( x4 | x2 , x3 ) P( x5 | x4 )
Effect of turning the sprinkler ON
PX 3 ON ( x1, x2 , x4 , x5 )  P( x1) P( x2 | x1) P( x4 | x2 , X 3  ON) P( x5 | x4 )
 P( x1, x2 , x4 , X 5 | X 3  ON)
THE MACHINERY OF CAUSAL CALCULUS
Rule 1: Ignoring observations
P(y | do{x}, z, w) = P(y | do{x}, w)
if (Y  Z|X,W )G
Rule 2: Action/observation exchange
P(y | do{x}, do{z}, w) = P(y | do{x},z,w)
if (Y  Z|X,W )G
X
XZ
Rule 3: Ignoring actions
P(y | do{x}, do{z}, w) = P(y | do{x}, w)
if (Y  Z|X,W )G
X Z(W)
Completeness Theorem (Shpitser, 2006)
“WHAT IF I SMOKE?”
REDUCED TO CALCULUS
Genotype (Unobserved)
Smoking
Tar
Cancer
P (c | do{s}) = t P (c | do{s}, t) P (t | do{s})
Probability Axioms
= t P (c | do{s}, do{t}) P (t | do{s})
Rule 2
= t P (c | do{s}, do{t}) P (t | s)
Rule 2
= t P (c | do{t}) P (t | s)
Rule 3
= st P (c | do{t}, s) P (s | do{t}) P(t |s) Probability Axioms
Rule 2
=   P (c | t, s) P (s | do{t}) P(t |s)
s t
= s t P (c | t, s) P (s) P(t |s)
Rule 3
EFFECT OF WARM-UP ON INJURY
(After Shrier & Platt, 2008)
No, no!
DETERMINING CAUSES OF EFFECTS
A COUNTERFACTUAL VICTORY
•
•
Your Honor! My client (Mr. A) died BECAUSE
he used that drug.
Court to decide if it is MORE PROBABLE THAN
NOT that A would be alive BUT FOR the drug!
PN = P(? | A is dead, took the drug) > 0.50
THE ATTRIBUTION PROBLEM
Definition:
1. What is the meaning of PN(x,y):
“Probability that event y would not have occurred if
it were not for event x, given that x and y did in fact
occur.”
Answer:
PN ( x, y )  P(Yx'  y ' | x, y )
Computable from M
THE ATTRIBUTION PROBLEM
Definition:
1. What is the meaning of PN(x,y):
“Probability that event y would not have occurred if
it were not for event x, given that x and y did in fact
occur.”
Identification:
2. Under what condition can PN(x,y) be learned from
statistical data, i.e., observational, experimental
and combined.
ATTRIBUTION MATHEMATIZED
(Tian and Pearl, 2000)
•
Bounds given combined nonexperimental and
experimental data (P(y,x), P(yx), for all y and x)
0


 1 
 P( y )  P( y ) 
 P( y' ) 
x'
x'
max 

PN

min



P( x,y )


 P( x,y ) 




•
Identifiability under monotonicity (Combined data)
P( y|x )  P( y|x' ) P( y|x' )  P( y x' )
PN 

P( y|x )
P( x,y )
CAN FREQUENCY DATA DECIDE
LEGAL RESPONSIBILITY?
Deaths (y)
Survivals (y)
•
•
•
•
Experimental
do(x) do(x)
16
14
984
986
1,000 1,000
Nonexperimental
x
x
2
28
998
972
1,000 1,000
Nonexperimental data: drug usage predicts longer life
Experimental data: drug has negligible effect on survival
Plaintiff: Mr. A is special.
1. He actually died
2. He used the drug by choice
Court to decide (given both data):
Is it more probable than not that A would be alive
but for the drug?
PN 
 P(Yx'  y' | x, y )  0.50
SOLUTION TO THE
ATTRIBUTION PROBLEM
•
•
WITH PROBABILITY ONE 1  P(yx | x,y)  1
Combined data tell more that each study alone
MEDIATION:
ANOTHER COUNTERFACTUAL
TRIUMPH
Why decompose effects?
1. To understand how Nature works
2. To comply with legal requirements
3. To predict the effects of new type of interventions:
Signal re-routing and mechanism deactivating,
rather than variable fixing
COUNTERFACTUAL DEFINITION OF
INDIRECT EFFECTS
X
Z
Y
z = f (x, u)
y = g (x, z, u)
No Controlled Indirect Effect
Indirect Effect of X on Y: IE ( x0 , x1;Y )
The expected change in Y when we keep X constant, say
at x0, and let Z change to whatever value it would have
attained had X changed to x1.
E[Yx0 Z x  Yx0 ]
1
In linear models, IE = TE - DE
POLICY IMPLICATIONS
OF INDIRECT EFFECTS
What is the indirect effect of X on Y?
The effect of Gender on Hiring if sex discrimination
is eliminated.
GENDER X
IGNORE
Z QUALIFICATION
f
Y HIRING
Deactivating a link – a new type of intervention
MEDIATION FORMULAS
IN UNCONFOUNDED MODELS
Z
X
Y
z = f (x, u1)
y = g (x, z, u2)
u1 independent of u2
DE   [ E (Y | x1, z )  E (Y | x0 , z )]P( z | x0 )
z
IE   [ E (Y | x0 , z )[ P( z | x1)  P( z | x0 )]
z
TE  E (Y | x1)  E (Y | x0 )
TE  DE  IE
IE  Fraction of responses explained by mediation
(sufficient)
TE  DE  Fraction of responses owed to mediation
(necessary)
TRANSPORTABILITY OF KNOWLEDGE
ACROSS DOMAINS
(with E. Bareinboim)
1. A Theory of causal transportability
When can causal relations learned from experiments
be transferred to a different environment in which no
experiment can be conducted?
2. A Theory of statistical transportability
When can statistical information learned in one domain
be transferred to a different domain in which
a. only a subset of variables can be observed? Or,
b. only a few samples are available?
MOTIVATION
WHAT CAN EXPERIMENTS IN LA TELL ABOUT NYC?
Z (Age)
Z (Age)

X
(Intervention)
Y
(Outcome)
Experimental study in LA
Measured: P ( x, y, z )
*
X
(Observation)
Observational study in NYC
Measured: P* ( x, y, z )
P* ( z )  P (z )
P( y | do( x), z )
Needed:
Y
(Outcome)
P* ( y | do( x))  ?   P( y | do( x), z ) P* ( z )
z
Transport Formula (calibration): F ( P, Pdo , P*)
TRANSPORT FORMULAS DEPEND
ON THE STORY
Z
S
S
S
S
Factors
producing differences
Z
Y
X
Y
X
(a)
(b)
a) Z represents age
P* ( y | do( x))   P( y | do( x), z ) P* ( z )
z
b) Z represents language skill
P* ( y | do( x))  ?P( y | do( x))
TRANSPORT FORMULAS DEPEND
ON THE STORY
Z
S
S
S
Z
Y
X
Y
X
(a)
X
(b)
Z
(c)
a) Z represents age
P* ( y | do( x))   P( y | do( x), z ) P* ( z )
z
b) Z represents language skill
P* ( y | do( x))  P( y | do( x))
c) Z represents a bio-marker
P* ( y | do( x))  ? P( y | do( x), z ) P* ( z | x )
z
Y
GOAL: ALGORITHM TO DETERMINE
IF AN EFFECT IS TRANSPORTABLE
S
INPUT: Annotated Causal Graph
S
Factors creating differences
U
V
T
S
X
W
Z
Y
OUTPUT:
1. Transportable or not?
2. Measurements to be taken in the
experimental study
3. Measurements to be taken in the
target population
4. A transport formula
P* ( y | do( x)) 
f [ P( y, v, z, w, t , u | do( x)); P * ( y, v, z, w, t , u )]
TRANSPORTABILITY
REDUCED TO CALCULUS
Theorem
A causal relation R is transportable from  to * if
and only if it is reducible, using the rules of do-calculus,
to an expression in which S is separated from do( ).
R * P* ( y | do( x))  P( y | do( x), s)
  P( y | do( x), s, w) P( w | do( x), s)
w
  P( y | do( x), w) P( w | s)
w
Z
W
w
  P( y | do( x), w) P* ( w)
S
X
Y
RESULT: ALGORITHM TO DETERMINE
IF AN EFFECT IS TRANSPORTABLE
S
INPUT: Annotated Causal Graph
S
Factors creating differences
U
V
OUTPUT:
1. Transportable or not?
2. Measurements to be taken in the
experimental study
3. Measurements to be taken in the
target population
4. A transport formula
5. Completeness (Bareinboim, 2012)
T
S
X
W
Z
Y
P* ( y | do( x)) 
 P ( y | do( x) , z)  P* (z | w)  P (w | do( x), t ) P* (t )
z
w
t
WHICH MODEL LICENSES THE TRANSPORT
OF THE CAUSAL EFFECT XY
S
External factors creating disparities
Yes
No
S
X
S
S
(a)
X
Y
Y
X
W Z
(e)
Y
No
S
Y
Z
(c)
Yes
S
W Z
(d)
X
(b)
Yes
X
Yes
S
Y
X
Z
(f)
Y
STATISTICAL TRANSPORTABILITY
(Transfer Learning)
Why should we transport statistical information?
i.e., Why not re-learn things from scratch ?
1. Measurements are costly.
Limit measurements to a subset V * of variables
called “scope”.
2. Samples are scarce.
Pooling samples from diverse populations will
improve precision, if differences can be filtered
out.
STATISTICAL TRANSPORTABILITY
Definition: (Statistical Transportability)
A statistical relation R(P) is said to be transportable from  to *
over V * if R(P*) is identified from P, P*(V *), and D where P*(V *)
is the marginal distribution of P* over a subset of variables V *.
R=P* (y | x) is transportable over
V* = {X,Z}, i.e., R is estimable without
re-measuring Y
S
X
Z
Y
R   P * ( z | x) P( z | y )
S
X
Z
Y
z
Transfer Learning
If few samples (N2) are available from *
and many samples (N1) from ,
then estimating R = P*(y | x) by
R   P * ( y | x, z ) P ( z | x )
z
achieves a much higher precision
META-ANALYSIS OR
MULTI-SOURCE LEARNING
Target population * R = P*(y | do(x))
S
(a)
X
(d)
(b)
Z
W
Y
X
(e)
Z
(c)
Z
W
Y
X
(f)
Z
Z
W
Z
S
S
X
(g)
W
Y
X
(h)
Z
W
Y
X
(i)
Z
W
W
Y
X
W
Y
Z
S
S
X
Y
Y
X
W
Y
CAN WE GET A BIAS-FREE ESTIMATE OF
THE TARGET QUANTITY?
Target population * R = P*(y | do(x))
Is R identifiable from (d) and (h) ?
(a)
Z
R   P * ( y | do( x), w) P * ( w | do( x))
w
X
(d)
W
w
 P(h) ( y | do( x), w) P(d ) ( w | x)
Z
w
S
X
 P(h) ( y | do( x), w) P(d ) ( w | do( x))
Y
W
Y
R(*) is identifiable from studies (d) and (h).
R(*) is not identifiable from studies (d) and (i).
(h)
(i)
Z
Z
S
S
X
W
Y
X
W
Y
FROM META-ANALYSIS
TO META-SYNTHESIS
The problem
How to combine results of several experimental
and observational studies, each conducted on a
different population and under a different set of
conditions, so as to construct an aggregate
measure of effect size that is "better" than any
one study in isolation.
META-SYNTHESIS REDUCED
TO CALCULUS
Theorem
{1, 2,..., K} – a set of studies.
{D1, D2,...., Dk} – selection diagrams (relative to *).
A relation R(*) is "meta estimable" if it can be
decomposed into terms of the form:
Qk  P(Vk | do(Wk ), Z k )
such that each Qk is transportable from Dk.
Open-problem: Systematic decomposition
BIAS VS. PRECISION
IN META-SYNTHESIS
Principle 1: Calibrate estimands before pooling
(to minimize bias)
Principle 2: Decompose to sub-relations before calibrating
(to improve precision)
R( *)  P * ( y | do( x))
(a)
(g)
Z
(h)
Z
(i)
Z
(d)
Z
S
X
W
Y X
W
Y X
W
S
Y X
W
Y X
Calibration
P(*g ) ( y | do( x))
Pooling
Z

W
Y
P(* ) ( y | do( x))
d
BIAS VS. PRECISION
IN META-SYNTHESIS
(a)
Z
(g)
(h)
Z
(i)
Z
(d)
Z
S
X
W Y X
R( *)  P * ( y | do( x))
W
P(*g ) ( y | do( x))
Y X
W
Z
S
Y X
P(* ) ( y | w, do( x))
h
W
Y X
W
Y
P(*) ( w | do( x)) P(* ) ( w | do( x))
d
i

Pooling
*
P(i, d ) ( w | do( x))
Composition
P*(i, d , k ) ( y | do( x))   P*(h) ( y | w, do( x)) P*(i, d ) ( w | do( x))
w
Pooling

P*
(all) ( y | do( x))
CONCLUSIONS
• Counterfactuals are the building blocks of
scientific thought, free will and moral behavior.
• The algorithmization of counterfactuals has
benefited several problem areas in the empirical
sciences, including policy evaluation, mediation
analysis, generalizability, and credit / blame
determination.
• This brings us a step closer to achieving
cooperative behavior among computers and
humans.
CONCLUSIONS (cont.)
What is "understanding"?
Harnessing the grammars of science to answer
questions that scientists wish to ask, and do not
know how.
What is fun?
Seeing your intuition amplified through the
microscope of formal analysis.
Even more fun:
Watching with amazement how you can do things
today that you couldn't yesterday.
Thank you
Rumelhart (1976)
Figure 3
Rumelhart (1976)
Figure 10
Rumelhart (1976), p. 35
vk

Ci, k   vk  Ci,j  Pr(h j|hi ,hk )
 j
Rk  
otherwise
Pearl (1982)
Pearl (1982), (Belief Propogation)
Kim & Pearl (1983)
Explaining a way
BELIEF PROPAGATION
IN POLYTREES
Bayes Net (1985)
Bayes Net (1985)
Breaking a loop
BELIEF PROPAGATION
WHEN THERE ARE LOOPS
APPLICATIONS OF
BAYESIAN NETWORKS
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
Medical Diagnosis
Clinical Decision Support
Complex Genetic Models
Crime Risk Factors Analysis
Spatial Dynamics in Geography
Inference Problems in Forensic Science
Conservation of a Threatened Bird
Classifiers for Modelling of Mineral Potential
Student Modelling
Sensor Validation
An Information Retrieval System
Reliability Analysis of Systems
Terrorism Risk Management
Credit-Rating of Companies
Classification of Wines
Pavement and Bridge Management
Complex Industrial Process Operation
Probability of Default for Large Corporates
Risk Management in Robotics
LESSON #1
Do not underestimate what we can learn from
fallible humans and from the AI paradigm that
emulating humans is healthy and doable.
BEYOND EVIDENCE, BELIEF,
AND STATISTICS
Data
P
Joint
Distribution
Q(P)
(Aspects of P)
Inference
e.g.,
Infer whether customers who bought product A
would also buy product B.
Q = P(B | A)
STATISTICS 1ST LIMITATION
INTERVENTION
Data
P
Joint
Distribution
Q(P)
(Aspects of P)
Inference
e.g.,
Infer whether customers who bought product A
would buy product B. If we double the price
Q = P(B | A, do (price = 2p1)) Not an aspect of P.
STATISTICS 2ND LIMITATION
RETROSPECTION
Data
P
Joint
Distribution
Q(P)
(Aspects of P)
Inference
e.g.,
Infer whether Joe who bought product A
would have bought A, had we doubled the price
Not an aspect of P.
Q  P( Ap2 | Ap1 )
THE CAUSAL HIERARCHY
1. Associational (Statistical, Evidential)
e.g., What if I see X=x?
2. Interventional ( Experimental, Causal)
e.g., What if I do X=x?
3. Retrospectional (Counterfactual, token)
e.g., What if I hadn't done X=x?
No mixing:
No claim at layer i without assumptions
from layer i or higher.
THE STRUCTURAL MODEL
PARADIGM
Data
Joint
Distribution
Data
Generating
Model
Q(M)
(Aspects of M)
M
Inference
M – Invariant strategy (mechanism, recipe, law,
protocol) by which Nature assigns values to
variables in the analysis.
•
“Think
Nature, not experiment!”
PHYSICS AND COUNTERFACTUALS, OR
WHY PHYSICS DESERVES
A NEW ALGEBRA?
Scientific Equations (e.g., Hooke’s Law) are non-algebraic
e.g., Length (Y) equals a constant (2) times the weight (X)
Correct notation:
Y :==2X
2X
X=1
X=1
Y=2
Process information
The solution
Had X been 3, Y would be 6.
If we raise X to 3, Y would be 6.
Must “wipe out” X = 1.
FAMILIAR CAUSAL MODEL
ORACLE FOR COUNTERFACTUALS
X
Y
Z
INPUT
OUTPUT
THE FUNDAMENTAL THEOREM
OF CAUSAL INFERENCE
Causal Markov Theorem:
Any distribution generated by Markovian structural model M
(recursive, with independent disturbances) can be factorized as
P(v1, v2,..., vn )   P(vi | pai )
i
Where pai are the (values of) the parents of Vi in the causal
diagram associated with M.
Corollary: (Truncated factorization, Manipulation Theorem)
The distribution generated by an intervention do(X=x)
(in a Markovian model M) is given by the truncated factorization
P(v1, v2 ,..., vn | do( x)) 
 P(vi | pai ) |
i|Vi X
X x
THE EVOLUTION OF CAUSAL CALCULUS
• Haavelmo's surgery (1943): ri  ui  vi  gi
Add adjustable force (  gi )
• Strotz and Wold surgery (1960). “Wipe out” the equation
ri  ui  vi , and replace it with ri  constant
• Graphical surgery (Spirtes et al., 1993; Pearl, 1993).
Wipe out incoming arrows to r
u
v
r
y P(u, v, r , y)  P(u ) P(v) P(r | u, v) P( y | r )
• do-calculus (Pearl, 1994)
P(Y  y | do(r )) new operator
• Structural counterfactuals (Balke and Pearl, 1995)
Yr(u) = Y(u) in the r-mutilatedmodel
• Unification with Neyman-Rubin Yx(u) and Lewis (1973)
HISTORICAL OBSERVATIONS
"Development of Western science is based on two
great achievements: the invention of the formal
logical system (in Euclidean geometry) by the
Greek philosophers, and the discovery of the
possibility to find out causal relationships by
systematic experiment (during the Renaissance)."
(Albert Einstein, 1953)
Inspired by Turing, I have tried to put the two
together and base causal inference on a formal
system that is reducible to algorithmic
implementation.
Mission largely accomplished – more to be done.