CAUSAL INFERENCE AS A MACHINE LEARNING EXERCISE

Download Report

Transcript CAUSAL INFERENCE AS A MACHINE LEARNING EXERCISE

CAUSAL MODELING
AND THE
LOGIC OF SCIENCE
Judea Pearl
Computer Science and Statistics
UCLA
www.cs.ucla.edu/~judea/
OVERVIEW
Scope and Language in Scientific Theories
1. Statistical models
(observtions, PL)
2. Causal models
2.1 Stochastic causal model
(interventions, PL + modality)
2.2 Functional causal models
(counterfactuals, PL + subjunctives)
3. General equational models
(explicit interventions, PL)
• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •
4. General Scientific theories
(objects-properties, FOL-SOL ...)
OUTLINE
• Modeling: Statistical vs. Causal
• Causal models and identifiability
• Inference to three types of claims:
1. Effects of potential interventions,
2. Claims about attribution (responsibility)
3. Claims about direct and indirect effects
• Falsifiability and Corroboration
TRADITIONAL STATISTICAL
INFERENCE PARADIGM
Data
P
Joint
Distribution
Q(P)
(Aspects of P)
Inference
e.g.,
Infer whether customers who bought product A
would also buy product B.
Q = P(B|A)
THE CAUSAL INFERENCE
PARADIGM
Data
M
Data-generating
Model
Q(M)
(Aspects of M)
Inference
Some Q(M) cannot be inferred from P.
e.g.,
Infer whether customers who bought product A
would still buy A if we double the price.
FROM STATISTICAL TO CAUSAL ANALYSIS:
1. THE DIFFERENCES
Probability and statistics deal with static relations
Statistics
Data
•
Probability
joint
distribution
inferences
from passive
observations
FROM STATISTICAL TO CAUSAL ANALYSIS:
1. THE DIFFERENCES
Probability and statistics deal with static relations
Statistics
Probability
inferences
Data
from passive
observations
Causal analysis deals with changes (dynamics)
i.e. What remains invariant when P changes.
joint
distribution
• P does not tell us how it ought to change
e.g. Curing symptoms vs. curing diseases
e.g. Analogy: mechanical deformation
FROM STATISTICAL TO CAUSAL ANALYSIS:
1. THE DIFFERENCES
Probability and statistics deal with static relations
Statistics
Probability
inferences
Data
from passive
observations
Causal analysis deals with changes (dynamics)
1. Effects of
Data
interventions
Causal
2. Causes of
Model
Causal
effects
assumptions
3. Explanations
Experiments
joint
distribution
FROM STATISTICAL TO CAUSAL ANALYSIS:
1. THE DIFFERENCES (CONT)
1. Causal and statistical concepts do not mix.
CAUSAL
Spurious correlation
Randomization
Confounding / Effect
Instrument
Holding constant
Explanatory variables
2.
3.
4.
STATISTICAL
Regression
Association / Independence
“Controlling for” / Conditioning
Odd and risk ratios
Collapsibility
FROM STATISTICAL TO CAUSAL ANALYSIS:
1. THE DIFFERENCES (CONT)
1. Causal and statistical concepts do not mix.
CAUSAL
Spurious correlation
Randomization
Confounding / Effect
Instrument
Holding constant
Explanatory variables
STATISTICAL
Regression
Association / Independence
“Controlling for” / Conditioning
Odd and risk ratios
Collapsibility
2. No causes in – no causes out (Cartwright, 1989)
statistical assumptions + data
causal conclusions
causal assumptions
}
3. Causal assumptions cannot be expressed in the mathematical
language of standard statistics.
4.
FROM STATISTICAL TO CAUSAL ANALYSIS:
1. THE DIFFERENCES (CONT)
1. Causal and statistical concepts do not mix.
CAUSAL
Spurious correlation
Randomization
Confounding / Effect
Instrument
Holding constant
Explanatory variables
STATISTICAL
Regression
Association / Independence
“Controlling for” / Conditioning
Odd and risk ratios
Collapsibility
2. No causes in – no causes out (Cartwright, 1989)
statistical assumptions + data
causal conclusions
causal assumptions
}
3. Causal assumptions cannot be expressed in the mathematical
language of standard statistics.
4. Non-standard mathematics:
a) Structural equation models (SEM)
b) Counterfactuals (Neyman-Rubin)
c) Causal Diagrams (Wright, 1920)
WHAT'S IN A CAUSAL MODEL?
Oracle that assigns truth value to causal
sentences:
Action sentences:
Counterfactuals:
Explanation:
B if we do A.
B would be different if
A were true.
B occurred because of A.
Optional: with what probability?
FAMILIAR CAUSAL MODEL
ORACLE FOR MANIPILATION
X
Y
Z
INPUT
OUTPUT
CAUSAL MODELS AND
CAUSAL DIAGRAMS
Definition: A causal model is a 3-tuple
M = V,U,F
with a mutilation operator do(x): M Mx where:
(i)
V = {V1…,Vn} endogenous variables,
(ii)
U = {U1,…,Um} background variables
(iii) F = set of n functions, fi : V \ Vi  U  Vi
vi = fi(pai,ui) PAi  V \ Vi Ui  U
•
CAUSAL MODELS AND
CAUSAL DIAGRAMS
Definition: A causal model is a 3-tuple
M = V,U,F
with a mutilation operator do(x): M Mx where:
(i)
V = {V1…,Vn} endogenous variables,
(ii)
U = {U1,…,Um} background variables
(iii) F = set of n functions, fi : V \ Vi  U  Vi
vi = fi(pai,ui) PAi  V \ Vi Ui  U
q  b1 p  d1i  u1
p  b2q  d 2 w  u2
U1
I
W
Q
P
U2
PAQ
CAUSAL MODELS AND
MUTILATION
Definition: A causal model is a 3-tuple
M = V,U,F
with a mutilation operator do(x): M Mx where:
(i)
V = {V1…,Vn} endogenous variables,
(ii)
U = {U1,…,Um} background variables
(iii) F = set of n functions, fi : V \ Vi  U  Vi
vi = fi(pai,ui) PAi  V \ Vi Ui  U
(iv) Mx= U,V,Fx,
X  V, x  X
where Fx = {fi: Vi  X }  {X = x}
(Replace all functions fi corresponding to X with the constant
functions X=x)
•
CAUSAL MODELS AND
MUTILATION
Definition: A causal model is a 3-tuple
M = V,U,F
with a mutilation operator do(x): M Mx where:
(i)
V = {V1…,Vn} endogenous variables,
(ii)
U = {U1,…,Um} background variables
(iii) F = set of n functions, fi : V \ Vi  U  Vi
vi = fi(pai,ui) PAi  V \ Vi Ui  U
(iv)
q  b1 p  d1i  u1
p  b2q  d 2 w  u2
U1
I
W
Q
P
U2
CAUSAL MODELS AND
MUTILATION
Definition: A causal model is a 3-tuple
M = V,U,F
with a mutilation operator do(x): M Mx where:
(i)
V = {V1…,Vn} endogenous variables,
(ii)
U = {U1,…,Um} background variables
(iii) F = set of n functions, fi : V \ Vi  U  Vi
vi = fi(pai,ui) PAi  V \ Vi Ui  U
(iv)
q  b1 p  d1i  u1 U1
p  b2q  d 2 w  u2
p  p0
Mp
I
W
U2
Q
P
P = p0
PROBABILISTIC
CAUSAL MODELS
Definition: A causal model is a 3-tuple
M = V,U,F
with a mutilation operator do(x): M Mx where:
(i)
V = {V1…,Vn} endogenous variables,
(ii)
U = {U1,…,Um} background variables
(iii) F = set of n functions, fi : V \ Vi  U  Vi
vi = fi(pai,ui) PAi  V \ Vi Ui  U
(iv) Mx= U,V,Fx,
X  V, x  X
where Fx = {fi: Vi  X }  {X = x}
(Replace all functions fi corresponding to X with the constant
functions X=x)
Definition (Probabilistic Causal Model):
M, P(u)
P(u) is a probability assignment to the variables in U.
CAUSAL MODELS AND
COUNTERFACTUALS
Definition: Potential Response
The sentence: “Y would be y (in unit u), had X been x,”
denoted Yx(u) = y, is the solution for Y in a mutilated model
Mx, with the equations for X replaced by X = x.
(“unit-based potential outcome”)
•
•
CAUSAL MODELS AND
COUNTERFACTUALS
Definition: Potential Response
The sentence: “Y would be y (in unit u), had X been x,”
denoted Yx(u) = y, is the solution for Y in a mutilated model
Mx, with the equations for X replaced by X = x.
(“unit-based potential outcome”)
Joint probabilities of counterfactuals:
P(Yx  y, Z w  z ) 
•

u:Yx (u )  y,Z w (u )  z
P(u )
CAUSAL MODELS AND
COUNTERFACTUALS
Definition: Potential Response
The sentence: “Y would be y (in unit u), had X been x,”
denoted Yx(u) = y, is the solution for Y in a mutilated model
Mx, with the equations for X replaced by X = x.
(“unit-based potential outcome”)
Joint probabilities of counterfactuals:
P(Yx  y, Z w  z ) 
In particular:

u:Yx (u )  y,Z w (u )  z
P( y |do(x ) ) 
 P(Yx  y ) 
PN (Yx'  y' | x, y ) 


P(u )
u:Yx (u )  y
P(u )
P(u | x, y )
u:Yx' (u )  y '
3-STEPS TO COMPUTING
COUNTERFACTUALS
S5. If the prisoner is dead, he would still be dead
if A were not to have shot. DDA
Abduction
TRUE
Action
U
TRUE
C
Prediction
U
TRUE
C
C
FALSE
FALSE
A
B
D
TRUE
U
A
B
D
A
B
D
TRUE
COMPUTING PROBABILITIES
OF COUNTERFACTUALS
P(S5). The prisoner is dead. How likely is it that he would be dead
if A were not to have shot. P(DA|D) = ?
Abduction
P(u)
P(u|D)
Action
U
P(u|D)
C
Prediction
U
P(u|D)
C
C
FALSE
FALSE
A
B
D
TRUE
U
A
B
D
A
B
D
P(DA|D)
CAUSAL INFERENCE
MADE EASY (1985-2000)
1. Inference with Nonparametric Structural Equations
made possible through Graphical Analysis.
2. Mathematical underpinning of counterfactuals
through nonparametric structural equations
3. Graphical-Counterfactuals symbiosis
IDENTIFIABILITY
Definition:
Let Q(M) be any quantity defined on a causal
model M, and let A be a set of assumption.
Q is identifiable relative to A iff
P(M1) = P(M2)  Q(M1) = Q(M2)
for all M1, M2, that satisfy A.
•
•
IDENTIFIABILITY
Definition:
Let Q(M) be any quantity defined on a causal
model M, and let A be a set of assumption.
Q is identifiable relative to A iff
P(M1) = P(M2)  Q(M1) = Q(M2)
for all M1, M2, that satisfy A.
In other words, Q can be determined uniquely
from the probability distribution P(v) of the
endogenous variables, V, and assumptions A.
•
IDENTIFIABILITY
Definition:
Let Q(M) be any quantity defined on a causal
model M, and let A be a set of assumption.
Q is identifiable relative to A iff
P(M1) = P(M2)  Q(M1) = Q(M2)
for all M1, M2, that satisfy A.
In this talk:
A: Assumptions encoded in the diagram
Q1: P(y|do(x)) Causal Effect (= P(Yx=y))
Q2: P(Yx =y | x, y) Probability of necessity
Q3: E(Yx ) Direct Effect
Z x'
THE FUNDAMENTAL THEOREM
OF CAUSAL INFERENCE
Causal Markov Theorem:
Any distribution generated by Markovian structural model M
(recursive, with independent disturbances) can be factorized as
P(v1, v2,..., vn )   P(vi | pai )
i
Where pai are the (values of) the parents of Vi in the causal
diagram associated with M.
•
THE FUNDAMENTAL THEOREM
OF CAUSAL INFERENCE
Causal Markov Theorem:
Any distribution generated by Markovian structural model M
(recursive, with independent disturbances) can be factorized as
P(v1, v2,..., vn )   P(vi | pai )
i
Where pai are the (values of) the parents of Vi in the causal
diagram associated with M.
Corollary: (Truncated factorization, Manipulation Theorem)
The distribution generated by an intervention do(X=x)
(in a Markovian model M) is given by the truncated factorization
P(v1, v2,..., vn | do( x )) 

i|Vi X
P(vi | pai ) |
X x
RAMIFICATIONS OF THE
FUNDAMENTAL THEOREM
Given P(x,y,z), should we ban smoking?
U (unobserved)
U (unobserved)
X
Smoking
•
Z
Tar in
Lungs
Y
Cancer
X=x
Smoking
•
•
Z
Tar in
Lungs
Y
Cancer
RAMIFICATIONS OF THE
FUNDAMENTAL THEOREM
Given P(x,y,z), should we ban smoking?
U (unobserved)
U (unobserved)
X
Z
Smoking
Tar in
Lungs
Y
Cancer
X=x
Smoking
Y
Z
Tar in
Lungs
Cancer
Pre-intervention
Post-intervention
P( x, y, z )   P(u )P( x | u )P( z | x )P( y | z, u )
P( y, z | do( x ))   P(u )P( z | x )P( y | z, u )
u
•
u
RAMIFICATIONS OF THE
FUNDAMENTAL THEOREM
Given P(x,y,z), should we ban smoking?
U (unobserved)
U (unobserved)
X
Z
Smoking
Tar in
Lungs
Y
Cancer
X=x
Smoking
Y
Z
Tar in
Lungs
Cancer
Pre-intervention
Post-intervention
P( x, y, z )   P(u )P( x | u )P( z | x )P( y | z, u )
P( y, z | do( x ))   P(u )P( z | x )P( y | z, u )
u
u
To compute P(y,z|do(x)), we must eliminate u. (graphical problem).
THE BACK-DOOR CRITERION
Graphical test of identification
P(y | do(x)) is identifiable in G if there is a set Z of
variables such that Z d-separates X from Y in Gx.
G
Z1
Gx
Z1
Z2
Z3
•
Z2
Z3
Z4
X
Z
Z6
Z5
Y
Z4
X
Z6
Z5
Y
THE BACK-DOOR CRITERION
Graphical test of identification
P(y | do(x)) is identifiable in G if there is a set Z of
variables such that Z d-separates X from Y in Gx.
G
Z1
Gx
Z1
Z2
Z3
Z2
Z3
Z4
X
Z
Z6
Z5
Y
Z4
X
Z6
Moreover, P(y | do(x)) =  P(y | x,z) P(z)
z
(“adjusting” for Z)
Z5
Y
RULES OF CAUSAL CALCULUS
Rule 1: Ignoring observations
P(y | do{x}, z, w) = P(y | do{x}, w)
if (Y  Z|X,W )G
Rule 2: Action/observation exchange
X
P(y | do{x}, do{z}, w) = P(y | do{x},z,w)
if (Y  Z|X,W )G
Rule 3: Ignoring actions
XZ
P(y | do{x}, do{z}, w) = P(y | do{x}, w)
if (Y  Z|X,W )G
X Z(W)
DERIVATION IN CAUSAL CALCULUS
Genotype (Unobserved)
Smoking
Tar
Cancer
P (c | do{s}) = t P (c | do{s}, t) P (t | do{s})
Probability Axioms
= t P (c | do{s}, do{t}) P (t | do{s})
Rule 2
= t P (c | do{s}, do{t}) P (t | s)
Rule 2
= t P (c | do{t}) P (t | s)
Rule 3
= st P (c | do{t}, s) P (s | do{t}) P(t |s) Probability Axioms
= st P (c | t, s) P (s | do{t}) P(t |s)
Rule 2
= s t P (c | t, s) P (s) P(t |s)
Rule 3
OUTLINE
• Modeling: Statistical vs. Causal
• Causal models and identifiability
• Inference to three types of claims:
1. Effects of potential interventions,
2. Claims about attribution (responsibility)
3.
DETERMINING THE CAUSES OF EFFECTS
(The Attribution Problem)
•
•
Your Honor! My client (Mr. A) died BECAUSE
he used that drug.
DETERMINING THE CAUSES OF EFFECTS
(The Attribution Problem)
•
•
Your Honor! My client (Mr. A) died BECAUSE
he used that drug.
Court to decide if it is MORE PROBABLE THAN
NOT that A would be alive BUT FOR the drug!
P(? | A is dead, took the drug) > 0.50
THE PROBLEM
Theoretical Problems:
1. What is the meaning of PN(x,y):
“Probability that event y would not have occurred if
it were not for event x, given that x and y did in fact
occur.”
•
THE PROBLEM
Theoretical Problems:
1. What is the meaning of PN(x,y):
“Probability that event y would not have occurred if
it were not for event x, given that x and y did in fact
occur.”
Answer:
PN ( x, y )  P(Yx'  y' | x, y )
P(Yx'  y' , X  x,Y  y )

P( X  x,Y  y )
THE PROBLEM
Theoretical Problems:
1. What is the meaning of PN(x,y):
“Probability that event y would not have occurred if
it were not for event x, given that x and y did in fact
occur.”
2. Under what condition can PN(x,y) be learned from
statistical data, i.e., observational, experimental
and combined.
WHAT IS INFERABLE FROM
EXPERIMENTS?
Simple Experiment:
Q = P(Yx= y | z)
Z nondescendants of X.
Compound Experiment:
Q = P(YX(z) = y | z)
Multi-Stage Experiment:
etc…
CAN FREQUENCY DATA DECIDE
LEGAL RESPONSIBILITY?
Deaths (y)
Survivals (y)
•
•
•
•
Experimental
do(x) do(x)
16
14
984
986
1,000 1,000
Nonexperimental
x
x
2
28
998
972
1,000 1,000
Nonexperimental data: drug usage predicts longer life
Experimental data: drug has negligible effect on survival
Plaintiff: Mr. A is special.
1. He actually died
2. He used the drug by choice
Court to decide (given both data):
Is it more probable than not that A would be alive
but for the drug?
PN 
 P(Yx'  y' | x, y )  0.50
TYPICAL THEOREMS
(Tian and Pearl, 2000)
•
Bounds given combined nonexperimental and
experimental data
0


 1 
 P( y )  P( y ) 
 P( y' ) 
x'
x'
max 

PN

min



P( x,y )


 P( x,y ) 




•
Identifiability under monotonicity (Combined data)
P( y|x )  P( y|x' ) P( y|x' )  P( y x' )
PN 

P( y|x )
P( x,y )
corrected Excess-Risk-Ratio
SOLUTION TO THE ATTRIBUTION
PROBLEM (Cont)
•
•
•
WITH PROBABILITY ONE P(yx | x,y) =1
From population data to individual case
Combined data tell more that each study alone
OUTLINE
• Modeling: Statistical vs. Causal
• Causal models and identifiability
• Inference to three types of claims:
1. Effects of potential interventions,
2. Claims about attribution (responsibility)
3. Claims about direct and indirect effects
•
QUESTIONS ADDRESSED
• What is the semantics of direct and
indirect effects?
• Can we estimate them from data?
Experimental data?
TOTAL, DIRECT, AND INDIRECT
EFFECTS HAVE SIMPLE SEMANTICS
IN LINEAR MODELS
b
X
a
Z
c
z = bx + 1
y = ax + cz + 2
Y


TE 
E(Y | do( x ))  a + bc
x


DE 
E(Y | do( x ), do( z ))  a
x
IE 
 TE  DE  bc
Z - independen t
SEMANTICS BECOMES NONTRIVIAL
IN NONLINEAR MODELS
(even when the model is completely specified)
X
Z
z = f (x, 1)
y = g (x, z, 2)
Y


TE 
E(Y | do( x ))
x


DE 
E(Y | do( x ), do( z ))
x
IE 
 ????
Dependent on z?
Void of operational meaning?
THE OPERATIONAL MEANING OF
DIRECT EFFECTS
X
Z
z = f (x, 1)
y = g (x, z, 2)
Y
“Natural” Direct Effect of X on Y:
The expected change in Y per unit change of X, when we
keep Z constant at whatever value it attains before the
change.
E[Yx1Z x  Yx0 ]
0
In linear models, NDE = Controlled Direct Effect
POLICY IMPLICATIONS
(Who cares?)
indirect
What is the direct effect of X on Y?
The effect of Gender on Hiring if sex discrimination
is eliminated.
GENDER X
IGNORE
Z QUALIFICATION
f
Y HIRING
THE OPERATIONAL MEANING OF
INDIRECT EFFECTS
X
Z
z = f (x, 1)
y = g (x, z, 2)
Y
“Natural” Indirect Effect of X on Y:
The expected change in Y when we keep X constant, say
at x0, and let Z change to whatever value it would have
under a unit change in X.
E[Yx0 Z x  Yx0 ]
1
In linear models, NIE = TE - DE
LEGAL DEFINITIONS TAKE THE
NATURAL CONCEPTION
(FORMALIZING DISCRIMINATION)
``The central question in any employment-discrimination
case is whether the employer would have taken the
same action had the employee been of different race
(age, sex, religion, national origin etc.) and everything
else had been the same’’
[Carson versus Bethlehem Steel Corp. (70 FEP Cases 921,
7th Cir. (1996))]
x = male, x = female
y = hire, y = not hire
z = applicant’s qualifications
NO DIRECT EFFECT
YxZ = Yx,
x
YxZ = Yx
x
SEMANTICS AND IDENTIFICATION
OF NESTED COUNTERFACTUALS
Consider the quantity
Q
 Eu [YxZ x * (u ) (u )]
Given M, P(u), Q is well defined
Given u, Zx*(u) is the solution for Z in Mx*, call it z
YxZ (u ) (u ) is the solution for Y in Mxz
x*
 experiment al 
Can Q be estimated from 
 data?
nonexperim ental 
ANSWERS TO QUESTIONS
• Graphical conditions for estimability from
experimental / nonexperimental data.
• Graphical conditions hold in Markovian models
•
ANSWERS TO QUESTIONS
• Graphical conditions for estimability from
experimental / nonexperimental data.
• Graphical conditions hold in Markovian models
• Useful in answering new type of policy questions
involving mechanism blocking instead of variable
fixing.
THE OVERRIDING THEME
1.
2.
Define Q(M) as a counterfactual expression
Determine conditions for the reduction
Q( M )  Pexp ( M ) or Q( M )  P( M )
3.
If reduction is feasible, Q is inferable.
• Demonstrated on three types of queries:
Q1: P(y|do(x)) Causal Effect (= P(Yx=y))
Q2: P(Yx = y | x, y) Probability of necessity
Q3: E(Yx ) Direct Effect
Z x'
FALSIFIABILITY and CORROBORATION
P*
P*(M)
Falsifiability: P*(M)  P*
D (Data)
Constraints implied by M
Data D corroborates model M if M is (i) falsifiable
and (ii) compatible with D.
Types of constraints:
1. conditional independencies
2. inequalities (for restricted domains)
3. functional e.g.,
 P( x | w) P( z | w, x, y )  f ( y, z )
w
x
y
z
x
OTHER TESTABLE CLAIMS
Changes under interventions
For all causal models:
P(Yxz  y )  P(Yz  y | x)
 x, y , z
For all semi-Markovian models:
P(Yxz  y )  P( X z  x, Yz  y )
 P(Yx  y, Z x  z )  P( x, y, z )
For Markovian models (and X  Y  Z  V ):
P(Yz  y, X z  x)  P( X yz  x) P(Yxz  y )
For a given Markovian model:
P(Yv \Y  y )  P( y | paY )
FROM CORROBORATING MODELS
TO CORROBORATING CLAIMS
A corroborated model can imply identifiable yet
uncorroborated claims.
e.g.,
x
a
y
z
x
a
y
b
z
Some claims can be more corroborated than others.
Definition:
An identifiable claim C is corroborated by data if
some minimal set of assumptions in M sufficient for
identifying C is corroborated by the data.
Graphical criterion: minimal submodel = maximal
supergraph
FROM CORROBORATING MODELS
TO CORROBORATING CLAIMS
A corroborated model can imply identifiable yet
uncorroborated claims.
e.g.,
x
a
y
z
x
a
y
b
z
Some claims can be more corroborated than others.
Definition:
An identifiable claim C is corroborated by data if
some minimal set of assumptions in M sufficient for
identifying C is corroborated by the data.
Graphical criterion: minimal submodel = maximal
supergraph
OVERVIEW
Scope and Language in Scientific Theories
1. Statistical models
(observtions, PL)
2. Causal models
2.1 Stochastic causal model
(interventions, PL + modality)
2.2 Functional causal models
(counterfactuals, PL + subjunctives)
3. General equational models
(explicit interventions, PL)
• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •
4. General Scientific theories
(objects-properties, FOL-SOL ...)