An intuitive introduction to path analysis and d

Download Report

Transcript An intuitive introduction to path analysis and d

Bill Shipley,
département de biologie
Université de Sherbrooke
Sherbrooke (Qc) Canada
[email protected]
Number of churches
Ln(murders)=0.009+0.99*Ln(churches)
10
20
50
Number of murders
2
5
New causal context…...
1
Number of murders per year per city
100
200
Pop
size
1
2
5
10
20
50
100
Number of churches
200
Number of churches per city
Pop
size
Passive prediction ONLY if the
underlying causal processes are constant
Number of murders
2-D Shadow
3-D Object
What the audience sees
Hidden from view
“2-D” correlational shadow
“3-D” causal process
B & C correlated,
but independent given A
A & D correlated,
but independent given B & C
And so on….
A
B
C
D
E
What the scientist sees
Hidden from view
R.A. Fisher
Statistical Methods for Research Workers
(1925)
o
o
o
o
15 plots with treatment (+fertilizer & water)
15 plots without treatment (+water)
Treatment: 80 g  6
Control:
55 g  6
Nitrogen fertilizer
T-test: p<0.0001
Crop growth
Nitrogen
fertilizer
?
Crop growth
X
Random
numbers
X
X
Experimental (observational) unit...
- the unit to which the treatment is applied
- the UNIT to which the treatment is applied
variable 1
variable 2 …
variable n
N fertilizer
N, P, K...
Worms….
No causal inferences between variables within the experimental unit
THE PLANT
Nitrogen
fertilizer
Nitrogen
absorption
Photosynthetic
enzymes
Carbon
fixation
Seed
yield
Scenario 1
Fertilizer
addition
Nitrogen
absorption
Photosynthetic
enzymes
Photosynthetic
enzymes
Nitrogen
absorption
Scenario 2
Fertilizer
addition
Scenario 3
Photosynthetic
enzymes
Fertilizer
addition
Nitrogen
absorption
La méthode expérimentale
Claude Bernard
1813 - 1878
Color of blood
in renal vein
before
entering the
kidney
Active/inactive
state of the
kidney
Color of
blood in the
renal vein
upon
exiting the
kidney
Color of blood
in renal vein
before
entering the
kidney
Active/inactive
state of the
kidney
Color of
blood in the
renal vein
upon
exiting the
kidney
X
Color of blood
in renal vein
before
entering the
kidney
Active/inactive
state of the
kidney
Color of
blood in the
renal vein
upon
exiting the
kidney
1. Hypothesize a causal structure.
A
B
C
A
B
C
A
B
C
A
B
C
2. Measure the correlations between
the variables in their natural state.
3. Predict how these correlations will
change if various physical manipulations
hold constant different variables.
4. Compare the new correlations after
controlling the variables to the predictions
assuming the causal structure.
5. If any of the predicted changes in the
correlational structure disagree with the
observed changes, then reject the causal structure.
Body size in autumn
sex
Survival to spring
Causal hypothesis 1
Survival
to spring
Causal hypothesis 2
Other causes
sex
Survival
to spring
Other causes
Body size
in autumn
Other causes
sex
Body size
in autumn
Other causes
0.120
Quantity and
quality of summer
forage
Z
Body weight in the
autumn
0.040
Probability of
survival until
spring
1.5
0.0
Y
1.5
0.0
-1.5
1.5
X
Z= f (X,Y) = f (X)f (Y)
Survival (%)
24
26
28
30
7
5
22
4
3
40
50
60
70
80
40
Body mass (kg)
50
60
4
2
0
-2
-4
-1.0
70
Body mass (kg)
Residuals
Survival for a constant body mass
Amount of forage (kg)
6
“residuals of
Y given X”
-0.5
0.0
Forage quality for constant body mass
0.5
80
“3-D” causal process
“2-D” correlational shadow
B & C independent
given A
A
B
A & D independent
given B & C
B & D independent
given D
C
and so on...
D
E
Hypothesis testing
“3-D” causal process
“2-D” correlational shadow
B & C independent
given A
A
B
A & D independent
given B & C
C
B & D independent
given D
D
and so on...
E
Hypothesis generation
A
B & C independent
given A
B
C
D
A & D independent
given B & C
B & D independent
given D
and so on...
E
A
B
0.300
0.120
Z
Z
0.100
0.040
1.5
1.5
0.0
0.0
Y
1.5
0.0
-1.5
1.5
Y
-1.5
X
Z= f (X,Y) = f (X)f (Y)
P( x;  , ) =
1
1
e[ X   ] [ X   ]
(2 )n / 2 |  |1/ 2
1.5
0.0
-1.5
X
Z= f (X,Y)  f (X)f (Y)
The dangers of mistranslation between languages...
French “demande”
vs.
English “demand”
=
Probability distributions
•Deals only in information content conditional on other information
•NOT causal relationships.
•There is no notion of a causal (asymmetric) relationship in probability theory
•Consistently mistranslates “X-->Y” as “Y=f(X)”
=
Bill Gates worth 1,000,000,000$
(machine translation into another language)
(machine translation back into English)
Payment request for doors in the fence worth 1,000,000,000$
Rain
Mud
Other causes of mud
Mud (cm) = 0.1Rain (cm) + N(0, 0.1)
Rain(cm)=10Mud(cm)+N(0,1)
Rain
Mud
Other causes of mud
1. Express causal claims using graph theory (directed
acyclic graphs - DAGs)
Property: asymmetric relationships
A
B
C
2. Apply a graph-theoretic operator (d-separation)
on this graph.
A_||_C|B (A is separated from C given B in the graph)
3. If two vertices (X,Y) in this DAG are d-separated given a set Q
of other vertices, then variables X and Y are probabilistically
independent given the set Q of conditioning variables in ANY
multivariate probability distribution generated by the DAG
4. There always exists a basis set B of d-separation claims for
the DAG that together completely specify the joint probability
distribution over the variables represented by the DAG.
B={A_||_C|B..} implies P(X,Y,Z)
5. Test the predicted and observed independence claims
implied by the graphical model.
- if there are significant differences, reject the causal
model;
- if there aren’t significant differences, tentatively accept
the causal model (and continue testing…)
6. Now, translate the graphical model into prediction equations.
A
B
C
A=e1
B=f(A) + e2
C=f(B) + e3
7. The independence claims in the DAG are local, therefore, to
change the causal structure, simply re-write the DAG and then
go back to step 6.
A=e1
A
B
C
B= e2
C=f(B) + e3
Number of churches
Ln(murders)=0.009+0.99*Ln(churches)
10
20
50
Number of murders
2
5
New causal context…...
1
Number of murders per year per city
100
200
Pop
size
1
2
5
10
20
50
100
Number of churches
200
Number of churches per city
Pop
size
Passive prediction ONLY if the
underlying causal processes are constant
Number of murders
A few definitions...
A
A
B
B
C
C
D
D
E
E
If you can follow the arrows from i to j then
there is a directed path from i to j.
A
B
C
D
Directed path from:
E
If you can go from i to j while ignoring the
direction of the arrows then there is an
undirected path from i to j.
A to C
NOT from A to E
E to C
NOT from E to A
Undirected path from:
A to E
E to A
A few definitions...
A
B
C
D
Non-collider vertex
E
C
Unshielded collider vertex
Sheilded collider vertex
A
B
C
D
E
Causal children of A
NOT causal children of A
Causal children of E
NOT causal children of E
A
B
C
D
Causal ancestors of C
E
State of a vertex:
A non-collider vertex allows causal influence to flow through it (naturally ON);
conditioning (holding constant) blocks causal influence through it (turns OFF).
A
B
C
A
B
C
A collider vertex prevents causal influence to flow through it (naturally OFF);
conditioning (holding constant) allows causal influence through it (turns ON).
A
B
C
A
B
C
A
B
C
Rain
A
mud
B
water hose
rain
rain
C
Water hose
Water hose
mud
mud
1. It rained
1. It didn’t rain
2. Therefore mud
2. There was mud
3. No idea about
water hose
3. Therefore the water hose
was on
Is X and Y d-separated given a set Q={A, B, …} conditioning
vertices?
1. List all undirected paths between X and Y
For each such undirected path...
2. Are there any non-colliders along this path that are in Q? If yes, path is blocked;
Go to next undirected path.
3. Are all colliders or causal children of colliders along this path in Q? If no, then
path is blocked; go to next undirected path.
If all undirected paths between X and Y are blocked by Q then X and Y are
d-separated by Q.
If X and Y are d-separated by Q, then they are probabilistically independent given Q
in any probability distribution generated by the graph.
Non-collider
A
B
C
Are B & C
d-separated
given
A?
B_||_C|{A}?
A
B
D
C
D
E
E
YES B & C are d-separated given A therefore...
B & C will be independent conditional on A
A
A
B
C
Are B & C
d-separated
given
D?
B_||_C|{D}?
B
D
C
D
collider
E
E
NO B & C are not d-separated given D therefore...
B & C will be dependent conditional on A
A
B
C
A _||_E|{D}?
YES
A_||_E|{D,B}?
YES
B_||_C|{A,D}?
NO
B_||_C|{A,E}?
NO
D_||_A|{B}?
NO
D
E_||_B|D?
YES
E
… and so on for every unique pair (X,Y) conditioned on
every unique pair of remaining variables...
V  V 2 V  2 
    

x =0
2
x
   

= 10 X [1 + 3 + 3 + 1] = 80
Basis set: the smallest set of d-separation claims in a DAG
that, together, imply all others.
A
B
C
D
E
If you know the basis set, then you can specify the entire
structure of the joint probability distribution that is generated
by the directed acyclic graph.
Therefore, you can test the causal structure by testing the
d-separation claims given in the basis set.
Special basis set: BU= {X_||_Y|{Pa(X) U Pa(Y)}
X,Y pair of vertices not directly connected.
(each unique pair of non-adjacent vertices, conditioned on
the set of parents of both)
BU={A_||_D|{B,C}, A_||_E|{D}, B_||_C|{A}, B_||_E|{A,D}, C_||_E|{A,D} }
List basis set BU
A
B
C
D
A_||_D|{B,C}
A_||_E|{D}
B_||_C|{A}
B_||_E|{A,D}
C_||_E|{A,D}
Convert to
probabilistic
claims
rA,D|{B,C}=0
rA,E|D=0
rB,C|A=0
rB,E|A,D=0
rC,E|A,D=0
k
Calculate : C = 2 Ln( pi )
i =1
Calculate probability
of each claim in data
p1=0.23
p2=0.50
p3=0.001
p4=0.45
p5=0.12
C = 23.98
k=5
IF all d-sep claims in the graph are true in the data, then
C follows a chi-squared distribution with 2k degrees of freedom
E
THEREFORE if the probability of C is below the significance
level……… the causal structure is rejected by the data.
THEREFORE if the probability of C is above the significance
level……… the causal structure is consistent with the data.
X2 of 23.98 with 10 degrees of freedom gives p=0.008
REJECT causal structure
Claude Bernard
Ronald Fisher
Karl Pearson
Sewall Wright
Clark Glymour
Judea Pearl