Transcript Document

Markov Chains:
Transitional Modeling
Qi
Liu
1
content






Terminology
Transitional Models without Explanatory
Variables
Inference for Markov chains
Data Analysis :Example 1 (ignoring explanatory
variables)
Transitional Models with Explanatory Variables
Data Anylysis: Example 2 (with explanatory
variables)
2
Terminology




Transitional models
Markov chain
K th-order Markov chain
Tansitional probabilities and Tansitional
matrix
3
Transitional models
{y0,y1,…,yt-1} are the responses observed previously. Our
focus is on the dependence of Yt on the {y0,y1,…,yt-1}
as well as any explanatory variables. Models of this type
are called transitional models.
4
Markov chain


A stochastic process, for all t, the conditional
distribution of Yt+1,given Y0,Y1,…,Yt is
identical to the conditional distribution of Yt+1
given Yt alone. i.e, given Yt, Yt+1 is
conditional independent of Y0,Y1,…,Yt-1. So
knowing the present state of a Markov
chain,information about the past states does
not help us predict the future
P(Yt+1|Y0,Y1,…Yt)=P(Yt+1|Yt)
5
K th-order Markov chain


For all t, the conditional distribution of Yt+1 given
Y0,Y1,…,Yt is identical to the conditional distribution
of Yt+1 ,given (Yt,…,Yt-k+1)
P(Yt+1|Y0,Y1,…Yt)=P(Yt+1|Yt-k+1,Yt-k+2,….Yt)
i.e, given the states at the previous k times, the
future behavior of the chain is independent of past
behavior before those k times. We discuss here is
first order Markov chain with k=1.
6
Tansitional probabilities
Denote the conditional probability P(Yt=j
| Yt-1=i) by  (t) ,the {  (t)} are called
Tansitional probabilities, which satisfy
  (t)  1 . The I  I matrix {  (t) ,i=1,…,I,
j=1,…,I} is a transition probability
matrix. It is called one-step, to distinguish
it from the matrix probabilities for k-step
transitions from time t-k to time t.
j|i
j
j|i
j|i
j|i
7
Transitional Models without
Explanatory Variables
At first, we ignore explanatory variables. Let f(y0,…,yT) denote the
joint probability mass function of (Y0,…,YT),transitional models
use the factorization:
f(y0,…,yT) =f(y0)f(y1|y0)f(y2|y0,y1)…f(yT|y0,y1,…,yT-1)
This model is conditional on the previous responses.
For Markov chains,
f(y0,…,yT) =f(y0)f(y1|y0)f(y2|y1)…f(yT|yT-1)
(*)
From it, a Markov chain depends only on one-step transition
probabilities and the marginal distribution for the initial state. It
also follows that the joint distribution satisfies loglinear model
(Y0Y1, Y1Y2,…, YT-1YT)
For a sample of realizations of a stochastic process, a contingency
table displays counts of the possible sequences. A test of fit of
this loglinear model checks whether the process plausibly
satisfies the Markov property.
8
Inference for Markov chains
Use standard methods of categorical analysis.
eg, ML estimation of transition probabilities. Let n (t)
denote the number of transitions from state i at time t-1 to
state j at time t. For fixed t, { n (t) }form the two-way
marginal table for dimensions t-1 and t of an I
contingency table. For the Ni+(t) subjects in category I at
time t-1,suppose that { n (t),j=1,…,I} have a multinomial
distribution with parameters {  (t)}.Let {ni0}denote the
initial counts. Suppose that they also have a multinomial
distribution, with parameters {  }
ij
ij
T 1
ij
j|i
i0
9
Inference for Markov chains(continue)
If subjects behave independently, from the (*), the
likelihood function is proportional to
I
T
I

I
   ionio    
i 1
t 1 i 1
 j 1


(t ) nij (t )  
 
j|i

The transitional probabilities are parameters of IT
independent multinomial distributions. From Anderson
and Good man (1957), the ML estimates are
ˆ (t)=n (t)/n i  (t)
j|i
ij
10
Example 1 (ignoring explanatory variables)
A study at Harvard of effects of air pollution on respiratory illness in children.
The children were examined annually at ages 9 through 12 and classified according to the
presence or absence of wheeze. Let Yt denote the binary response at age t, t=9,10,11,12.
1 wheeze;2 no wheeze
y9
y10
y11
y12
coun
t
y9
y10
y11
y12
count
1
1
1
1
94
2
1
1
1
19
1
1
1
2
30
2
1
1
2
15
1
1
2
1
15
2
1
2
1
10
1
1
2
2
28
2
1
2
2
44
1
2
1
1
14
2
2
1
1
17
1
2
1
2
12
2
2
1
2
42
1
2
2
1
12
2
2
2
1
35
1
2
2
2
63
2
2
2
2
572
11
Code of Example 1




































Code of 11.7
data breath;
input y9 y10 y11 y12 count;
datalines;
1 1 1 1 94
1 1 1 2 30
1 1 2 1 15
1 1 2 2 28
1 2 1 1 14
12129
1 2 2 1 12
1 2 2 2 63
2 1 1 1 19
2 1 1 2 15
2 1 2 1 10
2 1 2 2 44
2 2 1 1 17
2 2 1 2 42
2 2 2 1 35
2 2 2 2 572
;
proc genmod; class y9 y10 y11 y12;
model count= y9 y10 y11 y12 y9*y10 y10*y11 y11*y12 /dist=poi lrci type3 residuals obstats;
run;
proc genmod; class y9 y10 y11 y12;
model count= y9 y10 y11 y12 y9*y10 y9*y11 y10*y11 y10*y12 y11*y12 y9*y10*y11 y10*y11*y12/dist=poi lrci type3 residuals obstats;
run;
proc genmod; class y9 y10 y11 y12;
model count= y9 y10 y11 y12 y9*y10 y9*y11 y9*y12 y10*y11 y10*y12 y11*y12 /dist=poi lrci type3 residuals obstats;
run;
data breath_new;set breath;
a=y9*y10+y10*y11+y11*y12;
b=y9*y12+Y10*y12+y9*y11;
proc genmod; class y9 y10 y11 y12;
model count= y9 y10 y11 y12 a b /dist=poi lrci type3 residuals obstats;
run;
12
Data analysis


The loglinear model (y9y10,y10y11,y11y12) a
first order Markov chain.
P(Y11|Y9,Y10)=P(Y11|Y10)
P(Y12|Y10,Y11)=P(Y12|Y11)
G²=122.9025, df=8, with p-value<0.0001, it
fits poorly. So given the state at time t,
classification at time t+1 depends on the
states at times previous to time t.
13
Data analysis (cont…)


Then we consider model (y9y10y11,
y10y11y12),a second-order Markov chain,
satisfying conditional independence at ages 9
and 12, given states at ages 10 and 11.
This model fits poorly too, with
G²=23.8632,df=4 and p-value<0.001.
14
Data analysis (cont)

The loglinear model (y9y10,y9y11,y9y12,y10y11,y10y12,y11y12)
that permits association at each pair of ages fits well, with
G²=1.4585,df=5,and p-value=0.9178086.
Parameter Estimate
y9*y10 1.8064
y9*y11 0.9478
y9*y12 1.0531
y10*y11 1.6458
y10*y12 1.0742
y11*y12 1.8497
Error
0.1943
0.2123
0.2133
0.2093
0.2205
0.2071
Limits
1.4263
0.5282
0.6323
1.2356
0.6393
1.4449
Square Pr > ChiSq
2.1888 86.42 <.0001
1.3612 19.94 <.0001
1.4696 24.37 <.0001
2.0569 61.85 <.0001
1.5045 23.74 <.000
2.2574 79.81 <.0001
15
Data analysis (cont)


From above, we see that
the association seems
similar for pairs of ages1
year apart, and somewhat
weaker for pairs of ages
more than 1 year apart. So
we consider the simpler
model in which
It also fits well, with G²=2.3,  =  =  and
df=9, and p-value=
0.9857876.
y 9 y10
ij
y10 y11
ij
y11y12
ij
ijy 9 y11
= =
ijy9 y12
ijy10 y12
16
Estimated Conditonal Log Odds Ratios
Association Estimate
Y9Y10
Y10Y11
Y11Y12
Y9Y11
Y9Y12
Y10Y12
1.81
1.65
1.85
0.95
1.05
1.07
Simpler
Structure
1.75
1.75
1.75
1.04
1.04
1.04
17
Transitional Models with Explanatory Variables
The joint mass function of T sequential responses is
f(y1,…,yT;X) =f(y1;X)f(y2|y1;X)f(y3|y1,y2;X)…f(yT|y1,y2,…,yT1;X)
For binary y, we can use a logistic regression model for
each term in the above factorization,
y (   y    
f(yt|y1,y2,…,yt-1;Xt)= exp[
1  exp(   y    
t
1 1
1 1
yt 1   X t )]

t 1 y t 1   X t )
t 1
, y =0,1
t
The model treats previous responses as explanatory
variables. It is called regressive logistic model (Bonney
1987).
The interpretation and magnitude of  depends on how
many previous observations are in the model.
Continue…
18
Within-cluster effects may diminish markedly by
conditioning on previous responses. This is an
important difference from marginal models, for
which the interpretation does not depend on the
specification of the dependence structure. In the
special case of first-order Markov structure, the
coefficients of (y1,…,yt-2) equal 0 in the model
for yt.
Given the predictor, the model treats repeated
transitions by a subject as independent. Thus, one
can fit the model with ordinary GLM software,
treating each transition as a separate observation.
(Bonney 1986)
19
Data Anylysis


Example 2 (with explanatory variables)
At ages 7 to 10, children were evaluated
annually on the presence of respiratory
illness. A predictor is maternal smoking at the
start of the study, where s=1 for smoking
regularly and s=0 otherwise.
20
Child’s Respiratory Illness by Age and Maternal Smoking
Child’s Respiratory Illness by Age and Maternal Smoking
No Maternal Maternal
Smoking
Smoking
(S=0)
(S=1)
Child’s Respiratory
Age 10
Age 10
Illness
Age7
Age8 Age No Yes No
Yes
9
No
No
No 237
10
118
6
Yes 15
4
8
2
Yes No 16
2
11
1
Yes 7
3
6
4
Yes
No
No 24
3
7
3
Yes 3
2
3
1
Yes No 6
2
4
2
Yes 5
11
4
7
21
Data analysis (cont)
Let yt denote the response at age t
(t=7,8,9,10).
Regressive logistic model
Logit[p(yt=1)]=  +  s   t   y , t=8,9,10
Each subject contributes three
observations to the model fitting. The data
set consists of 12 binomials, for the 2*3*2
combinations of (s,t,yt-1). EG, for the
combination (0,8,0), s=0, t=8, yt-1=7, then
y8=0 for 237+10+15+4=266 subjects and
y8=1 for 16+2+7+3=28 subjects.
1
2
3
t 1
22
Code of Example 2



















data illness;
input t tp ytp yt s count;
datalines;
8 7 0 0 0 266
8 7 0 0 1 134
8 7 0 1 0 28
8 7 0 1 1 22
8 7 1 0 0 32
8 7 1 0 1 14
8 7 1 1 0 24
8 7 1 1 1 17
9 8 0 0 0 274
9 8 0 0 1 134
9 8 0 1 0 24
9 8 0 1 1 14
9 8 1 0 0 26
9 8 1 0 1 18
9 8 1 1 0 26
9 8 1 1 1 21


















9 8 1 0 0 26
9 8 1 0 1 18
9 8 1 1 0 26
9 8 1 1 1 21
10 9 0 0 0 283
10 9 0 0 1 140
10 9 0 1 0 17
10 9 0 1 1 12
10 9 1 0 0 30
10 9 1 0 1 21
10 9 1 1 0 20
10 9 1 1 1 14
;
run;
proc logistic descending;
freq count;
model yt = t ytp s/scale=none
aggregate;
run;
23
Output from SAS




Deviance and Pearson Goodness-of-Fit Statistics
Criterion
DF
Value Value/DF Pr > ChiSq
Deviance
8
3.1186
0.3898
0.9267
Pearson
8
3.1275
0.3909
0.9261








Analysis of Maximum Likelihood Estimates
Standard
Wald
Parameter DF Estimate
Error Chi-Square Pr > ChiSq
Intercept 1 -0.2926
0.8460
0.1196
0.7295
t
1 -0.2428
0.0947
6.5800
0.0103
ytp
1
2.2111
0.1582
195.3589
<.0001
s
1
0.2960
0.1563
3.5837
0.0583
24
Analysis
The MLE fit is
Logit[ pˆ ( y  1) ]=log 1 p(py( y1)1) =-0.2926-0.24282t+0.2960s+2.2111 y
exp[-0.292
6 - 0.24282t+ 0.2960s+ 2.2111]
= exp[0.2926 0.242821 0.2960s  2.2111y ] is an increasing
pˆ ( y  1) =
1  exp[-0.2926- 0.24282t+ 0.2960s+ 2.2111]
function of s and Yt-1, a decreasing function of t.
Then:
If S and Yt-1 are fixed, p(Yt=1)>p(Yt-1=1),which means that a
younger child is easier to have illness.
t
t 1
t
t
t
t 1
If t and Yt-1 are fixed, when s=1, P(Yt=1) is bigger than that when
s=0. Which means that a child whose mother smokes has bigger
chance to have illness than those whose mother does not smoke.
If t and s are fixed, when Yt-1=1, P(Yt=1) is bigger than that when
Yt-1=0, which means that if a child had illness when he was t-1, he
would have more probability to have illness at age t than a child
who didn’t have illness at age t-1.
25



The model fits well, with G²=3.1186, df=8, pvalue=0.9267.
The coefficient of is 2.2111 with SE 0.1582 , ChiSquare statistic 195.3589 and p-value <.0001
,which shows that the previous observation has a
strong positive effect. So if a child had illness when
he was t-1, he would have more probability to have
illness at age t than a child who didn’t have illness at
age t-1.
The coefficient of s is 0.2960, the likelihood ratio test
of H0 :=0 is 3.5837,df=1,with p-value 0.0583. There
is slight evidence of a positive effect of maternal
smoking.
26
Interpratation of Paramters ß
Interpratation of Paramters ß :
Logit[ pˆ ( y  1) ]=log 1 p(py( y1)1) =  +  s   t   y ,
t
t
1
2
3
t 1
t
Then
P( yt  1 | s  0, t  8, y7  0)
P(Yt  0 | s  0, t  8, y7  0)
=exp(  +8  )=exp(-0.2926+8*(2
0.2428))=0.107, P( y  1 )=0.0967. If a child did not have illness at
age 7,and his mother did not smoke, the probability that he would
have illness at age 8 is 0..0967.
P( y  1 | s  0, t  8, y  1)
=exp(  +8  +  )=exp(-0.2926P( y  0 | s  0, t  8, y  1)
t
t
7
t
7
2
3
8*0.2428+2.2111)=0.9764,
P( y  1)=0.494 >> 0.0967
So for those children whose mother didn’t smoke, if the child had
illness at age 7, he would have the probability of 0.494 to have
illness at age 8.
P( y  1 | s  0, t  8, y  1)
P( y  1 | s  0, t  8, y  0)
 =log
-log
P( y  1 | s  0, t  8, y  1)
P(Y  0 | s  0, t  8, y  0)
t
t
7
t
7
t
7
t
7
3
27
And in this way, we can get the interpretation of other parameters.

Thank you !
28