Transcript Document
Modelling Longitudinal Data
• General Points
• Single Event histories (survival analysis)
• Multiple Event histories
Motivation
• Attempt to go beyond more simple material
in the first workshop.
• Begin to develop an appreciation of the
notation associated with these techniques.
• Gain a little “hands-on” experience.
Statistical Modelling Framework
Generalized Linear Models
An interest in generalized linear models is richly rewarded. Not only does
it bring together a wealth of interesting theoretical problems but it also
encourages an ease of data analysis sadly lacking from traditional
statistics….an added bonus of the glm approach is the insight provided
by embedding a problem in a wider context. This in itself encourages a
more critical approach to data analysis.
Gilchrist, R. (1985) ‘Introduction: GLIM and Generalized Linear Models’,
Springer Verlag Lecture Notes in Statistics, 32, pp.1-5.
Statistical Modelling
• Know your data.
• Start and be guided by
‘substantive theory’.
• Start with simple
techniques (these might
suffice).
• Remember John Tukey!
• Practice.
Willet and Singer (1995) conclude that discrete-time
methods are generally considered to be simpler and more
comprehensible, however, mastery of discrete-time methods
facilitates a transition to continuous-time approaches should
that be required.
Willet, J. and Singer, J. (1995) Investigating Onset, Cessation, Relapse, and Recovery: Using Discrete-Time Survival
Analysis to Examine the Occurrence and Timing of Critical Events. In J. Gottman (ed) The Analysis of Change (Hove:
Lawrence Erlbaum Associates).
As social scientists we are often
substantively interested in whether a
specific event has occurred.
Survival Data – Time to an event
In the medical area…
• Time from diagnosis to death.
• Duration from treatment to full health.
• Time to return of pain after taking a pain
killer.
Survival Data – Time to an event
Social Sciences…
•
•
•
•
•
Duration of unemployment.
Duration of housing tenure.
Duration of marriage.
Time to conception.
Time to orgasm.
Consider a binary outcome or
two-state event
0 = Event has not occurred
1 = Event has occurred
0
A
0
B
0
C
1
1
1
t1
Start of Study
t2
t3
End of Study
These durations are a continuous
Y so why can’t we use standard
regression techniques?
These durations are a continuous
Y so why can’t we use standard
regression techniques?
We can. It might be better to model
the log of Y however. These models
are sometimes known as ‘accelerated
life models’.
1946 Birth Cohort Study
Research Project 2060
1946
0
A
0
B
0
C
(1st August 2032 VG retires!)
1
1
1
t1
t2
t3
Start of Study
1=Death
t4
Breast Feeding Study –
Data Collection Strategy
1. Retrospective questioning of mothers
2. Data collected by Midwives
3. Health Visitor and G.P. Record
Breast Feeding Study –
2001
Age 6
Birth
1995
Start of
Study
Breast Feeding Study –
2001
0
1
Age 6
1
0
1
0
t1
Birth
1995
t2
t3
Start of
Study
Accelerated Life Model
Loge ti = b0 + b1x1i+ei
Accelerated Life Model
Loge ti = b0 + b1x1i+ei
Beware this
is log t
constant
error term
explanatory variable
At this point something should
dawn on you – like fish scales
falling from your eyes – like
pennies from Heaven.
b0 + b1x1i+ei is the r.h.s.
Think about the l.h.s.
•Yi
•Loge (odds) Yi
•Loge ti
-
Standard liner model
Standard logistic model
Accelerated life model
We can think of these as a single ‘class’ of models
and (with a little care) can interpret them in a similar
fashion (as Ian Diamond of the ESRC would say
“this is phenomenally groovy”).
0
1
0
0
0
0
Start of Study
1
CENSORED
OBSERVATIONS
1
1
End of Study
A
1
B
CENSORED
OBSERVATIONS
Start of Study
End of Study
These durations are a continuous
Y so why can’t we use standard
regression techniques?
What should be the value of Y for
person A and person B at the end of
our study (when we fit the model)?
Cox Regression
(proportional hazard model)
is a method for modelling time-to-event data in the
presence of censored cases.
•Explanatory variables in your model (continuous and
categorical).
•Estimated coefficients for each of the covariates.
•Handles the censored cases correctly.
Cox, D.R. (1972)
‘Regression models and
life tables’ JRSS,B, 34
pp.187-220.
Childcare Study –
Studying a cohort of women who
returned to work after having
their first child.
• 24 month study
• The focus of the study was childcare spell #2
• 341 Mothers (and babies)
Variables
•
•
•
•
•
ID
Start of childcare spell #2 (month)
End of childcare spell #2 (month)
Gender of baby (male; female)
Type of care spell #2 (a relative;
childminder; nursery)
• Family income (crude measure)
Survival Function
(or survival curve)
Describes the decline in the size
of the risk set over time.
Survival Function
S(t) = 1 – F(t) = Prob (T>t)
also
S(t1) S(t2)
for all t2 > t1
Survival Function
S(t) = 1 – F(t) = Prob (T>t)
survival
probability
Cumulative
probability
event
time
complement
Survival Function
All this means is… once you’ve left
the risk set you can’t return!!!
S(t1) S(t2)
for all t2 > t1
Survival Functions
1.2
1.0
.8
.6
Cum Survival
family income
.4
Up to £30K
Up to £30K-c ensored
.2
£30K +
0.0
£30K +-cens ored
0
TIME
10
20
30
Median Survival Times
Survival Functions
1.2
1.0
.8
.6
family income
.4
Up to £30K
Up to £30K-c ensored
.2
£30K +
0.0
£30K +-cens ored
0
10
20
30
One Minus Survival Functions
1.0
.8
.6
.4
family income
.2
Up to £30K
Up to £30K-c ensored
0.0
£30K +
-.2
£30K +-cens ored
0
TIME
10
20
30
Too hard to interpret except for the Rain Man
Log Survival Function
1
0
-1
-2
family income
-3
Up to £30K
Up to £30K-c ensored
-4
£30K +
-5
£30K +-cens ored
0
TIME
10
20
30
HAZARD
In advanced analyses researchers
sometimes examine the shape of
something called the hazard. In
essence the shape of this is not
constrained like the survival function.
Therefore it can potentially tell us
something about the social process
that is taking place.
For the very keen…
Hazard –
the rate at which events occur
Or
the risk of an event occurring at a particular
time, given that it has not happened before t
For the even more keen…
Hazard –
The conditional probability of an event
occurring at time t given that it has not
happened before. If we call the hazard
function h(t) and the pdf for the duration f(t)
Then,
h(t)= f(t)/S(t)
Hazard Function
5
4
3
2
family income
Cum Hazard
1
Up to £30K
Up to £30K-c ensored
0
£30K +
-1
£30K +-cens ored
0
TIME
10
20
30
A Statistical Model
X1
X2
X3
Y variable =
duration with
censored
observations
A Statistical Model
Family income
Y variable =
duration with
Gender of baby
Type of childcare
Mother’s age
A continuous covariate
censored
observations
For the keen..
Cox Proportional Hazard Model
h(t)=h0(t)exp(bx)
Cox Proportional Hazard Model
exponential
h(t)=h0(t)exp(bx)
hazard
estimate
baseline
hazard(unknown)
X var
For the very keen..
Cox Proportional Hazard Model
can be transformed into an
additive model
log h(t)=a(t) + bx
Therefore…
For the very keen..
Cox Proportional Hazard Model
log h(t)=b0(t) + b1 x1
This should look distressingly
familiar!
Define the code for the event
(i.e. 1 if occurred – 0 if censored)
Enter explanatory variables
(dummies and continuous)
Variables in the Equation
INC
GENDER
MUMAGE
CHILDM
NURSERY
B
1.282
-.046
.012
1.165
1.887
SE
.140
.118
.010
.151
.157
df
1
1
1
1
1
Sig.
.000
.696
.262
.000
.000
Exp(B)
3.605
.955
1.012
3.206
6.598
Chi-square related
X var
Estimate
Wald
83.594
.153
1.258
59.157
144.903
Standard
error
Un-logged estimate
What does this mean?
Our Y the duration of childcare spell #2.
Note we are modelling the hazard!
Significant Variables
• Family income p<.001
• Gender baby p=.696
• Mother’s age p=.262
• Childminder p<.001
• Nursery p<.001
Effects on the hazard
• Family income p<.001
£30K +
Up to £30K
Childminder p<.001
Nursery p<.001