Pubertal Timing and the Stages of Drug Use Stephanie Hyatt

Download Report

Transcript Pubertal Timing and the Stages of Drug Use Stephanie Hyatt

ABSTRACT
Discrete-time multi-level survival analysis is very useful in prevention research due to
the longitudinal and clustered nature of the data. Implementing a discrete-time multilevel survival model is generally straightforward using common software such as SAS,
MLN, or HLM. However, we may draw erroneous conclusions and policy implications if
the model fits poorly. Model checking for the above type of model is not well developed,
although it is very important for obtaining proper results. We present a graphical method
for assessing the fit of a discrete-time multi-level survival model. The method verifies
that the predicted probability of the event, based on our model, is approximately the
estimated probability of this same event based on the data. The method involves looking
at a plot while knowing what the plot would look like under the circumstances of a well
fitting model. Model selection can be achieved by comparing overall fit using the plots.
Using visual model checking techniques allows one to quickly draw conclusions. We
demonstrate the ease of using our technique with SAS.
Goal
 An example of a model is
log(ptjk/(1-ptjk)) = b0 + b1 * Xjk + b2 * X1jk + b3 * X2tjk + b4 * t
b0 = g00 + g01 * Wk + uk
b1 = g10
b2 = g20
b3 = g30
b4 = g40 ,
where Xjk and X1jk are time invariant individual level predictors, X2tjk is a time varying
individual level predictor, Wk is a time invariant group level predictor, t equals the time period,
and uk is a group level error term.
 We wish to check how well our model fits the data using the observed quantities ytjk and the
corresponding model based quantities ptjk’.
Procedure
Setting
 Discrete-time multi-level survival data consists of observations for each person at each
time period up until that person has the event or is censored.
 Each observation includes an indicator variable y that shows whether the event
occurred, an identification variable for the group the individual belongs to, the individual
and group level predictors (values may or may not change over time), and a time
variable.
 The outcome variable ytjk equals one if person j in group k had the event at time t, zero
otherwise.
 Discrete-time multi-level survival analysis models ptjk, a (conditional) probability that
ytjk = 1. This probability is called the hazard.
 Following model fitting, consider a cluster of observations with predicted hazard close to,
say, .08. If the true hazards corresponding to those observations truly equal .08, then the
proportion of 1’s in the corresponding cluster of ytjks should equal .08.
 We use the Loess procedure in SAS as an approximation of this proportion, obtaining a
value we term loessy for each observation. Hence in a well fitting model, we expect
e = loessy – p’ to be close to 0.
 We plot a standardized version of e as a residual quantity in the graphs. We find in
repeated simulations that when fitting the correct model, approximately 90 percent of these
residuals lie between –2 and 2.
Determining Model Fit
Indicators of a well fitting model are
 a small percentage of residuals outside of -2 to 2.
 residuals lie uniformly in a band.
EXAMPLES
Indicators of a poor fitting model are:
 a large percentage of residuals outside of -2 to 2.
 residuals do not lie in a band, but form a curve.
LEGEND: stdres = standardized residual. perc = proportion of residuals outside of [-2,2]. The horizontal axis is
the predicted probabilities, equally spaced. The corresponding estimated model is located below each plot.
logit(p) = log(p/(1-p)). Note: Unless otherwise noted, the plots use observations from all time periods.
A comparison of plots corresponding to potential models may reveal the better fitting model.
Q: Should a time-invariant individual level predictor be included in
the model?
The data is generated by logit(p) = -3 + 1*X + 2*X1 + .03*t + u.
The model on the right omits X1, a significant time-invariant
individual level predictor.
The u-shape of the residuals in the right hand plot and the
relative increase in the number of residuals outside [-2,2]
compared with the left hand plot indicate that X1 should be
considered for inclusion in the model.
Q: Should a time-varying individual level predictor be in the model?
The data is generated by logit(p) = -3 + 1*X1 + 1.5*X2 + .03*t + u.
The model on the right omits X2, a significant individual level timevarying predictor.
As above, the u-shape of the residuals in the right hand plot and
the relative increase in the number of residuals outside [-2,2]
compared with the left hand plot indicate that X2 should be
considered for inclusion in the model.
Q: Should a time-invariant group level predictor be in the model?
The data is generated by logit(p) = -3 + 1*X + 1.5*W + .02*t + u.
The model on the right omits W, a significant group level timeinvariant predictor.
The non-uniform shape of the residuals in the right hand plot and
the large increase in number of residuals outside of [-2,2]
compared with the left hand plot indicate that W should be
considered for inclusion in the model.
Q: Is the inclusion of a group level error term contributing to the
predicted hazard?
The data is generated by logit(p) = -1 + 1*X + u.
The model on the right omits u, a group level error term.
The large increase in the number of residuals outside [-2,2] in the
right hand plot suggests including the group level error term in the
model.
Q: What is a reasonable form for the baseline hazard?
The data is generated by logit(p) = -1 + 1*X + .4*t - .06*t*t + u.
The model on the right omits t*t, the time-squared term.
 It seems that the time-squared term is important. However, it is
more useful to look at plots that do not use all time periods
simultaneously, as the following two plots show.
The data is the same as in the previous example.
These plots use only observations corresponding to time periods
5 and later.
The squared term is needed in the model, based on the residual
plots. Linear and quadratic baseline hazards can be similar in
shape for earlier time periods, but for later time periods when the
true hazard is small, addition of a quadratic term is necessary to
get a good fit.
CONCLUSIONS
The approach described here provides the prevention researcher with a graphical means of verifying
model fit and justifying predictor inclusion based on its contribution to predicting event occurrence.
Over numerous simulated data sets, we find that the method detects significant, time-varying and timeinvariant individual level predictors and group level predictors, and ascertains an accurate form for the
baseline hazard.
In the case of a constant baseline hazard, omission of a group level error term can be detected. Future
research will address the non-constant case.
Future work includes plots of residuals versus predictors, martingale residual plots, and the case of when
all the predicted hazards very small.
References
Kay, R. & S. Little (1986). Assessing the Fit of the Logistic Model:
A Case Study of Children with the Haemolytic Uraemic Syndrome.
Applied Statistics, 36, 16-30.
Barber, J.S., S.A. Murphy, W.G. Axinn, & J. Maples (2000).
Discrete-time Multilevel Survival Analysis. Sociological Methodology, 30,
201-235.
This research was supported by the National Institute on Drug Abuse
Grant # 2 P50 DA10075