20061121113012301-150349

Download Report

Transcript 20061121113012301-150349

Covariate information in complex
event history data some thoughts arising from a
case study
Elja Arjas
Department of Mathematics and Statistics, University of Helsinki
and
National Public Health Institute (KTL)
Based on ongoing joint work with Olli Saarela and Sangita Kulathinal
Background and motivation:
• Assessment of risk factors of cardiovascular
diseases (e.g. coronary heart disease, stroke);
• Traditional approach for cohort analysis: hazard
regression model, with covariates (e.g. blood
pressure, cholesterol level, or body mass index)
measured only at the baseline;
• Adding “a genetic component”: usually
candidate loci, potentially causative on the basis
of the available information about their function.
Emphasis on causal ideas:
• Stressing probabilistic predictions: “How
would the probability of the outcome
change if a covariate would have a
different value?”
• Association vs. causation: the issue of
confounding (change by intervention, “do”conditioning, Pearl 2000).
Cosidering causal effects …
• Compare, e.g., predictive probabilities of future
response y*
p(y*|data, attrib*, hist*, do(exposure*’))
vs.
p(y*|data, attrib*, hist*, do(exposure*))
for a generic individual ”*” (or, for an equivalence
class of exchangeable individuals) characterized
by attributes and past history used in
conditioning (cf. Arjas and Parner 2004).
Causal ideas…:
• Causal mechanisms can involve pathways
that are
 direct in the sense that they influence, in the
postulated model structure, directly the
outcome variable, or
 indirect in that their effect on the outcome is
mediated via the levels of the measured risk
factors.
MORGAM study
• Evans et al. (2005)
• Individuals of different ages in a cohort are monitored for
 (fatal and non-fatal) occurrences of coronary heart disease
(CHD) or stroke,
 death from other causes.
• Information on risk factors such as





smoking status,
blood pressure (BP),
body mass index (BMI),
total cholesterol and HDL cholesterol and
possible earlier occurrences (yes/no) of CHD or stroke
is collected at cohort baseline.
Genetic information…
• SNP (single nucleotide polymorphism) level
genotype data from candidate loci, e.g.
 functionally connected e.g. to blood clotting,
 associated with cardiovascular diseases,
 associated with increased lipid levels.
• Due to the cost involved genotyping is only done
on
 all known cases of CHD or stroke, and
 individuals belonging to a random subset of the
original cohort.
Information missing…
• There is
 no genetic information of any kind available
on most members of the original cohort, and
even for those belonging to the case-cohort
set, only on the chosen candidate loci;
 no knowledge of early fatal occurrences of
CHD or stroke from outside the cohort.
Graphical representation
event
endpoint
time (age)
underlying
covariate
process
measurement error
variance

t
Xi
~
Xi
parameters
of interest

Yi
Gi
candidate
gene
covariate
measurement
Aspects to be considered...
1. Time:



BMI, BP and cholesterol level do not remain
constant over time: “individually varying stochastic
processes”.
Even an accurate measurement at a particular time
cannot be directly related to the endpoints as a
"cause“.
The interpretation, and value for a causal analysis,
of covariate measurements made in the past will
generally depend on how long ago they were
measured.
Further aspects…
2. Feed back to covariate values from earlier
events:
Covariate values of individuals who had experienced
a CHD event or stroke already before being
recruited to the cohort may have been influenced by
this event (e.g., the person quits smoking, changes
diet, or gets medication to lower blood pressure).
3. Influence of an earlier treatment:
After a first occurrence of non-fatal CHD or stroke,
the risk for later similar events or death is likely to be
more strongly influenced by the availability and
success of the acute medical treatment than by the
values of the measured risk factors/covariates.
Further aspects…
4. Potential confounding issue:
The considered candidate loci can influence both
the values of the measured covariates and those of
the outcome variables. If this is not properly
accounted for in the modelling and analysis of data,
they become a potential source of confounding in
an observational study.
Here also: How about the rest of the genome,
outside the selected candidate loci?
Further aspects…
5. Large dimension of parameter space:
The degree of SNP-based polymorphisms
present in the data generally exceeds by far
numbers for which it would be possible,
given the amounts of data, to reliably
estimate risks associated with individual
genotypes.
Particularly problematic in this sense is the
MHC/HLA region.
Some shortcuts…
• Problem 2:
Ignore the current status covariate information that may have
been influenced by the earlier occurrence, only keeping
information on covariates that do not change in time (age, sex,
genotype).
• Problem 3:
Consider follow-up data only up to the first occurrence of CHD or
stroke.
• Problems 1, 4 and 5:
Try something more systematic: For problem 5, apply a
monotonicity postulate and consequent partial ordering of risks.
For problems 1 and 4, treat the missing covariate information in
a distributional form (using data augmentation and MCMC).
Problem 5: dimension
Partial ordering:
• The two variants (alleles) of a biallelic SNP are
labeled as 0 and 1, with 0 for the "common” and
1 for the "rare” form;
• Within each gene (more generally, linkage
group), arrange the sequence of SNP genotypes
(pairs of the form 00, 01, 10 and 11), each
determined from the same SNP locus, into
haplotypes. (Alleles belonging to the same maternal or paternal - chromosome form a
haplotype.)
Problem 5: dimension (2)
•
•
Denote (−,ø,+) to indicate “less risky”, “neutral” and “more risky”
allele, respectively.
For each pair of alleles, there are three possibilities:
1. allele 0 is less risky than allele 1 (−+),
2. no effect (øø) and
3. allele 1 is less risky than 0 (+−).
•
•
•
Postulate: this ordering of alleles is extendible to a partial ordering
of haplotype risks. For example, haplotype h1 is “more risky” than
haplotype h2 if all its alleles are either “more risky” or “neutral”
compared to the corresponding alleles in h2, and at least one is
“more risky”.
Haplotypes can then be classified into groups, each being
represented by a vector with elements chosen from {−,ø,+}.
Modelling of risks is then done via such classes.
Extend this partial ordering into a partial ordering between to
haplotype pairs (diplotypes).
Problem 5: dimension (3)
β−−
β−ø
β−+
βø−
βøø
βø+
β+−
β+ø
β++
β−−
=
>
>
>
>
>
>
>
>
β−ø
<
=
>
>
>
>
>
β−+ βø−
< <
<
=
=
>
> >
>
>
> >
βøø βø+ β+−
< < <
< <
<
< < <
= <
> =
=
>
>
> > >
β+ø β++
< <
< <
<
< <
< <
<
< <
= <
> =
Problem 5: dimension (4)
Problem 5: dimension (5)
event
endpoint
restrictions for
parameters from
the allele ordering
N
L
number of location of
causal loci causal loci
Yi
Di
diplotype
Gi
genotype

D
population
haplotype
frequencies
A
ordering of
alleles of
causal loci
Problem 1: time
• Regression dilution
Measuring time dependent and individually
varying covariates (such as BP, cholesterol
level and BMI) at a single time point generally
leads to an under-estimation of the effect size.
• But what should one do if for each
individual there is only a single covariate
measurement in the data?
Problem 1: time (2)
• Modelling the underlying covariate process
 For dealing with time dependent covariates in an
explicit form, one needs a generator (stochastic
intensities) for the covariate process considered as a
function of pre-t histories, as well as corresponding
stochastic intensities for the end point (T;X) itself.
 One possibility is to apply the Marked Point Process
(MPP) framework. The considered end point, with a
corresponding description of the outcome, can then
be imbedded into this process in a natural way as a
marked point (T;X).
Problem 1: time (3)
• Measurement error
If also the covariate measurements involve a random error, we
need a measurement model. The model parameters can be
estimated if there are additional data available on the
progression of the covariates.
• Numerical implementation
Using MCMC and data augmentation methods – but practical
implementation can be difficult.
• Dependence of the covariates on genotype information?
Fortunately, only long time averages of covariates are likely to be
of importance for the considered endpoints. But potential
confounding problem remains.
Problem 4: missing data, confounding…
• Genetic factors are potential confounders in causal
questions. If the relevant genotype information is known
and its role has been properly accounted for in the
statistical model, this problem can be dealt with by
proper conditioning on such information.
• But what to do when a majority of the cohort members,
as in MORGAM, have not been genotyped?
• Usual solution: restrict the analysis only to those
individuals who have been genotyped. But then the
relevant follow-up and covariate information that exists
on the other cohort members will not be used in the
analysis at all.
Problem 4: missing data, confounding…
• Treat also problem 4 as a missing data problem,
considering a probability model for the missing
genotypes and applying "full likelihood” and Bayesian
inference (Kulathinal and Arjas 2006, cf. Scheike and
Martinussen 2004). This solution involves considering
the unknown genotypes in a distributional form.
• Note, however, that a person's genotype, the measured
risk factors and phenotype (time to event and event type)
may all be statistically dependent of each other.
Therefore the likelihood contribution from an individual
who has not been genotyped involves an integration with
respect to a (conditional) genotype distribution (which is
generally different for different individuals).
Problem 4: missing data, confounding…
• In general, and depending on the information
available, one can consider different levels of
conditioning in the predictive probabilities
p(y*|data, attrib*, hist*, do(exposure*’)).
• Depending on such a level, the interpretation of
the results from causal analysis will differ, with
more detailed conditioning taking us closer to
“individual causal effect” - which, however, can
never be achieved by a statistical analysis of
data.
Problem 4: missing data, confounding…
• More detailed conditioning is also attractive as a
recipe against potential confounders (“no
unmeasured confounders” postulate).
• Playing with finer level conditioning by using
latent variable modelling can be attractive, but
also risky if there is very little data, noisy data, or
no data at all to support such modelling efforts.
• In essence, such finer level predictive
probabilities are calibrated against data that are
actually observed.
”Take home”-messages:
• Careful consideration of sources of
information is important.
• Interpretation of results is often facilitated
by establishing intuitive links to causal
”what if” ideas (”do”-conditioning).
• Less emphasis on inference (particularly
statistical significance testing) concerning
individual regression coefficients.
”Take home”-messages: (2)
• General modelling approach based on
MPP’s is useful, offering possibilities to
consider conditioning of probabilities on
different levels of information.
• Bayesian approach, and applying MCMC
for numerical computations, provides a
flexible framework for statistical inference,
keeping it within the domain of probability.