Transcript Document
Combining administrative and survey data
in a study of low birth weight and air pollution
Chris Jackson
With Nicky Best and Sylvia Richardson
Department of Epidemiology and Public Health
Imperial College, London
[email protected]
NCRM BIAS node
http://www.bias-project.org.uk
BIAS: Biases in observational studies
Promote principled methods for accounting for potential
biases in observational data:
“non-response” bias:
selection bias (participation in a study)
missing data (on some variables for one individual)
confounding (important variables not available)
ecological bias (from aggregate / area-level data)
measurement error
Naïve methods not normally appropriate.
Alleviating biases
Suitable statistical models for the processes
underlying the data
Express uncertainty about biases as
probability distributions.
Uncertainty carries through to the results
Bayesian graphical models
Software, e.g. WinBUGS
Using multiple data sources to inform about the
potential biases
Application areas
Small area estimation (with Virgilio Gómez Rubio)
Using combination of aggregate (e.g. census) and
individual survey data
Selection bias in case-control and survey studies (with
Sara Geneletti)
Using directed acyclic graphs
Inference from combining datasets of different designs
from different sources (with Chris Jackson, Jassy
Molitor)
Using Bayesian hierarchical / graphical models
See (http://www.bias-project.org.uk)
Example: low birth weight and air pollution
Does exposure to air pollution during pregnancy
increase the risk of low birth weight?
Example illustrates various biases.
Combine datasets with different strengths:
Survey data (Millennium Cohort Study)
Administrative data (national births register)
Small, great individual detail.
Large, but little individual detail.
Single underlying model assumed to govern both
datasets: elaborate as appropriate to handle biases
Low birth weight
Important determinant of future health population
health indicator.
Established risk factors:
Tobacco smoking during pregnancy.
Ethnicity (South Asian, issue for UK data)
Maternal age, weight, height, number of previous
births.
Role of environmental risk factors, such as air
pollution, less clear.
Various studies around the world suggest a link.
Exposure to urban air pollution correlated with
socioeconomic factors ethnicity, tobacco
smoking confounding
Data sources (1): Millennium Cohort Study
About 15,000 births in the UK between Sep 2000
and August 2001 (we study only England and Wales, singleton births)
Postcode made available to us under strict security
Match individuals with annual mean
concentration of certain air pollutants (PM10,
NO2, CO, SO2) (NETCEN)
Birth weight, and reasonably complete set of
confounder data available
Allows a reasonable analysis, but issues remain:
Low power to detect small effect could be
improved by incorporating other data.
Selection bias.
Selection of Millennium
Cohort
SELECTION PROBABILITY
High child poverty
0.04
Low child poverty
0.02
High ethnic minority
0.11
High child poverty
0.07
Low child poverty
0.04
High child poverty
0.18
Low child poverty
0.06
High child poverty
0.16
Low child poverty
0.08
ENGLAND
SCOTLAND
ALL UK WARDS
WALES
NORTHERN
IRELAND
Selection bias in the Millennium Cohort
Survey disproportionately represents population.
If selection probability related to exposure and
outcome, then estimate of association biased.
Ethnicity / child poverty probably related to both
pollution exposure and low birth weight.
Accounting for selection bias:
Adjust model for all variables affecting selection, or
Weight cases by inverse probability of selection
Cluster sampling within-ward correlations
for correct standard errors, use a hierarchical
(multilevel) model with groups defined by wards.
Data sources (2): National birth register
Every birth in the population recorded.
Individual data with postcode ( pollution exposure)
and birth weight available to us under strict security.
Social class and employment status of parents also
available for a 10% sample.
We study only this 10% sample: 50,000 births
between Sep 2000 and Aug 2001.
Larger dataset, no selection bias,
…but no confounder information, especially
ethnicity and smoking.
Data sources (3): Aggregate data
Ethnic composition of the population
2001 census
for census output areas (~500 individuals)
Tobacco expenditure
consumer surveys (CACI, who produce ACORN
consumer classification data)
for census output areas.
…linked by postcode to Millennium Cohort and
national register data.
Birth weight
and pollution
(source: MCS)
Birth weight
and ethnicity
(source: MCS)
Birth weight
and smoking
(source: MCS)
Pollution and
confounders
(source: MCS)
Models for formally analysing combined data
Want estimate of the association between low
birth weight and pollution, using all data,
accounting for:
Selection bias in MCS
Adjust models for all predictors of selection
Or weight by inverse probability of selection
Missing confounders in register
Bayesian graphical model…
Graphical model representation
ETHi
POLLi
POLLj
ETHj
MODEL
LBWi
baby i in register
LBWi:
LBWj
baby j in MCS
low birth weight
POLLi:
pollution exposure (plus other confounders observed
in both datasets)
ETHi:
known
ethnicity and smoking. Only observed in the MCS.
Same MODEL assumed to govern both datasets.
unknown
Adding in the imputation model
AGGi
ETHi
MODEL
(imputation)
POLLi
AGGj
POLLj
ETHj
MODEL
(LBW)
LBWi
baby i in register
AGGi:
LBWj
baby j in MCS
aggregate ethnicity/smoking data for area of residence of baby i
MODEL for imputation
of ETHi in terms of aggregate data and other variables.
Estimate it from observed ETHj in the MCS.
Bayesian model
Estimate both:
Imputation model for missing ethnicity and smoking
Outcome model for the association between low
birth weight and pollution.
All beliefs about unknown quantities expressed as
probability distributions.
Prior distributions (often ignorance) modified in light of
data posterior distributions
Joint posterior distribution of all unknowns estimated by
Markov Chain Monte Carlo (MCMC) simulation
(WinBUGS software)
Graphical representation of the model guides the
MCMC simulation.
Variables in the final models:
(1) regression model for low birth weight
Probability baby i has birth weight under 2.5 kg
modelled in terms of
Pollution (NO2 and SO2)
Ethnicity (White / South Asian / Black / other)
Smoking during pregnancy (yes/no)
Social class of mother
Survey selection strata (for MCS data)
Other variables not significant in multiple regression, or
not confounded with pollution (mother’s weight, height,
maternal age, number of previous births, hypertension
during pregnancy,…)
Variables in the final models:
(2) imputation model for missing data
Probability baby i is in one of eight categories:
ethnicity 1. White / 2. South Asian / 3. Black / 4. other
smoking during pregnancy 1. No / 2. Yes
Modelled in terms of small-area variables for baby i:
Proportion of population of in each of three ethnic
minority categories (South Asian / Black / other)
Tobacco expenditure
MCS survey selection strata
…and some individual-level variables for baby i.
Pollution exposure
Low birth weight
Social class, employment status of mother.
Odds ratios (posterior mean, 95% CI)
Data
NO2 *
SO2 *
Register,
ignore
confounding
1.20
(1.13,1.27)
1.03
(1.00,1.07)
MCS
1.04
(0.89,1.21)
1.04
(0.96,1.12)
2.00
(1.71,2.34)
2.76
(2.14,3.56)
MCS, ignore
selection
1.08
(0.94,1.23)
1.04
(0.96,1.12)
2.00
(1.71,2.34)
3.01
(2.42,3.74)
Register +
MCS
0.97
(0.91,1.03)
1.01
(0.97,1.05)
1.94
(1.80,2.10)
2.92
(2.61,3.26)
Register,
adjust for
confounding
0.97
(0.91,1.04)
1.01
(0.97,1.07)
1.94
(1.76,2.12)
2.93
(2.57,3.33)
Smoking
-
*One unit of pollution concentration = interquartile range of pollution
concentration across England and Wales
South Asian
-
Conclusions so far
Combining the datasets can
increase power
alleviate bias due to confounding
No evidence for association of pollution
exposure with low birth weight.
Work in progress
Sensitivity to different choices for the imputation model
External data (e.g. small-area data) on confounders
not always available
More investigation of selection bias, and different ways
of accounting for it
Quantify relative influence of each dataset
Other biases, expected to be smaller problem
Missing data in MCS
Exposure measurement error
Distinguish between preterm birth and low full-term
birth weight.
Other kinds of data synthesis
Aggregate (ecological) data
Needs appropriate models, and often individual data
Administrative data usually aggregated to preserve confidentiality
Make inferences on individual-level risk factors and outcomes using
aggregate data: “Ecological bias” caused by
within-area variability of risk factors
confounding caused by limited number of variables.
survey/cohort data, case-control data.
Combining aggregate and individual data:
can reduce ecological bias and increase power
distinguish contextual effects from individual.
Publications
Our papers, presentations and software available from
http://www.bias-project.org.uk
C. Jackson, N. Best, S. Richardson. Hierarchical related regression
for combining aggregate and survey data in studies of socio-economic
disease risk factors. under revision, Journal of the Royal Statistical
Society, Series A.
C. Jackson, N. Best, S. Richardson. Improving ecological inference
using individual-level data. Statistics in Medicine (2006) 25(12):21362159.
C. Jackson, S. Richardson, N. Best. Studying place effects on health
by synthesising area-level and individual data. Submitted.