Transcript UK11_welch

Testing the performance of the two-fold FCS
algorithm for multiple imputation of
longitudinal clinical records
Catherine Welch1, Irene Petersen1, Jonathan Bartlett2,
Ian White3, Richard Morris1, Louise Marston1, Kate Walters1,
Irwin Nazareth1 and James Carpenter2
1Department
of Primary Care and Population Health, UCL
2Department of Medical Statistics, LSHTM
3MRC
Biostatistics, Cambridge
Funding: MRC
The Health Improvement Network (THIN)
primary care database
• GP records
• 9 million patients over 15 years in 450 practices
• Powerful data source for research into
coronary heart disease (CHD)
• Studies complicated by missing data
• Up to 38% of health indicator
measurements are missing
in newly registered patients1
1Marston
et al, 2010 Pharmacoepidemiology and Drug Safety
Partially observed data in THIN
• Missing data never intended
to be recorded
• Data recorded at irregular
intervals
• Non-monotone missingness p
pattern
Multiple Imputation (MI) and THIN
• Most MI designed for cross-sectional data
• Impute both continuous and discrete variables at
many time points
– Standard ICE using Stata struggles with this
• New method developed by Nevalainen et al
– Two-fold fully conditional specification (FCS) algorithm
– Imputes each time point separately
– Uses information recorded before and after time point
Nevalainen et al, 2009 Statistics in Medicine
A graphical illustration of the two-fold FCS
algorithm
Among-time iteration
Within-time iteration
Nevalainen et al, 2009 Statistics in Medicine
f ( X ijmis | X i1 , X i, j , X i1 , Yij )
Algorithm validation
• Nevalainen et al
– Proposed the two-fold FCS approach
– Validated algorithm using data sampled from case-control
– 3 time points included with a linear substantive model
• Our previous work
• Imputed data had accurate coefficients and
acceptable level of variation in these settings
Simulation
• Before we apply the algorithm to THIN we want to
test it in a complex setting similar to THIN
• Test algorithm in simulation study:
–
–
–
–
Create 1000 full datasets
Remove values
Apply two-fold FCS algorithm
Fit regression model for risk of CHD
• Full data
• Complete case data
• Imputed data
– Compare results
Advantages of using simulated data
• We know the original distributions so we can
compare with distribution of imputed data and test
for bias
• Create different scenarios to test the algorithm
• Design data so it is close to THIN data
Simple dataset
•
•
•
•
5000 men, 10 years of data
CHD diagnosis from 2000 – yes/no
Age – 5 year age bands
Smoking status recorded in 2000
– smokers, ex- and non-smokers
•
•
•
•
•
Anti-hypertensive drug prescription – yes/no
Systolic blood pressure (mmHg)
Weight (kg)
Townsend score quintile – 1 (least) to 5 (most)
Registration – indicate if patient registered in 1999
Results from exponential regression model
• Outcome : Time to CHD
• Exposures in year 2000: age, Townsend score
quintile, weight, blood pressure, smoking status,
anti-hypertensive drug treatment, registration in
1999
• Analysis of 1000 datasets
Generated data results
Results of fitting exponential regression model
Variables
THIN data Full simulated data
log risk Log risk
ratio
ratio
SE
Anti-hypertensive
drug treatment
Systolic blood
pressure (mmHg)
Weight (kg)
Smoking
status
Nonsmoker
Exsmoker
Current
smoker
0.2935
0.2868
0.0957
0.0048
0.0049
0.0026
0.0019
0.0019
0.0032
0.0679
0.0692
0.1074
0.2386
0.2385
0.1143
Reference
Adjusted for age, registration in 1999 and Townsend score quintile
70% missing completely at random (MCAR)
missingness mechanisms
• Missing data on blood pressure, weight, smoking
• In THIN:
– 30 - 70% missing in any given year,
• E.g. 70% missing equivalent to a health indicator recorded
approximately every 3 years
– If one variable is missing other variables also more
likely to be missing
70% MCAR results
Variables
Anti-hypertensive
drug treatment
Systolic blood
pressure (mmHg)
Weight (kg)
THIN data
Simulated data
Log risk
Full data
Complete case
ratio
Log risk
Log risk
ratio
SE
ratio
SE
0.2935
0.2868
0.0957
0.2852
0.1931
0.0048
0.0049
0.0026
0.0051
0.0055
0.0019
0.0019
0.0032
0.0015
0.0062
Smoking NonReference
status
smoker
Ex0.0679
0.0692 0.1074 0.0633 0.2151
smoker
Current
0.2386
0.2385 0.1143 0.2307 0.2299
smoker
Adjusted for age, registration in 1999 and Townsend score quintile
Two-fold FCS algorithm
•
•
•
•
Stata ICE – series of chained equations
3 among-time iterations, 10 within-time iterations
Produce 3 imputed datasets
1 year time window
i-3
i-2
i-1
i
i+1
i+2
i+3
Imputing time-independent variables
• Algorithm designed to impute time-dependent
variables and does not account for imputing timeindependent variables
• Smoking status in 2000 is a time-independent
variable
• Need to extend algorithm for this
Imputing time-independent variables
• For each among-time iteration, time-independent
variables imputed first
Impute time-independent
variables
• Algorithm will be cycle through time points with
smoking status included as an auxiliary variable.
Results following imputation
• We would expect to see similar log risk ratios to
the THIN data
• The standard errors for variables with no missing
data will be close to those from the full data
• The standard errors for variables with missing
data will be smaller to the complete case analysis
but not recover to the size of the full data
Results following imputation
Variables
Anti-hypertensive
drug treatment
Systolic blood
pressure (mmHg)
Weight (kg)
THIN data
Log risk
Full data
ratio
Log risk
ratio
SE
Simulated data
Complete case Imputed data
Log risk
Log risk
ratio
SE
ratio
SE
0.2935
0.2868
0.0957
0.2852
0.1931 0.2848 0.1066
0.0048
0.0049
0.0026
0.0051
0.0055 0.0050 0.0052
0.0019
0.0019
0.0032
0.0015
0.0062 0.0023 0.0053
Smoking NonReference
status
smoker
Ex0.0679
0.0692 0.1074 0.0633 0.2151 0.0654 0.2288
smoker
Current
0.2386
0.2385 0.1143 0.2307 0.2299 0.2409 0.2453
smoker
Adjusted for age, registration in 1999 and Townsend score quintile
Results following imputation
Variables
Anti-hypertensive
drug treatment
Systolic blood
pressure (mmHg)
Weight (kg)
THIN data
Log risk
Full data
ratio
Log risk
ratio
SE
Simulated data
Complete case Imputed data
Log risk
Log risk
ratio
SE
ratio
SE
0.2935
0.2868
0.0957
0.2852
0.1931 0.2848 0.1066
0.0048
0.0049
0.0026
0.0051
0.0055 0.0050 0.0052
0.0019
0.0019
0.0032
0.0015
0.0062 0.0023 0.0053
Smoking NonReference
status
smoker
Ex0.0679
0.0692 0.1074 0.0633 0.2151 0.0654 0.2288
smoker
Current
0.2386
0.2385 0.1143 0.2307 0.2299 0.2409 0.2453
smoker
Adjusted for age, registration in 1999 and Townsend score quintile
Results following imputation
Variables
Anti-hypertensive
drug treatment
Systolic blood
pressure (mmHg)
Weight (kg)
THIN data
Log risk
Full data
ratio
Log risk
ratio
SE
Simulated data
Complete case Imputed data
Log risk
Log risk
ratio
SE
ratio
SE
0.2935
0.2868
0.0957
0.2852
0.1931 0.2848 0.1066
0.0048
0.0049
0.0026
0.0051
0.0055 0.0050 0.0052
0.0019
0.0019
0.0032
0.0015
0.0062 0.0023 0.0053
Smoking NonReference
status
smoker
Ex0.0679
0.0692 0.1074 0.0633 0.2151 0.0654 0.2288
smoker
Current
0.2386
0.2385 0.1143 0.2307 0.2299 0.2409 0.2453
smoker
Adjusted for age, registration in 1999 and Townsend score quintile
Results following imputation
Variables
Anti-hypertensive
drug treatment
Systolic blood
pressure (mmHg)
Weight (kg)
THIN data
Log risk
Full data
ratio
Log risk
ratio
SE
Simulated data
Complete case Imputed data
Log risk
Log risk
ratio
SE
ratio
SE
0.2935
0.2868
0.0957
0.2852
0.1931 0.2848 0.1066
0.0048
0.0049
0.0026
0.0051
0.0055 0.0050 0.0052
0.0019
0.0019
0.0032
0.0015
0.0062 0.0023 0.0053
Smoking NonReference
status
smoker
Ex0.0679
0.0692 0.1074 0.0633 0.2151 0.0654 0.2288
smoker
Current
0.2386
0.2385 0.1143 0.2307 0.2299 0.2409 0.2453
smoker
Adjusted for age, registration in 1999 and Townsend score quintile
Correlations
• Previous results imply accurate imputations for
missing data in 2000
• Alternative method required:
– Assess correlations between measurements recorded
at different times
• We would like to maintain the correlations
structure in the generated and imputed data at all
time points
Correlation
Correlations
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
2000
2001
2002
2003 2004 2005 2006 2007
Year of weight measurement
correlated with weight measured in 2000
Full simulated data
Imputed simulated data
2008
2009
Increase time window
• Increased the time window to 2 and 3 years
• This slightly improves the estimates of coefficients
and SE
2 year time window
3 year time window
i-3
i-2
i-1
i
i+1
i+2
i+3
Correlation
Increase time window
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
2000
2001
2002
2003 2004 2005 2006 2007
Year of weight measurement
correlated with weight measured in 2000
Full simulated data
1 year
2 years
2008
3 years
2009
In summary
• The two-fold FCS algorithm gives unbiased
imputations with:
– 70% missing data
– Exponential regression model, and
– MCAR missingness mechanisms
• The correlation structure is maintained as the time
window increases
Discussion
• Algorithm effective because at least one
measurement during follow-up
• Same results with MAR
• Future work…
– Introduce censoring
– Change smoking status to be time-dependent
– Interactions