Transcript Imputation

Multiple Imputation
Julia Kozlitina
Steve Robertson
April 26, 2006
Outline
 Multiple Imputation (MI)
 How to impute (i.e. how to fill in values)
 How to analyze and draw inferences
 How many times to impute
 Alternatives to MI
 Applications
 Software
Multiple Imputation  Idea: replace each missing item with 2 or more
acceptable values, representing a distribution of
possibilities (Rubin, 1987).
 This results in m complete datasets (each one is
analyzed using standard methods, and estimated
parameters are averaged).
 Can often be generated from simple modifications
of existing single-imputation methods such as
hot-deck or regression.
Dataset with m imputations
m imputations
N units in the survey
k variables
Each row vector of imputations is of length m, where
model for 1st imputation = …
model for 2nd imputation = …
…
model for mth imputation = …
…most useful when
the fraction of
values missing is
not excessive and
when m is modest
(say 2 to 10)
Advantages:
1. Allows to use standard complete-data methods
2. Can incorporate data collector’s knowledge to
reflect the uncertainty about imputed values
(sampling variability and uncertainty about the
reasons for nonresponse)
3. Increases efficiency of estimation
4. Provides valid inferences (for variance estimators)
under an assumed model for nonresponse
5. Allows one to study sensitivity to various models
Disadvantages:
1.
More work is needed to generate multiple
imputations
Often not difficult to implement using the
existing single-imputation scheme
2.
More space is needed to store the data
3.
More work required to analyze the data (not
serious when m is modest) –
Often not difficult to implement using and
standard statistical programs
How to fill in the values :
 BAYESIAN PERSPECTIVE (Rubin, 1987): draw
multiple imputations to simulate a Bayesian
posterior distribution of missing values, that is,
conditional distribution of the missing data given
the observed data,
Pr( | X , Yobs , Rinc , I )
where, obs = set of observed values
inc = set of units included in the sample
I = an indicator for inclusion
How to fill in the values :
 Impose a probability model on the complete
data and nonresponse mechanism (i.e., normal
regression or loglinear model)
 Create imputations through a 2-step Bayesian
process:
1. Specify prior distributions and draw unknown
model parameters, and
2. Simulate m independent draws from the
conditional distribution of the missing data
given the observed data
How to fill in the values :
 This requires deriving the posterior distribution.
In simple problems, closed-form solutions exist
 In more complex applications, rely on special
computational techniques such as Markov chain
Monte Carlo (MCMC)
 Other possibilities: approximate Bayesian
bootstrap (Rubin, 1987)
 Modeling propensity scores to form sampling
groups (Lavori et. al., 1995)
Approx. Bayesian Bootstrap (ABS)
- Draw n1 values randomly with replacement form
Yobs (i.e. create a hot deck)
- Draw the n0 = n - n1 components of Ymis
randomly with replacement from Y*obs
See Rubin (1987) for details on:
 Bayesian Bootstrap (BB) - p. 44,
 Approximate Bayesian Bootstrap (ABS) - p. 124
Inference on combined estimates :
 The estimate is the average of m repeated complete-data
estimates
m
  ˆ j m
j 1
 Let
m
W  W j m
– be the average of m repeated
j 1
m

B   ˆ j  
j 1
 /(m 1)
2
complete-data variances, and
– variance between imputations
 The total variance is approximately the sum of the two:
T  W  (1  m1 ) B
Inference on combined estimates
 Confidence intervals and significance tests can be computed
using a t reference distribution with
v  (m  1)(1  rm1 )
degrees of freedom, where rm is the relative increase in
variance due to nonresponse (Rubin, Ch. 3)
rm  (1  m 1 ) B / W
How many imputations needed?
 Rubin (1987, p. 114) shows that the relative efficiency of a
finite-m estimator is
1
V (  )   
 1  
V ( m )  m 
where  is the rate of missing information for the quantity
being estimated.
 Values shown below. For small , m =2 or 3 is nearly fully
efficient.

m
0.1
0.3
0.5
0.7
0.9
1
95
88
82
77
73
2
98
93
89
86
83
3
98
95
93
90
88
5
99
97
95
94
92

100
100
100
100
100
Problems
 Difficulties with MI variance estimator discussed
by Binder & Sun (1996), Fay (1996), and others
- Gives inconsistent variance estimates under some
simple conditions (improper imputation)
 Kott (1995) observes that sampling weights must
be used for both point and variance estimation in
order to satisfy the conditions of being proper
 Wang and Robins (1998) explore large-sample
properties of MI estimators
Alternatives
 Advances made on making efficient and
asymptotically valid inferences from single
imputations
 Shao (2002) and Rao (2000, 2005): jacknife
variance estimator for hot-deck imputation in
which donors are selected W/R with selection
probability proportional to sampling weights
 Kalton & Kish (1984), Fay (1996): fractionally
weighted imputation – use more than one donor
for a recipient
Fractionally Weighted Imputation
 Idea: reduce imputation variance relative to
single imputation
 Fractional hot-deck imputation replaces each
missing value with a set of imputed values and
assigns a weight to each (Kim & Fuller, 2004), i.e.
 Each imputed value receives a “fraction” of the
original observation weight
Multiple Imputation Applications
 SAS has recently developed a procedure for
multiple imputation ( first available in the 8.1
version)
 The procedure requires use of both:
PROC MI
PROC MIANALYZE
MI Applications
 Multiple imputation inference involves three
distinct phases
1. The missing data are filled in m times to
generate m complete data sets (PROC MI)
2. The m complete data sets are analyzed by
standard statistical analyses. (PROC REG, PROC
GLM, etc.)
3. The results from the m complete data sets
are combined to produce inferential results.
(PROC MIANALYZE)
Three Imputation Mechanisms :
(Choice depends on the type of missing data pattern)
1.Regression Method - A regression model is fitted
for each variable with missing values, with previous
variables as covariates. (Monotone missing)
2.Propensity Score Method - Observations are
grouped based on propensity scores, and an
approximate Bayesian bootstrap imputation is
applied to each group. (Monotone missing)
3.MCMC Method - (Markov Chain Monte Carlo)
Constructs a Markov chain long enough for the
distribution of the elements to stabilize (MAR)
Multiple Imputation Applications
 See handout of SAS code and output
 Examples of the MI procedure can be shown
using a data set which contains measurements
on men running during a P.E. Course at N.C.
State University
 3 Variables of interest:
Oxygen intake per minute (ml/kg body wt)
Runtime (time in minutes to run 1.5 miles)
RunPulse (heart rate while running)
Conclusions:
 Multiple imputation is a method of replacing
missing values which has some theoretical
advantages over other methods
 Software is becoming more common to handle
multiple imputation and the code is relatively
simple
Software
Commercial:
- SAS PROC MI
- SOLAS for Missing Data Analysis
(http://www.statsolusa.com/)
Free:
- MIX - Software for multiple imputation
http://www.stat.psu.edu/~jls/misoftwa.html
References









Binder, D.A., and Sun, W. (1996). Frequency valid multiple imputation for surveys
with a complex design. Proceedings of the Section on Survey Research Methods,
ASA, 281-286.
Fay, R.E. (1996). Alternative paradigms for the analysis of imputed survey data.
JASA, 91, 490-498.
Kalton, G., and Kish, L. (1984). Some efficient random imputation methods.
Communications in Statistics, A13, 1919-1939.
Kott, P.S. (1995). A paradox of multiple imputation. Proceedings, 384-389.
Kim, J., and Fuller, W.A. (2004). Fractional hot deck imputation. Biometrika, 91,
559-578.
Lavori, P.W., Dawson, R., and Shera, D. (1995). A multiple imputation strategy for
clinical trials with truncation of patient data. Statistics in Medicine, 14, 1913-1925.
Rubin, D.B. (1987). Multiple imputation for nonresponse in surveys. New York: John
Wiley & Sons, Inc.
SAS Manual Version 8.1, Chapter 11
Shao, J. (2002). Resampling methods for variance estimation in complex surveys
with a complex design. In Survey Nonresponse. Edited by Groves, R.M., et. al. New
York: John Wiley & Sons, Inc., 303-314