Best Practices for Handling Missing Data

Download Report

Transcript Best Practices for Handling Missing Data

Best Practices for
Handling Missing Data
David R. Johnson
Professor of Sociology, Demography
and Family Studies
Pennsylvania State University
Outline
• What are missing data and why do we need to
do something about them?
• Classic Approaches and their problems.
• Modern Approaches
– Multiple Imputation (MI)
– Maximum Likelihood (ML or FIML)
• When to use modern approaches
• Focus on decisions in multiple imputation
What are Missing Data?
 Refusals
 Don’t Know (can be treated either as a missing
value or as an actual value)
 Not Applicable (may or may not be missing data)
 No response (common in self-report
questionnaires)
 Bad data (when the value is clearly wrong and you
don’t know what the correct value would be)
 Planned Missing or Missing by Design. (questions
missing in some years of the GSS)
 When the partner does not respond
How common are missing data?
 In surveys there is almost always some missing
values.
 In official statistics (states, countries, etc.) some
values are either not collected or not available for
some reason.
 Quite common in data collected using contingency
questions or skip patterns (e.g., occupational
prestige only answered if the respondent works).
 May occur when questions were not asked in some
years of a multiyear survey, or were asked only of a
random sample of respondents (a planned missing
pattern).
Why should we be concerned about
missing data?
 In multivariate analysis—like regression—all variables in
the model must have value. Can’t handle incomplete
matrices. Called analysis of Complete Cases.
 Any case with even one variable with a missing value is
excluded from the analysis.
 Even with little missing data in a each variable can
results in loss of many cases.
 When a case is lost you loose all the information on that
case and your standard errors go up.
 To use all available information requires some way of
using both complete and incomplete cases.
How do we handle missing data in regression analysis?
 Most regression models require a complete data
matrix. Each cell of all rows and columns of data
must have valid nonmissing values. Called
“Complete Case Analysis”
 When you have a large number of independent
variables the proportion of the sample that has
complete data may be quite small even if the
proportion missing on each variable is not large.
 If missing on one variable is independent of missing
on another then the proportion of cases with no
missing data = (1 – mx1)* (1 – mx2)*…(1 – mxi)
where m is the proportion non-missing on variable x.
Many strategies for handling missing data in regression
•
•
•
•
•
•
•
•
•
Complete Case Analysis (listwise or casewise)
Pairwise Deletion
Mean Substitution
Regression Substitution or Regression Imputation
Hot (and Cold) Deck methods
Expectations-Maximization (EM) methods
Full Information Maximum Likelihood (FIML)
Multiple Imputation Methods
Plus a number of other lesser know (and less
used) methods
Complete Case Analysis
(casewise or listwise deletion)
 May yield a small sample size
 Remaining sample of cases with complete data may no
longer be representative of the total sample or the
population
 May reduce the statistical power – you throwing away
the partial information you have on incomplete cases
 Probably the most common solution in the literature up
until the last several years.
 There are still situations in which it might be an
appropriate method (Allison suggest that when you lose
only a small proportion of cases (5% or so) casewise is
OK)
Pairwise Deletion (obsolete method)
 Incomplete data are analyzed by computing a correlation or
covariance matrix based on the pairs that have complete
data
 Analyze this matrix rather than the raw data. Works for OLS
regression. Is an option in some computer programs (SPSS)
 Main problem is that each correlation may be based on a
very different set of cases. Can produce a
covariance/correlation matrix that does not meet the
assumptions of a proper matrix (not Gramian). Can result in
very biased and poorly estimated coefficients.
 Any more, it is seldom used in the literature.
Mean Substitution
 Fill in the missing value with the mean of that
variable.
 As the regression line always goes through the mean
of the variables, this seems like a good answer
because it will not influence the regression line one
way or another.
 However, putting values on regression line leads to
no error for these cases which artificially increases
the explained variance
 Procedure also leads to biased estimates of the
standard deviation of the variables as the missing
cases all have a standard deviation of 0.
Mean Substitution with a Missing Data Indicator (dummy
variable)
 Occasionally researchers will use mean substitution along with
adding a dummy variable for each variable indicating whether or
not that observation was missing on that variable. Often see this
with income in published papers.
 Allison has demonstrated both statistically and with simulation data
that including a dummy variable along with mean substitution will
likely lead to biased estimates.
 There have been some articles indicating that mean substitution is
not likely to yield biased estimates, particularly when the amount
of missing data is quite low. However, would not recommend it.
Regression Imputation
(Regression substitution)
 These methods may vary from program to program, but are
basically as follows:
 Estimate a regression model from complete cases in which the
variable X is the dependent variable. Should include all other
variables in the model as independent variables. Get a
predicted value for each case with a missing observation and
substitute that predicted value for the missing value.
 Better than mean substitution because at least the missing data
now have some variance for a given variable (different
respondents have different values)
 Still biased as the missing values are still perfectly predicted as
a linear combination of the other independent variables. Tends
to inflate R-square and bias standard errors.
Regression Imputation
(Regression substitution with a stochastic error term)
 A form of regression imputation (stochastic
regression imputation) adds a random error
component to each predicted value.
 This reduces the problem of all cases being
on the regression line and inflated explained
variance.
 Similar to more statistically complex methods
used in some multiple imputation
approaches.
Cold and Hot Deck Imputation
• Method used by the Census Bureau since the 1940’s to impute
missing values in census data. (Census uses the Hot version)
• Defining a “Deck”.
– A big N-way crosstab of (usually) 3 to 5 demographic variables (e.g.,
age, gender, education, employment, marital status) each with a small
number of categories (<5)
– In each of the cells of the crosstab there is a value for another variable
(e.g. income)
– When a record is missing on the variable (e.g. income) they are
assigned the value in the cell that corresponds to their demographic
characteristic.
– If the value in the cell stays the same it is called a “Cold Deck”; if it
changes as more records are processed it is a “Hot Deck”
Hot Deck Imputation
• Create an n1 x n2 x n3 … table that is stored in the
computer using basic demographic characteristics of
respondents.
• (e.g., Age in 4 categories, gender (2), marital status (5),
education (4) = yields 160 cells)
• Fill in the hot deck matrix for a given variable (let’s say
employment status) by going though records sequentially
and fill in an observed value for employment status for each
cell based on a person with that set of characteristics.
• The values in the deck keep changing as more records are
read.
• If a person has a missing value on employment they would
get the current value in the matrix based on their
demographic characteristics
Limitations of Hot Deck
• Requires a very large sample to work
adequately.
• In the deck, you can use only a small number
(3 – 5) characteristics with limited numbers of
categories in each.
• Can produce biased estimates.
• Does not take into consideration the uncertainty
introduced by the imputation.
Expectation-Maximization (EM) Method
 This is an iterative method for filling in missing values based on
the MAR assumption.
 Implemented in the SPSS MVA module for single imputations and
in SAS MI for generating the starting point for multiple
imputations.
 This is a “model-based” imputation method.
 The most common model is called the Normal Model.
 The Normal Model assumes that the variables are quantitative
(continuous) and their distribution is multivariate normal.
 Although quantitative variables are required, it can be used with
categorical variables if transformed to a set of dummy variables.
 Estimates have been found to be quite robust to violation of the
multivariate normal assumption.
Example of Observed and Imputed Values
for Normal Model
1
2
Observed
Values
3
Imputed
Values
4
Importance of having children
Imputed values are continuous and follow a
multivariate normal distribution
How EM works (simplified)
 Uses regression (or some other method) to develop a plausible initial
value for a missing value on variable x taking into account another set of
variables. E.g., use the predicted value of x from its regression on all
other variables in the model. Also add in an error term. This is the E
(expectations) step.
 With these initial estimates substituted into the data matrix, it recomputes the regression (or other method) and develops a new plausible
value. This is the M (maximization) step.
 It continues with these two steps until the estimated missing values stop
changing (within a certain tolerance limit).
 Depending on the algorithm used these estimates can be unbiased under
the assumption of MAR and the statistical assumptions of the “model”
Full Information Maximum Likelihood
(FIML or ML)
 Works for statistical methods based on a
covariance matrix (e.g. OLS regression,
structural equation models).
 Models with binary outcomes can be estimated
in some versions (e.g. Mplus)
 Does not produce values for the missing data,
just estimates the covariance matrix in the
presence of missing data.
 Similar to EM except it does not actually assign
values for cases. (Because EM is actually a
maximum likelihood method)
Full Information Maximum Likelihood
(FIML or ML)
 Creates a proper covariance matrix taking into account the information in
both complete and incomplete cases. (in contrast with pairwise deletion in
which case the covariance matrix is often not “proper”).
 This covariance matrix can then be used to do regression or other
covariance based methods (e.g. SEM)
 Assumes MAR and correctly adjusts the standard errors for the uncertainty
due to missingness
 Available in most Structural Equations Model (SEM) computer programs
(Mplus, Amos, LISREL, MX, EQS, and the new SEM package in Stata)
 In some of these packages you can estimate a wide variety of regression
type models (e.g. logistic regression). Mplus is probably the most flexible.
 Some missing data experts argue that this is probably the best approach if
you can use it (e.g. Paul Allison).
Multiple Imputation (MI)
 One of the “Modern Methods” that is widely recommended.
 You create several versions of the dataset which only differ in the
missing values assigned. Should create at least 5 datasets, but
more (e.g. 20+) is generally recommended for most models.
 The missing values assigned for a case on a variable X have both
a fixed (predicted) component and a random error component.
The size of random error component is selected to reflect the
degree of uncertainty in assigning a value. The imputed data sets
differ in the random components assigned. High certainty would
lead to little variation between data sets in the random values, low
certainty to more variation.
Multiple Imputation (MI)
 Once the datasets are generated, the researcher conducted the
regression in each dataset then combines the estimates using
“Rubin’s” rules. These are a set of equations which yield “correct”
standard errors taking into account the uncertainty in the
imputation in estimating the standard errors.
 ICE and MI Stata do multiple imputation and SAS has the program
MI. These have procedures that combine the multiple data sets
and compute pooled estimates. (SAS has mianalyze: Stata has
micombine and mim)
 A MI module in now available in SPSS (or PASW). SPSS also has a
capacity to combine the multiple datasets for many statistical
procedures. The SPSS MI is similar to ICE in Stata.
 Version 11+ of Stata has an “official” MI program which uses
either the same statistical method as the SAS MI or the ICE
procedure.
More on Multiple Imputation
 Lots of issue remain with MI.
 How may variables should you take into consideration when
imputing values?
 If a variable is not used to inform the imputation then the
missing values imputed are assumed to be uncorrelated with
that variable.
 MI with many variables and cases takes a LONG time to
compute because most imputation software uses “Data
Augmentation” methods that are simulation-based and go
through many thousands of iterations. More recent versions
are more efficient.
 See Johnson & Young (2011) were the consequences of
these decisions are compared and recommendations about
best choices are made.
Other Issues
 Are you “making up” data when you do imputations?
 If done properly, no. Imputation just enables you to use methods that require
a full data matrix. No new information is added.
 With a proper imputation, no new information is added. The missing values can
be viewed as neutral “fillers” to allow for the use of complete case statistical
procedures.
 Why are some of the imputed values so strange? (sometimes
impute values will be out of the range of the observed data)
 The values themselves are not all that important for the regression estimates.
Many people recode them to fall within the range, but this is not necessary for
proper estimation.
 Recoding them into the correct range and to the discrete values is OK if the
data will be used for other purposes as well. May, but probably will not,
produce a slight bias.
Other Issues
 What is the most acceptable approach to use today to
get published in the literature?
 FIML is acceptable for SEM or OLS regression and becoming
more common for other approaches (e.g. mixture models).
 MI methods are accepted for other approaches.
 How many imputations should you use?
 Old standard was 5.
 Recent work suggests that the number of imputed datasets
should increase with the amount of missing information.
 From 10 to 25 is acceptable. (We did not find much difference
in estimates based on different numbers of datasets.)
Working with Missing Data
What will the Future bring?
• The literature on how missing data should be handled is still in
transition. Many issues are still unresolved and more work is
needed.
• Our own work shows that, if you have a relatively small
proportion of missing data, as long as you use one of the
“modern” methods you are not likely to bias your findings.
• My prognosis is that in the near future, going through special
steps, creating special datasets, etc. will no longer be
necessary. Proper handling of missing data will be built into
the statistical software. Similar to the case with Mplus and SEM
packages. Already, Mplus can handle a wide variety of models
with incomplete data. Newest version (6) of Mplus can also do
the imputations and pooling of estimates automatically if you
request that option. No need to do a two-step process.