Research and Missing Data
Download
Report
Transcript Research and Missing Data
Missing Data in Research
Studies
Joseph A. Olsen
What do I do about missing data?
Introduction
• What is certain in life?
– Death
– Taxes
• What is certain in research?
– Measurement error
– Missing data
• Missing data can be:
– Due to preventable errors, mistakes, or lack of foresight by the
researcher
– Due to problems outside the control of the researcher
– Deliberate, intended, or planned by the researcher to reduce cost or
respondent burden
– Due to differential applicability of some items to subsets of
respondents
– Etc.
Some Characteristics of Missing Data
Facets of missing data
Persons
Variables
Occasions
Type of non-response
Unit non-response
Block non-response
Wave non-response
Item non-response
Special non-response problems in longitudinal and
clustered data
Attrition/drop-out
Group (e. g. family) member non-response
Missing Data Mechanisms (1)
Preliminaries:
Yobs: The non-missing or observed data
Ymiss: The missing or unobserved data
M: Whether the data on a given item for a given case is missing (1)
or not (0)
Missing Completely at Random (MCAR)
The probability that an item is missing (M) is unrelated to either the
observed (Yobs) or the unobserved (Ymiss) data
Missing at Random (MAR)
The probability that an item is missing (M) may be related to the
observed data (Yobs) but is unrelated to the unobserved data (Ymiss)
Missing Not at Random (MNAR)
The probability that an item is missing (M) is related to the
(unknown) value of the unobserved data (Ymiss), even after
conditioning on the observed data (Yobs)
Missing Data Mechanisms (2)
The appropriateness of different missing data
treatments depends (among other things) on the
underlying missing data mechanism
“Real” missing data can seldom be classified into
just one of the three (MCAR, MAR, MNAR)
Because we don’t have access to the missing data
(Ymiss), we can not empirically test whether or not
the data is MNAR
If we know (or can convincingly argue) that the
data is not MNAR, a test of whether the data is
MCAR is available (e. g. in SPSS Missing Values
Analysis).
Missing Data in Research Studies
Missing data mechanism
Missing completely at random (MCAR)—Ignorable
Missing at random (MAR)—Conditionally ignorable
Missing not at random (MNAR)—Nonignorable
Amount of missing data
Percent of cases with missing data
Percent of variables having missing data
Percent of data values that are missing
Pattern of missing data
Missing by design
Missing data patterns
Univariate
Monotonic
File matching
General
Goals of a Missing Data Treatment
Preserve the essential characteristics of the data
Distributions of the variables
Relationships among the variables
Maintain the representativeness of the analyzed
data
Provide valid statistical inference (control Type I
error)
Maximize the statistical power of the study and its
statistical analyses (minimize Type II error)
Avoid bias and instability in the parameter
estimates and standard errors for statistical
models
Older Missing Data Treatments (1)
Deletion methods
Listwise deletion (complete case analysis)
Pairwise deletion (available case analysis)
Cold deck imputation
Deterministic, logical, or rule-based imputation
Treat missing data for nominal predictors as an additional category
Hot deck (donor case) imputation
Cluster based methods
Distance based (e. g. nearest neighbor) methods
Mean substitution
(Variable) mean substitution
Mean substitution with added random error
Predictor mean substitution with missing data dichotomy
Older Missing Data Treatments (2)
Regression imputation
Regression predicted value imputation
Regression imputation with added random error
Special methods for longitudinal studies and randomized
controlled trials
Endpoint only analysis
Last observation carried forward (LOCF)
Intent to treat worst (best) case imputation
Summary growth parameters
Special methods for multi-item scales
Available item method of scale construction
Person mean imputation
Two-way imputation
Two-way imputation with added random error
Newer Missing Data Treatments
• Modern state-of-the-art missing data
treatments for MAR data
– Maximum likelihood
– Multiple imputation
• Cutting edge investigational missing data
treatments for MNAR data
–
–
–
–
Pattern mixture models
Selection models
Shared parameter models
Inverse probability weighting
Statistical Analysis with Missing
Data
What do you get when you don’t specify what you want? What
choices do you have within a given analysis procedure?
Often, listwise deletion is the default (and only) option (SPSS
Reliability and GLM)
Listwise default with pairwise and mean substitution as options
(SPSS Factor and Regression Analysis)
Pairwise default with listwise option (SPSS Correlation)
Modeling approaches that incorporate missing data handling
Survival models
Mixed effects models
Structural equation models
Missing data treatments carried out prior to analysis
Ad hoc methods (Listwise, pairwise, single imputation, etc.)
Modern methods(Maximum Likelihood, Multiple Imputation)
Modern Missing Data Treatments
Maximum likelihood (ML)
Estimates summary statistics or statistical models using all available data
Available in modern structural equation modeling software (Amos, EQS,
Lisrel, Mplus, Mx, etc.)
The ML covariance matrix and mean vector can also be obtained from
SPSS MVA, and used for standard Regression, Factor analysis, Reliability,
and other procedures
There are also freeware and open source programs that can produce the
ML covariance matrix and mean vector, usually by using the Expectation
Maximization (EM) algorithm (e.g. EMCOV)
Multiple imputation
Imputes individual data values in multiple complete datasets, averaging the
results of the statistical analyses across these datasets
Available in the current versions of certain SEM software (Amos, Mplus).
Also available in SPSS (MVA), SAS (Proc MI and MIANALYZE), Stata (mi
impute and mi estimate), and stand-alone missing data packages such as
SOLAS
Why do social scientists use modern
missing data treatments so infrequently?
Lack of awareness or familiarity
They are not convinced of the problems with older
methods
The statistical literature on missing data is technically
daunting
The techniques aren’t incorporated into the standard
statistical analysis procedures used by social
scientists
Journal reviewers and editors have not required it