Missing Data - Survey Research Laboratory
Download
Report
Transcript Missing Data - Survey Research Laboratory
Introduction to Survey
Data Analysis
Linda K. Owens, PhD
Assistant Director for Sampling & Analysis
Survey Research Laboratory
University of Illinois at Chicago
1
Focus of the seminar
Data cleaning/missing data
Sampling bias reduction
Survey Research Laboratory
2
When analyzing survey data...
1. Understand & evaluate survey
design
2. Screen the data
3. Adjust for sampling design
Survey Research Laboratory
3
1. Understand & evaluate survey
Conductor of survey
Sponsor of survey
Measured variables
Unit of analysis
Mode of data collection
Dates of data collection
Survey Research Laboratory
4
1. Understand & evaluate survey
Geographic coverage
Respondent eligibility criteria
Sample design
Sample size & response rate
Survey Research Laboratory
5
Levels of measurement
Nominal
Ordinal
Interval
Ratio
Survey Research Laboratory
6
2. Data screening
ALWAYS examine raw frequency
distributions for…
(a) out-of-range values (outliers)
(b) missing values
Survey Research Laboratory
7
2. Data screening
Out-of-range values:
Delete data
Recode values
Survey Research Laboratory
8
Missing data:
can reduce effective sample size
may introduce bias
Survey Research Laboratory
9
Reasons for missing data
Refusals (question sensitivity)
Don’t know responses (cognitive problems,
memory problems)
Not applicable
Data processing errors
Questionnaire programming errors
Design factors
Attrition in panel studies
Survey Research Laboratory
10
Effects of ignoring missing data
Reduced sample size – loss of statistical
power
Data may no longer be representative–
introduces bias
Difficult to identify effects
Survey Research Laboratory
11
Assumptions on missing data
Missing completely at random
(MCAR)
Missing at random (MAR)
Ignorable
Nonignorable
Survey Research Laboratory
12
Missing completely at random (MCAR)
Being missing is independent from
any variables.
Cases with complete data are
indistinguishable from cases with
missing data.
Missing cases are a random subsample of original sample.
Survey Research Laboratory
13
Missing at random (MAR)
The probability of a variable being
observed is independent of the true
value of that variable controlling for
one or more variables.
Example: Probability of missing
income is unrelated to income within
levels of education.
Survey Research Laboratory
14
Ignorable missing data
The data are MAR.
The missing data mechanism is
unrelated to the parameters we want
to estimate.
Survey Research Laboratory
15
Nonignorable missing data
The pattern of data missingness is
non-MAR.
Survey Research Laboratory
16
Methods of handling missing data
Listwise (casewise) deletion: uses only
complete cases
Pairwise deletion: uses all available cases
Dummy variable adjustment: Missing
value indicator method
Mean substitution: substitute mean value
computed from available cases (cf.
unconditional or conditional)
Survey Research Laboratory
17
Methods of handling missing data
Regression methods: predict value
based on regression equation with
other variables as predictors
Hot deck: identify the most similar
case to the case with a missing and
impute the value
Survey Research Laboratory
18
Methods of handling missing data
Maximum likelihood methods: use all
available data to generate maximum
likelihood-based statistics.
Survey Research Laboratory
19
Methods of handling missing data
Multiple imputation: combines the
methods of ML to produce multiple
data sets with imputed values for
missing cases
Survey Research Laboratory
20
Multiple Imputation Software
SAS—user written IVEware,
experimental MI & MIANALYZE
STATA—user written ICE
R—user written libraries and functions
SOLAS
S-PLUS
Survey Research Laboratory
21
Methods of handling missing data:
summary
Listwise deletion assumes MCAR or MAR
Dummy variable adjustment is biased, don’t
use
Conditional mean substitution assumes MCAR
within cells
Hot Deck & Regression improvement on mean
substitution but...
Multiple Imputation is least biased but
computationally intensive
Survey Research Laboratory
22
Missing Data Final Point
All methods of imputation underestimate
true variance
artificial increase in sample size
values treated as though obtained by data
collection
Rao (1996) On variance estimation with
imputed survey data. Jrn. Am. Stat.
Assoc. 91:499-506
Fay (1996) in bibliography
Survey Research Laboratory
23
Types of survey sample designs
Simple Random Sampling
Systematic Sampling
Complex sample designs
stratified designs
cluster designs
mixed mode designs
Survey Research Laboratory
24
Why complex sample designs?
Increased efficiency
Decreased costs
Survey Research Laboratory
25
Why complex sample designs?
Statistical software packages with an
assumption of SRS underestimate the
sampling variance.
Not accounting for the impact of
complex sample design can lead to a
biased estimate of the sampling
variance (Type I error).
Survey Research Laboratory
26
Sample weights
Used to adjust for differing
probabilities of selection.
In theory, simple random samples
are self-weighted.
In practice, simple random samples
are likely to also require adjustments
for nonresponse.
Survey Research Laboratory
27
Types of sample weights
Poststratification weights: designed to bring
the sample proportions in demographic
subgroups into agreement with the
population proportion in the subgroups.
Nonresponse weights: designed to inflate the
weights of survey respondents to compensate
for nonrespondents with similar
characteristics.
“Blow-up” (expansion) weights: provide
estimates for the total population of interest.
Survey Research Laboratory
28
Syntax examples of design-based
analysis in STATA, SUDAAN, & SAS
STATA
svyset
svyset
svyset
svyreg
strata strata
psu psu
pweight finalwt
fatitk age male black hispanic
SUDAAN
proc regress data=”c:\nhanes.sav” filetype=spss desgn=wr;
nest strata psu;
weight finalwt
subpgroup sex race;
levels
2
3;
model fatintk = age sex race;
Survey Research Laboratory
29
Syntax examples of design-based
analysis in STATA, SUDAAN, & SAS
SAS
proc surveyreg data=nhanes;
strata strata;
cluster psu;
class sex race;
model fatintk = age sex race;
weight finalwt
Survey Research Laboratory
30
In summary, when analyzing survey
data...
Understand & evaluate survey design
Screen the data – deal with missing data
& outliers.
If necessary, adjust for study design
using weights and appropriate computer
software.
Survey Research Laboratory
31
Bibliography
On website with slides
Big names in area of missing data are:
• Roderick Little
• Donald B. Rubin
• Paul Allison
• See also McKnight et.al. Missing Data: A Gentle
Introduction
Large government datasets (NSFG, NHANES, NHIS)
often include detailed methodological
documentation on imputation and weight
construction.
Survey Research Laboratory
32
Thank You!
www.srl.uic.edu
Survey Research Laboratory
33