Pre-Processing & Item Analysis
Download
Report
Transcript Pre-Processing & Item Analysis
Pre-Processing
& Item Analysis
DeShon - 2005
Pre-Processing
Method of Pre-processing depends on the
type of measurement instrument used
General Issues
Responses within range?
Missing data
Item directionality
Scoring
Transforming responses into numbers that are
useful for the desired inference
Checking response range
First step…
Make sure there are no observations
outside the range of your measure.
If you use a 1-5 response measure, you can’t
have a response of 6.
Histograms and summary statistics (min,
max)
Reverse Scoring
Used when combining multiple measures
(e.g., items) into a composite
All items should refer to the target trait in the
same direction
Alg: (high scale score +1) – score
Missing Data
Huge issue in most behavioral research!
Key issues:
Why is the data missing?
Planned, missing randomly, response bias?
What’s the best analytic strategy with missing data?
Statistical Power
Biased results
Collins, L. M., Schafer, J. L., & Kam, C. M. (2001). A comparison of inclusive and restrictive strategies in modern missing data
procedures. Psychological Methods, 6, 330_351.
Schafer, J. L., & Graham, J. W. (2002). Missing data: our view of the state of the art. Psychological Methods, 7, 147-177.
Causes of Missing Data
Common in social research
nonresponse, loss to followup
lack of overlap between linked data sets
social processes
dropping out of school, graduation, etc.
survey design
“skip patterns” between respondents
Missing Data
Step 1: Do everything ethically feasible to
avoid missing data during data collection
Step 2: Do everything ethically possible to
recover missing data
Step 3: Examine amount and patterns of
missing data
Step 4: Use statistical models and
methods that replace missing data or are
unaffected by missing data
Missing Data Mechanisms
Missing Completely at Random (MCAR)
Missing at Random (MAR)
Not Missing at Random (NMAR)
X not subject to nonresponse (age)
Y subject to nonresponse (income)
Missing Completely at Random
MCAR
Probability of response is independent
of X & Y
Ex: Probability that income is recorded
is the same for all individuals regardless
of age or income
Missing at Random
MAR
Probability of response is dependent
on X but not Y
Probability of missingness does not
depend on unobserved information
Ex: Probability that income is recorded
varies according to age but it is not
related to income within a particular age
group
Not Missing at Random
NMAR
Probability of missingness does depend on
unobserved information
Ex: Probability that income is recorded
varies according to income and possibly
age
How can you tell?
Missing response DVHST94 vs Gender
3000
• Look for patterns
2500
2000
Missing
Observed
1500
1000
500
0
• Run a logistic
regression with your
IV’s predicting a
dichotomous variable
(1=missing;
0=nonmissing
Male
Female
Total
LOGISTIC REGRESSION
Coefficients:
Estimate
Std. Error
z value
Pr(>|z|)
(Intercept)
-5.058793
NEW.AGE
0.181625
0.007524 24.140
< 2e-16 ***
SEXMale
-0.847947
0.131475 -6.450
1.12e-10 ***
DVHHIN94
0.047828
0.026768
1.787
0.0740 .
DVSMKT94
-0.015131
0.031662
-0.478
0.6327
0.233188
0.226732
1.028
0.3037
NUMCHRON
-0.087992
0.048783
-1.804
0.0713 .
VISITS
0.012483
0.006563
1.902
0.0572 .
NEW.WT6
-0.043935
0.077407
-0.568
0.5703
NEW.DVBMI94
-0.015622
0.017299
-0.903
0.3665
NEW.DVPP94 = 0
0.367083 -13.781
< 2e-16 ***
Missing Data Mechanisms
If MAR or MCAR, the missing data mechanism is
ignorable for full information likelihood-based
inferences
If MCAR, the mechanism is also ignorable for
sampling-based inferences (OLS regression)
If NMAR, the mechanism is nonignorable – thus
any statistic could be biased
Missing Data Methods
Always Bad Methods
Listwise deletion
Pairwise deletion a.k.a. available case analysis
Person or item mean replacement
Often Good Methods
Regression replacement
Full-Information Maximum Likelihood
SEM – must have full dataset
Multiple Imputation
Listwise Deletion
Assumes that the data are MCAR.
Only appropriate for small amounts of
missing data.
Can lower power substantially
Inefficient
Now very rare
Don’t do it!
FIML - AMOS
Imputation-based Procedures
Missing values are filled-in and the
resulting “Completed” data are analyzed
Hot deck
Mean imputation
Regression imputation
Some imputation procedures (e.g., Rubin’s
multiple imputation) are really modelbased procedures.
Mean Imputation
Technique
Implicit models
Calculate mean over cases that have values for Y
Impute this mean where Y is missing
Ditto for X1, X2, etc.
Y=mY
X1=m1
X2=m2
Problems
ignores relationships among X and Y
underestimates covariances
Regression Imputation
Technique & implicit models
If Y is missing
impute mean of cases
with similar values for X1, X2
Likewise, if X2 is missing
impute mean of cases
with similar values for X1, Y
X1 = g0 + X1 g1 + Y g2
If both Y and X2 are missing
impute means of cases
with similar values for X1
Y = b0 + X1 b1 + X2 b2
Y = d0 + X1 d1
X2= f0 + X1 f1
Problem
Ignores random components (no e)
Underestimates variances, se’s
Little and Rubin’s Principles
Imputations should be
Conditioned on observed variables
Multivariate
Draws from a predictive distribution
Single imputation methods do not provide
a means to correct standard errors for
estimation error.
Multiple Imputation
Context: Multiple regression (in general)
Missing values are replaced with “plausible” substitutes
based on distributions or model
Construct m>1 simulated versions
Analyze each of the m simulated complete datasets by
standard methods
Combine the m estimates
get confidence intervals using Rubin’s rules
(micombine)
ADVANTAGE: sampling variability is taken into account
by restoring error variance
Multiple Imputation
(Rubin, 1987, 1996)
Yobs
Ymiss
Point estimate
imputations
Yobs
(1)
Ymiss
Ty1 Var Ty1
Yobs
(2)
Ymiss
Ty2 Var Ty2
Yobs
(m)
Ymiss
Tym Var Tym
Ty
Variance
within +
Variance
Imputation
Another View
• IMPUTATION:
IMPUTATION
ANALYSIS
POOLING
Impute the missing entries of the
incomplete data sets M times, resulting
in M complete data sets.
• ANALYSIS:
Analyze each of the M completed data
sets using weighted least squares.
INCOMPLETE
DATA
IMPUTED
DATA
ANALYSIS
RESULTS
FINAL
RESULTS
• POOLING:
Integrate the M analysis results into a
final result. Simple rules exist for
combining the M analyses.
How many Imputations?
5
Efficiency of an estimate: (1+γ/m)-1
γ = percentage of missing info
m = number of imputations
If 30% missing, 3 imputations 91%
5 imputations 94%
10 imputations 97%
Imputation in SAS
PROC MI
By default generates 5 imputation values for each missing value
Imputation method: MCMC (Markov Chain Monte Carlo)
EM algorithm determines initial values
MCMC repeatedly simulates the distribution of interest from which the imputed values are drawn
Assumption: Data follows multivariate normal distribution
PROC REG
Fits five weighted linear regression models to the
five complete data sets obtained from PROC MI
(used by_imputation_statement )
PROC MIANALIZE
Reads the parameter estimates and associated
covariance matrix from the analysis performed on
the multiple imputed data sets and derives valid
statistics for the parameters
Example
Case 1 is missing weight
Given 1’s sex and age
generate a plausible
distribution
for 1’s weight density
0.01
0.008
0.006
0.004
0.002
75 100 125 150 175 200 225
At random, sample 5 (or more) plausible weights for case 1
Impute Y!
For case 6, sample from conditional distribution of age.
Weight
case
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Use Y to impute X!
For case 7, sample from conditional bivariate distribution of age &
weight
maleness years_over_20 weight
0
28
1
19
218
1
37
235
0
24
150
1
18
1
176
1
0
0
28
1
46
195
0
23
0
29
1
44
221
0
0
21
1
41
204
0
40
1
37
208
0
1
43
Example
PROC MI;
DATA=missing_weight_age
OUT=weight_age_mi;
run;
PROC REG;
VAR years_over_20 weight
maleness;
run;
data=weight_age_mi
model weight = maleness
years_over_20;
by _imputation_;
_Imputation_ case maleness years_over_20 weight
1
1
0
28
178
1
2
1
19
218
1
3
1
37
235
2
2
2
1
2
3
0
1
1
28
19
37
101
218
235
3
3
3
1
2
3
0
1
1
28
19
37
167
218
235
4
4
4
1
2
3
0
1
1
28
19
37
152
218
235
5
5
5
1
2
3
0
1
1
28
19
37
159
218
235
Example – Standard Errors
Total variance in b0
Variation due to sampling + variation due to imputation
Mean(s2b0) + Var(b0 )
Actually, there’s a correction factor of (1+1/M)
for the number of imputations M. (Here M=5.)
So total variance in estimating b0 is
Mean(s2b0) + (1+1/M) Var(b0 )
= 179.53 + (1.2) 511.59 = 793.44
Standard error is 793.44 = 28.17
Example
PROC MIAnalyze data=parameters;
VAR intercept maleness years_over_20;
run;
Multiple Imputation Parameter Estimates
Parameter
intercept
maleness
years_over_20
Estimate
Std Error
178.564526
67.110037
-0.960283
28.168160
14.801696
0.819559
95% Confidence Limits
70.58553
21.52721
-3.57294
Other Software
www.stat.psu.edu/~jls/misoftwa.html
286.5435
112.6929
1.6524
DF
2.2804
3.1866
2.991
Item Analysis
Relevant for tests with a right / wrong
answer
Score the item so that 1=right and 0=wrong
Where do the answers come from?
Rational analysis
Empirical keying
Item Analysis
Goal of Item Analysis
Determine the extent to which the item is
useful in differentiating individuals with
respect to the focal construct
Improve the measure for future
administrations
Item Analysis
Item analysis provides info on the effectiveness
of the individual items for future use
Typology of Item Analysis
Item Analysis
Classical
Item Response theory
Rasch
IRT2
IRT3
Item Analysis
Classical analysis is the easiest and most widely
used form of analysis
The statistics can be computed by generic
statistical packages (or by hand) and need no
specialist software
The item statistics apply only to that group of
testees on that collection of items
Sample Dependent!
Classical Item Analysis
Item Difficulty
Proportion Correct (1=correct; 0=wrong)
the higher the proportion the easier the item
In general, need a wide range of item difficulties
to cover the range of the trait being assessed
If mastery test, need item difficulties to cluster
around the cut score
Very easy (0.0) or very hard items (1.0) are
useless
Most variance at p=.5
Classical Item Analysis
Item Discrimination – 2 methods
Difference in proportion correct between high
and low test score groups (27%)
Item-total correlation (output in Cronbach’s
alpha routines)
No negative discriminators
Check key or drop item
Zero discriminators are not useful
Item difficulty and discrimination are
interdependent
Classical Item Analysis
Item Answer Item Diff
I1
I2
I3
I4
I5
I6
I7
I8
I9
I10
I11
I12
I13
I14
I15
I16
4
2
1
3
4
2
1
2
3
3
3
2
3
2
4
3
0.85
0.96
0.94
0.78
0.88
0.81
0.78
0.87
0.83
0.88
0.90
0.95
0.96
0.94
0.78
0.50
Item-total r Item-Disc
0.30
0.26
0.25
0.45
0.37
0.34
0.46
0.44
0.32
0.18
0.19
0.24
0.21
0.32
0.20
0.24
0.23
0.07
0.15
0.39
0.26
0.32
0.41
0.29
0.22
0.11
0.13
0.10
0.08
0.14
0.32
0.21