Pre-Processing & Item Analysis

Transcript Pre-Processing & Item Analysis

Pre-Processing
& Item Analysis
DeShon - 2005
Pre-Processing


Method of Pre-processing depends on the
type of measurement instrument used
General Issues




Responses within range?
Missing data
Item directionality
Scoring

Transforming responses into numbers that are
useful for the desired inference
Checking response range


First step…
Make sure there are no observations
outside the range of your measure.


If you use a 1-5 response measure, you can’t
have a response of 6.
Histograms and summary statistics (min,
max)
Reverse Scoring

Used when combining multiple measures
(e.g., items) into a composite


All items should refer to the target trait in the
same direction
Alg: (high scale score +1) – score
Missing Data

Huge issue in most behavioral research!

Key issues:

Why is the data missing?



Planned, missing randomly, response bias?

What’s the best analytic strategy with missing data?

Statistical Power

Biased results
Collins, L. M., Schafer, J. L., & Kam, C. M. (2001). A comparison of inclusive and restrictive strategies in modern missing data
procedures. Psychological Methods, 6, 330_351.
Schafer, J. L., & Graham, J. W. (2002). Missing data: our view of the state of the art. Psychological Methods, 7, 147-177.
Causes of Missing Data

Common in social research



nonresponse, loss to followup
lack of overlap between linked data sets
social processes


dropping out of school, graduation, etc.
survey design

“skip patterns” between respondents
Missing Data




Step 1: Do everything ethically feasible to
avoid missing data during data collection
Step 2: Do everything ethically possible to
recover missing data
Step 3: Examine amount and patterns of
missing data
Step 4: Use statistical models and
methods that replace missing data or are
unaffected by missing data
Missing Data Mechanisms



Missing Completely at Random (MCAR)
Missing at Random (MAR)
Not Missing at Random (NMAR)
X  not subject to nonresponse (age)
Y  subject to nonresponse (income)
Missing Completely at Random


MCAR
Probability of response is independent
of X & Y

Ex: Probability that income is recorded
is the same for all individuals regardless
of age or income
Missing at Random



MAR
Probability of response is dependent
on X but not Y
Probability of missingness does not
depend on unobserved information
 Ex: Probability that income is recorded
varies according to age but it is not
related to income within a particular age
group
Not Missing at Random


NMAR
Probability of missingness does depend on
unobserved information
 Ex: Probability that income is recorded
varies according to income and possibly
age
How can you tell?
Missing response DVHST94 vs Gender
3000
• Look for patterns
2500
2000
Missing
Observed
1500
1000
500
0
• Run a logistic
regression with your
IV’s predicting a
dichotomous variable
(1=missing;
0=nonmissing
Male
Female
Total
LOGISTIC REGRESSION
Coefficients:
Estimate
Std. Error
z value
Pr(>|z|)
(Intercept)
-5.058793
NEW.AGE
0.181625
0.007524 24.140
< 2e-16 ***
SEXMale
-0.847947
0.131475 -6.450
1.12e-10 ***
DVHHIN94
0.047828
0.026768
1.787
0.0740 .
DVSMKT94
-0.015131
0.031662
-0.478
0.6327
0.233188
0.226732
1.028
0.3037
NUMCHRON
-0.087992
0.048783
-1.804
0.0713 .
VISITS
0.012483
0.006563
1.902
0.0572 .
NEW.WT6
-0.043935
0.077407
-0.568
0.5703
NEW.DVBMI94
-0.015622
0.017299
-0.903
0.3665
NEW.DVPP94 = 0
0.367083 -13.781
< 2e-16 ***
Missing Data Mechanisms



If MAR or MCAR, the missing data mechanism is
ignorable for full information likelihood-based
inferences
If MCAR, the mechanism is also ignorable for
sampling-based inferences (OLS regression)
If NMAR, the mechanism is nonignorable – thus
any statistic could be biased
Missing Data Methods

Always Bad Methods




Listwise deletion
Pairwise deletion a.k.a. available case analysis
Person or item mean replacement
Often Good Methods


Regression replacement
Full-Information Maximum Likelihood


SEM – must have full dataset
Multiple Imputation
Listwise Deletion



Assumes that the data are MCAR.
Only appropriate for small amounts of
missing data.
Can lower power substantially



Inefficient
Now very rare
Don’t do it!
FIML - AMOS
Imputation-based Procedures

Missing values are filled-in and the
resulting “Completed” data are analyzed




Hot deck
Mean imputation
Regression imputation
Some imputation procedures (e.g., Rubin’s
multiple imputation) are really modelbased procedures.
Mean Imputation

Technique




Implicit models




Calculate mean over cases that have values for Y
Impute this mean where Y is missing
Ditto for X1, X2, etc.
Y=mY
X1=m1
X2=m2
Problems

ignores relationships among X and Y

underestimates covariances
Regression Imputation

Technique & implicit models

If Y is missing

impute mean of cases
with similar values for X1, X2


Likewise, if X2 is missing

impute mean of cases
with similar values for X1, Y


X1 = g0 + X1 g1 + Y g2
If both Y and X2 are missing

impute means of cases
with similar values for X1



Y = b0 + X1 b1 + X2 b2
Y = d0 + X1 d1
X2= f0 + X1 f1
Problem

Ignores random components (no e)
Underestimates variances, se’s
Little and Rubin’s Principles

Imputations should be




Conditioned on observed variables
Multivariate
Draws from a predictive distribution
Single imputation methods do not provide
a means to correct standard errors for
estimation error.
Multiple Imputation







Context: Multiple regression (in general)
Missing values are replaced with “plausible” substitutes
based on distributions or model
Construct m>1 simulated versions
Analyze each of the m simulated complete datasets by
standard methods
Combine the m estimates
get confidence intervals using Rubin’s rules
(micombine)
ADVANTAGE: sampling variability is taken into account
by restoring error variance
Multiple Imputation
(Rubin, 1987, 1996)
Yobs
Ymiss
Point estimate
imputations
 
Yobs
(1)
Ymiss
Ty1 Var Ty1
Yobs
(2)
Ymiss
Ty2 Var Ty2
Yobs
(m)
Ymiss
Tym Var Tym
Ty
 
 
Variance
within +
Variance
Imputation
Another View
• IMPUTATION:
IMPUTATION
ANALYSIS
POOLING
Impute the missing entries of the
incomplete data sets M times, resulting
in M complete data sets.
• ANALYSIS:
Analyze each of the M completed data
sets using weighted least squares.
INCOMPLETE
DATA
IMPUTED
DATA
ANALYSIS
RESULTS
FINAL
RESULTS
• POOLING:
Integrate the M analysis results into a
final result. Simple rules exist for
combining the M analyses.
How many Imputations?



5
Efficiency of an estimate: (1+γ/m)-1
γ = percentage of missing info
m = number of imputations
If 30% missing, 3 imputations  91%
5 imputations  94%
10 imputations  97%
Imputation in SAS
PROC MI

By default generates 5 imputation values for each missing value

Imputation method: MCMC (Markov Chain Monte Carlo)

EM algorithm determines initial values

MCMC repeatedly simulates the distribution of interest from which the imputed values are drawn

Assumption: Data follows multivariate normal distribution
PROC REG
Fits five weighted linear regression models to the
five complete data sets obtained from PROC MI
(used by_imputation_statement )
PROC MIANALIZE
Reads the parameter estimates and associated
covariance matrix from the analysis performed on
the multiple imputed data sets and derives valid
statistics for the parameters
Example

Case 1 is missing weight

Given 1’s sex and age

generate a plausible
distribution
for 1’s weight density
0.01
0.008
0.006
0.004
0.002
75 100 125 150 175 200 225

At random, sample 5 (or more) plausible weights for case 1


Impute Y!
For case 6, sample from conditional distribution of age.


Weight
case
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Use Y to impute X!
For case 7, sample from conditional bivariate distribution of age &
weight
maleness years_over_20 weight
0
28
1
19
218
1
37
235
0
24
150
1
18
1
176
1
0
0
28
1
46
195
0
23
0
29
1
44
221
0
0
21
1
41
204
0
40
1
37
208
0
1
43
Example

PROC MI;

DATA=missing_weight_age

OUT=weight_age_mi;



run;
PROC REG;




VAR years_over_20 weight
maleness;
run;
data=weight_age_mi
model weight = maleness
years_over_20;
by _imputation_;
_Imputation_ case maleness years_over_20 weight
1
1
0
28
178
1
2
1
19
218
1
3
1
37
235
2
2
2
1
2
3
0
1
1
28
19
37
101
218
235
3
3
3
1
2
3
0
1
1
28
19
37
167
218
235
4
4
4
1
2
3
0
1
1
28
19
37
152
218
235
5
5
5
1
2
3
0
1
1
28
19
37
159
218
235
Example – Standard Errors


Total variance in b0
 Variation due to sampling + variation due to imputation
 Mean(s2b0) + Var(b0 )
Actually, there’s a correction factor of (1+1/M)
for the number of imputations M. (Here M=5.)
So total variance in estimating b0 is
 Mean(s2b0) + (1+1/M) Var(b0 )
= 179.53 + (1.2) 511.59 = 793.44
Standard error is 793.44 = 28.17



Example
PROC MIAnalyze data=parameters;

VAR intercept maleness years_over_20;

run;

Multiple Imputation Parameter Estimates
Parameter
intercept
maleness
years_over_20

Estimate
Std Error
178.564526
67.110037
-0.960283
28.168160
14.801696
0.819559
95% Confidence Limits
70.58553
21.52721
-3.57294
Other Software

www.stat.psu.edu/~jls/misoftwa.html
286.5435
112.6929
1.6524
DF
2.2804
3.1866
2.991
Item Analysis

Relevant for tests with a right / wrong
answer


Score the item so that 1=right and 0=wrong
Where do the answers come from?

Rational analysis

Empirical keying
Item Analysis

Goal of Item Analysis


Determine the extent to which the item is
useful in differentiating individuals with
respect to the focal construct
Improve the measure for future
administrations
Item Analysis

Item analysis provides info on the effectiveness
of the individual items for future use
Typology of Item Analysis
Item Analysis
Classical
Item Response theory
Rasch
IRT2
IRT3
Item Analysis



Classical analysis is the easiest and most widely
used form of analysis
The statistics can be computed by generic
statistical packages (or by hand) and need no
specialist software
The item statistics apply only to that group of
testees on that collection of items
 Sample Dependent!
Classical Item Analysis

Item Difficulty

Proportion Correct (1=correct; 0=wrong)





the higher the proportion the easier the item
In general, need a wide range of item difficulties
to cover the range of the trait being assessed
If mastery test, need item difficulties to cluster
around the cut score
Very easy (0.0) or very hard items (1.0) are
useless
Most variance at p=.5
Classical Item Analysis

Item Discrimination – 2 methods



Difference in proportion correct between high
and low test score groups (27%)
Item-total correlation (output in Cronbach’s
alpha routines)
 No negative discriminators
 Check key or drop item
 Zero discriminators are not useful
Item difficulty and discrimination are
interdependent
Classical Item Analysis
Item Answer Item Diff
I1
I2
I3
I4
I5
I6
I7
I8
I9
I10
I11
I12
I13
I14
I15
I16
4
2
1
3
4
2
1
2
3
3
3
2
3
2
4
3
0.85
0.96
0.94
0.78
0.88
0.81
0.78
0.87
0.83
0.88
0.90
0.95
0.96
0.94
0.78
0.50
Item-total r Item-Disc
0.30
0.26
0.25
0.45
0.37
0.34
0.46
0.44
0.32
0.18
0.19
0.24
0.21
0.32
0.20
0.24
0.23
0.07
0.15
0.39
0.26
0.32
0.41
0.29
0.22
0.11
0.13
0.10
0.08
0.14
0.32
0.21

Pre-Processing & Item Analysis

Transcript Pre-Processing & Item Analysis

Directory