Transcript Class 2

Basics of Data Cleaning
Why Examine Your Data?
• Basic understanding of the data set
• Ensure statistical and theoretical
underpinnings of a given m.v. technique
are met
• Concerns about the data
– Departures from distribution assumptions
(i.e., normality)
– Outliers
– Missing Data
Testing Assumptions
• MV Normality assumption
– Solution is better
• Violation of MV Normality
– Skewness (symmetry)
– Kurtosis (peakedness)
– Heteroscedascity
– Non-linearity
Negative Skew
8
6
4
2
0
0.00
.50
1.00
1.50
2.00
GPA
2.50
3.00
3.50
4.00
Positive Skew
8
6
4
2
0
0.00
.50
1.00
1.50
2.00
GPA
2.50
3.00
3.50
4.00
Kurtosis
Mesokurtic
Leptokurtic
Platykurtic
Skewness & Kurtosis
SPSS Syntax
FREQUENCIES
VARIABLES=age
/STATISTICS=SKEWNESS
SESKEW KURTOSIS SEKURT
/ORDER= ANALYSIS.
Z values =
Statistic
Std Error
Critical Values for z score
.05  +/- 1.96
.01  +/- 2.58
Statistics
age
N
Skewness
Std. Error of Skewness
Kurtosis
Std. Error of Kurtosis
Valid
Missing
140
4
.354
.205
-.266
.407
Skewness = .354/.205 = 1.73
Kurtosis = -.266/.407 = -.654
Homoscedascity
s21
m1
s22
m2
s23
m3
s24
m4
s 21 = s 22 = s 23 = s 24 = s 2e
When there are multiple groups,
each group has similar levels of variance
(similar standard deviation)
Linearity
1
Zscore(PERF)
0
-1
-2
-3
-4
-5
-6
-3
-2
-1
0
1
Zscore(STRESS)
2
3
4
Testing the Assumptions of
Absence of Correlated Errors
• Correlated errors means there is an
unmeasured variable affecting the
analysis
• Key is to identify the unmeasured
variable and to include it in the analysis
• How often do we meet this assumption?
Data Cleaning
• Examine
– Individual items/scales (i.e., reliability)
– Bivariate relationships
– Multivariate relationships
• Techniques to use
– Graphs  non-normality, heteroscedasticity
– Frequencies  missing data, out of bounds
values
– Univariate outliers (+/- 3 SD from mean)
– Mahalanobis Distance (.001)
Graphical Examination
• Single Variable: Shape of Distribution
– Histogram
– Stem and leaf
• Relationships between two+ variables
– Scatterplot
Histogram
60
50
Frequency
40
30
20
10
Mean = 11.8008
Std. Dev. = 10.1035
N = 364
0
0.00
10.00
20.00
30.00
yrsexp
40.00
50.00
Scatterplot
60.00
50.00
yrsexp
40.00
30.00
20.00
10.00
0.00
10.00
20.00
30.00
40.00
50.00
age
60.00
70.00
80.00
Frequencies
race
Valid
Missing
Total
Frequency
white, not hispanic
290
african american/black
31
asian/pacific islander
19
hispanic
19
other
9
9.00
1
Total
369
System
2
371
Percent
78.2
8.4
5.1
5.1
2.4
.3
99.5
.5
100.0
Valid Percent
78.6
8.4
5.1
5.1
2.4
.3
100.0
Cumulative
Percent
78.6
87.0
92.1
97.3
99.7
100.0
Outliers
• Where do outliers come from?
– Inclusion of subjects not part of the
population (e.g., ESL response to
vocabulary test)
– Legitimate data points*
– Extreme values of random error (X = t + e)
– Error in observation
– Error in data preparation
Univariate Outliers
• Criteria: Mean +/- 3 SD
• Example: Age
– Mean = 34.68
– SD = 10.05
• Out of range values > 64.83 or < 4.53
Univariate Outliers
50
Frequency
40
30
20
10
Mean = 34.6784
Std. Dev. = 10.04525
N = 370
0
20.00
30.00
40.00
50.00
age
60.00
70.00
Multivariate Outliers
Mahalanobis Distance SPSS Syntax
Regression Var = case VAR1 VAR2
/statistics collin
/dependent =case / enter
/residuals = outliers(mahal).
Critical Values (case with D > c.v. is m.v. outlier)
two variables - 13.82
three variables - 16.27
four variables - 18.46
five variables - 20.52
six variables - 22.46
Approaches to Outliers
•
•
•
•
•
•
Leave them alone
Delete entire case (listwise)
Delete only relevant variables (pairwise)
Trim – highest legitimate value
Mean substitution
Imputation
2
2
1
1
Zscore(EXAM)
Zscore(EXAM)
Effects of Outliers
0
-1
-2
-3
-4
0
-1
-2
-3
-2
-1
Zscore(HMWRK)
r = .50
0
1
-3
-4
-3
-2
-1
Zscore(HMWRK)
r = .32
0
1
Effects of Outliers
25
N = 500, no outlier
N = 501, outlier present
20
N = 401, outlier present
N = 301, outlier present
N = 201, outlier present
N = 101, oulier present
15
N = 51, oulier present
N = 26, outlier present
Interviews
10
5
0
0
20
Publications
Major Problems: Missing Data
• Generalizability issues
• Reduces power (sample size)
• Impacts accuracy of results
– Accuracy = dispersion around true score
(can be under- or over-estimation)
– Varies with MDT used
Dealing with Missing Data
•
•
•
•
•
•
Listwise deletion
Pairwise deletion
Mean substitution
Regression imputation
Hot-deck imputation
Multiple imputation
Dealing with Missing Data
In Order of Accuracy:
• Pairwise deletion
• Listwise deletion
• Regression imputation
• Mean substitution
• Hot-deck imputation
Dealing with Missing Data
MDT
Pros
Cons
Listwise deletion
Easy to use
High accuracy
Reduces sample size
Pairwise deletion
Easy to use
Highest accuracy
Problematic in MV
analyses; non-positive
definite correlation
matrix
Mean substitution
Easy to use
Saves data; preserves
sample size
Moderate accuracy
Attenuation of findings
Regression imputation
(no error term
adjustment)
Saves data; preserves
sample size
Moderate accuracy
Difficult to use
Can’t use when all
predictors are missing
Hot-deck imputation
Saves data; preserves
sample size
Lots of bias & error
Transformations
Distribution
• Moderate deviation from
normality
• Substantial deviation
from normality
• Severe deviation; esp. jshape
• Negative skew
Best Transformation to Try
• Square Root
• Log
• Inverse
• “Reflect” (mirror image),
then transform
• Interpretation of transformed variables?