Lecture 7_Model Building with Multiple regression_Nov 3

Download Report

Transcript Lecture 7_Model Building with Multiple regression_Nov 3

Advanced Statistical Methods: Continuous Variables
http://statisticalmethods.wordpress.com
Model Building with Multiple Regression
[email protected]
Selecting appropriate variables;
Data Screening:
- frequencies/descriptives;
- correlations: ‘real’?
- missing data: pattern?
how much?
why?
Missing Data – Pattern
- missing at random
- non-random: affects generalizability of results
Test:
E.g.: construct dummy variable: 1 = cases missing on income; 0 = nonmissing; test of mean differences in attitude btw. the 2 groups.
If non-significant test  decision on missing = open (various options
would work)
If significant  careful with decision on missing
SPSS: Missing Value Analysis (do in Lab)
Missing Data – How many?
- 5% or less & at random & large file  similar solutions from
dealing w. missing;
- else, problematic
Dealing with Missing Data (M.D.)
Deleting M.D.
-
default in most software packages;
problematic if missing values = scattered through cases & variables
Imputation of Missing Values (estimating M.D.)
-
Mean substitution
Regression
Expectation maximization
Multiple imputation
M. D. correlation matrix
Treating Missing as Data
Repeating Analyses w. and without M.D
Dealing with Missing Data (M.D.)
Deleting M.D.
-
default in most software packages;
problematic if missing values = scattered through cases & variables
Imputation of Missing Values (estimating M.D.)
-
Mean substitution
Regression
Expectation maximization
Multiple imputation
M. D. correlation matrix
Treating Missing as Data
Repeating Analyses w. and without M.D
The Multiple Regression Model
Ŷ = a + b1X1 + b2X2 + ... + biXi
- the best prediction of a DV from several continuous (or dummy) IVs;
- also allows for non-linear relationships, by redefining the IV(s):
squaring, cubing, .. of the original IV
Regression coefficients:
- minimize (the sum of squared) deviations between Ŷ and Y;
- optimize the correlation btw. Ŷ and Y for the data set.
Assumptions
-
-
Random sampling;
DV = continuous; IV(s) variables = continuous (can be treated as
such), or dummies;
Linear relationship btw. the DV & the IVs variables (but we can
model non-linear relations);
Normally distributed characteristics of Y in the population;
Normality, linearity, and homoskedasticity btw. predicted DV scores
(Ŷs) and the errors of prediction (residuals)
Independence of errors;
No large outliers
4. Assumptions of normality, linearity, and homoskedasticity btw.
predicted DV scores (Ŷs) and the errors of prediction (residuals)
-
4.a. Multivariate Normality
each variable & all linear combinations of the variables are normally
distributed;
if this assumption is met  residuals of analysis = normally distributed
& independent
For grouped data: assumption pertains to the sampling distribution of
means of variables;
 Central Limit Theory: with sufficiently large sample size, sampling
distributions are normally distributed regardless of the distribution of the
variables
What to look for (in ungrouped data):
-
is each variable normally distributed?
Shape of distribution: skewness & kurtosis. Frequency histograms; expected
normal probability plots; detrend expected normal probability plots
-
are the realtionships btw. pairs of variables (a) linear, and (b) homoskedastic
(i.e. the variance of one variable is the same at all values of other variables)?
Homoskedasticity
-
for ungrouped data: the variability in scores for one continuous variable is ~
the same at all values of another continuous variable
for grouped data: the variability in the DV is expected to be ~ the same at all
levels of the grouping variable
Heteroskedasticity = caused by:
- non-normality of one of the variables;
- one variable is related to some transformation of the other;
- greater error of measurement at some level of an IV
Residuals Scatter Plots to check if:
4.a. Errors of prediction are normally distributed around each & every
Ŷ
4.b. Residuals have straight line relationship with Ŷs
- If genuine curvilinear relation btw. an IV and the DV, include a square of the
IV in the model
4.c. The variance of the residuals about Ŷs is ~the same for all
predicted scores (assumption of homoskedasticity)
- heteroskedasticity may occur when:
- some of the variables are skewed, and others are not;
 may consider transforming the variable(s)
- one IV interacts with another variable that is not part of the equation
5. Errors of prediction are independent of one another
Durbin-Watson statistic = measure of autocorrelation of errors over the sequence
of cases; if significant it indicates non-independence of errors