Multiple regression refresher

Download Report

Transcript Multiple regression refresher

Multiple regression refresher
Austin Troy
NR 245
Based primarily on material accessed from Garson, G. David
2010. Multiple Regression. Statnotes: Topics in Multivariate
Analysis.
http://faculty.chass.ncsu.edu/garson/PA765/statnote.htm
Purpose
• Y (dependent) as function
vector of X’s (independent)
• Y=a + b1X1 + b2X2 + ….+bnXn +e
• B=0?
• Each X adds a dimension
• Multiple X’s: effect of Xi
controlling for all other X’s.
Assumptions
• Proper specification of the model
• Linearity of relationships. Nonlinearity is usually
not a problem when the SD of Y is more than SD of
residuals.
• Normality in error term (not Y)
• Same underlying distribution for all variables
• Homoscedasticity/Constant variance.
Heteroskedacticity may mean omitted interaction
effect. Can use weighted least squares regression or
transformation
• No outliers. Leverage statistics
Assumptions
•
•
•
•
•
Interval, continuous, unbounded data
Non-simultaneity/recursivity: causality one way
Unbounded data
Absence of perfect or high partial multicollinearity
Population error is uncorrelated with each of the
independents . "assumption of mean independence”:
mean error doesn’t vary with X
• Independent observations (absence of
autocorrelation) leading to uncorrelated error terms.
No spatial/temporal autocorrelation
• mean population error=0
• Random sampling
Outputs of regression
• Model fit
– R2 = (1 - (SSE/SST)), where SSE = error sum of
squares; SST = total sum of squares
– Coefficients table: Intercept, Betas, standard
errors, t statistics, p values
A simple univariate model
A simple multivariate model
Another example: car price
Addressing multicollinearity
• Intercorrelation of Xs. When excessive, SE of beta coefficients become
large, hard to assess relative importance of Xs.
• Is a problem when the research purpose includes causal modeling.
• Increasing samples size can offset
• Options:
–
–
–
–
Mean center data
Combine variables into a composite variable.
Remove the most intercorrelated variable(s) from analysis.
Use partial least squares, which doesn’t assume no multicollinearity
• Ways to check: correlation matrix, Variance inflation Factors. VIF>4 is
common rule
• VIF from last model
diasbp.1 age.1
generaldiet.1 exercise.1
drinker.1
1.136293 1.120658 1.088769
1.101922
1.019268
• However, here is VIF when we regress BMI, age and weight against blood
pressure
age.1
bmi.1
wt.1
1.13505 3.164127 3.310382
Addressing nonconstant variance
• Bottom graph ideal
• Diagnosed with residual
plots (or abs resid plot)
• Look for funnel shape
• Generally suggests the
need for:
–
–
–
–
Source: http://www.originlab.com/www/helponline/Origin8/en/regression_and_curve_fitting/graphic_residual_analysis.htm
Generalized linear model
transformation,
weighted least squares or
addition of variables (with
which error is correlated)
Considerations: Model specification
• U shape or upside down
U suggest nonlinear
relationship between Xs
and Y.
• Note: full model residual
plots versus partial
residual plots
• Possible transformations:
semi-log, log-log, square
root, inverse, power, BoxCox
Considerations: normality
• Normal Quantile plot
• Close to normal
• Population is skewed to
the right (i.e. it has a long
right hand tail).
• Heavy tailed populations
are symmetric, with more
members at greater
remove from the
population mean than in
a Normal population with
the same standard
deviation.