Regression Model Assumptions
Download
Report
Transcript Regression Model Assumptions
Applied Regression Analysis
BUSI 6220
KNN Ch. 3
Diagnostics and Remedial Measures
Diagnostics for the Predictor
Variable
Dot Plots
Sequence Plots
Stem-and-Leaf Plots
Essentially to check for outlying observations which will
be useful in later diagnosis.
Residual Analysis
Why Look at the Residuals?
Detect non-linearity of regression function
Detect Heteroscedasticity (=lack of constant variance)
Auto-correlation
Outliers
Non-normality
Important predictor variables left out?
Regression Model Assumptions:
Errors are Independent (Have Zero Covariance)
Errors have Constant Variance
Errors are Normally Distributed
Diagnostics for Residuals
PLOT OF
RESIDUALS
Detect non-linearity of regression
function
Heteroscedasticity
Auto-correlation
Outliers
Non-normality
Important predictor variables left
out?
1.
2.
3.
4.
5.
6.
7.
against predictor (if X1 only)
(Absolute or Sqd. Residual)
against predictor
against fitted values (for many
X i)
against time
against omitted predictor
variables
Box plot
Normal probability plot
Diagnostics for Residuals
Normal probability plot
Approximate
expected value of
kth smallest
residual :
k 0.375
MSE z
n 0.25
Tests involving Residuals
The Correlation test for Normality
H0: The residuals are normal
HA: The residuals are not normal
Correlation between ei(s) and their expected values
under normality.
Use Table B.6
Observed coeff. of correlation should be at least
as large as table value for a given level of
significance.
Tests involving Residuals
Other tests for Normality
H0: The residuals are normal
HA: The residuals are not normal
Anderson-Darling (very powerful, may be used for
small sets, n<25)
Ryan-Joiner
Shapiro-Wilk
Kolmogorov-Smirov
Tests involving Residuals
The Correlation test for Normality
H0: The residuals are normal
HA: The residuals are not normal
Correlation between ei(s) and their expected values
under normality.
Use Table B.6
Observed coeff. of correlation should be at least
as large as table value for a given level of
significance.
Tests involving Residuals
(Constancy of Error Variance)
The Modified Levene Test
Partitions the independent variable into two groups
(High X values and low X values), then tests the null
H0: The groups have equal variances
Similar to a pooled variance t-test for difference in two means
of independent samples.
It is robust to departures from normality or error terms
Large sample size essential so that dependencies of error
terms on each other can be neglected
Uses group “median” instead of the “mean”(Why ?)
Tests involving Residuals
(Constancy of Error Variance)
The Modified Levine Test
d1 d 2
t
1 1
s
n1 n2
*
L
where,
d i1 ei1 e~1
and d i1 ei 2 e~2
Now, the d i1 and d i2 are the data points, i.e the t - test is based
on these two sets of data points.
s
(n1 1) s12 (n2 1) s22
n1 n2 2
Read “Comments” on page 118 and
go thru’ the Breusch-Pagan test on
page 119.
F test for Lack of Fit
A
comparison of “Full Model” sum of
squares error and “Lack of Fit” sum of
squares.
For best results, requires repeat observations
at, at least one X level.
Full model: Yij=mj+ eij (mj = mean response
when X=Xj)
Reduced model: Yij= b0 + b1Xj+ eij
(Why “Reduced” ?)
F test for Lack of Fit
SSE(Full)=SSPE=
Y
ij
j
Y
2
j
i
(Labeled “Pure Error” since unbiased estimator of true error
variance. See 3.31 and 3.32, page 123)
SSLF=SSE(Reduced)-SSPE, (where SSE(Reduced)= SSE
from ordinary least squares regression model)
Test Statistic :
SSLF
c p
*
F
SSPE
nc
(what is “p”?)
Be sure to compare the ANOVA table on page 126 with OLS ANOVA
table.
Overview of some Remedial Measures
The Problem: Simple Linear Regression is not appropriate.
The solution:
1. Abandon the model (“Eagle to Hawk; abort mission and return to base”.)
2. Remedy the situation:
If Non-independent error terms then work with a model that
calls for correlated error terms (Ch.12)
If Heteroscedasticity then use WLS method to estimate
parameters (Ch. 10) or use transformations of data.
If scatter plot indicates non-linearity, then either use non-linear
regression function (Ch.7) or transform to linear.
NEXT: We will look at one such powerful transformation
method.
The Box-Cox Transformation Method
The family of power transforms on Y is given as: Y'=Yl
The family easily includes simple transforms such as the
square root, squared etc.
By definition, when l0, then Y'=logeY
When the response variable is so transformed, the normal
error regression model becomes: Yil b0 + b1Xi+ ei
We would like to determine the “best” value of l
Method 1: Maximum likelihood estimation
Max.
b ,b
0
L2
10 ,
,lR
4
1
2
n
2 2
1
exp
2
2
Y
n
i 1
i
l
b 0 b1 X i
2
The Box-Cox Transformation Method
Method 2: Numerical Search
Step 1: Set a value of l.
Step 2: Standardize the Yi observations
If l0, then: Wi=K1(Yil1 )
If l0, then: Wi=K2(logeYi)
1/ n
where, K2 Yi
i 1
n
1
and K1
lK 2l 1
Step 3: Now regress the set W on the set X.
Step 4: Note the corresponding SSE.
Step 5: Change l, and repeat steps 2 to 4 until lowest SSE is
obtained.
Let’s try both this method with the GMAT data.
What should we get as the best l?