Regression Model Assumptions

Download Report

Transcript Regression Model Assumptions

Applied Regression Analysis
BUSI 6220
KNN Ch. 3
Diagnostics and Remedial Measures
Diagnostics for the Predictor
Variable



Dot Plots
Sequence Plots
Stem-and-Leaf Plots
Essentially to check for outlying observations which will
be useful in later diagnosis.
Residual Analysis
Why Look at the Residuals?
Detect non-linearity of regression function
Detect Heteroscedasticity (=lack of constant variance)
Auto-correlation
Outliers
Non-normality
Important predictor variables left out?
Regression Model Assumptions:
Errors are Independent (Have Zero Covariance)
Errors have Constant Variance
Errors are Normally Distributed
Diagnostics for Residuals
PLOT OF
RESIDUALS






Detect non-linearity of regression
function
Heteroscedasticity
Auto-correlation
Outliers
Non-normality
Important predictor variables left
out?
1.
2.
3.
4.
5.
6.
7.
against predictor (if X1 only)
(Absolute or Sqd. Residual)
against predictor
against fitted values (for many
X i)
against time
against omitted predictor
variables
Box plot
Normal probability plot
Diagnostics for Residuals
Normal probability plot
Approximate
expected value of
kth smallest
residual :
  k  0.375 
MSE  z 

  n  0.25 
Tests involving Residuals
The Correlation test for Normality
H0: The residuals are normal
HA: The residuals are not normal
Correlation between ei(s) and their expected values
under normality.
Use Table B.6
Observed coeff. of correlation should be at least
as large as table value for a given level of
significance.
Tests involving Residuals
Other tests for Normality
H0: The residuals are normal
HA: The residuals are not normal
Anderson-Darling (very powerful, may be used for
small sets, n<25)
Ryan-Joiner
Shapiro-Wilk
Kolmogorov-Smirov
Tests involving Residuals
The Correlation test for Normality
H0: The residuals are normal
HA: The residuals are not normal
Correlation between ei(s) and their expected values
under normality.
Use Table B.6
Observed coeff. of correlation should be at least
as large as table value for a given level of
significance.
Tests involving Residuals
(Constancy of Error Variance)
The Modified Levene Test
Partitions the independent variable into two groups
(High X values and low X values), then tests the null
H0: The groups have equal variances
Similar to a pooled variance t-test for difference in two means
of independent samples.
 It is robust to departures from normality or error terms
 Large sample size essential so that dependencies of error
terms on each other can be neglected
Uses group “median” instead of the “mean”(Why ?)
Tests involving Residuals
(Constancy of Error Variance)
The Modified Levine Test
d1  d 2
t 
1 1
s

n1 n2
*
L
where,
d i1  ei1  e~1
and d i1  ei 2  e~2
Now, the d i1 and d i2 are the data points, i.e the t - test is based
on these two sets of data points.
s
(n1  1) s12  (n2  1) s22
n1  n2  2
Read “Comments” on page 118 and
go thru’ the Breusch-Pagan test on
page 119.
F test for Lack of Fit
A
comparison of “Full Model” sum of
squares error and “Lack of Fit” sum of
squares.
 For best results, requires repeat observations
at, at least one X level.
 Full model: Yij=mj+ eij (mj = mean response
when X=Xj)
 Reduced model: Yij= b0 + b1Xj+ eij
(Why “Reduced” ?)
F test for Lack of Fit

SSE(Full)=SSPE=
 Y
ij
j
Y 
2
j
i
(Labeled “Pure Error” since unbiased estimator of true error
variance. See 3.31 and 3.32, page 123)


SSLF=SSE(Reduced)-SSPE, (where SSE(Reduced)= SSE
from ordinary least squares regression model)
Test Statistic :
SSLF
c p
*
F 
SSPE
nc
(what is “p”?)
Be sure to compare the ANOVA table on page 126 with OLS ANOVA
table.
Overview of some Remedial Measures
The Problem: Simple Linear Regression is not appropriate.
 The solution:
1. Abandon the model (“Eagle to Hawk; abort mission and return to base”.)
2. Remedy the situation:
If Non-independent error terms then work with a model that
calls for correlated error terms (Ch.12)
If Heteroscedasticity then use WLS method to estimate
parameters (Ch. 10) or use transformations of data.
If scatter plot indicates non-linearity, then either use non-linear
regression function (Ch.7) or transform to linear.
NEXT: We will look at one such powerful transformation
method.

The Box-Cox Transformation Method




The family of power transforms on Y is given as: Y'=Yl
The family easily includes simple transforms such as the
square root, squared etc.
By definition, when l0, then Y'=logeY
When the response variable is so transformed, the normal
error regression model becomes: Yil b0 + b1Xi+ ei


We would like to determine the “best” value of l
Method 1: Maximum likelihood estimation
Max.
b ,b
0
L2
10 ,
,lR
4


1
2 
n
2 2
 1
exp 
2
2


 Y
n
i 1
i
l
 b 0  b1 X i 
2



The Box-Cox Transformation Method
Method 2: Numerical Search
Step 1: Set a value of l.
Step 2: Standardize the Yi observations
If l0, then: Wi=K1(Yil1 )
If l0, then: Wi=K2(logeYi)

1/ n


where, K2    Yi 
 i 1 
n
1
and K1 
lK 2l 1
Step 3: Now regress the set W on the set X.
Step 4: Note the corresponding SSE.
Step 5: Change l, and repeat steps 2 to 4 until lowest SSE is
obtained.
Let’s try both this method with the GMAT data.
What should we get as the best l?