Transcript Topic_19
Topic 19: Remedies
Outline
• Review regression diagnostics
• Remedial measures
– Weighted regression
– Ridge regression
– Robust regression
– Bootstrapping
Regression Diagnostics
Summary
• Check normality of the residuals with a
normal quantile plot
• Plot the residuals versus predicted
values, versus each of the X’s and
(when appropriate) versus time
• Examine the partial regression plots
– Use the graphics smoother to see if
there appears to be a curvilinear
pattern
Regression Diagnostics
Summary
• Examine
– the studentized deleted residuals
(RSTUDENT in the output)
– The hat matrix diagonals
– Dffits, Cook’s D, and the DFBETAS
• Check observations that are extreme
on these measures relative to the
other observations
Regression Diagnostics
Summary
• Examine the tolerance for each X
• If there are variables with low
tolerance, you need to do some
model building
– Recode variables
– Variable selection
Remedial measures
•
•
•
•
•
Weighted least squares
Ridge regression
Robust regression
Nonparametric regression
Bootstrapping
Maximum Likelihood
Yi 0 1X i i , Var( i ) i
Yi ~ N 0 1X i , i
1
fi
e
2 i
2
2
1 Yi 0 1X i
2
i
2
L f1 f 2 f n likelihood function
100
0
1st
Qtr
3rd
Qtr
Ea
W
N
Weighted regression
• Maximization of L with respect to β’s is
equivalent to minimization of
1
i
2
Y
i
1 X i 1 ... p 1 X ip 1
2
0
• Weight of each observation: wi=1/σi2
Weighted least squares
• Least squares problem is to minimize
the sum of wi times the squared
residual for case i
• Computations are easy, use the weight
statement in proc reg
• bw = (X΄WX)-1(X΄WY)
where W is a diagonal matrix of the
weights
• The problem now becomes determining
the weights
Determination of weights
• Find a relationship between the absolute
residual and another variable and use
this as a model for the standard
deviation
• Similarly for the squared residual and
another variable
• Use grouped data or approximately
grouped data to estimate the variance
Determination of weights
• With a model for the standard deviation
or the variance, we can approximate the
optimal weights
• Optimal weights are proportional to the
inverse of the variance
KNNL Example
•
•
•
•
KNNL p 427
Y is diastolic blood pressure
X is age
n = 54 healthy adult women aged 20
to 60 years old
Get the data and check it
data a1;
infile ‘../data/ch11ta01.txt';
input age diast;
proc print data=a1;
run;
Plot the relationship
symbol1 v=circle i=sm70;
proc gplot data=a1;
plot diast*age / frame;
run;
Diastolic bp vs age
Strong linear
relationship but nonconstant variance
Run the regression
proc reg data=a1;
model diast=age;
output out=a2 r=resid;
run;
Regression output
Source
Model
Error
Corrected Total
Root MSE
Dependent Mean
Coeff Var
Analysis of Variance
Sum of
Mean
DF
Squares
Square F Value Pr > F
1 2374.96833 2374.96833
35.79 <.0001
52 3450.36501 66.35317
53 5825.33333
8.14575 R-Square
79.11111 Adj R-Sq
10.29659
0.4077
0.3963
Regression output
Variable
Intercept
age
Parameter Estimates
Parameter Standard
DF
Estimate
Error t Value Pr > |t|
1
56.15693
3.99367
14.06 <.0001
1
0.58003
0.09695
5.98 <.0001
Estimators still unbiased but no
longer have minimum variance
Prediction interval coverage often
lower or higher than 95%
Use the output data set to
get the absolute and
squared residuals
data a2;
set a2;
absr=abs(resid);
sqrr=resid*resid;
Do the plots with a
smooth
proc gplot data=a2;
plot (resid absr sqrr)*age;
run;
Absolute value of the
residuals vs age
absr
20
18
16
14
12
10
8
6
4
2
0
20
30
40
age
50
60
Squared residuals vs age
Model the std dev vs age
(absolute value of the residual)
proc reg data=a2;
model absr=age;
output out=a3 p=shat;
Note that a3 has the predicted
standard deviations (shat)
Compute the weights
data a3;
set a3;
wt=1/(shat*shat);
Regression with weights
proc reg data=a3;
model diast=age / clb;
weight wt;
run;
Output
Source
Model
Error
Corrected Total
Root MSE
Analysis of Variance
Sum of
Mean
DF Squares Square F Value Pr > F
1 83.34082 83.34082
56.64 <.0001
52 76.51351 1.47141
53 159.85432
1.21302 R-Square 0.5214
Dependent Mean 73.55134 Adj R-Sq 0.5122
Coeff Var
1.64921
Output
Parameter Estimates
Parameter Standard
95% Confidence
Variable DF Estimate
Error t Value Pr > |t|
Limits
Intercept 1 55.56577 2.52092 22.04 <.0001 50.5072 60.6244
age
1
0.59634 0.07924
7.53 <.0001 0.43734 0.75534
Reduction in std err of the age coeff
Ridge regression
• Similar to a very old idea in numerical analysis
• If (X΄X) is difficult to invert (near singular) then
approximate by inverting (X΄X+kI).
• Estimators of coefficients are biased but more
stable.
• For some value of k ridge regression estimator
has a smaller mean square error than ordinary
least square estimator.
• Can be used to reduce number of predictors
• Ridge = k is an option for model statement .
• Cross-validation used to determine k
Robust regression
• Basic idea is to have a procedure that is
not sensitive to outliers
• Alternatives to least squares, minimize
– sum of absolute values of residuals
– median of the squares of residuals
• Do weighted regression with weights
based on residuals, and iterate
Nonparametric
regression
•
•
•
•
Several versions
We have used i=sm70
Interesting theory
All versions have some smoothing or
penalty parameter similar to the 70 in
i=sm70
Bootstrap
• Very important theoretical development
that has had a major impact on applied
statistics
• Based on simulation
• Sample with replacement from the data
or residuals and repeatedly refit model to
get the distribution of the quantity of
interest
Background Reading
• We used programs topic19.sas
• This completes Chapter 11
• This completes the material for the
midterm