Transcript Chapter 10







Statistical model for linear regression
Simple linear regression model
Estimating the regression parameters
Confidence intervals and significance tests
Confidence interval for mean response
Prediction interval
1



We observe 92 males aged 20 to 29.
We measure skin-fold thickness and body
density.
Part of the data:
ID
1
2
3
4
5
Iskin
1.27
1.56
1.45
1.52
1.51
Den
1.093
1.063
1.078
1.056
1.073 …
The SAS System
17:47 Thursday, July 22, 2004
The REG Procedure
Model: MODEL1
Dependent Variable: Den
Root MSE
Dependent Mean
Coeff Var
0.00854
1.06403
0.80252
R-Square
Adj R-Sq
0.7204
0.7173
Parameter Estimates
Variable
Intercept
Iskin
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
1
1
1.16300
-0.06312
0.00656
0.00414
177.30
-15.23
<.0001
<.0001
99% Confidence Limits
1.14574
-0.07403
1.18026
-0.05221
We will often be using software for calculations in this section.
5
We can use:
 proc reg
◦ see file named Regression 2.
◦ This program has more tools than the regression.doc
file (studied in Chapter 2).

Now we will think of the least squares regression
line computed from the sample as an estimate of
the true regression line for the population.
Type of line
Sample
Population
LSR equation of line
ŷ  b0  b1 x
slope
b1
y-intercept
b0
 y   0  1 x
1
0
yi   0  1 xi   i



Data: n observations in the form (x1, y1), (x2, y2), …
(xn, yn).
The deviations εi are assumed to be independent
and N(0, ).
The parameters of the model are: 0, 1, and .
◦ Estimate 0 with b0, 1 with b1, and  with
MSE

The actual data will not fit the
regression line exactly:
DATA = FIT + RESIDUAL
◦ FIT is the least squares regression line
◦ RESIDUAL (“noise”) = 
 difference between the data and what the
line predicts

Less important: A level C confidence interval for
the intercept 0 is
b0  t * SEb0

More important: A level C confidence interval for
the slope 1 is
b1  t * SEb1

t* from t table with n-2 degrees of freedom
*** notice we still are using estimate ± MOE ***
Dependent Variable: D

Root MSE
Dependent Mean
Coeff Var
0.00854
1.06403
0.80252
RAd
Manually calculate the 95% confidence interval
for the mean decrease in body density per a unit
Parameter Estimate
of skinfold thickness.
Variable
DF
Parameter
Estimate
Intercept
Iskin
1
1
1.16300
-0.06312
Standard
Error
t Value
0.00656
0.00414
177.30
-15.23

Hypotheses: H0: 1 = 0 vs. Ha: 1 ≠ 0
Test statistic:
b1
t
df = n - 2
SEb1

If 1 = 0, then y = 0



◦ the mean of y does not vary with x
◦ Implies no linear relationship
If we reject this hypothesis, we have a “linear
relationship” i.e. non-zero (population) slope.
SAS will give the test statistic and the 2-sided Pvalue. In most cases we only need to interpret the
results.
Significance Test for Regression Slope
To test the hypothesis H0: β1 = hypothesized value, compute the test statistic:
b1  hypothesized value
t
SEb1
Find the P-value by calculating the probability of getting a t statistic this large
or larger in the direction specified by the alternative hypothesis Ha. Use the t
distribution with df = n – 2.
Dependent Variable: D
Root MSE
Dependent Mean
Coeff Var

0.00854
1.06403
0.80252
RAd
Is there a linear relationship between body density
Parameter Estimate
and skinfold thickness?
Variable
Intercept
Iskin
DF
Parameter
Estimate
Standard
Error
t Value
1
1
1.16300
-0.06312
0.00656
0.00414
177.30
-15.23

If you are trying to predict for a group
 C.I. for the Mean Response
 Example: Trying to predict the average blood pressure for all 40 year
olds
 Interval is usually narrower

If you are trying to predict for a single observation
 Prediction Interval
 Example: Trying to predict the blood pressure for a single 40 year old
 Interval is usually much wider because of individual variation
 Some 40 year olds will have much higher or lower B.P.s
15
A confidence interval for the population mean μy of all responses y for a
certain x value (think average group response for a specific x value)
The level C confidence interval for the mean response μy at a
given value x* of x is:
ˆ
y  t * SE ˆ
where t* is the value such that the area under the t(n – 2) density
curve between –t* and t* is C.
SAS can do the
work for us using the
clm option.
See Regression2.doc
16
Prediction Intervals
The level C prediction interval for a single observation on y
when x takes the value x* is:
yˆ  t * SE yˆ
t* is the critical value for the t (n – 2) distribution with area C
between –t* and +t*.

SAS can do the work for us
using the cli option.
See Regression2.doc
These will be wider then the
mean response intervals since
individuals have more variability!
17
Prediction Interval for individual observations (for x = Lskin)

Assumptions




Correct Model (linearity)
Independent Observations
Normally Distributed Errors
Constant Variance

Checking these generally involves PLOTS of the
residuals and predicted values.

Residual = observed – predicted
ei  Yi  Yˆi
22

INDEPENDENCE
◦ Repeated responses y are independent of each other.
 We may detect problems by seeing residuals vs observation
numbers (order the data was collected).

Residuals vs. OBS

NORMALITY
◦ For any fixed value of x, the response y varies
according to a normal distribution.
 To check this assumption of normality one can do a normal
quantile plot of the residuals (SAS).

QQplot

LINEARITY
◦ Mean response has a straight-line relationship with x.
 To check if the relationship is roughly linear, one does a
scatterplot or a residual plot.

CONSTANT VARIANCE
◦ The standard deviation of y (σy) is the same for all values of
x.
 The value of σy is unknown. To check for constant variability,
we look at a residual plot.

For both use the Residual Plot
◦ Residuals vs x (Already saw this in chapter 2)
 Or use equivalent: Residuals vs predicted (used later)

Also need to check for outliers or influential
observations before using regression.
◦ They can dramatically affect the results. Why?

On SAS output R-square:
R2=SSR/SST
◦ Shows what part of the total variation in Y is explained
by the least squares regression on x.
◦ Proportion of variability explained by your model.
◦ (Example MPG…)

Optional topics:
◦ What is the F-statistic on the output?
◦ How do we test for correlation?
textbook

First look? What do you see?
◦ Other things we could look at that’s not shown?
The ANOVA Table
Source
DF
Sum of squares
SS
Mean square
MS
F
MSM=SSM/DFM
MSM/MSE
n
Model
1
SSM   ( yˆi  y )2
i 1
n
Error
n−2
SSE   ( yi  yˆi )2
MSE=SSE/DFE
i 1
n
Total
n−1
SST   ( yi  y )2
i 1
SST = SSM + SSE
DFT = DFM + DFE
F=MSM/MSE
34