Transcript Slide 1
Inference for regression
- More details about simple linear regression
IPS chapter 10.2
© 2006 W.H. Freeman and Company
Objectives (IPS chapter 10.2)
Inference for regression—more details
Analysis of variance for regression
Calculations for regression inference
Inference for correlation
Analysis of variance for regression
The regression model is:
Data =
fit
+ residual
y i = (b 0 + b 1 x i ) +
(ei)
where the ei are independent and
normally distributed N(0,s), and
s is the same for all values of x.
It resembles an ANOVA, which also assumes equal variance, where
SST = SS model +
DFT = DF model +
SS error
DF error
and
For a simple linear relationship, the ANOVA tests the hypotheses
H0: β1 = 0 versus Ha: β1 ≠ 0
by comparing MSM (model) to MSE (error): F = MSM/MSE
When H0 is true, F follows
the F(1, n − 2) distribution.
The p-value is P(> F).
The ANOVA test and the two-sided t-test for H0: β1 = 0 yield the same p-value.
Software output for regression may provide t, F, or both, along with the p-value.
ANOVA table
Source
Model
Error
Sum of squares SS
2
ˆ
(
y
y
)
i
( y yˆ )
i
Total
2
DF
Mean square MS
F
P-value
1
SSG/DFG
MSG/MSE
Tail area above F
n−2
SSE/DFE
i
( yi y)2
n−1
SST = SSM + SSE
DFT = DFM + DFE
The standard deviation of the sampling distribution, s, for n sample
data points is calculated from the residuals ei = yi – ŷi
s
2
2
e
i
n2
2
ˆ
(
y
y
)
i i
n2
SSE
MSE
DFE
s is an unbiased estimate of the regression standard deviation σ.
Coefficient of determination, r2
The coefficient of determination, r2, square of the correlation
coefficient, is the percentage of the variance in y (vertical scatter
from the regression line) that can be explained by changes in x.
r 2 = variation in y caused by x
(i.e., the regression line)
total variation in observed y values around the mean
2
ˆ
(
y
y
)
i
SSM
r
2
( yi y ) SST
2
What is the relationship between
the average speed a car is
driven and its fuel efficiency?
We plot fuel efficiency (in miles
per gallon, MPG) against average
speed (in miles per hour, MPH)
for a random sample of 60 cars.
The relationship is curved.
When speed is log transformed
(log of miles per hour, LOGMPH)
the new scatterplot shows a
positive, linear relationship.
Calculations for regression inference
To estimate the parameters of the regression, we calculate the
standard errors for the estimated regression coefficients.
The standard error of the least-squares slope β1 is:
SEb1
s
2
(
x
x
)
i i
The standard error of the intercept β0 is:
SEb0
1
s
n
x2
( xi xi ) 2
To estimate or predict future responses, we calculate the following
standard errors
The standard error of the mean response µy is:
The standard error for predicting an individual response ŷ is:
1918 flu epidemics
1918 influenza epidemic
Date
# Cases # Deaths
17
ee
k
15
ee
k
13
ee
k
11
9
ee
k
ee
k
7
w
ee
k
5
w
ee
k
3
w
ee
k
w
ee
k
1
1918 influenza epidemic
w
w
w
w
10000
800
9000
700
8000
600
# Cases
# Deaths
7000
500
6000
5000
400
4000
The line graph suggests that about 7 to 8% of300
those
3000
200
2000
diagnosed with the flu died within about a week
of
100
1000
0
0 the
diagnosis.
We look at the relationship between
w
0
0
130
552
738
414
198
90
56
50
71
137
178
194
290
310
149
800
700
600
500
400
300
200
100
0
number of deaths in a given week and the number of
w
ee
k
1
w
ee
k
3
w
ee
k
5
w
ee
k
7
w
ee
k
9
w
ee
k
11
w
ee
k
13
w
ee
k
15
w
ee
k
17
36
531
4233
8682
7164
2229
600
164
57
722
1517
1828
1539
2416
3148
3465
1440
Incidence
week 1
week 2
week 3
week 4
week 5
week 6
week 7
week 8
week 9
week 10
week 11
week 12
week 13
week 14
week 15
week 16
week 17
10000
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
# deaths reported
# cases diagnosed
1918 influenza epidemic
new diagnosed cases one week earlier.
# Cases
# Deaths
r = 0.91
1918 flu epidemic: Relationship between the number of
deaths in a given week and the number of new diagnosed
cases one week earlier.
MINITAB
- Regression Analysis:
FluDeaths1 versus FluCases0
The regression equation is
FluDeaths1 = 49.3 + 0.0722 FluCases0
Predictor
Coef
Constant
49.29
FluCases
0.072222
S = 85.07
s MSE
SE Coef
SEb 0
0.008741 SE
b1
29.85
R-Sq = 83.0%
T
P
1.65
0.121
8.26
0.000
R-Sq(adj) = 81.8%
r2 = SSM / SST
Analysis of Variance
Source
Regression
DF
1
SS
P-value for
H0: β = 0; Ha: β ≠ 0
MS
F
P
68.27
0.000
Residual Error
14
494041 SSM 494041
101308
7236
Total
15
595349 SST
MSE s 2
Inference for correlation
To test for the null hypothesis of no linear association, we have the
choice of also using the correlation parameter ρ.
When x is clearly the explanatory variable, this test
is equivalent to testing the hypothesis H0: β = 0.
b1 r
sy
sx
When there is no clear explanatory variable (e.g., arm length vs. leg length),
a regression of x on y is not any more legitimate than one of y on x. In that
case, the correlation test of significance should be used. Technically, in that
case, the test is a test of independence much like we saw in an earlier
chapter on contingency tables.
The test of significance for ρ uses the one-sample t-test for:
H0: ρ = 0.
We compute the t statistics
for sample size n and
correlation coefficient r.
This calculation turns out
To be identical to the tstatistic based on slope
t
r n2
1 r2