Prediction concerning Y variable

Transcript Prediction concerning Y variable

Prediction concerning
the response Y
Introduction
• If we want to estimate μ, the mean weight
of all American women, aged 18-24, what
would be a good estimate?
• If we want to predict y, the weight of a
randomly selected American woman, aged
18-24, what would be a good prediction?
Can we do better?
210
200
190
w = -266.5 + 6.1 h
weight
180
170
160
150
140
y-bar = 158.8
130
120
110
62
66
70
height
74
College entrance test score
Simple linear regression model
22
Y  EY   0  1 x
18
14
10
Yi  0  1 x    i
6
1
2
3
High school gpa
4
5
Simple linear regression model
Three different research
questions
• What is the mean response, E(Yh), for a
given value, xh, of the predictor variable?
• What would one predict a new observation,
Yh(new), to be for a given value, xh, of the
predictor variable?
• What would one predict the mean of m new
observations, Y h(new ) , to be for a given value,
xh , of the predictor variable?
Example:
Skin cancer mortality and latitude
• What is the expected (mean) mortality rate
for all locations at 40o N latitude?
• What is the predicted mortality rate for 1
new randomly selected location at 40o N?
• What is the predicted mortality rate for 10
new randomly selected locations at 40o N?
Example:
Skin cancer mortality and latitude
Regression Plot
Mortality = 389.189 - 5.97764 Latitude
S = 19.1150
R-Sq = 68.0 %
R-Sq(adj) = 67.3 %
Mortality
200
150
100
30
40
Latitude
50
“Point estimators”
Yˆh  b0  b1 xh
is the best point estimator in each case.
That is, it is:
• the best guess of the mean response at xh
• the best guess of a new observation at xh
• the best guess of a mean of m new observations at xh
But, as always, to be confident in the answer to our research
question, we should put an interval around our best guess.
It is dangerous to “extrapolate”
beyond scope of model.
Regression Plot
colonies = 16.0667 + 1.61576 conc
S = 2.67546
R-Sq = 66.8 %
R-Sq(adj) = 63.5 %
30
colonies
25
20
15
0
1
2
3
conc
4
5
6
It is dangerous to “extrapolate”
beyond scope of model.
Regression Plot
colonies = 15.0205 + 3.22113 conc - 0.276956 conc**2
S = 2.74819
R-Sq = 69.6 %
R-Sq(adj) = 64.5 %
colonies
30
20
10
0
5
10
conc
Confidence interval for
the population mean response E(Yh)
College entrance test score
Again, what are we estimating?
22
Y  EY   0  1 x
18
14
10
Yi  0  1 x    i
6
1
2
3
High school gpa
4
5
(1-α)100% t-interval
for mean response E(Yh)
Formula in words:
Sample estimate ± (t-multiplier × standard error)
Formula in notation:
2 
1

xh  x  

yˆ h  t 1 ,n2   MSE 

 n   x  x 2 
2
i


Implications on precision
• The greater the spread in the xi values, the
narrower the confidence interval, the more
precise the estimation of E(Yh).
• Given the same set of xi values, the further
xh is from the (sample) mean of the xi, the
wider the confidence interval, the less
precise the estimation of E(Yh).
Predicted Values for New Observations
New Fit SE Fit
95.0% CI
95.0% PI
1 150.08 2.75 (144.6,155.6) (111.2,188.93)
2 221.82 7.42 (206.9,236.8) (180.6,263.07)X
X denotes a row with X values away from the
center
Values of Predictors for New Observations
New Obs Latitude
1
40.0
2
28.0
Mean of Lat = 39.533
Comments on assumptions
• xh is a value within scope of model, but it is not
necessary that it is one of the x values in the data set.
• The confidence interval formula for E(Yh) works
okay even if the error terms are only approximately
normally distributed.
• If you have a large sample, the error terms can even
deviate substantially from normality without greatly
affecting appropriateness of the confidence interval.
Prediction interval for
a new response Yh(new)
College entrance test score
Again, what are we predicting?
22
Y  EY   0  1 x
18
14
10
Yi  0  1 x    i
6
1
2
3
High school gpa
4
5
(1-α)100% prediction interval
for new response Yh(new)
Formula in words:
Sample prediction ± (t-multiplier × standard error)
Formula in notation:
2 
 1

xh  x  

yˆ h  t 1 ,n2   MSE 1  
 n   x  x 2 
2
i


Prediction of Yh(new)
if mean E(Y) is known
Assume  2  25 so   5
Prediction of Yh(new)
if mean E(Y) is known
0.08
Normal curve
0.07
0.06
0.05
0.04
0.997
0.03
0.02
0.01
0.00
47
52
57
62
67
Number of hours
72
77
Prediction of Yh(new)
if mean E(Y) is not known
Summary of prediction issues
• We cannot be certain of the mean of the
distribution of Y.
• Prediction limits for Yh(new) must take into
account:
– variation in the possible mean of the
distribution of Y
– variation in the responses Y within the
probability distribution
Variation of the prediction
The variation in the prediction of a new response depends
on two components:
1. the variation due to estimating the mean E(Yh) with
yˆ h
2. the variation in Y within the probability distribution
   (Yˆh )
2
2
which is estimated by:




2
2
1
 1


xh  x  
xh  x  
  MSE 1   n

MSE  MSE   n
2
2
n
 n




x

x
x

x


i
i




i 1
i 1


(1-α)100% prediction interval
for new response Yh(new)
Formula in words:
Sample prediction ± (t-multiplier × standard error)
Formula in notation:
2 
 1

xh  x  

yˆ h  t 1 ,n2   MSE 1  
 n   x  x 2 
2
i


Confidence intervals and prediction
intervals for response in Minitab
• Stat >> Regression >> Regression …
• Specify response and predictor(s).
• Select Options…
– In “Prediction intervals for new observations” box,
specify either the X value or a column name containing
multiple X values.
– Specify confidence level (default is 95%).
• Click on OK. Click on OK.
• Results appear in session window.
Confidence intervals and prediction
intervals for response in Minitab
Confidence intervals and prediction
intervals for response in Minitab
C6
40
28
S = 19.12
R-Sq = 68.0%
R-Sq(adj)= 67.3%
Predicted Values for New Observations
New Fit SE Fit
95.0% CI
95.0% PI
1 150.08 2.75 (144.6,155.6) (111.2,188.93)
2 221.82 7.42 (206.9,236.8) (180.6,263.07)X
X denotes a row with X values away from the
center
Values of Predictors for New Observations
New Obs Latitude
1
40.0
2
28.0
Mean of Lat = 39.533
Comments on assumptions
• xh is a value within scope of model, but it is
not necessary that it is one of the x values in
the data set.
• The formula for the prediction interval
depends strongly on the assumption that the
error terms are normally distributed.
A plot of the confidence interval and
prediction interval in Minitab
• Stat >> Regression >> Fitted line plot …
• Specify predictor and response.
• Under Options …
– Select Display confidence bands.
– Select Display prediction bands.
– Specify desired confidence level (95% default)
• Select OK. Select OK.
A plot of the confidence interval and
prediction interval in Minitab
A plot of the confidence interval and
prediction interval in Minitab
Regression Plot
Mortality = 389.189 - 5.97764 Latitude
S = 19.1150
R-Sq = 68.0 %
R-Sq(adj) = 67.3 %
Mortality
250
150
Regression
95% CI
95% PI
50
30
40
Latitude
50

Prediction concerning Y variable

Transcript Prediction concerning Y variable

Directory