LECTURE 18 (Week 6)

Download Report

Transcript LECTURE 18 (Week 6)

Objectives (BPS chapter 24)
Inference for regression

Conditions for regression inference

Estimating the parameters

Using technology

Testing the hypothesis of no linear relationship

Testing lack of correlation

Confidence intervals for the regression slope

Inference about prediction

Checking the conditions for inference
yˆ  0.125x  41.4
The data in a scatterplot are a random
sample from a population that may

exhibit a linear relationship between x
and y. Different sample  different plot.
Now we want to describe the population mean
response my as a function of the explanatory
variable x: my = a + bx.
And we want to assess whether the observed
relationship is statistically significant (not
entirely explained by chance events due to random
sampling).
The regression model
The least-squares regression line ŷ = a + bx is a mathematical model of
form “sample data = fit + residual.” For each data point in the sample,
the residual is the difference (y − ŷ).
At the population level, the model becomes yi = (a + bxi) + (ei)
with residuals ei independent and
normally distributed N(0, s).
The population mean response my is
my = a + bx
my = a + bx
The intercept a, the slope b, and the standard deviation s of y are the
unknown parameters of the regression model. We rely on the random
sample data to provide unbiased estimates of these parameters.

The value of ŷ from the least-squares regression line is really a prediction
of the mean value of y (my) for a given value of x.

The least-squares regression line [ŷ = a + bx] obtained from sample data is
the best estimate of the true population regression line [my = a + bx].
ŷ unbiased estimate for mean response my
a unbiased estimate for intercept a
b unbiased estimate for slope b
Conditions for inference

The observations are independent.

The relationship is indeed linear.

The standard deviation of y, σ, is
the same for all values of x.

The response y varies normally
around its mean.
For any fixed x, the responses
y follow a Normal distribution
with standard deviation s.
Regression assumes equal variance
of y (s is the same for all values of x).
The population standard deviation s
for y at any given value of x represents
the spread of the normal distribution of
the ei around the mean my.
The regression standard error, s, for n sample data points is
calculated from the residuals (yi – ŷi):
s
2
residual

n2

2
ˆ
(
y

y
)
 i i
n2
s is an unbiased estimate of the regression standard deviation s.
Confidence interval for the slope β
Estimating the regression parameter b for the slope is a case of onesample inference with σ unknown. Hence we rely on t distributions.
The standard error of the slope b is:
SEb 
(s is the regression standard error.)
s
2
(
x

x
)

Thus, a level C confidence interval for the slope b is:
estimate ± t*SEestimate
b ± t* SEb
t* is t critical for t(df = n − 2) density curve with C% between –t* and +t*
Testing the hypothesis of no relationship
To test for the existence of a significant relationship, we can test if the
parameter for the slope b is significantly different from zero using a
one-sample t-test procedure.
The standard error of the slope b is: SEb 
We test the hypotheses H0: b = 0
Ha: b ≠ 0, >0, or <0 (two- or one-sided)
We calculate
t = b/SEb
which has the t (n – 2) distribution
to find the P-value of the test.
s
2
(
x

x
)

Testing for lack of correlation
The regression slope b and the correlation
coefficient r are related and b = 0  r = 0.
slope b  r
sy
sx
Similarly, the population parameter for the slope β is related to the
population correlation coefficient ρ, and when β = 0  ρ = 0.
Thus, testing the hypothesis H0: β = 0 is the same as testing the
hypothesis of no correlation between x and y in the population from
which our data were drawn.
Inference about prediction
One use of regression is for predicting the value of y, ŷ, for any value
of x within the range of data tested: ŷ = a + bx.
But the regression equation depends on the particular sample drawn.
More reliable predictions require statistical inference
To estimate an individual response y for a given value of x, we use a
prediction interval.
If we randomly sampled many times, there
would be many different values of y
obtained for a particular x following N(0, σ)
around the mean response µy.
The level C prediction interval for a single observation on y when x
takes the value x* is:
ŷ ± t*n − 2 SEŷ
95% prediction
interval for ŷ
t* for t distribution with n – 2 df
The prediction interval represents
mainly the error from the normal
distribution of the residuals ei.
Graphically, a series of confidence
intervals for the whole range of x
values is shown as a continuous
interval on either side of ŷ.
Confidence interval for µy
We may also want to predict the population mean value of y, µy, for any
value of x within the range of data tested.
Using inference, we calculate a level C confidence interval for the
population mean μy of all responses y when x takes the value x*:
This interval is centered on ŷ, the unbiased estimate of μy.
The true value of the population mean μy at a given
value of x will indeed be within our confidence
interval in C% of all intervals calculated
from many different random samples.
The level C confidence interval for the mean response μy at a given
value x* of x is centered on ŷ (unbiased estimate of μy):
ŷ ± tn − 2 * SEm^
t* for t distribution with n – 2 df
95% confidence
interval for my
A separate confidence interval is
calculated for μy along all the values
that x takes.
Graphically, the series of confidence
intervals for the whole range of x
values is shown as a continuous
interval on either side of ŷ.
The confidence interval for μy contains with C% confidence the
population mean μy of all responses at a particular value of x.

The prediction interval contains C% of all the individual values
taken by y at a particular value of x.

Least-squares regression line
95% prediction interval for ŷ
95% confidence interval for my
Estimating my uses a smaller
confidence interval than estimating
an individual in the population
because the sampling distribution is
narrower than the population
distribution.
Residuals are randomly scattered
 good!
Curved pattern
 the relationship is not linear.
Change in variability across plot
 σ not equal for all values of x.
Example
The annual bonuses ($ 1000) of six randomly selected emplyees and their years
of services were recorded. We wish to analyze the relationship between the two
variables. Data was analyzed using MINITAB. The output is shown below
Yeas (X)
1
2
3
4
5
6
Bonus (Y)
6
1
9
5
17
12
Predictor Coef SE Coef T
P
Constant 0.933 4.192 0.22 0.835
Years
2.114 1.076 1.96 0.121
S = 4.50291 R-Sq = 49.1% R-Sq(adj) = 36.4%
Predicted Values for New Observations
New
Obs Fit SE Fit
95% CI
95% PI
1 11.50 2.45 (4.71, 18.30) (-2.72, 25.73)
Values of Predictors for New Observations
New
Obs Years
1 5.00
Example
a. What is the equation of the least squares regression line ?
yˆ  0.933  2.114 x
b. Calculate the 95% confidence interval for the true slope coefficient.
b1  t *SEb1  2.114  2.776 1.076  [0.873,5.10]
c. Based on the above output, at the .05 level of significance, test if slope β is
significantly different from zero.
H0 : b  0
H1 : b  0
t* 
b
 1.96
SEb
p  value  0.121  0.05
The test is not significant, fail to
reject null hypothesis
Example
d. What is the predicted annual bonus of an employee with 5 years of service ?
yˆ  0.933  2.114  5  11.50
e. What is the value of the residual for the data value (5, 17)?
residual  y  yˆ  17  11.50  6.50
f. Construct a 95% prediction interval for a single employee’s bonus whose year
of service is 7 years.
*
2
1
(
x

x
)
yˆ  t * s 1  
 [2.72,25.73]
2
n  ( xi  x )
Example
f. Construct a 95% confidence interval for the mean bonus
m y when years of
service is 7.
*
2
1
(
x

x
)
yˆ  t * s

 [4.71,18.30]
2
n  ( xi  x )