Transcript Chapt21_BPS

Chapter 21
Inference for Regression
BPS - 3rd Ed.
Chapter 21
1
Linear Regression
(from Chapter 5)
 Objective:
To quantify the linear
relationship between an explanatory
variable (x) and response variable (y).
 We
can then predict the average
response for all subjects with a given
value of the explanatory variable.
BPS - 3rd Ed.
Chapter 21
2
Case Study
Crying and IQ
Karelitz, S. et al., “Relation of crying activity in early infancy
to speech and intellectual development at age three years,”
Child Development, 35 (1964), pp. 769-777.
Researchers explored the crying of
infants four to ten days old and their IQ
test scores at age three to determine if a
more crying was a sign of higher IQ
BPS - 3rd Ed.
Chapter 21
3
Case Study
Crying and IQ
Data collection
 Data
collected on 38 infants
 Snap of rubber band on foot caused infants
to cry
– recorded the number of peaks in the most active 20
seconds of crying (explanatory variable x)
 Measured
IQ score at age three years using
the Stanford-Binet IQ test (response variable y)
BPS - 3rd Ed.
Chapter 21
4
Case Study
Crying and IQ
Data
BPS - 3rd Ed.
Chapter 21
5
Case Study
Crying and IQ
Data analysis
Scatterplot of y vs. x
shows a moderate
positive linear
relationship, with no
extreme outliers or
potential influential
observations
BPS - 3rd Ed.
Chapter 21
6
Case Study
Crying and IQ
Data analysis
 Correlation
between crying and IQ is
r = 0.455 (as calculated in Chapter 4)
 Least-squares regression line for
predicting IQ from crying is
yˆ  a  bx  91.27  1.493 x (as in Ch. 5)

R2 = 0.207, so 21% of the variation in IQ scores
is explained by crying intensity
BPS - 3rd Ed.
Chapter 21
7
Inference
 We
now want to extend our analysis to
include inferences on various components
involved in the regression analysis
– slope
– intercept
– correlation
– predictions
BPS - 3rd Ed.
Chapter 21
8
Regression Model, Assumptions
 Conditions
required for inference about regression
(have n observations on an explanatory variable x
and a response variable y)
1. for any fixed value of x, the response y varies
according to a Normal distribution. Repeated
responses y are independent of each other.
2. the mean response µy has a straight-line relationship
with x: µy =  + x . The slope  and intercept  are
unknown parameters.
3. the standard deviation of y (call it ) is the same for all
values of x. The value of  is unknown.
BPS - 3rd Ed.
Chapter 21
9
Regression Model, Assumptions
regression model has three parameters: , ,
and 
 the true regression line µy =  + x says that the
mean response µy moves along a straight line as
x changes (we cannot observe the true regression
line; instead we observe y for various values of x)
 observed values of y vary about their means µy
according to a Normal distribution (if we take
many y observations at a fixed value of x, the
Normal pattern will appear for these y values)
 the
BPS - 3rd Ed.
Chapter 21
10
Regression Model, Assumptions
standard deviation  is the same for all values
of x, meaning the Normal distributions for y have
the same spread at each value of x
 the
BPS - 3rd Ed.
Chapter 21
11
Estimating Parameters:
Slope and Intercept
When using the least-squares regression
line yˆ  a  bx, the slope b is an unbiased
estimator of the true slope , and the
intercept a is an unbiased estimator of the
true intercept 
BPS - 3rd Ed.
Chapter 21
12
Estimating Parameters:
Standard Deviation
standard deviation  describes the variability
of the response y about the true regression line
 a residual is the difference between an observed
value of y and the value ŷ predicted by the leastsquares regression line: y - yˆ
 the standard deviation  is estimated with a
sample standard deviation of the residuals (this is
a standard error since it is estimated from data)
 the
BPS - 3rd Ed.
Chapter 21
13
Estimating Parameters:
Standard Deviation
The regression standard error is the
square root of the sum of squared residuals
divided by their degrees of freedom (n2):
1
s
n2
BPS - 3rd Ed.
ˆ

y

y
 
Chapter 21
2
14
Case Study
Crying and IQ
yˆ  a  bx  91.27  1.493 x,
b = 1.493 is an unbiased estimator of the
true slope , and a = 91.27 is an unbiased
estimator of the true intercept 
 Since
– because the slope b = 1.493, we estimate that
on the average IQ is about 1.5 points higher for
each added crying peak.
 The
regression standard error is s = 17.50
– see pages 566-567 in the text for this calculation
BPS - 3rd Ed.
Chapter 21
15
Case Study
Crying and IQ
Using Technology:
BPS - 3rd Ed.
Chapter 21
16
Confidence Interval for Slope
 A level
C confidence interval for the true
slope  is b  t* SEb
– t* is the critical value for the t distribution with
df = n2 degrees of freedom that has area
(1C)/2 to the right of it
– the standard error of b is a multiple of the
s
regression standard error:
SEb 
BPS - 3rd Ed.
Chapter 21
 x  x 
2
17
Case Study
Crying and IQ
Confidence interval for slope 
b
BPS - 3rd Ed.
SEb
Chapter 21
18
Case Study
Crying and IQ
Confidence interval for slope 
b=1.4929, SEb = 0.4870, df = n2 = 382 = 36
(df = 36 is not in Table C, so use next smaller df = 30)
For a 95% C.I., (1C)/2 = .025, and t* = 2.042
So a 95% C.I. for the true slope  is:
b  t* SEb = 1.4929  2.042(0.4870)
= 1.4929  0.9944
= 0.4985 to 2.4873
BPS - 3rd Ed.
Chapter 21
19
Hypothesis Tests for Slope
 The
most common hypothesis to test
regarding the slope is that it is zero:
H 0:  = 0
– says regression line is horizontal (the mean of
y does not change with x)
– no true linear relationship between x and y
– the straight-line dependence on x is of no value
for predicting y
 Standardize
BPS - 3rd Ed.
b to get a t test statistic:
Chapter 21
20
Hypothesis Tests for Slope
b
 Test statistic for H0:  = 0 : t 
SEb
– follows t distribution with df = n2
 P-value: [for T ~ t(n2) distribution]
Ha:  > 0 : P-value = P(T  t)
Ha:  < 0 : P-value = P(T  t)
Ha:   0 : P-value = 2P(T  |t|)
BPS - 3rd Ed.
Chapter 21
21
Case Study
Crying and IQ
Hypothesis Test for slope 
t = b / SEb
= 1.4929 / 0.4870
= 3.07
P-value
Significant linear relationship
BPS - 3rd Ed.
Chapter 21
22
Test for Correlation
 The
correlation between x and y is closely
related to the slope (for both the population
and the observed data)
– in particular, the correlation is 0 exactly when
the slope is 0
testing H0:  = 0 is equivalent to
testing that there is no correlation between
x and y in the population from which the
data were drawn
 Therefore,
BPS - 3rd Ed.
Chapter 21
23
Test for Correlation
 There
does exist a test for correlation that
does not require a regression analysis
– Table F on page 661 of the text gives critical
values and upper tail probabilities for the
sample correlation r under the null hypothesis
that the correlation is 0 in the population

look up n and r in the table (if r is negative, look up
its positive value), and read off the associated
probability from the top margin of the table to obtain
the P-value just as is done for the t table (Table C)
BPS - 3rd Ed.
Chapter 21
24
Case Study
Crying and IQ
Test for H0: correlation = 0
Correlation between crying and IQ is r = 0.455
 Sample size is n=38
 From Table F: for Ha: correlation > 0 , the
P-value is between .001 and .0025 (using n=40)

– P-value for two-sided test is between .002 and .005
(matches two-sided P-value for test on slope)
– one-sided P-value would be between .005 and .01 if
we were very conservative and used n=30
BPS - 3rd Ed.
Chapter 21
25
Inference about Prediction
 Once
a regression line is fit to the data, it is
useful to obtain a prediction of the response
for a particular value of the explanatory
variable ( x* ); this is done by substituting
the value of x* into the equation of the line
( yˆ  a  bx ) for x in order to calculate the
predicted value ŷ
 We now present confidence intervals that
describe how accurate this prediction is
BPS - 3rd Ed.
Chapter 21
26
Inference about Prediction
 There
are two types of predictions
– predicting the mean response of all subjects
with a certain value x* of the explanatory
variable
– predicting the individual response for one
subject with a certain value x* of the
explanatory variable
 Predicted
values ( ŷ ) are the same for each
case, but the margin of error is different
BPS - 3rd Ed.
Chapter 21
27
Inference about Prediction
 To
estimate the mean response µy, use an
ordinary confidence interval for the
parameter µy =  + x*
– µy is the mean of responses y when x = x*
– 95% confidence interval: in repeated samples
of n observations, 95% of the confidence
intervals calculated (at x*) from these samples
will contain the true value of µy at x*
BPS - 3rd Ed.
Chapter 21
28
Inference about Prediction
 To
estimate an individual response y, use a
prediction interval
– estimates a single random response y rather
than a parameter like µy
– 95% prediction interval: take an observation
on y for each of the n values of x in the original
data, then take one more observation y at
x = x*; the prediction interval from the n
observations will cover the one more y in 95%
of all repetitions
BPS - 3rd Ed.
Chapter 21
29
Inference about Prediction
 Both
confidence interval and prediction
interval have the same form:
yˆ  t * SE
– both t* values have df = n2
– the standard errors (SE) differ for the two
intervals (formulas on next slide)

the prediction interval is wider than the
confidence interval
BPS - 3rd Ed.
Chapter 21
30
Inference about Prediction
BPS - 3rd Ed.
Chapter 21
31
Checking Assumptions
 Independent
observations
– no repeated observations on the same
individual
 True
relationship is linear
– look at scatterplot to check overall pattern
– plot of residuals against x magnifies any
unusual pattern (should see ‘random’
scatter about zero)
BPS - 3rd Ed.
Chapter 21
32
Checking Assumptions
standard deviation σ of the
response at all x values
 Constant
– scatterplot: spread of data points about the
regression line should be similar over the
entire range of the data
– easier to see with a plot of residuals against
x, with a horizontal line drawn at zero
(should see ‘random’ scatter about zero)
(or plot residuals against ŷ for linear regr.)
BPS - 3rd Ed.
Chapter 21
33
Checking Assumptions
 Response
y varies Normally about the
true regression line
– residuals estimate the deviations of the
response from the true regression line, so
they should follow a Normal distribution

make histogram or stemplot of the residuals
and check for clear skewness or other
departures from Normality
– numerous methods for carefully checking
Normality exists (talk to a statistician!)
BPS - 3rd Ed.
Chapter 21
34
Residual Plots
x = number of beers
y = blood alcohol
Residuals:
-2 731
-1 871
-0 91
0 5578
1 1
2 39
3
(4|1 = .041)
4 1
(close to Normal)
Roughly linear relationship; spread is even across entire data range
(‘random’ scatter about zero)
BPS - 3rd Ed.
Chapter 21
35
Residual Plots
‘x’ = collection of explanatory variables, y = salary of player
Standard deviation is not constant everywhere
(more variation among players with higher salaries)
BPS - 3rd Ed.
Chapter 21
36
Residual Plots
x = number of years, y = logarithm of salary of player
A clear curved pattern – relationship is not linear
BPS - 3rd Ed.
Chapter 21
37