x - PRCACalculus

Download Report

Transcript x - PRCACalculus

Chapter 13
Simple Linear Regression
and Correlation:
Inferential Methods
Suppose we were to investigate the relationship
between y = the first-year college grade point
average and x = high school grade point average.
The first-year college grade point
The equation
forand
an the
additive
Isaverage
the first-year
college
grade grade
point model
high probabilistic
school
is:
average
solelyhave
by the
point determined
average do NOT
a
highdeterministic
school grade relationship.
point average?
y  determinis tic function of x  random deviation
 f (x )  e
A
relationship
inthe
which
the valuebetween
of y is
A
description
of
relationship
Where
e is an “error”
variable
completely
determined
by deterministically
the value of an
two variables that are not
independent
called a
related
can be variable
given by xaisprobabilistic
deterministic
relationship.
model.
The simple linear regression model assumes that
there is a line with y-intercept a and slope b,
called the population regression line.
When a value of the independent variable x is
fixed and an observation on the dependent
variable y is made,
y
a
y  a  bx  e
Population regression
line (slope b)
e1
Without the random deviation e in
e2
the equation, all observed (x, y)
points would fall exactly on the
population regression line.
x1
x2
x
Basic Assumptions of the Simple
Linear Regression Model
1. The distribution of e at any particular x
value has mean value 0. that is, me = 0.
2. The standard deviation of e is the same for
any particular value of x. This standard
deviation is denoted by s.
3. The distribution of e at any particular value
of x is normal.
4. The random deviations e1, e2, . . ., en
associated with different observations are
independent of one another.
Let’s look at the heights and weights of a
population of adult women.
Weight
How much
Weights
of women
Are some
of these
would
an
that
are
5
feet
tall
We
want
the
weights
more
Where
would
adult
What
would
will vary
– infemale
other
standard
This
distribution
likely
than
you
expect
the
words,
there
is
a
weigh
if
she
you expect
deviations
of
ispopulation
normally
others?
distribution
of
were
5 feet
for other
all
these
normal
distributed.
What
would
this
weights
for
adult
regression
line
tall?
heights?
females
5to
distributions
distribution
towho
be?are look
tall.
befeet
the
same.
like?
60
60
62
62
60
64
64
Height
62
66
66
60
62
64
64
68
68
66
68
66
68
Basic Assumptions of the Simple
Linear Regression Model Revisited
1. The distribution
of e at anyof
particular
x
The distribution
y at
value has any
mean
value 0. that
particular
valueis,ofmxe = 0.
is normal.
2. The standard
deviation of e is the same for
any particular value of x. This standard
deviation is denoted by s.
the variable
3.Remember
The distribution
of e ate any particular value
is a measure of the
of x is normal. For any particular x value,
extent that individual
the standard
deviation
of
4. The
random
deviations
e
,
e
,
.
.
.,
e
y-values deviate from 1 2
n
y equals the standard
associated
with different observations are
the population
deviation of e.
regression
independent
ofline.
one another.
We use yˆ  a  bx to estimate the true
population regression line.
Sxy
b = point estimate of b = Sxx
a = point estimate of a = y - bx
where
Sxy   xy

x  y 


and S
n
xx

2


x
x2  
n
Medical researches have noted that adolescent females are
much more likely to deliver low-birth-weight babies than are
adult females. Because low-birth-weight babies have higher
mortality rates, a number of studies have examined the
relationship between birth weight and mother’s age for babies
born to young mothers.
The following data is on x = maternal age (in years) and y =
birth weight of baby (in grams).
x
15
17
18
15
16
19
17
16
18
19
Baby’s Weight (g)
The scatterplot shows a
y 2289 3393 3271 2648 2897 3327
2970pattern
2535 3138
3573
linear
and the
spread in the y values
3500
appears to be similar
across
the range of x
3000
Sketch a scatterplot of these
data.
values. This supports
2500
the appropriateness of
the simple linear
16
15
17
18
19
regression model.
Mother’s Age (yrs)
Birth Weight Continued
. . babies increase
The weight. of
approximately
245.15 grams
for each
The following
data is on x = maternal
age (in years)
and y =
of grams).
1 year in the mother’s age.
birth weightincrease
of baby (in
x
y
15
17
18
15
16
17
16
18
19
2289 3393 3271 2648 2897 3327 2970 2535 3138 3573
yˆ  1163.45  245.15x
Baby’s Weight (g)
19
What is the point
estimate
for the
 1163.45  245.15(18)  3249
.25 grams
mean weight of
babies born to 18year-old
mothers?
This
is
the
point
This
is
also
the
3500
estimate for
the
prediction
of
the
3000
meanofweight
of baby
all
weight
a single
babies
to 182500
born
to aborn
mother
18
year-old
mothers.
years of
age.
15
16
17
18
Mother’s Age (yrs)
19
The statistic for estimating the variance s2 is
SSResid
s 
n 2
2
e
where

SS Resid   y  yˆ

2
Why n –
2?
The estimate for the standard deviation s is
Note we
thatmust
the degrees
Since
estimateof
freedom
with
The
subscript
e reminds
us
both
for aassociated
and
b
in
the
2
2 or s in simple
estimating
s
s

sewe reduce
that
we
are
estimating
the
e line,
regression
variance
of regression
the
“errors”
thelinear
sample
size
n byis2or 2
Recall the coefficientresiduals.
of
r , is the
df determination,
=n-2
proportion of observed y variation that is
attributed to the model relationship.
Birth Weight Revisited . . .
The following data is on x = maternal age (in years) and y =
birth weight of baby (in grams).
x
15
y
2289
For
mother’s
age, the
17 a particular
18
15
16
19
17
16
18
19
typical
3393
3271 deviation
2648 2897 for
3327possible
2970 2535 3138 3573
Approximately
76% of the variability
weights
of babies is approximately
observed
weight of babies can be
231 grams.
explained by this model.
Baby’s Weight (g)
se  205.308
3500
r 2  .76
3000
2500
15
16
17
18
Mother’s Age (yrs)
19
Properties of the Sampling
Distribution of b
When the four basic assumptions of the simple
linear regression model are satisfied, the
Since
b is statements
almost always
it
following
are unknown,
true:
mustvalue
be estimated
1. The mean
of b is b. from
That is, mb = b.
independently selected observations.
2. The
of the statistic
The standard
slope b ofdeviation
the least-squares
line b is
s for b.
gives a points estimate

b
Sxx
3. The statistic b has a normal distribution (a
Since s of
is usually
unknown,
the estimated
consequence
the model
assumption
that the
standard deviation of the statistic b is
se
random deviation e is normally
distributed.)
sb 
Sxx
Confidence Interval for b
When the four basic assumptions of the simple
linear regression model are satisfied, a
confidence interval for b, the slope of the
population regression line, has the form
b  (t critical value)  sb
where the t critical value is based on df = n – 2.
Is cardiovascular fitness (as measured by time to
exhaustion from running on a treadmill) related to an
athlete’s performance in a 20-km ski race?
The following data on x = treadmill time to exhaustion (in
minutes) and y = 20-km ski time (in minutes) were taken
from the article “Physiological Characteristics and
Performance of Top U.S. Biathletes” (Medicine and Science
in Sports and Exercise, 1995):
x
8.4
8.7
9.0
9.6
9.6
10.0 10.2 10.4 11.0 11.7
71.0 71.4 65.0 68.7 64.4 69.4 63.0 64.6 66.9 62.6 61.7
The plot shows a linear pattern, and
72
the vertical spread of points does not
appear to be changing over the range
67
Sketch
a sample.
scatterplot
of x values
in the
If we
assume that
the
distribution
of errors
for
the
data.
62
at any given x value is approximately
normal, then the simple linear
regression model seems appropriate.
Ski Time
(min)
y
7.7
8
9
10
11
12
Treadmill Time (min)
Biathletes Continued . . .
x = treadmill exhaustion time
y = ski time
8.4 8.7 9.0
x
7.7
9.6
9.6
10.0 10.2 10.4 11.0 11.7
y
71.0 71.4 65.0 68.7 64.4 69.4 63.0 64.6 66.9 62.6 61.7
 2.3335  (2.26)(.591)  (3.671,  .999)
Ski Time
(min)
72
67
62
8
9
10
11
12
Treadmill Time (min)
We are 95% confident that
the true average decrease
in ski time associated with
a 1 minute increase in
treadmill
exhaustion
time
Find a 95%
confidence
is interval
between for
1 minute
and
the slope
3.7
ofminutes.
the true regression
line.
Biathletes Continued . . .
Partial Minitab Output
Equation
of
Estimated
y
intercept
sb = estimated
estimated standard
regression
a b
Ski time = 88.8 – 2.33 treadmill time
deviation
of
Estimated
slope
b
2
line
r 100×r
(adjusted)
is not
2
se
used in simple
Predictor
Coef
StDev
T regression.
P
linear
The regression equation is
Constant
88.796
5.750
15.44
0.000
Treadmill
-2.3335
0.5911
-3.95
0.003
S = 2.188
R-Sq = 63.4%
2
SSResid
SSTo
s
e
n-2
Analysis of Variance
Source
R-Sq (adj) = 59.3%
DF
SS
MS
F
P
Regression
1
74.630
74.630
15.58
0.003
Residual Error
9
43.097
4.789
10
117.727
Total
Summary of Hypothesis Tests
Concerning b
Null hypothesis: H0: b = hypothesized value
Test Statistic:
b  hypothesiz ed value
t 
sb
The test is based on df = n – 2.
Alternative Hypothesis:
P -value:
Often the value
hypothesized
Ha: b > hypothesized
area tovalue
right ofis
t under the
t curve
zero – this is calledappropriate
the model
utility test
for simple
Ha: b < hypothesized
value
area tolinear
left of t under the
appropriate t curve
regression.
Ha: b ≠ hypothesized value
2(area to right of t ) if +t or
2(area to left of t ) if -t
Summary of Hypothesis Tests
Concerning b Continued . . .
Assumptions:
For this test to be appropriate the four basic assumptions of
the simple regression model must be met:
1. The distribution of e at any particular x value
has a mean of 0 (me = 0),
2. The standard deviation of e is s, which does not
depend on x.
3. The distribution of e at any particular x value is
normal.
4. The random deviations e1, e2, …, en associated
with different observations are independent of
one another.
Weight
What is the
slope of a
horizontal line?
60
62
64
Height
Suppose the
least-squares
line is
horizontal –
would height
be useful in
predicting
A slope of zeroweight?
–
means
66
68 that there is
NO linear
relationship
between x and y!
The Model Utility Test for Simple
Linear Regression
The model utility test for simple linear regression
is the test of
The null hypothesis specifies
that there is no useful linear
relationship between x and
H0: b = 0
y.
Ha: b ≠ 0
Test Statistic:
b
t 
sb
Biathletes Revisited . . .
x = treadmill exhaustion time
y = ski time
8.4 8.7 9.0
x
7.7
y
71.0 71.4 65.0 68.7 64.4 69.4 63.0 64.6 66.9 62.6 61.7
H0: b = 0
Ha: b ≠ 0
9.6
9.6
10.0 10.2 10.4 11.0 11.7
Where b is the slope of the population
regression line between treadmill time and
ski time
 2.3335
t 72
 3.95
0.5911
Ski Time
(min)
P-value
= the
.003scatterplots
Even
though
indicates a linear relationship
a between
= .05 ski time df
and= 9
67
treadmill time, let’s perform
Since
the P-value < a, we reject
H0. utility
Theretest.
is
the model
62
sufficient evidence of a linear relationship
between
treadmill
time
and ski time.
9
8
10
11
12
Treadmill Time (min)
Biathletes Revisited . . .
Partial Minitab Output
The regression equation is
t test statistic
P-value
Ski time = 88.8 – 2.33 treadmill time
Predictor
Coef
StDev
T
P
Constant
88.796
5.750
15.44
0.000
Treadmill
-2.3335
-3.95
0.003
S = 2.188
R-Sq = 63.4%
÷
0.5911
=
R-Sq (adj) = 59.3%
Statistical
Analysis of Variance
Source
Regression
Residual Error
Total
software usually performs
the
test
DF model utility
SS
MS with F
P
H01: b = 74.630
0 versus74.630
Ha: b ≠15.58
0
0.003
9
43.097
10
117.727
4.789
Checking Model Adequacy
The simple linear regression model is
y = a + bx + e
where e represents the random deviation of an
observed y value from the population regression
line a + bx.
If
we
knew
the
deviations
of e1deviations
, linear
ethese
However,
we
do
not
know
the
Therefore,
we
must
2, …, en,
The
assumptions
forestimate
simple
for
ewe
eexamine
population
them
forrandom
any the
deviations
using
the residuals
from
1, ecould
2, …,
n because
regression
are
based
onthe
this
inconsistencies
regression
with
linemodel
is unknown.
estimated
line. deviation
Thus,
we
useassumptions.
the residuals
e.
to check our assumptions.
Residual Analysis
• Standardize the residuals to look at their
magnitudes
residual
standardized residual 
estimated standard deviation of residual
• Create a residual plot (from Chapter 5) or a
Any observation
with
a large
positive
or of
standardized
residual
plot
(which
is a will
plot
Most
statistical
software
residualresidual)
should
be
examined
the negative
(x, standardized
pairs)
perform
this
calculation.
It is
carefullyplot
for any
error
recording
toinexhibits
do
by hand.
A desirable
istedious
one
that
no data,
particular
nonstandard
experimental
condition,
orspread
pattern
(such as curvature
or much
greater
atypical
in one part
on theexperimental
plot than the unit.
other) and that
has no point that is far removed from all the
others.
A Look at Standardized Residual Plots
This is a desirable plot in
that it exhibits no pattern
and has no point that lies
far away from the other
points.
Both of these
plots contain
points
far plot
awayexhibits a curved
This
In this plot,
thethe
standard
deviation of the
from
others.
pattern
which indicates that
residuals increases
as the
x-values increase.
Thesethe
points
can
fitted
model should be
While a straight-line
model
might still be
have substantial
changed to incorporate the
appropriate, theeffects
best-fit
should be found
online
curvature.
using weighted
least-squares.
Consult your
estimates
of a
local
statistician!
and
b as
well as
other quantities.
Biathletes Revisited . . .
r = residuals
sr = standardized residuals (from Minitab)
7.7
y
71.0
r
0.17
sr
0.10
Ski Time
(min)
72
67
8.4
8.7
9.0
9.6
9.6
10.0
10.2
10.4
11.0
11.7
The
probability
the 62.6 61.7
65.0 normal
68.7 64.4
69.4 63.0 plot
64.6 of66.9
standardized
residuals
is quite
2.21
-3.49 0.91 -1.99
3.01 -2.46
-0.39 straight.
2.37 -0.53 0.21
There
is no0.44
reason
doubt
plausibility
1.13 -1.74
-0.96 to
1.44
-1.18 the
-0.19
1.16 -0.27 0.12
that the random deviations e are normally
distributed.
Let’s look at a normal probability
plot of the
standardized
1
residuals
0
71.4
Standardized Residual
x
-1
62
-2
8
9
10
11
12
Treadmill Time (min)
-2
-1
0
1
Normal Score
2
Biathletes Continued . . .
r = residuals
sr = standardized residuals (from Minitab)
7.7
8.4
8.7
9.0
9.6
9.6
10.0
10.2
10.4
11.0
11.7
y
71.0
71.4
65.0
68.7
64.4
69.4
63.0
64.6
66.9
62.6
61.7
r
0.17
2.21
-3.49
0.91
-1.99
3.01
-2.46
-0.39
2.37
-0.53
0.21
sr
0.10
-0.27
The
standardized
residual
plot
similar
appearances.
Sketch
a y.
residual plot.
be does
plotted
against
not show evidence of any
Sketch a standardized
pattern or of increasing spread.
residual
plot.
3
0.12
Notice
these
two
have
1.13
-1.74that
0.44that
-0.96
1.44 plots
-1.18can
-0.19
Remember
residuals
also1.16
2
1
Residuals
Standardized Residuals
x
0
-1
-2
1
0
-1
-2
-3
8
9
10
11
Treadmill Time
12
8
9
10
Treadmill
Time
11
12
Optional Topics
Inferences Based on the
Estimated Regression Line
and
Inference about the Population
Correlation Coefficient
Properties of the Sampling Distribution
of a + bx for a Fixed Value of x
Let x* denote a particular value of the independent variable
x. When the four basic assumptions of the simple linear
regression model are satisfied, the sampling distribution of
The
farther x*sis from
the
center,
the by
Since
s
is
unknown,
can
be
estimated
the statistic a +bx* had the following
properties:
a+bx*
larger sa+bx*
is.place of s.
s
which
substitutes
s
in
a+bx*
e so a + bx* is an
1) The mean
value of a + bx* is a + bx*,
unbiased statistic estimating the mean y value when x = x*.
2) The standard deviation of the statistic a + bx*, denoted by
sa+bx*, is given by
sa bx *
1 (x * x )2
s

n
Sxx
3) The distribution of a + bx* is normal.
Confidence Interval for a Mean y
Value
Because sa+bx* is larger the farther x* is from x,
When the basic assumptions of the simple linear
the confidence interval becomes wider as x*
regression
confidence
movesmodel
away are
frommet,
the acenter
of theinterval
data. for
a +bx*, the mean y value when x has value x*, is
a  bx *  (t critical value)  sa bx *
where the t critical value is based on df = n – 2.
Physical characteristics of sharks are of interest to
surfers and scuba divers as well as to marine researcher.
The data on x = length (in feet) and y = jaw width (in
inches) for 44 sharks (were found in various articles
appearing in the magazines Skin Diver and Scuba News.
(These data are found on page 778 of the text.)
Because it is difficult to measure jaw width in living
sharks, researchers would like to determine whether it is
possible to estimate jaw width
body length,
which is
Thisfrom
scatterplot
of the
more easily measured.
data shows a linear
pattern and is
consistent with use of
the simple linear
regression model.
Jaws Continued . . .
The regression equation is
Jaw Width = 0.69 + 0.963 Length
Predictor
Coef
StDev
T
P
Constant
0.688
1.299
0.53
0.599
0.96345
0.08228
11.71
0.000
Length
S = 1.376
R-Sq = 76.6%
R-Sq (adj) = 76.0%
The point estimate is
Let’s use
the
data
to compute
a 90%
The
model
utility
test confirms
The simple
linear
regression
confidence
for
mean
width
a  b (model
15
)  .688
 interval
.the
96345
(15
)ofthe
15.of
140
in. jaw
usefulness
this
model.
explains
76.6%
the
forin15jaw
foot
long sharks.
variability
width.
The estimated standard deviation of a + b(15) is
sa b (15)
1
(15  15.586)2
 1.376

 .213
44
279.8718
Jaws Continued . . .
The regression equation is
Jaw Width = 0.69 + 0.963 Length
Predictor
Coef
StDev
T
P
Constant
0.688
1.299
0.53
0.599
0.96345
0.08228
11.71
0.000
Length
S = 1.376
R-Sq = 76.6%
R-Sq (adj) = 76.0%
The 90% confidence interval is
a  b (15)  (t critical value)  sa bx * 
15.140  (1.68)(.213)  (14.782, 15.498)
Based on these sample data, we can
be 90% confident that the mean jaw
width for sharks of length 15 feet is
between 14.782 and 15.498 inches.
Prediction Interval for a Single y
Value
When the basic assumptions of the simple linear
Themodel
prediction
interval
is wider interval
than thefor
regression
are
met,
a
prediction
The
prediction
interval
and
the
confidence interval due to the due to the
y*, a single
y observation
made
when xat= x*, has
confidence
interval
are
centered
addition
of se under
the square-root
exactly the samesymbol.
place, a + bx*.
the form
a  bx *  (t critical value)  s  s
2
e
2
a bx *
where the t critical value is based on df = n – 2.
Jaws Revisited . . .
Suppose that we were interested in predicting the
jaw width of a single shark of length 15 feet.
a  b (15)  .688  .96245(15)  15.140
se2  1.376  1.8934
2
Notice
that
this
s
 .213  .0454
interval is much
The 90% prediction interval is wider than the
a  b (15)  (t critical value)  se2  sa2confidence

b (15)
interval for the
15.140  (1.68) 1.9388  (12.801,mean
17.479jaw
) width.
We can be 90% confident that an
individual shark of length 15 feet will
have a jaw width between 12.801 and
17.479 inches.
2
a b (15)
2
Below is a Regression Plot from Minitab
showing the confidence interval and the
prediction interval for the shark data.
Notice
that that
the
Also notice
prediction
the
confidence
interval
interval
is is
very
substantial
narrow
close to
wider
the
x, butthan
widens
confidence
the
farther it is
interval
from
the mean.
A Test for Independence in a
Bivariate Normal Population
Null Hypothesis: H0: r = 0
Test Statistic:
t 
r
1r2
Greek letter
“rho” coefficient.
r
is
the
population
correlation
n

2
Many investigators
are
interested
if ANY
However,
r
=
0
is
NOT
A bivariate
normal
is
one
where
for
It assesses
extent
of
any
linear
The test
is based
onequivalent
df =
nthe
– population
2.between
relationship
exist
x
and
y.
That
to x and y being
any relationship
fixed
x value,
the
associated
in
thedistribution
population.of
r
must
be y
is, are
x
and
y
are
independent
of
each
independent
except
in
theycase
values is normal,
and
for
any
fixed
value, the
between
-1 and 1.
other?
of a bivariate
normal
Alternativedistribution
Hypothesis:
P-value:
of x values
is normal.
population.
Ha: rAn
> 0 example
(positive dependence)
Area
to thexright
t
would be the
height
and ofweight
y
Ha: r < 0 (negativeofdependence)
the left of t
American Area
adulttomales.
Ha: r ≠ 0 (dependence)
2(Area to the right of t) if +t
or 2(Area to the left of t) if -t
A Test for Independence in a
Bivariate Normal Population
Assumptions:
r is the correlation coefficient for a random sample
from a bivariate normal population.
The one way to verify that the population is
a bivariate normal population is to plot
individual normal probability plots of the x
and y variables.
The relationship between sleep duration and the level of
the hormone leptin ( a hormone related to energy intake
and energy expenditure) in the blood was investigated.
Average nightly sleep (x, in hours) and blood leptin level
(y) were recorded for each person in a sample of 716
participants in the Wisconsin Sleep Cohort Study. The
sample correlation coefficient was r = 0.11. Does this
support the claim that short sleep duration is associated
with reduced leptin? Use a = .01.
Where r = the correlation between average
nightly sleep and blood
leptin
level
for the
State
the
hypotheses.
To
verify theofassumptions,
we would look at
H a: r > 0
population
adult Americans
normal probability
plots of the x values and of
.
11
Test Statistic:

 2.96
the y tvalues.
However,
data is not available, so
2
1  (.11
) bivariate normal population
we will assume
the
is reasonable.
714We will also assume that it is
reasonable to regard the sample of participants
as representative
ofdfthe
population
of adult
P-value
= .0015
= 714
a = .01
Americans.
H 0: r = 0
Sleepless Nights Continued . . .
H 0: r = 0
H a: r > 0
Where r = the correlation between average
nightly sleep and blood leptin level for the
population of adult Americans
Test Statistic:
P-value = .0015
t 
.11
1  (.11)
714
2
 2.96
df = 714
a = .01
Note: the hypothesis of no linear
relationship
(H<
b = we
0) reject
can also
used
0: .01,
Since
the P-value
H0. be
There
is
to test to
forsuggest
independence
a positive
bivariate
evidence
that therein
is a
normal apopulation.
association (perhaps
weak one since r = .11)
between sleep duration and blood leptin level.