Chp 13 ppt - Wylie ISD

Download Report

Transcript Chp 13 ppt - Wylie ISD

Chapter 13
Simple Linear Regression
and Correlation:
Inferential Methods
Suppose we were to investigate the relationship
between y = the first-year college grade point
average and x = high school grade point average.
The first-year college grade point
The equation
anthe
additive
probabilistic
Is thefor
first-year
college
grade
average
and
high
school
grade model
is:
point
average
determined
point
average
do NOT solely
have aby
the
high school relationship.
grade point
deterministic
y  determinis tic function
of x  random deviation
average?
 f (x )  e
A description
relationshipof
in the
which
the value of
y is
A
relationship
between
Where
e is an determined
“error” variable
completely
by the
value of an
two variables that are not
deterministically
independent
called a
related
can bevariable
given byxaisprobabilistic
deterministic
relationship.
model.
The simple linear regression model assumes that
there is a line with y-intercept a and slope b,
called the population regression line.
When a value of the independent variable x is
fixed and an observation on the dependent
variable y is made,
y
a
y  a  bx  e
Population regression
line (slope b)
e1
Without the random deviation e in
e2
the equation, all observed
(x, y)
points would fall exactly on the
population regression line.
x1
x2
x
Basic Assumptions of the Simple
Linear Regression Model
1.
The distribution of e at any particular x
value has mean value 0. that is, me = 0.
2. The standard deviation of e is the same for
any particular value of x. This standard
deviation is denoted by s.
3. The distribution of e at any particular value
of x is normal.
4. The random deviations e1, e2, . . ., en
associated with different observations are
independent of one another.
Let’s look at the heights and weights of a
population of adult women.
Weight
How much
Weights
women
Are of
some
of
would
an
thatWe
are
5
feet
tall
want
the
these
weights
Where
would
adult
What
would
will vary
– infemale
other
standard
This
distribution
more
likelyisthe
than
you
expect
words,
there
a
weigh
if she
you expect
deviations
of
all
ispopulation
normally
others?
distribution
of
were
5 feet
for other
thesefor
normal
distributed.
What
would
this
weights
adult
regression
line
tall?
heights?
females
5to
distributions
distribution
towho
be?are look
feet
tall.
be
the
same.
like?
60
60
62
62
60
64
64
Height
62
66
66
60
62
64
64
68
68
66
68
66
68
Basic Assumptions of the Simple
Linear Regression Model Revisited
1.
The distribution
of e at anyofparticular
x
The distribution
y at
value has mean
value 0. that
any particular
valueis,ofmex= 0.
is normal.
2. The standard
deviation of e is the same for
any particular value of x. This standard
deviation is denoted by s.
the variable
e any particular value
3.Remember
The distribution
of e at
is a measure of the
of x is normal. For any particular x value,
extent that individual
the standard
deviation
of
4. The
random
deviations
e
,
e
,
.
.
.,
e
y-values deviate from 1 2
n
y equals the standard
associated
with different observations are
the population
deviation of e.
regression
independent
ofline.
one another.
We use yˆ  a  bx to estimate the true
population regression line.
Sxy
b = point estimate of b = S
xx
a = point estimate of a = y - bx
where
Sxy   xy

x  y 


and S
n
xx

2


x
x2  
n
Let x* denote a specific value of the predictor variable x.
Then a + bx* has two different interpretations:
1. It is a point estimate of the mean y value when x = x*.
2. It is a point prediction of an individual y value to be
observed when x = x*.
Medical researches have noted that adolescent females are
much more likely to deliver low-birth-weight babies than are
adult females. Because low-birth-weight babies have higher
mortality rates, a number of studies have examined the
relationship between birth weight and mother’s age for
babies born to young mothers.
The following data is on x = maternal age (in years) and y =
birth weight of baby (in grams).
x
15
17
18
15
16
19
17
16
18
19
Baby’s Weight (g)
The scatterplot shows a
y 2289 3393 3271 2648 2897 3327
2970
2535 3138
3573
linear
pattern
and the
spread in the y values
3500
appears to be similar
acrossdata.
the range of x
3000
Sketch a scatterplot of these
values. This supports
2500
the appropriateness of
the simple linear
16
15
17
18
19
regression model.
Mother’s Age (yrs)
Birth Weight Continued . . .
The following data is on x = maternal age (in years) and y =
birth weight of baby (in grams).
x
15
17
18
15
16
19
17
16
18
19
y 2289 3393 3271 2648 2897 3327 2970 2535 3138 3573
 x  170
 y  30,041
Summary
2
2
statistics
x

2910
xy

515
,
600
y
,785,351


  91
computed from
17030,041  4903.0 the sample
S  515,600 
n  10
xy
10
datathese
are:
Using
The estimated
summary
regression
line is:
statistics
y = -1163.45 + 245.15x
1702
Sxx  2910 
 20.0
10
4903.0
b
 245.15
20.0
a  3004.1  (245.1)(17.0)  1163.45
Birth Weight The
Continued
. . babies increase
weight. of
approximately
245.15 grams
for each
The following
data is on x = maternal
age (in years)
and y =
increase
of grams).
1 year in the mother’s age.
birth weight
of baby (in
x
15
17
18
15
16
19
17
16
18
19
y 2289 3393 3271 2648 2897 3327 2970 2535 3138 3573
Baby’s Weight (g)
yˆ  1163.45  245.15x
What is the point
estimate
for the
 1163.45  245.15(18)  3249
.25 grams
mean weight of
babies born to 18year-old
mothers?
This
is
the
point
This
is
also
the
3500
estimate for
the
prediction
of the
3000
mean
of all
weightweight
of a single
babies
182500
baby
bornborn
to a to
mother
year-old
18 yearsmothers.
of age.
15
16
17
18
Mother’s Age (yrs)
19
The statistic for estimating the variance s2 is
SSResid
s 
n 2
2
e
where

SS Resid   y  yˆ

2
Why n –
2?
The estimate for the standard deviation s is
Note
the degrees
Since that
we must
estimateof
freedom
associated
with
The
subscript
e reminds
us
both
for a
and
b
in
the
2
2 or s in simple
estimating
s
s

sewe reduce
that
we
are
estimating
the
regression
line,
e
variance
of regression
the
“errors”
thelinear
sample
size
n byis2or
2, is the
Recall the coefficient
of
determination,
r
residuals.
df = n - 2
proportion of observed y variation that is
attributed to the model relationship.
Birth Weight Revisited . . .
The following data is on x = maternal age (in years) and y =
birth weight of baby (in grams).
For
a particular
mother’s
age,
the
15
17
18
1576%16of the
19 variability
17
16
18
Approximately
typical
y 2289
3393 deviation
3271weight
2648 for
2897
3327 2970
observed
ofpossible
babies
can 2535
be 3138
weights ofexplained
babies
isby
approximately
2
this
model.
ˆ  426,762
SS Resid   y 
y
.
45
231 grams.
SS To   y  y 2  1,780,322.9
x
Baby’s Weight (g)
se 
19
3573
426,762.45
 230.97
8
SSResid
and
426to
,762
.45
2Findthis
Use
compute
r  1  SSTo.
 .76
s1e,780
and,322
r2. .9
3500
3000
2500
15
16
17
18
Mother’s Age (yrs)
19
Properties of the Sampling
Distribution of b
When the four basic assumptions of the simple
linear regression model are satisfied, the
Since b isstatements
almost always
it
following
are unknown,
true:
mustvalue
be estimated
1. The mean
of b is b. from
That is, mb = b.
independently selected
2. The
standard deviation
of b
the
b is
observations.
The slope
ofstatistic
the
s a point
least-squaressline
gives
b 
estimate for
Sxxb.
3. The statistic b has a normal distribution (a
Since sof
is usually
unknown,
the estimated
consequence
the model
assumption
that
standard deviation of the statistic b is
the random deviation e issenormally
sb 
distributed.)
Sxx
Confidence Interval for b
When the four basic assumptions of the simple
linear regression model are satisfied, a
confidence interval for b, the slope of the
population regression line, has the form
b  (t critical value)  sb
where the t critical value is based on df = n – 2.
Is cardiovascular fitness (as measured by time to
exhaustion from running on a treadmill) related to an
athlete’s performance in a 20-km ski race?
The following data on x = treadmill time to exhaustion (in
minutes) and y = 20-km ski time (in minutes) were taken
from the article “Physiological Characteristics and
Performance of Top U.S. Biathletes” (Medicine and
Science in Sports and Exercise, 1995):
x
Ski Time (min)
y
7.7
8.4
8.7
9.0
9.6
9.6
10.0 10.2 10.4
11.0
11.7
71.0 71.4 65.0 68.7 64.4 69.4 63.0 64.6 66.9 62.6 61.7
The plot shows a linear pattern, and
72
the vertical spread of points does not
appear to be changing over the range
67
a scatterplot
of xSketch
values in the
sample. If we
assume that
the
distribution
of
for
the
data.
62
errors at any given x value is
approximately normal, then the simple
linear regression model seems
9
8
10
11
12
appropriate.
Treadmill Time (min)
Biathletes Continued . . .
x = treadmill exhaustion time
y = ski time
8.4 8.7 9.0
x
7.7
9.6
9.6
10.0 10.2 10.4
11.0
11.7
y
71.0 71.4 65.0 68.7 64.4 69.4 63.0 64.6 66.9 62.6 61.7
Ski Time (min)
 2.3335  (2.26)(.591)  (3.671,  .999)
72
67
62
8
9
10
11
Treadmill Time (min)
12
We are 95% confident
that the true average
decrease in ski time
associated with a 1 minute
increase
in treadmill
Find a 95%
confidence
exhaustion
time
interval for
theisslope
between
1 minute
and 3.7
of the true
regression
minutes. line.
Biathletes Continued . . .
Partial Minitab Output
Equation
of y
Estimated
sb = estimated
standard
estimated
regression
intercept
a b
Ski time = 88.8 – 2.33 treadmill time
deviation
of
b
Estimated
slope
2
line
r 100×r
(adjusted)
is not
2
se
used in simple
Predictor
Coef
StDev
T regression.
P
linear
The regression equation is
Constant
88.796
5.750
15.44
0.000
Treadmill
-2.3335
0.5911
-3.95
0.003
S = 2.188
R-Sq = 63.4%
2
SSResid
SSTo
s
e
n-2
Analysis of Variance
Source
R-Sq (adj) = 59.3%
DF
SS
MS
F
P
Regression
1
74.630
74.630
15.58
0.003
Residual Error
9
43.097
4.789
10
117.727
Total
Summary of Hypothesis Tests
Concerning b
Null hypothesis: H0: b = hypothesized value
b  hypothesiz ed value
Test Statistic: t 
sb
The test is based on df = n – 2.
Alternative Hypothesis:
P -value:
Often the
hypothesized
value
Ha: b > hypothesized
value
area to right
of t under the
appropriate
is zero – this is called
thet curve
model utility
fortosimple
Ha: b < hypothesized
value test area
left of t under the
appropriate t curve
linear regression.
Ha: b ≠ hypothesized value
2(area to right of t ) if +t or
2(area to left of t ) if -t
Summary of Hypothesis Tests
Concerning b Continued . . .
Assumptions:
For this test to be appropriate the four basic assumptions
of the simple regression model must be met:
1. The distribution of e at any particular x value
has a mean of 0 (me = 0),
2. The standard deviation of e is s, which does
not depend on x.
3. The distribution of e at any particular x value
is normal.
4. The random deviations e1, e2, …, en associated
with different observations are independent of
one another.
Weight
What is the
slope of a
horizontal line?
60
62
64
Height
Suppose the
least-squares
line is
horizontal –
would height
be useful in
predicting
A slope of zeroweight?
–
means
66
68 that there
is NO linear
relationship
between x and y!
The Model Utility Test for
Simple Linear Regression
The model utility test for simple linear regression
is the test of
The null hypothesis
specifies that there is no
useful linear relationship
H 0: b = 0
between x and y.
Ha : b ≠ 0
Test Statistic:
b
t 
sb
Biathletes Revisited . . .
x = treadmill exhaustion time
y = ski time
8.4 8.7 9.0
x
7.7
y
71.0 71.4 65.0 68.7 64.4 69.4 63.0 64.6 66.9 62.6 61.7
H0: b = 0
Ha: b ≠ 0
9.6
9.6
10.0 10.2 10.4
11.0
11.7
Where b is the slope of the population
regression line between treadmill time and
ski time
P-value
= the
.003scatterplots
Even
though
indicates a linear relationship
a between
= .05 ski time df
and= 9
67
treadmill time, let’s perform
Since
the P-value < a, we reject
H0. utility
There
is
the model
test.
62
Ski Time (min)
 2.3335
t 72
 3.95
0.5911
sufficient evidence of a linear relationship
between
treadmill
time
and ski time.
9
8
10
11
12
Treadmill Time (min)
Biathletes Revisited . . .
Partial Minitab Output
The regression equation is
t test statistic
P-value
Ski time = 88.8 – 2.33 treadmill time
Predictor
Coef
StDev
T
P
Constant
88.796
5.750
15.44
0.000
Treadmill
-2.3335
-3.95
0.003
S = 2.188
÷
R-Sq = 63.4%
0.5911
=
R-Sq (adj) = 59.3%
Statistical
Analysis of Variance
Source
Regression
Residual Error
Total
software usually performs
the
test
DF model utility
SS
MS with F
P
H01: b = 74.630
0 versus
Ha: b ≠15.58
0
74.630
0.003
9
43.097
10
117.727
4.789
Checking Model Adequacy
The simple linear regression model is
y = a + bx + e
where e represents the random deviation of an
observed y value from the population regression
line a + bx.
If
However,
we
knew
we
the
do
deviations
not
know
the
of edeviations
ethese
Therefore,
we
must
1,linear
2, …, en,
The
assumptions
forestimate
simple
for
we
e1, e
could
examine
en because
them
for
population
any the
deviations
using
the
residuals
from
2, …,
regression
are
based
onthe
this
random
inconsistencies
regression
with
linemodel
is e.
unknown.
assumptions.
estimated
line. deviation
Thus,
we
use
the residuals
to check our assumptions.
Residual Analysis
• Standardize the residuals to look at their
magnitudes
residual
standardized residual 
estimated standard deviation of residual
• Create a residual plot (from Chapter 5) or a
Any observation
with
a large
positive
or of
standardized
residual
plot
(which
is a plot
Most
statistical
software
will
residualthis
should
bepairs)
examined
the negative
(x, standardized
residual)
perform
calculation.
It is
carefullyplot
for is
any
error
recording
tedious
toinexhibits
do
by hand.
A desirable
one
that
no data,
particular
nonstandard
experimental
condition,
or
pattern
(such as
curvature or
much greater
atypical
experimental
unit.
spread in one
part on
the plot than
the other) and
that has no point that is far removed from all the
others.
A Look at Standardized Residual Plots
This is a desirable plot in
that it exhibits no pattern
and has no point that lies
far away from the other
points.
Both of these
plots contain
points
farplot
away
This
exhibits a curved
In this plot,
the
standard
deviation of the
frompattern
the others.
which indicates that
residuals increases
as the
x-values increase.
Thesethe
points
can
fitted model should be
While a straight-line
model might still be
have changed
substantial
to incorporate the
appropriate, the
best-fit
line
should be found
effects oncurvature.
using weightedestimates
least-squares.
of a Consult your local
statistician!
and b
as well as
other quantities.
Biathletes Revisited . . .
r = residuals
y
71.0
r
0.17
sr
0.10
72
67
8.4
8.7
9.0
9.6
9.6
10.0
10.2
10.4
11.0
11.7
The
probability
the 62.6 61.7
65.0 normal
68.7 64.4
69.4 63.0 plot
64.6 of66.9
standardized
residuals
is quite
2.21
-3.49 0.91 -1.99
3.01 -2.46
-0.39 straight.
2.37 -0.53 0.21
There
is no0.44
reason
doubt
1.13 -1.74
-0.96 to
1.44
-1.18 the
-0.19plausibility
1.16 -0.27 0.12
that the random deviations e are normally
distributed.
Let’s look at a normal probability
plot of the
standardized
1
residuals
0
71.4
Standardized Residual
7.7
Ski Time (min)
x
sr = standardized residuals (from Minitab)
-1
62
-2
8
9
10
11
Treadmill Time (min)
12
-2
-1
0
1
Normal Score
2
Biathletes Continued . . .
r = residuals
sr = standardized residuals (from Minitab)
7.7
8.4
8.7
9.0
9.6
9.6
10.0
10.2
10.4
11.0
11.7
y
71.0
71.4
65.0
68.7
64.4
69.4
63.0
64.6
66.9
62.6
61.7
r
0.17
2.21
-3.49
0.91
-1.99
3.01
-2.46 -0.39
2.37
-0.53
0.21
sr
0.10
Notice
these
two
have
1.13
-1.74that
0.44
-0.96
1.44 plots
-1.18can
-0.19
-0.27
Remember
that
residuals
also1.16 plot
The
standardized
residual
similar
appearances.
Sketch
a y.
residual plot.
bedoes
plotted
against
not show evidence of any
Sketch a standardized
pattern or of increasing spread.
residual
plot.
3
1
2
0
1
0
Residuals
Standardized Residuals
x
-1
-2
-1
-2
-3
8
9
10
11
Treadmill Time
12
8
9
10
11
Treadmill Time
12
0.12
Optional Topics
Inferences Based on the
Estimated Regression Line
and
Inference about the Population
Correlation Coefficient
Properties of the Sampling Distribution
of a + bx for a Fixed Value of x
Let x* denote a particular value of the independent variable
x. When the four basic assumptions of the simple linear
regression model are satisfied, the sampling distribution of
The
farther
x*sis from
the
center, theby
Since
s
is
unknown,
can
be
estimated
the statistic a +bx* had the following
properties:
a+bx*
larger sa+bx*
is.place of s.
s
which
substitutes
s
in
a+bx*
e so a + bx* is an
1) The mean
value of a + bx* is a + bx*,
unbiased statistic estimating the mean y value when x = x*.
2) The standard deviation of the statistic a + bx*, denoted
by sa+bx*, is given by
sa bx *
1 (x * x )2
s

n
Sxx
3) The distribution of a + bx* is normal.
Confidence Interval for a Mean y
Value
Because sa+bx* is larger the farther x* is from
When the basic assumptions of the simple linear
x, the confidence interval becomes wider as x*
regression
modelfrom
are met,
a confidence
moves away
the center
of the interval
data.
for a +bx*, the mean y value when x has value x*,
is
a  bx *  (t critical value)  sa bx *
where the t critical value is based on df = n – 2.
Physical characteristics of sharks are of interest to
surfers and scuba divers as well as to marine researcher.
The data on x = length (in feet) and y = jaw width (in
inches) for 44 sharks (were found in various articles
appearing in the magazines Skin Diver and Scuba News.
(These data are found on page 778 of the text.)
Because it is difficult to measure jaw width in living
sharks, researchers would like to determine whether it is
possible to estimate jaw This
widthscatterplot
from body length,
which is
of the
more easily measured.
data shows a linear
pattern and is
consistent with use of
the simple linear
regression model.
Jaws Continued . . .
The regression equation is
Jaw Width = 0.69 + 0.963 Length
Predictor
Coef
StDev
T
P
Constant
0.688
1.299
0.53
0.599
0.96345
0.08228
11.71
0.000
Length
S = 1.376
R-Sq = 76.6%
R-Sq (adj) = 76.0%
The point estimate is
use
the
data
to compute
a 90%
The
model
utility
test confirms
The Let’s
simple
linear
regression
interval
thethis
mean
jaw
usefulness
a  b model
(15) confidence
 .688
 the
.96345
(15)offor
15
.of
140
in
. model.
explains
76.6%
the
width in
for
15width.
foot long sharks.
variability
jaw
The estimated standard deviation of a + b(15) is
sa b (15)
1
(15  15.586)2
 1.376

 .213
44
279.8718
Jaws Continued . . .
The regression equation is
Jaw Width = 0.69 + 0.963 Length
Predictor
Coef
StDev
T
P
Constant
0.688
1.299
0.53
0.599
0.96345
0.08228
11.71
0.000
Length
S = 1.376
R-Sq = 76.6%
R-Sq (adj) = 76.0%
The 90% confidence interval is
a  b (15)  (t critical value)  sa bx * 
15.140  (1.68)(.213)  (14.782, 15.498)
Based on these sample data, we can
be 90% confident that the mean jaw
width for sharks of length 15 feet is
between 14.782 and 15.498 inches.
Prediction Interval for a Single y
Value
When the basic assumptions of the simple linear
The model
prediction
interval
is wider than
thefor
regression
are
met,
a
prediction
interval
The prediction
interval
andthe
thedue to the
confidence
interval
due to
y*, a confidence
single y observation
made
when xat
= x*, has
interval
are
centered
addition of se under the square-root
exactly the samesymbol.
place, a + bx*.
the form
a  bx *  (t critical value)  s  s
2
e
2
a bx *
where the t critical value is based on df = n – 2.
Jaws Revisited . . .
Suppose that we were interested in predicting the
jaw width of a single shark of length 15 feet.
a  b (15)  .688  .96245(15)  15.140
se2  1.376  1.8934
2
Notice that this
s
 .213  .0454
interval is much
The 90% prediction interval is wider than the
2
a  b (15)  (t critical value)  se2  saconfidence

b (15)
interval for the
15.140  (1.68) 1.9388  (12.801,mean
17.479jaw
) width.
We can be 90% confident that an
individual shark of length 15 feet will
have a jaw width between 12.801 and
17.479 inches.
2
a b (15)
2
Below is a Regression Plot from Minitab showing
the confidence interval and the prediction
interval for the shark data.
Notice
that that
the
Also notice
prediction
the
confidence
interval
interval
is is
very
substantial
narrow
close to
wider
the
x, butthan
widens
confidence
the
farther it is
interval
from
the mean.
A Test for Independence in a
Bivariate Normal Population
Null Hypothesis: H0: r = 0
Test Statistic:
t 
r
1r2
Greek letter
“rho” coefficient.
r
is
the
population
correlation
n

2
Many investigators
are
interested
if ANY
However,
r
=
0
is
NOT
A relationship
bivariate
normal
population
is
one
where
for
assesses
the
extent
of
any
linear
The test
is It
based
on
df
=
n
–
2.
exist between
y. That
equivalent
to x andx yand
being
any fixed
value,
the
distribution
of
associated
relationship
in
the
population.
r
must
be
is, arex
x
and
y
are
independent
of
each
except
in
the ycase
y values isindependent
normal,
and
for
any
fixed
value,
the
between
-1
and
1.
other?
of a of
bivariate
normal
Alternativedistribution
Hypothesis:
xP-value:
values
is normal.
population.
Ha: r An
> 0 example
(positive dependence)
Area
to thex right
of t
would be the
height
and weight
y
Ha: r < 0 (negativeof
dependence)
the left of t
AmericanArea
adulttomales.
Ha: r ≠ 0 (dependence)
2(Area to the right of t) if +t
or 2(Area to the left of t) if -t
A Test for Independence in a
Bivariate Normal Population
Assumptions:
r is the correlation coefficient for a random
sample from a bivariate normal population.
The one way to verify that the population
is a bivariate normal population is to plot
individual normal probability plots of the x
and y variables.
The relationship between sleep duration and the level of
the hormone leptin ( a hormone related to energy intake
and energy expenditure) in the blood was investigated.
Average nightly sleep (x, in hours) and blood leptin level
(y) were recorded for each person in a sample of 716
participants in the Wisconsin Sleep Cohort Study. The
sample correlation coefficient was r = 0.11. Does this
support the claim that short sleep duration is associated
with reduced leptin? Use a = .01.
Where r = the correlation between average
nightly sleep and blood
leptin
level
for the
State
the
hypotheses.
To
verify the
assumptions,
we would look at
Ha: r > 0
population
of adult
Americans
normal probability
plots of the x values and of
.
11
Test Statistic:
the ytvalues.
However,
data is not available, so

 2.96
2
1  (.the
11) bivariate normal population is
we will assume
reasonable.
We will also assume that it is
714
reasonable to regard the sample of participants
as representative
ofdfthe
population ofaadult
P-value
= .0015
= 714
= .01
Americans.
H0: r = 0
Sleepless Nights Continued . . .
H0: r = 0
Ha: r > 0
Where r = the correlation between average
nightly sleep and blood leptin level for the
population of adult Americans
Test Statistic:
P-value = .0015
t 
.11
1  (.11)
714
2
 2.96
df = 714
a = .01
Note: the hypothesis of no linear
relationship
(H0<:.01,
b =we
0)reject
can also
used
Since
the P-value
H0. be
There
is
to test to
for
independence
evidence
suggest
that thereinis aa bivariate
positive
normal apopulation.
association (perhaps
weak one since r = .11)
between sleep duration and blood leptin level.