Transcript LinReg

Linear Regression t-Tests
Cardiovascular fitness among skiers
Cardiovascular fitness is measured by the time required to run
to exhaustion on a treadmill. In the following study,
cardiovascular fitness is compared to performance in a 20-km
ski race.
The following data are for biathletes, as reported in an article on
sports physiology:
x
7.7
8.4
8.7
9.0
9.6
9.6
10.0 10.2 10.4 11.0 11.7
y
71.0 71.4 65.0 68.7 64.4 69.4 63.0 64.6 66.9 62.6 61.7
x = treadmill time (minutes) y = 20-km ski time (minutes)
“Physiological Characteristics and Performance of Top U.S. Biathletes”
(Medicine and Science in Sports and Exercise(1995):1302-1310.
When we encounter data in ordered pairs we usually examine
the data first by making a scatterplot.
First, we will enter the data into lists on the calculator.
Now setting up the scatterplot:
The scatterplot suggests a negative linear relationship
between treadmill time and ski race time. Note that while
my graphs do not have axes labeled, this is due to technical
constraints, and when you write your answers on paper you
should always label the axes and show the scale.
We perform linear regression to obtain the equation of the bestfit line.
On the TI-83 press <STAT> <CALC> <8:LinReg(a+bx)>
Recall that L1 and L2 are the default lists so
I don’t have to specify them, but do need to
specify Y1 in order to store the equation:
Press <VARS> <Y-VARS> <1:Function> <1:Y1>
Press <ENTER>.
yˆ  a  bx
yˆ  88.795  2.3335x
The linear model shows that for every minute increase in treadmill
time thereis a decrease of 2.3335 minutes (on average) in ski race
time. When the treadmill time is zero, the ski race time is expected
to be 88.795 minutes.
Now graphing the line we see that the model looks good.
Recall that whenever we perform linear
regression we must confirm our results
by making a residual plot.
We go to the lists and enter the residuals in L3.
With the cursor on the header for L3,
press <2nd> <LIST> then scroll to
<RESID>.
Now make a scatterplot of the residuals.
Press <ZOOM>
<STAT>.
Here we see that the residuals are fairly
randomly scattered. This patternless
residual plot allows us to confirm our
linear model for the data.
A new concern for us in this new test is that our residuals need
to be normally distributed. This meets a requirement that the
response variable varies normally. We have not seen this
assumption before.
We have in the past needed to establish that data is derived
from a normal distribution. We will follow the same
approach here.
We make a normal probability plot to check this.
The normal probability
plot shows a linear
pattern.
This is consistent with the residuals having a normal distribution.
Another piece of information we need for the linear regression ttest deals with an assumption that the standard deviation is the
same far all values of x. To check this we reexamine the residual
plot.
If the data are scattered to about the same
extent as we move from left to right, we
can say that the equal variance assumption
is met.
Some people call this visual inspection the Does the plot
thicken? condition. That is, do the residuals get closer
together in part of the graph? In our example the variance
seems the same throughout.
We really just have one more assumption, and that is that the
individual ordered pairs are independent of one another.
In practice this is difficult to truly satisfy, and we often move
forward without fully knowing that the data is independent.
The best we can do is carefully examine the data and the
residuals, looking for patterns that we might have overlooked.
If data is collected over time, we might want to graph it as a
function of time to see if there has been a general trend that
would represent a violation of the independence assumption.
Now we look more towards the test.
If we make repeated samples of data from a population and
fit each sample with linear regression, we will likely get
different equations each time.
We understand that, due to sampling variability, our
estimates of a and b in the equation are just that, estimates.
Since we calculate them on samples they are statistics.
They estimate values that are true for the population.
Ultimately, we seek the true regression line, and write it with
Greek letters for the parameters a and b.
The true regression line is
y  a  bx
Our significance test will attempt to determine whether b is
zero. If theslope is zero then the explanatory variable is
useless as a predictor of the response variable.
The null hypothesis is always H0 : b = 0.
The alternate hypothesis is always Ha : b  0 or b  0 or b  0.


For us to be able to judge whether the variability we see is
explainable by chance alone, we must have an idea of how
much variability there is in this system.
We calculate the standard error about the line.
1
1
2
2
ˆ
s
residual

(y

y
)


n 2
n 2
We use s to estimate s in the regression model.
We have degrees of freedom in this test because it is a t-test. We
will have n-2 degrees of freedom in these tests, where n is the
number of data points.
We need one further concept, and that is of standard error of the
regression slope SEb..
SE b 
s
 (x  x )
2
We are now ready to define our test statistic:

b
t
SE b
Let’s give a quick example of how you should write this as a
7-step write-up.
Step 1:
H0 : b  0.
Ha : b  0.
Step 2:
This scatterplot of ski-race time as
a function of treadmill time, shows
a negative linear relationship.
yˆ  88.795  2.3335x
This residual plot is patternless, which is
consistent with our linear model.
Further examination of this residual plot shows that the standard
deviation is the same throughout.
A normal probability plot shows that the
residuals appear to fit a linear model.
Our data appear to be independent; at least I do not find a
reason to say that they are not independent.
Step 3:
b
2.3335
t

 3.947 df = 9.
SE b
SE b
This was found by running the linear regression t-test on the
calculator . To do this press <STAT> <TESTS>

<E:LinRegTTest>
<ENTER>. Set the values, as shown.
Step 4:
The test statistic is too extreme
to see the shading on this graph.
Step 5:
P(t  3.9477)  .00168
Step
 6:
Reject H0, a value this extreme may occur by
chance alone less than 1% of the time.
Step 7:
We have evidence that cardiovascular fitness, as
measured by a treadmill test, does correspond to
reduced race time on 20-km ski race.
THE END