Transcript Document
Regression
Several Explanatory Variables
Example: Scottish hill races data.
These data are made available in R as
> Library(MASS)
> data(hills)
They give record times (minutes) in 1984
of 35 Scottish hill races, against distance
(miles) and total height climbed (feet).
We regard time as the response
variable, and seek to model how its
conditional distribution depends on
the explanatory variables distance
and climb.
The R code pairs(hills) produces the plots shown.
These show that the response variable
time has a strong positive association
with each of the explanatory variables
distance and climb - although a
stronger dependence on distance.
However, the two explanatory variables
distance and climb also have a strong
positive association with each other,
and this complicates the modelling.
Preliminary analysis of the data
suggests that the observation (number
18) corresponding to Knock Hill is
almost certainly in error - the time is
much too great for the given distance
and climb, and it may have been
misrecorded by 1 hour. We therefore
omit Knock Hill from the analysis.
(use plot and identify commands)
On physical grounds we attempt to
find a model with zero intercept.
We consider first a linear model
(Model 1) involving both the
explanatory variables distance and
time.
time = a x distance + b x climb + ε
The fitted model is
time = 5.47 x dist + 0.0106 x climb + ε
The “three stars” associated with the
estimates of the coefficients, shows
that distance and climb are both
important explanatory variables.
(This can be confirmed by noting the
very much poorer fits obtained if either
of these variables is omitted).
> plot(hills.model.1) produces
The pattern of residuals leads us to
suspect that there may be some nonlinear
dependence on climb and/or distance. This
would be physically quite natural. It here
seems reasonable to introduce quadratic
terms as a first attempt to model any
nonlinearity.
We consider now the (quite elaborate)
model (Model2):
time = a0 x distance + b0 x (distance)2 +
c0 x climb + d0 x(climb)2 + ε
The fitted model is now:
time=5.62xdistance+0.0323x(distance)2+
0.000262xclimb+0.00000180x(climb)2+ε
The analysis, most notably “star values”
associated with the estimate of the
coefficient of (climb)2, shows that there is
indeed evidence of nonlinearity in the
dependence on climb, and (given also
physical considerations) quite possibly in
the dependence on distance.
The pattern of residuals is now
more randomly spread, indicating
a better model than the fisrt one.
Finally, the residuals of model 1 can be plotted
against those of model 2.
This suggests that Model 2 is a
considerable improvement, at least
insofar as it reduces the large residuals
associated with the 3 labelled
observations.
The observations corresponding to
Bens of Jura and Lairig Ghru remain
moderately influential.