Week 7 - Massey University

Download Report

Transcript Week 7 - Massey University

Week 11
Regression Models and
Inference
Generalising from data

Data from 15 lakes in central Ontario.

Zinc concentrations in aquatic plant Eriocaulon
septangulare (mg per g dry weight) & zinc concentrations
in the lake sediment (mg per g dry weight).
Generalising from data



No interest in specific lakes
How are plant & sediment Zn related in general.
How accurately can you predict plant Zn from
sediment Zn?
Model for regression data

Sample ‘represents’ a larger ‘population’

Distinguish between regn lines for sample
and population

Sample regn line (least squares) is an
estimate of popn regn line.

How do you model randomness in sample?
Sample regression line (revision)
Least squares line
yˆ  b0  b1 x
b0 intercept — predicted y when x = 0.
b1 slope — increase (or decrease) expected for y
when x increases by one unit.
yˆ predicted y or estimated y.
yˆ i fitted value for ith individual
Height and handspan

Heights (inches) and Handspans (cm) of 167
college students.
Handspan = -3 + 0.35 Height
Handspan increases by 0.35 cm, on average, for each
increase of 1 inch in height.
Residuals (revision)
ei  yi  yˆi

Vertical distance from data point to LS line

Person 70 in tall with handspan 23 cm
yˆ  3  0.35(70)  21.5
resisual = yi – yi = 23 – 21.5 = 1.5 cm

Model: population regn line
EY   b 0  b1 x
E(Y) mean or expected value of y for individuals in
the population who all have the same x.
b0
intercept of line in the population.
b1
slope of line in the population.
b1 = 0 means no linear relationship.
b0 and b1 estimated by sample LS values b0 and b1.
Model: distribution of ‘errors’

Error = vertical distance of value from
population regn line
error  Y  b 0  b1 x 



Assume errors all have normal(0, ) distns
Constant standard deviation
Linear regression model
y = Mean + Error

Error is population equivalent of residual

Error is called “Deviation” in textbook
Y = b0 + b1x + 

Error distribution
 ~ normal(0, )
Understanding parameters
Model assumptions

Linear relationship


No curvature
No outliers

Constant error standard deviation

Normal errors
Checking assumptions

Data should be in a symmetric band of
constant width round a straight line

Prices of Mazda cars in Melbourne paper
Transformations

Transformation of Y (or X) may help

Model regression line: logprice  b0  b1  age
Parameter estimates


Least squares estimates, b0 and b1 are
estimates of b0 and b1
Best estimate of error s.d.,  is
Sum of Squared Residuals
s
n2
SSE


n2

2
ˆ


 yi  y i
n2
‘Typical’ size of residuals
Minitab estimates
Data:
x = heights (in inches)
y = weight (pounds)
of n = 43 male students.
Standard deviation
s = 24.00 (pounds):
Roughly measures, for
any given height, the
general size of the
deviations of individual
weights from the mean
weight for the height.
Interpreting s

About 95% of crosses in band ± 2s on each
side of least squares line.

s = 24, band ± 48
48
Inference about regn slope

Regn slope, b1, is usually most important
parameter

Expected increase in Y for unit increase in x

Point estimate is LS slope, b1

How variable? What is std error of estimate?
s.e.b1 
s
x  x 
2
SSE
where s 
n 2
Inference: 95% C.I. for slope

Same pattern as earlier C.I.s
estimate ± t* x std. error

Value of t* :



approx 2 for large n
bigger for small n
use t-tables (n – 2) degrees of freedom
Example

Driver age and maximum legibility distance of
new highway sign
Average Distance = 577 – 3.01 × Age
95% C.I. from Minitab


Point estimate: reading distance decreases by 3.01
ft per year of age
n = 30 points
95% Confidence interval:
t28 d.f. = 2.05
b1  t *  s.e.b1   3.01 2.05  0.4243
  3.01 0.87
  3.88 to  2.14 ft
Interpretation
With 95% confidence, we estimate that …
in the population of drivers represented by this
sample, …
the mean sign-reading distance decreases between
3.88 and 2.14 ft …
per 1-year increase in age.
Importance of zero slope
Y  b 0  b1 x

If slope is b1 = 0,
 Y is normal with mean b0 and st devn 
Response distribution does not depend on x

It is therefore important to test whether b1 = 0
Test for zero slope

Hypotheses:

Test statistic:
H0: b1 = 0
HA: b1 ≠ 0
estim ate Null value b1  0
t

st d error
s.e.b1
p-value:

tail area of t-distn (n – 2 d.f.)

Minitab: Age vs reading distance
t
b1  0  3.0068 0

 7.09 and p-value  0.000
s.e.b1 
0.4243

Probability is virtually 0 that observed slope could be
as far from 0 or farther if there was no linear
relationship in population

Extremely strong evidence that distance and age are
related
Testing zero correlation
H0:  = 0
(x and y are not correlated.)
HA:  ≠ 0
(x and y are correlated.)
where  = population correlation


Same test as for zero regression slope.
Can be performed even when a regression
relationship makes no sense.

e.g. leaf length & width
Significance and Importance

With very large n, weak relationships (low
correlation) can be statistically significant.
Moral: With a large sample size, saying two
variables are significantly related may only
mean the correlation is not precisely 0.
Look at a scatterplot of the data and examine
the correlation coefficient, r.
Prediction of new Y at x
If you knew values of b0, b1 and 
yˆ  b 0  b1 x

Prediction error
 ~ normal0,
 
 New value has s.d. 


95% prediction interval

b0  b1x 
1.96
Prediction of new Y at x
In practice, you must use estimates
yˆ  b0  b1 x

Prediction error has two components
New value still has s.d. 
2
estimated
by
s

 Also, prediction itself is random

x  x
1

s.d.prediction  s

n x i  x 2
Combining these,
2

s.d.prediction error 
s  s.d.prediction
2
2
Prediction of new Y at x

Prediction interval
yˆ  t
*
s  s.d.prediction
2
2
x  x
1

s.d.prediction  s

2
n x i  x 
2

t* is from t tables (n – 2) d.f.

Narrowest
when x is near x

Reading distance and age

Minitab output

95% confident that a 21-year-old will read sign
between 407 and 620 ft
Estimating mean Y at x

Different from estimating a new individual’s Y

Only takes into account variability in y
x  x
1

s.d.prediction  s

n x i  x 2
2


95% CI for mean Y at x
yˆ  t *  s.d.prediction
t* is from t tables (n – 2) d.f.
Height and weight

95% CI


For average
of all
college men
of ht x
95% PI

For one
new college
man of ht x