Week 7 - Massey University
Download
Report
Transcript Week 7 - Massey University
Week 11
Regression Models and
Inference
Generalising from data
Data from 15 lakes in central Ontario.
Zinc concentrations in aquatic plant Eriocaulon
septangulare (mg per g dry weight) & zinc concentrations
in the lake sediment (mg per g dry weight).
Generalising from data
No interest in specific lakes
How are plant & sediment Zn related in general.
How accurately can you predict plant Zn from
sediment Zn?
Model for regression data
Sample ‘represents’ a larger ‘population’
Distinguish between regn lines for sample
and population
Sample regn line (least squares) is an
estimate of popn regn line.
How do you model randomness in sample?
Sample regression line (revision)
Least squares line
yˆ b0 b1 x
b0 intercept — predicted y when x = 0.
b1 slope — increase (or decrease) expected for y
when x increases by one unit.
yˆ predicted y or estimated y.
yˆ i fitted value for ith individual
Height and handspan
Heights (inches) and Handspans (cm) of 167
college students.
Handspan = -3 + 0.35 Height
Handspan increases by 0.35 cm, on average, for each
increase of 1 inch in height.
Residuals (revision)
ei yi yˆi
Vertical distance from data point to LS line
Person 70 in tall with handspan 23 cm
yˆ 3 0.35(70) 21.5
resisual = yi – yi = 23 – 21.5 = 1.5 cm
Model: population regn line
EY b 0 b1 x
E(Y) mean or expected value of y for individuals in
the population who all have the same x.
b0
intercept of line in the population.
b1
slope of line in the population.
b1 = 0 means no linear relationship.
b0 and b1 estimated by sample LS values b0 and b1.
Model: distribution of ‘errors’
Error = vertical distance of value from
population regn line
error Y b 0 b1 x
Assume errors all have normal(0, ) distns
Constant standard deviation
Linear regression model
y = Mean + Error
Error is population equivalent of residual
Error is called “Deviation” in textbook
Y = b0 + b1x +
Error distribution
~ normal(0, )
Understanding parameters
Model assumptions
Linear relationship
No curvature
No outliers
Constant error standard deviation
Normal errors
Checking assumptions
Data should be in a symmetric band of
constant width round a straight line
Prices of Mazda cars in Melbourne paper
Transformations
Transformation of Y (or X) may help
Model regression line: logprice b0 b1 age
Parameter estimates
Least squares estimates, b0 and b1 are
estimates of b0 and b1
Best estimate of error s.d., is
Sum of Squared Residuals
s
n2
SSE
n2
2
ˆ
yi y i
n2
‘Typical’ size of residuals
Minitab estimates
Data:
x = heights (in inches)
y = weight (pounds)
of n = 43 male students.
Standard deviation
s = 24.00 (pounds):
Roughly measures, for
any given height, the
general size of the
deviations of individual
weights from the mean
weight for the height.
Interpreting s
About 95% of crosses in band ± 2s on each
side of least squares line.
s = 24, band ± 48
48
Inference about regn slope
Regn slope, b1, is usually most important
parameter
Expected increase in Y for unit increase in x
Point estimate is LS slope, b1
How variable? What is std error of estimate?
s.e.b1
s
x x
2
SSE
where s
n 2
Inference: 95% C.I. for slope
Same pattern as earlier C.I.s
estimate ± t* x std. error
Value of t* :
approx 2 for large n
bigger for small n
use t-tables (n – 2) degrees of freedom
Example
Driver age and maximum legibility distance of
new highway sign
Average Distance = 577 – 3.01 × Age
95% C.I. from Minitab
Point estimate: reading distance decreases by 3.01
ft per year of age
n = 30 points
95% Confidence interval:
t28 d.f. = 2.05
b1 t * s.e.b1 3.01 2.05 0.4243
3.01 0.87
3.88 to 2.14 ft
Interpretation
With 95% confidence, we estimate that …
in the population of drivers represented by this
sample, …
the mean sign-reading distance decreases between
3.88 and 2.14 ft …
per 1-year increase in age.
Importance of zero slope
Y b 0 b1 x
If slope is b1 = 0,
Y is normal with mean b0 and st devn
Response distribution does not depend on x
It is therefore important to test whether b1 = 0
Test for zero slope
Hypotheses:
Test statistic:
H0: b1 = 0
HA: b1 ≠ 0
estim ate Null value b1 0
t
st d error
s.e.b1
p-value:
tail area of t-distn (n – 2 d.f.)
Minitab: Age vs reading distance
t
b1 0 3.0068 0
7.09 and p-value 0.000
s.e.b1
0.4243
Probability is virtually 0 that observed slope could be
as far from 0 or farther if there was no linear
relationship in population
Extremely strong evidence that distance and age are
related
Testing zero correlation
H0: = 0
(x and y are not correlated.)
HA: ≠ 0
(x and y are correlated.)
where = population correlation
Same test as for zero regression slope.
Can be performed even when a regression
relationship makes no sense.
e.g. leaf length & width
Significance and Importance
With very large n, weak relationships (low
correlation) can be statistically significant.
Moral: With a large sample size, saying two
variables are significantly related may only
mean the correlation is not precisely 0.
Look at a scatterplot of the data and examine
the correlation coefficient, r.
Prediction of new Y at x
If you knew values of b0, b1 and
yˆ b 0 b1 x
Prediction error
~ normal0,
New value has s.d.
95% prediction interval
b0 b1x
1.96
Prediction of new Y at x
In practice, you must use estimates
yˆ b0 b1 x
Prediction error has two components
New value still has s.d.
2
estimated
by
s
Also, prediction itself is random
x x
1
s.d.prediction s
n x i x 2
Combining these,
2
s.d.prediction error
s s.d.prediction
2
2
Prediction of new Y at x
Prediction interval
yˆ t
*
s s.d.prediction
2
2
x x
1
s.d.prediction s
2
n x i x
2
t* is from t tables (n – 2) d.f.
Narrowest
when x is near x
Reading distance and age
Minitab output
95% confident that a 21-year-old will read sign
between 407 and 620 ft
Estimating mean Y at x
Different from estimating a new individual’s Y
Only takes into account variability in y
x x
1
s.d.prediction s
n x i x 2
2
95% CI for mean Y at x
yˆ t * s.d.prediction
t* is from t tables (n – 2) d.f.
Height and weight
95% CI
For average
of all
college men
of ht x
95% PI
For one
new college
man of ht x