Stat 112 Notes 3

Download Report

Transcript Stat 112 Notes 3

Stat 112 Notes 4
• Today:
– Review of p-values for one-sided tests
– Chapter 3.4.2 (Assessing the fit of the
regression line).
– Chapter 3.5.2 (Prediction Intervals)
– Chapter 3.7 (Some Cautions in Interpreting
Regression Results)
• Homework 1 due on Thursday
• For Thursday’s office hours, for this week only, I
will hold them from 1-2 instead of after class (I
have my usual office hours today after class).
p-values for one-sided tests example:
Poverty Rates and Doctors
Bivariate Fit of MDs per 100,000 By Poverty Percent
450
MDs per 100,000
400
350
300
250
200
150
7.5
10
12.5
15
17.5
20
22.5
Poverty Percent
Parameter Estimates
Term
Intercept
Poverty Percent
Estimate
286.84208
-4.329299
Std Error
33.14046
2.669525
t Ratio
8.66
-1.62
Prob>|t|
<.0001
0.1114
Example: One Sided Test
Do there tend to be less doctors in states with higher poverty rates?
Let Y =MDs per 100,000
X =Poverty Percent
Simple Linear Regression Model:
E (Y | X )  0  1 X
H 0 : 1  0
H a : 1  0
Because the t-ratio is negative and is on the same side as alternative, the
p-value is (Prob>|t|)/2 = 0.1114/2 = .0557.
Suggestive but inconclusive evidence that there tend to be less
doctors in states with higher poverty rates.
Example Continued: One and Two
Sided Tests
Do there tend to be more doctors in states with higher poverty rates?
H 0 : 1  0
H a : 1  0
Because the t-ratio is negative and on the opposite side of the alternative,
the p-value is 1-(Prob>|t|)/2=1-0.1114/2=.9443
Is poverty rate associated with the number of doctors in a state?
p-value = Prob>|t|/2 = 0.1114.
There is not strong evidence that poverty rate is associated
with the number of doctors in a state.
Teachers’ Salaries and Dating
• In U.S. culture, it is usually considered
impolite to ask how much money a person
makes.
• However, suppose that you are single and
are interested in dating a particular person.
• Of course, salary isn’t the most important
factor when considering whom to date but it
certainly is nice to know (especially if it is
high!)
• In this case, the person you are interested in
happens to be a high school teacher, so you
know a high salary isn’t an issue.
• Still you would like to know how much she or
he makes, so you take an informal survey of
11 high school teachers that you know.
Distributions
Salary
35000
50000 60000
Moments
Mean
Std Dev
Std Err Mean
upper 95% Mean
lower 95% Mean
N
50881.818
6491.1968
1957.1695
55242.664
46520.973
11
Based on this data, what can you conclude?
Absent any other information, best guess for teacher’s salary is the
mean salary, $50,882.
But it is likely that this estimate will not be correct.
To get an idea of how far off, you might be, you can calculate the
standard deviation:
11
s
(y
i 1
i
 y) 2
n 1

421437378
 6491.82
10
The standard deviation is the “typical” amount by which an
observation deviates from mean.
Thus, your best estimate for your potential date’s salary is $50,882
but a typical estimate will be off by about $6,500.
• You happen to know that the person you are
interested in has been teaching for 8 years.
• How can you use this information to better
predict your potential date’s salary?
• Regression Analysis to the Rescue!
• You go back to each of the original 11
teachers you surveyed and ask them for their
years of experience.
• Simple Linear Regression Model: E(Y|X)= 0  1 X
, the distribution of Y given X is normal with
mean 0  1 X and standard deviation  e .
Bivariate Fit of Salary By Years of Experience
65000
Salary
60000
55000
50000
45000
40000
35000
0
2.5
5
7.5 10 12.5
Years of Experience
Bivariate Fit of Salary By Years of Experience
65000
Salary
60000
55000
50000
45000
40000
35000
0
2.5
5
7.5 10 12.5
Years of Experience
Linear Fit
Linear Fit
Salary = 40612.135 + 1686.0674 Years of Experience
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.545881
0.495423
4610.93
50881.82
11
Linear Fit
Linear Fit
Salary = 40612.135 + 1686.0674 Years of Experience
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
0.545881
0.495423
4610.93
• Predicted salary of your potential date who has been
a teacher for 8 years = Estimated Mean salary for
teachers of 8 years = 40612.135+1686.0674*8 =
$54,100
• How far off will your estimate typically be? Root
mean square error = Estimated standard deviation of
Y|X = $4,610.93.
• Notice that the typical error of your estimate of
teacher salary using experience, $4,610.93, is less
than that of using only information on mean teacher
salary, $6,491.20.
• Regression analysis enables you to better predict
your potential date’s salary.
Summary of Fit
R Squared
RSquare
RSquare Adj
Root Mean Square Error
0.545881
0.495423
4610.93
• How much better predictions of your potential date’s
salary does the simple linear regression model
provide than just using the mean teacher’s salary?
• This is the question that R squared addresses.
• R squared: Number between 0 and 1 that measures
how much of the variability in the response the
regression model explains.
• R squared close to 0 means that using regression for
predicting Y|X isn’t much better than mean of Y, R
squared close to 1 means that regression is much
better than the mean of Y for predicting Y|X.
R Squared Formula
•
Total sum of squares - Residual sum of squares
R 
Total sum of squares
2
2
(
Y

Y
)
i1 i
n
• Total sum of squares =
= the
sum of squared prediction errors for using
sample mean of Y to predict Y
n
2
ˆ
(
Y

Y
)
• Residual sum of squares = i1 i i
,
where Yˆi  ˆ0  ˆ1 X i is the prediction of Yi
from the least squares line.
What’s a good R squared?
• A good R2 depends on the context. In precise
laboratory work, R2 values under 90% might be
too low, but in social science contexts, when a
single variable rarely explains great deal of
variation in response, R2 values of 50% may be
considered remarkably good.
• The best measure of whether the regression
model is providing predictions of Y|X that are
accurate enough to be useful is the root mean
square error, which tells us the typical error in
using the regression to predict Y from X.
Connection between Correlation
and R Squared
The correlation r between two variables X and Y is
a measure of the direction and strength of the linear
association between X and Y .
The correlation ranges between -1 and 1, with a
correlation near -1 indicating a strong negative linear
association between X and Y , a correlation near 0
indicating little association between X and Y and a
correlation near 1 indicating a strong positive
association between X and Y .
The R2 from the regression of Y on X is the square
of the correlation between X and Y
More Information About Your
Potential Date’s Salary:
Prediction Intervals
• From the regression model, you predict that your
potential date’s salary is $54,100 and the typical error
you expect to make in your prediction is $4,611.
• Suppose you want to know an interval that will most
of the time (say 95% of the time) contain your date’s
salary?
• We can find such a prediction interval by using the
fact that under the simple linear regression model,
the distribution of Y|X is normal, here the
subpopulation of teachers with 8 years of experience
has a normal distribution with estimated mean
$54,100 and estimated standard deviation $4,611.
Prediction Interval
• A 95% prediction interval has the property
that if we repeatedly take samples y1,..., yn
from a population with the simple
regression model where x1,..., xn are fixed
yp
at theirx current
values
and
then
sample
xp
with
,the prediction interval will
yp
contain
95% of the time.
ˆ  Eˆ (Y | X  X )  b  b X
y
Best
prediction
of
:
Y
•
p
p
p
0
1
p
2
1 (X p  X )
,
s p  RMSE 1  
2
n (n  1) s X
1
n
2
s X2 
(
X

X
)
.

i
i 1
n 1
95% Prediction Interval: Yˆp  t.025,n 2 s p
Comment: For large n, the 95% prediction interval is approximate
Yˆp  2* RMSE
Prediction Interval for Your Date’s
Salary
• Suppose your date has 8 years of
experience. Yˆ  40612.14+1686.07*8=54100.7
p
2
1 (X p  X )
=
s p  RMSE 1  
n (n  1) s X 2
1 (8  6.09) 2
4610.93 1  

2
11 10* 2.844
5238.07
95% Prediction Interval:
Yˆp  t.025,n  2 s p  54100.7  2.262*5238.07 
(42252.19, 65949.21)
Your date’s salary will be in the range
(42252.19,65949.21) most of the time.
We obtain X and S X2 from Analyze, Distribution on the X variable.
Distributions
Years of Experience
12.5
10
7.5
5
2.5
0
Moments
Mean
Std Dev
Std Err Mean
upper 95% Mean
lower 95% Mean
N
6.0909091
2.8444523
0.8576346
8.0018382
4.17998
11
Prediction Intervals in JMP
• After using Fit Line, click the red triangle next to
Linear Fit and click Confid Curves Indiv.
65000
60000
Salary
55000
50000
45000
40000
35000
0
2.5
5
7.5
10
12.5
Years of Experience
• Use the crosshair tool (under Tools) to find the
exact prediction interval for a particular x value.
Approximate Prediction Intervals
The 95% prediction interval is approximately Yˆp  2* RMSE .
Under the simple linear regression model:
 Approximately 68% of the Yi ’s will be within one RMSE
of their predicted value Eˆ (Y | X  X )  b  b X .
i


0
1
i
Approximately 95% of the Yi ’s will be within two RMSEs
of their predicted value Eˆ (Y | X  X i )  b0  b1 X i .
Approximately 99% of the Yi ’s will be within two RMSEs
of their predicted value Eˆ (Y | X  X )  b  b X .
i
0
1
i
Forecasting Outside the Range of the
Explanatory Variable (Extrapolation)
• When constructing estimates of E (Y | X p ) or
predicting individual values of a Y based on x p
, caution must be used if x p is outside the range
of the observed x’s. The data does not provide
information about whether the simple linear
regression model continues to hold outside of
the range of the observed x’s.
• Prediction intervals only account for (1)
variability in Y given X; (2) uncertainty in the
estimates of the slope and intercept given that
the simple linear regression model is true.
When x p is outside the range of the observed
x’s, the prediction interval might not be accurate.
Olympic Long Jump: Length of
gold medal jump (Y) vs. Year
(X)
Bivariate Fit of Length By Year
30
29
28
Length
27
26
25
24
23
22
21
20
1880 1900 1920 1940 1960 1980 2000 2020
Year
Linear Fit
Linear Fit
Length = -72.49157 + 0.0504846*Year
Predictions from Long Jump
Simple Linear Regression
Model
• Predicted Olympic gold medal winning long
jumps:
–
–
–
–
2012 (London): -72.49+0.0504*2012 = 29.08 feet
2032: -72.49+0.0504*2032 = 30.09 feet
3000: -72.49+0.0504*3000 = 78.96 feet
95% Prediction Interval for Year 3000:
2
1
(3000

X
)
Yˆp  t.025,n  2 s p  78.96  2.064* RMSE 1 

26
(n  1) s X2
 (67.14, 90.78)
Prediction interval is not reasonable! Predicting winning
distance for year 3000 is an extrapolation
Association vs. Causality
• A high
in a simple linear regression of
Y on X means that X has a strong linear
relationship with Y, in other words
changes in X are strongly associated with
changes in the mean of Y. It does not
imply that changes in X causes changes in
Y.
• Alternative explanations for high R2 :
R2
– Reverse is true. Y causes X.
– There may be a lurking (confounding) variable
related to both x and y which is the common
cause of x and y
Salary of Presbyterian Ministers in
Bivariate Fit of Salary of Presbyterian Ministers in MA By Price of Rum
50000
40000
1998
30000
1982
20000
1954
1926
1886
10000
0
0
2.5
5
7.5 10 12.5
Price of Rum
Are the Presybterian ministers benefiting from the rum trade or
supporting it?
Neither – the lurking variable of inflation over time is the common
cause of increases in Presbyterian minister’s salaries and the price
of rum.
Review
• R squared measures how much better the regression model
predicts Y than just using the mean of Y.
• 95% prediction interval: interval that contains new
observation’s Y given the new observation’s X with 95%
probability.
• Approximately 95% of observations Yi are within 2 RMSEs of
their predicted value Eˆ (Y | X  X i ) given their X
• Cautions in Interpreting Regression:
– Prediction intervals for X values outside the range of the
observed X variables may not be accurate.
– Regression measures the association between X and the
mean of Y and does not necessarily measure the causal
effect of X on Y.
• Next Class
– Sections 3.5.2, 3.6