Class 10: Tuesday, Oct. 12 - Wharton Statistics Department

Download Report

Transcript Class 10: Tuesday, Oct. 12 - Wharton Statistics Department

Class 10: Tuesday, Oct. 12
• Hurricane data set, review of confidence
intervals and hypothesis tests
• Confidence intervals for mean response
• Prediction intervals
• Transformations
• Upcoming:
– Thursday: Finish transformations, Example
Regression Analysis
– Tuesday: Review for midterm
– Thursday: Midterm
– Fall Break!
Hurricane Data
• Is there a trend in the number of
hurricanes in the Atlantic over time
(possibly an increase because of global
warming)?
• hurricane.JMP contains data on the
number of hurricanes in the Atlantic basin
from 1950-1997.
Bivariate Fit of Hurricanes By Year
Residual
14
12
Hurricanes
10
8
8
6
4
2
0
-2
-4
1950
1960
1970
1980
1990
Year
6
4
Distributions
Residuals Hurricanes
2
7
.01
.05 .10
.25
.50
.75
.90 .95
.99
6
5
0
4
1950
1960
1970
1980
1990
2000
3
2
1
Year
0
-1
-2
-3
Summary of Fit
-4
-3
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum
Wgts)
0.047995
0.0273
2.33302
5.75
48
Parameter Estimates
Term
Intercept
Year
Estimate Std Error
78.80292 47.97217
-0.037017 0.024308
t Ratio Prob>|t| Lower 95% Upper 95%
1.64 0.1073 -17.76005 175.36589
-1.52 0.1346 -0.085946 0.0119117
-2
-1
Normal Quantile Plot
0
1
2
3
2000
Inferences for Hurricane Data
• Residual plots and normal quantile plots indicate
that assumptions of linearity, constant variance
and normality in simple linear regression model
are reasonable.
• 95% confidence interval for slope (change in
mean hurricanes between year t and year t+1):
(-0.086,0.012)
• Hypothesis Test of null hypothesis that slope
equals zero: test statistic = -1.52, p-value =0.13.
We accept H 0 : 1  0 since p-value > 0.05. No
evidence of a trend in hurricanes from 19501997.
• Scale for interpreting p-values:
p-value
Evidence
<.01
very strong evidence against H0
.01-.05
strong evidence against H0
.05-.10
weak evidence against H0
>.1
little or no evidence against H0
• A large p-value is not strong evidence in favor of
H0, it only shows that there is not strong
evidence against H0.
Inference in Regression
•
•
•
•
Confidence intervals for slope
Hypothesis test for slope
Confidence intervals for mean response
Prediction intervals
Car Price Example
• A used-car dealer wants to understand how
odometer reading affects the selling price of
used cars.
• The dealer randomly selects 100 three-year old
Ford Tauruses that were sold at auction during
the past month. Each car was in top condition
and equipped with automatic transmission,
AM/FM cassette tape player and air
conditioning.
• carprices.JMP contains the price and number of
miles on the odometer of each car.
Bivariate Fit of Price By Odometer
16000
Linear Fit
Price = 17066.766 - 0.0623155 Odometer
Price
15500
15000
Summary of Fit
14500
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum
Wgts)
14000
13500
15000
25000 30000 35000 40000 45000
Odometer
Parameter Estimates
Term
Estimate Std Error t Ratio Prob>|t| Lower 95% Upper 95%
Intercept 17066.766 169.0246 100.97 <.0001 16731.342 17402.19
Odometer -0.062315 0.004618 -13.49 <.0001 -0.071479 -0.053152
0.650132
0.646562
303.1375
14822.82
100
• The used-car dealer has an opportunity to bid on
a lot of cars offered by a rental company. The
rental company has 250 Ford Tauruses, all
equipped with automatic transmission, air
conditioning and AM/FM cassette tape players.
All of the cars in this lot have about 40,000 miles
on the odometer. The dealer would like an
estimate of the average selling price of all cars
of this type with 40,000 miles on the odometer,
i.e., E(Y|X=40,000).
• The least squares estimate is
Eˆ (Y | X  40000)  17067  0.0623 * 40000  $14,575
Confidence Interval for Mean
Response
• Confidence interval for E(Y|X=40,000): A range of plausible values
for E(Y|X=40,000) based on the sample.
• Approximate 95% Confidence interval:
Eˆ (Y | X  X 0 )  2 * SE{Eˆ (Y | X  X 0 )}
2
1
(
X

X
)
SE{Eˆ (Y | X  X 0 )}  RMSE
 n 0
n  ( X i  X )2
i 1
• Notes about formula for SE: Standard error becomes smaller as
sample size n increases, standard error is smaller the closer X 0 is
to X
• In JMP, after Fit Line, click red triangle next to Linear Fit and click
Confid Curves Fit. Use the crosshair tool by clicking Tools, Crosshair
to find the exact values of the confidence interval endpoints for a
given X0.
Bivariate Fit of Price By Odometer
16000
Price
15500
15000
14500
14000
13500
15000
25000 30000 35000 40000 45000
Odometer
Approximate 95% confidence interval for Eˆ (Y | X  40,000) 
($14,514, $14,653)
A Prediction Problem
•
The used-car dealer is offered a particular 3year old Ford Taurus equipped with automatic
transmission, air conditioner and AM/FM
cassette tape player and with 40,000 miles on
the odometer. The dealer would like to predict
the selling price of this particular car.
•
Best prediction based on least squares
estimate:
Eˆ (Y | X  40000)  17067  0.0623 * 40000  $14,575
Range of Selling Prices for
Particular Car
• The dealer is interested in the range of selling prices that
this particular car with 40,000 miles on it is likely to have.
• Under simple linear regression model, Y|X follows a
normal distribution with mean 0  1 * X and standard 
deviation . A car with 40,000 miles on it will be in
interval 0  1 * 40000  2 *
about 95% of the time.
• Class 5: We substituted the least squares estimates
for ˆ0 , ˆ1, RMSE for 0 , 1, and said car with 40,000
miles on it will be in interval ˆ0  ˆ1 * 40000  2 * RMSE
about 95% of the time. This is a good approximation but
it ignores potential error in least square estimates.
Prediction Interval
• 95% Prediction Interval: An interval that has
approximately a 95% chance of containing the value of Y
for a particular unit with X=X0 ,where the particular unit is
not in the original sample.
• Approximate 95% prediction interval:
n
1
ˆ
E (Y | X  X 0 )  2 * RMSE 1   i1 ( X i  X )2
n
• In JMP, after Fit Line, click red triangle next to Linear Fit
and click Confid Curves Indiv. Use the crosshair tool by
clicking Tools, Crosshair to find the exact values of the
prediction interval endpoints for a given X0.
Bivariate Fit of Price By Odometer
16000
Price
15500
15000
14500
14000
13500
15000
30000 40000
Odometer
95% Confidence Interval for {Y | X  40000}  (14514, 14653)
95% Prediction Interval for X=40000  (13972, 15194)
A Violation of Linearity
Bivariate Fit of Life Expectancy By Per Capita GDP
Life Expectancy
80
70
60
Y=Life Expectancy in 1999
X=Per Capita GDP (in US
Dollars) in 1999
Data in gdplife.JMP
50
40
0
5000
10000 15000 20000 25000 30000
Per Capita GDP
Residual
15
5
-5
-15
-25
0
5000
10000
15000
20000
Per Capita GDP
25000
30000
Linearity assumption of simple
linear regression is clearly violated.
The increase in mean life
expectancy for each additional dollar
of GDP is less for large GDPs than
Small GDPs. Decreasing returns to
increases in GDP.
Transformations
• Violation of linearity: E(Y|X) is not a
straight line.
• Transformations: Perhaps E(f(Y)|g(X)) is a
straight line, where f(Y) and g(X) are
transformations of Y and X, and a simple
linear regression model holds for the
response variable f(Y) and explanatory
variable g(X).
Bivariate Fit of Life Expectancy By log Per Capita GDP
70
15
60
Residual
Life Expectancy
80
50
5
-5
-15
-25
40
6
6
7
8
9
10
7
8
9
10
log Per Capita GDP
log Per Capita GDP
Linear Fit
Life Expectancy = -7.97718 + 8.729051 log Per Capita GDP
The mean of Life Expectancy | Log Per Capita appears to be approximately
a straight line.
HowLinear
doFit we use the transformation?
•
Life Expectancy = -7.97718 + 8.729051 log Per Capita GDP
Parameter Estimates
Term
Estimate Std Error t Ratio Prob>|t|
Intercept
-7.97718 3.943378
-2.02 0.0454
log Per Capita
8.729051 0.474257 18.41 <.0001
GDP
• Testing for association between Y and X: If the simple linear
regression model holds for f(Y) and g(X), then Y and X are
associated if and only if the slope in the regression of f(Y) and g(X)
does not equal zero. P-value for test that slope is zero is <.0001:
Strong evidence that per capita GDP and life expectancy are
associated.
• Prediction and mean response: What would you predict the life
expectancy to be for a country with a per capita GDP of $20,000?
Eˆ (Y | X  20,000)  Eˆ (Y | log X  log 20,000) 
Eˆ (Y | log X  9.9035)  7.9772  8.7291* 9.9035  78.47
How do we choose a
transformation?
• Tukey’s Bulging Rule.
• See Handout.
• Match curvature in data to the shape of
one of the curves drawn in the four
quadrants of the figure in the handout.
Then use the associated transformations,
selecting one for either X, Y or both.
Transformations in JMP
1. Use Tukey’s Bulging rule (see handout) to determine
transformations which might help.
2. After Fit Y by X, click red triangle next to Bivariate Fit and
click Fit Special. Experiment with transformations
suggested by Tukey’s Bulging rule.
3. Make residual plots of the residuals for transformed
model vs. the original X by clicking red triangle next to
Transformed Fit to … and clicking plot residuals.
Choose transformations which make the residual plot
have no pattern in the mean of the residuals vs. X.
4. Compare different transformations by looking for
transformation with smallest root mean square error on
original y-scale. If using a transformation that involves
transforming y, look at root mean square error for fit
measured on original scale.
Bivariate Fit of Life Expectancy By Per Capita GDP
Life Expectancy
80
70
60
50
40
0
5000
10000 15000 20000 25000 30000
Per Capita GDP
Linear Fit
Transformed Fit to Log
Transformed Fit to Sqrt
Transformed Fit Square
Transformed Fit to Sqrt
Linear Fit
Life Expectancy = 56.176479 + 0.0010699 Per Capita GDP
•
0.515026
0.510734
8.353485
63.86957
115
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.636551
0.633335
7.231524
63.86957
115
Transformed Fit Square
Transformed Fit to Log
Life Expectancy = -7.97718 + 8.729051 Log(Per Capita GDP)
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
`
Summary of Fit
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
Life Expectancy = 47.925383 + 0.2187935 Sqrt(Per Capita GDP)
Square(Life Expectancy) = 3232.1292 + 0.1374831 Per Capita GDP
Fit Measured on Original Scale
0.749874
0.74766
5.999128
63.86957
115
Sum of Squared Error
Root Mean Square Error
RSquare
Sum of Residuals
7597.7156
8.1997818
0.5327083
-70.29942
By looking at the root mean square error on the original y-scale, we see that
all of the transformations improve upon the untransformed model and that the
transformation to log x is by far the best.
Linear Fit
Transformation to
-5
-15
5
-5
-15
-25
0
5000
10000
15000
20000
25000
-25
30000
0
Per Capita GDP
5000
10000
15000
20000
25000
30000
25000
30000
Per Capita GDP
Transformation to Log X
Transformation to
15
Y2
15
5
Residual
Residual
X
15
5
Residual
Residual
15
-5
5
-5
-15
-15
-25
-25
0
5000
10000
15000
20000
Per Capita GDP
25000
30000
0
5000
10000
15000
20000
Per Capita GDP
The transformation to Log X appears to have mostly removed a trend in the mean
of the residuals. This means that E (Y | X )  0  1 log X. There is still a
problem of nonconstant variance.