Lecture 19 - University of Pennsylvania

Download Report

Transcript Lecture 19 - University of Pennsylvania

Lecture 19
• Simple linear regression (Review, 18.5,
18.8)
• Homework 5 is posted and due next
Tuesday by 3 p.m.
• Extra office hour on Thursday after class.
Review of Regression Analysis
• Goal: Estimate E(Y|X) – the regression
function
• Uses:
– E(Y|X) is a good prediction of Y based on X
– E(Y|X) describes the relationship between Y
and X
• Simple linear regression model: E(Y|X) is a
straight line (the regression line)
E(Y | X )  0  1 X
The Simple Linear Regression Line
• Example 18.2 (Xm18-02)
– A car dealer wants to find
the relationship between
the odometer reading and
the selling price of used cars.
– A random sample of 100 cars
is selected, and the data
recorded.
– Find the regression line.
Car Odometer
Price
1 37388
14636
2 44758
14122
3 45833
14016
4 30862
15590
5 31705
15568
6 34010
14718
.
.
.
Independent
Dependent
.
.
.
variable
x variable
y
.
.
.
Simple Linear Regression Model
• The data ( x1, y1),, ( xn , yn ) are assumed to be a
realization of yi  0  1xi   i , i  1,n
1 ,,  n iid ~ N (0,  2 )
2

,

,

• 0 1  are the unknown parameters of the
model. Objective of regression is to estimate
them.
• 1 , the slope, is the amount that Y changes on
average for each one unit increase in X.
•   , the standard error of estimate, is the standard
deviation of the amount by which Y differs from
E(Y|X), i.e., standard deviation of the errors
Estimation of Regression Line
• We estimate the regression line 0  1x by
the least squares line b0  b1x , the line that
minimizes the sum of squared prediction
errors for the data.
b1 
cov(X, Y)
2
sx
b 0  y  b1x
Fitted Values and Residuals
• The least squares line decomposes the data into
two parts yi  yˆi  ei where
yˆi  b0  b1xi
ei  y i  yˆi
• yˆ1,, yˆn are called the fitted or predicted values.
• e1,, en are called the residuals.
• The residuals e1,, en are estimates of the errors
(1,,  n )
Estimating

n
1
2
ˆ
s 
(
y

y
)

i
i
n  2 i1
• The standard error of estimate s (root mean
squared error) is an estimate of  
• The standard error of estimate s is basically the
standard deviation of the residuals.
• s measures how useful the simple linear
regression model is for prediction
• If the simple regression model holds, then
approximately
– 68% of the data will lie within one s of the LS line.
– 95% of the data will lie within two s of the LS line.
18.4 Error Variable: Required
Conditions
• The error  is a critical part of the regression model.
• Four requirements involving the distribution of  must
be satisfied.
–
–
–
–
The probability distribution of  is normal.
The mean of  is zero for each x: E(|x) = 0 for each x.
The standard deviation of  is  for all values of x.
The set of errors associated with different values of y are all
independent.
The Normality of 
E(y|x3)
The standard deviation remains constant,
m3
0 + 1x3
E(y|x2)
0 + 1x2
m2
but the mean value changes with x
0 + 1x1
E(y|x1)
m1
From the first three assumptions we have:
x1
y is normally distributed with mean
E(y) = 0 + 1x, and a constant standard
deviation  given x.
x2
x3
Coefficient of determination
– To measure the strength of the linear relationship we
use the coefficient of determination R2 .
2

cov(
X
,
Y
)

R2 
s2x s2y
SSE
2
or R  1 
n
2
(
y

y
)
 i
SSE   ( yi  yˆi )2
i 1
Coefficient of determination
• To understand the significance of this
coefficient note:
The regression model
Overall variability in y
The error
n
SSE   ( yi  yˆ i ) 2
i 1
n
SSR   ( yˆi  y ) 2
i 1
n
SSTot   ( yi  y ) 2
i 1
SSTot  SSR  SSE
Coefficient of determination
y2
Two data points (x1,y1) and (x2,y2)
of a certain sample are shown.
y
y1
x1
Total variation in y =
(y1  y ) 2  (y 2  y ) 2 
Variation in y = SSR + SSE
x2
Variation explained by the + Unexplained variation (error)
regression line
(yˆ 1  y) 2  (yˆ 2  y) 2
 (y1  yˆ 1 ) 2  (y 2  yˆ 2 ) 2
Coefficient of determination
• R2 measures the proportion of the variation in y
that is explained by the variation in x.
2
R  1

SSE
(y i  y)


(y  y)
(y i  y) 2  SSE
2
i
2


SSR
(y i  y) 2
• R2 takes on any value between zero and one.
R2 = 1: Perfect match between the line and the data
points.
R2 = 0: There is no linear relationship between x & y
Coefficient of determination,
Example
• Example 18.5
– Find the coefficient of determination for Example
18.2; what does this statistic tell you about the
model?
• Solution
R 
2
– Solving by hand;
[cov(x, y)]2
s x2 s 2y

[ 2,712,511]2
( 43,528,688)( 259,996 )
 .6501
Example 18.2 in JMP
Bivariate Fit of Price By Odometer
16000
Price
15500
15000
14500
14000
13500
15000 25000 35000
Odom e te r
45000
Linear Fit
Linear Fit
Price = 17066.766 - 0.0623155 Odometer
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Parameter Estimates
Term
Intercept
Odometer
Estimate
17066.766
-0.062315
0.650132
0.646562
303.1375
Std Error
169.0246
0.004618
t Ratio
100.97
-13.49
Prob>|t|
<.0001
<.0001
SEs of Parameter Estimates
• From the JMP output, se(ˆ0 )  169.02, se(ˆ1)  .0046
• Imagine yourself taking repeated samples of the
prices of cars with the odometer readings x1,, xn
from the “population.”
• For each sample, you could estimate the
regression line by least squares. Each time, the
least squares line would be a little different.
• The standard errors estimate how much the least
squares estimates of the slope and intercept would
vary over these repeated samples.
Confidence Intervals
• If simple linear regression model holds,
estimated slope follows a t-distribution.
• A 95% confidence interval for the slope 1
is given by ˆ1  t.025,n2se(ˆ1)
• A 95% confidence interval for the intercept
0 is given by ˆ  t
ˆ
0
.025 ,n2 se( 0 )
Testing the slope
– When no linear relationship exists between two
variables, the regression line should be horizontal.
q
q
qq
q
q
q
q
q
q
q
q
Linear relationship.
Different inputs (x) yield
different outputs (y).
No linear relationship.
Different inputs (x) yield
the same output (y).
The slope is not equal to zero
The slope is equal to zero
Testing the Slope
• We can draw inference about 1 from b1 by testing
H0: 1 = 0
H1: 1 = 0 (or < 0,or > 0)
– The test statistic is
b 1  1
t
s b1
The
error ofvariable
b 1.
– Ifstandard
the error
where
s b1 
s
(n  1)s 2x
is normally distributed, the statistic is
Student t distribution with d.f. = n-2.
Testing the Slope,
Example
• Example 18.4
– Test to determine whether there is enough
evidence to infer that there is a linear
relationship between the car auction price and
the odometer reading for all three-year-old
Tauruses, in Example 18.2.
Use a = 5%.
Testing the Slope,
Example
• Solving by hand
– To compute “t” we need the values of b1 and sb1.
b1  .0623
sb1 
t
s
(n  1) s x2

303.1
 .00462
(99)(43,528,690)
b1  1  .0623 0

 13.49
.
00462
sb1
– The rejection region is t > t.025 or t < -t.025 with n = n-2
= 98.
Approximately, t.025 = 1.984
Testing the Slope,
Example Xm18-02
Bivariate Fit of Price By Odometer
16000
• Using the computer
Price
15500
15000
14500
14000
13500
15000 25000 35000
Odom e te r
There is overwhelming evidence to infer
that the odometer reading affects the
auction selling price.
45000
Linear Fit
Linear Fit
Price = 17066.766 - 0.0623155 Odometer
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Parameter Estimates
Term
Intercept
Odometer
Estimate
17066.766
-0.062315
0.650132
0.646562
303.1375
Std Error
169.0246
0.004618
t Ratio
100.97
-13.49
Prob>|t|
<.0001
<.0001
Cause-and-effect Relationship
• A test of whether the slope is zero is a test of
whether there is a linear relationship between x
and y in the observed data, i.e., is a change in x
associated with a change in y.
• This does not test whether a change in x causes a
change in y. Such a relationship can only be
established based on a carefully controlled
experiment or extensive subject matter knowledge
about the relationship.
Example of Pitfall
• A researcher measures the number of
television sets per person X and the average
life expectancy Y for the world’s nations.
The regression line has a positive slope –
nations with many TV sets have higher life
expectancies. Could we lengthen the lives
of people in Rwanda by shipping them TV
sets?
18.7 Using the Regression
Equation
• Before using the regression model, we need
to assess how well it fits the data.
• If we are satisfied with how well the model
fits the data, we can use it to predict the
values of y.
• To make a prediction we use
– Point prediction, and
– Interval prediction
Point Prediction
• Example 18.7
– Predict the selling price of a three-year-old
Taurus with 40,000 miles on the odometer
(Example
18.2).
A
point prediction
yˆ  17067  .0623x  17067  .0623(40,000)  14,575
– It is predicted that a 40,000 miles car would sell
for $14,575.
– How close is this prediction to the real price?
Interval Estimates
• Two intervals can be used to discover how closely
the predicted value will match the true value of y.
– Prediction interval – predicts y for a given value of x,
– Confidence interval – estimates the average y for a
given x.
– The prediction interval
yˆ  t a 2 s 
2
1 ( x g  x)
1 
n (n  1)s 2x
– The confidence
interval
2
1 ( x g  x)
yˆ  t a 2 s 
n

(n  1)s 2x
Interval Estimates,
Example
• Example 18.7 - continued
– Provide an interval estimate for the bidding
price on a Ford Taurus with 40,000 miles on the
odometer.
– Two types of predictions are required:
• A prediction for a specific car
• An estimate for the average price per car
Interval Estimates,
Example
• Solution
– A prediction interval provides the price estimate for a
single car:
yˆ  t a 2 s 
t.025,98
2
1 ( x g  x)
1 
n (n  1)s 2x
Approximately
1
(40,000 36,009) 2
[17,067 .0623(40000)]  1.984(303.1) 1 

 14,575 605
100 (100 1)43,528,690
Interval Estimates,
Example
• Solution – continued
– A confidence interval provides the estimate of
the mean price per car for a Ford Taurus with
40,000 miles reading on the odometer.
• The confidence interval (95%) = yˆ  t a 2 s
1

n
( x g  x) 2

( x i  x) 2
1
(40,000 36,009) 2
[17,067 .0623(40000)]  1.984(303.1)

 14,575 70
100 (100 1)43,528,690
The effect of the given xg on the
length of the interval
– As xg moves away from x the interval becomes longer.
That is, the shortest interval is found at x
yˆ  b 0  b1x g
x
yˆ  t a 2 s 
2
1 ( x g  x)

n (n  1)s 2x
The effect of the given xg on the
length of the interval
– As xg moves away from x the interval becomes
longer. That is, the shortest interval is found at x
yˆ  b 0  b1x g
yˆ ( x g  x  1)
yˆ ( x g  x  1)
yˆ  t a 2 s 
yˆ  t a 2 s 
x 1 x  1
x
( x  1)  x  1 ( x  1)  x  1
2
1 ( x g  x)

n (n  1)s 2x
1
12

n (n  1)s 2x
The effect of the given xg on the
length of the interval
– As xg moves away from x the interval becomes longer.
That is, the shortest interval is found at . x
yˆ  b 0  b1x g
x 2
x
x2
( x  2)  x   2 ( x  2 )  x  2
yˆ  t a 2 s 
2
1 ( x g  x)

n (n  1)s 2x
yˆ  t a 2 s 
1
12

n (n  1)s 2x
yˆ  t a 2 s 
1
22

n (n  1)s 2x
Practice Problems
• 18.84,18.86,18.88,18.90,18.94