x - My Teacher Pages
download
report
Transcript
x - My Teacher Pages
Chapters 8 & 9
Linear Regression & Regression Wisdom
Price of Homes Bases on Size (in Square Feet)
Sold in Ames between Sep. 2004 and Oct. 2005
r = 0.8718945
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Statistical Modeling
Statistical Model: An equation that fits the pattern
between a response variable and possible explanatory
variables, accounting for deviations from the model.
(Simplest case: one quantitative response variable and
one quantitative explanatory variable.)
Response Variable (Y): The quantitative outcome of a
study.
Explanatory Variable (X): A quantitative variable that
may explain or predict the response variable
What is the beset model for: Predicting weight (Y) from
height (X)?
What is the best model for: Predicting blood pressure
(Y) from age (X)?
Correlation and the Line
Price of Homes Based on Square Feet
Price = -90.2458 + 0.1598SQFT
r = 0.8718945
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Regression line
Explains how the response variable (y)
changes in relation to the explanatory
variable (x)
Use the line to predict value of y for a
given value of x
Regression line
Need a mathematical formula
We want to predict y from x
The predicted values are called ŷ.
The observed values are called y.
Which Line is Best?
What are some ways we can determine
which model out of all the possible
models is the “best” one?
What are some ways that we can
numerically rank the different models. (i.e.
the different lines)
Which Model is Best?
Price = -90.2458 + 0.1598SQFT (red)
Price = -300 + 0.3SQFT (blue)
Price = 0 + 0.1SQFT (green)
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Regression line
“Putting a hat on it” means we have
predicted something from the model
Look at vertical distance
y yˆ
Amount of error in the regression line
The goal is to find the line so that these
errors are minimized.
Least squares regression
Most commonly used regression line
Makes the sum of the squared errors as
small as possible
Based on the statistics
x , y , sx , s y , r
Regression line equation
yˆ b0 b1 x
where
b1 r
sy
sx
b0 y b1 x
Regression line equation
b1 = slope of line. For every unit increase in x, y
changes by the amount of the slope.
Interpreting b1 (slope):
For every one unit increase in the explanatory
variable, there will be, on average, a b1 unit(s)
increase/decrease in the response variable.
For example: For every one square foot increase in
size, on average, there will be a $159.80 increase in
home price.
MEMORIZE THIS!!!!!
Regression line equation
b0 = y-intercept of line. The value of y when x =
0.
Interpreting b0 (y-intercept):
When the explanatory variable = 0, on average, the
value of the response variable = b0.
For example: When the sq. ft. of a home is 0, the
price of the home will be -$90,245.80 on average.
MEMORIZE THIS!!!!!
BE CAREFUL. The interpretation of the intercept
does not always make sense. When interpreting, be
sure to mention if the interpretation does not make
sense.
Example – Kobe’s Shooting
I visited cnnsi’s website and checked out
some of Kobe Bryant’s personal scoring
numbers. I looked at the number of times
he shot the ball and his point total for
each game so far this year.
Let’s come up with the regression
equation for this data.
Kobe’s Shooting
r = 0.7293762
Form: Linear
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Strength: Moderate to Strong
Direction: Positive
Calculating the regression line
Remember that:
Our explanatory variable(x) is the number of
shots
Our response variable(y) is the number of
points
So the five numbers needed are:
x 27.04, sx 7.41,
y 35.71, sy 12.13,
r 0.749768
Calculating the Regression Line
Find the Slope
sy
12.13
b1 r (0.7293762)
1.19
sx
7.41
Find the Intercept
b0 y b1 x 35.71 .90(27.04) 3.436
Calculating the regression line.
Don’t forget to write the equation.
ŷ 3.436 1.19x
DON’T FORGET TO WRITE THE
EQUATION IN THE CONTEXT OF THE
PROBLEM.
pts 3.436 1.19(number of shots)
Interpretation
How would we interpret b1?
For a one shot increase from Kobe Bryant,
on average we would expect him to score
1.19 more points.
How would we interpret b0?
If Kobe Bryant did not take one shot then on
average we would expect him to score 3.436
points
Prediction
Use the regression equation to predict y from
x.
Ex. What is the predicted number of points when
Kobe shoots 30 times in a game?
ŷ 3.436 1.19(30) 39.136
Ex. What is the predicted number of points when
Kobe shoots 55 times in a game?
ŷ 3.436 1.19(55) 68.886
Plotting the regression line
Find two points on the line:
Ex. x = 30, y = 39 and x = 55, y =69
• If you are plotting by hand it is ok to round values
Plot these two points on the graph
Connect the points
This is the regression line
Plotting the Regression Line
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Properties of regression line
r is related to the value of b1
r has the same sign as b1
One standard deviation change in x
corresponds to r times one standard
deviation change in y
The regression line always goes through the
point ( x , y )
Properties of regression line
r2
Percent of variation in y that is explained by
the least squares regression of y on x
The higher the value of r2, the more the
regression line explains the changes that
occur in the y variable
The higher the values of r2, the better the
regression line fits the data
Properties of regression line
r2
0 r2 1 since -1 r 1
Interpretation of r2
r2 is the percent of variation in the response
variable that can be explained by the least squares
regression of the response variable on the
explanatory variable.
For Kobe’s example: 53.1% of the variability in the
number of points Kobe Bryant scores in a game can
be explained by the LS regression of points per
game on number of shots per game (g).
MEMORIZE THIS!!!!
Residuals
Amount of variation in y not taken into
account by regression line
Formula: y y
ˆ
There is a residual for each data point
Mean of the residuals is zero
Calculating Residuals – Kobe
ŷ 3.436 1.19x
pts 3.436 1.19(number of shots)
Find the residual for the point (46,81)
First find the predicted number of calories for a sandwich with
a serving weight of 182 g:
ŷ 3.436 1.19(46) 58.176
Now find residual:
residual y ŷ 81 58.176 22.824
Calculating Residuals – Kobe
Find the residual for the point (26,35)
ŷ 3.346 1.19(26) 34.286
residual y ŷ 35 34.286 0.714
Residual Plots
Scatterplot of Residuals
Explanatory variable on horizontal axis
Residuals on vertical axis
Horizontal line at residual = 0
Residual Plots
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Interpreting Residual Plots
Is there a curved pattern?
Is there increasing spread about the line as x
increases?
This could mean a non-linear relationship
This could mean non-constant variance
Is there decreasing spread about the line as x
increases?
This could mean non-constant variance
Interpreting Residual Plots
Points with large residuals
These are probably outliers in the y direction
These will pull the regression line in the direction of
the outlier (up or down)
Extreme points in the x direction
These are called influential points
They do not always show up in residuals because
the residual could be small
Removing the outlier could markedly change the
regression line
Reading JMP Data
Bivariate Fit of BAC by # of Beers
0.2
BAC
0.15
0.1
0.05
0
0
2
4
6
Beers
8
10
Reading JMP Data
Linear Fit
Linear Fit
BAC = -0.011654 + 0.0180112 # of Beers
This is the regression line for the data.
Slope is 0.0180112. y-Intercept is -0.011654.
The response variable is the BAC.
The explanatory variable is the # of Beers.
Reading JMP Data
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
This gives some summary of the data.
RSquare = r2 = (r)2 = (correlation)2
Root Mean Square Error = s
Mean of response = y
Observations = n
0.803536
0.788424
0.020920
0.076000
15
Reading JMP Data
Analysis of Variance
Source
Model
Error
C. Total
DF
1
13
14
Sum of Squares
0.02327041
0.00568959
0.02896000
Mean Square
0.023270
0.000438
This is called the ANOVA Table.
This is another way to analyze the data.
We aren’t going to discuss this in this class.
F Ratio
53.1700
Prob > F
<.0001
Reading JMP Data
Parameter Estimates
Term
Intercept
#beers
Estimate
-0.011654
0.0180112
Std Error
0.013179
0.002470
t Ratio
-0.88
7.29
This tells you what the y-intercept and slope
are. It also gives the standard error for each of
the estimates. If you were to form confidence
intervals for the parameter estimates, you
would need these values. We won’t discuss
that in this class.
Prob>|t|
0.3926
<.0001
Reading JMP Data
Residual
0.05
0.03
0.01
-0.01
-0.03
0
2
4
6
8
10
Beers
Here is your residual plot. Check it to see if there are any
problems with linearity of the data and constant variance.
Example
at
F
G
e
s
l
c
o
r
60 70 80 90 10 10 120
A ge
10
20
30
A ge
40
at
fi rs
Example
Age at first word vs. Gesell score.
Scatterplot: Weak negative linear
relationship between two variables. Possible
outliers at (42,57) and (17,121).
Regression: r = -0.64, r2 = 40.96%.
yˆ 109.87 1.13x
Example
at
F
G
e
s
l
c
o
r
60 70 80 90 10 10 120
A ge
10
20
30
A ge
40
at
fi rs t
Example
Age at first word vs. Gesell score.
Prediction:
• When x=17
• When x=42
Residuals:
• point (17,121)
• point (42,57)
Example
R
e
s
i
d
u
a
l
-10 0 10 20 30
Re s id u a
10
20
30
A ge
40
at
Fi rs
Example
Residual Plot
Outliers at x=17 and x=42
Small residual for x=42
• Could be influential
Remove (42,57) from data.
Regression line changes markedly.
r = -0.33, r2 = 10.89%.
Example
at
F
G
e
s
l
c
o
r
60 70 80 90 10 10 120
A ge
10
20
30
A ge
40
at
fi rs
Outliers--What should you do?
Make sure data points have been
recorded correctly
Collect more data
Remove the outlier
Examine collection techniques
Examine outside influences
Cautions about regression
Linear relationship only
Not resistant
Using averaged data
Makes relationship appear stronger
Taking average removes variation
Extrapolation
Predicting y when x value is outside the
original data
Cautions about Regression
Extrapolation
Remember the data about home prices vs. the
amount of sq. footage in the home.
The regression line we found based on data
collected from homes with 900 to 3,000 sq. ft. is
price 75.47 0.69( sq. ft.)
This would mean that if my home has no square
footage, then I pay -$75,470.
If you must extrapolate, at least don’t expect that
your prediction will come true.
Cautions about regression
ASSOCIATION IS NOT CAUSATION!
Strong association between explanatory and
response variables does not mean that the
explanatory variable causes the response
variable.
Proving Causation
Experiment
Change the values of x and control for lurking
variables.
Not all problems can be solved by
experiment
• Smoking causes lung cancer.
• Living near power lines causes leukemia.
Proving Causation
Lurking variable
Important effect on variables, but not
included in study.
Example:
• Do taller people make more money? What do
you think a lurking variable might be?
•
Proving Causation
Proving smoking causes lung cancer
Association is strong
Association is consistent
High doses are associated with stronger
response
Cause precedes the effect in time
Cause is plausible
Review
Number of Calories By Sugar Content (g) for 13
Cereals
150
Let’s calculate the
formula for this
regression line:
cals
125
100
75
50
25
0
5
sugar (g)
10
15
Review
Let’s review all the formulas we need:
yˆ b0 b1 x
b1 r
sy
sx
b0 y b1 x
1 ( x x )( y y )
r
n 1
sx s y
s
( y y)
n 1
2
1
y y
n
Review
Here are all the numbers you need:
x
n 13
( x
( y y )
2
94
y 1280
x )( y y ) 1014.66
6169.21
2
(
x
x
)
301.97
Review
First, calculate sx and sy:
sx
sy
(x x)
2
n 1
( y y)
n 1
301.91
5.02
12
6169.21
22.67
12
2
Review
Second, calculate r:
1014.66
1014.66
r
0.743
(13 1)( 22.67)(5.02) 1365.64
Third, calculate b1:
22.67
b1 (.743)
3.36
5.02
Review
Fourth, calculate x and y :
94
x
7.23
13
1280
y
98.46
13
Fifth, calculate a (we’re almost done!!):
b0 98.46 3.36(7.23) 74.17
Review
Last, but definitely the most important,
WRITE DOWN THE EQUATION IN THE
CONTEXT OF THE PROBLEM:
Calories 74.17 3.36( sugar )
Review
Interpret b1:
For every one gram increase in sugar, the
number of calories will increase by 3.36.
Interpret r2:
About 55% of the variability in the number of
calories in cereal can be explained by the LS
regression of calories on sugar content.