Transcript 10-3 PartII

Example: Old Faithful
Given the sample data in Table 10-1,
find the regression equation. Question: Is there
a correlation between duration time of eruptions
and the time interval after the eruption?
Duration
240
120
178
234
235
269
255
220
Interval
After
92
65
72
94
83
94
101
87
Slide
1
Solution
Using the same procedure as in the previous example, we find
that b1 = 0.234 and b0 = 34.8. Hence, the estimated
regression equation is:
y^ = 34.8 + 0.234x
Slide
2
Example: Old Faithful - cont
Given the sample data in Table 10-1,
find the regression equation.
Slide
3
Example: Old Faithful
Given the sample data in Table 10-1, we found that the
regression equation is ^
y = 34.8 + 0.234x. Assuming
that the current eruption has a duration of x = 180 sec,
find the best predicted value of y, the time interval
after this eruption.
Slide
4
Part 2: Beyond the Basics of Regression
Predictions
In predicting a value of y based on
some given value of x ...
1. If there is not a linear correlation, the
best predicted y-value is y.
2. If there is a linear correlation, the
best predicted y-value is found by
substituting the x-value into the
regression equation.
Slide
5
Guidelines for Using The
Regression Equation
1. If there is no linear correlation, don’t use the
regression equation to make predictions.
2. When using the regression equation for
predictions, stay within the scope of the
available sample data (no extrapolating!).
3. A regression equation based on old data is
not necessarily valid now.
4. Don’t make predictions about a population
that is different from the population from
which the sample data were drawn.
Slide
6
CwK p. 553
#7 and 8!
Slide
7
Definitions

Marginal Change
The marginal change is the amount that a variable changes
when the other variable changes by exactly one unit.
Example: The regression line y-hat = 34.8 + 0.234x has a slope of .234
Interpretation: If we increase x (duration time) by 1 second, the
predicted time interval after the eruption will increase by .234
minutes.

Outlier
An outlier is a point lying far away from the other data points.

Influential Point
An influential point strongly affects the graph of the
regression line.
Slide
8
Definition
Residual
The residual for a sample of paired (x, y) data, is the
difference (y - ^
y) between an observed sample y-value
and the value of y, which is the value of y that is
predicted by using the regression equation.
residual = observed y – predicted y = y - y^
Slide
9
Example
•
•
•
•
Find the regression line for the following table:
Find y-hat!
XY
Find residuals
1 4
Graph residuals
2 24
4 8
5 32
Slide
10
Definitions
 Least-Squares Property
A straight line has the least-squares property if the sum of the
squares of the residuals is the smallest sum possible.
 Residual Plot
A scatterplot of the (x, y) values after each of the y-coordinate values
have been replaced by the residual value y – ^
y. That is, a residual plot
is a graph of the points (x, y –^
y)
Slide
11
Residual Plot Analysis
If a residual plot does not reveal any pattern, the
regression equation is a good representation of the
association between the two variables.
If a residual plot reveals some systematic pattern, the
regression equation is not a good representation of
the association between the two variables.
Slide
12
Residual Plots
Good model for the data; points are close to LSRL, no distinct
pattern
Slide
13
Residual Plots
Association is NOT linear; distinct pattern = linear model not a good
model in this case
Slide
14
Residual Plots
Shows pattern of increasing variation; violates requirement that for
different values of x, the distributions of y values have the same variance.
Slide
15
The SAT essay: longer is better?
(An observational study)
Following the debut of the new SAT writing test in March 2005, Dr. Les Perelman from M.I.T. stirred
controversy by reporting, “It appeared to me that regardless of what a student wrote, the longer the
essay, the higher the score.” he went on to say, “I have never found a quantifiable predictor in 25
years of grading that was anywhere as strong as this one. If you just graded them based on length
without ever reading them, you’d be right over 90 percent of the time.” The table below shows the
data set that Dr. Perlman used to draw his conclusions.
1) Identify the explanatory/response variables
2) Draw a scatter plot and the LSRL
3) Find the vital statistics
4) Interpret r and r-squared
5) Find the marginal change (slope interpretation).
6) Use the regression line to estimate the score of a paper that is 390 words
long.
7) Graph the residual plot. Do you feel confident that you could come up with
a good estimate for a score using the regression line? Explain.
Words
460
422
402
365
357
278
236
201
168
156
133
114
108
100
403
Score
6
6
5
5
6
5
4
4
4
3
2
2
1
1
5
Words
401
388
320
258
236
189
128
67
697
387
355
337
325
272
150
Score
6
6
5
4
4
3
2
1
6
6
5
5
4
4
2
Words
135
73
Score
3
1
Slide
16
P. 553
Do #7, #8 then:
Answer each of the following questions for #16 and #17
a) Is there a linear correlation? Use your calculator commands
to find the p-value, then the critical values from Table A-6 to
prove it. Is your answer the same for each one?
b) Graph the points (don’t forget axis labels)
c) Find the vital statistics (r, r-squared, a, b, y-hat – don’t forget
to define x and y)
d) Tell me what r and r-squared means in the context of the
problem (r: form, direction, strength) (r-squared: how much
of the variation in x can be explained by the variation in y)
e) Find the residuals
f) Draw the residual plot – is the regression line a good model
for the data? Why?
Slide
17