Transcript Slide 1

1
Functions and
Applications
Copyright © Cengage Learning. All rights reserved.
1.4
Linear Regression
Copyright © Cengage Learning. All rights reserved.
Linear Regression
To find a linear model given two data points: We find the
equation of the line that passes through them.
However, we often have more than two data points, and
they will rarely all lie on a single straight line, but may often
come close to doing so.
The problem is to find the line coming closest to passing
through all of the points.
3
Linear Regression
Suppose, for example, that we are conducting research for
a company interested in expanding into Mexico. Of interest
to us would be current and projected growth in that
country’s economy.
The following table shows past and projected per capita
gross domestic product (GDP) of Mexico for 2000–2014.
4
Linear Regression
A plot of these data suggests a roughly linear growth of the
GDP (Figure 27(a)).
These points suggest a roughly
linear relationship between
t and y, although they clearly
do not all lie on a single
straight line.
Figure 27(a)
5
Linear Regression
Figure 27(b)
Figure 27(b) shows the points together with several lines,
some fitting better than others. Can we precisely measure
which lines fit better than others? For instance, which of the
two lines labeled as “good” fits in Figure 27(b) models the
data more accurately?
6
Linear Regression
We begin by considering, for each value of t, the difference
between the actual GDP (the observed value) and the
GDP predicted by a linear equation (the predicted value).
The difference between the predicted value and the
observed value is called the residual.
Residual = Observed Value – Predicted Value
7
Linear Regression
On the graph, the residuals measure the vertical distances
between the (observed) data points and the line (Figure 28)
and they tell us how far the linear model is from predicting
the actual GDP.
Figure 28
8
Linear Regression
The more accurate our model, the smaller the residuals
should be.
We can combine all the residuals into a single measure of
accuracy by adding their squares. (We square the residuals
in part to make them all positive.)
The sum of the squares of the residuals is called the
sum-of-squares error, SSE. Smaller values of SSE
indicate more accurate models.
9
Linear Regression
Observed and Predicted Values
Suppose we are given a collection of data points
(x1, y1), …, (xn, yn). The n quantities y1, y2, …, yn are called
the observed y-values. If we model these data with a
linear equation
ŷ = mx + b,
ŷ stands for “estimated y” or “predicted y.”
then the y-values we get by substituting the given x-values
into the equation are called the predicted y-values:
ŷ1 = mx1 + b
Substitute x1 for x.
ŷ2 = mx2 + b
Substitute x2 for x.
…
ŷn = mxn + b.
Substitute xn for x.
10
Linear Regression
Quick Example
Consider the three data points (0, 2), (2, 5), and (3, 6). The
observed y-values are y1 = 2, y2 = 5, and y3 = 6. If we
model these data with the equation ŷ = x + 2.5, then the
predicted values are:
ŷ1 = x1 + 2.5 = 0 + 2.5 = 2.5
ŷ2 = x2 + 2.5 = 2 + 2.5 = 4.5
ŷ3 = x3 + 2.5 = 3 + 2.5 = 5.5.
11
Linear Regression
Residuals and Sum-of-Squares Error (SSE)
If we model a collection of data (x1, y1), …, (xn, yn) with a
linear equation ŷ = mx + b, then the residuals are the
n quantities (Observed Value – Predicted Value):
(y1 – ŷ1), (y2 – ŷ2), …, (yn – ŷn).
The sum-of-squares error (SSE) is the sum of the
squares of the residuals:
SSE = (y1 – ŷ1)2 + (y2 – ŷ2)2 + … +(yn – ŷn)2.
12
Linear Regression
Quick Example
For the data and linear approximation given above, the
residuals are:
y1 – ŷ1 = 2 – 2.5 = –0.5
y2 – ŷ2 = 5 – 4.5 = 0.5
y3 – ŷ3 = 6 – 5.5 = 0.5
and so SSE = (–0.5)2 + (0.5)2 + (0.5)2 = 0.75.
13
Example 1 – Computing SSE
Using the data above on the GDP in Mexico, compute SSE
for the linear models y = 0.5t + 8 and y = 0.25t + 9. Which
model is the better fit?
Solution:
We begin by creating a
table showing the values
of t, the observed (given)
values of y, and the
values predicted by the
first model.
14
Example 1 – Solution
cont’d
We now add two new columns for the residuals and their
squares.
SSE, the sum of the squares of the residuals, is then the
sum of the entries in the last column, SSE = 8.
15
Example 1 – Solution
cont’d
Repeating the process using the second model, 0.25t + 9,
yields the following table:
This time, SSE = 2 and so the second model is a better fit.
16
Example 1 – Solution
cont’d
Figure 29 shows the data points and the two linear models
in question.
Figure 29
17
Linear Regression
Among all possible lines, there ought to be one with the
least possible value of SSE—that is, the greatest possible
accuracy as a model.
The line (and there is only one such line) that minimizes the
sum of the squares of the residuals is called the
regression line, the least-squares line, or the best-fit
line.
To find the regression line, we need a way to find values
of m and b that give the smallest possible value of SSE.
18
Linear Regression
Regression Line
The regression line (least squares line, best-fit line)
associated with the points (x1, y1), (x2, y2), …, (xn, yn) is the
line that gives the minimum (SSE).
19
Linear Regression
The regression line is
y = mx + b,
where m and b are computed as follows:
n = number of data points.
The quantities m and b are called the regression
coefficients.
20
Linear Regression
Here, “” means “the sum of.” Thus, for example,
x = Sum of the x-values = x1 + x2 + …+xn
xy = Sum of products = x1y1 + x2y2 + …+ xnyn
x2 = Sum of the squares of the x-values
= x12 + x22 + …+ xn2.
On the other hand,
(x)2 = Square of x
= Square of the sum of the x-values.
21
Coefficient of Correlation
22
Coefficient of Correlation
If all the data points do not lie on one straight line, we
would like to be able to measure how closely they can be
approximated by a straight line.
We know that SSE measures the sum of the squares of the
deviations from the regression line; therefore it constitutes
a measurement of what is called “goodness of fit.” (For
instance, if SSE = 0, then all the points lie on a straight
line.)
However, SSE depends on the units we use to measure y,
and also on the number of data points (the more data
points we use, the larger SSE tends to be).
23
Coefficient of Correlation
Thus, while we can (and do) use SSE to compare the
goodness of fit of two lines to the same data, we cannot
use it to compare the goodness of fit of one line to one set
of data with that of another to a different set of data.
To remove this dependency, statisticians have found a
related quantity that can be used to compare the goodness
of fit of lines to different sets of data.
This quantity, called the coefficient of correlation or
correlation coefficient, and usually denoted r, is between
–1 and 1. The closer r is to –1 or 1, the better the fit.
24
Coefficient of Correlation
For an exact fit, we would have r = –1 (for a line with
negative slope) or r = 1 (for a line with positive slope). For a
bad fit, we would have r close to 0.
Figure 31 shows several collections of data points with
least squares lines and the corresponding values of r.
Figure 31
25
Coefficient of Correlation
Correlation Coefficient
The coefficient of correlation of the n data points (x1, y1),
(x2, y2), …, (xn, yn) is
It measures how closely the data points (x1, y1), (x2, y2), …,
(xn, yn) fit the regression line. (The value r2 is sometimes
called the coefficient of determination.)
26
Coefficient of Correlation
Interpretation
If r is positive, the regression line has positive slope; if r is
negative, the regression line has negative slope.
If r = 1 or –1, then all the data points lie exactly on the
regression line; if it is close to ±1, then all the data points
are close to the regression line.
On the other hand, if r is not close to ±1, then the data
points are not close to the regression line, so the fit is not a
good one. As a general rule of thumb, a value of |r| less
than around 0.8 indicates a poor fit of the data to the
regression line.
27
Example 3 – Computing the Coefficient of Correlation
Use the following table that shows past and projected per
capita gross domestic product (GDP) of Mexico for
2000–2014 and find the correlation coefficient for the same.
Is the regression line a good fit?
28
Example 3 – Solution
The formula for r requires x, x2, xy, y, and y2. Let’s
organize our work in the form of a table, where the original
data are entered in the first two columns and the bottom
row contains the column sums.
29
Example 3 – Solution
cont’d
Substituting these values into the formula we get
As r is close to 1, the fit is a fairly good one; that is, the
original points lie nearly along a straight line.
30