Descriptive Methods in Regression and Correlation

Download Report

Transcript Descriptive Methods in Regression and Correlation

Chapter 4
Descriptive Methods in
Regression and Correlation
Slide 2
C4S1 – Linear Equation with One
Independent Variable
Linear Equations
Linear equations with one independent variable can be
written as
y = b0 + b1x
b0 and b1 are constants (fixed numbers) and x is the
independent variable and y is the dependent variable.
The graph of a linear equation is a straight line. y = mx + b
Slide 3
y  mx  b
y  int (0, y)
x  int (x, 0)
Slide 4
Figure 4.6
Positive Slope
Falls right to left
Negative Slope
Falls left to right
Horizontal Line
Has a slope of 0
Slide 5
C4S2 – The Regression Equation
Plotting the data in a scatterplot helps us visualize any apparent
relationship between x and y. Generally speaking, a scatterplot
(or scatter diagram) is a graph of data from two quantitative
variables of a population. To construct a scatterplot, we use a
horizontal axis for the observations of one variable and a vertical
axis for the observations of the other. Each pair of observations is
then plotted as a point.
Slide 6
Because we could draw many different lines through the
cluster of data points, we need a method to choose the
“best” line.
The method, called the least-squares criterion, is based
on an analysis of the errors made in using a line to fit the
data points.
ŷ  b0  b1 x
Slide 7
To avoid confusion, we use
value of x.
ŷ
to denote the y-value predicted for a
To measure quantitatively how well a line fits the data, we first
consider the errors, e, made in using the line to predict the
y-values of the data points.
In general, an error, e, is the signed vertical distance from the line to a
data point. The error made in using the line to predict the y-value is
e=y−
ŷ
The decide which line best fits the data we compute the sum of the
squared errors
2
e
i
The line with the smaller sum of squared error is the one that fits
the data better.
Slide 8
Slide 9
Regression Equation for a set of n
data points is ŷ  b0  b1 x
y-intcept
b0 
2
y
x
         x   xy 
n  x
2
   x
2
slope
 n    xy     x   y 
b1 
2
2
n  x    x
Mean for y
y

y
n
Slide 10
Extrapolation
Suppose that a scatterplot indicates a linear relationship
between two variables.
Then, within the range of the observed values of the
predictor variable, we can reasonably use the regression
equation to make predictions for the response variable.
However, to do so outside that range, which is called
extrapolation, may not be reasonable because the linear
relationship between the predictor and response variables
may not hold there.
Grossly incorrect predictions can result from extrapolation.
Slide 11
Outliers and Influential Observations
An outlier is an observation that lies outside the overall pattern of
the data.
In the context of regression, an outlier is a data point that lies far
from the regression line, relative to the other data points.
An outlier can sometimes have a significant effect on a regression
analysis.
We must also watch for influential observations.
In regression analysis, an influential observation is a data point
whose removal causes the regression equation (and line) to
change considerably.
A data point separated in the x-direction from the other data points
is often an influential observation because the regression line is
“pulled” toward such a data point without counteraction by other
data points.
Slide 12
Regression analysis is used when you want to show if and/or
how one variable can predict or cause changes in another
variable.
Correlation between x and y
Sx and Sy are the standard deviations of x and y
Slope of best fit line
 sy 
m  r 
 sx 
Slide 13
C4S3 – The Coefficient of Determination
Slide 14
The coefficient of determination, r2, always lies between 0 and 1.
r2 near 0 suggests that the regression equation is not very useful
for making predictions
r2 near 1 suggest that the regression equation is quite useful for
making predictions
Shows us if we can use the regression equation instead of the
mean.
Percentage of variation.
Slide 15
Regression Identity
The total of the squares equals the regression
sum of squares plus the error sum of squares.
SST = SSR + SSE
Equation is always true
Slide 16
C4S4 – Linear Correlation
We here things like “there is a positive correlation between x and y” and
“x and y are uncorrelated” these are explained in this section.
Linear Correlations measures the strength of the linear relationship
between two variables.
Reveals the meaning
and basic properties
Used for hand
calculations
Slide 17
Understanding the Linear Correlation Coefficient
r is the independent of the of the choice of units and always lies between
-1 and 1.
Close to ±1 then there is a strong linear relationship and is useful in
making predictions. Regression equation is extremely useful. The data
points are clustered closely about the regression line.
Near 0 then the linear relationship is weak and a poor predictor. The data
points are essentially scattered about a horizontal line.
Keep in mind that r measures the strength of the linear relationship
between two variables and that the following properties of r are
meaningful only when the data points are scattered about a line.
•
•
•
•
r reflects the slope of the scatterplot.
The magnitude of r indicates the strength of the linear
relationship.
The sign of r suggests the type of linear relationship.
The sign of r and the sign of the slope of the regression line are
identical.
Slide 18
Understand
ing the
Linear
Correlation
Coefficient
To
graphically
portray the
meaning of
the linear
correlation
coefficient,
we present
various
degrees of
linear
correlation
in Fig. 4.17.
Figure 4.17
Slide 19
Relationship Between the Correlation Coefficient
and the Coefficient of Determination

The coefficient of determination, r2, is a descriptive measure
of the utility of the regression equation for making predictions.

The coefficient of determination, r2, equals the square of the
linear correlation coefficient, r.

Linear correlation coefficient, r, is a descriptive measure of the
strength of the linear relationship between two variables.

Because linear correlation coefficient describes the strength of
the linear relationship between two variables it should be used
as a descriptive measure only when a scatterpoint indicates
that the data points are scattered about the line.
Slide 20
Relationship Between the Correlation Coefficient
and the Coefficient of Determination

When using linear correlation coefficient you must also watch
for outliers and influential observation because sample means
and sample standard deviations are not resistant to outliers and
other extreme values.

We cannot say the a value of r near 0 implies there is no
relationship and we cannot say that values of r near ± 1
implies that a linear relationship exists. Only meaningful when
the scatterplot indicate that the data points are scattered about
a line.
Slide 21