Transcript Chapter 7

Chapter 7
Correlation, Bivariate Regression,
and Multiple Regression
Pearson’s Product Moment Correlation





Correlation measures the association
between two variables.
Correlation quantifies the extent to which
the mean, variation & direction of one
variable are related to another variable.
r ranges from +1 to -1.
Correlation can be used for prediction.
Correlation does not indicate the cause of a
relationship.
Scatter Plot


Scatter plot gives a visual description of the
relationship between two variables.
The line of best fit is defined as the line that
minimized the squared deviations from a data point
up to or down to the line.
Line of Best Fit Minimizes Squared Deviations
from a Data Point to the Line
Always do a Scatter Plot to Check the Shape of
the Relationship
Will a Linear Fit Work?
2
1
0
0
-1
-2
-3
-4
1
2
3
4
5
6
Will a Linear Fit Work?
y = 0.5246x - 2.2473
R2 = 0.4259
2
1
0
0
-1
-2
-3
-4
1
2
3
4
5
6
2nd Order Fit?
y = 0.0844x2 + 0.1057x - 1.9492
R2 = 0.4666
2
1
0
0
-1
-2
-3
-4
1
2
3
4
5
6
6th Order Fit?
y = 0.0341x6 - 0.6358x5 + 4.3835x4 - 13.609x3 + 18.224x2 - 7.3526x - 2.0039
R2 = 0.9337
2
1
0
0
-1
-2
-3
-4
1
2
3
4
5
6
Will Linear Fit Work?
Y
2
1
0
0
-1
-2
-3
-4
50
100
150
200
250
y = 0.0012x - 1.0767
R2 = 0.0035
Linear Fit
2
1
0
0
-1
-2
-3
-4
50
100
150
200
250
Correlation Formulas
Evaluating the Strength of a Correlation


For predictions, absolute value of r < .7, may
produce unacceptably large errors, especially if
the SDs of either or both X & Y are large.
As a general rule
–
–
–
–
Absolute value r greater than or equal .9 is good
Absolute value r equal to .7 - .8 is moderate
Absolute value r equal to .5 - .7 is low
Values for r below .5 give R2 = .25, or 25% are poor,
and thus not useful for predicting.
Significant Correlation??
If N is large (N=90) then a
.205 correlation is significant.
ALWAYS THINK ABOUT R2
How much variance in Y is X
accounting for?
r = .205
R2 = .042, thus X is accounting
for 4.2% of the variance in Y.
This will lead to poor
predictions.
A 95% confidence interval will
also show how poor the
prediction is.
Venn diagram shows (R2) the amount of
variance in Y that is explained by X.
Unexplained Variance in Y.
(1-R2) = .36, 36%
R2=.64 (64%) Variance
in Y that is explained
by X
The vertical
distance (up
or down)
from a data
point to the
line of best fit
is a
RESIDUAL.
r = .845
R2 = .714
(71.4%)
Y = mX + b
Y = .72 X + 13
Calculation of Regression Coefficients (b, C)
– If r < .7
prediction
will be poor.
– Large SDs
adversely
affect the
accuracy of
the
prediction.
Standard Deviation of
Residuals
Standard Error of
Estimate
(SEE)
SD of Y
Prediction
Errors
– The SEE is the SD of the prediction errors (residuals)
when predicting Y from X. SEE is used to make a
confidence interval for the prediction equation.
The SEE is used to compute confidence intervals
for prediction equation.
Example of a 95% confidence interval.
– Both r and SDY are critical in accuracy
of prediction.
– If SDY is small and r is big, predictions
are will be small.
– If SDY is big and r is small, predictions
are will be large.
– We are 95%
sure the mean
falls between
45.1 and 67.3
Multiple Regression


Multiple regression is used to predict
one Y (dependent) variable from two
or more X (independent) variables.
The advantage of multivariate or
bivariate regression is
– Provides lower standard error of estimate
– Determines which variables contribute to
the prediction and which do not.
Multiple Regression




b1, b2, b3, … bn are coefficients that give weight to
the independent variables according to their
relative contribution to the prediction of Y.
X1, X2, X3, … Xn are the predictors (independent
variables).
C is a constant, similar to Y intercept.
Body Fat = Abdominal + Tricep + Thigh
List the variables and order to enter into the
equation
1. X2 has biggest area
(C), it comes in
first.
2. X1 comes in next
area (A) is bigger
than area (E). Both
A and E are unique,
not common to C.
3. X3 comes in next, it
uniquely adds area
(E).
4. X4 is not related to
Y so it is NOT in the
equation.
Ideal Relationship Between Predictors and Y
– Each variable
accounts for
unique variance
in Y
– Very little overlap
of the predictors
– Order to enter?
– X1, X3, X4, X2, X5
Regression Methods




Enter: forces all predictors (independent
variables) into the equation, in one step.
Forward: Each step adds a new predictor.
Predictors enter based upon the unique
variance in Y they explain.
Backward: Starts with full equation (all
predictors) and removes them one at a time
on each step, beginning with the predictor
that adds the least.
Stepwise: Each step adds a new predictor.
One any step a predictor can be added and
another removed if it has high partial
correlations with the newly added predictor.
Regression Methods in SPSS
– Choose
desired
Regression
Method.
Regression Assumptions



Homoscedaticity: equal variance of X
at any Y value.
The residuals are normally distributed
around the line of best fit.
X and Y are linearly related
Set 1
Set 2
Set 3
Set 4
11
123
2
5
25
144
5
29
14
155
4
24
17
144
7
25
14
125
1
31
10
147
9
37
9
182
5
35
22
166
6
22
25
122
8
24
27
165
7
25
24
143
9
30
11
156
4
28
19
154
2
25
25
149
22
26
Tests for
Normality
– Use SPSS
– Descriptives
–Explore
Tests for Normality
Tests for Normality
Tests for Normality
Tests for Normality
Tests of Normality
a
s et1
Kolmogorov-Smirnov
Statis tic
df
Sig.
.175
14
.200*
Shapiro-Wilk
Statis tic
df
.890
14
Sig.
.081
*. This is a lower bound of the true s ignificance.
a. Lilliefors Significance Correction
– Not less than 0.05 so
the data are normal.
Tests for Normality: Normal Probability Plot or Q-Q Plot
Normal Q-Q Plot of set1
– If the data are
normal the points
cluster around a
straight line
2
Expected Normal
1
0
-1
-2
5
10
15
20
Observed Value
25
30
Tests for Normality: Boxplots
– Bar is the median, box extends
– Outliers are labeled with
th
from 25 – 75 percentile,
O, Extreme values are
whiskers extend to largest and
labeled with a star
smallest values within 1.5 box
lengths
100.00
15
27.00
80.00
24.00
21.00
60.00
16
18.00
40.00
15.00
20.00
12.00
9.00
0.00
set1
set1
Tests for Normality: Normal Probability Plot or
Q-Q Plot
Normal Q-Q Plot of set1
2
Expected Normal
1
0
-1
-2
5
10
15
20
Observed Value
25
30
Cntry15.Sav Example of Regression
Assumptions
Cntry15.Sav Example of Regression
Assumptions
Cntry15.Sav – Regression Statistics Settings
Cntry15.Sav – Regression Plot Settings
Cntry15.Sav – Regression Save Settings
Cntry15.Sav Example of Regression Assumptions
– Standardized Residual Stem-andLeaf Plot
– Frequency
–
–
–
–
3.00
4.00
7.00
1.00
Stem & Leaf
-1 .
-0 .
0.
1.
019
0148
0466669
7
– Stem width: 1.00000
– Each leaf:
1 case(s)
Cntry15.Sav Example of Regression Assumptions
Normal Q-Q Plot of Standardized Residual
2
– Distribution is
normal.
Expected Normal
1
– Two scores
are
somewhat
outside
0
-1
-2
-2
-1
0
Observed Value
1
2
Cntry15.Sav Example of Regression Assumptions
2.00000
– No Outliers
[labeled O]
1.00000
– No Extreme
scores [labeled
with a star]
0.00000
-1.00000
-2.00000
Standardized Residual
Cntry15.Sav Example of Regression Assumptions
– The points
should fall
randomly in a
band around
0, if the
distribution is
normal.
Detrended Normal Q-Q Plot of Standardized Residual
0.4
Dev from Normal
0.2
0.0
– In this
distribution
there is one
extreme
score.
-0.2
-0.4
-0.6
-2
-1
0
Observed Value
1
2
Cntry15.Sav Example of Regression Assumptions
Extreme Values
Standardized Residual
Highest
Lowest
1
2
3
4
5
1
2
3
4
5
Cas e Number
6
3
11
8
9
12
1
4
7
10
Value
1.73204
.97994
.67940
.61749
.60691a
-1.98616
-1.14647
-1.02716
-.83536
-.49254
– The data are
normal.
a. Only a partial list of cas es with the value .60691 are shown in the
table of upper extremes .
Tests of Normality
a
Standardized Residual
Kolmogorov-Smirnov
Statis tic
df
Sig.
.137
15
.200*
*. This is a lower bound of the true s ignificance.
a. Lilliefors Significance Correction
Shapiro-Wilk
Statis tic
df
.971
15
Sig.
.866
Regression Violations
Regression Violations
Regression Violations