Transcript File

The Pearson Product-Moment
Correlation Coefficient
The regression coefficient is an asymmetrical statistic,
one that gives different values for the model Y = f(X)
and the model X = f(Y). The other major measure of
bivariate association is the Pearson product-moment
correlation coefficient (sometimes called "little r" for
short). The correlation coefficient is a symmetrical
statistic. That is, it simply describes the association
between X and Y without worrying about whether Y =
f(X) or X = f(Y). It would produce the same result in
either case. Unlike the regression coefficient, whose
values range from 0.0 to  , the correlation coefficient
ranges from 0.0 when there is NO association between
X and Y to  1.00 when there is PERFECT association
(either direct or inverse).
To generate the second set of statistics describing
association from the linear model, we partition the
sum of squares. Graphically, we begin with a single
data point, i, in two-dimensional space. Yi is its location
on the scale of y (on the y-axis); below that is the
predicted location of Y, Yi-hat. The dotted horizontal line
(- - - -) is the location of the mean of Y. (When there is
no association between X and Y, b = 0.0 and
therefore a = Y-bar.)
a  Y  bX
where b = 0,
a Y
i
Yi
•
Yi - hat
_
Y
---------------
Xi




} Yi  Yˆi
} Yˆ  Y
i
The vertical line represents the deviation of the ith
observation from the mean of Y (i.e., the difference
between Yi and Y-bar).
The line of best fit bisects the deviation into its two
mathematical components. The component ABOVE the
line of best fit is the residual, the difference between Yi
and Yi - hat, the actual location of the ith observation on
the y-axis and the predicted location of this observation
on the y-axis. This is the error (or residual)
component.
The component BELOW the line of best fit is new. It is
the difference between the predicted Y-value, Yi - hat,
and the mean of Y (Y-bar). This component is called the
regression component.
Since these two components combined are the parts of
the deviation of the ith observation from the mean of Y,
the following is merely an algebraic summary of this
relationship:
deviation = regression component + error (residual)
Y  Y   Yˆ  Y   Y  Yˆ 
i
i
i
i
Squaring both sides and summing across all
observations yields
 Y
N
i 1
i
Y  
2

N
i 1
Yˆi  Y
  
2
N
i 1
Yi  Yˆi
or
SSTotal = SSRegression + SSError

2
We can express the amount of association between X
and Y as a ratio of the variance explained by the linear
model to the total variance in Y to be explained. SSTotal
is the variance to be explained and SSRegression the
variance accounted for by Y's relationship with X:
R2YX = SSRegression / SSTotal
This is the Coefficient of Determination. Its values
range from 0.0 when X and Y are independent (i.e.,
when Y-hat minus Y-bar = 0.0) to 1.0 with perfect
association (i.e., SSRegression = SSTotal). It is interpreted
as the percentage of the total variance in Y explained
by Y's association with X.
In algebraic form, the Coefficient of Determination is
calculated as
2
RYX
2
s XY
 2 2
s X sY
The denominator is the product of the variance
(standard deviation squared) of X and the variance of Y.
The numerator is the square of the covariance and can
be obtained by squaring the value from the following
short-cut equation
s XY 
N
N
N
i 1
i 1
i 1
N  Yi X i   Yi  X i
N N  1
In the time and temperature example, N = 3, the sum of
X (time) was 23.5, the sum of the squared time values
was 194.25, the sum of time values squared was
552.25, the sum of Y (temperature) was 248, and the
sum of the cross-products was 1,911.
sXY = (3)(1911) - (248)(23.5) / (3)(3 - 1)
sXY = (5733 - 5828) / 6
sXY = - 95 / 6
sXY = - 15.833
Squaring to get the covariance squared,
s2XY = 250.694
Next, we can use the short-hand equation to calculate
the two variances:
s2X = NX2 - (X)2 / N(N - 1)
(Here, the absence of an index and counter on the
summation sign implies summing from the first to the
last value.)
s2X = (3)(194.25) - (23.5)2 / (3)(3- 1)
s2X = (582.75) - (552.25) / (3)(2)
s2X = 30.5 / 6
s2X = 5.083
And for the variance of Y:
s2Y = NY2 - (Y)2 / N(N - 1)
s2Y = (3)(20,600) - (248)2 / (3)(3 - 1)
s2Y = (61,800) - (61,504) / 6
s2Y = 296 / 6
s2Y = 49.333
Now we can solve for the Coefficient of Determination:
R2YX = s2XY / s2X s2Y
R2YX = 250.694 / (5.083)(49.333)
R2YX = 250.694 / 250.760
R2YX = 0.9997
This is interpreted as meaning that 99.9 percent of the
variance in afternoon high temperature is statistically
explained by the association of this variable with the time
of the sun's first appearance. This is an extremely
high—and extremely unlikely—value, since R2YX varies
from a minimum of 0.0 (no variance explained) to a
maximum of 1.0 (100 percent if ALL the variance is
explained).
If the Coefficient of Determination is the percentage of
the variance in Y explained by its association with X,
then the converse is the percentage of variance in Y
NOT explained by its association with X. This is called
the Coefficient of Nondetermination, simply
KYX = 1 - R2YX
In this example, the percentage of variance NOT
explained is 1 - 0.999, or less than 0.1 percent.
Conceptually, the Pearson product-moment correlation
coefficient is the square root of the Coefficient of
Determination:
rXY 
2
RYX
For raw data, the correlation coefficient is found by
rXY = sXY / sX sY
where the numerator is the covariance and the
denominator is the product of the standard deviations
of X and Y. In our example,
rXY = - 15.833 / (2.255) (7.024)
rXY = - 15.833 / 15.839
rXY = - 0.9996
Notice that, unlike the Coefficient of Determination
which only takes positive values, the correlation
coefficient varies between 0.0 and  1.00. Here, a
correlation of - 0.9996 shows an extremely STRONG
INVERSE relationship.
Finally, in the bivariate situation, the regression
coefficient (i.e., slope, b) and the correlation coefficient
(rXY) are related, as follows:
b = rXY (sY / sX)
and
rXY = b (sX / sY)
In the present little example,
b = (- 0.968) (7.024 / 2.255)
b = (- 0.968) (3.115)
b = - 3.015
and
rXY = - 3.115 (2.255 / 7.024)
rXY = - 3.115 (0.321)
rXY = - 0.999
SAS Time and Temperature Example
LIBNAME perm 'a:\';
LIBNAME library 'a:\';
OPTIONS NODATE NONUMBER PS=66;
PROC CORR DATA=perm.weather NOSIMPLE;
VAR temp time;
TITLE1 'Time and Temperature Example';
RUN;
Time and Temperature Example
Correlation Analysis
2 'VAR' Variables:
TIME
TEMP
Pearson Correlation Coefficients / Prob > |R| under Ho: Rho=0
/ N = 3
TIME
TEMP
TIME
1.00000
0.0
-0.99983
0.0116
TEMP
-0.99983
0.0116
1.00000
0.0
Time and Temperature Example
Correlation Analysis
2 'VAR' Variables:
TIME
TEMP
Pearson Correlation Coefficients / Prob > |R| under Ho: Rho=0
/ Number of Observations
TIME
TEMP
TIME
TEMP
1.00000
0.0
-0.99983
0.0116
2
-0.99983
0.0116
3
1.00000
0.0
3
2
Correlation Example
For the following data on ten families, answer the questions below.
——————————————————————————————————————————————————————————————————————————————
Annual Income
_
Number of
_
_
_
2
2
Family
(in $1,000)
(Xi - X)
Children
(Yi - Y)
(Xi - X)(Yi - Y)
X
Y
——————————————————————————————————————————————————————————————————————————————
1
25
0
2
17
0
3
20
1
4
14
2
5
11
2
6
10
3
7
6
4
8
8
5
9
8
6
10
4
7
----X =
Y =
_
_
X =
Y =
——————————————————————————————————————————————————————————————————————————————
1.
What is the value of the correlation coefficient?
______________
2.
What is the value of the Coefficient of Determination?
______________
3.
What is the value of the Coefficient of Nondetermination?
______________
Correlation Example Answers
For the following data on ten families, answer the questions below.
——————————————————————————————————————————————————————————————————————————————
Annual Income
_
Number of
_
_
_
Family
(in $1,000)
(Xi - X)2
Children
(Yi - Y)2 (Xi - X)(Yi - Y)
X
Y
——————————————————————————————————————————————————————————————————————————————
1
25
161.29
0
9
-38.1
2
17
22.09
0
9
-14.1
3
20
59.29
1
4
-15.4
4
14
2.89
2
1
-1.7
5
11
1.69
2
1
1.3
6
10
5.29
3
0
0.0
7
6
39.69
4
1
-6.3
8
8
18.49
5
4
-8.6
9
8
18.49
6
9
-12.9
10
4
68.89
7
16
-33.2
----X =
123
Y = 30
_
_
X = 12.3
Y = 3.0
 = 398.1
 = 54
 = -129
——————————————————————————————————————————————————————————————————————————————
1.
What is the value of the correlation coefficient?
-0.880
2.
What is the value of the Coefficient of Determination?
0.774
3.
What is the value of the Coefficient of Nondetermination?
0.226