Correlation Coefficient

Download Report

Transcript Correlation Coefficient

RMTD 404
Lecture 9
Correlation & Regression
In two independent samples t-test, differences between the means of
independent variable groups on the dependent variable is a measure of
association—if the group means differ, then there is a relationship
between the independent and dependent variable.
But, it is useful to make a distinction between statistical tests that evaluate
differences and statistical tests that evaluate association. We have already
seen that difference statistics and association statistics provide similar
2 which are both effect size indicators).
types of information (recall d and rpb
This chapter deals with two general topics:
Correlation: Statistics that depict the strength of a relationship between
two variables.
Regression: Applications of correlational statistics to prediction
problems.
2
We typically begin depicting relationships between two variables using a
scatterplot—a bivariate plot that depicts three key characteristics of the
relationship between two variables.
•
Strength: How closely related are the two variables? (Weak vs strong)
•
Direction: Which values of each variable are associated with values of
the other variable? (Positive vs negative)
•
Shape: What is the general structure of the relationship? (Linear vs nonlinear)
By convention, when we want to use one variable as a predictor of the other
variable (called the criterion variable), we put the predictor on the X axis and
the criterion on the Y axis. (NOTE: we can still think of these as IV’s and DV’s)
3
Scatterplots
Strong
Positive Linear
4
4
3
3
2
2
1
1
0
-4
-3
-2
-1
0
0
1
2
3
4
-4
-3
-2
-1
-1
Strong
Negative Linear
-3
-4
-2
2
2
1
1
1
2
3
4
-4
-3
-2
-1
-1
-4
1
2
3
4
0
0
-3
0
-3
3
-2
4
-2
3
-1
3
4
0
-3
2
-4
4
-4
1
-1
-2
Weak
Positive Linear
0
-1
Strong
Curvilinear
-2
-3
-4
4
When we want to show that a certain function can describe the relationship and
that that function is useful as a predictor of the Y variable based on X, we
include a regression line—the line that best fits the observed data.
5
6
Covariance
An important concept relating to correlation is the covariance of two variables
(covXY or sXY—notice that the latter designates the covariance as a measure
of dispersion between X and Y). The covariance reflects the degree to which
two variables vary together or covary. Notice that the equation for the
covariance is very similar to the equation for the variance, only the
covariance has two variables.
N
  X i  X Yi  Y 
s XY  i 1
N 1
When the covariance is a large, positive number, Y tends to be large when X tends
to be large (both are positive).
When the covariance is a large, negative number, Y tends to be large and positive
when X tends to be large but negative.
When the covariance is near zero, there is no clear pattern like this—positive
values tend to be cancelled by negative values of the product.
7
Correlation Coefficient
A problem occurs with the covariance—it is in raw score units, so we cannot tell
much about whether the covariance is indeed large enough to be important
by looking at it. It changes as the scales used to measure the variables
change.
The solution to this problem is to standardize the statistic by dividing by a
measure of the spread of the relevant distributions.
Thus, the correlation coefficient is defined as:
rXY
s XY

s X sY
Because sXY cannot exceed sXsY, the limit of |r| is 1.00. Hence, one way to
interpret r is as a measure of the degree to which the covariance reaches is
maximum possible value—when the two variables covary as much as they
possibly could, the correlation coefficient equals 1.00.
8
Correl ations
Here is an example from SPSS.
Variable 1: Reading scores.
Variable 2: Math scores.
Covariance sRM=68.069;
r=.714
RE ADING
STANDARDIZE D S CORE
MA THE MA TICS
STANDARDIZE D S CORE
Pearson Correlation
Sig. (2-tailed)
Sum of Squares and
Cross-products
Covariance
N
Pearson Correlation
Sig. (2-tailed)
Sum of Squares and
Cross-products
Covariance
N
RE ADING
STANDARDI
ZE D S CORE
1
MA THE MA TI
CS
STANDARDI
ZE D S CORE
.714**
.000
24218. 352
18310. 598
89.698
271
.714**
.000
68.069
270
1
18310. 598
27350. 633
68.069
270
101.675
270
**. Correlation is s ignificant at t he 0.01 level (2-tailed).
80.000
MATHEMATICS STANDARDIZED SCORE
Positive and strong linear
relationship is found
between the two variables.
60.000
40.000
R Sq Linear = 0.51
20.000
40.000
READING STANDARDIZED SCORE
60.000
9
In R:
plot(bytxrstd,bytxmstd)
cor(bytxrstd,bytxmstd,
use="pairwise.complete")
[1] 0.7142943
Adjusted r
Although we usually report the value of the Pearson Product Moment correlation,
there is a problem with that statistic—it is a biased estimate of the
population correlation (ρ—rho). When the number of observations is small,
the sample correlation will be larger than the population correlation.
To compensate for this problem, we can compute the adjusted correlation
coefficient (radj), which is an unbiased estimate of the population correlation
coefficient.
radj 

1  r 2  N  1
1
radj 
1  .714   270  1

1

N 2
For our reading & math example, the computation gives us the following.
2
270  2
1
131.86
 .713
268
Because our sample size is large, the correction does little to change the value of
r.
radj<-sqrt(1-(((1-71^2)*269)/(268)))
radj
[1] 0.7086957
11
Hypothesis Testing for r
Occasionally, we want to perform a hypothesis test on r. That is, we want to
determine the probably that an observed r came from a hypothetical null
parameter (ρ).
The most common use of hypothesis testing relating to r is the test of the null
hypothesis, Ho: ρ = 0. When N is large and ρ = 0, the sampling distribution of r is
approximately normal in shape and is centered on 0.
The following t statistic can be formed
t
r N 2
1 r2
which is distributed as t with N – 2 degrees of freedom.
12
Returning to the reading and math scores example, our r was .714. We
can test the null hypothesis that correlation came from a population in
which reading scores and math scores are unrelated. Since we had 270
participants in our study, then the t statistic would be computed as
follows.
t
r N 2
1 r2

.714 270  2
1  .7142

11.69
 16.7
.70
With 268 degrees of freedom, the p-value for this statistic is less than
.0001 (critical value for a two-tailed test is 1.96). Hence, we would reject
the null hypothesis and conclude that there is a non-zero correlation
between reading and math scores in the population.
13
Regression Line
So, the correlation coefficient tells us the strength of the relationship
between two variables. If this relationship is strong, then we can use
knowledge about the values of one variable to predict the values of the
other variable.
Recall that the shape of the relationship being modeled by the correlation
coefficient is linear. Hence, r describes the degree to which a straight
line describes the values of the Y variable across the range of X values.
If the absolute value of r is close to 1, then the observed Y points all lie
close to the best-fitting line. As a result, we can use the best-fitting line
to predict what the values of the Y variable will be for any given value
of X. To make such a prediction, we obviously need to know how to
create the best-fitting (a.k.a. regression) line.
14
Recall that the equation for a line takes the form Y = bX + a. We will put a hat (^)
over the Y to indicate that, for our purposes, we are using the linear equation
to estimate Y. Note that all elements of this equation are estimated (from
data).
Yˆi  bX i  a
where Ŷi is the value of Y predicted by the linear model for the ith
value of X.
b is the estimated slope of the regression line (the difference
in Yˆi associated with a one-unit difference in X).
a is the estimated intercept (the value of Yˆ when X = 0).
Xi is the ith value of the predictor variable.
15
Our task is to identify the values of a and b that produce the best-fitting linear
function. That is, we use the observed data to identify the values of a and b
that minimize the distances between the observed values (Y) and the
predicted values ( Yˆ ). But, we can’t simply minimize the difference between
Y and Yˆ (called the residual from the linear model) because any line that
intersects ( X , Y ) on the coordinate plane will result in a average residual
equal to 0.
To solve this problem, we take the same approach used in the computation of the
variance—we find the values of a and b that minimize the squared residuals.
This solution is called the least squares solution.
16
Fortunately, the least squares solution is simple to find, given statistics that
you already know how to compute.
a  Y  bX
s
b  XY
s 2X
N


2
These values minimize  Yi  Yˆi (the sum of the squared residuals).
i 1
17
As an example, consider the data on reading and math. We are interested
in determining whether reading scores would be useful in predicting
math scores. We got the following descriptive statistics for the two
variables using the student dataset.
De scri ptive Statistics
X  51.83
Y  51.71
s X  9.47
sY  10.08
rxy  .714
Mean
READING
STANDARDIZED SCORE
MATHEMATICS
STANDARDIZED SCORE
St d. Deviat ion
N
51.82690
9.470882
271
51.71431
10.083413
270
From this, we can easily compute sXY.
rXY
s XY
s XY

so .714=
s X sY
9.47 *10.08
s XY  68.16
And from this, we can compute a and b.
b
s XY 68.16

 0.76
2
2
sX
9.47
a  Y  bX  51.71  .76  51.83  12.32
18
The predicted model (regression line) can be written as:
Yˆi  12.32  0.76 X i
So what does this regression line and its parameters tell us?
•
The intercept tells us that the best prediction of math score when
reading score= 0 equals 12.32.
•
The slope tells us that, for every 1-point increase in reading scores,
we get an increase in math of .76 points.
•
The covariance and correlation (as well as the slope) tell us that
the relationship between reading and math is positive. That is,
reading score tends to increase when math score increases.
•
Note, however, that it is incorrect to ascribe a causal relationship
between reading and math in this context.
19
Standard Errors
An important question in regression is “does the regression line do a good job
of explaining the observed data?”
One way to address this question is to state how much confidence we have in
the predicted value. That is, how precise is our prediction?
Let’s begin with a simple case to demonstrate how we already know a good bit
about estimate precision. Suppose that you don’t know anything about
reading scores and you want to estimate what a student’s math score is.
The only thing that you know is that the mean math score and the
standard deviation of math scores.
If you were to randomly choose a student from that population, what is the
best guess at what that student’s math will be?
20
When we know nothing, our best guess of the value of the criterion variable
is the mean of the criterion variable. And you would expect 95% of the
observed cases to lie within 1.96 standard deviations of the mean
(assuming the population of math scores is normally distributed).
This statement would not change if you had knowledge about a student’s
reading score and r = 0 described the relationship between reading and
math in the population. That is, if there is no relationship between X
and Y (knowing something about X gives you no additional information
about Y), then your best predicted value of Y is the mean of Y and the
precision of that estimate is dictated by the standard deviation of Y (or
the sample variance of Y).
21
Let’s simplify the equation for the sample variance so that we can extend the
equation to describe the precision of predictions when r does not equal 0 on
the next slide.
N
 Yi  Y 
2
stotal
 i 1
2
N 1

SStotal
dftotal
That is, let’s denote the numerator of the equation as the sum of squares (i.e.,
the sum of the squared deviations of the observed values around their
mean--SStotal). It is called the “total” sum of squares because it accounts for
the entire difference between the observations. The denominator is the
degrees of freedom
22
Now, let’s look at the standard error—the standard deviation of a sampling
distribution--in regression.
We can define the standard error of the estimate (sY.X) as the standard deviation
of observed Y values around the value of Y predicted based on our
knowledge of X.
That is, sY.X is the standard deviation of Y around Yˆ for any given value of X.
Computationally, sY.X is defined as the square root of the sum of the squared
residuals over their degrees of freedom (called the residual because it is the
deviation of the observations from the predictions) or the root mean square
error (RMSE).
 Yi  Yˆ 
N
sY  X 
2
SSresidual
 i 1
df residual
N 2
23
Obviously, from its form, we can see that this is a standard deviation. The only
difference is that we are computing deviances of observed Ys from predicted
values (Yˆ) rather than the mean. That is, the standard error is the standard
deviation of the conditional distributions of Y at each level of X.
The square of the standard error of estimate (a variance) is also known as the
residual variance or the error variance.
24
Graphically, here is the standard error of estimate.
4
3
sY  X
2
1
0
-4
-3
Y  Yˆ
-2
residuals
-1
0
1
2
3
4
-1
-2
-3
-4
Yˆ
Y|X
conditional distributions
25
So, the square of the standard error of the estimate is another special type
of variance (like the square of the standard error of the mean)—a sum
of squared residuals divided by its degrees of freedom. We can also
state the error or residual variance as a function of the correlation
coefficient (r).
1  r 2  NN  12
sY  X  sY
N 1
Note that when the sample size is very large,
approaches one, so
N 2
the equation simplifies to.
sY  X  sY
or
1  r 2 

sY2  X  sY2 1  r 2

Hence, we can estimate the value of the standard error of the estimate if
we know the correlation coefficient between X and Y and the standard
deviation of Y.
26
For our previous example, we can obtain the error variance for the regression of
reading scores on math scores as follows.




sY2 X  sY2 1  r 2  10.082 * 1  .714 2  49.81
and
sY  X  49.81  7.06
Hence, if we were to assume that the observed Y values were normally
distributed around the prediction line, we would expect 95% of the observed
Y values to lie within ±1.96  7.06 points (or within 13.84 points) of the
predicted values.
27
The first part of the SPSS output gives us the error variance as the mean
square residual.
ANOVAb
Model
1
Regres sion
Residual
Total
Sum of
Squares
13954. 741
13395. 892
27350. 633
df
1
268
269
Mean Square
13954. 741
49.985
F
279.180
Sig.
.000a
a. Predic tors: (Constant), READING STANDARDIZED SCORE
b. Dependent Variable: MATHEMATICS STANDARDIZED SCORE
The SSresidual is designated as the SSerror and MSerror = SSerror/dferror. Also, the root
mean square residual equals the standard error of the estimate. In this
case, it equals 7.07 ( 49.985 ). Hence, we know that, on average, the
observed math scores are about 7.07 points from the prediction line.
Again, this differs only slightly from what we computed by hand due to
rounding error.
28
This points out an important thing about the relationship between r and the
standard error of the estimates—for a given value of the standard deviation
of Y, the size of the standard error is proportional to the size of the standard
deviation of Y by a function of r—the strength of the relationship between X
and Y.
When r = 0, the value of the standard error of the estimates equals the value of
the standard deviation of Y. When r = 1, the value of the standard error of the
estimates equals 0. And, when r is between 0 and 1, 1  r 2 is the relative size
of the standard error of the estimates versus the standard deviation of the
sample.
sY.X = sY
sY.X = 0
29
r2
Another important way of stating the relationship between variability in the
sample and variability in the estimates as a function of r relates the sums of
squares for these two measures.

SSresidual  SStotal 1  r 2
where

N

SStotal    Yi  Y 
 i 1

2
Which can be solved for r2 so that
SStotal  SSresidual SSmodel SSexplained
r 


SStotal
SStotal
SSobserved
2
Hence, r2 is a proportion—the proportion of observed variability that is explained
by the relationship between X and Y. When the residual variability is small, r2
approaches 1. When the residual variance is large, r2 approaches 0.
30
In our reading and math example, about 51% (.7142) of the variability in math is
explained by its relationship with reading scores. Another way of stating this
is that 51% of the variance in math scores is accounted for by its covariance
with reading scores.
31
r2
To summarize, there are several sources of variance in the regression equation between X and
Y.
N
SS X    X  X 2
Variability of the predictor variable
i 1
N
SStotal  SSY   Y  Y 2
i 1
N

Variability of the outcome

2
SS model  SSYˆ   Yˆ  Y  SSY  SS residual Variability explained by the model
i 1
N


2
SS residual  SSerror   Y  Yˆ  SSY  SSYˆ Variability not explained by the model
i 1
Note:
SStotal  SSmodel  SSresidual  predicted  error
Given that r2 tells you the amount of variability in the Y variable that is explained by its
relationship with the X variable, we can use r2 as an effect size indicator. In fact, Cohen
(1988) suggests the following values for a rule-of-thumb: small = .01, medium = .09, large
= .25.
32
In SPSS, the r and r2 are reported, along with the adjusted r2.
Model Summary
Model
1
R
.714a
R Square
.510
Adjusted
R Square
.508
Std. Error of
the Estimate
7.069984
a. Predictors: (Constant), READING STANDARDIZED
SCORE
In R:
summary(result1)
Call:
lm(formula = bytxmstd ~ bytxrstd)
Residuals:
Min
1Q
-17.01118 -4.98226
Median
0.07116
3Q
4.73102
Max
16.43518
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.17730
2.40505
5.063 7.68e-07 ***
bytxrstd
0.76211
0.04561 16.709 < 2e-16 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.07 on 268 degrees of freedom
(30 observations deleted due to missingness)
Multiple R-squared: 0.5102,
Adjusted R-squared: 0.5084
F-statistic: 279.2 on 1 and 268 DF, p-value: < 2.2e-16
33
Residual Analysis
One way to evaluate the quality of the regression model is to examine the
residuals. By examining residuals, you’ll be able to tell whether your linear
model is appropriate for the data, the degree to which the data conform to
the linear model, and which specific cases do not jibe with the linear
model.
The plot that is often used when performing a residual analysis is a scatter plot
of the residuals and the predicted values.
A scatter plot (aka a residual plot) allows us to identify patterns in the residuals.
The scatter of the residuals should be of equal magnitude across the range
of the predicted value and should increase in density as the residual falls
closer to the predicted value.
34
Residual Analysis
Cohen & Cohen suggest that non-random patterns within residual plots may
indicate specific types of problems in the data. These plots showYˆ on the X-axis
andY  Yˆ on the Y-axis.
Curvilinear =
Non-linear relationship
Outliers=
Special cases or data errors
Heteroscedasticity =
Invalid inferences
Slope =
Omitted time IV
35
Here is the residual plot.
36
Hypothesis Testing of b
The test of the null hypothesis that r = 0 is the same as the test of the null
hypothesis that β (the parameter estimated by the slope, b) equals 0. That is, if
there is no relationship between X and Y, then the correlation equals zero. This
is the same thing as the slope of the regression line equals zero, which also
translates to a situation in which the mean of Y is the best predictor and the
standard error of the estimate equals the standard deviation of Y.
Recall that the t-test compares an observed parameter to a hypothetical (null)
value, dividing the difference by the standard error. Hence, we need a standard
error for b.
sY  X
sb 
sX N  1
Hence, the t-test for comparing b to a null parameter (typically set to 0) is:
t
b

sY  X
sb
b
sX N 1

bs X N  1
sY  X
where t has N – 2 df
37
For the math and reading data, the standard error of b is computed below.
b = 0.76
sY .X =
49.985  7.07 sX = 9.47
We had 270 participants, sb is computed as follows.
sb 
sY  X
7.07
7.07


 0.046
sX N  1 9.47 270  1 9.47*16.40
So the t statistic to test the null hypothesis that b=0 equals
t
b
0.76

 16.52
sb 0.046
38
Hypothesis Testing for b in SPSS
The SPSS output contains this statistical test. In this example, we see that the
slope for reading scores regressed on math scores is statistically significant
(t=16.71, p<.0001). That is, the slope is non-zero.
Coefficientsa
Model
1
(Constant)
READING
STANDARDIZED SCORE
Unstandardized
Coefficients
B
Std. Error
12.177
2.405
.762
Standardized
Coefficients
Beta
.046
.714
t
5.063
Sig.
.000
16.709
.000
95% Confidence Interval for B
Lower Bound Upper Bound
7.442
16.913
.672
.852
a. Dependent Variable: MATHEMATICS STANDARDIZED SCORE
In R:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.17730
2.40505
5.063 7.68e-07 ***
bytxrstd
0.76211
0.04561 16.709 < 2e-16 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
39
Similarly, we can create a confidence interval around the observed b
using the following extension of the t-test formula.


95%CI    b  t  sY  X
  b  t sb
sX N 1
2
2
For the reading & math example, the two-tailed critical t for a = .05 with
269 degrees of freedom would equal 1.96, so the confidence interval
would be computed as follows.
 sY  X

95%CI     b  t 
 .76  1.96  0.046

2 
sX N  1 
UL  .67
LL  .85
Given that 0 does not fall within these limits, we would reject the null
hypothesis that b came from a population in which b = 0.
40
Assumptions
In both correlation and regression, it is assumed that the relationship
between X and Y is linear rather than curvilinear. Different procedures are
available for modeling curvilinear data.
If we simply wish to describe the relationship in the observed data or express
the proportion of the variance of Y that is accounted for by its linear
relationship with X, we need no additional assumptions.
41
However, inferential procedures for regression (i.e., issues relating to b and Yˆ )
rely on two additional assumptions about the data being modeled.
Homogeneity of variance in arrays: The residual variance of Y conditioned on
X at each level of X is assumed to be equal. This is equivalent to the
homogeneous variance assumption we made with the t-test. The observed
variances do not have to be equal, but they have to be close enough. We can
examine the residual plot to get a sense about this assumption
Normality of conditional arrays: The distribution of observed Y values
around the predicted Y value at each level of X is assumed to be
normally distributed. This is necessary because we use the standard
normal distribution in testing hypotheses. Histograms can be used to
check for normality as well as Q-Q plots.
42
If we wish to draw inferences about the correlation coefficient, on the
other hand, we need to make only one assumption (albeit a rather
demanding assumption):
Bivariate Normality: If we wish to test hypotheses about r or
establish confidence limits on r, we must assume that the joint
distribution of Xs and Ys is normal.
43
Factors that Influence the Correlation
The correlation coefficient can be substantially affected by characteristics of the
sample. Specifically, there are three potential problems that might lead to
spuriously high or low correlation coefficients.
1.
Range Restriction: If the range of Xs or Ys is restricted (e.g., a ceiling or floor
effect or omission of data from certain sections of the population or
sample), the correlation statistic (r) will likely underestimate r (although it is
possible that it will overestimate it—like when restriction of range
eliminates a portion of a curvilinear relationship).
r=.72
4
4
3
3
2
2
r=.63
1
1
0
0
-4
-3
-2
-1
0
1
2
3
4
-4.00
-3.00
-2.00
-1.00
0.00
-1
-1
-2
-2
-3
-3
-4
-4
1.00
2.00
3.00
4.00
44
2.
Heterogeneous Samples: A second problem—one that is more likely to
make r an overestimate of r—arises when you compare two variables (X
and Y), but there are large differences on one of these variables with
respect to a third variable (Z). For example, suppose that we are interested
in the relationship between comfort with technology and scores on a
technology-based test. Also suppose that males and females exhibit very
large differences in comfort with technology. The joint distribution of males
and females could give us a false impression of the relationship between
comfort and scores.
4
r=.34
3
r=.50
2
1
0
-4
-3
-2
-1
0
1
2
3
4
-1
-2
r=.34
-3
-4
45
3.
Outliers: Extreme values of Y or X can artificially inflate or deflate r as an
estimate of r. In most cases, these outliers are substantively interesting
cases or are error-laden cases.
4
3
2
r=.52
1
r=.36
0
-4
-3
-2
-1
0
1
2
3
4
-1
-2
4
-3
3
-4
2
r=.36
1
0
-4
-3
-2
-1
0
-1
1
2
3
r=.01
-2
-3
-4
46
4
Regression steps in SPSS:
1.
2.
3.
4.
5.
6.
7.
8.
Analyze
Regression
Linear
Input dependent (outcome) variable
Input independent (predictor) variable
Press OK
The Model Summary table provides R2 and adj R2
The ANOVA table provides the SSR, SSE, MSE, F, and overall predictive
capacity of the IV.
9. The coefficients table is most important – it provides the parameter
estimates for b and a along with tests of their statistical significance.