Transcript Document

Correlation Analysis
Pearson Product Moment Coefficient of Correlation:
r 

sx sy

S xy
S xx Syy
The variances and covariances are given by:
s xy 

s xy
S xy
n 1
s x2
S xx

n 1
s y2

S yy
n 1
In general, when a sample of n individuals or experimental units
is selected and two variables are measured on each individual or
unit so that both variables are random, the correlation coefficient r is the appropriate measure of linearity for use in this
situation.
Regression Analysis (2)
1
Example
The heights and weights of n  10 offensive backfield football
players are randomly selected from a county’s football all-stars.
Calculate the correlation coefficient for the heights (in inches)
and weights (in pounds) given in Table below.
Table Heights and weights of n  10 backfield all-stars
Player
1
2
3
4
5
6
7
8
9
10
Height x
73
71
75
72
72
75
67
69
71
69
Regression Analysis (2)
Weight y
185
175
200
210
190
195
150
170
180
175
2
Solution
You should use the appropriate data entry method of your
scientific calculator to verify the calculations for the sums of
squares and cross-products:
S xy  328
S xx  60 .4
S yy  2610
using the calculational formulas given earlier in this chapter.
Then
328
r 
 .8261
(60.4)(2610)
or r =.83. This value of r is fairly close to 1, the largest possible
value of r , which indicates a fairly strong positive linear
relationship between height and weight.
Regression Analysis (2)
3



There is a direct relationship between the calculation formulas
for the correlation coefficient r and the slope of the regression
line b.
Since the numerator of both quantities is Sxy, both r and b have
the same sign.
Therefore, the correlation coefficient has these general
properties:
- When r  0, the slope is 0, and there is no linear relationship
between x and y.
- When r is positive, so is b, and there is a positive relationship
between x and y.
- When r is negative, so is b, and there is a negative relationship
between x and y.
Regression Analysis (2)
4
The relationship between r (correlation
coefficient) and the regression model
^
Y Y
XX
 r
sy
sx
sy   sy 

 Y   Y  r  X     r    X
sx   sx 

Therefore
^
sy
1  r 
sx
^
Regression Analysis (2)
5
Figure Some typical scatter plots
Regression Analysis (2)
6


The population correlation coefficient r is calculated and
interpreted as it is in the sample.
The experimenter can test the hypothesis that there is no
correlation between the variables x and y using a test statistic
that is exactly equivalent to the test of the slope  in previous
Section.
Regression Analysis (2)
7
Test of Hypothesis Concerning the correlation Coefficient r :
1. Null hypothesis: H 0 : r  0
2. Alternative hypothesis:
One-Tailed Test
Ha : r > 0
(or H a : r < 0)
3. Test statistic:
t0 
Two-Tailed Test
Ha : r  0
r n2
1  r2
When the assumptions are satisfied,
the test statistic will have a Student’s t distribution with
(n  2) degrees of freedom.
Regression Analysis (2)
8
When comparing to non-zero constant
1. Null hypothesis: H 0 : r  r0
2. Alternative hypothesis:
One-Tailed Test
H a : r > r0
(or H a : r < r0)
3. Test statistic:
t0 
Two-Tailed Test
H a : r  r0
( r  r0 ) n  2
(1  r 2 )(1  r 0 )
When the assumptions are satisfied,
the test statistic will have a Student’s t distribution with
(n  2) degrees of freedom.
Regression Analysis (2)
9
4. Rejection region: Reject H 0 when
One-Tailed Test
t > ta,n-2
Two-Tailed Test
t > ta/2, n-2 or t <  ta/2, n-2
(or t < ta, n-2 when the
alternative hypothesis
is H a : r < 0 or H a : r < r0)
or p-value < a
Regression Analysis (2)
10
Example Refer to the height and weight data in the previous
Example The correlation of height and weight was calculated to
be r =.8261. Is this correlation significantly different from 0?
Solution
To test the hypotheses
H0 : r  0
versus
Ha : r  0
the value of the test statistic is
t0  r
n2
10  2

.
8261
 4.15
2
2
1 r
1  (.8261)
which for n  10 has a t distribution with 8 degrees of freedom.
Since this value is greater than t.005  3.355, the two-tailed
p-value is less than 2(.005)  .01, and the correlation is
declared significant at the 1% level (P < .01). The value r 2
 .82612  .6824 means that about 68% of the variation in one
of the variables is explained by the other. The Minitab printout n
Figure 12.17 displays the correlation r and the exact p-value for
testing its significance.
Regression Analysis (2)
11

r is a measure of linear correlation and x and y could
be perfectly related by some curvilinear function
when the observed value of r is equal to 0.
Regression Analysis (2)
12
Testing for Goodness of Fit



In general, we do not know the underlying
distribution of the population, and we wish to test the
hypothesis that a particular distribution will be
satisfactory as a population model.
Probability Plotting can only be used for examining
whether a population is normal distributed.
Histogram Plotting and others can only be used to
guess the possible underlying distribution type.
Regression Analysis (2)
13
Goodness-of-Fit Test (I)



A random sample of size n from a population whose
probability distribution is unknown.
These n observations are arranged in a frequency
histogram, having k bins or class intervals.
Let Oi be the observed frequency in the ith class
interval, and Ei be the expected frequency in the ith
class interval from the hypothesized probability
distribution, the test statistics is
Regression Analysis (2)
14
Goodness-of-Fit Test (II)


If the population follows the hypothesized distribution,
X02 has approximately a chi-square distribution with
k-p-1 d.f., where p represents the number of
parameters of the hypothesized distribution
estimated by sample statistics.
2
k
That is,
O  E 
 02  
i
i 1

Reject the hypothesis if
i
Ei
~  k2 p 1
 > a ,k  p1
2
0
Regression Analysis (2)
2
15
Goodness-of-Fit Test (III)



Class intervals are not required to be equal width.
The minimum value of expected frequency can not
be to small. 3, 4, and 5 are ideal minimum values.
When the minimum value of expected frequency is
too small, we can combine this class interval with its
neighborhood class intervals. In this case, k would
be reduced by one.
Regression Analysis (2)
16
Example 8-18 The number of defects in printed circuit boards is
hypothesized to follow a Poisson distribution. A random sample of
size 60 printed boards has been collected, and the number of
defects observed as the table below:

The only parameter in Poisson distribution is l, can be
estimated by the sample mean = {0(32) + 1(15) + 2(19) +
3(4)}/60 = 0.75. Therefore, the expected frequency is:
e 0.75 (0.75) 0
p1  P( X  0) 
 0.472
0!
E1  0.472  60  28.32
Regression Analysis (2)
17
Example 8-18 (Cont.)

Since the expected frequency in the last cell is less than 3, we
combine the last two cells:
Regression Analysis (2)
18
Example 8-18 (Cont.)
1. The variable of interest is the form of distribution of defects in
printed circuit boards.
2. H0: The form of distribution of defects is Poisson
H1: The form of distribution of defects is not Poisson
3. k = 3, p = 1, k-p-1 = 1 d.f.
4. At a = 0.05, we reject H0 if X20 > X20.05, 1 = 3.84
5. The test statistics is:
(Oi  Ei ) 2 (32  28.32) 2 (15  21.24) 2 (13  10.44) 2
 



 2.94
Ei
28.32
21.24
10.44
i 1
k
2
0
6. Since X20 = 2.94 < X20.05, 1 = 3.84, we are unable to reject the
null hypothesis that the distribution of defects in printed circuit
boards is Poisson.
Regression Analysis (2)
19
Contingency Table Tests

Example 8-20
A company has to choose among three pension plans.
Management wishes to know whether the preference for plans
is independent of job classification and wants to use a = 0.05.
The opinions of a random sample of 500 employees are shown
in Table 8-4.
Regression Analysis (2)
20
Contingency Table Test
- The Problem Formulation (I)



There are two classifications, one has r levels and the other has
c levels. (3 pension plans and 2 type of workers)
Want to know whether two methods of classification are
statistically independent. (whether the preference of pension
plans is independent of job classification)
The table:
Regression Analysis (2)
21
Contingency Table Test
- The Problem Formulation (II)



Let pij be the probability that a random selected element falls in
the ijth cell, given that the two classifications are independent.
Then pij = uivj, where the estimator for ui and vj are


1 c
1 r
 i   Oij
v j   Oij
n j 1
n i 1
Therefore, the expected frequency of each cell is
r
 
1 c
Eij  n  i v j   Oij  Oij
n j 1 i 1
Then, for large n, the statistic
r
c (O  E ) 2
ij
 02   ij
Eij
i 1 j 1
has an approximate chi-square distribution with (r-1)(c-1) d.f.
Regression Analysis (2)
22
Example 8-20
Regression Analysis (2)
23
Regression Analysis (2)
24
Key Concepts and Formulas
I. A Linear Probabilistic Model
1. When the data exhibit a linear relationship, the appropriate
model is y  a   x  e .
2. The random error e has a normal distribution with mean 0
and variance s 2.
II.Method of Least Squares
1. Estimates a and b, for a and , are chosen to minimize SSE,
The sum of the squared deviations about the regression line,
yˆ  a  bx.
Regression Analysis (2)
25
2. The least squares estimates are b  Sxy / Sxx and a  y  b x.
III. Analysis of Variance
1. Total SS  SSR  SSE, where Total SS  Syy and
SSR  (Sxy)2 / Sxx.
2. The best estimate of s 2 is MSE  SSE / (n  2).
IV. Testing, Estimation, and Prediction
1. A test for the significance of the linear regression—H0 :   0
can be implemented using one of the two test statistics:
b
MSR
t
or F 
MSE
MSE / S xx
Regression Analysis (2)
26
2. The strength of the relationship between x and y can be
measured using
MSR
R2 
Total SS
which gets closer to 1 as the relationship gets stronger.
3. Use residual plots to check for nonnormality, inequality of
variances, and an incorrectly fit model.
4. Confidence intervals can be constructed to estimate the
intercept a and slope  of the regression line and to estimate
the average value of y, E( y ), for a given value of x.
5. Prediction intervals can be constructed to predict a particular
observation, y, for a given value of x. For a given x,
prediction intervals are always wider than confidence
intervals.
Regression Analysis (2)
27
V. Correlation Analysis
1. Use the correlation coefficient to measure the relationship
between x and y when both variables are random:
r 
S xy
S xx Syy
2. The sign of r indicates the direction of the relationship; r near
0 indicates no linear relationship, and r near 1 or 1 indicates
a strong linear relationship.
3. A test of the significance of the correlation coefficient is
identical to the test of the slope .
Regression Analysis (2)
28
Cause and Effect







X could cause Y
Y could cause X
X and Y could cause each other
X and Y could be caused by a third variable Z
X and Y could be related by chance
Bad (or good) luck
Need careful examination of the study. Try to find previous
evidences or academic explanations.
Regression Analysis (2)
29