Chapter 8 Review, Part 2
Download
Report
Transcript Chapter 8 Review, Part 2
Chapter 8 – Regression 2
Basic review, estimating
the standard error of the
estimate and short cut
problems and solutions.
1
You can use the regression
equation when:
1. the relationship between X and Y is linear,
2. r falls outside the CI.95 around 0.000 and is
therefore a statistically significant correlation,
and
3. X is within the range of X scores observed in
your sample,
2
Simple problems using the
regression equation
tY' =
r * tX
tY' = .150 * 0.40 = 0.06
tY' = .40 * -1.70 = -0.68
tY' = .40 * 1.70 = 0.68
3
Predictions from Raw Data
1. Calculate the t score for X.
t X ( X X ) / sX
2. Solve the regression equation.
tY r (t X )
3. Transform the estimated t score for Y into a raw score.
Y Y (tY ) * ( sY )
4
Predicting from and to raw
scores
Problem: Estimate the midterm point total
given a study time of 400 minutes.
It is given that the estimated mean of the study
time is 560 minutes and the estimated standard
deviation is 216.02. (Range = 260-860)
It is given that the estimated mean of midterm
points is 76 and their estimated standard
deviation is 7.98.
The estimated correlation coefficient is .851.
5
Predicting from and to raw
scores
1. Translate raw X to tX score.
X X-bar
sX
(X-X-bar) / sX = tX
400 560
216.02
(400-560)/216.02= -0.74
6
Use regression equation
2.
Find value of tY'
r
r * tX = tY'
.851 .851*-0.74=-0.63
7
Translate tY' to raw Y'
Y
sY
76.00 7.98
Y + (tY' * sY) = Y'
76.00+(-0.63*7.98) = 70.97
8
A Caution
Never assume that a correlation will stay
linear outside of the range you originally
observed.
Therefore, never use the regression
equation to make predictions from X
values outside of the range you found in
your sample.
Example: Measuring heights of children.
9
Reviewing the r table and
reporting the results of
calculating r from a
random sample
10
How the r table is laid out:
the important columns
Column 1 of the r table shows degrees of freedom
for correlation and regression (dfREG)
dfREG=nP-2
Column 2 shows the CI.95 for varying degrees of
freedom
Column 3 shows the absolute value of the r that falls
just outside the CI.95. Any r this far or further from
0.000 falsifies the hypothesis that rho=0.000 and can
be used in the regression equation to make
predictions of Y scores for people who were not in
the original sample but who were part of the
population from which the sample is drawn.
11
df
nonsignificant
.05
.01
If r falls in.9999
within the 95% CI
.997
around
0.000,
.950
.990then the result is
.878
not .959
significant.
1
-.996 to .996
2
-.949 to .949
3
-.877 to .877
4
to .810value .811
.917
Does
the-.810
absolute
5
-.753
to .753
.754
.874
Find
your
degrees
of
or exceed
6 r equal
-.706
to .706 the .707
.834
of
freedom
(n
-2)
7value in-.665
.665 p
.666 cannot .798
thistocolumn?
You
reject
in-.631
thistocolumn
8
.631
.632
.765
the
null hypothesis.
9
-.601 to .601
.602
.735
You
can
10
-.575 to .575
.576
.708use it in the
r is significant
with
11
-.552 to .552
.553 regression
.684 equation to
alpha
= .05.
You
must
12
-.531
to .531assume .532
.661 Y scores.
estimate
.
.
.
that rho =. 0.00.
.
.
.
.
.
.
.
.
If r is significant
you
100
-.194 to .194
.195
.254
can
consider
it
an
unbiased,
200
-.137 to .137
.138
.181
300
-.112
to .112 estimate
.113
least
squares
of rho. .148
500
-.087 toalpha
.087 = .05..088
.115
1000
-.061 to .061
.062
.081
2000
-.043 to .043
.044
.058
10000
-.019 to .019
.020
.026
Example : Achovy pizza
and horror films, rho=0.000
(scale 0-9)
H1: People who enjoy food
with strong flavors also
enjoy other strong
sensations.
H0: There is no relationship
between enjoying food
with strong flavors and
enjoying other strong
sensations.
horror
anchovies films
7
7
7
9
3
8
3
6
0
9
8
6
4
5
1
2
1
1
1
6
Can we reject the null hypothesis?
13
Can we reject the null hypothesis?
8
6
Pizza
4
2
0
0
2
4
6
8
Horror films
14
Can we reject the null hypothesis?
We do the math and we find that:
r = .352
df = 8
15
r table
df
nonsignificant
.05
.01
1
2
3
4
5
6
7
-.996 to .996
-.949 to .949
-.877 to .877
-.810 to .810
-.753 to .753
-.706 to .706
-.665 to .665
-.631 to .631
-.601 to .601
-.575 to .575
-.552 to .552
-.531 to .531
.
.
.
-.194 to .194
-.137 to .137
-.112 to .112
-.087 to .087
-.061 to .061
-.043 to .043
-.019 to .019
.997
.950
.878
.811
.754
.707
.666
.9999
.990
.959
.917
.874
.834
.798
.765
.735
.708
.684
.661
.
.
.
.254
.181
.148
.115
.081
.058
.026
8
9
10
11
12
.
.
.
100
200
300
500
1000
2000
10000
.632
.602
.576
.553
.532
.
.
.
.195
.138
.113
.088
.062
.044
.020
This finding falls within the
CI.95 around 0.000
We call such findings “nonsignificant”
Nonsignificant is abbreviated n.s.
We would report these finding as follows
r (8)=0.352, n.s.
Given that it fell inside the CI.95, we must
assume that rho actually equals zero and that
our sample r is .352 instead of 0.000 solely
because of sampling fluctuation.
We go back to predicting that everyone will
score at the mean of Y.
17
How to report a significant r
For example, let’s say that you had a sample
(nP=30) and r = -.400
Looking under nP-2=28 dfREG, we find the
interval consistent with the null is between .360 and +.360
So we are outside the CI.95 for rho=0.000
We would write that result as r(28)=-.400, p<.05
That tells you the dfREG, the value of r, and that
you can expect an r that far from 0.000 five or
fewer times in 100 when rho = 0.000
18
Then there is Column 4
Column 4 shows the values that lie outside a CI.99
(The CI.99 itself isn’t shown like the CI.95 in Column 2
because it isn’t important enough.)
However, Column 4 gives you bragging rights.
If your r is as far or further from 0.000 as the
number in Column 4, you can say there is 1 or fewer
chance in 100 of an r being this far from zero
(p<.01).
For example, let’s say that you had a sample (nP=30)
and r = -.525.
The critical value at .01 is .463. You are further from
0.00 than that.So you can brag.
You write that result as r(28)=-.525, p<.01.
19
To summarize
If r falls inside the CI.95 around 0.000, it is
nonsignificant (n.s.) and you can’t use the
regression equation (e.g., r(28)=.300, n.s.
If r falls outside the CI.95, but not as far from
0.000 as the number in Column 4, you have a
significant finding and can use the regression
equation (e.g., r(28)=-.400,p<.05
If r is as far or further from zero as the number
in Column 4, you can use the regression
equation and brag while doing it (e.g., r(28)=.525, p<.01
20
df
1
2
3
4
5
6
7
8
9
10
11
12
.
.
.
100
200
300
500
1000
2000
10000
nonsignificant
.05
.01
-.996 to .996
-.949 to .949
-.877 to .877
-.810 to .810
-.753 to .753
-.706 to .706
-.665 to .665
-.631 to .631
-.601 to .601
-.575 to .575
-.552 to .552
-.531 to .531
.
.
.
-.194 to .194
-.137 to .137
-.112 to .112
-.087 to .087
-.061 to .061
-.043 to .043
-.019 to .019
.997
.950
.878
.811
.754
.707
.666
.632
.602
.576
.553
.532
.
.
.
.195
.138
.113
.088
.062
.044
.020
.9999
.990
.959
.917
.874
.834
.798
.765
.735
.708
.684
.661
.
.
.
.254
.181
.148
.115
.081
.058
.026
Can you reject H0?
r = .386
np= 19
dfREG = 17
df
nonsignificant
.05
.01
10
11
12
13
14
15
16
17
18
19
.
.
.
40
50
60
-.575 to .575
-.552 to .552
-.531 to .531
-.513 to .513
-.496 to .496
-.481 to .481
-.467 to .467
-.455 to .455
-.443 to .443
-.432 to .432
.
.
.
-.303 to .303
-.272 to .272
-.249 to .249
.576
.553
.532
.514
.497
.482
.468
.456
.444
.433
.
.
.
.304
.273
.250
.708
.684
.661
.641
.623
.606
.590
.575
.561
.549
.
.
.
.393
.354
.325
22
Can you reject H0?
r = -.386
np= 47
dfreg = 45
df
nonsignificant
.05
.01
10
11
12
13
14
15
16
17
18
19
.
.
.
40
50
60
-.575 to .575
-.552 to .552
-.531 to .531
-.513 to .513
-.496 to .496
-.481 to .481
-.467 to .467
-.455 to .455
-.443 to .443
-.432 to .432
.
.
.
-.303 to .303
-.272 to .272
-.249 to .249
.576
.553
.532
.514
.497
.482
.468
.456
.444
.433
.
.
.
.304
.273
.250
.708
.684
.661
.641
.623
.606
.590
.575
.561
.549
.
.
.
.393
.354
.325
23
How much better than the mean can we guess?
24
Improved prediction
If we can use the regression equation rather
than the mean to make individualized estimates
of Y scores, how much better are our estimates?
We are making predictions about scores on the
Y variable from our knowledge of the
statistically significant correlation between X & Y
and the fact that we know someone’s X score.
The average unsquared error when we predict
that everyone will score at the mean of Y equals
sY, the ordinary standard deviation of Y.
How much better than that can we do?
25
Estimating the standard error of
the estimate the (very) long way.
Calculate correlation (which includes calculating
s for Y).
If the correlation is significant, you can use the
regression equation to make individualized
predictions of scores on the Y variable.
The average unsquared error of prediction when
you do that is called the estimated standard
error of the estimate.
26
Example for Prediction Error
A study was performed to investigate
whether the quality of an image affects
reading time.
The experimental hypothesis was that
reduced quality would slow down reading
time.
Quality was measured on a scale of 1 to
10. Reading time was in seconds.
27
Quality vs Reading Time data:
Compute the correlation
Quality Reading time
(scale 1-10) (seconds)
4.30
8.1
Is there a relationship?
4.55
8.5
Check for linearity.
5.55
7.8
Compute r.
5.65
7.3
6.30
7.5
6.45
7.3
6.45
6.0
28
Calculate t scores for X
X
4.30
4.55
5.55
5.65
6.30
6.45
6.45
X=39.25
n= 7
X=5.61
X-X
-1.31
-1.06
-0.06
0.04
0.69
0.84
0.84
(X - X)2
1.71
1.12
0.00
0.00
0.48
0.71
0.71
tX =
(X - X) / sX
-1.48
-1.19
-0.07
0.05
0.78
0.95
0.95
SSW = 4.73
MSW = 4.73/(7-1) = 0.79
s = 0.89
29
Calculate t scores for Y
Y
8.1
8.5
7.8
7.3
7.5
7.3
6.0
Y=52.5
n= 7
Y=7.50
Y-Y
0.60
1.00
0.30
-0.20
0.00
-0.20
-1.50
(Y - Y)2
0.36
1.00
0.09
0.04
0.00
0.04
2.25
tY =
(Y - Y) / sY
0.76
1.26
0.38
-025
0.00
-0.25
-1.89
SSW = 3.78
MSW = 3.78/(7-1) = 0.63
sY = 0.79
30
Plot t scores
tX
tY
-1.48
-1.19
-0.07
0.05
0.78
0.95
0.95
0.76
1.28
0.39
-0.25
0.00
-0.25
-1.89
31
t score plot with best fitting
line: linear? YES!
Reading Time (t score)
2.00
1.00
-2.00
0.00
-1.00
0.00
1.00
2.00
-1.00
-2.00
Image quality (t score)
32
Calculate r
tX
tY
-1.48
-1.19
-0.07
0.05
0.78
0.95
0.95
0.76
1.28
0.39
-0.25
0.00
-0.25
-1.88
tY -tX
(tY -tX)2
-2.24
5.02
-2.47
6.10
-0.46
0.21
0.30
0.09
0.78
0.61
1.20
1.44
2.83
8.01
(tX - tY)2 = 21.48
(tX - tY)2 / (nP - 1) = 3.580
r = 1 - (1/2 * 3.580) = 1 - 1.79 = -0.790
33
Check whether r is significant
r = -0.790
df = nP-2 = 5
is .05
Look in r table:With 5
dfREG, the CI.95 goes
from -.753 to +.753
r(5)= -.790, p <.05
r is significant!
34
We can calculate the Y' for every raw X
X
4.30
4.55
5.55
5.65
6.30
6.45
6.45
Y'
8.42
8.23
7.54
7.47
7.01
6.91
6.91
35
Can we show mathematically that regression
estimates are better than mean estimates?
Y
8.1
8.5
7.8
7.3
7.5
7.3
6.0
Y
7.5
7.5
7.5
7.5
7.5
7.5
7.5
Y'
8.42
8.23
7.54
7.47
7.01
6.91
6.91
We expect of course that
there will be less error if
we use regression.
To calculate the standard deviation we
take deviations of Y from the
mean of Y, square them, add them up,
divide by degrees of freedom, and then
take the square root.
To calculate the standard error of the
estimate, sEST, we will take the
deviations of each raw Y score from its
regression equation estimate, square
them, add them up, divide by degrees of
freedom, and take the square root.
36
Estimated standard error
of the estimate
Y
8.1
8.5
7.8
7.3
7.5
7.3
6.0
Y'
8.42
8.23
7.54
7.47
7.01
6.91
6.91
Y - Y'
-0.28
0.27
0.26
-0.23
0.49
0.39
-0.91
(Y - Y')2
0.08
0.07
0.07
0.05
0.24
0.15
0.83
SSRES = 1.49
MSRES = 1.49/(7-2) = 0.30
SEST = 0.546
37
How much better?
SY = 0.80
SEST = 0.546
.80 .546
.32 32%
.80
32% less error when use the regression
equation instead of the mean to predict.
38
Mathematical magic
There is usually an alternative formula for calculating
statistics that is easier to perform.
We went through a lot of extra steps to calculate SEST = 0.546.
It is not necessary to calculate all of the Y’s..
39
Another way to phrase it:
How much error did we get
rid of?
Treat it as a weight loss problem.
If Jack is 30 pounds overweight and he
loses 40% of it, how much is he still
overweight.
He lost .400 x 30 pounds = 12 pounds.
He has 30 – 12 = 18 pounds left to lose.
40
SSY= error to start
r2=percent of error lost
SSY is the total amount of error we start
with when prediction scores on Y. It is the
amount of error when everyone is
predicted to score at the mean.
The proportion of error you get rid of
using the regression equation as your
predictor equals Pearson’s correlation
coefficient squared (r2)
41
To get the total error left
find how much you got rid
of, then subtract from
what you started with
Amount you got rid of: SSY * r2
Amount left: SSRES = SSY – (SSY * r2 )
Average amount of squared error left:
MSRES = SSRES/dfREG = SSRES/(nP-2)
sEST = square root of MSRES
42
Computing sEST the easier
way!
We already knew that SSY = 3.80 and r = -0.790.
SSRES = SSY - (SSY * r2) = 3.80 - (3.80 * (-0.79)2) = 1.43
MSRES = 1.43/(7-2) = 0..286
SEST = 0.535
43
How much better?
SY = 0.80
SEST = 0.535
.80 .535
.33 33%
.80
33% less error when use the regression
equation instead of the mean to predict.
The difference between 33% and 32%
when we calculated using the long way
is due to rounding error.
44
Stating the obvious:
The estimated standard deviation (s) was the
estimated average unsquared distance of scores
in the population from mu.
The estimated standard error of the estimate
(sEST) is the estimated average unsquared
distance of scores in the population from the
regression equation based predicted Y scores.
Both reflect the error of prediction. Using the
regression equation individualizes prediction
and, if r is significant, leads to less error.
45
Do one yourself.
Assume the original sum of squares for error is
420.00, nP=22 and the sum of the squared
differences between the tX and tY scores is
12.60.
What is r?
Is r statistically significant? Write the results as
you would in a report.
What is the estimated average unsquared
distance of Y scores from the regression line?
What percent improvement is obtained when s
is compared to sEST?
46
Answers: What is r? Is it
significant?
Compute r
(tX - tY)2 = 12.60
(tX - tY)2 / (nP - 1) = 12.60/21=.600
r=1.000-1/2(.600) = .700
Is r significant?
r(20)= .700, p<.01
47
What is the estimated average
unsquared distance of scores in
the population from the
regression line?
That is the same as asking “What is the
estimated standard error of the estimate?”
SSRES = SSY - (SSY * r2) = 420.00 – [420.00 * (0.70)2] = 214.20
MSRES = 214.20/(20) = 10.71
SEST = 3.27
48
What percent improvement is obtained
when s is compared to sEST?
MSW=SSW/df = 420.00/21 = 20.00
s 20.00 4.47
49
Last and (perhaps) least:
Proportion improvement = (s-sEST)/s
(4.47 – 3.27)/4.47=.268
Percent improvement = proportion improvement *100
In this case there was about a 26.8%
improvement in unsquared error when you use
the regression equation rather than the mean as
your basis for predicting Y scores.
50
End chapter 8 slides here
Slides past here were not
covered in lecture and will
not be on the exam.
51
Error Types: Type 1 Error
Type 1 error occurs when you accidentally
get a random sample with an r outside
the range predicted by the null hypothesis
even though rho=0.000. This forces you
to reject the null hypothesis when there
really is no relationship between X and Y
in the population as a whole.
Scientists are conservative and set up
conditions to avoid Type 1 errors.
52
Error Types: Type 2 Error
A type 2 error can only occur when there
really is a correlation between X and Y in
the population, but you accidentally get a
sample r that falls within the range
predicted by the null hypothesis. You must
then fail to reject the null and assume
rho=0.000
This is incorrect and results in Type 2
error.
53
Alpha levels
Any result can be found by chance.
However some results are so strong that
they are very unlikely.
Unlikely is defined as occuring by chance
5 (or fewer) times in 100.
The risk of getting a weird sample that
causes a Type 1 error is called alpha.
= .05
54