The t-test - University of South Florida

Download Report

Transcript The t-test - University of South Florida

Regression
Regression



Correlation and regression are closely related
in use and in math.
Correlation summarizes the relations b/t 2
variables.
Regression is used to predict values of one
variable from values of the other (e.g., SAT to
predict GPA).
Basic Ideas (2)




Yi  a  bX i  ei
Sample value:
Intercept – place where X=0
Slope – change in Y if X changes 1 unit.
Rise over run.
If error is removed, we have a predicted
value for each person at X (the line):
Y   a  bX
Suppose on average houses are worth about $75.00 a square
foot. Then the equation relating price to size would be
Y’=0+75X. The predicted price for a 2000 square foot house
would be $150,000.
Linear Transformation
Y   a  bX
C
h
4
0
3
5
3
0
2
n C
a
3
0
2
0
g h
in a
g n
5
Y
2
0
1
5
1
0
Y
=
Y=
1
Y

1 to 1 mapping of variables via line
Permissible operations are addition and
multiplication (interval data)
Y

=1
5
Y
1
50
+
Y
=
0
++
Y
2
=
5
22
=
X
5
+
5
0
0
0
2
Add a
4
constantX
6
8
1
0
2
0
4
6
X
Multiply by a constant
8
1
0
Linear Transformation (2)



Degrees F

Centigrade to Fahrenheit
240
Note 1 to 1 map
212 degrees F, 100 degrees C
200
160
Intercept?
120
Slope?
Y   a  bX
80
40
32 degrees F, 0 degrees C
0
0
30
60
90
120
Degrees C
Intercept is 32. When X (Cent) is 0, Y (Fahr) is 32.
Slope is 1.8. When Cent goes from 0 to 100 (rise), Fahr goes from 32
to 212, and 212-32 = 180. Then 180/100 =1.8 is rise over run is the
slope. Y = 32+1.8X. F=32+1.8C.
e
Regression Line (1) Basics
R
e
2
0
1
8
g
r
e
0
M
e
0
a
e
1
6
M
0
W
L
4
e E
1
D
1
0
v r
D
2
6
' a
n
e
a
1. Passes thru both means.
2. iaPasses close to points.
t oNote errors.
r
3. Described by an equation.
v
6
0
6
e
0
(
1
y
Ye
in
5
ia
,
0
26
46
H
67
e
87
0
ig
2
h
Regression Line (2) Slope
Plot of Weight by Height
Equation for a line is
Y=mX+b in algebra.
Second Tit le
210
180
M ean = 150. 7 lbs.
In regression, equation
usually written Y=a+bX
Weight
Regr ession line
W eight =- 327+7. 15* Height
150
120
M ean = 66. 8 I nches
90
60
63
66
69
72
75
Height
Y is the DV (weight), X is the IV (height), a is the
intercept (-327) and b is the slope (7.15).
The slope, b, indicates rise over run. It tells how many
units of change in Y for a 1 unit change in X. In our
example, the slope is a bit over 7, so a change of 1 inch is
expected to produce a change a bit more than 7 pounds.
Regression Line (3) Intercept
Plot of Weight by Height
Second Tit le
210
180
M ean = 150. 7 lbs.
Regr ession line
Weight
The Y intercept, a, tells where
the line crosses the Y axis; it’s
the value of Y when X is zero.
W eight =- 327+7. 15* Height
150
120
M ean = 66. 8 I nches
90
60
63
66
69
72
75
Height
The intercept is calculated by: a  Y  bX
Sometimes the intercept has meaning; sometimes not. It
depends on the meaning of X=0. In our example, the
intercept is –327. This means that if a person were 0
inches tall, we would expect them to weigh –327 lbs.
Nonsense. But if X were the number of smiles,then a
would have meaning.
Correlation & Regression
Correlation & regression are closely related.
1. The correlation coefficient is the slope of the regression
line if X and Y are measured as z scores. Interpreted as
SDY change with a change of 1 SDX.
2.
SD
Y
For raw scores, the slope is: br
SDX
The slope for raw scores is the correlation times the ratio
of 2 standard deviations. (These SDs are computed with
(N-1), not N). In our example, the correlation was .96, so
the slope can be found by b = .96*(33.95/4.54) = .96*7.45
= 7.15.
Recall that a  Y  bX . Our intercept is 150.77.15*66.8 -327.
Correlation & Regression (2)
3. The regression equation is used to make predictions.
The formula to do so is just: Y '  a  bX
Suppose someone is 68 inches tall. Predicted weight is
-327+7.15*68 = 159.2.
Estimating Y for X = 3
5
Y=2+.5(3) = 3.5
Y=2+.5*X
4
R egres s ion
Line
Slope=.5
3
2
Interc ept=2
1
-2
0
2
3
X
5
6
Review




What is the slope? What does it tell or
mean?
What is the intercept? What does it tell or
mean?
How are the slope of the regression line and
the correlation coefficient related?
What is the main use of the regression line?
Test Questions
30
50
30
40
20
30
20
Miles per Gallon
20
10
10
10
0
0
-100
0
100
200
300
400
500
-100
Time to Accelerate from 0 to 60 mph (sec)
Time to Accelerate from 0 to 60 mph (sec)
30
20
10
0
100
200
300
400
500
0
68
0
1000
2000
3000
4000
5000
Vehicle Weight (lbs.)
A
6000
Engine Displacement (cu. inches)
70
72
74
76
78
Engine Displacement (cu. inches)
Model Year (modulo 100)
B
C
D
What is the approximate value of the intercept for Figure C?
a. 0
b. 10
c. 15
d. 20
80
82
84
Test Questions
In a regression line,
the equation used is
typicallyY '  a  bX .
What does the value a
stand for?



independent variable
intercept
predicted value (DV)

slope


ig
Regression of Weight on Height
X
R
e
g
Wt
61
105
62
120
63
120
65
160
65
120
68
145
69
175
70
160
72
185
75
210
N=10
N=10
M=67
M=150
Correlation (r) = .94.
SD=4.57
SD=
33.99
Regression equation:HY’=-361.86+6.97X
e
e
Ht
2
4
0
1
0
Y   a  bX
2
Y
=
1
8
0
1
5
0
R
W
R
1
2
9
0
6
0
6
6
6
0
6
0
6
2
7
4
7
6
7
8
ig
Predicted Values & Errors
e
Y   a  bX
R
Numbers for linear part and error.
e
2
0
1
N
g
1
0
M
8
2
0
6
M
0
W
L
1
D
4
e E
1
D
6
0
6
6
0
26
46
67
NoteHM of Y’ and
e
Residuals. Note
variance of Y is
V(Y’) + V(res).
a
4
y
5
' a
e
8
n
a
6ia r
e 7
0
(
1
Ye
in
0
v r
2
e
3
e
1
r
Ht
e
61
Wt
s
105
Y'
s
108.19
-3.19
62
120
115.16
4.84
v
120
122.13
-2.13
65
160
136.06
23.94
120
o
136.06
f
-16.06
145
n156.97
-11.97
P
65
69 ia
5
,
o
63
68
t o
n
Error
io r
175 t
1
163.94 io
2
P
11.06
0
160
170.91
-10.91
72
185
184.84
0.16
10
75
210
205.75
4.25
M
67
150
150.00
0.00
4.57
33.99
31.85
11.89
20.89
1155.56
1014.37
141.32
89
7
0
ig
SD
Variance
2
70
r
h
t
io
Error variance
S
2
Y'
(Y  Y ' )


2
N
SY2'  SY2 (1  r 2 )
(Heiman’s notation for
error is not standard. )
In our example,
r  .94; r 2  .88
SY2' 
2
(
Y

Y
'
)

N
 141.32
SY2'  SY2 (1  r 2 )  1156* (1  .88)  141
Standard error of the Estimate – average distance from prediction
SY '  SY 1  r 2
In our example
SY '  141.32  12
Variance Accounted for
2
S
r 2  1  Y2'
SY
(Heiman’s notation for
error is not standard. )
The basic idea is to try maximize r-square, the variance accounted for. The
closer this value is to 1.0, the more accurate the predictions will be.
Sample Exam Data from Previous
Class
Exam 1 Exam 2
86.00
98.00
70.00
84.00
82.00
92.00
92.00
72.00
96.00
82.00
56.00
70.00
76.00
82.00
74.00
94.00
78.00
56.00
66.00
72.00
A sample of 10 scores from both exams
Assuming these are representative, what can you say about
the exams? The students?
Scatterplot & Boxplots of 2 Exams
Exam 1
Exam 2
Descriptive Stats
Descriptives
Exam1
Mean
Median
86.0000
Variance
108.959
Std. Deviation
Exam2
Statistic
83.4412
10.43837
Minimum
Maximum
Range
Mean
52.00
100.00
48.00
70.7721
Median
72.0000
Variance
220.503
Std. Deviation
Minimum
Maximum
Range
Std. Error
.89508
14.84935
24.00
100.00
76.00
1.27332
Correlations
Correlations
Exam1
Exam1
Pearson
Correlation
Exam2
1
.420**
Sig. (2-tailed)
N
Exam2
Pearson
Correlation
.000
165
136
.420**
1
Sig. (2-tailed)
.000
N
136
**. Correlation is significant at the 0.01 level (2-tailed).
139
Scatterplot with means and regression
line
Note that the
correlation, r, is .42
and the squared
correlation, R2, is
.177. R2 is also the
variance accounted
for. We can predict
a bit less than 20
percent of the
variance in Exam 2
from Exam 1.
Coefficientsa
Standardized
Unstandardized Coefficients
Model
1
B
(Constant)
Exam1
a. Dependent Variable: Exam2
Std. Error
20.895
9.377
.598
.112
Coefficients
Beta
t
.420
Sig.
2.228
.028
5.360
.000
Predicted Scores
Coefficientsa
Unstandardized
Coefficients
Model
1
(Constant)
Exam1
B
Std. Error
20.895
9.377
.598
Standardized
Coefficients
Beta
.112
t
2.228
.420 5.360
Sig.
.028
.000
a. Dependent Variable: Exam2
Y '  a  bX
Predicted Exam 2 = 20.895 + .598*Exam1
For example, if I got 85 on Exam 1, then my predicted score for Exam 2 is
20.895+.598*85 = 71.73 = 72 percent