Transcript Model 1
Chapter 5
Multicollinearity, Dummy and
Interaction variables
1
We noted from the previous chapter that there are several
dangers of adding new variables indiscriminately to the model.
First, although unadjusted R2 goes up, we lose degrees of
freedom because we have to estimate additional coefficients.
The smaller the degrees of freedom, the less precise the
parameter estimates. There is another serious consequence of
adding too many variables to a model. If a model has several
variables, it is likely that some of the variables will be strongly
correlated. This property, known as multicollinearity, can
drastically alter the results from one model to another, making
them much harder to interpret.
2
Example 1
Let
Housingt = number of housing starts (in thousands) in the U.S.
Popt = U.S. population in millions
GDPt = U.S. Gross Domestic Product in billions of dollars
Interatet = new home mortgage rate
t = 1963 to 1985
3
SAS codes:
data housing;
infile ‘d:\teaching\MS3215\housing.txt’;
input year housing pop gdp unemp intrate ;
proc reg data= housing ;
model housing= pop intrate;
run;
proc reg data= housing ;
model housing= gdp intrate;
run;
proc reg data= housing ;
model housing= pop gdp intrate;
run;
4
The REG Procedure
Model: MODEL1
Dependent Variable: housing
Number of Observations Read
Number of Observations Used
23
23
Analysis of Variance
Source
Model
Error
Corrected Total
DF
2
20
22
Root MSE
Dependent Mean
Coeff Var
Sum of
Squares
1125359
1500642
2626001
Mean
Square F Value Pr > F
562679
7.50 0.0037
75032
273.91987 R-Square
1601.07826 Adj R-Sq
17.10846
0.4285
0.3714
Parameter Estimates
Variable
Intercept
pop
intrate
Parameter
DF
Estimate
1 -3813.21672
1
33.82138
1
-198.41880
Standard
Error
1588.88417
9.37464
51.29444
t Value Pr > |t|
-2.40
0.0263
3.61
0.0018
-3.87
0.0010
5
The REG Procedure
Model: MODEL1
Dependent Variable: housing
Number of Observations Read
Number of Observations Used
23
23
Analysis of Variance
Source
Model
Error
Corrected Total
Sum of
Squares
1134747
1491254
2626001
DF
2
20
22
Root MSE
Dependent Mean
Coeff Var
Mean
Square F Value Pr > F
567374
7.61 0.0035
74563
273.06168 R-Square
1601.07826 Adj R-Sq
17.05486
0.4321
0.3753
Parameter Estimates
Variable
DF
Intercept
gdp
intrate
1
1
1
Parameter
Estimate
687.92418
0.90543
-169.67320
Standard
Error t Value
382.69637
0.24899
43.83996
1.80
3.64
-3.87
Pr > |t|
0.0874
0.0016
0.0010
6
The REG Procedure
Model: MODEL1
Dependent Variable: housing
Number of Observations Read
Number of Observations Used
23
23
Analysis of Variance
Source
Model
Error
Corrected Total
DF
3
19
22
Root MSE
Dependent Mean
Coeff Var
Sum of
Squares
1147699
1478302
2626001
Mean
Square F Value Pr > F
382566
4.92 0.0108
77805
278.93613 R-Square
1601.07826 Adj R-Sq
17.42177
0.4371
0.3482
Parameter Estimates
Variable
Intercept
pop
gdp
intrate
Parameter
DF
Estimate
1 -1317.45317
1
4.91398
1
0.52186
1 -184.77902
Standard
Error
4930.68042
36.55401
0.97391
58.10610
t Value
-0.27
0.41
0.54
-3.18
Pr > |t|
0.7922
0.6878
0.5983
0.0049
7
Note that in the last model, the t statistics for Pop and GDP are
insignificant, but they are both significant when entered
separately in the first and second models. This is because the
three variables Pop, GDP and Intrate are highly correlated. It
can be shown that
Cor(GDP, Pop)
= 0.99
Cor(GDP, Intrate )
= 0.88
Cor(Pop, Intrate )
= 0.91
8
Example 2
Let expensesi be the cumulative expenditure on the
maintenance for a given automobile, milesi be the cumulative
mileage in thousand of miles and weeki be its age in weeks
since the original purchase. i= 1,…,57.
9
SAS codes:
data automobile;
infile ‘d:\teaching\MS3215\automobile.txt’;
input weeks miles expenses;
proc reg data= automobile;
model expenses = weeks;
run;
proc reg data= automobile;
model expenses = miles;
run;
proc reg data= automobile;
model expenses = weeks miles;
run;
10
The REG Procedure
Model: MODEL1
Dependent Variable: expenses
Number of Observations Read
Number of Observations Used
57
57
Analysis of Variance
Source
Model
Error
Corrected Total
DF
1
55
56
Sum of
Squares
66744854
7474117
74218972
Root MSE
Dependent Mean
Coeff Var
Mean
Square
66744854
135893
F Value Pr > F
491.16 <.0001
368.63674 R-Square
1426.57895 Adj R-Sq
25.84061
0.8993
0.8975
Parameter Estimates
Variable
Intercept
weeks
DF
1
1
Parameter
Estimate
-626.35977
7.34942
Standard
Error
104.71371
0.33162
t Value
-5.98
22.16
Pr > |t|
<.0001
<.0001
11
The REG Procedure
Model: MODEL1
Dependent Variable: expenses
Number of Observations Read
Number of Observations Used
57
57
Analysis of Variance
Source
Model
Error
Corrected Total
DF
1
55
56
Sum of
Squares
63715228
10503743
74218972
Root MSE
Dependent Mean
Coeff Var
Mean
Square
F Value Pr > F
63715228 333.63 <.0001
190977
437.00933 R-Square
1426.57895 Adj R-Sq
30.63338
0.8585
0.8559
Parameter Estimates
Variable
Intercept
miles
Parameter
DF Estimate
1 -796.19928
1
53.45246
Standard
Error
134.75770
2.92642
t Value
-5.91
18.27
Pr > |t|
<.0001
<.0001
12
The REG Procedure
Model: MODEL1
Dependent Variable: expenses
Number of Observations Read
Number of Observations Used
57
57
Analysis of Variance
Source
Model
Error
Corrected Total
DF
2
54
56
Root MSE
Dependent Mean
Coeff Var
Sum of
Squares
70329066
3889906
74218972
Mean
Square
35164533
72035
268.39391 R-Square
1426.57895 Adj R-Sq
18.81381
F Value Pr > F
488.16 <.0001
0.9476
0.9456
Parameter Estimates
Variable
Intercept
weeks
miles
Parameter
DF
Estimate
1
7.20143
1
27.58405
1 -151.15752
Standard
Error
117.81217
2.87875
21.42918
t Value Pr > |t|
0.06
0.9515
9.58
<.0001
-7.05
<.0001
13
A car that is driven more should have a greater maintenance
expense. Similarly, the older the car the greater the cost of
maintaining it. So we would expect both slope coefficients to
be positive. It is interesting to note that even though the
coefficient for miles is positive in the second model, it is
negative in the third model. Thus, there is a reversal in sign.
The magnitude of the coefficient for weeks also changes
substantially. The t statistics for miles and weeks are also
much lower in the third model even though both variables are
still significant.
The problem is again the high correlation between weeks and
miles.
14
Consider the model
yi 0 1 X i1 2 X i 2 i and let ˆ0 , ˆ1 and ˆ2 be the least
squares estimates of 0 , 1 and 2 respectively. It can be shown
that
var ˆ1
var ˆ2
2
2
x
x
i1 1 1 r122
2
2
2
x
x
1
r
i2 2
12
cov ˆ1 , ˆ2
r12 2
1 r
2
12
n
xi1 x1
i 1
where r12 = Cor(X1, X2)
2
n
xi 2 x2
2
i 1
15
The effect of increasing r12 on var ˆ2
var ˆ2
Value of r12
0.00
2
n
x
i 1
i2
x
A
2
0.5
1.33 x A
0.7
1.96 x A
0.8
2.78 x A
0.9
5.26 x A
0.95
10.26 x A
0.97
16.92 x A
0.99
50.25 x A
0.995
100 x A
0.999
500 x A
The sign reversal and decrease in t value are caused by the inflated
variance of the estimators.
16
Consequences of Multicollinearity
Wider confidence intervals.
Insignificant t values.
High R2 and consequently F can convincingly
reject H 0 : 1 2 ... p 0 , but few significant t values.
Sensitivity of least squares estimates and their standard
errors to small changes in model.
17
Exact multicollinearity exists if two or more independent
variables have a perfect linear relationship between them. In
this case there is no unique solution to the normal equations
derived from least squares. When this happens, one or more
variables should be dropped from the model.
Multicollinearity is very much a norm in regression analysis
involving non-experimental (or uncontrolled) data. It can never
be eliminated . The question is not about the existence or nonexistence of multicollinearity, but how serious the problem is.
18
Identifying Multicollinearity
High R2 (and significant F value) but low values for t
statistics.
High correlation coefficients between the explanatory
variables. But the converse need not be true. In otherwords,
multicollinearity may still be a problem even though the
correlation between two variables does not appear to be high.
This is because three of more variables may be strongly
correlated, yet pairwise correlation are not high.
Regression coefficient estimates and standard errors
sensitive to small changes in specification.
19
Variance Inflation Factor (VIF):
Let x1, x2,…, xp be the p explanatory variables in a regression.
Perform the regression of xk on the remaining p-1 explanatory
variables and call the coefficient of determination from the
2
regression RK . The VIF for variable xk is
VIFK
1
1 RK2
VIF is a measure of the strength of the relationship between
each explanatory variable and all other explanatory variables in
the model.
20
2
Relationship between RK and VIFK
RK2
VIFK
0
1
0.9
10
0.99
100
21
How large the VIFK have to be to suggest a serious problem
with multicollinearity?
a) An individual VIFK larger than 10 indicates that
multicollinearity may be seriously influencing the least
squares estimates of the regression coefficients.
p
b) If the average of the VIFK, VIF VIFK / p , is larger than 5,
K 1
then serious problems may exist. V IF indicates how many
times larger the errors for the regression is due to
multicollinearity than it would be if the variables were
uncorrelated.
22
SAS codes:
data housing;
infile ‘d:\teaching\MS3215\housing.txt’;
input year housing pop gdp unemp intrate ;
proc reg data= housing ;
model housing= pop gdp intrate/vif;
run;
23
The REG Procedure
Model: MODEL1
Dependent Variable: housing
Number of Observations Read
Number of Observations Used
23
23
Analysis of Variance
Source
Model
Error
Corrected Total
DF
3
19
22
Root MSE
Dependent Mean
Coeff Var
Sum of
Squares
1147699
1478302
2626001
Mean
Square F Value Pr > F
382566
4.92 0.0108
77805
278.93613 R-Square
1601.07826 Adj R-Sq
17.42177
0.4371
0.3482
Parameter Estimates
Variable
Intercept
pop
gdp
intrate
Parameter
DF
Estimate
1 -1317.45317
1
14.91398
1
0.52186
1 -184.77902
Standard
Error
t Value
4930.68042 -0.27
36.55401
0.41
0.97391
0.54
58.10610
-3.18
Pr > |t|
0.7922
0.6878
0.5983
0.0049
Variance
Inflation
0
87.97808
64.66953
7.42535
24
Solutions to Multicollinearity
1) Benign Neglect
If an analyst is less interested in interpreting individual
coefficients but more interested in forecasting then
multicollinearity may not a serious concern. Even with high
correlations among independent variables, if the regression
coefficients are significant and have meaningful signs and
magnitudes, one need not be too concerned with
multicollinearity.
25
2) Eliminating Variables
Remove the variable with strong correlation with the rest
would generally improve the significance of other variables.
There is a danger, however, in removing too many variables
from the model because that would lead to bias in the
estimates.
3) Re-specifying the model
For example, in the housing regression, we can include the
variables as per capita rather than include population as an
explanatory variable, leading to
Housing
GDP
0 1
2 Intrate
Pop
Pop
26
SAS codes:
data housing;
infile ‘d:\teaching\MS3215\housing.txt’;
input year housing pop gdp unemp intrate ;
phousing= housing/pop;
pgdp= gdp/pop;
proc reg data= housing ;
model phousing = pgdp intrate/vif;
run;
27
The REG Procedure
Model: MODEL1
Dependent Variable: phousing
Number of Observations Read
Number of Observations Used
23
23
Analysis of Variance
Source
Model
Error
Corrected Total
DF
2
20
22
Sum of
Squares
26.33472
34.38472
60.71944
Root MSE
Dependent Mean
Coeff Var
Mean
Square F Value Pr > F
13.16736
7.66 0.0034
1.71924
1.31120 R-Square
7.50743 Adj R-Sq
17.46531
0.4337
0.3771
Parameter Estimates
Variable
DF
Parameter
Estimate
Intercept
pgdp
intrate
1
1
1
2.07920
0.93567
-0.69832
Standard
Error
t Value
3.34724
0.36701
0.18640
0.62
2.55
-3.75
Pr > |t|
Variance
Inflation
0.5415
0.0191
0.0013
0
3.45825
3.45825
28
4) Increasing the sample size
This solution is often recommended on the ground that such
an increase improves the precision of an estimator and
hence reduce the adverse effects of multicollinearity. But
sometimes additional sample information may not available.
5) Other estimation techniques (beyond the scope of this
course)
Ridge regression
Principal component analysis
29
Dummy variables
In regression analysis, qualitative or categorical variables are
often useful. Qualitative variables such as sex, martial status or
political affiliation can be represented by dummy variables,
usually coded as 0 and 1. The two values signify that the
observation belongs to one of two possible categories.
30
Example
The Salary Survey data set was developed from a salary survey of
computer professionals in a large corporation. The objective of the
survey was to identify and quantify those variables that determine
salary differentials. In addition, the data could be used to determine
if the corporation’s salary administration guidelines were being
followed. The data appear in the file salary.txt. The response variable
is salary (S) and the explanatory variables are: (1) experience (X),
measured in years; (2) education (E), coded as 1 for completion of a
high school (H.S.) diploma, 2 for completion of a bachelor degree
(B.S.), and 3 for the completion of an advanced degree; (3)
management (M), which is coded as 1 for a person with management
responsibility and 0 otherwise. We shall try to measure the effects of
these three variables on salary using regression analysis.
31
So, the regression model is
Si 0 1 X i 2 Ei 3 M i i
where
Mi
1
0
if employee i takes on management responsibility
otherwise
This leads to two possible regressions:
i) For managerial positions:
Si 0 3 1 X i 2 Ei i
ii) For non-managerial positions:
Si 0 1 X i 2 Ei i
3 therefore represents the average salary difference between
employees with and without managerial responsibilities.
32
SAS codes:
data salary;
infile ‘d:\teaching\MS3215\salary.txt’;
input s x e m;
proc reg data= salary;
model s= x e m;
run;
33
The REG Procedure
Model: MODEL1
Dependent Variable: s
Number of Observations Read
Number of Observations Used
46
46
Analysis of Variance
Source
Model
Error
Corrected Total
DF
3
42
45
Sum of
Squares
928714168
72383410
1001097577
Root MSE
Dependent Mean
Coeff Var
Mean
Square
309571389
1723415
F Value Pr > F
179.63 <.0001
1312.78883 R-Square
17270
Adj R-Sq
7.60147
0.9277
0.9225
Parameter Estimates
Variable
Intercept
x
e
m
Parameter
DF
Estimate
1 6963.47772
1
570.08738
1 1578.75032
1 6688.12994
Standard
Error
665.69473
38.55905
262.32162
398.27563
t Value
10.46
14.78
6.02
16.79
Pr > |t|
<.0001
<.0001
<.0001
<.0001
34
So the estimated regressions are
i) For managerial positions:
Sˆi 6963.48 6688.13 570.09 X i 1578.76Ei
13651.61 570.09 X i 1578.76 Ei
ii) For non-managerial positions:
Sˆi 6963.48 570.09 X i 1578.76Ei
Note that all variables are significant and all estimated coefficients have
positive signs, indicating that, other things being equal,
a. Each additional year of work experience is worth a salary increment
of $570.
b. An improvement of qualification from high school to a bachelor’s
degree or from bachelor’s degree to advanced degree is worth $1579.
c. On average, employees with managerial responsibility receive $6688
more than employees without managerial responsibility.
35
Note that only one dummy variable is needed to represent M
which contains two categories. Suppose we define a new
variable which is a compliment to M, that is,
M
'
i
1
0
if employee i does not takes on management responsibility
otherwise
Note that whenever Mi= 1, M i' = 0. If M i' is used in conjunction
with Mi then we have
Si 0 1 X i 2 Ei 3 M i 4 M i' i
but note that the “implicit” explanatory variable (call it Ii)
attached to the intercept term is represented by a vector of 1.
Hence, I i M i M i'
PERFECT MULTICOLLINEARITY
36
The method of least squares fails as a result and there is no
unique solution to the normal equations.
This problem is known as “dummy variable trap”. In general,
for a qualitative variable containing J categories, only J-1
dummy variables are required.
Question: What if Mi is replaced by M i' , will there be any
difference in the result?
Answer: Note that Mi and M i' contain essentially the same
information. The results will be exactly the same.
37
SAS codes:
data salary;
infile ‘d:\teaching\MS3215\salary.txt’;
input s x e m;
mp= 0;
If m eq 0 then mp= 1;
proc reg data= salary;
model s= x e mp;
run;
38
The REG Procedure
Model: MODEL1
Dependent Variable: s
Number of Observations Read
Number of Observations Used
46
46
Analysis of Variance
Source
Model
Error
Corrected Total
DF
3
42
45
Sum of
Squares
928714168
72383410
1001097577
Root MSE
Dependent Mean
Coeff Var
Mean
Square
F Value Pr > F
309571389 179.63 <.0001
1723415
1312.78883
17270
7.60147
R-Square
Adj R-Sq
0.9277
0.9225
Parameter Estimates
Variable
Intercept
x
e
mp
DF
1
1
1
1
Parameter
Estimate
13652
570.08738
1578.75032
-6688.12994
Standard
Error
734.39164
38.55905
262.32162
398.27563
t Value
18.59
14.78
6.02
-16.79
Pr > |t|
<.0001
<.0001
<.0001
<.0001
39
So the regressions are
i) For managerial positions:
Sˆi 13652 570.09 X i 1578.76Ei
ii) For non-managerial positions:
Sˆi 13652 6688.13 570.09 X i 1578.76Ei
6963.87 570.09 X i 1578.76Ei
The results, except for minor differences due to roundings, are essentially
the same as those when M, instead of M’, is used as an explanatory variable.
40
So far, education has been treated in a linear fashion. This may
be too restrictive. Instead, we shall view education as a
categorical variable and define two dummy variables to
represent three categories,
Bi
1
Ai
if employee i completes a bachelor degree as his/her highest level of
education attainment
0 otherwise
1 if employee i completes an advanced degree as his/her highest level
of education attainment
0 otherwise
So, when
Ei= 1, Bi= Ai= 0
Ei= 2, Bi= 1, Ai= 0
Ei= 3, Bi= 0, Ai= 1
So the model to be estimated is
Si 0 1 X i 2 Bi 3 Ai 4 M i i
41
SAS codes:
data salary;
infile ‘d:\teaching\MS3215\salary.txt’;
input s x e m;
a= 0;
b= 0;
If e eq 2 then b= 1;
If e eq 3 then a= 1;
proc reg data= salary;
model s= x b a m;
run;
42
The REG Procedure
Model: MODEL1
Dependent Variable: s
Number of Observations Read
Number of Observations Used
46
46
Analysis of Variance
Source
Model
Error
Corrected Total
Sum of
DF
Squares
4
957816858
41
43280719
45 1001097577
Mean
Square
239454214
1055627
Root MSE
1027.43725 R-Square
Dependent Mean
17270 Adj R-Sq
Coeff Var
5.94919
F Value Pr > F
226.84 <.0001
0.9568
0.9525
Parameter Estimates
Variable
Intercept
x
b
a
m
Parameter
DF
Estimate
1 8035.59763
1
546.18402
1 3144.03521
1 2996.21026
1 6883.53101
Standard
Error
t Value
386.68926 20.78
30.51919 17.90
361.96827
8.69
411.75271
7.28
313.91898
21.93
Pr > |t|
<.0001
<.0001
<.0001
<.0001
<.0001
43
The interpretation of the coefficients of X and M are the same
as before. The estimated coefficient of Bi (3144.04) measures
the differential salary between bachelor degree holders relative
to high school leavers. Similarly, the estimated coefficient of Ai
(2996.21) measures the differential salary between advanced
degree holders relative to high school leavers. The difference
(3144.04-2996.21) measures the salary differential between
bachelor degree and advanced degree holders. Interestingly, the
results suggest that a bachelor degree is worth more than an
advanced degree! (but is the difference significant?)
44
Interaction Variables
The previous models all suggest that the effects of education
and management status on salary determination are additive.
For example, the effect of a management position is
measured by 4 independently of the level of educational
attainment. The possible non-additive effects may be
evaluated by constructing additional variables designed to
capture interaction effects. Interaction variables are products
of existing variables, for example, B M and A M are
interaction variables capturing interaction effects between
educational levels and managerial responsibility.
The expanded model is
Si 0 1 X i 2 Bi 3 Ai 4 M i 5 Bi M i 6 Ai M i i
45
SAS codes:
data salary;
infile ‘d:\teaching\MS3215\salary.txt’;
input s x e m;
a= 0;
b= 0;
If e eq 2 then b= 1;
If e eq 3 then a= 1;
bm= b*m;
am= a*m;
proc reg data= salary;
model s= x b a m bm am;
run;
46
The REG Procedure
Model: MODEL1
Dependent Variable: s
Number of Observations Read
Number of Observations Used
46
46
Analysis of Variance
Source
Model
Erro r
Corrected Total
DF
6
39
45
Sum of
Squares
999919409
1178168
1001097577
Mean
Square
166653235
30209
Root MSE
173.80861 R-Square
Dependent Mean
17270 Adj R-Sq
Coeff Var
1.00641
F Value Pr > F
5516.60 <.0001
0.9988
0.9986
Parameter Estimates
Variable
Intercept
x
b
a
m
bm
am
DF
1
1
1
1
1
1
1
Parameter
Estimate
9472.68545
496.98701
1381.67063
1730.74832
3981.37690
4902.52307
3066.03512
Standard
Error
t Value
80.34365 117.90
5.56642
89.28
77.31882
17.87
105.33389 16.43
101.17472 39.35
131.35893 37.32
149.33044 20.53
Pr > |t|
<.0001
<.0001
<.0001
<.0001
<.0001
<.0001
<.0001
47
Interpretation of regression results:
There are 6 regression models altogether
i) high school leavers in non-managerial positions
Si 0 1 X i i
ii)
high school leaves in managerial positions
Si 0 1 X i 4 M i i
iii)
bachelor degree holders in non-managerial positions
Si 0 1 X i 2 Bi i
iv)
bachelor degree holders in managerial positions
Si 0 1 X i 2 Bi 4 M i 5 Bi M i i
48
i)
advanced degree holders in non-managerial positions
Si 0 1 X i 3 Ai i
i)
advanced degree holders in managerial positions
Si 0 1 X i 3 Ai 6 Ai M i i
For example, to ascertain the marginal change in salary due to
the acquisition of an advanced degree,
S i
3 6 M i
Ai
1730.75 3066.04M i (estimate)
Thus, the marginal change is $1730.75 for non-managerial
employees and $4796.79 for managerial employees.
49
Similarly, to investigate the impact if a change from nonmanagerial to managerial position,
Si
4 5 Bi 6 Ai
M i
3981.38 4902.52Bi 3066.04 Ai (estimate)
Thus, the marginal change is $3981.38 for high school
leavers, $8883.9 for bachelor degree holders and $7047.42
for advanced degree holders.
50
Comparing different groups of regression models by dummy
variables
Sometimes a collection of data may consist of two or more
distinct subsets, each of which may require a separate
regression. Serious bias may be incurred if a combined
regression relationship is used to represent the pooled data set.
51
Example
A job performance test was given to a group of 20 trainees on a
special employment program at the end of the job training
period. All these 20 trainees were eventually employed by the
company and given a performance evaluation score after 6
months. The data are given in the file employment.txt.
Let Y represent job performance score of employee and X be
the score on the pre-employment test. We are concerned with
equal employment opportunity. We want to compare
52
Model 1 (pooled):
yij 0 1 xij ij ,
j 1,2
i 1,2,..., n
Model 2A (Minority):
yi1 01 11xi1 i1
Model 2B (White):
yi 2 02 12 xi 2 i 2
In model 1, race distinction is ignored, the data are pooled and
there is a single regression line. In models 2A and 2B, there are
two separate regression relationships for the two subgroups,
each with a distinct set of regression coefficients.
53
54
So, formally, we want to test
H 0 : 11 12 , 01 02
vs.
H1: at least one of the equalities in H0 is false
The test may be performed using dummy and interaction
variables.
Define
1 if j = 1 (minority)
Z ij 0 if j = 2 (white)
and formulate the following model which we call model 3:
yij 0 1 xij 2 Z ij 3 Z ij xij ij ,
j 1,2
i 1,2,..., n
55
Note that Model 3 is equivalent to Models 2A and 2B. When
j=1, Z i1=1, and Model 3 becomes
yi1 0 1 xi1 2 3 xi1 i1
0 2 1 3 xi1 i1 ,
which is Model 2A,
and when j=2, Z i 2=0, and Model 3 reduces to
yi 2 0 1 xi 2 i 2 ,
which is Model 2B.
So a comparison between Model 1 and Models 2A and 2B is
equivalent to a comparison between Model 1 and Model 3.
56
Now, Model 1
yij 0 1 xij ij ;
j 1,2
i 1,2,..., n
may be obtained by setting 2 3 0 in Model 3
yij 0 1 xij 2 Z ij 3 zij xij ij ,
j 1,2
i 1,2,..., n
Thus, the hypothesis of interest becomes
H 0 : 2 3 0
H1: at least one of 2 and 3 is non-zero
57
The test may be carried out using a partial-F test defined as
SSE R SSE F / k
F
SSE F /( n p 1)
where k is the number of restrictions under H0, SSER is the
SSE corresponding to the restricted model. SSEF is the SSE
corresponding to the full model and n-p-1 is the degrees of
freedom in the full model.
58
SAS codes:
data employ;
infile ‘d:\teaching\MS3215\employ.txt’;
input x race y;
z= 0;
if race eq 1 then z= 1;
zx= z*x;
proc reg data= employ;
model y= x z zx;
test: test z= 0, zx= 0;
run;
proc reg data= employ;
model y= x;
run;
59
Regression results of Model 3 (full model)
The REG Procedure
Model: MODEL1
Dependent Variable: y
Number of Observations Read
Number of Observations Used
20
20
Analysis of Variance
Source
Model
Error
Corrected Total
DF
3
16
19
Sum of
Squares
62.63578
31.65547
94.29125
Mean
Square
20.87859
1.97847
Root MSE
1.40658 R-Square
Dependent Mean 4.50850 Adj R-Sq
Coeff Var
31.19840
F Value
10.55
Pr > F
0.0005
0.6643
0.6013
Parameter Estimates
Variable
Intercept
x
z
zx
DF
1
1
1
1
Parameter Standard
Estimate
Error
t Value Pr > |t|
2.01028
1.05011
1.91
0.0736
1.31340
0.67037
1.96
0.0677
-1.91317
1.54032
-1.24
0.2321
1.99755
0.95444
2.09
0.0527
60
The REG Procedure
Model: MODEL1
Test test Results for Dependent Variable y
Source
Numerator
Denominator
DF
2
16
Mean
Square
6.95641
1.97847
F Value
Pr > F
3.52
0.0542
61
Regression results of Model 1 (restricted model)
The REG Procedure
Model: MODEL1
Dependent Variable: y
Number of Observations Read
Number of Observations Used
20
20
Analysis of Variance
Source
Model
Error
Corrected Total
DF
1
18
19
Sum of
Squares
48.72296
45.56830
94.29125
Mean
Square
48.72296
2.53157
F Value
19.25
Pr > F
0.0004
Root MSE
1.59109 R-Square 0.5167
Dependent Mean
4.50850 Adj R-Sq 0.4899
Coeff Var
35.29093
Parameter Estimates
Variable
Intercept
x
DF
1
1
Parameter
Estimate
1.03497
2.36053
Standard
Error
t Value Pr > |t|
0.86803
1.19
0.2486
0.53807
4.39
0.0004
62
Here, k=2, n=20, p=3 and n-p-1=16
Hence,
45.57 31.66 / 2
F
3.52
31.66 /16
Now let 0.10
F(0.10,2,16) 2.67 and p(3.52) 0.0542
Hence we reject H0 and conclude that the relationship is
different for the two groups. Specifically, for minorities, we
have
yˆ i1 0.091 3.31xi1
and for white,
yˆ i 2 2.01 1.31xi 2
63