Chapter 15, part D

Download Report

Transcript Chapter 15, part D

Chapter 15, part D
Qualitative Independent Variables
VI. Qualitative Independent Variables
For most of our models we have restricted our
independent variables to quantitative data, values
that can take any value in a range.
Past examples include: Salary, G.P.A., # of
Customers, Repair Cost $
Qualitative (dummy) variables are those that take
two or more values (Gender, Political Party,
Region of Country).
A. A Dummy Variable
The simplest of dummy variables is one in which
there are only two possibilities for a qualitative
variable. You arbitrarily assign a value of 1 to one
possibility and a value of 0 to the other.
Examples: X=1 if Female; X=0 if Male
X=1 if Union worker; X=0 if Nonunion
X=1 if College Graduate; X=0 if not
B. Inclusion in a Regression
Problem #38 builds a model to relate Age (x1), Blood
Pressure (x2) and Smoking (x3) to the Risk of
Strokes (y).
Smoking is a dummy variable,
X3 =1 if a smoker; X3=0 if a non smoker.
y  0   AAge   P Pressure SSmoke
Output
Overall, what do you make of these results?
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.934605168
R Square
0.87348682
Adjusted R Square
0.849765599
Standard Error
5.756574565
Observations
20
ANOVA
df
Regression
Residual
Total
Intercept
Age
Pressure
Dummy
3
16
19
SS
3660.739588
530.2104116
4190.95
MS
1220.246529
33.13815073
Coefficients Standard Error
t Stat
-91.75949844
15.22276009 -6.027783261
1.076741057
0.165963611
6.48781412
0.251813473
0.045225519 5.567951023
8.739871056
3.000815432 2.912498704
F
Significance F
36.82301223
2.06404E-07
P-value
1.75755E-05
7.4873E-06
4.24366E-05
0.010173553
C. Interpretation
The estimated coefficient on the Dummy for
smoking is 8.74.
Since X3=1 for a smoker, this means the probability
a patient has a stroke in the next 10 years rises by
8.74% if they’re a smoker.
You can’t do much about your age, but if you lower
your blood pressure by 10 points, you lower the
risk by 2.5%. Hmmm, what should a person do?
D. Multi-level Dummy Variables
There are many wage/salary regression models that
wish to examine differences in a wage variable by
region of the country.
For example, we could divide the country into 4
regions and assign a value of 1 to a worker from
that region and 0 for all other regions.
Example
Suppose we have 3 workers in a set of data. Franklin is from
the North, Elly May is from the South, and Chet is from
the West. Our table of data might look like this:
Worker
Wage
North
South
East
West
Franklin
$$$
1
0
0
0
Elly May
$$
0
1
0
0
Chet
$
0
0
0
1
The Model
• If you have 4 levels for the qualitative variable
“Region”, you can only include 3 in the equation.
Including all 4 makes it impossible for leastsquares to minimize the sum of squared residuals.
• The omission of one region creates a benchmark
and allows you to compare all other regions to the
one omitted.
Hypothetical Regression Results
Let’s say that we leave out “East” and we find the
following:
Wage(Y) = 100 + 50(North) - 25(South) - 10(West)
Remember, “North”=1 only if a worker is from the
North and all other regions “South” and “West”
are 0 for that worker.
Interpretation
Franklin is from the North, so “North”=1 and
“South”=“West”=0. His estimated wage is then
100+50=$150. Thus we could say that a worker
from the North, all else held constant, would see a
$50 increase in his/her wage
Continued...
Elly May is from the South, so “South”=1 and
“North”=“West”=0. Her estimated wage is then
100-25=$75.