Class 18 – Thursday, Nov. 11

Download Report

Transcript Class 18 – Thursday, Nov. 11

Class 18 – Thursday, Nov. 11
• Omitted Variables Bias
• Specially Constructed Explanatory
Variables
– Interactions
– Squared Terms for Curvature
– Dummy variables for categorical variables
(next class)
• I will e-mail you Homework 7 after class. It
will be due next Thursday.
California Test Score Data
• The California Standardized Testing and
Reporting (STAR) data set californiastar.JMP
contains data on test performance, school
characteristics and student demographic
backgrounds from 1998-1999.
• Average Test Score is the average of the
reading and math scores for a standardized test
administered to 5th grade students.
• One interesting question: What would be the
causal effect of decreasing the student-teacher
ratio by one student per teacher?
Multiple Regression and
Causal Inference
• Goal: Figure out what the causal effect on average
test score would be of decreasing student-teacher
ratio and keeping everything else in the world fixed.
• Lurking variable: A variable that is associated with
both average test score and student-teacher ratio.
• In order to figure out whether a drop in studentteacher ratio causes higher test scores, we want to
compare mean test scores among schools with
different student-teacher ratios but the same values
of the lurking variables.
• If we include all of the lurking variables in the multiple
regression model, the coefficient on student-teacher
ratio represents the change in the mean of test
scores that is caused by a one unit increase in
student-teacher ratio.
Omitted Variables Bias
Response Average Test Score
Parameter Estimates
Term
Intercept
Student Teacher Ratio
Estimate
698.93295
-2.279808
Std Error
9.467491
0.479826
t Ratio
73.82
-4.75
Prob>|t|
<.0001
<.0001
Response Average Test Score
Parameter Estimates
Term
Intercept
Student Teacher Ratio
Percent of English Learners
Estimate
686.03225
-1.101296
-0.649777
Std Error
7.411312
0.380278
0.039343
t Ratio
92.57
-2.90
-16.52
Prob>|t|
<.0001
0.0040
<.0001
• Schools with many English learners tend to have worst resources.
The multiple regression that shows how mean test score changes
when student teacher ratio changes but percent of English learners
is held fixed gives a better idea of the causal effect of the studentteacher ratio than the simple linear regression that does not hold
percent of English learners fixed.
• Omitted variables bias of omitting percentage of English learners =
-2.28-(-1.10)=-1.28.
Omitted Variables Bias: General
Formula
• What happens if we omit a lurking variable from
the regression?
• Suppose we are interested in the causal effect of x1
on y and believe that there are lurking variables
x2 ,, x p1
and that
E{ y | x1 ,, x p1}   0  1 x1     p1 x p1
E{ y | x1 ,, x p }   0*  1* x1     p* x p
•
1
is the causal effect of x1 on y. If we omit
the lurking variable, x p1 , then the multiple *
regression will be estimating the coefficient 1
as the coefficient on x1 . How different are 1*
and 1 .
Omitted Variables Bias Formula
• Suppose that
E{ y | x1 ,, x p1}  0  1 x1     p1 x p1
E{ y | x1 ,, x p }  0*  1* x1     p* x p
E{x p1 | x1 ,, x p }   0   1 x1     p x p
*



• Then
1
1   1 p1
• Formula tells us about direction and magnitude
of bias from omitting a variable in estimating a
causal effect.
• Formula also applies to least squares estimates,
i.e., ˆ1  ˆ1*  ˆ1ˆ p1
• Key point: In order for there to be omitted
variable bias, the omitted variable must be
associated with both the explanatory variable of
interest and the response.
Omitted Variables Bias Examples
• Would you expect the slope coefficient on
X to be too high, too low or have no bias
for the regression that omits the given
variable?
• Y = Test Score, X= Number of Music
Classes Taken, Omitted Variable =
Student Ability
• Y = Salary, X = Gender (1=Female,
0=Male), Omitted Variable = Education
Key Warning About Multiple
Regression
• Even if we have included many lurking
variables in the multiple regression, we
may have failed to include one or not have
enough data to include one. There will
then be omitted variables bias.
• The best way to study causal effects is to
do a randomized experiment (coming up
next week).
Specially Constructed
Explanatory Variables
• Interaction variables
• Squared and higher polynomial terms for
curvature
• Dummy variables for categorical variables.
Interaction
• Interaction is a three-variable concept. One of
these is the response variable (Y) and the other
two are explanatory variables (X1 and X2).
• There is an interaction between X1 and X2 if the
impact of an increase in X2 on Y depends on the
level of X1.
• To incorporate interaction in multiple regression
model, we add the explanatory variable
( X.1  X1 ) * ( X 2  X 2 ) There is evidence of an
interaction if the coefficient on ( X1  X1 ) * ( X 2  X 2 )
is significant (t-test has p-value < .05).
An experiment to study how noise affects the performance of children tested second
grade hyperactive children and a control group of second graders who were not
hyperactive. One of the tasks involved solving math problems. The children solved
problems under both high-noise and low-noise conditions. Here are the mean scores:
Mean Mathematics Score
250
200
150
High Noise
Low Noise
100
50
0
Control
Hyperactive
Let Y=Mean Mathematics Score, X 1  Type of Child (0= Control, 1 = Hyperactive),
X 2 =Type of Noise (0= Low Noise, 1= High Noise). There is an interaction between
type of child and type of noise: Impact of increasing noise from low to high depends on
the type of child.
Interaction variables in JMP
• To add an interaction variable in Fit Model
in JMP, add the usual explanatory
variables first, then highlight X1 in the
X2
Select Columns box and
in the
Construct Model Effects Box. Then click
Cross in the Construct Model Effects Box.
• JMP creates the explanatory variable
( X1  X1 ) * ( X 2  X 2 )
Interaction Example
• The number of car accidents on a stretch of highway
seems to be related to the number of vehicles that travel
over it and the speed at which they are traveling.
• A city alderman has decided to ask the county sheriff to
provide him with statistics covering the last few years
with the intention of examining these data statistically so
that she can introduce new speed laws that will reduce
traffic accidents.
• accidents.JMP contains data for different time periods on
the number of cars passing along the stretch of road, the
average speed of the cars and the number of accidents
during the time period.
Interactions in Accident Data
Response Accidents
Parameter Estimates
Term
Intercept
Cars
Speed
(Speed-60.0017)*(Cars-9.935)
Estimate
-0.852117
0.4154531
0.0644162
1.0763228
Std Error
7.314465
0.136048
0.118519
0.087791
t Ratio
-0.12
3.05
0.54
12.26
Prob>|t|
0.9077
0.0035
0.5889
<.0001
Eˆ ( Accidents| Cars  8, Speed  66)  Eˆ (Cars  8, Speed  65)  [0.852  0.415* 8 
0.064* 66  1.076* (66  60.0017) * (8  9.935)]  [0.852  0.415* 8  0.064* 65 
1.076* (65  66.0017) * (8  9.935)] 
0.064* (66  65)  1.076* (66  65) * (8  9.935)  2.02
Eˆ ( Accidents| Cars  11, Speed  66)  Eˆ (Cars  11, Speed  65)  [0.852  0.415*11 
0.064* 66  1.076* (66  60.0017) * (11  9.935)]  [0.852  0.415* 11  0.064* 65 
1.076* (65  66.0017) * (11  9.935] 
0.064* (66  65)  1.076* (66  65) * (11  9.935)  1.21
Increases in speed have a worse impact on number of accidents when there are
a large number of cars on the road than when there are a small number of cars on
the road.
Notes on Interactions
• The need for interactions is not easily spotted
with residual plots. It is best to try including an
interaction term and see if it is significant.
• To understand better the multiple regression
relationship when there is an interaction, it is
useful to make an Interaction Plot. After Fit
Model, click red triangle next to Response, click
Factor Profiling and then click Interaction Plots.
Interaction Profiles
12
12.6
10
6
4
Cars
Accidents
8
Cars
2
0
7
-2
12
62.5
10
6
Speed
4
Speed
Accidents
8
2
0
56.6
-2
7 8 9 10
12
57 58 59 60 61 62 63
Plot on left displays E(Accidents|Cars, Speed=56.6), E(Accidents|Cars,Speed=62.5)
as a function of Cars. Plot on right displays E(Accidents|Cars=12.6), E(Accidents|
Cars,Speed=7) as a function of Speed. We can see that the impact of speed on
Accidents depends critically on the number of cars on the road.
Fast Food Locations
• An analyst working for a fast food chain is
asked to construct a multiple regression
model to identify new locations that are
likely to be profitable. The analyst has for
a sample of 25 locations the annual gross
revenue of the restaurant (y), the mean
annual household income and the mean
age of children in the area. Data in
fastfoodchain.jmp
Multivariate
Correlations
Revenue
Income
Age
Revenue
1.0000
0.4355
0.3769
Income
0.4355
1.0000
0.0201
Age
0.3769
0.0201
1.0000
Scatterplot Matrix
1300
1200
1100
Revenue
1000
900
35
30
Income
25
20
15.0
12.5
10.0
Age
7.5
5.0
900 1000 110012001300
20
25
30
35
5.0 7.5 10.0 12.5 15.0
Relationship between revenue and income and between
revenue and age is quadratic. Members of relatively
poor or relatively affluent households are less likely to
eat at this chain’s restaurants, since the restaurants
attract mostly middle-income customers.
The quadratic relationship cannot be easily captured by a
transformation. Curvature between y and x falls into two
quadrants of circle in Tukey’s Bulging Rule.
Squared Terms for Curvature
• To capture a quadratic relationship
between X1 and Y, we add ( X1  X ) * ( X1  X )
as an explanatory variable.
• To do this in JMP, add X1 to the model,
then highlight X1 in the Select Columns
box and highlight X1 in the Construct
Model Effects box and click Cross.
Response Revenue
Parameter Estimates
Term
Intercept
Income
Age
(Income-24.2)*(Income-24.2)
(Age-8.392)*(Age-8.392)
Estimate
1062.4317
5.4563847
1.6421762
-3.979104
-4.112892
Std Error
72.9538
2.162126
5.413888
0.570833
1.267459
t Ratio
14.56
2.52
0.30
-6.97
-3.24
Prob>|t|
<.0001
0.0202
0.7648
<.0001
0.0041
The t-tests indicate strong evidence of curvature for both income and age. The curvature
in age means that the impact of an extra year of age on mean revenue for a fixed level of
income depends on the fixed value of income.
Eˆ (Re venue| Incom e 24.2, Age  8)  Eˆ (Re venue| Incom e 24.2, Age  7) 
1.642  (4.113) * [(8  8.392) * (8  8.392)  (7  8.392) * (8  7.392)]  8.98
Eˆ (Re venue| Incom e 24.2, Age  10)  Eˆ (Re venue| Incom e 24.2, Age  9) 
1.642  (4.113) * [(10  8.392) * (10  8.392)  (9  8.392) * (9  8.392)]  7.47
Notes on Squared Terms for
Curvature
• If t-test for squared term ( X1  X )2 has p-value <.05,
indicating that there is curvature, then we keep the linear
term X 1
in the model regardless of its p-value.
• Coefficients in model with squared terms for curvature
are tricky to 2interpret. If we have explanatory variables X 1
and ( X1  X ) in the model, then we can’t keep X 1
fixed and change ( X  X )2
1
• As with interactions, to better understand the multiple
regression relationship when there is a squared term for
curvature, a plot is useful. After Fit Model, click red
triangle next to Response, click Factor Profiling and click
Profiler. JMP shows a plot for each explanatory variable
of how the mean of Y changes as the explanatory
variable is increased and the other explanatory variables
are held fixed at their mean value.
Prediction Profiler
Revenue
1281
1208.257
±32.825
Income
8.392
14.9
3.4
24.2
33.6
15.6
781.028
Age
Left hand plot is a plot of Mean Revenue for different levels of income when Age is
held fixed at its mean value of 8.392. The 1208.257+/-32.825 is a confidence interval
for the mean response at income=24.2, Age=8.392.
Regression Model for Fast Food
Chain Data
• Interactions and polynomial terms can be
combined in a multiple regression model.
Parameter Estimates
Term
Intercept
Income
Age
(Income-24.2)*(Income-24.2)
(Age-8.392)*(Age-8.392)
(Age-8.392)*(Income-24.2)
Estimate
921.11967
9.3678491
6.2254725
-3.726129
-3.868707
1.9672682
Std Error
95.703
2.743887
5.472777
0.542156
1.179054
0.944082
t Ratio
9.62
3.41
1.14
-6.87
-3.28
2.08
Prob>|t|
<.0001
0.0029
0.2695
<.0001
0.0039
0.0509
• Strong evidence of a quadratic relationship
between revenue and age, revenue and
income. Moderate evidence of an
interaction between age and income.