Chapter 15 - McGraw Hill Higher Education

Download Report

Transcript Chapter 15 - McGraw Hill Higher Education

Chapter 15
Model Building and Model
Diagnostics
McGraw-Hill/Irwin
Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved.
Model Building and Model
Diagnostics
15.1
15.2
15.3
15.4
The Quadratic Regression Model
Interaction
Logistic Regression
Model Building, and the Effects of
Multicollinearity
15.5 Improving the Regression Model I:
Diagnosing and Using Information about
Outlying and Influential Observations
15-2
Model Building and Model
Diagnostics
15.6
15.7
Improving the Regression Model II:
Transforming the Dependent and
Independent Variables
Improving the Regression Model III: The
Durbin-Watson Test and Dealing with
Autocorrelation
15-3
LO 1: Model quadratic
relationships by using
the quadratic regression
model.
15.1 The Quadratic
Regression Model
One useful form of linear regression is the
quadratic regression model
Assume we have n observations of x and y
The quadratic regression model relating y to x is
y = β0 + β1x + β2x2 + 



1.
2.
3.
β0 + β1x + β2x2 is the mean value of the dependent
variable y when the value of the independent variable is x
β0, β1 and β2 are unknown regression parameters
relating the mean value of y to x
 is an error term that describes the effects on y of all
factors other than x and x2
15-4
LO1
More Variables




We have only looked at the simple case
where we have y and x
That gave us the quadratic regression model
y = β0 + β1x + β2x2 + 
However, we are not limited to just two terms
The following would also be a valid quadratic
regression model
y = β0 + β1x1 + β2x12 + β3x2 + β4x3 + 
15-5
LO 2: Detect and model
interaction between two
independent variables.
15.2 Interaction

Multiple regression models often contain interaction
variables



These are variables that are formed by multiplying two
independent variables together
For example, x1·x2
 In this case, the x1·x2 variable would appear in the model
along with both x1 and x2
We use interaction variables when the relationship
between the mean value of y and one of the
independent variables is dependent on the value of
another independent variable
15-6
LO 3: Use a logistic
model to estimate
probabilities and odds
ratios.
15.3 Logistic Regression

Logistic regression and least squares regression are
very similar


Both produce prediction equations
The y variable is what makes logistic regression
different


With least squares regression, the y variable is a
quantitative variable
With logistic regression, it is usually a dummy 0/1 variable
 With large data sets, y variable may be the probability of a
set of observations having a dummy variable value of one
15-7
LO3
General Logistic Regression
Model
  o  1 x1   2 x2   k xk 
e
px1 , x2 ,, xk  
  o  1 x1   2 x2   k xk 
1 e


p(x1,x2,…xk) is the probability that the event
under consideration will occur when the
values of the independent variable are
x1,x2,…xk
The odds of the event occurring are
p(x1,x2,…xk)/(1-p(x1,x2,…xk))

The probability that the event will occur divided by
the probability it will not occur
15-8
LO 4: Describe and
measure
multicollinearity.


Multicollinearity is the condition where the
independent variables are dependent, related or
correlated with each other
Effects



15.4 Model Building and the
Effects of Multicollinearity
Hinders ability to use t statistics and p-values to assess the
relative importance of predictors
Does not hinder ability to predict the dependent (or
response) variable
Detection



Scatter plot matrix
Correlation matrix
Variance inflation factors (VIF)
15-9
LO 5: Use various
model comparison
criteria to identify one or
more appropriate
regression models.




Comparing Regression Models on
R2, s, Adjusted R2, and Prediction
Interval
Multicollinearity causes problems evaluating the pvalues of the model
Therefore, we need to evaluate more than the
additional importance of each independent variable
We also need to evaluate how the variables work
together
One way to do this is to determine if the overall
model gives a high R2 and adjusted R2, a small s,
and short prediction intervals
15-10
LO5
C Statistic

Another quantity for comparing regression models is
called the C statistic


First, calculate mean square error for the model
containing all p potential independent variables



Also known as CP statistic
Denoted s2p
Next, calculate SSE for a reduced model with k
independent variables
Calculate C as
SSE
C  2  n  2k  1
sp
15-11
LO 6: Use diagnostic
measures to detect
outlying and influential
observations.



15.5 Diagnosing and Using
Information About Outlying and
Influential Observations
Observation 1: Outlying with respect to y value
Observation 2: Outlying with respect to x value
Observation 3: Outlying with respect to x value and y value not
consistent with regression relationship (Influential)
15-12
LO 7: Use data
transformations to help
remedy violations of the
regression assumptions.


A possible remedy for violations of the constant
variance, correct functional form and normality
assumptions is to transform the dependent variable
Possible transformations include




15.6 Transforming the Dependent
and Independent Variables
Square root
Quartic root
Logarithmic
The appropriate transformation will depend on the
specific problem with the original data set
15-13
LO 8: Use the Durbin–
Watson test to detect
autocorrelated error
terms.



15.7 The Durbin-Watson Test and
Dealing with Autocorrelation
One type of autocorrelation is called firstorder autocorrelation
This is when the error term in time period t
(t) is related to the error term in time period
t-1 (t-1)
The Durbin-Watson statistic checks for firstorder autocorrelation
15-14