Comparing the Various Types of Multiple Regression

Download Report

Transcript Comparing the Various Types of Multiple Regression

Comparing the Various Types of
Multiple Regression
• Suppose we have the following hypothesis about some
variables from the World95.sav data set:
• A country’s rate of male literacy (Y) is associated with a
smaller rate of annual population increase (X1), a
greater gross domestic product (X2), and a larger
percentage of people living in cities (X3)
• First let’s look at the intercorrelation among these four
variables
• What we hope to find is that each of the three predictors
has at least a moderate correlation with the Y variable,
male literacy, but are not too highly intercorrelated
themselves (avoiding multicollinearity)
• Let’s check this out by obtaining the zero-order
correlations
Starting with the Zero-order
Correlation Matrix
•
•
•
•
•
In SPSS Data Editor, go to Analyze/
Correlate/Bivariate and put the four
variables into the variables window:
Males who read, people living in
cities, population increase, and
gross domestic product (Put them in
in that order)
Under Options select means and
standard deviations and exclude
cases pairwise
Select Pearson, one-tailed, flag
significant correlations, and press
OK
Examine your table of
intercorrelations
Note that all of the predictors have
significant correlations with the Y
variable, male literacy, and their
intercorrelations are all well below
.80 so multicollinearity may not be a
big problem
Correlations
Males who read (%)
People living in cities (%)
Population increas e (%
per year))
Pears on Correlation
Sig. (2-tailed)
N
Pears on Correlation
Sig. (2-tailed)
N
Pears on Correlation
Sig. (2-tailed)
N
Gross
Population
domestic
Males who
People living
increase (%
product /
read (%)
in cities (%)
per year))
capita
1
.587**
-.619**
.417**
.
.000
.000
.000
85
85
85
85
.587**
1
-.375**
.605**
.000
.
.000
.000
85
108
108
108
-.619**
-.375**
1
-.521**
.000
.000
.
.000
Gross domestic product / Pears on Correlation
capita
Sig. (2-tailed)
N
85
.417**
.000
85
108
.605**
.000
108
109
109
-.521**
.000
109
1
.
109
**. Correlation is s ignificant at the 0.01 level (2-tailed).
Intercorrelations among predictors
SPSS Setup for Simultaneous Multiple
Regression on Three Predictor
Variables
• Now let’s run a standard (simultaneous) multiple
regression of Y (male literacy) on the three predictor
variables
– In Data Editor go to Analyze/ Regression/ Linear and click the Reset
button
– Put Male Literacy into the Dependent box
– Put Population Increase, People Living in Cities, and Gross Domestic
Product into the Independents box
– Under Statistics, select Estimates, Confidence Intervals, Model Fit, R
squared change, Descriptives, Part and Partial Correlation,
Collinearity Diagnostics and click Continue
– Under Options, check Include Constant in the Equation, click Continue
and then OK
– Under Method, select enter. This will enter all of the variables into the
regression equation
• Compare your output to the next several slides
Variables Entered/Removed Table
• First look at the table of variables entered
– You will see that all three of the predictor variables have
been included in the regression equation
Model Summary, Simultaneous
Multiple Regression
• Next look at the Model Summary. You will see that
– The multiple correlation (R) between male literacy and the three
predictors is strong: .764,
– The combination of the three predictors accounts for nearly 60% of the
variation in male literacy (R square): .583
– The regression equation is significant (F 3, 81) = 37.783, p < .001.
This information is also contained in the ANOVA table
Model Summary
Change Statis tics
Model
1
R
R Square
a
.764
.583
Adjus ted
R Square
.568
Std. Error of
the Es timate
13.441
R Square
Change
.583
F Change
37.783
df1
df2
3
81
Sig. F Change
.000
a. Predictors : (Cons tant), Gros s domes tic product / capita, Population increase (% per year)), People
living in cities (%)
ANOVAb
Model
1
Sum of
Squares
Regress ion 20478.670
Res idual
14634.107
Total
35112.776
df
3
81
84
Mean Square
6826.223
180.668
F
37.783
Sig.
.000 a
a. Predictors : (Constant), Gros s domes tic product / capita, Population increase (%
Listwise Exclusion:
per year)), People living in cities (%)
b. Dependent Variable: Males who read (%)
Exclude a case if it is missing
data on any one of the variables-more typical for multivariate
Pairwise-typical for bivariate-exclude cases if they’re missing x or y
Regression Weights, Simultaneous
Multiple Regression
•
•
•
•
Now let’s look at the regression weights (the beta coefficients) (I have divided this
table into two halves; this is the left side below). From this table you will learn that
Two of the predictors have significant standardized regression weights (population
increase, Beta=-.517, t = -6.698, p < .000; people living in cities, Beta = .493, t =
5.539, p <.001): that is, each of the two is a significant contributor to predicting male
literacy
GDP does not appear to add unique predictive power when the effects of the other
predictors are held constant (Beta = -.063, t = -.676, p = .501)
The sign of the regression weights is in the predicted direction, with male literacy
being positively associated with % people living in cities and GDP, but negatively
associated with % annual population increases
Multicollinearity Statistics
•
•
So far you have found partial support but not full support for your hypothesis: given
this analysis it would have to be revised to leave out GDP as a predictor. What you
appear to have found is that Male literacy (in standard scores) = -.517 (Population
increase in standard units) + .493 (People living in cities in standard units) (not quite;
need to rerun it without GDP)
But not so fast! First let’s check to make sure we don’t have any multicollinearity
issues. Below are the collinearity statistics from the coefficients table. Recall that for
multicollinearity to be a problem tolerance had to approach zero and VIF approach
10. So everthing below looks OK and you can report what that you have found
modified support for your hypothesis, minus the effect of GDP
Hierarchical Multiple Regression
• Now let’s analyze the same data and ask the same
question with a different method of multiple regression
• This time we will try a hierachical model where we will
enter the variables based on some external criterion of
our own, like a theoretical model
• Based on our theory of why men read, we have decided
to first enter the variable people living in cities, then
annual population increase, then GDP
• We are going to make some changes to the way we set
up the analysis
SPSS Setup for Hierarchical
Multiple Regression
• Go to Analyze/ Regression/ Linear
– Click on the reset button to get rid of your old settings
– Move Males who Read into the Dependent Box
– Now we are going to enter variables one at a time, in the order
predicted by our theory. Move your first to enter variable, People Living
in Cities, into the Independent box and click Next
– Move your second to enter variable, Population Increase Annual, into
the Independent box and click Next
– Finally, move your third to enter variable, Gross Domestic Product, into
the Independent box. DON’T click next again
– Make sure the enter option is selected under Method
– Under Statistics, select Estimates, Confidence Intervals, Model Fit, R
squared change, Descriptives, Part and Partial Correlation, and Collinearity
Diagnostics, and click Continue
– Under Options, check Include Constant in the Equation, click Continue and
then OK
– Compare results to next slides
Entered/Removed Table for
Hierarchical Multiple Regression
•
The box called Variables Entered/ Removed gives you a summary of what’s in the
model and the information about the order in which it was entered or removed.
Here you have all three variables entered, and you are going to be comparing three
different “models” or regression equations; one with only people living in cities as a
predictor, a two-variable model with people living in cities and population increase %
annual as predictors, and finally a model with all three of the predictors combined
Variables Entered/Removedb
Model
1
2
3
Variables
Entered
People
living in a
cities (%)
Population
increase
(% per
a
year))
Gross
domestic
product
a /
capita
Variables
Removed
Method
.
Enter
.
Enter
.
Enter
a. All reques ted variables entered.
b. Dependent Variable: Males who read (%)
Model Summary for Hierarchical
Multiple Regression
•
Next we are going to look at our model summary, which compares each of the three models (one,
two, or three predictors). Note that for model 1, with only the people living in cities predictor, r is
the same as the zero-order correlation between male literacy and people living in cities. But the
associated R square is significant (i.e., the regression equation is better than using the mean of Y
as a predictor) at F (1,83) = 43.668, p < .001. Model 2, with two of the three predictors, is even
better, with an r of .762 and an R square of .581 of the variance accounted for. This change in R
square is significant (F (1, 82) = 46.198, p<.001), indicating that the second predictor, population
increase annual %, added significantly to the regression equation after the first predictor had
done its work. But the third predictor, GDP, came up short. It only increased R square by a tiny
bit, from .581 to .583, and the change in R square was not significant (F (1,81) = .457, n.s)
Model Summary
Change Statis tics
Model
1
2
3
R
R Square
.587 a
.345
.762 b
.581
.764 c
.583
Adjus ted
R Square
.337
.571
.568
Std. Error of
the Es timate
16.649
13.397
13.441
R Square
Change
.345
.236
.002
F Change
43.668
46.198
.457
df1
df2
1
1
1
83
82
81
Sig. F Change
.000
.000
.501
a. Predictors : (Cons tant), People living in cities (%)
b. Predictors : (Cons tant), People living in cities (%), Population increase (% per year))
c. Predictors : (Cons tant), People living in cities (%), Population increase (% per year)), Gros s domes tic product / capita
ANOVA Tests, Hierarchical Multiple
Regression
•
Our ANOVA table gives us the significance of each of the three models (one predictor, two
predictors, three predictors) and we see that the F is largest for the two-predictor model). (These
Fs are for the overall predictive effect and are different than the F for the amount of change we
get when adding in an additional variable as on the previous slide.) The F for the three-variable
equation (37.783)is also equal to the final F we got in the standard (simultaneous) method when
we entered all of the variables at once. So we have all the evidence we need to toss out the
third variable as a predictor, unless we have some reason to assume that GDP “causes” one of
the other predictors
ANOVAd
Model
1
2
3
Sum of
Squares
Regress ion 12104.898
Res idual
23007.879
Total
35112.776
Regress ion 20396.129
Res idual
14716.647
Total
35112.776
Regress ion 20478.670
Res idual
14634.107
Total
35112.776
df
1
83
84
2
82
84
3
81
84
Mean Square
12104.898
277.203
F
43.668
Sig.
.000 a
10198.065
179.471
56.823
.000 b
6826.223
180.668
37.783
.000 c
a. Predictors : (Constant), People living in cities (%)
b. Predictors : (Constant), People living in cities (%), Population increase (% per year))
c. Predictors : (Constant), People living in cities (%), Population increase (% per year)),
Gross domestic product / capita
d. Dependent Variable: Males who read (%)
Regression Coefficients, Hierarchal
Multiple Regression
• If we look at the regression coefficients for the
hierarchical analysis we will see that they are the same
as for the previous, simultaneous analysis in the case of
model 3
Writing up your Results
•
Reporting the results of a hierarchical multiple regression analysis
– To test the hypothesis that a country’s level of male literacy is a function of three
variables, the country’s annual increase in population, percentage of people
living in cities, and gross domestic product, a hierarchichal multiple regression
analysis was performed. Tests for multicollinearity indicated that a low level of
multicollinearity was present (tolerance = .864, .649, .and 601 for annual
increase in population, percentage of people living in cities, and gross domestic
product, respectively. People living in cities was the first variable entered,
followed by annual population increase and then GDP, according to our theory.
Results of the regression analysis provided partial confirmation for the research
hypothesis. Beta coefficients for the three predictors were people living in cities,
β = .493, t = 5.539, p < .001; annual population increase, β = -.517, t = -6.698, p
< .001; and gross domestic product, β = -.063, t = -.676, p = .501, n.s. The best
fitting model for predicting rate of male literacy is a linear combination of the
country’s annual population increase and the percentage of people living in cities
(R = .762, R2 = .581, F (2,82) = 56.823, p < .001). Addition of the GDS variable
did not significantly improve prediction (R2 change = .002. F = .457, p = .501).
Stepwise Multiple Regression
• Finally, let’s look at a stepwise multiple regression
– In SPSS Data Editor, go to Analyze/ Regression/ Linear
– Click the reset button
– Put Male Literacy into the Dependent box
– Put Population Increase, People Living in Cities, and Gross Domestic
Product into the Independents box
– Under Method, select Stepwise
– Under Statistics, select Estimates, Confidence Intervals, Model Fit, R
squared change, Descriptives, Part and Partial Correlation, and Collinearity
Diagnostics, and click Continue
– Under Options, check Include Constant in the Equation, and under Stepping
Method criteria select “use probability of F” and set F to enter a variable to
.005 and and F to remove a variable to .01. We are making this alpha
adjustment to control the overall error rate which may increase because of
the more frequent probability testing that is done in stepwise regression.
Click Continue and then OK
Variables Entered/Removed Table
for Stepwise Multiple Regression
•
The table of variables
entered and removed shows
that only the first two
predictors, population
increase annual and people
living in cities were ever
entered in the analysis. The
third variable, GDP,
evidently did not pass the
entry test of an F with
associated probability level
of .005
Variables Entered/Removeda
Model
1
Variables
Entered
Variables
Removed
Population
increase
(% per
year))
.
People
living in
cities (%)
.
2
Method
Stepwis e
(Criteria:
Probabilit
y-ofF-to-enter
<= .005,
Probabilit
y-ofF-to-remo
ve >= .
010).
Stepwis e
(Criteria:
Probabilit
y-ofF-to-enter
<= .005,
Probabilit
y-ofF-to-remo
ve >= .
010).
a. Dependent Variable: Males who read (%)
Model Summary for Stepwise
Multiple Regression
• With our third variable out of the picture, the model summary looks
a little different because the effects of the third variable have been
removed both from the first two and their relationship to the
dependent variable. The increase in R square with model 1 (only
population increase) is significant as well as the further increase in
R square with the addition of the second variable (Model 2)
Model Summary
Change Statis tics
Model
1
2
R
R Square
.619 a
.383
b
.762
.581
Adjus ted
R Square
.376
.571
Std. Error of
the Es timate
16.155
13.397
R Square
Change
.383
.198
F Change
51.538
38.699
a. Predictors : (Cons tant), Population increase (% per year))
b. Predictors : (Cons tant), Population increase (% per year)), People living in cities (%)
df1
df2
1
1
83
82
Sig. F Change
.000
.000
Overall F test
ANOVAc
Model
1
2
Sum of
Squares
Regress ion 13450.814
Res idual
21661.963
Total
35112.776
Regress ion 20396.129
Res idual
14716.647
Total
35112.776
df
1
83
84
2
82
84
Mean Square
13450.814
260.988
F
51.538
Sig.
.000 a
10198.065
179.471
56.823
.000 b
a. Predictors : (Constant), Population increase (% per year))
b. Predictors : (Constant), Population increase (% per year)), People living in cities (%)
c. Dependent Variable: Males who read (%)
Here’s your overall F test for significance of
the two-predictor model in explaining male literacy
Regression Coefficients for
Stepwise Multiple Regression
•
The Beta coefficients (regression weights) for the two variables included in
the final model (model 2) are, respectively, -.502 for population increase
and .460 for people living in cities. Both coefficients are significant (see t
values). These values are the same in the two variable models in the
previous analysis (the hierarchical)
Sample Writeup of a Step-Wise Multiple
Regression
– To test the hypothesis that a country’s level of male literacy is a
function of three variables, the country’s annual increase in
population, percentage of people living in cities, and gross domestic
product, a stepwise multiple regression analysis was performed.
Levels of F to enter and F to remove were set to correspond to p
levels of .005 and .01, respectively, to adjust for familywise alpha
error rates associated with multiple significance tests. Tests for
multicollinearity indicated that a low level of multicollinearity was present
(tolerance = .864, .649, .and 601 for annual increase in population,
percentage of people living in cities, and gross domestic product,
respectively. Results of the stepwise regression analysis provided
partial confirmation for the research hypothesis: rate of male literacy
is a linear function of the country’s annual population increase and
the percentage of people living in cities (R = .762, R2 = .581). The
overall F for the two-variable model was 56.823, df = 2, 82, p <
.001. Standardized beta weights were -.502 for annual population
increase and .460 for percentage of people living in cities.
Variable Exclusion in Stepwise
Regression
•
•
As an example of stepwise multiple
regression with a larger number of
predictor variables, I regressed daily
calorie intake on these six predictors:
Population in thousands; Number of
people / sq. kilometer; People living in
cities (%); People who read (%);
Population increase (% per year; and
Gross domestic product. The final
model consisted of only two of the
variables, GDP and Cities. The rest
were excluded because they did not
have a low enough p value (.005) to
enter, due to the fact that their partial
correlation with the dependent variable,
Y (daily caloric intake), with the effects
of the other predictors held constant,
was not significant, even though their
zero order correlation with the
dependent variable, Y, may have been.
Now see if you can duplicate this
analysis.
In the final model, only GDP and
people living in cities were retained
Points to Remember about Multiple
Regression
• To sum up
– In doing a regression, first obtain a matrix of the zero-order
correlations among the candidate predictor variables and
look for multicollinearity problems (variables too highly
correlated)
– Consider whether your theory would dictate that the
variables be entered in any particular order
– If there is no theory to guide you, consider if you want to
enter all the variables at once or let empirical criteria like a
fixed probability level determine if they are allowed to enter
the equation
– Adjust the significance level to make the alpha levels
smaller on F to enter and remove especially if you have a
lot of variables