Multiple Regression

Download Report

Transcript Multiple Regression

Multiple Regression
Multiple Regression
The test you choose depends on level of measurement:
Independent Variable
Dependent Variable
Test
Dichotomous
Interval-Ratio
Dichotomous
Independent Samples t-test
Nominal
Dichotomous
Nominal
Dichotomous
Cross Tabs
Nominal
Dichotomous
Interval-Ratio
Dichotomous
ANOVA
Interval-Ratio
Dichotomous
Interval-Ratio
Bivariate Regression/Correlation
Interval-Ratio
Multiple Regression
Two or More…
Interval-Ratio
Dichotomous
Multiple Regression

Multiple Regression is very popular among
sociologists.



Most social phenomena have more than one
cause.
It is very difficult to manipulate just one social
variable through experimentation.
Sociologists must attempt to model complex
social realities to explain them.
Multiple Regression

Multiple Regression allows us to:




Use several variables at once to explain the variation in a
continuous dependent variable.
Isolate the unique effect of one variable on the continuous
dependent variable while taking into consideration that
other variables are affecting it too.
Write a mathematical equation that tells us the overall
effects of several variables together and the unique effects
of each on a continuous dependent variable.
Control for other variables to demonstrate whether
bivariate relationships are spurious
Multiple Regression

For example:
A sociologist may be interested in the relationship
between Education and Income and Number of
Children in a family.
Independent Variables
Dependent Variable
Education
Number of Children
Family Income
Multiple Regression

For example:
 Null Hypothesis: There is no relationship between
education of respondents and the number of children in
families. Ho : b1 = 0
 Null Hypothesis: There is no relationship between family
income and the number of children in families. Ho : b2 = 0
Independent Variables
Dependent Variable
Education
Number of Children
Family Income
Multiple Regression


Bivariate regression is based on fitting a line as close
as possible to the plotted coordinates of your data on
a two-dimensional graph.
Trivariate regression is based on fitting a plane as
close as possible to the plotted coordinates of your
data on a three-dimensional graph.
Case:
1 2 3 4 5 6 7 8
9 10
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Children (Y):
2 5 1 9 6 3 0 3
7
2 5 1 9 6 3 0 3 7 14 2 5 1 9 6
Education (X1)
12 16 2012 9 18 16 14 9 12 12 10 20 11 9 18 16 14 9
Income 1=$10K (X2): 3 4 9 5 4 12 10 1 4
7
3
10 4 9 4 4 12 10 6 4
8 12 10 20 11 9
1 10 3 9 2 4
Multiple Regression
Plotted coordinates
(1 – 10) for Education,
Income and Number of
Children
Y
0
X2
X1
Case:
1 2 3 4 5 6 7 8
9 10
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Children (Y):
2 5 1 9 6 3 0 3
7
2 5 1 9 6 3 0 3 7 14 2 5 1 9 6
Education (X1)
12 16 2012 9 18 16 14 9 12 12 10 20 11 9 18 16 14 9
Income 1=$10K (X2): 3 4 9 5 4 12 10 1 4
7
3
10 4 9 4 4 12 10 6 4
8 12 10 20 11 9
1 10 3 9 2 4
Multiple Regression
What multiple regression
does is fit a plane to
these coordinates.
Y
0
X2
X1
Case:
1 2 3 4 5 6 7 8
9 10
Children (Y):
2 5 1 9 6 3 0 3
7
Education (X1)
12 16 2012 9 18 16 14 9 12
Income 1=$10K (X2): 3 4 9 5 4 12 10 1 4
7
3
Multiple Regression

Mathematically, that plane is:

Y = a + b1X1 + b2X2
a = y-intercept, where X’s equal zero
b=coefficient or slope for each variable
For our problem, SPSS says the equation is:

Y = 11.8 - .36X1 - .40X2
Expected # of Children = 11.8 - .36*Educ - .40*Income
Muliple Regression
Conducting a Test of Significance for the slopes of the Regression Shape
By slapping the sampling distribution for the slopes over a guess of the
population’s slopes, Ho, we can find out whether our sample could have
been drawn from a population where the slopes are equal to our guess.
1.
2.
3.
4.
5.
6.
7.
Two-tailed significance test for -level = .05
Critical t = +/- 1.96
To find if there is a significant slope in the population,
H o : 1 = 0 ; 2 = 0

Ha: 1  0 ; 2  0
 ( Y – Y )2
Collect Data
n-2
Calculate t (z): t = b – o
s.e. =
(for each)
s.e.
 ( X – X )2
Make decision about the null hypotheses
Find P-values
Multiple Regression
Model Summary
Model
1
R
R Square
a
.757
.573
Adjus ted
R Square
.534
Std. Error of
the Es timate
2.33785
a. Predictors : (Constant), Income, Education
Model
1

Y = 11.8 - .36X1 - .40X2
ANOVAb
Regress ion
Res idual
Total
Sum of
Squares
161.518
120.242
281.760
df
2
22
24
Mean Square
80.759
5.466
F
14.776
Sig.
.000 a
a. Predictors : (Constant), Income, Education
b. Dependent Variable: Children
Coefficientsa
Model
1
(Cons tant)
Education
Income
Uns tandardized
Coefficients
B
Std. Error
11.770
1.734
-.364
.173
-.403
.194
a. Dependent Variable: Children
Standardized
Coefficients
Beta
-.412
-.408
t
6.787
-2.105
-2.084
Sig.
.000
.047
.049
Sig. Tests
t-scores and
P-values
Multiple Regression

R2

TSS – SSE / TSS




TSS = Distance from mean to value on Y for each case
SSE = Distance from shape to value on Y for each case
Can be interpreted the same for multiple regression—joint explanatory
value of all of your variables (or “your model”)
Can request a change in R2 test from SPSS to see if adding new
variables improves the fit of your model
Model Summary
Model
1
R
R Square
a
.757
.573
Adjus ted
R Square
.534
a. Predictors : (Constant), Income, Education
Std. Error of
the Es timate
2.33785
Multiple Regression
57% of the variation in
number of children is
explained by education
and income!
Model Summary
Model
1
R
R Square
a
.757
.573
Adjus ted
R Square
.534
Std. Error of
the Es timate
2.33785
a. Predictors : (Constant), Income, Education
Model
1

Y = 11.8 - .36X1 - .40X2
ANOVAb
Regress ion
Res idual
Total
Sum of
Squares
161.518
120.242
281.760
df
2
22
24
a. Predictors : (Constant), Income, Education
b. Dependent Variable: Children
Coefficientsa
Model
1
(Cons tant)
Education
Income
Uns tandardized
Coefficients
B
Std. Error
11.770
1.734
-.364
.173
-.403
.194
a. Dependent Variable: Children
Standardized
Coefficients
Beta
-.412
-.408
t
6.787
-2.105
-2.084
Sig.
.000
.047
.049
Mean Square
80.759
5.466
F
14.776
Sig.
.000 a
Multiple Regression
r2
Model Summary
Model
1
R
R Square
a
.757
.573
Adjus ted
R Square
.534

 (Y –
-  (Y – Y)2
 (Y – Y)2
Y)2
Std. Error of
the Es timate
2.33785
a. Predictors : (Constant), Income, Education
Model
1

Y = 11.8 - .36X1 - .40X2
ANOVAb
Regress ion
Res idual
Total
Sum of
Squares
161.518
120.242
281.760
df
2
22
24
Mean Square
80.759
5.466
F
14.776
Sig.
.000 a
a. Predictors : (Constant), Income, Education
b. Dependent Variable: Children
161.518 ÷ 261.76 = .573
Coefficientsa
Model
1
(Cons tant)
Education
Income
Uns tandardized
Coefficients
B
Std. Error
11.770
1.734
-.364
.173
-.403
.194
a. Dependent Variable: Children
Standardized
Coefficients
Beta
-.412
-.408
t
6.787
-2.105
-2.084
Sig.
.000
.047
.049
Multiple Regression
So what does our equation tell us?

Y = 11.8 - .36X1 - .40X2
Expected # of Children = 11.8 - .36*Educ - .40*Income
Try “plugging in” some values for your
variables.
Multiple Regression
So what does our equation tell us?
^
Y = 11.8 - .36X1 - .40X2
Expected # of Children = 11.8 - .36*Educ - .40*Income
If Education equals:& If Income Equals:
0
0
10
0
10
10
20
10
20
11
Then, children equals:
11.8
8.2
4.2
0.6
0.2
Multiple Regression
So what does our equation tell us?
^
Y = 11.8 - .36X1 - .40X2
Expected # of Children = 11.8 - .36*Educ - .40*Income
If Education equals:& If Income Equals:
1
0
1
1
1
5
1
10
1
15
Then, children equals:
11.44
11.04
9.44
7.44
5.44
Multiple Regression
So what does our equation tell us?
^
Y = 11.8 - .36X1 - .40X2
Expected # of Children = 11.8 - .36*Educ - .40*Income
If Education equals:& If Income Equals:
0
1
1
1
5
1
10
1
15
1
Then, children equals:
11.40
11.04
9.60
7.80
6.00
Multiple Regression
If graphed, holding one variable constant produces a twodimensional graph for the other variable.
Y 11.40
Y
11.44
b = -.36
b = -.4
6.00
0
15
X1 = Education
5.44
0
X2 = Income
15
Multiple Regression


An interesting effect of controlling for other
variables is “Simpson’s Paradox.”
The direction of relationship between two
variables can change when you control for
another variable.
Education
+
Crime Rate

Y = -51.3 + 1.5X
Multiple Regression

“Simpson’s Paradox”
+
Education
Crime Rate
+
Urbanization
(is related to
both)

Y = -51.3 + 1.5X1
Education
+
Crime Rate
Regression Controlling for Urbanization
-
Education
Urbanization
+
Crime Rate

Y = 58.9 - .6X1 + .7X2
Multiple Regression
Crime
Original
Regression Line
Looking at each level of
urbanization, new lines
Rural
Small town
Suburban
Education
City
Multiple Regression
Now… More Variables!
 The social world is very complex.
 What happens when you have even more variables?

For example:
A sociologist may be interested in the effects of Education, Income,
Sex, and Gender Attitudes on Number of Children in a family.
Independent Variables
Dependent Variable
Education
Family Income
Sex
Gender Attitudes
Number of Children
Multiple Regression
Null Hypotheses:

1.
2.
3.
4.
There will be no relationship between education of respondents and
the number of children in families. Ho : b1 = 0 Ha : b1 ≠ 0
There will be no relationship between family income and the number
of children in families. Ho : b2 = 0 Ha : b2 ≠ 0
There will be no relationship between sex and number of children.
Ho: b3 = 0 Ha : b3 ≠ 0
There will be no relationship between gender attitudes and number
of children. Ho : b4 = 0 Ha : b4 ≠ 0
Independent Variables
Dependent Variable
Education
Family Income
Sex
Gender Attitudes
Number of Children
Multiple Regression



Bivariate regression is based on fitting a line as close
as possible to the plotted coordinates of your data on
a two-dimensional graph.
Trivariate regression is based on fitting a plane as
close as possible to the plotted coordinates of your
data on a three-dimensional graph.
Regression with more than two independent variables
is based on fitting a shape to your constellation of
data on an multi-dimensional graph.
Multiple Regression


Regression with more than two independent variables
is based on fitting a shape to your constellation of
data on an multi-dimensional graph.
The shape will be placed so that it minimizes the
distance (sum of squared errors) from the shape to
every data point.
Multiple Regression



Regression with more than two independent variables
is based on fitting a shape to your constellation of
data on an multi-dimensional graph.
The shape will be placed so that it minimizes the
distance (sum of squared errors) from the shape to
every data point.
The shape is no longer a line, but if you hold all other
variables constant, it is linear for each independent
variable.
Multiple Regression
Y
Imagining a graph with four dimensions!
Y
Y
Y
Y
0
X2
0
X2
X2
X2
X2
X1
0
0
0
X1
X1
X1
X1
Multiple Regression
For our problem, our equation could be:

Y = 7.5 - .30X1 - .40X2 + 0.5X3 + 0.25X4
E(Children) =
7.5 - .30*Educ - .40*Income + 0.5*Sex + 0.25*Gender Att.
Multiple Regression
So what does our equation tell us?
^
Y = 7.5 - .30X1 - .40X2 + 0.5X3 + 0.25X4
E(Children) =
7.5 - .30*Educ - .40*Income + 0.5*Sex + 0.25*Gender Att.
Education: Income:
10
5
10
5
10
10
10
5
10
5
Sex:
0
0
0
1
1
Gender Att:
0
5
5
0
5
Children:
2.5
3.75
1.75
3.0
4.25
Multiple Regression
Each variable, holding the other variables constant, has a linear, twodimensional graph of its relationship with the dependent variable.
Here we hold every other variable constant at “zero.”
Y
7.5
Y 7.5
b = -.3
b = -.4
4.5
3.5
0
10
0
X2 = Education
X1 = Income
^Y = 7.5 - .30X1 - .40X2 + 0.5X3 + 0.25X4
10
Multiple Regression
Each variable, holding the other variables constant, has a linear, twodimensional graph of its relationship with the dependent variable.
Here we hold every other variable constant at “zero.”
8.75
Y
8
b = .5
b = .25
Y
7.5
7.5
0
1
0
5
X3 = Sex
X4 = Gender Attitudes
^Y = 7.5 - .30X1 - .40X2 + 0.5X3 + 0.25X4
Multiple Regression
Okay, we’re almost
through with regression!
Multiple Regression

Dummy Variables
What are
dummy
variables?!

They are simply dichotomous variables that are entered into
regression. They have 0 – 1 coding where 0 = absence of
something and 1 = presence of something. E.g., Female
(0=M; 1=F) or Southern (0=Non-Southern; 1=Southern).
Multiple Regression
Dummy Variables
are especially nice
because they allow
us to use nominal
variables in
regression.
A nominal variable
has no rank or order,
rendering the
numerical coding
scheme useless for
regression.
But YOU
said we
CAN’T do
that!
Multiple Regression

The way you use nominal variables in regression is by
converting them to a series of dummy variables.
Nomimal Variable
Race
1 = White
2 = Black
3 = Other
Recode into different
Dummy Variables
1. White
0 = Not White; 1 = White
2. Black
0 = Not Black; 1 = Black
3. Other
0 = Not Other; 1 = Other
Multiple Regression
The way you use nominal variables in regression is by converting them to
a series of dummy variables.
Recode into different
Nomimal Variable
Dummy Variables
Religion
1. Catholic
1 = Catholic
0 = Not Catholic; 1 = Catholic
2 = Protestant
2. Protestant
3 = Jewish
0 = Not Prot.; 1 = Protestant
4 = Muslim
3. Jewish
5 = Other Religions
0 = Not Jewish; 1 = Jewish
4. Muslim
0 = Not Muslim; 1 = Muslim
5. Other Religions
0 = Not Other; 1 = Other Relig.

Multiple Regression
When you need to use a nominal variable in
regression (like race), just convert it to a
series of dummy variables.
 When you enter the variables into your model,
you MUST LEAVE OUT ONE OF THE
DUMMIES.
Leave Out One
Enter Rest into Regression
White
Black
Other

Multiple Regression
The reason you MUST LEAVE OUT ONE OF THE
DUMMIES is that regression is mathematically
impossible without an excluded group.
 If all were in, holding one of them constant would
prohibit variation in all the rest.
Leave Out One
Enter Rest into Regression
Catholic
Protestant
Jewish
Muslim
Other Religion

Multiple Regression

The regression equations for dummies will
look the same.
For Race, with 3 dummies, predicting self-esteem:

Y = a + b1X1 + b2X2
a = the y-intercept,
which in this case is
the predicted value
of self-esteem for
the excluded group,
white.
b1 = the slope
for variable
X1, black
b2 = the slope
for variable
X2, other
Multiple Regression

If our equation were:
For Race, with 3 dummies, predicting self-esteem:
Plugging in values for
the dummies tells you
each group’s self-esteem
average:

Y = 28 + 5X1 – 2X2
a = the y-intercept,
which in this case is
the predicted value
of self-esteem for
the excluded group,
white.
White = 28
5 = the slope
for variable
X1, black
-2 = the slope
for variable
X2, other
Black = 33
Other = 26
When cases’ values for X1 = 0 and X2 = 0, they are white;
when X1 = 1 and X2 = 0, they are black;
when X1 = 0 and X2 = 1, they are other.
Multiple Regression
Dummy variables can be entered into multiple
regression along with other dichotomous and
continuous variables.
 For example, you could regress self-esteem
on sex, race, and education:

X = Female
Y = a + b1X1 + b2X2 + b3X3 + b4X4

1
X2 = Black
How would you interpret this?

Y = 30 – 4X1 + 5X2 – 2X3 + 0.3X4
X3 = Other
X4 = Education
Multiple Regression
How would you interpret this?

Y = 30 – 4X1 + 5X2 – 2X3 + 0.3X4
X1 = Female
X2 = Black
X3 = Other
X4 = Education
1.
2.
3.
4.
Women’s self-esteem is 4 points lower than men’s.
Blacks’ self-esteem is 5 points higher than whites’.
Others’ self-esteem is 2 points lower than whites’
and consequently 7 points lower than blacks’.
Each year of education improves self-esteem by 0.3
units.
Multiple Regression
How would you interpret this?

Y = 30 – 4X1 + 5X2 – 2X3 + 0.3X4
X1 = Female
X2 = Black
X3 = Other
X4 = Education
Plugging in some select values, we’d get self-esteem for
select groups:

White males with 10 years of education = 33

Black males with 10 years of education = 38

Other females with 10 years of education = 27

Other females with 16 years of education = 28.8
Multiple Regression
How would you interpret this?

Y = 30 – 4X1 + 5X2 – 2X3 + 0.3X4
X1 = Female
X2 = Black
X3 = Other
X4 = Education
The same regression rules apply. The slopes represent
the linear relationship of each independent variable
in relation to the dependent while holding all other
variables constant.
Make sure you get into the habit of saying the
slope is the effect of an independent variable
on the dependent variable “while holding
everything else constant.”
Multiple Regression
Standardized Coefficients
 Sometimes you want to know whether one variable
has a larger impact on your dependent variable than
another.
 If your variables have different units of measure, it is
hard to compare their effects.
 For example, if wages go up one thousand dollars
for each year of education, is that a greater effect
than if wages go up five hundred dollars for each
year increase in age.
Multiple Regression
Standardized Coefficients
 So which is better for increasing wages, education or
aging?
 One thing you can do is “standardize” your slopes so
that you can compare the standard deviation increase
in your dependent variable for each standard
deviation increase in your independent variables.
 You might find that Wages go up 0.3 standard
deviations for each standard deviation increase in
education, but 0.4 standard deviations for each
standard deviation increase in age.
Multiple Regression
Standardized Coefficients

Recall that standardizing regression coefficients is
accomplished by the formula: b(Sx/Sy)
Coefficientsa
Model
1
(Cons tant)
Education
Income
Uns tandardized
Coefficients
B
Std. Error
11.770
1.734
-.364
.173
-.403
.194
Standardized
Coefficients
Beta
-.412
-.408
t
6.787
-2.105
-2.084
Sig.
.000
.047
.049
a. Dependent Variable: Children


In the example above, education and income have very
comparable effects on number of children.
Each lowers the number of children by .4 standard deviations
for a standard deviation increase in each, controlling for the
other.
Multiple Regression
Standardized Coefficients
 One last note of caution...


It does not make sense to standardize slopes for
dichotomous variables.
It makes no sense to refer to standard deviation increases
in sex, or in race--these are either 0 or they are 1 only.
Multiple Regression
Give yourself a hand…
You now understand more
statistics that 99% of the
population!
You are well-qualified for
understanding most
sociological research papers.