Multiple Regression Lecture

Download Report

Transcript Multiple Regression Lecture

Multiple Regression
MARE 250
Dr. Jason Turner
Linear Regression
y = b0 + b1x
y = dependent variable
y
b0 + b1 = are constants
b0 = y intercept
b1 = slope
x = independent variable
Urchin density = b0 + b1(salinity)
Multiple Regression
Multiple regression allows us to learn more about the
relationship between several independent or predictor
variables and a dependent or criterion variable
For example, we might be looking for a reliable way to
estimate the age of AHI at the dock instead of waiting
for laboratory analyses
y = b0 + b1x
y = b0 + b1x1 + b2x2 …bnxn
Multiple Regression
In the social and natural sciences multiple regression procedures
are very widely used in research
Multiple regression allows the researcher to ask “what is the
best predictor of ...?”
For example, educational researchers might want to learn what
are the best predictors of success in high-school
Psychologists may want to determine which personality variable
best predicts social adjustment
Sociologists may want to find out which of the multiple social
indicators best predict whether or not a new immigrant group
will adapt and be absorbed into society.
Multiple Regression
The general computational problem that needs to be
solved in multiple regression analysis is to fit a straight
line to a number of points
In the simplest case one dependent and one
independent variable
13
12
11
10
Age
Can be visualized this
in a scatterplot
Scatterplot of Age vs SL
9
8
7
6
5
4
60.0
62.5
65.0
67.5
SL
70.0
72.5
75.0
77.5
The Regression Equation
A line in a two dimensional or two-variable space is defined
by the equation Y=a+b*X; the animation below shows a two
dimensional regression equation plotted with three different
confidence intervals (90%, 95% 99%)
Matrix Plot of Age vs SL, BM, OP, PF
24
30
36
2
4
6
14
13
12
11
10
Age
In the multivariate case,
when there is more than one
independent variable, the
regression line cannot be
visualized in the two
dimensional space, but can
be computed rather easily
9
8
7
6
5
58
68
SL
78
8
BM
12
OP
16
PF
Residual Variance and R-square
The smaller the variability of the residual values around the
regression line relative to the overall variability, the better
is our prediction
Coefficient of determination (r2) - If we have an R-square
of 0.4 we have explained 40% of the original variability,
and are left with 60% residual variability. Ideally, we would
like to explain most if not all of the original variability
Therefore - r2 value is an indicator of how well the model
fits the data (e.g., an r2 close to 1.0 indicates that we have
accounted for almost all of the variability with the variables
specified in the model
Assumptions, Assumptions…
Assumption of Linearity
It is assumed that the relationship between variables is linear
- always look at bivariate scatterplot of the variables of interest
Normality Assumption
It is assumed in multiple regression that the residuals (predicted
minus observed values) are distributed normally (i.e., follow the
normal distribution)
Most tests (specifically the F-test) are quite robust with regard to
violations of this assumption
Review the distributions of the major variables with histograms
Effects of Outliers
Outliers may be influential observations
A data point whose removal
causes the regression
equation (line) to change
considerably
Consider removal much like
an outlier
If no explanation – up to
researcher
Stepwise Regression:
When is too much – too much
Building Models via Stepwise Regression
Stepwise model-building techniques for regression
The basic procedures involve:
(1) identifying an initial model
(2) iteratively "stepping," that is, repeatedly altering the
model at the previous step by adding or removing a
predictor variable in accordance with the "stepping
criteria,"
(3) terminating the search when stepping is no longer
possible given the stepping criteria
For Example…
We are interested in predicting values for Y based upon several
X’s…Age of AHI based upon SL, BM, OP, PF
We run multiple regression and get the equation:
Age = - 2.64 + 0.0382 SL + 0.209 BM + 0.136 OP + 0.467 PF
We then run a STEPWISE regression to determine the best
subset of these variables
How does it work…
Response is Age
Vars
1
1
2
2
3
3
4
R-Sq R-Sq(adj)
C-p
S
77.7
77.4
8.0 0.96215
60.3
59.8 76.6 1.2839
78.9
78.3
5.4 0.94256
78.6
78.0
6.6 0.94962
79.8
79.1
3.6 0.92641
79.1
78.3
6.5 0.94353
80.0
79.0
5.0 0.92897
SBOP
LMPF
X
X
XX
XX
XXX
XXX
XXXX
How does it work…
Stepwise Regression: Age versus SL, BM, OP, PF
Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15
Response is Age on 4 predictors, with N = 84
Step
Constant
1
2
3
-0.8013 -1.1103 -5.4795
BM
T-Value
P-Value
0.355 0.326
16.91 13.17
0.000 0.000
0.267
6.91
0.000
OP
T-Value
P-Value
0.096 0.101
2.11 2.26
0.038 0.027
SL
T-Value
P-Value
0.087
1.96
0.053
S
0.962 0.943 0.926
R-Sq
77.71 78.87 79.84
R-Sq(adj)
77.44 78.35 79.08
Mallows C-p
8.0
5.4
3.6
Who Cares?
Stepwise analysis allows you (i.e. – computer) to
determine which predictor variables (or combination
of) best explain (can be used to predict) Y
Much more important as number of predictor
variables increase
Helps to make better sense of complicated
multivariate data