Week 11 Mar. 20-21

Download Report

Transcript Week 11 Mar. 20-21

ANOVA, Regression and
Multiple Regression
March 20-21, 2012
PPAL 6200
Why ANOVA
• ANOVA allows us to compare the means
for groups and ask if they are sufficiently
different from one another to say that such
differences are statistically significant.
• In other words, that there is a low
probability that a difference of such
magnitude would occur between groups in
the real world if the actual difference
between groups were zero.
• Unlike “t” tests, ANOVA can be performed
with more than two groups
• The statistic associated with ANOVA is the
“f” test.
As the book notes…
• “The details of ANOVA are a bit daunting (they
appear in an optional section at the end of this
chapter). The main idea of ANOVA is more
accessible and much more important. Here it is:
when we ask if a set of sample means gives
evidence for differences among the population
means, what matters is not how far apart the
sample means are but how far apart they are
relative to the variability of individual
observations.”
In other words, lets look both at the means
and the overlap of the distributions!
(Graphics Moore 2009)
Statistically speaking
• The f statistics looks at variation of the
sample means over variation among
individuals in the same sample
var iation _ among _ the _ sample _ means
F
var iation _ among _ individual _ cases _ in _ the _ same _ sample
• Like “t” the “f” statistic is very robust and
therefore you should not worry too much
about deviations from normality if your
sample is large
One warning
• ANOVA assumes that the variability of
observations, measured by the standard
deviation is the same in all populations
• In the real world if you keep the sizes of
the groups you are comparing roughly
similar few problems occur but you must
check.
The book gives this rule
• Results of the “f” test are usually okay if the
largest sample standard deviation is no more than
twice as large as the smallest sample standard
deviation
• Another way to check is “Levene’s equality of
variance” statistic.
• If it is significant (low p) it means there is little
probability that the standard deviations of the
groups being compared are similar and you have
a problem
From Table 24.2 (richness of trees by Group)
Why Regression
• Regression is commonly used in the social
sciences because it allows us to
– Describe
– Explain
– Predict
which are the three big goals of social
science
Recall
• Regression involves mathematically describing a
linear relationship between a response (or
dependent) variable and an explanatory (or
independent) variable
• That line is given in the form y = a +b(x) Where:
–
–
–
–
y is the response variable
a is the y axis intercept of the line
b is the slope of the line
X is the explanatory variable
Requirements for use of regression
• Also recall that if the relationship between our response
and explanatory variable is not linear then regression will
give misleading results. Therefore we always do a
scatter plot before attempting regression. The
mathematical notation for linearity is below.
• This is sometimes called the “least-squares regression
line” because this regression procedure finds the line
that is the least squared difference from each data point.
y    bx
Regression requirements continued
• For any value of x the values of y are normally
distributed and repeated responses of y are
independent of each other
• The standard deviation of y is the same for all
values of x
Graphics Moore 2009
Regression Analysis
• As well as estimating the regression line,
we also estimate the goodness of fit
between the line and the data by using a
statistic known as Rsq
• Rsq (as the name implies) is the square of
the correlation measure known as “r”.
• We also have to know the significance of
the association between the explanatory
and response variables (as well as the
coefficient “a”) for the line we have found
we use a variation of the “t” test for this.
A useful tool: Regression Standard
Error
• The regression standard error is a useful tool
that can help us diagnosis whether we have met
the various conditions needed to perform a
regression (don’t worry your software will do
this).
2
1
s
 ( y y
ˆ
)
n2
So looking at example 23.1 in your
book here is the scatter plot
Here is the regression control showing
how I have selected the standard errors
called “residuals” in SPSS
Here is the dialogue box in
Excel using the plug in
Here is a portion of the printout
that was generated in SPSS
Here you can see the standard
residuals or errors that were calculated
In Excel it looks like this
A happy coincidence
• As the book notes Rsq is “closely related” to r.
• In fact it is literally the sq. of r in a simple OLS
regression with one explanatory variable.
• Therefore, when you test the null hypothesis of
the regression line. That it is actually flat you
also pretty much have tested the correlation too.
However, most software also prints it out in case
you want to see it.
Does it matter that the estimate of
the intercept is insignificant?
• In practice no.
• What really matters is the estimate of the
slope
Calculating the confidence interval
for your line
• If you look back at our
print out you will see
the slope is given, as
is the Standard Error
of the Slope and a “t”
value. Put them
together and you
have the 95%
confidence interval for
the Population slope.
b  t * SEb  95%confidence
Or you could have had the computer
calculate the confidence intervals for you
• Here is a portion of the regression printout from MS-Excel
As noted before we can use our
standardized errors to check our
assumptions
• The y values vary normally for each x value, do
a histogram of your residuals and check for
relative normality of distribution
• Plot the residuals as the dependent variable with
the x variable as independent to check for
linearity and that the observations for Y are
independent of each other
• Standard deviation of responses can be
checked by looking for a rough symmetrical
distribution above and below the zero point
Our previous example had too few
cases to check residuals so here is
example 23.9 from the book on
climate change
Graphics Source: Moore 2009
Graphics Source, Moore 2009
Moving from OLS to Multiple OLS
three big changes
1. We have to use Beta instead of B
2. We have to be aware of multicollinearity and
other multiple impacts (in short that we are not
just piling on independent variables but that
each independent variable is demonstrating a
unique explanatory power
3. The book gives you a third. We have to be
aware of interaction terms and other factors
that lead us to pick one model over another
The equation is now changed to reflect
greater number of variables and change
from b to beta
y     1 x1   2 x2   n xn
How to do it?
• Start from the beginning and look at each variable
separately using our descriptive and exploratory
techniques
• Now look at our dependent variable in pairs with each
independent variable using correlations to see which
ones might have a big impact
• Fit different models. Pay attention to changes in
explanatory power and also the t statistics
• If using stats software use stepwise procedures.
Stepwise adds and removes variables in the order you
input them based on a selection criteria (change in the F
statistic from the ANOVA test). In short the computer
tells you which model best fits with as few variables as
possible.
In SPSS You can do more than one
scatterplot at once
The data provided in Table 27.6 represent a random
sample of 60 customers from a large clothing retailer.8 The
manager of the store is interested in predicting how much a
customer will spend on his or her next purchase.
Our goal is to find a regression model for predicting the
amount of a purchase from the available explanatory
variables. A short description of each variable is provided
below.
Here are the print outs for Ex 27.19
using SPSS
Let’s add a new variable
• Purchase 12
shows the total
purchases each
customer makes
over the last 12
months divided
by the frequency
of their visits to
the store
As you will see it changes things
Here is the OLS for it alone
• The last slide was
basically an
interaction of the two
variables we
previously identified
as helpful. Let’s go
back to when they
were separate for a
second and test
whether each has a
separate impact or if
multicollinearity is at
play. Look for
tolerances .1 or less
as evidence of
multicollinearity
Finally let’s look at our residual
plots
• Often you might have the
chance to use more
elaborate residuals than
standardized ones, such
as studentized residuals.
• As there is no pattern we
assume the variance for y
is the same for all values
of x
• The sequence chart
also tells us that there
the y values are
independent of each
other
• The QQ plot tells us
the residuals are
roughly normal
meaning that the
notion that values of y
vary normally for each
value of x might be
met