ANOVA and linear regression

Download Report

Transcript ANOVA and linear regression

ANOVA and linear regression
July 15, 2004
ANOVA
for comparing means between
more than 2 groups
ANOVA
(ANalysis Of VAriance)

Idea: For two or more groups, test difference
between means, for quantitative normally
distributed variables.
 Just an extension of the t-test (an ANOVA with
only two groups is mathematically equivalent to a
t-test).
 Like the t-test, ANOVA is “parametric” test—
assumes that the outcome variable is roughly
normally distributed with a mean and standard
deviation (parameters) that we can estimate
ANOVA
Assumptions

Assumptions: Normally distributed
outcome variable; homogeneity of variances
(like t-test)
The “F-test”
Is the difference in the means of the groups more
than background noise (=variability within groups)?
Variabilit y between groups
F
Variabilit y within groups
Spine bone density vs.
menstrual regularity
1.2
1.1
1.0
S
P
I
N
E
0.9
Within group
variability
Between
group
variation
Within group
variability
Within group
variability
0.8
0.7
amenorrheic
oligomenorrheic
eumenorrheic
Group means and standard
deviations

Amenorrheic group (n=11):
– Mean spine BMD = .92 g/cm2
– standard deviation = .10 g/cm2

Oligomenorrheic group (n=11)
– Mean spine BMD = .94 g/cm2
– standard deviation = .08 g/cm2

Eumenrroheic group (n=11)
– Mean spine BMD =1.06 g/cm2
– standard deviation = .11 g/cm2
The size of the
groups.
Between-group
variation.
The F-Test
2
sbetween
The difference of
each group’s
mean from the
overall mean.
2
2
2
(.
92

.
97
)

(.
94

.
97
)

(
1
.
06

.
97
)
 ns x2  11* (
)  .063
3 1
2
swithin
 avg s 2  1 (.102  .082  .112 )  .0095
3
F2,30
The average
amount of
variation within
groups.
2
between
2
within
s

s
.063

 6.6
.0095
Large F value indicates
Each group’s variance.
that the between group
variation exceeds the
within group variation
(=the background
noise).
The F-distribution

The F-distribution is a continuous probability
distribution that depends on two parameters n
and m (numerator and denominator degrees
of freedom, respectively):
The F-distribution

A ratio of sample variances follows an Fdistribution:


2
between
2
within
The
F
~ Fn ,m
F-test tests the hypothesis that two sample
variances are equal.
will be close to 1 if sample variances are equal.
2
2
H 0 :  between
  within
H a :
2
between

2
within
ANOVA Table
Source of
variation
d.f.
Between k-1
(k groups)
Sum of
squares
Mean
Sum of
Squares
SSB
SSB/k-1
(sum of squared
deviations of
group means from
F-statistic
SSB
SSW
p-value
Go to
k 1
nk  k
Fk-1,nk-k
chart
grand mean)
Within
nk-k
(n individuals
per group)
Total
nk-1
variation
SSW
(sum of squared
deviations of
observations
from their
group mean)
s2=SSW/nk-k
TSS
(sum of squared
deviations of observations
from grand mean)
TSS=SSB + SSW
ANOVA=t-test
Source of
variation
Between
(2 groups)
Within
d.f.
1
2n-2
Sum of
squares
SSB
Squared
(squared difference
difference in means
in means)
SSW
equivalent to
numerator of
pooled
variance
Total
2n-1
variation
Mean
Sum of
Squares
TSS
Pooled
variance
F-statistic
p-value
Go to
(X  Y )
sp
2
2
(
X Y 2
)  (t 2 n  2 ) 2
sp
F1, 2n-2
Chart
notice
values
are just (t
2
2n-2)
ANOVA summary

A statistically significant ANOVA (F-test)
only tells you that at least two of the groups
differ, but not which ones differ.

Determining which groups differ (when it’s
unclear) requires more sophisticated
analyses to correct for the problem of
multiple comparisons…
Question: Why not just do 3
pairwise ttests?
Answer: because, at an error rate of 5% each test,
this means you have an overall chance of up to 1(.95)3= 14% of making a type-I error (if all 3
comparisons were independent)
 If you wanted to compare 6 groups, you’d have to
do 6C2 = 15 pairwise ttests; which would give you
a high chance of finding something significant just
by chance (if all tests were independent with a
type-I error rate of 5% each); probability of at
least one type-I error = 1-(.95)15=54%.

Multiple comparisons
With 18 independent
comparisons, we have
60% chance of at least 1
false positive.
Multiple comparisons
With 18 independent
comparisons, we expect
about 1 false positive.
Correction for multiple
comparisons





How to correct for multiple comparisons posthoc…
Bonferroni’s correction (adjusts p by most
conservative amount; assuming all tests
independent, divide p by the number of tests)
 Holm/Hochberg (gives p-cutoff beyond
which not significant)
 Tukey’s (adjusts p)
 Scheffe’s (adjusts p)
Non-parametric ANOVA
Kruskal-Wallis one-way ANOVA
Extension of the Wilcoxon Rank-Sum test
for 2 groups; based on ranks
Proc NPAR1WAY in SAS
Linear regression
Outline

1. Simple linear regression and prediction
 2. Multiple linear regression and
multivariate analysis
 3. Dummy coding categorical predictors
Review: what is “Linear”?

Remember this:
 Y=mX+B?
m
B
Review: what’s slope?
A slope of 2 means that every 1-unit change in X
yields a 2-unit change in Y.
Example

What’s the relationship between gestation
time and birth-weight?
Birth-weight depends on
gestation time (hypothetical data)
Y=birthweight
(g)
Best fit line is chosen such
that the sum of the squared
(why squared?) distances of
the points (Yi’s) from the line
is minimized:
Slope of the line = 100 g/wk
X=gestation
time (weeks)
Linear regression equation:

Birth-weight (g)=  + *(X weeks) +
random variation

Birth-weight (g)= 0 + 100*(X wks)
Prediction
If you know something about X, this knowledge helps you
predict something about Y.
Prediction
Baby weights at Stanford are normally distributed
with a mean value of 3400 grams.
Your “Best guess” at a random baby’s weight, given
no information about the baby, is what?
3400 grams
But, what if you have relevant information? Can you
make a better guess?
Prediction

A new baby is born that had gestated for
just 30 weeks. What’s your best guess at
the birth-weight?
 Are you still best off guessing 3400?
 NO!
At 30 weeks…
Y=birthweight
3000
(g)
X=gestation
time (weeks)
30
At 30 weeks…
Y=birth
weight
3000
(x,y)=
(g)
(30,3000)
X=gestation
time (weeks)
30
At 30 weeks…


The babies that gestate for 30 weeks appear
to center around a weight of 3000 grams.
Our linear regression equation predicts that
a baby of 30 weeks gestation will weigh
3000g:

Expected weight (g) = 100*(30 weeks)
And, if X=20, 30, or 40…
Y=birthweight
(g)
X=gestation
time (weeks)
20
30
40
If X=20, 30, or 40…
Y=baby
weights
(g)
X=gestation
times (weeks)
20
30
40
Mean values fall on the line

At 40 weeks, expected weight = 4000
 At 30 weeks, expected weight =3000
 At 20 weeks, expected weight = 2000

In general,
Expected weight = 100 grams/week*X wks
Assumptions (or the fine print)

Linear regression assumes that…
– 1. The relationship between X and Y is linear
– 2. Y is distributed normally at each value of X
– 3. The variance of Y at every value of X is the
same (homogeneity of variances)
Non-homogenous variance
Y=birthweight
(100g)
X=gestation
time (weeks)
A ttest is linear regression!

A t-test is an example of linear regression with a
binary predictor.
 For example, if the mean difference in spine bone
density between a sample of men and a sample of
women is .11 g/cm2 and the women have an
average value of .99, then the t-test for the
difference in the means is mathematically
equivalent to the linear regression model:
Spine BMD (g/cm2) = .99 (intercept) + .11 (1 if
male)
Multiple Linear Regression

More than one predictor…
=  + 1*X + 2 *W + 3 *Z
Each regression coefficient is the amount of change
in the outcome variable that would be expected
per one-unit change of the predictor, if all other
variables in the model were held constant.
ANOVA is linear regression!
A categorical variable with more than two groups:
E.g.: groups 1, 2, and 3 (mutually exclusive)
=  (=value for group 1) + 1*(1 if in group 2) + 2
*(1 if in group 3)
This is called “dummy coding”—where multiple
binary variables are created to represent being in
each category (or not) of a categorical variable
Example: ANOVA = linear
regression
In SAS:
data stats210.runners;
set stats210.runners;
if mencat=1 then amenorrheic=1; else amenorrheic=0;
if mencat=2 then oligomenorrheic=1; else oligomenorrheic =0;
run;
The good news is that SAS will often do this for you with a
class statement!
Functions of multivariate
analysis:
Control for confounders
 Test for interactions between predictors
(effect modification)
 Improve predictions

Multiple linear regression
caveats
 Multicollinearity arises when two variables that
measure the same thing or similar things (e.g.,
weight and BMI) are both included in a multiple
regression model; they will, in effect, cancel each
other out and generally destroy your model.

Model building and diagnostics are tricky
business!
Other types of multivariate
regression
 Multiple linear regression is for normally
distributed outcomes

Logistic regression is for binary outcomes

Cox proportional hazards regression is used when
time-to-event is the outcome
Reading for this week

Chapters 6-8, 10
Note: Midterm next week

One “cheat” sheet allowed for in-class
portion and one for in-lab portion