ANOVA and linear regression
Download
Report
Transcript ANOVA and linear regression
ANOVA and linear regression
July 15, 2004
ANOVA
for comparing means between
more than 2 groups
ANOVA
(ANalysis Of VAriance)
Idea: For two or more groups, test difference
between means, for quantitative normally
distributed variables.
Just an extension of the t-test (an ANOVA with
only two groups is mathematically equivalent to a
t-test).
Like the t-test, ANOVA is “parametric” test—
assumes that the outcome variable is roughly
normally distributed with a mean and standard
deviation (parameters) that we can estimate
ANOVA
Assumptions
Assumptions: Normally distributed
outcome variable; homogeneity of variances
(like t-test)
The “F-test”
Is the difference in the means of the groups more
than background noise (=variability within groups)?
Variabilit y between groups
F
Variabilit y within groups
Spine bone density vs.
menstrual regularity
1.2
1.1
1.0
S
P
I
N
E
0.9
Within group
variability
Between
group
variation
Within group
variability
Within group
variability
0.8
0.7
amenorrheic
oligomenorrheic
eumenorrheic
Group means and standard
deviations
Amenorrheic group (n=11):
– Mean spine BMD = .92 g/cm2
– standard deviation = .10 g/cm2
Oligomenorrheic group (n=11)
– Mean spine BMD = .94 g/cm2
– standard deviation = .08 g/cm2
Eumenrroheic group (n=11)
– Mean spine BMD =1.06 g/cm2
– standard deviation = .11 g/cm2
The size of the
groups.
Between-group
variation.
The F-Test
2
sbetween
The difference of
each group’s
mean from the
overall mean.
2
2
2
(.
92
.
97
)
(.
94
.
97
)
(
1
.
06
.
97
)
ns x2 11* (
) .063
3 1
2
swithin
avg s 2 1 (.102 .082 .112 ) .0095
3
F2,30
The average
amount of
variation within
groups.
2
between
2
within
s
s
.063
6.6
.0095
Large F value indicates
Each group’s variance.
that the between group
variation exceeds the
within group variation
(=the background
noise).
The F-distribution
The F-distribution is a continuous probability
distribution that depends on two parameters n
and m (numerator and denominator degrees
of freedom, respectively):
The F-distribution
A ratio of sample variances follows an Fdistribution:
2
between
2
within
The
F
~ Fn ,m
F-test tests the hypothesis that two sample
variances are equal.
will be close to 1 if sample variances are equal.
2
2
H 0 : between
within
H a :
2
between
2
within
ANOVA Table
Source of
variation
d.f.
Between k-1
(k groups)
Sum of
squares
Mean
Sum of
Squares
SSB
SSB/k-1
(sum of squared
deviations of
group means from
F-statistic
SSB
SSW
p-value
Go to
k 1
nk k
Fk-1,nk-k
chart
grand mean)
Within
nk-k
(n individuals
per group)
Total
nk-1
variation
SSW
(sum of squared
deviations of
observations
from their
group mean)
s2=SSW/nk-k
TSS
(sum of squared
deviations of observations
from grand mean)
TSS=SSB + SSW
ANOVA=t-test
Source of
variation
Between
(2 groups)
Within
d.f.
1
2n-2
Sum of
squares
SSB
Squared
(squared difference
difference in means
in means)
SSW
equivalent to
numerator of
pooled
variance
Total
2n-1
variation
Mean
Sum of
Squares
TSS
Pooled
variance
F-statistic
p-value
Go to
(X Y )
sp
2
2
(
X Y 2
) (t 2 n 2 ) 2
sp
F1, 2n-2
Chart
notice
values
are just (t
2
2n-2)
ANOVA summary
A statistically significant ANOVA (F-test)
only tells you that at least two of the groups
differ, but not which ones differ.
Determining which groups differ (when it’s
unclear) requires more sophisticated
analyses to correct for the problem of
multiple comparisons…
Question: Why not just do 3
pairwise ttests?
Answer: because, at an error rate of 5% each test,
this means you have an overall chance of up to 1(.95)3= 14% of making a type-I error (if all 3
comparisons were independent)
If you wanted to compare 6 groups, you’d have to
do 6C2 = 15 pairwise ttests; which would give you
a high chance of finding something significant just
by chance (if all tests were independent with a
type-I error rate of 5% each); probability of at
least one type-I error = 1-(.95)15=54%.
Multiple comparisons
With 18 independent
comparisons, we have
60% chance of at least 1
false positive.
Multiple comparisons
With 18 independent
comparisons, we expect
about 1 false positive.
Correction for multiple
comparisons
How to correct for multiple comparisons posthoc…
Bonferroni’s correction (adjusts p by most
conservative amount; assuming all tests
independent, divide p by the number of tests)
Holm/Hochberg (gives p-cutoff beyond
which not significant)
Tukey’s (adjusts p)
Scheffe’s (adjusts p)
Non-parametric ANOVA
Kruskal-Wallis one-way ANOVA
Extension of the Wilcoxon Rank-Sum test
for 2 groups; based on ranks
Proc NPAR1WAY in SAS
Linear regression
Outline
1. Simple linear regression and prediction
2. Multiple linear regression and
multivariate analysis
3. Dummy coding categorical predictors
Review: what is “Linear”?
Remember this:
Y=mX+B?
m
B
Review: what’s slope?
A slope of 2 means that every 1-unit change in X
yields a 2-unit change in Y.
Example
What’s the relationship between gestation
time and birth-weight?
Birth-weight depends on
gestation time (hypothetical data)
Y=birthweight
(g)
Best fit line is chosen such
that the sum of the squared
(why squared?) distances of
the points (Yi’s) from the line
is minimized:
Slope of the line = 100 g/wk
X=gestation
time (weeks)
Linear regression equation:
Birth-weight (g)= + *(X weeks) +
random variation
Birth-weight (g)= 0 + 100*(X wks)
Prediction
If you know something about X, this knowledge helps you
predict something about Y.
Prediction
Baby weights at Stanford are normally distributed
with a mean value of 3400 grams.
Your “Best guess” at a random baby’s weight, given
no information about the baby, is what?
3400 grams
But, what if you have relevant information? Can you
make a better guess?
Prediction
A new baby is born that had gestated for
just 30 weeks. What’s your best guess at
the birth-weight?
Are you still best off guessing 3400?
NO!
At 30 weeks…
Y=birthweight
3000
(g)
X=gestation
time (weeks)
30
At 30 weeks…
Y=birth
weight
3000
(x,y)=
(g)
(30,3000)
X=gestation
time (weeks)
30
At 30 weeks…
The babies that gestate for 30 weeks appear
to center around a weight of 3000 grams.
Our linear regression equation predicts that
a baby of 30 weeks gestation will weigh
3000g:
Expected weight (g) = 100*(30 weeks)
And, if X=20, 30, or 40…
Y=birthweight
(g)
X=gestation
time (weeks)
20
30
40
If X=20, 30, or 40…
Y=baby
weights
(g)
X=gestation
times (weeks)
20
30
40
Mean values fall on the line
At 40 weeks, expected weight = 4000
At 30 weeks, expected weight =3000
At 20 weeks, expected weight = 2000
In general,
Expected weight = 100 grams/week*X wks
Assumptions (or the fine print)
Linear regression assumes that…
– 1. The relationship between X and Y is linear
– 2. Y is distributed normally at each value of X
– 3. The variance of Y at every value of X is the
same (homogeneity of variances)
Non-homogenous variance
Y=birthweight
(100g)
X=gestation
time (weeks)
A ttest is linear regression!
A t-test is an example of linear regression with a
binary predictor.
For example, if the mean difference in spine bone
density between a sample of men and a sample of
women is .11 g/cm2 and the women have an
average value of .99, then the t-test for the
difference in the means is mathematically
equivalent to the linear regression model:
Spine BMD (g/cm2) = .99 (intercept) + .11 (1 if
male)
Multiple Linear Regression
More than one predictor…
= + 1*X + 2 *W + 3 *Z
Each regression coefficient is the amount of change
in the outcome variable that would be expected
per one-unit change of the predictor, if all other
variables in the model were held constant.
ANOVA is linear regression!
A categorical variable with more than two groups:
E.g.: groups 1, 2, and 3 (mutually exclusive)
= (=value for group 1) + 1*(1 if in group 2) + 2
*(1 if in group 3)
This is called “dummy coding”—where multiple
binary variables are created to represent being in
each category (or not) of a categorical variable
Example: ANOVA = linear
regression
In SAS:
data stats210.runners;
set stats210.runners;
if mencat=1 then amenorrheic=1; else amenorrheic=0;
if mencat=2 then oligomenorrheic=1; else oligomenorrheic =0;
run;
The good news is that SAS will often do this for you with a
class statement!
Functions of multivariate
analysis:
Control for confounders
Test for interactions between predictors
(effect modification)
Improve predictions
Multiple linear regression
caveats
Multicollinearity arises when two variables that
measure the same thing or similar things (e.g.,
weight and BMI) are both included in a multiple
regression model; they will, in effect, cancel each
other out and generally destroy your model.
Model building and diagnostics are tricky
business!
Other types of multivariate
regression
Multiple linear regression is for normally
distributed outcomes
Logistic regression is for binary outcomes
Cox proportional hazards regression is used when
time-to-event is the outcome
Reading for this week
Chapters 6-8, 10
Note: Midterm next week
One “cheat” sheet allowed for in-class
portion and one for in-lab portion