Transcript Topic 8

Topic 8 – One-Way ANOVA
Single Factor Analysis of Variance
Reading: 17.1, 17.2, & 17.5
Skim: 12.3, 17.3, 17.4
1
Overview

Categorical Variables (Factors)

Fixed vs. Random Effects

Review: Two-sample T-test

ANOVA as a generalization of the twosample T-test

Cell-Means and Factor-Effects ANOVA
Models (same model, different form)
2
Terminology: Factors & Levels


The term factor is generally used to refer to
a categorical predictor variable.

Blood Type

Gender

Drug Treatment

Other Examples?
The term levels is used to refer to the
specific categories for a factor.

A / B / AB / O (could also consider +/-)

Male / Female
3
Factors: Fixed or Random?

A factor is fixed if the levels under
consideration are the only ones of interest.

The levels of the factor are selected by a
non-random process AND are the only levels
of interest.

For the time being, all factors that we will
consider will be fixed.

Examples?
4
Factors: Fixed or Random? (2)

A factor is random if the levels under
consideration may be regarded as a sample
from a larger population.

Not all levels of interest are included in the
study – only a random sample.

We want to inferences to be applicable to the
entire (larger) population of levels.

Examples?

Analysis is a little more complicated; we’ll
save this topic for near the end of the course.
5
Example: Random or Fixed?
To study the effect of diet on cattle, an experimenter
randomly (and equally) allocates 50 cows to 5 diets (a
control and 4 experimental diets). After 1 year, the
cows are butchered and the amount of good meat (in
pounds) is measured.

Response = ______________

Cow = _______ Factor

Diet = _______ Factor
6
Notation


In general, we label our factors A, B, C, etc.

Factor A has levels i = 1, 2, 3, ..., a

Factor B has levels j = 1, 2, 3, ..., b

Factor C has levels k = 1, 2, 3, ..., c
More on notation later; remember for now
we are considering single factor ANOVA, so
we will have only a “Factor A”.
7
Comparing Groups
Suppose I want to compare
heights between men and women.
How would I do this?
8
Notation for Two-Sample Settings

Suppose an SRS (simple random sample) of size n1 is
selected from the 1st population, and another SRS of size n2
is selected from the 2nd population.
Population
Sample
size
Sample
mean
Sample
standard
deviation
1
n1
y1
s1
2
n2
y2
s2
9
Estimating Differences

A natural estimator of the difference 1  2
is the difference between the sample means:
y1  y2

If we assume that both populations are normally
distributed (or CLT applies) then both sample
means and their difference will be normally
distributed as well.

Because we are estimating standard deviations, a
confidence interval for the difference in means
uses the T-distribution.
10
CI for Difference

If variances are unknown, then a 95% confidence
interval for difference in means is given by
 y1  y2   tcrit

s
2
pooled
 1
1 



n2 
 n1
The critical value is tcrit  t0.975,df . The degrees of
freedom is n1 + n2 – 2.
11
Test for Difference = 0

Can also be viewed as a hypothesis test

Test statistic for testing whether the
difference is zero:
 y1  y2 
T 
s

2
pooled
 1
1 
n n 
2 
 1
Compare to critical value used in CI.
12
Conclusions

If the test statistic is of larger magnitude
(ignore sign) than the critical value, we
reject the hypothesis

There is a significant difference between the
two groups


The same conclusion results if the CI
doesn’t contain zero.
If the statistic is smaller (CI does contain
zero), we fail to reject the hypothesis

Fail to show a difference between the two
groups
13
Comparison of Several Groups
Suppose instead of two groups,
we have “a” groups that we wish to
compare (where a > 2).
Note: In Chapter 17, textbook defines the number of groups
as “k”. Remember this is just a letter, and the letter we use
really has nothing to do with anything in particular. So I’m
using a to correspond (consistently) to Factor A.
14
Multiple treatment model

With a groups (treatments), then we could
do 12 a  a 1 two-sample t-tests. But...

This does not test the equality of all means at
once H 0 : 1  2  ...  a

Multiple tests means we have greater chance of
making Type I errors (a Bonferroni correction
can get expensive because of the large number
of tests).

We usually expect variances to be the same
across groups, but it isn’t clear how we should
estimate variance with more than two samples.
15
Multiple treatment model (2)

Analysis of Variance (ANOVA) models
provide a more efficient way to compare
multiple groups. For example, in a single
factor ANOVA,

The Model (or ANOVA) F-test will test the
equality of all group means at the same time.

There are methods of doing pairwise
comparisons that are much more efficient than
Bonferroni.

All observations (from all groups) are used to
estimate the overall variance (by MSE).
16
Three Ways to View ANOVA

Views observations in terms of their group
meanscell means model

Views observations as the sum of an overall
mean, a deviation from that mean related to
the particular group to which the observation
belongsfactor effects model

As regression, using indicator variables.
17
ANOVA Model
Cell Means Model
18
ANOVA

ANOVA is generally viewed as a an
extension of the T-test but used for
comparisons of three or more population
means.

These populations are denoted by the
levels of our factor.


Only one variable, but has 3+ levels or groups
Hence we call the means of these levels
factor level means or simply cell means.
19
Cell Means Model

Basic ANOVA Model is:
Yij  i   ij

2

~
N
0,

where ij
 
Notation:

“i” subscript indicates the level of the factor
i  1,2,3,..., a

“j” subscript indicates observation number within
the group
j  1,2,3,..., ni
20
Cell Sizes

For the time being, we will assume that all
the cell sizes are the same:
ni  n for all i

The total sample size will be denoted
a
N   ni  an (when cell sizes are all n)
i 1
21
Assumptions for fixed effects

Random samples have been selected for
each level of the factor. All observations are
independent.

Response variable is normally distributed for
each population (level) and the population
variances are the same.

Hence, independence, normality and
constant variance

What happened to linearity?
22
Robustness

ANOVA procedures are generally robust to
minor departures from the assumptions (i.e.
minor deviations from the assumptions will
not affect the performance of the
procedure).

For major departures, transformations of the
response variable [e.g. Log(Y)] may help.

Transforming the Factor(IE predictor) in ANOVA
doesn’t help because it’s categorical
23
Components of Variation

Variation between groups gets “explained”
by allowing the groups to have different
means.


Variation within groups is unexplained.


We know this as SSM, SSR, or now SSA!
We know this as SSE (it stays the same )
The ratio F = MSM / MSE forms the basis
for testing the hypothesis that all group
means are the same. (or F = MSA / MSE)
24
Variation: Between vs. Within

A convenient way to view the SS

SSA is called the “between” SS because it
represents variation between the different
groups. It is determined by the squared
differences between group means and the grand
(overall) mean.

SSE is called the “within” SS because it
represents variation within groups. It is
determined by the squared differences of
observations from their group means.
25
Quick Comment on Notation

DOT indicates “sum”

BAR indicates “average” or “divide by
cell/sample size”


Y is the mean for all observations
Yi is the mean for the observations in
Level i of Factor A.
26
Pictorial Representation
ìï
ïï
ïï
Y ij - Y gg í
ïï
ïï
ïïî
}Y
- Y ig
ü
ïï
ýY i g - Y gg
ïï
ïþ
Y1
GROUP 1
ij
GROUP 2
Y
GROUP 3
27
SS Breakdown (Algebraic)

Break down difference between observation
and grand mean into two parts:
Y ij - Y gg) =
(14442
4443
T ot al
Deviat ion
Y i g - Y gg)
(14442
4443
Deviat ion of Est imat ed
F act or Level Mean
Around Grand Mean
BETWEEN
GROUPS
+ (Y ij - Y i g)
14442 4443
Deviat on around
Est imat ed F act or
Level Mean
WITHIN
GROUPS
28
Components of Variation (2)

Of course the individual components would
sum to zero, so we must square them. It
turns out that all cross-product terms cancel,
and we have:
å (Y
2
2
2
- Y gg) = å (Y i g - Y gg) + å (Y ij - Y i g)
i, j
i, j
i, j
1444442
444443 1444442
444443 1444442
444443
ij
SST
SSA
SSE
BETWEEN
WITHIN
GROUPS
GROUPS
29
ANOVA Table
Source
SS
df
MS
F
Factor A
SSA
a–1
MSA
Error
SSE
N–a
MSA
MSE
MSE
Total
SST
N–1
30
Model F Test (Cell Means)

Null Hypothesis
H 0 : 1  2 

 a
Alternative Hypothesis
H a : There exists some pair of
population means not equal.
31
Conclusion

If we reject the null hypothesis, we have
shown differences between groups (levels)


Remember it does not tell us which groups are
different. Only that at least one group is different
from at least one other group!
If we fail to reject the null hypothesis, we
have failed to show any significant
differences with the ANOVA F test

Unfortunately sometimes if we look a little closer
(we’ll do this later) we still might find some
differences!
32
Calculations: A Brief Look

We’ll consider these for only a balanced
design (cell sizes all the same n).

The purpose in doing this is not that you
memorize formulas, but that you further your
conceptual understanding of the sums of
squares.
33
SS Calculations(Balanced)
a
n
a
SSA   Yi  Y   n Yi  Y 
2
i 1 j 1
i 1
SSE   Yij  Yi 
a
n
2
2
i 1 j 1
SST   Yij  Y 
a
n
2
i 1 j 1
34
Blood Type Example (1)

Suppose we have 3 observations of a
certain response variable for each blood
type
A B O AB
28 32 21 32
27 34 22 32
28 35 25 34

Want to construct the ANOVA table
35
Blood Type Example (2)

We can compute the sample means using
SAS:
proc means; class type;
output out=means mean=YBAR;
proc print; run;
Obs
1
2
3
4
5
type
A
AB
B
O
_TYPE_
0
1
1
1
1
_FREQ_
12
3
3
3
3
YBAR
29.1667
27.6667
32.6667
33.6667
22.6667
36
Blood Type Example (3)

SSA (Between)
a
SSA  n Yi  Y 
2
i 1
2
2
2
2

 3 YA  Y   Yi  Y   Yi  Y   Yi  Y  


2
2
2
2

 3  1.5    3.5    4.5    6.5  


 231

At this point, we have a choice – to calculate
SSE or SST.
37
Blood Type Example (4)
SSE   Yij  Yi 
a
n
2
i 1 j 1
 YA1  YA   YA2  YA   ...  YAB 3  YAB 
2
2
2
  28  27.667    27  27.667   ...   34  22.667 
2
2
2
 16.67
SST  SSA  SSE  231  16.67  247.67
38
Blood Type Example (5)

DF: 4 – 1 = 3 for Factor A

DF: N – 1 = 11 for Total

DF: 11 – 3 = 8 for Error

Mean Squares:
MSA  231/ 3  77
MSE  16.67 /8  2.08
39
Blood Type Example (6)

ANOVA Table
Source
Between
Within
Total

SS
df
MS
F
231.00
3
77.00
36.95
16.67
8
2.084
247.67
11
F-test is significant, and so we conclude that there
is some difference among the means (we just
don’t know exactly which means are different).
40
SAS Coding

Will use PROC GLM with an important
addition: CLASS statement

CLASS statement identifies categorical
variables for SAS

Note that failure to use CLASS statement for
categorical variable will result in:

SYNTAX ERROR if character variable

INAPPROPRIATE ANALYSIS if class levels
are numeric
41
Blood Type Example (SAS)
proc glm data=bloodtype;
class type;
model resp=type;
output out=diag p=pred r=resid;
Source
Model
Error
Total
R-Square
0.932705
DF
3
8
11
Coeff Var
4.948717
Sum of
Squares
231.0000000
16.6666667
247.6666667
Root MSE
1.443376
Mean Square
77.0000000
2.0833333
F Value
36.96
Pr > F
<.0001
resp Mean
29.16667
42
Residual Diagnostics

Very similar to what we did in regression

Normality plot is the same – keep in mind that
most of the tests in ANOVA are robust to minor
violations of normality (thanks to the CLT).

In constant variance plot, still may see
megaphone shape in RESID vs. PRED if nonconstant variance is a problem.

In plots against the factor levels (commonly
used), would simply see differing vertical
spreads (not megaphone, because generally the
labels on the horizontal axis are not “ordered”)
43
Blood Type (QQ Plot)
44
Blood Type (Residual Plot)
45
Model Estimates

In SAS, using /solution as an option in the
MODEL statement of PROC GLM, we can
get the parameter estimates for our model.
Parameter
Intercept
type
type
type
type
A
AB
B
O
Estimate
22.66666667
5.00000000
10.00000000
11.00000000
0.00000000
B
B
B
B
B
Standard
Error
0.83333333
1.17851130
1.17851130
1.17851130
.
t Value
27.20
4.24
8.49
9.33
.
Pr > |t|
<.0001
0.0028
<.0001
<.0001
.
NOTE: The X'X matrix has been found to be singular, and a generalized inverse
was used to solve the normal equations. Terms whose estimates are
followed by the letter 'B' are not uniquely estimable.

Unfortunately these are not the cell means!
46
Cell or Group Means

To get each cell mean or Yi just add the
intercept to each parameter estimate
YA  22.67  5  27.67
YB  22.67  11  33.67
YAB  22.67  10  32.67
YO  22.67  0  22.67
47
Model Estimates

The reason for this is that there are infinitely
many ways to write down the model for
ANOVA.

SAS tells us this by saying ALL estimates
are “biased”. So what is SAS actually doing?
Parameter
Intercept
type
type
type
type
A
AB
B
O
Estimate
22.66666667
5.00000000
10.00000000
11.00000000
0.00000000
B
B
B
B
B
Standard
Error
0.83333333
1.17851130
1.17851130
1.17851130
.
t Value
27.20
4.24
8.49
9.33
.
Pr > |t|
<.0001
0.0028
<.0001
<.0001
.
NOTE: The X'X matrix has been found to be singular, and a generalized inverse
was used to solve the normal equations. Terms whose estimates are
48
followed by the letter 'B' are not uniquely estimable.
ANOVA Model
Factor Effects Model
(Another convenient view)
49
A simple example

Three groups:
1  30 Grand Mean 1  30  34  4
 2  35    34   2  35  34  1
 3  37  34  3
3  37
50
Factor Effects Model

An alternative to viewing each observation
as a deviation from the cell mean, we may
consider observations as deviations from
the grand (or overall) mean.

Part of that deviation is explained by the cell
(or group). We call that part i or factor
level effects.

We essentially break i from the cell-means
model into two pieces: i    i
51
Factor Effects Model

 i  1, 2,..., a
Yij     i   ij

 j  1, 2,..., ni
 is the grand (or overall) mean.

i is the ith treatment effect (difference

 ij ~ N  0,  2  is the error component.

i    i is the ith treatment mean.

Restriction
between group mean and  )

i
 0 is made.
52
Why the Restriction?


Note that estimating   ,1 , 2 ,..., a 
would require one more estimate than in the
cell means model  1 , 2 ,..., a .
So for the models to be identical, we must
add a constraint.

Convenient:   i  0 makes  the grand (or
overall) mean.

What exactly does SAS do?
53
Restriction made by SAS
Parameter
Intercept
type
type
type
type
A
AB
B
O
Estimate
22.66666667
5.00000000
10.00000000
11.00000000
0.00000000
B
B
B
B
B
Standard
Error
0.83333333
1.17851130
1.17851130
1.17851130
.
t Value
27.20
4.24
8.49
9.33
.
Pr > |t|
<.0001
0.0028
<.0001
<.0001
.
NOTE: The X'X matrix has been found to be singular, and a generalized inverse
was used to solve the normal equations. Terms whose estimates are
followed by the letter 'B' are not uniquely estimable.
Last level (alphabetically!!!) is set to ZERO.


This means the intercept (estimate for O )
will represent the mean for the “last” group.
So they are not exactly the factor effects,
but can we recover factor effects from this?
54
Estimating Factor Effects

We previously calculated the cell means
(this is the first step):
YA  22.67  5  27.67
YB  22.67  11  33.67
YAB  22.67  10  32.67
YO  22.67  0  22.67
55
Estimating Factor Effects (2)

The overall mean will be the weighted
average of the group means (in this case,
it’s a straightforward average since the cell
sizes are identical):
3  27.67   3  33.67   3  32.67   3  22.67 
Y 
12
 29.167
56
Estimating Factor Effects (3)

The factor effects are the differences
between the group and overall means:
ˆ A  27.67  29.17  1.5
ˆ B  33.67  29.17  4.5
ˆ AB  32.67  29.17  3.5
ˆO  22.67  29.17  6.5

Note: Sum of these is ZERO always.
57
Estimates / Tests


Alphas are estimated by ˆ i  Yi  Y
For the model F test: Testing the
hypothesis that all the means are the same
is equivalent to testing
H 0 : 1   2  ...   k  0
against the alternative
H a : i  0 for some i
58
ANOVA as REGRESSION
We’ll look at this only briefly, as in
practice we don’t generally view
ANOVA in this way. But SAS does!
So part of the context here is to help
us understand (eventually) how
ANOVA models work in SAS.
59
Dummy Variables

When we view ANOVA as a regression
model, we do so using dummy variables.

We’ve already seen such a variable and
even used it in the some examples where
we had only two possible categories:

Smoking Status (Yes = 1, No = 0)

Gender (Male = 1, Female = 0)
60
What is a Dummy Variable?

The most important thing about dummy
variables is that the numeric value has no
meaning beyond defining the category.

We could, for example, take (No = 1, Yes = 0)
or (Female = 1, Male = 0) on the previous
slide.

Additionally, we could use (Yes = 1, No = -1)
without changing the flavor of the results.
(the meaning of your parameter estimates
would change, but the final interpretations
would remain the same)
61
Extension to Many Groups

If my categorical factor has a levels, then I
will need a – 1 dummy variables to represent
the factor.

Example: Blood Type (A, B, AB, O)

X1 = 1 if blood type = A; else X1 = 0

X2 = 1 if blood type = B; else X2 = 0

X3 = 1 if blood type = AB; else X3 = 0
62
Degrees of Freedom

Recall our ANOVA model used a – 1 DF in
the model (one fewer than the number of
levels for the factor). Why?

Because of these indicator variables. It takes
a – 1 indicator variables to encompass our
categorical variable. That’s a – 1 slope
estimates, and hence a – 1 DF.

In general, any categorical variable in your
model will cost DF equal to the number of
levels minus one.
63
Extension to Many Groups (2)

My “Regression” Model will be
Y   0  1 X 1( A)   2 X 2( B )  3 X 3( AB )  

What do the parameters represent?

What is being tested with the overall model
F test?
64
Blood Type Example

Model:





Y   0  1 X 1( A)   2 X 2( B )  3 X 3( AB )  
 0 is the true mean for blood type O.
0  1 is the true mean for type A.
 0   2 is the true mean for type B.
 0   3 is the true mean for type AB.
And here are some fairly natural estimates:
b0  YO
b2  YB  YO
b1  YA  YO
b3  YAB  YO
65
Blood Type Example (2)

Standard errors for these estimates are also
fairly intuitive since in general the standard
error for a mean is of the form SEM   n

For example,
SE b0   MSE / nO
SE b1  MSE / nO  MSE / nA
66
Blood Type Example (3)

How do we test hypotheses?

H0: All means the same

H0: Mean for Type AB = Mean for Type O

H0: Mean for Type AB = Mean for Type A
67
Summary

One level of our factor gets represented by
the intercept. The slope estimates compare
all other levels to that “base” level.

We can compare any set of levels that we
want using a general linear test

This is exactly what SAS does for any
ANOVA! But the output in SAS will be in a
different form to make the interpretations
easier.
68
CLG Activity
69
Questions?
70
Upcoming in Topic 9...
Pairwise Comparisons (Sec. 17.7-17.8)
Randomized Blocks (Chapter 18)
71