Transcript Chapter 3

Comparison of groups
Comparison of groups
The purpose of an analysis is often to compare different groups of
data. Suppose, for example, that a meat scientist wants to examine
the effect of three different storage conditions on the tenderness of
meat. For that purpose 24 pieces of meat have been collected and
allocated into three storage (or treatment) groups, each of size eight.
In each group all eight pieces of meat are stored under the same
conditions, and after some time the tenderness of each piece of meat
is measured. The main question is whether the different storage
conditions affect the tenderness: are the observed differences
between the groups due to a real effect, or due to random variation?
Example: Parasite counts for salmons
An experiment with two difference salmon stocks, from River Conon in
Scotland and from River Ätran in Sweden, was carried out as follows.
Thirteen fish from each stock were infected and after four weeks the number
of a certain type of parasites was counted for each of the 26 fish with the
following results:
The purpose of the study was to investigate if the number of parasites during
an infection is the same for the two salmon stocks.
Example: Parasite counts for salmons
Example: Parasite counts for salmons
The mean and sample standard deviations are computed to
The summary statistics and the boxplots tell the same story: The observed
parasite counts are generally higher for the Ätran group compared to the
Conon group, indicating that Ätran salmons are more susceptible to
parasites. The purpose of the statistical analysis is to clarify whether the
observed difference is caused by an actual difference between the stocks or by
random variation.
Example: Dung decomposition
An experiment with dung from heifers was carried out in order to explore the
influence of antibiotics on the decomposition of dung organic material. As
part of the experiment, 36 heifers were divided into six groups. All heifers
were fed a standard feed, and antibiotics of different types (alphaCypermethrin, Enrofloxacin, Fenbendazole, Ivermectin, Spiramycin) were
added to the feed for heifers in five of the groups. No antibiotics were added
for heifers in the remaining group (the control group). For each heifer, a bag
of dung was dug into the soil, and after eight weeks the amount of organic
material was measured for each bag.
Example: Dung decomposition
Example: Dung decomposition
The observations together with group means (solid lines) and the total mean
(dashed line) are shown on the left, and parallel boxplots are shown on the
right panel. The amount of organic material appears to be lower for the
control group compared to any of the five types of antibiotics, suggesting that
decomposition is generally inhibited by antibiotics. However, there is
variation from group to group (between-group variation) as well as a
relatively large variation within each group (within-group variation). The
within-group variation seems to be roughly the same for all types, except
perhaps for spiramycin, but that is hard to evaluate because there are fewer
observations in that group.
Example: Dung decomposition
The sample means and the sample standard deviations are computed for each
group separately. We find the same indications as we did in the boxplots. On
average the amount of organic material is lower for the control group than for
the antibiotics groups, and except for the spiramycin group the standard
deviations are roughly the same in all groups.
Group means and SD’s
Consider the situation with n observations split into k groups. Label the
groups 1 through k. Let g(i) denote the group for observation i. Then g(i) has
one of the values 1,…,k. The sample mean and sample standard deviation in
group j are given by
Residual variance
The residual variance s2 can also be computed as a weighted average of the
group variance estimates, sj2, as follows
Note that the group variance sj2 is assigned the weight nj−1, the denominator
in (3.1). The summation of (3.5) is called the residual sum of squares.
Within-group variation
Within-group variation refers to the variation in each of the groups. It is
illustrated by the vertical deviations between the observations and their
corresponding group means. The residual sum of squares is given by
SSe describes the within-group variation since it measures squared deviations
between the observations and the group means. The residual degrees of
freedom is dfe = n−k, so the residual mean squares is
Between-group variation
Between-group variation refers to differences between the groups; for
example, deviation between the different treatments in the antibiotics
It is illustrated by the vertical differences between the group means
(horizontal line segments) and the overall mean (dashed line):
When we examine the between-group variation, the k group means
essentially act as our “observations”; hence, dfgrp = k−1, and the “average”
squared difference MSgrp per group becomes the between-group variation.
Analysis of variance
If there is no difference between any of the groups, then the group averages
will be of similar size and be similar to the overall mean. Hence, MSgrp will be
“small”. On the other hand, if groups 1 and 2, say, are different, then the
group averarages will be somewhat different; hence, MSgrp will be “large”.
“Small” and “large” should be measured relative to the within-group
variation, and MSgrp is thus standardized with MSe. We use
Large values of Fobs are critical; that is, not in agreement with the assumption
(hypothesis) that there is no different between any of the groups.
Analysis of variance
This disagreement is equivalent to Fobs being larger, and the corresponding pvalue are often inserted in an analysis of variance table. The p-value of being
smaller than 0.05 indicates significance evidence toward the disagreement
between groups.
Example: Dung decomposition
We conclude that there is strong evidence of group differences. Subsequently,
we need to quantify the conclusion further: Which groups are different and
how large are the differences?
Paired sample and dependence
Paired samples occur, for example, if two measurements are collected for each
subject in the sample under different circumstances (treatments), or if
measurements are taken on pairs of related observational units such as twins.
In dietary studies with two diets under investigation, for example, it is
common that the subjects try one diet in one period and the other diet in
another period; thus, they are “dependent.” As a consequence, the betweengroup variation is confused with the within-group variation, making the
analysis of variance inappropriate for paired data.
Independent samples
 It is important to distinguish paired samples from unpaired—or
independent—samples, because different methods of analysis are
appropriate. For unpaired samples like the dung decomposition data, we
impose an assumption of independence between all observations. This
means that the observations do not share information.
 This setup with independent samples corresponds to a one-way analysis of
variance, or one-way ANOVA. It is called “analysis of variance” because
different sources of variation are compared and “one-way” because only
one factor—the treatment or grouping—is varied in the experiment.
Summary of grouped data
 Two independent samples where the samples correspond to two different
groups or treatments and can be assumed to be independent.
 Independent samples where the samples correspond to k different groups
or treatments and can be assumed to be independent.
 Paired samples where the observations consist of pairs of measurements,
with the observations in a pair corresponding to two different groups or
The first case is a special case of the second, but we emphasize it anyway, for
two reasons. First, it is very important to distinguish two independent
samples from paired samples because different analysis methods are
Example: Word count
The attach() command makes it possible to use the variables GENDER, GROUP,
COUNT with reference to the data frame.
> attach(Data)
The following command produces boxplots grouped by GENDER.
> boxplot(COUNT ~ GENDER, col="green", ylab="Word counts per day")
Example: Word count
COUNT on the left-hand side of ~ in the call to aov() is modeled as grouped
data indicated by GENDER on the right-hand side. The output will then list
the analysis of variance table, and the group means.
> outcome <- aov(COUNT ~ GENDER)
> summary(outcome)
> Means <- model.tables(outcome, "means")
> Means