Transcript lecture6

Review of ANOVA and linear
regression
Review of simple ANOVA
ANOVA
for comparing means between
more than 2 groups
Hypotheses of One-Way
ANOVA

H0 : μ1  μ2  μ3    μc



All population means are equal
i.e., no treatment effect (no variation in means among
groups)
H1 : Not all of the population means are the same

At least one population mean is different

i.e., there is a treatment effect

Does not mean that all population means are different
(some pairs may be the same)
The F-distribution

A ratio of variances follows an F-distribution:


2
between
2
within
~ Fn ,m
The
F-test tests the hypothesis that two variances
are equal.
F
will be close to 1 if sample variances are equal.
2
2
H 0 :  between
  within
H a :
2
between

2
within
How to calculate ANOVA’s by
hand…
Treatment 1
Treatment 2
Treatment 3
Treatment 4
y11
y21
y31
y41
y12
y22
y32
y42
y13
y23
y33
y43
y14
y24
y34
y44
y15
y25
y35
y45
y16
y26
y36
y46
y17
y27
y37
y47
y18
y28
y38
y48
y19
y29
y39
y49
y110
y210
y310
y410
10
y1 

j 1
y 2 
10
10
10
(y
1j
 y1 )
j 1
10  1
2

y
2j
j 1
10
( y 2 j  y 2 ) 2
j 1
y 3 
10
(y
3j
y
3j
y 4 
j 1
y
10
 y 3 )
j 1
10  1
k=4 groups
10
10
10
y1 j
n=10 obs./group
10  1
10
2
(y
4j
4j
j 1
The group means
10
 y 4 ) 2
j 1
10  1
The (within)
group variances
Sum of Squares Within (SSW),
or Sum of Squares Error (SSE)
10
10
(y
1j
 y1 ) 2
(y
1j
j 1
10
(y
10
 y1 ) +
2
3j
 y 3 )
10
2

10
j 1

4
10

i 1 j 1
4j
 y 4 ) 2
The (within)
group variances
10  1
10  1
( y 2 j  y 2 ) 2
(y
j 1
j 1
10  1
10  1
(y
 y 2 )
j 1
j 1
10
2j
2
+

( y 3 j  y 3 ) +
2
j 3
( y ij  y i )
2
10
(y
4j
 y 4 ) 2
j 1
Sum of Squares Within (SSW)
(or SSE, for chance error)
Sum of Squares Between (SSB), or
Sum of Squares Regression (SSR)
4
Overall mean
of all 40
observations
(“grand mean”)
y  
 y
(y
i 1
ij
i 1 j 1
4
10 x
10
i
40
 y  )
2
Sum of Squares Between
(SSB). Variability of the
group means compared to
the grand mean (the
variability due to the
treatment).
Total Sum of Squares (SST)
4
10

i 1 j 1
( y ij  y  ) 2
Total sum of squares(TSS).
Squared difference of
every observation from the
overall mean. (numerator
of variance of Y!)
Partitioning of Variance
4
10
 ( y
i 1 j 1
ij
 y i )
4
2

+ 10x
i 1
( y i   y  )
4
2
=
10

i 1 j 1
SSW + SSB = TSS
( y ij  y  ) 2
ANOVA Table
Source of
variation
Between
(k groups)
Within
d.f.
Sum of
squares
k-1
SSB
F-statistic
SSB/k-1
(sum of squared
deviations of
group means from
grand mean)
nk-k
(n individuals per
group)
Total
variation
Mean Sum
of Squares
nk-1
SSW
(sum of squared
deviations of
observations from
their group mean)
SSB
SSW
Go to
k 1
nk  k
s2=SSW/nk-k
TSS
(sum of squared deviations of
observations from grand mean)
p-value
TSS=SSB + SSW
Fk-1,nk-k
chart
Example
Treatment 1
Treatment 2
Treatment 3
Treatment 4
60 inches
50
48
47
67
52
49
67
42
43
50
54
67
67
55
67
56
67
56
68
62
59
61
65
64
67
61
65
59
64
60
56
72
63
59
60
71
65
64
65
Example
Step 1) calculate the sum
of squares between groups:
Treatment 1
Treatment 2
Treatment 3
Treatment 4
60 inches
50
48
47
67
52
49
67
Mean for group 1 = 62.0
42
43
50
54
67
67
55
67
Mean for group 2 = 59.7
56
67
56
68
62
59
61
65
Mean for group 3 = 56.3
64
67
61
65
59
64
60
56
72
63
59
60
71
65
64
65
Mean for group 4 = 61.4
Grand mean= 59.85
SSB = [(62-59.85)2 + (59.7-59.85)2 + (56.3-59.85)2 + (61.4-59.85)2 ] xn per
group= 19.65x10 = 196.5
Example
Step 2) calculate the sum
of squares within groups:
(60-62) 2+(67-62) 2+ (42-62)
2+ (67-62) 2+ (56-62) 2+ (6262) 2+ (64-62) 2+ (59-62) 2+
(72-62) 2+ (71-62) 2+ (5059.7) 2+ (52-59.7) 2+ (4359.7) 2+67-59.7) 2+ (6759.7) 2+ (69-59.7)
2…+….(sum of 40 squared
deviations) = 2060.6
Treatment 1
Treatment 2
Treatment 3
Treatment 4
60 inches
50
48
47
67
52
49
67
42
43
50
54
67
67
55
67
56
67
56
68
62
59
61
65
64
67
61
65
59
64
60
56
72
63
59
60
71
65
64
65
Step 3) Fill in the ANOVA table
Source of variation
d.f.
Sum of squares
Mean Sum of
Squares
F-statistic
p-value
Between
3
196.5
65.5
1.14
.344
Within
36
2060.6
57.2
Total
39
2257.1
Step 3) Fill in the ANOVA table
Source of variation
d.f.
Sum of squares
Mean Sum of
Squares
F-statistic
p-value
Between
3
196.5
65.5
1.14
.344
Within
36
2060.6
57.2
Total
39
2257.1
INTERPRETATION of ANOVA:
How much of the variance in height is explained by treatment group?
R2=“Coefficient of Determination” = SSB/TSS = 196.5/2275.1=9%
Coefficient of Determination
SSB
SSB
R 

SSB  SSE SST
2
The amount of variation in the outcome variable (dependent
variable) that is explained by the predictor (independent variable).
ANOVA example
Table 6. Mean micronutrient intake from the school lunch by school
Calcium (mg)
Iron (mg)
Folate (μg)
Zinc (mg)
a
Mean
SDe
Mean
SD
Mean
SD
Mean
SD
S1a, n=25
117.8
62.4
2.0
0.6
26.6
13.1
1.9
1.0
S2b, n=25
158.7
70.5
2.0
0.6
38.7
14.5
1.5
1.2
S3c, n=25
206.5
86.2
2.0
0.6
42.6
15.1
1.3
0.4
School 1 (most deprived; 40% subsidized lunches).
b School 2 (medium deprived; <10% subsidized).
c School 3 (least deprived; no subsidization, private school).
d ANOVA; significant differences are highlighted in bold (P<0.05).
P-valued
0.000
0.854
0.000
0.055
FROM: Gould R, Russell J,
Barker ME. School lunch
menus and 11 to 12 year old
children's food choice in three
secondary schools in Englandare the nutritional standards
being met? Appetite. 2006
Jan;46(1):86-92.
Answer
Step 1) calculate the sum of squares between groups:
Mean for School 1 = 117.8
Mean for School 2 = 158.7
Mean for School 3 = 206.5
Grand mean: 161
SSB = [(117.8-161)2 + (158.7-161)2 + (206.5-161)2] x25 per
group= 98,113
Answer
Step 2) calculate the sum of squares within groups:
S.D. for S1 = 62.4
S.D. for S2 = 70.5
S.D. for S3 = 86.2
Therefore, sum of squares within is:
(24)[ 62.42 + 70.5 2+ 86.22]=391,066
Answer
Step 3) Fill in your ANOVA table
Source of variation
d.f.
Sum of squares
Mean Sum of
Squares
F-statistic
p-value
Between
2
98,113
49056
9
<.05
Within
72
391,066
5431
Total
74
489,179
**R2=98113/489179=20%
School explains 20% of the variance in lunchtime calcium
intake in these kids.
Beyond one-way ANOVA
Often, you may want to test more than 1
treatment. ANOVA can accommodate
more than 1 treatment or factor, so long
as they are independent. Again, the
variation partitions beautifully!
TSS = SSB1 + SSB2 + SSW
Linear regression review
What is “Linear”?

Remember this:

Y=mX+B?
m
B
What’s Slope?
A slope of 2 means that every 1-unit change in X
yields a 2-unit change in Y.
Regression equation…
Expected value of y at a given level of x=
E ( yi / xi )    xi
Predicted value for an
individual…
yi=
 + *xi + random errori
Fixed –
exactly
on the
line
Follows a normal
distribution
Assumptions (or the fine print)

Linear regression assumes that…





1. The relationship between X and Y is linear
2. Y is distributed normally at each value of X
3. The variance of Y at every value of X is the
same (homogeneity of variances)
4. The observations are independent**
**When we talk about repeated measures
starting next week, we will violate this
assumption and hence need more
sophisticated regression models!
The standard error of Y given X is the average variability around the
regression line at any given value of X. It is assumed to be equal at
all values of X.
Sy/x
Sy/x
Sy/x
Sy/x
Sy/x
Sy/x
Regression Picture
yi
ŷi  xi  
C
A
B
y
B
A
y
C
yi
*Least squares estimation
gave us the line (β) that
minimized C2
x
n
(y
i 1
i
 y)
2

n
 ( yˆ
i 1
i
 y)
2

n
 ( yˆ
i
 yi ) 2
i 1
R2=SSreg/SStotal
A2
B2
C2
SStotal
Total squared distance of
observations from naïve mean
of y
Total variation
SSreg
SSresidual
Distance from regression line to naïve mean of y
Variance around the regression line
Variability due to x (regression)
Additional variability not explained
by x—what least squares method aims
to minimize
Recall example: cognitive
function and vitamin D

Hypothetical data loosely based on [1];
cross-sectional study of 100 middleaged and older European men.

Cognitive function is measured by the Digit
Symbol Substitution Test (DSST).
1. Lee DM, Tajar A, Ulubaev A, et al. Association between 25-hydroxyvitamin D levels and cognitive performance in middle-aged
and older European men. J Neurol Neurosurg Psychiatry. 2009 Jul;80(7):722-9.
Distribution of vitamin D
Mean= 63 nmol/L
Standard deviation = 33 nmol/L
Distribution of DSST
Normally distributed
Mean = 28 points
Standard deviation = 10 points
Four hypothetical datasets

I generated four hypothetical datasets,
with increasing TRUE slopes (between
vit D and DSST):




0
0.5 points per 10 nmol/L
1.0 points per 10 nmol/L
1.5 points per 10 nmol/L
Dataset 1: no relationship
Dataset 2: weak relationship
Dataset 3: weak to moderate
relationship
Dataset 4: moderate
relationship
The “Best fit” line
Regression
equation:
E(Yi) = 28 + 0*vit
Di (in 10 nmol/L)
The “Best fit” line
Note how the line is
a little deceptive; it
draws your eye,
making the
relationship appear
stronger than it
really is!
Regression
equation:
E(Yi) = 26 + 0.5*vit
Di (in 10 nmol/L)
The “Best fit” line
Regression equation:
E(Yi) = 22 + 1.0*vit
Di (in 10 nmol/L)
The “Best fit” line
Regression equation:
E(Yi) = 20 + 1.5*vit Di
(in 10 nmol/L)
Note: all the lines go
through the point
(63, 28)!
Significance testing…
Slope
Distribution of slope ~ Tn-2(β,s.e.( ˆ ))
H0: β1 = 0
H1: β1  0
Tn-2=
(no linear relationship)
(linear relationship does exist)
ˆ  0
s.e.( ˆ )
Example: dataset 4

Standard error (beta) = 0.03
T98 = 0.15/0.03 = 5, p<.0001

95% Confidence interval = 0.09 to 0.21

Multiple linear regression…

What if age is a confounder here?



Older men have lower vitamin D
Older men have poorer cognition
“Adjust” for age by putting age in the
model:

DSST score = intercept + slope1xvitamin D
+ slope2 xage
2 predictors: age and vit D…
Different 3D view…
Fit a plane rather than a line…
On the plane, the
slope for vitamin
D is the same at
every age; thus,
the slope for
vitamin D
represents the
effect of vitamin
D when age is
held constant.
Equation of the “Best fit”
plane…




DSST score = 53 + 0.0039xvitamin D
(in 10 nmol/L) - 0.46 xage (in years)
P-value for vitamin D >>.05
P-value for age <.0001
Thus, relationship with vitamin D was
due to confounding by age!
Multiple Linear Regression

More than one predictor…
E(y)=  + 1*X + 2 *W + 3 *Z…
Each regression coefficient is the amount of
change in the outcome variable that would be
expected per one-unit change of the
predictor, if all other variables in the model
were held constant.
Functions of multivariate
analysis:



Control for confounders
Test for interactions between predictors
(effect modification)
Improve predictions
ANOVA is linear regression!

Divide vitamin D into three groups:



Deficient (<25 nmol/L)
Insufficient (>=25 and <50 nmol/L)
Sufficient (>=50 nmol/L), reference group
DSST=  (=value for sufficient) + insufficient*(1
if insufficient) + 2 *(1 if deficient)
This is called “dummy coding”—where multiple
binary variables are created to represent
being in each category (or not) of a
categorical variable
The picture…
Sufficient vs.
Insufficient
Sufficient vs.
Deficient
Results…
Parameter Estimates
Variable
DF
Intercept
deficient
insufficient

1
1
1
Parameter
Estimate
40.07407
-9.87407
-6.87963
Standard
Error
1.47817
3.73950
2.33719
t Value
Pr > |t|
27.11
-2.64
-2.94
<.0001
0.0096
0.0041
Interpretation:


The deficient group has a mean DSST 9.87 points
lower than the reference (sufficient) group.
The insufficient group has a mean DSST 6.87
points lower than the reference (sufficient) group.