Chap10: SUMMARIZING DATA

Download Report

Transcript Chap10: SUMMARIZING DATA

Ch12: Analysis of Variance
(ANOVA)
12.1: INTRO:
This is an extension of Chap 11 (2-sample
design) to more than two. The name ANOVA is
misleading because it compares the means of
data and not their variances.
One-way and Two-way Layouts will be applied to
parametric and nonparametric methods.
12.2: The One-Way Layout
The One-Way Layout is an experimental design in which
independent measurements are made under each of
several (more than 2) treatments.
Thus, the One-Way Layout ANOVA focuses on
comparing more than 2 pop’ns or treatment means.
Terminology:
The characteristic that differentiates the pop’ns or
treatments from one another is called the factor under
study.
The different pop’ns or treatments are termed as the
levels of the factor.
12.2.1: Normal Theory; the F-test
This section deals with ANOVA and F-test for the case
of I samples (treatments or levels) of same size , J.
Notation:
Model : Yij     i   ij
n
where   i  0 and  ij are independent and  ij ~ N (0,  2 )
i 1
Yij  j observation of the i treatm ent
th
th
  overallm eanlevel
 i  differential effect of the i th treatm ent
 ij  random error in the j th observation underthe i th treatm ent
12.2.1 : (cont’d)
Re sponse:Yij     i   ij  ExpectedRe sponse: E (Yij )     i
Thus,  i  0  E (Yij )   i.e. all treatm entshavethe SAME expected response
Hypothesesof int erest : to be donelater! H 0 : 1   2  ...   I  0 (equalm ean)
ANOVA is based on the followingidentity :
SSTotal  SSWithin  SSBetween   Yij  Y..    Yij  Yi.   J  Yi.  Y.. 
I
J
i 1 j 1
1 J
whereYi.   Yij (average under i th trt )
J j 1
1 I J
and Y..   Yij overallaverage
IJ i 1 j 1
2
I
J
i 1 j 1
2
I
i 1
2


12.2.1 : (cont’d)
Lem m aA : E X i  X    i    
2
2
n 1 2

n
1 n
where X i (i  1,...,n) are independent r.v. with E ( X i )   i , Var ( X i )   ,     i
n i 1
2




I
SSWithin 
2

Theorem A : E
  [unbiasedness ] and E SSBetween   J   i2  ( I  1) 2
 I ( J  1) 
i 1



 s pooled 
SS
2
TheoremB :  ij ~ N (0,  2 ) independently  Within
~

I ( J 1)
2

If , additionally, all  i  0, then
SSBetween
2
~  I21 and is independent of SSWithin
TheoremC :
SSBetween /( I  1)
The null distribution of F 
~ FI 1, I ( J 1)
SSWithin /I ( J  1)
if the errors are norm allyDistributed by assum ption.
12.2.2: Multiple Comparisons
The F-test in Ex. A of Sec. 12.2.1 states that the
means of measurements from the 7 different
labs are NOT all equal, but how much do they
differ & which pairs are significantly different?
These questions will be addressed in this section.
Our main focus is to compare pairs or groups of
treatments & to estimate the treatment means
and their differences.
Naïve approach: compare all pairs thru t-tests (?)
New approaches: Tukey & Bonferroni methods
12.2.2.1: Tukey’s method
It is used to construct CIs for the differences of all
pairs of means in such a way (unlike the naïve
approach) that the intervals simultaneously
have a set coverage probability. Using the
duality between CI & HT, one can determine
which particular pairs are significantly different.
Tukey’s procedure depends on the so-called
studentized range dist’n (A14-A19, textbook)
characterized by 2 parameters: I=the number of
samples being compared & I(J-1)=the degrees
of freedom in the pooled sample std deviation.
12.2.2.1: Tukey’s method (cont’d)
(1) Let the sam plesizes be ALL EQUALand the errors Norm allyDist ' ed
with a CONSTANT VARIANCE  Yi1 .   i  centered sam plem eansare independent
 2
and Yi1 .   i ~ N  0,
 J

.

Note that
s 2p
J
estim ates
2
J
(2) The probability dist' n of the randomvariableQ  maxi1 , i2
Y
i1 .
 
  i1  Yi2 .   i2
s 2p
J
is to be called the Studentized Range Dist ' n ( SRD) with param etersI and I ( J  1)
(3) With probability (1   ) , for every pairs(i1 , i2 ) with i1  i2 i1 , i2  1,...,I 
Yi1 .  Yi2 . 
sp
 q I , I ( J 1) ( )
J
is a 100(1   )% set of sim ultaneouseous CIs for all differences  i1   i2
where q
( ) denote the upper tail  critical valueof the SRD

Tukey’s method: (steps)
read example A on page 452
Step1 : Select and find q I , I ( J 1) ( )
from pagesA15  A19 ( AppendixB : table6)
sp
Step 2 : Calculate the quantity w 
 q I , I ( J 1) ( )
J
where s p  MSE (read from ANOVA table)
Step3 : List the sam plem eansin DECREASINGorder
and underline each pairthat differs by less thanw.
Pairsthat are notunderlinedindicatethat the
corresponding pop' n m eansdiffer significantly at level .
12.2.2.2: The Bonferroni method
This method was briefly introduced in Section 11.4.8
If k null hypotheses are to be tested, then a desired
overall type I error rate of at most  can be

guaranteed by testing each null hypothesis at level .
k
By duality between CI & HT, one can say that:
If k confidence intervals are 
each formed to have


confidence level 1001  %, then they all hold
k

simultaneously with confidence level at least1001   %
Nice results are obtained for not too large
k
.
12.2.3: The Kruskal-Wallis Test:
(a nonparametric method)
The Kruskal-Wallis test is a generalization of the
Mann-Witney test seen in Section 11.2.3
Thus, such the Kruskal-Wallis test makes no
normality assumption and has a wider range
of applicability than does the F test.
The Kruskal-Wallis is especially useful for smallsample size problems.
Data are replaced by their ranks but outliers will
have less influence in the Kruskal-Wallis test
(nonparametric) than they do on its counterpart
F test (parametric).
12.3: The Two-Way Layout
Here the experimental design involves 2 factors (each
factor has 2 or more levels).
Goal: We would to assess the effect of 2 factors
(Temperature & Humidity) on a variable of interest
(yield of a chemical reaction).
T1 T2 T3 T4
I=4 levels for factor 1
H1
J=2 levels for factor 2
 I*J=8 combinations (cells) H2
Take K independent obs. in cells
Another situation: an agricultural scientist may be
interested in the corn yield using 3 different fertilizers
with 4 different types of soils. What are the effects?
12.3.1: Additive Parametrization
AdditiveModel:Yij     i   j   ij
( NO INTERACTION ; K 1obs per cell)
Yˆ  ˆ  ˆ  ˆ is the fitted or predictedvaluesof Y
ij
i
j
ij
 Yˆi1  Yˆi 2  ( ˆ  ˆ i  ˆ1 )  ( ˆ  ˆ i  ˆ 2 )  ˆ1  ˆ 2
( sam edifferenceam ongrangeson all m enudays)
 read exam pleon pg 455
If thereexists an INTERACTION betweenm enu days & range,
then differences am ongrangeson m enu dayswill NOT be the sam e
 look at figure12.3 on pg 457.
Interactions canbeincorporated in the model to better fit the data.
12.3.1:Normal Theory
for the 2-Way Layout
Assume the number of observations per cell K>1.
If K is the same for each cell, then the design is to be said balanced.
Model :Yijk     i   j   ij   ijk
 k observation in cell (i, j )
 E (Yijk )     i   j   ij
th
andVar (Yijk )  
2
because ijk  N (0,  )
iid
2
I
J
I
J
i 1
j 1
i 1
j 1
Assum ption:   i    j    ij    ij  0
12.2.1: (cont’d)
iid
TheoremA : Under the assum ption that  ijk 
N (0,  2 )
I
J
i 1
j 1
E SS A   ( I  1) 2  JK   i2 ; E SSB   ( J  1) 2  IK   j2
I
J
E SS AB   ( I  1)(J  1)  K   ij2 and E SSE   IJ ( K  1) 2
2
i 1 j 1
where SS A  JK  Yi..  Y ... ; SSB  IK  Y. j .  Y ...
I
J
2
i 1
2
j 1
SS AB   Yij.  Yi..  Y. j .  Y .. and SSE   Yijk  Yij. 
I
J
I
2
i 1 j 1
J
K
2
i 1 j 1 k 1
Identity: SS A  SSB  SS AB  SSE   Yijk  Y...   SSTOT
I
J
K
i 1 j 1 k 1
2
12.2.1: (cont’d)
Theorem B :
SSE
Under the assum ption s  ijk  N (0,  )  2 ~  IJ2 ( K 1)

SS A
A
2
Underthe null H 0 :  i  0 , i  1,...,I  2 ~  I 1

SSB
0
Underthe null H B :  j  0 , j  1,..., J  2 ~  J21

SS AB
0
Underthe null H AB :  ij  0 , i  1,...,I ; j  1,..., J  2 ~  (2I 1)( J 1)

The SS ( sum s of squares ) are INDEPENDEN TLY distribute d .
iid
2