The Analysis of Variance

Download Report

Transcript The Analysis of Variance

The Analysis of Variance
One-Way ANOVA


We use ANOVA when we want to
look at statistical relationships
(difference in means for example)
between more than 2 populations or
samples
ANOVA is a natural extension of
ideas used in 2-pop t-tests and
other methods we have explored
Trouble on the School Board!



Despite the school board’s best efforts –
sensitive test score data for a large urban
school district was leaked to the press!
The issue is a long standing argument
that children in the inner city do not
receive the same quality of education as
do children in the suburban parts of the
city. This could be very embarrassing for
both the board and the mayor!
Here’s the data
NOT SO FAST!
Take a closer look
at the data –
check for
“structure”
A school board official states: “ The data is roughly normally distributed and
is what you would expect for a random sample of 90 students – 30 from
each of the East, Central and West districts”

Our investigative reporter took Stats 300 in
college! Here is what she did:

Sort the data into East, Central and West “bins”

The box plot suggests a cover-up!
Digging further… the full set becomes
Further tests…Thanks StatsMan!
Summary of the 3 data sets:

Is there a statistical hypothesis
lurking about?
The Hypotheses

Let m1, m2, and m3 be the mean
scores for the three populations:
Pop1 = East
 Pop 2 = Central
 Pop 3 = West


Ho: m1= m2= m3

Ha: ?
The null hypothesis is pretty
straight forward
Why is this a problem?
Could we do this with paired t-tests?

YES!
What does this imply?
We have good evidence to reject the null hypothesis – the central
district scores are statistically lower than the other two districts.
Could we just use paired t-tests?

If we had 12 school districts that we
were testing in the same way as the
previous case – how would the
analysis change?


How many pairs
How many false positives would we get
at a 95% Confidence level?
Why we can’t use multiple pairs of ttests or why we should consider the
entire set:
As the number of pairs increases the
• chance
Decreases
the
chance
of
of a false
positives
or erroneous
conclusion
on the null hypothesis
false positives
increases
2.
pooling gives
all of information
(not just
• By
Pooling
more
precision
pairs) we get a much more precise value
for
standard deviation in the
in the
statisitcs
population
3.
treating
all of the datacorrelations
we can,
• By
Detect
interesting
potentially detect interesting correlations
between subgroups – this could easily be
overlooked in we approached the data in a
pair-wise fashion.
1.
Setting up for ANOVA


You guessed it – yet more
terminology!
In 12.1 and 12.2 we will introduce:



A method to get an estimate for the
standard deviation s for the entire
population (Pooled Estimator)
A new spin on degrees of freedom (df)
A new test for significance – the F-test
Pooled Estimator for s

This is a generalization of the
method we used in paired t-tests:
(n1  1) s  (n2  1) s   (nI  1) s
s 
(n1  1)  (n2  1)   (nI  1)
2
p
2
1
2
2
2
I
This expression begins to measure the total variation
in a population. Each si2 term measures variation
within a given sample. “I” represents the total
number of independent SRS’s
Sigma Rule…

If the largest standard deviation in
a set of I SRS’s is less than twice as
large as the smallest then we can
approximate the standard deviation
by using the pooled estimator.
Example: What is the pooled estimate
for sigma for the 3 school districts?


I = 3 (East, Central, West are SRS’s)
n1=n2=n3=30
2
2
2
(30

1)35.04

(30

1)33.56


(30

1)26.13
s 2p 
(30  1)  (30  1)   (30  1)
s  1012.28 s p  31.8
2
p
Part II – Developing the F-Test

Conceptual Model

A collection of
SRS’s drawn from
a larger population
illustrate two
different kinds of
variation:
 Internal

variation around
a sample mean
within a given
SRS
Variation of the
SRS means with
the overall
population mean
Ways of quantifying variation





ANOVA compares the two kinds of
variability
The null hypothesis often is equivalent to
saying that the populations overlap (have
the same mean for example)
Another way of saying this is that the
SRS’s share the “grand mean” of the
entire population
This could happen if the individual SRS’s
have large variation internally but not
externally
We need a way to quantify this
The F-Value


We can compare variation between
samples with the variation within samples
by calculating the Mean Square of the
error in both cases.
This is expressed as:
MS (between)
F
MS ( within)

We will get to F-distributions in a few
moments
Mean Square Error – MSE(within)

This is what the pooled estimator
determines:
s  MSE (within)
2
p

This means that our school board
data has an internal MSE of (31.8)2
Mean Square Error – MSE(between)

To determine this we need the
“grand mean” for all of the data:
Mean Square Error – MSE(between)

Define as:
2
2
n
(
x

m
)
n
(
x

m
)


i
i
grand
i
i
grand
MS (between) 

df (between)
I 1
A new application of the idea of degrees of freedom
Example – school board data:
30(649  611) 2  30(548  611) 2  30(635  611) 2
MS (between) 

3 1
89835
We can now determine the “F-Value” for this data:
MSb 89835
F

 88.8
MSw
31.8
I Don’t Get It!

Confused? We are almost
there.


We now know how to
quantify the variation within
SRS’s (MSw) and the
variation between the
means of the SRS’s (MSb)
The “F-ratio” can be
compared against tables
just like we did for z-tests
and t-tests
How to Use an “F-ratio”

You need to know some important
numbers:
numerator
MSb
F
MS w
denominator
The number of SRS’s (I) from this we
form the degrees of freedom for the MSb
term: dfb = I-1
 The total number of data points ( the
pooled data) = N, dfw=N-1
 The F-ratio tests the null hypothesis (ie –
that the means are equal)
 If Ho is true the F ≈ 1

Testing the School Board’s Claim



The school board’s claim was that there
was no difference between the three
district’s mean test scores.
Since there were 90 students (n=90) and
3 groups (I=3) we should use the F(I1,N-1) = F(2,89) distribution
So … use Table E and F(2,89) = 88.8.
Since this is not listed we need to
approximate. You should be able to
determine the probability of the null
hypothesis between an upper and lower
p-value.
With an F-ratio as big as 88.8 you really
don’t normally need to look it up – you know Ho is false!
Use Minitab or EXCEL

Life is short! ANOVA is a complex
(number intensive) process. Let’s
look at two approaches:

Minitab
Next lecture …



We will spend next lecture working
through several examples of ANOVA
When doing this keep in mind what
it is that you are calculating
Don’t get overwhelmed by the
detail!