Goodness-of-fit tests (and further issues)

Download Report

Transcript Goodness-of-fit tests (and further issues)

Goodness-of-fit tests
(and further issues)
(Session 16)
SADC Course in Statistics
Learning Objectives
By the end of this session, you will be able to
• conduct and interpret results from a chisquare test for testing the goodness-of-fit
of data to a particular distribution
• understand how two-way contingency
tables can be further examined to look at
its residuals
• present results from a standard chi-square
test, paying attention to the table’s
summary features
To put your footer here go to View > Header and Footer
2
Goodness-of-fit tests
• In previous sessions, we have seen that
many tests are based on the assumption of
normality
• On some occasions, it is also important to
ascertain whether the data follow other
distributions, e.g. the binomial or Poisson
distributions
• We shall now look at how the chi-square
test can be applied to examine the extent
to which assumptions concerning the
distribution of a given variable holds
To put your footer here go to View > Header and Footer
3
Goodness-of-fit tests
• The basic idea is first to calculate the
probability of each possible value occurring
• e.g. the number of cows getting disease in a
farm which has 6 cows, may be assumed to
follow a binomial random variable.
• e.g. the number of visits made by a
pregnant woman in a region to the region’s
single anti-natal clinic may be assumed to
follow a Poisson distribution.
Can we check these assumptions before
subjecting the data to tests based on these?
To put your footer here go to View > Header and Footer
4
Goodness-of-fit test: Normal distn
• Because the Normal distribution applies to a
continuous random variable, it is necessary
to group the data and obtain observed
frequencies in each group.
• The next step is to determine the probability
of an observation falling in each group, and
hence the expected value.
• The chi-square test can then be applied in
the usual way: the d.f. being number of
groups – 1 – number of parameters
estimated in computing expected values.
To put your footer here go to View > Header and Footer
5
An example: Normal distn
• Consider the total rainfall in June at a
particular site from 1928 to 1983. Suppose
we wish to test the assumption that these data
follow a normal distribution
• A histogram for the data appears below.
14
12
Frequency
10
8
6
4
2
0
<=100 to 125 to 150 to 175 to 200 to 225 to 250 > 250
Rainfall totals
To put your footer here go to View > Header and Footer
6
An example: Normal distn
Expected values are
now calculated for
each group, assuming
a normal distribution.
The table shows
observed and
expected frequencies.
The chi-square value
is 3.6 with d.f.=5.
P-value = 0.6083.
Conclusions?
RainTotal
Observed Expected
<=100
4
6.86
to 125
11
7.45
to 155
12
10.31
to 175
9
11.12
to 200
9
9.33
to 225
6
6.10
to 250
3
3.11
> 250
2
1.72
Totals
56
To put your footer here go to View > Header and Footer
56
7
An example: Binomial distn
• First recall (from Module H1) the form of the
probability density function for the binomial
random variable with parameters n and p,
where p is the probability of a “success” in a
sequence of n trials, each trial having just 2
possible outcomes.
• The number of successes (X) in n trials has a
binomial distribution.
n!
k
nk
P( X  k ) 
p (1  p) ,
k!(n  k )!
k  0,1,, n
• This formula gives the binomial probabilities,
obtained also from Excel’s function
Binomdist(x,n,p,false).
To put your footer here go to View > Header and Footer
8
An example: Binomial distn
Suppose we have a
binomial variable with
observed values as
shown (n=7,p=0.222)
Expected values can be
derived using
[P(X=k)]*404.
The chi-square value is
141.3 with d.f.=4 since
p has been estimated
from the data.
p-value = 0.000
k
Observed Expected
0
81
1
130
139.2
2
129
119.2
3
37
56.7
4
14
16.2
5,6,7
23
3.0
404
404
Totals
69.7
What are your
conclusions?
To put your footer here go to View > Header and Footer
9
Other issues
There are two more issues to discuss
concerning chi-square tests for testing the
association between two categorical variables.
These relate to
• further examination of the table of
frequencies when a significant result is found;
and
•
how to present the results
To put your footer here go to View > Header and Footer
10
Example of Session 15
For data below, we found a significant chi-square
value, with p=0.0024, i.e. evidence that the
proportion of diseased animals are not the same
for all vaccines.
Vaccine diseased healthy
Total
A
43
237
280
B
52
198
250
C
25
245
270
D
48
212
260
E
57
233
290
Total
225
1125
1350
Question:
But what contributes
most to the chisquare statistic?
i.e. departs most
from
Pr(diseased)=0.167?
To put your footer here go to View > Header and Footer
11
Cell contributions to chi-square:
Table gives the
chi-square
contributions to
each cell, i.e.
values (O-E)2/E.
Rule of thumb:
Vaccine
diseased
healthy
A
0.288
0.057
B
2.563
0.512
C
8.889
1.778
D
0.502
0100
E
1.554
0.311
Focus on cells
with values4 and
in larger tables,
focus on those 9.
To put your footer here go to View > Header and Footer
12
Standardised residuals
Better still, use
standardised residuals
so signs are also
included, i.e. use
SR=(O-E)/E.
Rule of thumb:
Focus on SR>|2|, or
in larger tables, focus
on those >|3|.
Vaccine diseased healthy
A
-0.54
0.24
B
1.60
-072
C
-2.98
1.33
D
0.71
-0.33
E
1.25
-0.56
Conclusion:
Vaccine C gives
most discrepancy
from H0.
To put your footer here go to View > Header and Footer
13
Presentation of results
In this example, it would
be appropriate to present
a table of the percentage
of animals diseased under
each vaccine.
Vaccine % DISEASED
C
9.3%
A
15.4%
D
18.5%
E
19.7%
Table sorted by the most
B
20.8%
useful vaccine would
make the results easier to
see.
Note there are more advanced methods, e.g.
modelling, to make specific comparisons
between the above percentages
To put your footer here go to View > Header and Footer
14
Presentation: Example from Sess 14
Recall results below from before. Test of
association gave p=0.000.
Usually sleep under a
mosquito net?
Suffered
malaria?
Yes
No
Total
Yes
No
Total
649
3849
4498
62.5%
55.8%
56.6%
390
3055
3445
37.5%
44.2%
43.4%
1039
6904
7943
100.0%
100.0%
(100%)
To put your footer here go to View > Header and Footer
15
Presentation and conclusions
Test results indicate that there is an
association between use of a mosquito net
and incidence of malaria. However the
resulting incidences are unexpected. Note:
malaria incidence
for those using net = 62.5%
for those not using a net is = 55.8%.
This emphasises the danger of ignoring other
factors that may affect malaria incidence,
e.g. altitude, housing conditions, etc.
Further, could it be that those who had
malaria, then started using mosquito nets?
To put your footer here go to View > Header and Footer
16
Some final remarks
• Performing a chi-square analysis is simple, but
it does not take account of other factors that
may affect the results.
• More advanced (e.g. log-linear modelling)
procedures do exist for exploring factors
affecting a categorical response, here use of a
bednet.
• Recall that the chi-square test is an
approximation. This approximation is poor if
the expected frequencies are very small (e.g.
< 5). Try collapsing some rows or columns if
this happens.
To put your footer here go to View > Header and Footer
17
Some practical work follows…
To put your footer here go to View > Header and Footer
18