Transcript Slide 1
A New Rule of Thumb for 2×2 Tables
with Low Expected Counts
Bruce Weaver
Northern Health Research Conference
June 4-5, 2010
NHRC 2010
1
Speaker Acceptance & Disclosure
I have no affiliations, sponsorships, honoraria,
monetary support or conflict of interest from any
commercial source.
However…it is only fair to caution you that this talk
has not undergone ethical review of any sort.
Therefore, you listen at your own peril.
NHRC 2010
2
A Very Common Problem
“One of the commonest
problems in statistics is the
analysis of a 2×2
contingency table.”
Ian Campbell
(Statist. Med. 2007; 26:3661–3675)
NHRC 2010
3
What’s a contingency table?
See the example on
the next slide.
NHRC 2010
4
Example: A 2×2 Contingency Table
What the heck
is
malocclusion?
Counts in the cells
NHRC 2010
5
Normal Occlusion vs. Malocclusion
Class I
Occlusion. Normal
occlusion. The upper
teeth bite slightly ahead
of the lowers.
NHRC 2010
Class II
Malocclusion. Upper
teeth bite greatly ahead
of the lower teeth—i.e.,
overbite.
Class III
Malocclusion. Upper
front teeth bite behind
the lower teeth—i.e.,
under-bite.
6
What statistical test can I use to analyze
the data in my contingency table?
It depends.
NHRC 2010
7
The Most Commonly Used Test
The most common statistical test for
contingency tables is Pearson’s chisquared test of association.
Karl Pearson
Greek letter chi
Observed count
(O E )
E
2
2
NHRC 2010
Sum
Expected count
8
A Shortcut for 2×2 Tables Only
a
c
r
b
d
s
m
n
N
N (ad bc)
mnrs
2
2
NHRC 2010
9
But you can’t always use Pearson’s
2
It is well known (to those who know it well)* that Pearson’s
chi-square is an approximate test
The sampling distribution of the test
statistic (under a true null hypothesis)
is approximated by a chi-square
distribution with df = (r-1)(c-1)
A typical chisquare distribution
The approximation becomes poor when the expected counts
(assuming H0 is true) are too low
* Robert Rankin, author of The Hollow Chocolate Bunnies of the Apocalypse.
NHRC 2010
10
How low is too low for
expected counts?
It depends.
Again, it depends!
This guy is starting
to get on my
nerves.
NHRC 2010
11
A Rule of Thumb for 2×2 Tables
A common rule of thumb for when it’s OK to
analyze a 2×2 table with Pearson’s chi-squared test
of association says:
1) All expected counts should be 5 or greater
2) If any expected counts are < 5, another test should be
used
The most frequently recommended alternative test
under point 2 above is Fisher’s exact test (aka the
Fisher-Irwin test)
NHRC 2010
12
Some History
The standard rule of thumb for 2×2 tables dates back
to Cochran (1952, 1954), or even earlier
But, the minimum expected count of 5 appears to
have been an arbitrary choice (probably by Fisher)
Cochran (1952) suggested that it may need to be
modified when new evidence became available.
Computations by Ian Campbell (2007) have provided
some new & relevant evidence.
NHRC 2010
13
The Role of Research Design
Three distinct research designs
can give rise to 2×2 tables
Barnard (1947) classified them
as follows:
G.A. Barnard
Model I: Both row & column totals fixed in advance
Model II: Row totals fixed, column totals free to vary
Model III: Both row & column totals free to vary
NHRC 2010
14
Campbell on Model I
“Here, there is no dispute
that the Fisher–Irwin test …
should be used.”
Ian Campbell
“This last research design is
rarely used and will not be
discussed in detail.”
(Statist. Med. 2007; 26:3661–3675, emphasis added)
NHRC 2010
15
Review of Models II and III
Model II
Sometimes called the 2×2 comparative trial
Row totals fixed, column totals free to vary
E.g., researcher fixes group sizes for Treatment & Control
groups, or for Males & Females
Model III
Also called a cross-sectional study
Both row & column totals are free to vary
Only the total N is fixed
NHRC 2010
16
So what did Campbell do?
“Computer-intensive
techniques were used … to
compare seven two-sided
tests of two-by-two tables in
terms of their Type I errors.”
Ian Campbell
(Statist. Med. 2007; 26:3661–3675
NHRC 2010
17
Let’s try that again…
Null hypothesis was always true – i.e., there was no
association between the row & column variables
Therefore, statistically significant results were Type I errors
For values of N ranging from 4-80, Campbell computed the
maximum probability of Type I error (with alpha set to .05)
He also examined all possible values of π
The proportion of subjects (in the population) having
the binary characteristic(s) of interest—e.g., the
proportion of males, or the proportion of smokers, etc
NHRC 2010
18
The statistical tests of interest
Campbell examined 7 different statistical tests
I will focus on only 2 of those tests today:
Pearson’s chi-square
The ‘N-1’ chi-square
NHRC 2010
19
Yoo-hoo! What’s the
‘N-1’ chi-square?
NHRC 2010
20
The ‘N-1’ chi-square
Pearson’s chi-square (shortcut for 2×2 tables only)
N (ad bc)
mnrs
2
a
c
r
2
b
d
s
m
n
N
The ‘N-1’ chi-square (for 2×2 tables only)
( N 1)(ad bc)
mnrs
2
2
NHRC 2010
21
Whence the ‘N-1’ chi-square?
First derived by E.S. Pearson (1947)
Egon Sharpe Pearson, son of Karl
Derived again by Kendall & Stuart (1967)
Richardson (1994) asserted that it is “the appropriate
chi-square statistic to use in analysing all 2×2
contingency tables” (p. 116, emphasis added)
Campbell summarizes the theoretical argument for
preferring the N-1 chi-square on his website:
www.iancampbell.co.uk/twobytwo/n-1_theory.htm
NHRC 2010
22
Campbell’s Procedure
Campbell computed the maximum Type I error probability for:
N ranging from 4 to 80
Over all values of π
For minimum expected count = 0, 1, 3, and 5
He did all of that using both:
Pearson’s chi-squared test of association
The N-1 chi-squared test
Compared the actual Type I error rate to the nominal alpha
All of the above done for Models II and III separately
NHRC 2010
23
An Ideal Test
For an ideal test, the actual
proportion of Type I errors is equal
to the nominal alpha level
E.g., if you set alpha at .05, Type I
errors occur 5% of the time (when
the null hypothesis is true)
NHRC 2010
24
A Conservative Test
A test is
if
the actual Type I error rate is
lower than the nominal alpha
Conservative tests have low
power – they don’t reject H0
as often as they should (i.e.,
too many Type II errors)
NHRC 2010
25
A Liberal Test
A test is
if the
actual Type I error rate is
higher than the nominal
alpha
Liberal tests reject H0 too
easily, or too frequently
(i.e., too many Type I
errors)
NHRC 2010
26
Cochran’s Criterion for
Acceptable Test Performance
With discrete data (like counts) and small sample sizes, the
actual Type I error rate is generally not exactly equal to the
nominal alpha
Cochran (1942) suggested allowing a 20% error in the
actual Type I error rate—e.g., for nominal alpha = .05, an
actual Type I error rate between .04 and .06 is acceptable
Cochran’s criterion is admittedly arbitrary, but other authors
have generally followed it (or a similar criterion) – and
Campbell (2007) uses it.
NHRC 2010
27
Figure 2A: Pearson chi-square (Model II)
with minimum E = 0, 1, 3, and 5
Minimum value of E
Maximum over
all values of π
.05 ± 20% (from Cochran)
For Model II, Pearson’s chi-squared
test meets Cochran’s criterion only if
the minimum E ≥ 5 (the blue line).
NHRC 2010
28
Figure 2B: N-1 chi-square (Model II)
with minimum E = 0, 1, 3, and 5
Minimum value of E
For Model II, the N-1 chi-squared test
meets Cochran’s criterion quite well
for expected counts as low as 1.
NHRC 2010
29
Figure 4A: Pearson chi-square (Model III)
with minimum E = 0, 1, 3, and 5
Minimum value of E
For Model III, Pearson’s chisquared test meets Cochran’s
criterion fairly well for E as low as 3.
NHRC 2010
30
Figure 4B: N-1 chi-square (Model III)
with minimum E = 0, 1, 3, and 5
Minimum value of E
For Model III, the N-1 chi-squared
test meets Cochran’s criterion very
well for expected counts as low as 1.
NHRC 2010
31
Campbell’s New Rule of Thumb
for 2×2 Tables
For Model I – row & column totals both fixed
Use the two-sided Fisher Exact Test (as computed by SPSS)
Aka the Fisher-Irwin Test “by Irwin’s rule”
For Models II and III – comparative trials & cross-sectional
If all E ≥ 1, use the ‘N − 1’ chi-squared test
Otherwise, use the Fisher–Irwin Test by Irwin’s rule
NHRC 2010
32
Increased Power
Campbell’s new rule of thumb “extends the use of the chisquared test to smaller samples … with a resultant increase
in the power to detect real differences.” (Campbell, 2007, p.
3674, emphasis added)
And as everyone knows, the
more power, the better!
Tim “the Stats-Man” Taylor & Al
NHRC 2010
33
Campbell’s Online Calculator
http://www.iancampbell.co.uk/twobytwo/calculator.htm
NHRC 2010
34
Computing the N-1 chi-square with SPSS
I have written 2 SPSS syntax files to compute the N-1 chisquare
Ian Campbell provides a link to them beside his online
calculator
A link to my two
SPSS syntax files
NHRC 2010
35
Questions?
Yeah, I have a
question. Did you
have to include
that picture?
Severe Malocclusion
NHRC 2010
36
References
Barnard GA. Significance tests for 2×2 tables. Biometrika 1947; 34:123–138.
Campbell I. Chi-squared and Fisher–Irwin tests of two-by-two tables with small sample
recommendations. Statist. Med. 2007; 26:3661–3675. [See also:
http://www.iancampbell.co.uk/twobytwo/twobytwo.htm]
Cochran WG. The χ2 test of goodness of fit. Annals of Mathematical Statistics 1952; 25:315–
345.
Cochran WG. Some methods for strengthening the common χ2 tests. Biometrics 1954; 10:417–
451.
Kempthorne O. In dispraise of the exact test: reactions. Journal of Statistical Planning and
Inference 1979;3:199–213.
Kendall MG, Stuart A. The advanced theory of statistics, Vol. 2, 2nd Ed. London: Griffin, 1967.
Pearson ES. The choice of statistical tests illustrated on the interpretation of data classed in a
2×2 table. Biometrika 1947; 34:139–167.
Rankin R. The Hollow Chocolate Bunnies of the Apocalypse. Gollancz (August 1, 2003).
Richardson JTE. The analysis of 2x1 and 2x2 contingency tables: A historical review. Statistical
Methods in Medical Research 1994; 3:107-133.
NHRC 2010
37
The Cutting Room Floor
NHRC 2010
38
Etymology of rule of thumb
Some have claimed that the expression
rule of thumb derives an old legal ruling
in England that allowed men to beat
their wives with a stick, provided it was
no thicker than their thumb
However, there is no solid evidence to support that claim
http://www.phrases.org.uk/meanings/rule-of-thumb.html
http://www.canlaw.com/rights/thumbrul.htm
http://womenshistory.about.com/od/mythsofwomenshistory/a/rule_of_thumb.htm
http://www.straightdope.com/columns/read/2550/does-rule-of-thumb-refer-to-an-old-lawpermitting-wife-beating
NHRC 2010
39
An Important Topic
"The importance of the topic cannot be
stressed too heavily."
"2×2 contingency tables are the most
elemental structures leading to ideas
of association.... The comparison of two
binomial parameters runs through all
sciences."
Dr. Oscar Kempthorne
(J Stat Planning and Inf 1979;3:199–213, emphasis added)
NHRC 2010
40
Oscar Kempthorne (1919-2000)
Farm boy from Cornwall who became
a Cambridge-trained statistician
In 1941, he joined Rothamsted
Experiment Station, where he met
Ronald Fisher and Frank Yates
Strongly influenced by Fisher—e.g.,
areas of interest were experimental
design, genetic statistics, and
statistical inference
NHRC 2010
Kempthorne & Fisher
41
J.O. Irwin (1898-1982)
“J. O. Irwin was a soft spoken kind soul
who took a tremendous interest in his
students and their achievements.... He
was a lovable absent-minded kind of
professor who smoked more matches
than he did tobacco in his ever-present
pipe while he was deeply involved in
thinking about other important matters.”
Major Greenwood
“His old boss Pearson and his new boss
R. A. Fisher were bitter enemies but
Irwin's conciliatory nature allowed him to
remain on good terms with both men.”
From http://en.wikipedia.org/wiki/Joseph_Oscar_Irwin
NHRC 2010
42
A Variation on the Rule
A variation on that rule of thumb says that:
1) All expected counts should be 10 or greater.
2) If any expected counts are less than 10, but greater than
or equal to 5, Yates' Correction for continuity should be
applied. (However, the use of Yates' correction is
controversial, and is not recommended by all authors).
3) If any expected counts are less than 5, then some other
test should be used.
Again, the most frequently recommended alternative test
under point 3 has been Fisher’s exact test.
NHRC 2010
43
Figure 1: Maximum Type I error probability
for comparative trials (Model II)
Maximum over
all values of π
Cochran’s range:
± 20% of .05
Far too liberal if we
impose no restrictions
on minimum value of E
Arguably too
conservative for
smaller values of N
NHRC 2010
44
Figure 3: Maximum Type I error probability
for cross-sectional studies (Model III)
Too liberal if we
impose no restrictions
on minimum value of E
Again, the FET is
too conservative
NHRC 2010
45
Pearson’s chi-square
(O E )
E
2
2
General formula for
contingency tables of any size
O = observed count
E = expected count (assuming a true null hypothesis)
Σ = Greek letter sigma & means to sum across all cells
NHRC 2010
46
I don’t remember what expected counts
are—can you explain that?
Of course. See
the next slide.
NHRC 2010
47
Example: A 5×2 Table
E = row total × column total / grand total
NHRC 2010
48
How low is too low for
expected counts?
It depends.
If I had a dollar for
every time I heard
a statistician say
that, I’d be rich.
NHRC 2010
49
It depends on the table dimensions
For contingency tables larger than 2×2, the chisquare approximation is pretty good if:
“…no more than 20% of the expected
counts are less than 5 and all individual
expected counts are 1 or greater."
(Yates, Moore & McCabe, 1999, p. 734)
Many people do not know this, and mistakenly assume that
all expected counts must be 5 or more for tables of any size
NHRC 2010
50
Example 1: A 5×2 Contingency Table
Each person is classified on 2 different categorical variables
Each person appears in only one cell of the table
NHRC 2010
51
Expected Counts for the 5×2 Table
Two of 10 cells (20%) have E < 5; but all E >= 1
NHRC 2010
52
La-la-la-la-la …
MAJOR
NHRC 2010
53
Fisher’s Exact Test
Fisher’s formula for working out the exact probability of an
observed set of counts (and of more extreme sets under H0):
(a b)!(c d )!(a c)!(b d )!
p
N !a !b!c !d !
m !n !r ! s !
N !a !b!c !d !
NHRC 2010
a
c
r
b
d
s
m
n
N
54
Kendall & Stuart’s Derivation
of the ‘N-1’ Chi-square
For Model I, if a is known, b, c, and d can be worked out
using the fixed row & column totals
Kendall & Stuart demonstrated that under a true null
hypothesis, a is asymptotically normal with:
(a b)(a c)
Mean
N
i.e., row total ×
column total divided
by grand total
(a b)(c d )(a c)(b d )
Variance
2
N ( N 1)
NHRC 2010
55
Therefore…
z
(a b)(a c)
a
N
(a b)(c d )(a c)(b d )
2
N ( N 1)
N-1 chi-square
z
2
NHRC 2010
( N 1)(ad bc)
(a b)(c d )(a c)(b d )
2
2
df 1
56
END OF MAJOR NERD ALERT
NHRC 2010
57
J.T.E. Richardson on the N-1 chi-square
“It will become clear later that
[the N-1 chi-square] rather than
[Pearson’s chi-square] is in fact
the appropriate chi-square
statistic to use in analysing all
2×2 contingency tables
regardless of the underlying
model.” (Richardson, 1994, p. 116,
emphasis added)
J.T.E. Richardson
NHRC 2010
58
What is the Purpose of Research?
“The purpose of most
research is to discover
relations—relations
between or among
variables or between
treatment interventions
and outcomes.”
Dr. David Streiner
NHRC 2010
(Can J Psychiatry 2002;47:262–266)
59
What is the Role of Statistical Tests?
They test the null hypothesis that in
the population from which you have
sampled, there is no association
between the variables.
So when you reject the null
hypothesis, you infer that there is
an association between the
variables (in the population).
Yours truly
NHRC 2010
60