Tutorial on the chi2 test for goodness-of-fit testing
Download
Report
Transcript Tutorial on the chi2 test for goodness-of-fit testing
Mobile Computing Group
A quick-and-dirty tutorial on the chi2 test
for goodness-of-fit testing
Outline
The presentation follows the pyramid schema
Chi2 tests for GoF
Goodness-of-fit (GoF)
Background -concepts
Background
Descriptive vs. inferential statistics
–
Descriptive : data used only for descriptive purposes (use
tables, graphs, measures of variability etc.)
–
Inferential : data used for drawing inferences, make
predictions etc.
Sample vs. population
–
A sample is drawn from a population, assumed to have some
characteristics.
–
The sample is often used to make inferences about the
population (inferential statistics) :
Hypothesis testing
Estimation of population parameters
Background
Statistic vs. parameter
–
A statistic is related (estimated from) a sample. It can be
used for both descriptive and inferential purposes
–
A parameter refers to the whole population. A sample
statistic is often used to infer a population parameter
Example : the sample mean may be used to infer the population
mean (expected value)
Hypothesis testing
–
A procedure where sample data are used to evaluate a
hypothesis regarding the population
–
A hypothesis may refer to several things : properties of a
single population, relation between two populations etc.
–
Two statistical hypotheses are defined: a null H0 and an
alternative H1
H0 is the often a statement of no effect or no difference. It is
the hypothesis the researcher seeks to reject
Background
Inferential statistical test
–
Hypothesis testing is carried out via an inferential statistic
test :
Sample data are manipulated to yield a test statistic
The obtained value of the test statistic is evaluated with respect
to a sampling distribution, i.e., a theoretical probability
distribution for the possible values of the test statistic
The theoretical values of the statistic are usually tabulated and
let someone assess the statistical significance of the result of
his statistical test
The goodness-of-fit is a type of hypothesis testing
–
devise inferential statistical tests, apply them to the sample,
infer the matching of a theoretical distribution to the
population distribution
GoF as hypothesis testing
Hypothesis H0:
–
The sample data are manipulated to derive a test
statistic
–
The sample is derived from a theoretical distribution F()
In the case of the chi2 statistic this includes aggregation of
data into bins and some computations
The statistic, as computed from data, is checked
against the sampling distribution
–
For the chi2 test, the sampling distribution is the chi2
distribution, hence the name
Goodness-of-fit
Statistical tests and statistics : the big picture
EDF-based
tests
e.g., KS test,
Anderson-Darling test
Classical chi2
statistics
Chi2 type
tests
e.g., Shapiro-Wilk
test for normality
Generalized chi2
statistics
Log-likelihood
ratio statistic
Modified chi2
statistic
Specialized
tests
Pearson chi2
statistic
Pearson chi2 statistic
If X1, X2, X3…Xn , the random sample and F() the theoretical
distribution under test,
the Pearson chi2 statistic is computed as:
M
X2
i 1
Oi Ei 2
Ei
M
Ni n pi 2
i 1
n pi
M : number of bins
Oi (Ni): observed frequency in bin i
n
Ei (npi) : expected frequency in bin i according to the theoretical
distribution F()
: sample size
pi P( X j falls in bin i) dF x
i
Interpretation of chi2 statistic
Theory says that the Pearson chi2 statistic follows a
chi2 distribution, whose df are
–
M-1, when the parameters of the fitted distribution are given a
priori (case 0 test)
–
Somewhere between M-1 and M-1-q, when the q parameters
of the distribution are estimated by the sample data
–
Usually, the df for this case are taken to be M-1-q
Having estimated the value of the chi2 statistic X2 , I
check the chi2 distribution with M-1 (M-1-q) df to
find
–
What is the probability to get a value equal to or greater than
the computed value X2, called p-value
–
If p > a, where a is the significance level of my test, the
hypothesis is rejected, otherwise it is retained
–
Standard values for a are 0.1, 0.05, 0.01 – the higher a is the
more conservative I am in rejecting the hypothesis H0
Example
A die is rolled 120 times
1 comes 20 times, 2 comes 14, 3 comes 18, 4
comes 17, 5 comes 22 and 6 comes 29 times
The question is: “Is the die biased?” –or better: “Do
these data suggest that the die is biased?”
Hypothesis H0 : the die is not biased
–
Therefore, according to the null hypothesis these numbers
should be distributed uniformly
–
F() : the discrete uniform distribution
Example – cont.
Computations:
Bin
1
2
3
4
5
6
Sums
Oi
20
14
18
17
22
29
120
Ei
20
20
20
20
20
20
120
Oi- Ei
0
-6
-2
-3
2
9
0
(Oi- Ei)2
0
36
4
9
4
81
(Oi- Ei)2/ Ei
0
1.8
.2
.45
.2
4.05
X2=6.7
Interpretation
–
The distribution of the test statistic has 5 df
–
The probability to get a value smaller or equal than 6.7 under
a chi2 distribution with 5 df (p-value) is 0.75, which is < 1-a
for all a in {0.01..0.1}.
–
Therefore the hypothesis that the die is not biased cannot be
rejected
Interpretation of Pearson chi2
Graphical illustration
f z z 52
At 10% significance
level, I would reject the
hypothesis if the
computed X2>9.24)
10% of the area
under the curve
6.7
P-value : 0.25
9.24 11.07
15.09
0.1 0.05 0.01
z
Properties of Pearson chi2 statistic
It can be estimated for both discrete and
continuous variables
–
Holds for all chi2 statistics. Max flexibility but fails to make
use of all available information for continuous variables
It is maybe the simplest one from computational
point of view
As with all chi2 statistics, one needs to define
number and borders of bins
–
These are generally a function of sample size and the
theoretical distribution under test
Bin selection
How many and which?
–
Different opinions in literature, no rigid proof of optimality
There seems to be convergence on the following
aspects
–
Probability of bins
–
The bins should be chosen equiprobable with respect to the
theoretical distribution under test
Minimum expected frequencies npi :
(Cramer, 46) : npi > 10, for all bins
(Cochran, 54) : npi > 1 for all bins, npi >= 5 for 80% of bins
(Roscoe and Byars,71)
Bin selection
Relevance of bins M to sample size N
–
(Mann and Wald, 42), (Schorr, 74) : for large sample sizes
1.88n2/5 < M < 3.76n2/5
–
(Koehler and Larntz,80) : for small sample size
M>=3, n>=10 and n2/M>=10
–
(Roscoe and Byars, 71)
Equi-probable bins hypothesis : N > M when a = 0.01 and a =
0.05
Non-equiprobable bins : N>2M (a = 0.05) and N>4M (a=0.01)
Bin selection
Bins vs. sample size according to Mann and Ward
Bin selection : cont. vs. discrete
Fx x
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
Equi-probable bins
easy to select
Fx x
x
Bin i
1.0
Less straightforward to
define equi-probable
bins
1
2
3
4
5
6
7
x
References
Textbooks
D.J. Sheskin, Handbook of parametric and nonparametric
statistical procedures
–
Introduction (descriptive vs. inferential statistics, hypothesis testing,
concepts and terminology)
–
Test 8 (chap. 8) – The Chi-Square Goodness-of-Fit Test (high-level
description with examples and discussion on several aspects)
R. Agostino, M. Stephens, Goodness-of-fit techniques
–
Chapter 3 – Tests of Chi-square type
Reviews the theoretical background and looks more generally at chi2
tests, not only the Pearson test.
References
Papers
S. Horn, Goodness-of-Fit tests for discrete data: A review
and an Application to a Health Impairment scale
–
Good discussion of the properties and pros/cons of most goodnessof-fit tests for discrete data
–
accessible, tutorial-like