Tutorial on the chi2 test for goodness-of-fit testing

Download Report

Transcript Tutorial on the chi2 test for goodness-of-fit testing

Mobile Computing Group
A quick-and-dirty tutorial on the chi2 test
for goodness-of-fit testing
Outline
The presentation follows the pyramid schema
Chi2 tests for GoF
Goodness-of-fit (GoF)
Background -concepts
Background


Descriptive vs. inferential statistics
–
Descriptive : data used only for descriptive purposes (use
tables, graphs, measures of variability etc.)
–
Inferential : data used for drawing inferences, make
predictions etc.
Sample vs. population
–
A sample is drawn from a population, assumed to have some
characteristics.
–
The sample is often used to make inferences about the
population (inferential statistics) :

Hypothesis testing

Estimation of population parameters
Background

Statistic vs. parameter
–
A statistic is related (estimated from) a sample. It can be
used for both descriptive and inferential purposes
–
A parameter refers to the whole population. A sample
statistic is often used to infer a population parameter


Example : the sample mean may be used to infer the population
mean (expected value)
Hypothesis testing
–
A procedure where sample data are used to evaluate a
hypothesis regarding the population
–
A hypothesis may refer to several things : properties of a
single population, relation between two populations etc.
–
Two statistical hypotheses are defined: a null H0 and an
alternative H1

H0 is the often a statement of no effect or no difference. It is
the hypothesis the researcher seeks to reject
Background

Inferential statistical test
–

Hypothesis testing is carried out via an inferential statistic
test :

Sample data are manipulated to yield a test statistic

The obtained value of the test statistic is evaluated with respect
to a sampling distribution, i.e., a theoretical probability
distribution for the possible values of the test statistic

The theoretical values of the statistic are usually tabulated and
let someone assess the statistical significance of the result of
his statistical test
The goodness-of-fit is a type of hypothesis testing
–
devise inferential statistical tests, apply them to the sample,
infer the matching of a theoretical distribution to the
population distribution
GoF as hypothesis testing

Hypothesis H0:
–

The sample data are manipulated to derive a test
statistic
–

The sample is derived from a theoretical distribution F()
In the case of the chi2 statistic this includes aggregation of
data into bins and some computations
The statistic, as computed from data, is checked
against the sampling distribution
–
For the chi2 test, the sampling distribution is the chi2
distribution, hence the name
Goodness-of-fit

Statistical tests and statistics : the big picture
EDF-based
tests
e.g., KS test,
Anderson-Darling test
Classical chi2
statistics
Chi2 type
tests
e.g., Shapiro-Wilk
test for normality
Generalized chi2
statistics
Log-likelihood
ratio statistic
Modified chi2
statistic
Specialized
tests
Pearson chi2
statistic
Pearson chi2 statistic
If X1, X2, X3…Xn , the random sample and F() the theoretical
distribution under test,
the Pearson chi2 statistic is computed as:
M
X2 
i 1
Oi  Ei  2 
Ei
M

Ni  n  pi  2
i 1
n  pi

M : number of bins

Oi (Ni): observed frequency in bin i

n

Ei (npi) : expected frequency in bin i according to the theoretical
distribution F()
: sample size
pi  P( X j falls in bin i)   dF x
i
Interpretation of chi2 statistic


Theory says that the Pearson chi2 statistic follows a
chi2 distribution, whose df are
–
M-1, when the parameters of the fitted distribution are given a
priori (case 0 test)
–
Somewhere between M-1 and M-1-q, when the q parameters
of the distribution are estimated by the sample data
–
Usually, the df for this case are taken to be M-1-q
Having estimated the value of the chi2 statistic X2 , I
check the chi2 distribution with M-1 (M-1-q) df to
find
–
What is the probability to get a value equal to or greater than
the computed value X2, called p-value
–
If p > a, where a is the significance level of my test, the
hypothesis is rejected, otherwise it is retained
–
Standard values for a are 0.1, 0.05, 0.01 – the higher a is the
more conservative I am in rejecting the hypothesis H0
Example

A die is rolled 120 times

1 comes 20 times, 2 comes 14, 3 comes 18, 4
comes 17, 5 comes 22 and 6 comes 29 times

The question is: “Is the die biased?” –or better: “Do
these data suggest that the die is biased?”

Hypothesis H0 : the die is not biased
–
Therefore, according to the null hypothesis these numbers
should be distributed uniformly
–
F() : the discrete uniform distribution
Example – cont.

Computations:
Bin
1
2
3
4
5
6
Sums

Oi
20
14
18
17
22
29
120
Ei
20
20
20
20
20
20
120
Oi- Ei
0
-6
-2
-3
2
9
0
(Oi- Ei)2
0
36
4
9
4
81
(Oi- Ei)2/ Ei
0
1.8
.2
.45
.2
4.05
X2=6.7
Interpretation
–
The distribution of the test statistic has 5 df
–
The probability to get a value smaller or equal than 6.7 under
a chi2 distribution with 5 df (p-value) is 0.75, which is < 1-a
for all a in {0.01..0.1}.
–
Therefore the hypothesis that the die is not biased cannot be
rejected
Interpretation of Pearson chi2

Graphical illustration

f z z    52
At 10% significance
level, I would reject the
hypothesis if the
computed X2>9.24)
10% of the area
under the curve
6.7
P-value : 0.25
9.24 11.07
15.09
0.1 0.05 0.01
z
Properties of Pearson chi2 statistic

It can be estimated for both discrete and
continuous variables
–
Holds for all chi2 statistics. Max flexibility but fails to make
use of all available information for continuous variables

It is maybe the simplest one from computational
point of view

As with all chi2 statistics, one needs to define
number and borders of bins
–
These are generally a function of sample size and the
theoretical distribution under test
Bin selection

How many and which?
–

Different opinions in literature, no rigid proof of optimality
There seems to be convergence on the following
aspects
–
Probability of bins

–
The bins should be chosen equiprobable with respect to the
theoretical distribution under test
Minimum expected frequencies npi :

(Cramer, 46) : npi > 10, for all bins

(Cochran, 54) : npi > 1 for all bins, npi >= 5 for 80% of bins

(Roscoe and Byars,71)
Bin selection

Relevance of bins M to sample size N
–
(Mann and Wald, 42), (Schorr, 74) : for large sample sizes
1.88n2/5 < M < 3.76n2/5
–
(Koehler and Larntz,80) : for small sample size
M>=3, n>=10 and n2/M>=10
–
(Roscoe and Byars, 71)

Equi-probable bins hypothesis : N > M when a = 0.01 and a =
0.05

Non-equiprobable bins : N>2M (a = 0.05) and N>4M (a=0.01)
Bin selection

Bins vs. sample size according to Mann and Ward
Bin selection : cont. vs. discrete
Fx x 
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
Equi-probable bins
easy to select
Fx x 
x
Bin i
1.0
Less straightforward to
define equi-probable
bins
1
2
3
4
5
6
7
x
References
Textbooks


D.J. Sheskin, Handbook of parametric and nonparametric
statistical procedures
–
Introduction (descriptive vs. inferential statistics, hypothesis testing,
concepts and terminology)
–
Test 8 (chap. 8) – The Chi-Square Goodness-of-Fit Test (high-level
description with examples and discussion on several aspects)
R. Agostino, M. Stephens, Goodness-of-fit techniques
–
Chapter 3 – Tests of Chi-square type

Reviews the theoretical background and looks more generally at chi2
tests, not only the Pearson test.
References
Papers

S. Horn, Goodness-of-Fit tests for discrete data: A review
and an Application to a Health Impairment scale
–
Good discussion of the properties and pros/cons of most goodnessof-fit tests for discrete data
–
accessible, tutorial-like