Progress with GoF algorithms

Download Report

Transcript Progress with GoF algorithms

S.Donadio, B.Mascialino
July 2nd, 2003
Status of algorithms
• Chi2 (binned distributions)
• Chi2 (curves – sets of points)
• Kolmogorov-Smirnov-Goodman
• Kolmogorov-Smirnov
• Cramer-von Mises (binned)
• Cramer-von Mises (unbinned)
• Anderson-Darling (binned)
• Anderson-Darling (unbinned)
• Kuiper
Status of Quality Checkers
• Chi2
• Kolmogorov-Smirnov-Goodman
• Kolmogorov-Smirnov
• Cramer-von Mises
• Anderson-Darling
• Kuiper
Last algorithm
(to be added still)
Lilliefors test is similar to Kolmogorov test, but is based on the
null hypothesis that the random continuous variable is distributed
as a normal N(m,s2), when m and s2 are unknown. In practice,
being the parameters unknown, the researcher must estimate
them from the sample itself (x1,x2,...,xn) and in this way it becomes
possible to study the standardized sample (z1,z2,...,zn). The test is
performed comparing the empirical repartition function F of
(z1,z2,...,zn) with the one of the standardized normal distribution
F(z):
D* = sup |FO(z) - F(z)|
Lilliefors needs a theoretical
function in input
DISTRIBUTION 2
DISTRIBUTION 1
TOOLKIT
INPUT:
BINNED DISTRIBUTIONS
UNBINNED DISTRIBUTIONS
THEORETICAL DISTRIBUTIONS
Test for Normality, …
THEORETICAL FUNCTION
New algorithm
Cramer von Mises Tiku
It approximates Cramer von Mises test statistics with a 2.
It uses 2 Quality Checker.
Tiku M.L. Chi-squared approximation for
the distributions of goodness of fit UN2 and
WN2. Biometrika, 52, (1965b), 630.
New Algorithm
Kolmogorov-Smirnov (binned)
It allows the calculation of Kolmogorov-Smirnov test statistics
in case of binned distributions.
It uses a different quality checker (see Conover (1971),
Gibbons and Chakraborti (1992) ).
We must find it!
Uncertainties treatment
We must decide how to treat errors inside the statistical
toolkit.
Distributions are entered as a couple of DataPointSets:
Data

Weight
The handling of Data and Weight in the computation of the test
statistics is different in case of distributions, of curves or of sets
of points.
An example
2 =  {(y1i – y2i)2 / [(1i)2 + (2i)2]}
In the case of two distributions 2 is computed using only “Weights”.
In the case of two curves or sets of points, the numerator involves
“Data”and the denominator uses “Weights”.
THIS COULD BE MISLEADING!
Data Weights Errors
So, in order to have a coherent language for all the algorithms,
we should have:
• Data
• Weights
• Errors
Whenever errors are not necessary for the computation of the
test statistics we could fill them as a null vector.
Selecting data 1
Elimination of data points if n30
CRITERION OF 3-SIGMA:
If a point is 3-standard deviation away from the mean of data
points, there is about a 0.001 probability of obtaining in a single
measurement a value that is that far from the mean.
We can choose the elimination of this data point.
Selecting data 2
Elimination of data points if n10
CHAUVENET’S CRITERION:
There are n sample observation from a gaussian distribution
N(0,1), we should expect n’ to be in error by  or more, where
P(z-/ )=P(/)= 1 – n’/n
“If n’=0.5 means that even one observation with this amount of
error is unlikely. We can discard a data point if we expect less
than half an event to be further from the mean than the suspect
data point.”