2015_9_Hypothesis tests (3)

Download Report

Transcript 2015_9_Hypothesis tests (3)

STATISTICS
HYPOTHESES TEST (III)
Nonparametric Goodness-of-fit (GOF) tests
Professor Ke-Sheng Cheng
Department of Bioenvironmental Systems Engineering
National Taiwan University
Description of nonparametric
Problems
• Until now, in the estimation and hypotheses
testing problems, we have assumed that the
available observations come from distributions
for which the exact form is known, even
though the values of some parameters are
unknown. In other words, we have assumed
that the observations come from a certain
parametric family of distributions, and a
statistical inference must be made about the
values of the parameters defining that family.
4/6/2016
Dept of Bioenvironmental Systems Engineering
National Taiwan University
2
• In many situations, we do not assume that the
available observations come from a particular
family of distributions. Instead, we want to
study inferences that can be made about the
distribution from which the observations come,
without making special assumptions about the
form of that distribution.
4/6/2016
Dept of Bioenvironmental Systems Engineering
National Taiwan University
3
• For example, we might simply assume that
observations form a random sample from a
continuous distribution, without specifying the
form of this distribution any further; and we
then investigate the possibility that this
distribution is a normal distribution.
4/6/2016
Dept of Bioenvironmental Systems Engineering
National Taiwan University
4
• Problems in which the possible distributions of
the observations are not restricted to a specific
parametric family are called nonparametric
problems, and the statistical methods that are
applicable in such problems are called
nonparametric methods.
4/6/2016
Dept of Bioenvironmental Systems Engineering
National Taiwan University
5
Goodness-of-fit test
• A very common statistical problem in
hydrological frequency analysis or water
resources planning is that whether the available
observations (a random sample available to us)
come from a particular type of distribution. For
example, before we can estimate the magnitude
of the 24-hour rainfall depth with 100-year
return period, we must decide (identify) the type
of probability distribution for the rainfall data
(the annual maximum series) through statistical
tests.
4/6/2016
Dept of Bioenvironmental Systems Engineering
National Taiwan University
6
• Let’s consider statistical problems based on
data such that each observation can be
classified as belonging to one of a finite
number of possible categories. If a large
population consists of data of k different
categories, and let pi denote the probability that
an observation will belong to category i (i = 1,
2, …, k). Of course, pi  0 for i = 1, 2, …, k
k
and  pi  1 .
i 1
4/6/2016
Dept of Bioenvironmental Systems Engineering
National Taiwan University
7
4/6/2016
Dept of Bioenvironmental Systems Engineering
National Taiwan University
8
4/6/2016
Dept of Bioenvironmental Systems Engineering
National Taiwan University
9
4/6/2016
Dept of Bioenvironmental Systems Engineering
National Taiwan University
10
• Therefore, it seems reasonable to base a test on
the values of the differences ni  ei
for i = 1, 2, …, k and reject Ho when the
magnitudes of these differences are relatively
large.
4/6/2016
Dept of Bioenvironmental Systems Engineering
National Taiwan University
11
Chi-square GOF test
4/6/2016
Dept of Bioenvironmental Systems Engineering
National Taiwan University
12
4/6/2016
Dept of Bioenvironmental Systems Engineering
National Taiwan University
13
4/6/2016
Dept of Bioenvironmental Systems Engineering
National Taiwan University
14
4/6/2016
Dept of Bioenvironmental Systems Engineering
National Taiwan University
15
4/6/2016
Dept of Bioenvironmental Systems Engineering
National Taiwan University
16
4/6/2016
Dept of Bioenvironmental Systems Engineering
National Taiwan University
17
Number of categories
Sample size
4/6/2016
Dept of Bioenvironmental Systems Engineering
National Taiwan University
18
4/6/2016
Dept of Bioenvironmental Systems Engineering
National Taiwan University
19
Kolmogorov-Smirnov GOF test
• The chi-square test compares the empirical
histogram against the theoretical histogram.
• In contrast, the K-S test compares the empirical
cumulative distribution function (ECDF)
against the theoretical CDF.
4/6/2016
Dept of Bioenvironmental Systems Engineering
National Taiwan University
20
4/6/2016
Dept of Bioenvironmental Systems Engineering
National Taiwan University
21
4/6/2016
Dept of Bioenvironmental Systems Engineering
National Taiwan University
22
4/6/2016
Dept of Bioenvironmental Systems Engineering
National Taiwan University
23
• In order to measure the difference between
Fn(X) and F(X), ECDF statistics based on the
vertical distances between Fn(X) and F(X)
have been proposed.
4/6/2016
Dept of Bioenvironmental Systems Engineering
National Taiwan University
24
4/6/2016
Dept of Bioenvironmental Systems Engineering
National Taiwan University
25
4/6/2016
Dept of Bioenvironmental Systems Engineering
National Taiwan University
26
4/6/2016
Dept of Bioenvironmental Systems Engineering
National Taiwan University
27
4/6/2016
Dept of Bioenvironmental Systems Engineering
National Taiwan University
28
4/6/2016
Dept of Bioenvironmental Systems Engineering
National Taiwan University
29
4/6/2016
Dept of Bioenvironmental Systems Engineering
National Taiwan University
30
4/6/2016
Dept of Bioenvironmental Systems Engineering
National Taiwan University
31
Values of Dn , for the
Kolmogorov-Smirnov
test
4/6/2016
Dept of Bioenvironmental Systems Engineering
National Taiwan University
32
Goodness-of-fit tests using R
• 2 test for GOF test
– chisq.test
– The above test doesn’t account for any parameters
in determining the expected values.
– The degree of freedom of the test statistic is k-1.
• Kolmogorov-Smirnov GOF test
– ks.test (one-sample test)
4/6/2016
Dept of Bioenvironmental Systems Engineering
National Taiwan University
33
ks.test(x, y, parameters, alternative=”…”)
where x is the data vector to be tested, y is a string
vector specifying the hypothesized distribution,
parameters are the values of distribution parameters
corresponding to y, and alternative represents a
string vector (“less”, “greater”, or “two.sided”) for
one-tail or two-tail test.
• Examples
ks.test(x, ”pnorm”, 30, 10, alternative=”two.sided”)
ks.test(x, ”pexp”, 0.2, alternative=”greater”)
4/6/2016
Dept of Bioenvironmental Systems Engineering
National Taiwan University
34