CSI5388 Functional Elements of Statistics for Machine Learning

Download Report

Transcript CSI5388 Functional Elements of Statistics for Machine Learning

CSI5388:
Functional Elements of
Statistics for Machine
Learning
Part I
1
Contents of the Lecture

Part I (This set of lecture notes):
• Definition and Preliminaries
• Hypothesis Testing: Parametric Approaches

Part II (The next set of lecture notes)
• Hypothesis Testing: Non-Parametric
Approaches
• Power of a Test
• Statistical Tests for Comparing Multiple
Classifiers
2
Definitions and Preliminaries I





A Random Variable is a function, which
assigns unique numerical values to all
possible outcomes of a random
experiment under fixed conditions.
If X takes on N values x1, x2, .. xN, such
that each xi є R, then,
The Mean of X is
The Variance is
The Standard Deviation is
3
Definitions and Preliminaries II


Sample Variance
Sample Standard Deviation
4
Hypothesis Testing





Generalities
Sampling Distributions
Procedure
One- versus Two-tailed tests
Parametric approaches
5
Generalities


Purpose: If we assume a given sampling
distribution, we want to establish whether
or not a sample result is representative of
the sampling distribution or not. This is
interesting because it helps us decide
whether the results we obtained on an
experiment can generalize to future data.
Approaches to Hypothesis Testing:
There are two different approached to
hypothesis testing: Parametric and NonParametric approaches
6
Sampling Distributions


Definition: The sampling distribution of a
statistic (example, the mean, the median or any
other description/summary of a data set) is the
distribution of values obtained for that statistics
over all possible samplings of the same size from a
given population.
Note: Since the populations under study are
usually infinite or at least, very large, the true
sampling distribution is usually unknown.
Therefore, rather than finding its exact value, it
will have to be estimated. Nonetheless, we can do
so quite well, especially when considering the
mean of the data
7
Procedure I


Idea: If we assume a given sampling
distribution, we want to establish whether or not
a sample result is representative of the sampling
distribution or not. This is interesting because it
helps us decide whether the results we obtained
on an experiment can generalize to future data.
Example: If a sample mean we obtain on a
particular data sample is representative of the
sampling distribution, then we can conclude that
our data sample is representative of the whole
population. If not, it means that the values in our
sample are unrepresentative. (Perhaps this
sample contained data that were particularly
easy or particularly difficult to classify).
8
Procedure II
1.
2.
3.
4.
5.
State your research hypothesis
Formulate a null hypothesis stating the opposite of your
research hypothesis. In particular, the null hypothesis
regards the relationship between the sampling statistics
of the basic population and the sample result you
obtained from your specific set of data.
Collect your specific data and compute the statistic’s
sample result on it.
Calculate the probability of obtaining the sample result
you obtained if the sample emanated from the data set
that gave you the original sample statistic.
If this probability is low, reject the null hypothesis, and
state that the sample you considered does not emanate
from the data set that gave you the original sample
statistic.
9
One- and Two-Tailed Tests

If H0 is expressed as an equality, then
there are two ways to reject H0. Either the
statistic computed from your sample at
hand is lower than the sampling statistics
or it is higher. If you are only concerned
about either lower or higher statistics,
then you should perform a one-tailed test.
If you are simultaneously concerned about
the two ways in which H0 can be rejected,
then you should perform a two-tailed test.
10
Parametric Approaches to
Hypothesis Testing


The classical approach to hypothesis
testing is parametric. This means that in
order to be applied, this approach makes
a number of assumptions regarding the
distribution of the population and the
available sample.
Non-parametric approaches, discussed
later do not make these strong
assumptions, although they do make
some assumptions as well, as will be
discussed there.
11
Why are Hypothesis Tests often
applied to means?


Hypothesis tests are often applied to
means. The reason is that unlike for other
statistics, the standard deviation of the
mean is known and simple to calculate.
Since, without a standard deviation,
hypothesis testing could not be performed
(since the probability that the sample
under consideration emanates from the
population that is represented by the
original sampling statistics is linked to this
standard deviation), having access to the
standard deviation is essential.
12
Why is the standard deviation of
the mean easy to calculate?

Because of the important Central Limit
Theorem which states that no matter
how your original population is distributed,
if you use large enough samples, then the
sampling distribution of the mean of these
samples approaches a normal distribution.
If the mean of the original population is μ
and its standard deviation σ, then the
mean of the sampling distribution is μ and
its standard deviation σ/sqrt(N).
13
When is the sampling distribution of
the mean Normal?




The number of samples necessary for the
sampling distribution of the mean to approach
normal depends on the distribution of the parent
population.
If the parent population is normal, then the
sampling distribution of the mean is also normal.
If the parent population is not normal, but
symmetrical and uni-modal, then the sampling
distribution of the mean will be normal, even for
small sample sizes.
If the population is very skewed, then, sample
sizes of at least 30 will be required for the
sampling distribution of the mean to be normal.
14
How are hypothesis tests set up?
t-tests


Hypothesis Tests are used to find out
whether a sample mean comes from a
sampling distribution with a specified
mean.
We will consider:
• One-sample t-tests


μ, σ known
μ, σ unknown
• Two-sample t-tests


Two-matched samples
Two-independent samples
15
One-sample t-test
σ known




If σ is known, we can use the central limit theorem
to obtain the sampling distribution of this
population’s mean (mean is μ and standard
deviation is σ/sqrt(N)).
Let X be the mean of our data sample, we compute
z = (X – μ)/(σ/sqrt(N)) (1)
We find the probability that z is as large as the value
obtained from the z-table and then output this
probability if we are solely interested in a one-tailed
test and double it before outputting it if we are
interested in a two-tailed test.
If this output probability is smaller than .05, we
would reject H0 at the .05 level of significance.
Otherwise, we would state that we have no evidence
to conclude that H0 does not hold.
16
What is the meanings and
purpose of z?



Normal distributions can all be easily mapped
into a single one, using a specific transformation.
This means that, in our hypothesis tests, we can
use the same information about the sampling
distribution over and over (if we assume that our
population is normally distributed), no matter
what the mean and variance of our actual
population are.
Any observation can be changed into a standard
score, z, with respect to mean=0 and standard
deviation =1, as follows:
Z = (X-mean)/sd
17
One-sample t-test
σ unknown



In most situations, σ, the variance of the population
is unknown. In this case, we replace σ by s, the
sample standard deviation, in equation (1) yielding
t = (X – μ)/(s/sqrt(N))
(2)
Because s is likely to under-estimate σ, and, thus,
return a t-value larger than z would have been had σ
been known, it is inappropriate to use the
distribution of z to accept or reject the null
hypothesis.
Instead, we use the Student’s t distribution, which
corrects for this problem and compares t to the ttable with degree of freedom N-1. We then proceed
as we did for z on the slide about σ known, above.
18
What is the meanings and
purpose of t?



t follows the same principle as z except for
the fact that t should be used when the
standard deviation is unknown.
t, however, represents a family of curves
rather than a single curve. The shape of
the t distribution changes from sample
size to sample size.
As the sample size grows larger and
larger, t looks more and more like a
normal distribution
19
Assumption of the t-test with
σ unknown



Please, note that one assumption is made in the
use of the t-test. That is that we assume that the
sample was drawn from a normally distributed
population.
This is required because the derivation of t by
Student was based on the assumption that the
mean and variance of the population were
independent, an assumption that is true in the
case of a normal distribution.
In practice, however, the assumption about the
distribution from which the sample was drawn can
be lifted whenever the sample size is sufficiently
large to produce a normal sampling distribution of
the mean. In general, n= 25 or 30 (number of
cases in a sample) is sufficiently large. Often, it
can be smaller than that.
20
Two-sample t-tests
matched samples




Given two matched population, we want to test
whether the difference in means between these
two populations are significant or not. We do so
by looking at the difference in means, D, and
variance, SD, between these two populations and
comparing it to the mean of 0.
We can then apply the t-test as we did above, in
the case where σ was unknown.
This time, we have
t = (D – 0)/ (SD/sqrt(n)) (3)
We use the t-table as before with a n-1 degree of
freedom, and the same assumptions about the
normality of the distribution.
21
Two-sample t-tests
independent samples


This time, we are interested in comparing
two populations with different means and
variance. The two populations are
completely independent.
We can, again apply the t-test, with the
same conditions applying, using the formula:
t= (X1 –X2)/ sqrt((s12/n1) + (s22/n2))
22
Confidence Intervals




Sample means represent point estimates of the mean
parameter.Here, we are interested in interval
estimates, which tell us how large or small the true
value of μ could be without causing us to reject H0,
given that we ran a t-test on the mean of our sample.
To calculate these intervals, we simply take the
equations presented on the previous slides and express
them in terms of μ, and as a function of t.
We then replace t for the two-tailed value we are
interested in in the t-table. This value can be positive
or negative, meaning that we will obtain two values for
μ: μupper and μlower. This gives us the limits of the
confidence interval.
The confidence interval means that μ has a certain
probability (attached to the value of t chosen) to
belong to this interval. The greater the size of the
interval, the greater the probability that μ is included.
Conversely, the smaller that interval, the smaller the
probability that it is included.
23