STAT05 – Inferential Statistics

Download Report

Transcript STAT05 – Inferential Statistics

Applied statistics for testing and evaluation
MED4
STAT05 –– Inferential
Statistics
Inferential statistics
Lecturer:
Smilen Dimitrov
1
STAT05 – Inferential Statistics
Introduction
•
We previously discussed measures of
– central tendency (location) of a data sample (collection) – arithmetic
mean, median and mode; and
– statistical dispersion (variability) – range, variance and standard
deviation
in descriptive statistics
•
Today we look into the concept of distributions in more detail, and introduce
quantiles as final topic in descriptive statistics
•
We will look at how we perform these operations in R, and a bit more about
plotting as well
2
STAT05 – Inferential Statistics
Review of frequency distributions
y axis
Relative frequency distribution
rf<-table(Data.Sample)/length(Data.Sample)
y<- rf[”x”]
y value
Relative
frequency of
occurrence
of x
The value if y is
relative frequency of
occurrence of x –
percent of
cases/times x has
occurred in a
x axis
x value
y axis
x <- quantile[Data.Sample, y]
y value
Relative
frequency of
occurrence
of values
less than
x
x value
x axis
Cumulative frequency distribution
3
STAT05 – Inferential Statistics
Review of PDF and CDF
y axis
PDF – probability density function
y value
y = p(x)
= pdf(x)
The
probability of
getting
exactly x
x value
x axis
y axis
q
q
y value
y = f(x) =
cdf(x)
The
probability of
getting less
Quantiles – the x
values, obtained by
dividing the y range
of the CDF (from 0
to 1) in q-equal
parts
2
1
1
than x
(area under the pdf curve,
from –infinity to x)
2
x value
x axis
CDF – cumulative distribution function
x – second quartile = median
4
STAT05 – Inferential Statistics
Review of PDF and CDF – Uniform distribution in R
y axis
runif
gives
random samples,
unifromly distributed.
-3, 3) must be
added to specify
range.
Parameters – range
(min and max value)
y value
y = p(x)
= pdf(x)
dunif(0.7)
The
probability of
getting
gives y=0.016..
y = pdf(x) = dunif(x)
exactly x
x value
x axis
y axis
y value
y = f(x) =
cdf(x)
The
probability of
getting less
qunif(0.616)
gives x=0.7
x = cdf-1(y) = qunif(x)
punif(0.7)
gives y=0.616..
y = cdf(x) = punif(x)
than x
(area under the pdf curve)
x value
x axis
5
STAT05 – Inferential Statistics
Review of PDF and CDF – Normal distribution in R
y axis
rnorm
gives
random samples,
normally distributed.
y value
y = p(x)
= pdf(x)
Parameters –
mean and median
The
probability of
getting
dnorm(0.7)
exactly x
(can translate and scale the
curve)
Default – standard normal
distribution – mean 0, sd = 1
gives y=0.312..
y = pdf(x) = dnorm(x)
x value
x axis
y axis
y value
y = f(x) =
cdf(x)
The
probability of
getting less
than x
qnorm(0.758)
gives x=0.7
x = cdf-1(y) = qnorm(x)
pnorm(0.7)
gives y=0.758..
y = cdf(x) = pnorm(x)
(area under the pdf curve)
x value
x axis
6
STAT05 – Inferential Statistics
Review of PDF and CDF – T distribution in R
y axis
gives random
samples, unifromly
distributed.
, 5) must be
y value
y = p(x)
= pdf(x)
rt
added to specify
degrees of freedom
Parameters – degrees
of freedom
The
probability of
getting
(can only scale the curve)
dt(0.7)
gives y=0.286..
y = pdf(x) = dunif(x)
exactly x
x value
x axis
y axis
y value
y = f(x) =
cdf(x)
The
probability of
getting less
than x
qt(0.742)
gives x=0.7
x = cdf-1(y) = qunif(x)
pt(0.7)
gives y=0.742..
y = cdf(x) = punif(x)
(area under the pdf curve)
x value
x axis
7
STAT05 – Inferential Statistics
Samples and population - sampling
•
Descriptive statistics which summarize the characteristics of a
sample of data
•
Inferential statistics which attempt to say something about a
population on the basis of a sample of data - infer to all on the basis
of some
•
'How many penguins are there on a particular ice floe in the
Antarctic?‘
•
Penguins tend to move around and swim off, and it's cold! So
scientists use aerial photographs and statistical sampling to estimate
population size.
8
STAT05 – Inferential Statistics
Samples and population - sampling
•
Imagine a large, snow-covered, square region of the Antarctic that is
inhabited by penguins. From above, it would look like a white square
sprinkled with black dots:
•
If such access - you could count the
dots to determine the number of
penguins in this region.
•
Too large for one photo - instead take
100 photographs of the 100 smaller
square sub-regions, and count them
all - too long and be too expensive !
•
(The total count for the population on
the image is 500)
9
STAT05 – Inferential Statistics
Samples and population - sampling
•
Another alternative - select a
representative sample of the subregions, obtain photos of only these,
and use the counts from these subregions to estimate the total number
of penguins
10
STAT05 – Inferential Statistics
Samples and population - sampling
•
Suppose you had access to three samples, use the results from each to
obtain an estimate
•
Notice the
effect of
sample size
on the
estimate!
11
STAT05 – Inferential Statistics
Samples and population - sampling
•
There is a balancing act in selecting the sample size.
•
A larger sample size may cost more
money or be more difficult to
generate, but it should provide a
more accurate estimate of the
population characteristic
•
For these sample of 10 photographs,
the estimate is 450
•
Estimates for the total penguin population vary quite a bit based on
both the sample size and which sub-regions were sampled.
The decision about how to select a sample, accordingly, is a critical
one in statistics.
•
12
STAT05 – Inferential Statistics
Samples and population - sampling
•
Different ways to randomly select a sample
•
Think of a way to pick 10 numbers
between 00 and 99 at random.
•
One possible method for solving [this]
problem is to use two 10-sided dice,
one red and one blue.
•
The sub-region can then be
determined by the two dice (in the
order red, and then blue).
•
This random selection process will sometimes produce duplicates.
13
STAT05 – Inferential Statistics
Samples and population - sampling
•
For instance, you might find that seven tosses of the dice produced these
sub-region choices: 19 22 39 50 34 05 39
•
If we do not want duplicates, we can skip them until we get 10 distinct
numbers, for example: 19 22 39 50 34 05 75 62 87 13
•
This is called sampling without replacement
•
The estimate of the total number of penguins for the entire region based
on this random sample is the mean. = 450
14
STAT05 – Inferential Statistics
Effect of sample size on calculated parameters
•
•
Answers for an estimate for a population will vary depending on which
particular elements were taken in a sample
To see the effects, we can perform sampling with replacement on our
raisin sample in R - bootstrapping, and plot the results
sample one s.d. range
population one s.d. range
sample mean
population mean
•
In this plot, the darker a dot
is, the more times the value
occurs in a given sample
15
STAT05 – Inferential Statistics
Effect of sample size on calculated parameters
•
Conclusion – the greater the sample size, the mean and the variance of
the sample more closely approach those of the population
16
STAT05 – Inferential Statistics
Effect of sample size on calculated parameters
•
We can repeat this exercise quite more many times, each time taking a
random sample from a normally distributed variable, and showing only
the variance
•
As sample size declines,
the range of estimates of
sample variance
increases dramatically
•
•
•
(remember that the population
variance is constant at s2=4
throughout).
The problem becomes severe below samples of 13 or so, and is very
serious for samples of seven or fewer.
For small samples, the estimated variance is badly behaved, and this has
serious consequences for estimation and hypothesis testing.
17
STAT05 – Inferential Statistics
Effect of sample size on calculated parameters
•
•
How many samples?
For general statistics:
– Take n=30 samples if you can afford it and you won't go far wrong .
– Anything less than this is a small sample, and anything less than 10
is a very small sample.
•
Usually, our data forms only a small part of all the possible data we could
collect.
All possible users do not participate in a usability test, and every possible
respondent did not answer our questions.
The mean we observe therefore is unlikely to be the exact mean for the
whole population - the scores of our users in a test are not going to be an
exact index of how all users would perform.
– How can we relate our sample to everyone else?
•
•
18
STAT05 – Inferential Statistics
Confidence intervals
•
•
•
•
If we repeatedly sample and
calculate means from a population
(with any distribution), our list of
means will itself be normally
distributed (central limit theorem)
Population
plot
1 s.d.range
Plot of means
from samples
p
CI
(true) MeanA sample mean
This implies that our observed mean follows the same rules as all data
under the normal curve
The distribution of means is normal around the “true” population mean –
so our observed mean if 68% likely to fall within 1 SD of the true
population mean, but we don’t know the “true” population mean
We only have the sample, not the population
–
•
Standard error
so we use an estimate of this SD of means known as the Standard Error of the Mean
– And we can say that we are 68% confident that the true mean=
sample mean ± standard error of sample mean -> confidence interval
A confidence interval (CI) for a population parameter is an interval
between two numbers with an associated probability p
–
which is generated from a random sample of an underlying population.
19
STAT05 – Inferential Statistics
Confidence intervals
Example
•
•
•
We test 20 users on a new interface: Mean error score: 10, sd: 4
What can we infer about the broader user population?
According to the central limit theorem, our observed mean (10 errors) is
itself 95% likely to be within 2 s.d. of the true (but unknown to us) mean
of the population
•
If standard error of mean = 0.89, then observed (sample) mean is within
a normal distribution about the 'true' or population mean, so we can be:
– 68% confident that the true mean=10 ± 0.89
– 95% confident our population mean = 10 ± 1.78 (sample mean + 2*s.e.)
– 99% confident it is within 10 ± 2.67 (sample mean + 3*s.e.)
20
STAT05 – Inferential Statistics
Confidence intervals
Example B – assumed normal distr.
•
•
A machine fills cups with margarine, and is supposed to be adjusted so
that the mean content of the cups is close to 250 grams of margarine.
(the true population mean µ should be 250, but we don’t know if it is yet)
To check if the machine is adequately adjusted, a sample of n = 25 cups
of margarine is chosen at random and the cups weighed.
1 n
̂  X   X i
n i 1
1 25
x
 xi  250.2 grams
25 i 1
•
•
general sample mean:
estimator of the expectation
or population mean µ
This sample mean, with
actual weights x1, …, x25,
with
This sample x1, …, x25, st.dev
sn1  s
2
n 1
1 n
xi  x 2


n  1 i 1
There is a whole interval around the observed value 250.2 of the sample
mean within which, if the whole population mean actually takes a value in
this range, the observed data would not be considered particularly
unusual.
Such an interval is called a confidence interval for the parameter µ.
21
STAT05 – Inferential Statistics
Confidence intervals
Example B
•
In our case we may determine the endpoints by considering that the
sample mean X from a normally distributed sample is also normally
distributed, with the same expectation µ, but with standard deviation
 n  0.5 (grams) - note this is the standard error of the mean! .
•
So we make the standardization replacement:
Z
X 

n

X 
0.5
•
with Z now having a standard normal distribution (mean=0, sd=1)
independent of the parameter µ to be estimated.
•
Hence it is possible to find numbers -z and z, independent of µ, where Z
lies in between with probability 1 - α, a measure of how confident we
want to be. We take 1 – α = 0.95.
22
STAT05 – Inferential Statistics
Confidence intervals
Example B
•
We take 1 – α = 0.95, so we have
P( z  Z  z )  1    0.95
•
•
Z can be calculated from CDF
Remember the CDF Φ(z) gives
probability of z ≤ Z only! So α/2
( z )  P( Z  z )  1 

2
Φ(z)
 0.975
z   1 ( z )    1 0.975  1.96
95 % range
•
So


X 
0.95  P  1.96 
 1.96 
 n


 P  X  0.98    X  0.98
z
-z
Z?
•
So the 95% conf. Interval is between
X  0.98
X  0.98
23
STAT05 – Inferential Statistics
Confidence intervals
Example B
•
So the 95% confidence interval is between
X  0.98
X  0.98
•
Or, with probability 0.95 one will find the parameter µ between these
stochastic endpoints (this example: 249.2 and 251.18 )
•
Every time the measurements are repeated,
there will be another value for the mean X of the
sample.
– In 95% of the cases µ will be between the
endpoints calculated from this mean, but in
5% of the cases it will not be.
•
We cannot say: 'with probability (1 − α) the parameter μ lies in the
confidence interval.'
– We only know that by repetition in 100(1 − α) % of the cases μ will be in
the calculated interval.
24
STAT05 – Inferential Statistics
Confidence intervals
Bootstrap
•
We can also use bootstrap to find a confidence interval – by sampling
with replacement, many times from a sample, finding the means, and
looking for a confidence interval based on these.
•
In R for this we use the quantile function, which will generate the
cumulative frequency distribution based on a sample.
Student’s t
•
For small sample sizes (n<30)
instead of the CDF for the normal
distribution (qnorm) we can use
Student’s t distribution (qt)
25
STAT05 – Inferential Statistics
Student’s t distribution
•
Student's t-distribution is a probability
distribution that arises in the problem of
estimating the mean of a normally
distributed population when the sample
size is small.
  1 
 1


2  2
2   t 
f (t )  
 1  

 
 
    
2
•
The distribution depends on ν (d.o.f) ,
but not μ (mean) or σ (s.d); the lack of
dependence on μ and σ is what makes
the t-distribution important
•
The overall shape of the pdf of the tdistribution resembles the bell shape of
a normally distributed variable with
mean 0 and variance 1, except that it is
a bit lower and wider.
As the number of d.o.f. grows, the t-distribution
approaches the normal distribution with mean 0
and variance 1.
26
STAT05 – Inferential Statistics
Student’s t distribution
•
Red – Student’s t
distribution, 5 d.o.f
•
Black – standard normal
distribution (mean=0,
sd=1)
27
STAT05 – Inferential Statistics
Single sample inference and tests
•
Suppose we have a single sample. The questions we might to want to
answer are these:
– What is the mean value?
– Is the mean value significantly different from current expectation or
theory?
– What is the level of uncertainty associated with our estimate of the
mean value?
•
We use statistical tests to infer significant difference in the single sample
case.
28
STAT05 – Inferential Statistics
Single sample inference and tests
Procedure for testing a statistical hypothesis:
1. State the null hypothesis. The current knowledge (or lack of knowledge) before the experiment
takes place.
2. State the alternative hypothesis. The research hypothesis that we want to prove. Our claims.
3. Choose a test statistic T. It must be suitable to differentiate between the null & alternative
hypotheses. Calculate the value of T from the data.
4. Choose a significant level for the test: a = Prob. of observing a value of the statistic
which falls in the critical region. It may be given. The most popular value of a is 5%.
5. Calculate the Rejection region. Acceptance region & Critical value .
6. Decision: If T falls into the rejection region we reject Ho. If T does not fall into the
rejection region we do not reject Ho.
Indeed, the wording is always the same for all kinds of tests, and you should
try to memorize it. The abbreviated form is easier to remember: larger
reject, smaller accept.
29
STAT05 – Inferential Statistics
Single sample inference – Student’s t test
The t statistic has a t distribution with n-1 degrees of freedom.
The formal procedure is as follows:
1. Null Hypothesis : Ho : µ = µo
2. Alternative Hypothesis
a) Ha : µ > µo one sided
b) Ha : µ < µo one sided
c) Ha : µ ≠ µo two sided
x
t

3. Test statistic:
s n
4. Decide on the value of a.
5. Calculate the p-value.
a) p-value = P(t > observed t)
b) p-value = P(t < observed t)
c) p-value = 2 P(t > observed |t|)
The probabilities of t are based on a t dist. with n-1 d.f.
6. Reject Ho when the p-value < a. Do not reject Ho when the p-value =a
30
STAT05 – Inferential Statistics
Single sample inference – Student’s t test
Example: A casino makes the assumption that the average number of bets
a customer plays is at least 7. A floor manager suspects that the number
maybe less than 7 and in order to confirm his suspicions he takes a
sample of n=6 customers and calculates the mean number of plays and
the sample variance, obtaining x =6.15, s2=3.955.
Perform a hypothesis test to check the manager’s suspicions.
1. Ho : µ =7
2. Ha : µ <7
3. x =6.15, s2=3.955, t=(6.15-7)/((3.955/6)^(1/2)) = -1.047
4. α =0.1.
5. df = 6-1 = 5, p-value = P(t < -1.047) = P( t > 1.047) = 0.17 ( Approx from
tables) – 1-pt(1.047, 5)
6. p-value = 0.17 > α =0.1 =>We do not reject the assumption of the casino
of an average of 7 or more bets per cust.
31
STAT05 – Inferential Statistics
”Tails”
Predicting the direction of the difference.
Since we stated that you wanted to see if [something] was BETTER (>70),
not just DIFFERENT (< or > 70%), this is asking for a one-sided test….
For a one tail (directional) test - the tester narrows the odds by half by
testing for a specific difference. One sided predictions specify which part
of the normal curve the difference observed must reside in (left or right)
For a two-sided test, one just want[s] to see if there is ANY difference
(better or worse) between A and B.
(here manager wants to see if a customer makes less than 7 bets – not if he
makes any number of bets different from 7 – so a one-tailed test).
32
STAT05 – Inferential Statistics
Two sample inference and tests
•
The so-called classical tests deal with some of the most frequently-used
kinds of analysis, and they are the models of choice for:
– comparing two variances (Fisher's F test, var.test),
– comparing two sample means with normal errors (Student's t-test,
t.test),
–
–
–
•
comparing two sample means with non-normal errors (Wilcoxon's rank test, wilcox.test),
comparing two proportions (the binomial test, prop.test),
testing for independence in contingency tables (chi-square test, chisq.test or Fisher's
exact test, fisher.test)
First, we must realize the following: Is it right to say samples with the
same mean are identical? No!
– when the variances are different, don't compare the means. If you
do, you run the risk of coming to entirely the wrong conclusion.
33
STAT05 – Inferential Statistics
Two sample inference – Fisher F test (two variances)
•
Before we can carry out a test to compare two sample means, we need
to test whether the sample variances are significantly different. - Fisher
F test
•
To compare two variances, all you do is divide the larger variance by the
smaller variance (this is F ratio)
•
In order for the variances to be significantly different, the F ratio will
need to be significantly different bigger than 1 - How will we know a
significant value of the variance ratio from a non-significant one?
– The answer is as always to look up a critical value – this time from F
distribution.
•
F distribution needs d.o.f (sample size-1) in the numerator and
denominator of F ratio.
34
STAT05 – Inferential Statistics
Two sample inference – Fisher F test (two variances)
•
Example:
•
Two gardens with n=10 entries (9 d.o.f. each), same mean, two
variances
null hypothesis - the two variances are not significantly different
Set α=0.05 confidence level
Find critical value of F for this α=0.05, and 9 d.o.f for both numerator and
denumerator (through quantiles of F function) – ex. 4
Ex. F ratio of the variances is 10 which > the critical value of F (4)
therefore reject the null hypothesis
– accept the alternative hypothesis that the two variances are
significantly different.
•
•
•
•
•
•
Because the variances are significantly different, it would be wrong to
compare the two sample means using Student's t-test.
F – test – simplest analysis of variance (ANOVA) problem
35
STAT05 – Inferential Statistics
Two sample inference
•
How likely it is that our two sample means were drawn from populations
with the same average?
– If the answer is highly likely, then we shall say that our two sample
means are not significantly different.
•
There are two simple tests for comparing two sample means:
– Student's t-test - when the samples are independent, the variances
constant, and the errors are normally distributed, or
– Wilcoxon rank-sum test when the samples are independent but the
errors are not normally distributed (e.g. they are ranks or scores of
some sort).
36
STAT05 – Inferential Statistics
Two sample inference - Student’s t test (two means)
•
When we are looking at the differences between scores for two groups,
we have to judge the difference between their means relative to the
spread or variability of their scores. - t test does this.
37
STAT05 – Inferential Statistics
Two sample inference - Student’s t test (two means)
•
When we are looking at the differences between scores for two groups,
we have to judge the difference between their means relative to the
spread or variability of their scores. - t test does this.
•
The test statistics (t-value) is the number of standard errors by which the
two sample means are separated:
t
•
difference between th e two means x A  x B

standard error of the difference
SE diff
Now we know the standard error of the mean, but we have not yet met
the standard error of the difference between two means. For two
independent (i.e. non-correlated) variables, the variance of a difference
is the sum of the separate variances.
SE diff 
s A2 s B2

n A nB
38
STAT05 – Inferential Statistics
Two sample inference - Student’s t test (two means)
•
When we are looking at the differences between scores for two groups,
we have to judge the difference between their means relative to the
spread or variability of their scores. - t test does this.
•
The test statistics (t-value) is the number of standard errors by which the
two sample means are separated:
t
•
difference between th e two means x A  x B

standard error of the difference
SE diff
Now we know the standard error of the mean, but we have not yet met
the standard error of the difference between two means. For two
independent (i.e. non-correlated) variables, the variance of a difference
is the sum of the separate variances.
SE diff 
s A2 s B2

n A nB
39
STAT05 – Inferential Statistics
Two sample inference – Student’s t test (two means)
•
Example:
•
•
Two gardens with n=10 entries (9 d.o.f. each), different means,
null hypothesis is that the two sample means are the same, and we shall
accept this unless the value if Student's t is so large that it is unlikely that
such a difference could have arisen by chance alone.
Set α=0.05 confidence level - chance of rejecting the null hypothesis
when it is true (this is the Type I error rate)
Find critical value of T for this α=0.05, and 18 d.o.f for both gardens
(through quantiles of T function) – ex. 2.1
Ex. The values of student t is –3.872. Take abs value
3.872 is > the critical value of T (2.10) therefore reject the null
hypothesis
– accept the alternative hypothesis that the two variances are
significantly different.
•
•
•
•
•
This T – test – equivalent to one-way ANOVA problem – note ANOVA can
deal with three means at once.
40
STAT05 – Inferential Statistics
Review
•
•
•
Arithmetic mean
Median
Mode
Measures of
Central tendency (location)
Descriptive
statistics
•
•
•
•
•
Range
Variance
Standard deviation
Quantiles,
Interquartile range
•
Probability distributions – uniform, normal (Gaussian) and T
•
•
•
•
Standard error – inference about unreliability
Confidence interval
Single sample t-test
Two sample F-test and t-test.
Measure of
Statistical variability
(dispersion - spread)
Inferential
statistics
41
STAT05 – Inferential Statistics
Exercise for mini-module 5 – STAT03
None
42