Transcript Slide 1
Probability & Statistical Inference
Lecture 4
MSc in Computing (Data Analytics)
Lecture Outline
Modern statistics uses a number of mathematical results to
relate descriptive statistics and probability theory.
These can be divided (roughly) under three headings:
- Central Limit theorem (large samples)
- Maximum Likelihood Methods (large samples)
- Small sample results
Although the mathematical details are quite different in each
case – the end results and the reasoning used are almost
identical.
We will look in detail at the Central Limit Theorem but
without the higher mathematics.
If you can understand the working of the Central Limit
Theorem – then you also get the essential understanding of
the other methods as well.
Sampling Theory – Statistical Models
Central Limit Theorem (CLT) – A description
How many voters will give F.F. a first preference in the
next general election?
We have 2 different estimates
1. Researcher A (10 people) => 40%
2. Researcher B (100 people) => 25%
How much 'better' is estimate B than estimate A ?
Real Question: What makes a 'good' estimate ?
•
•
•
unbiased
low variability
i.e. if the survey was repeated should get 'similar' answer
Example
Suppose an engineer wants to estimate the lifetime of a electronic
component.
using simple random sampling they select a sample and test. The sample is taken
so that the component lifetimes can be considered to be independent of each
other.
Gods eye view: mean lifetime, µ= 4,900 hours
σ = 3959 hours
(you would never know this in practice however)
• This is the population
• Note: it is highly skewed and is
NOT normal
• What would happen if we took
repeated samples of the same
size and calculated their
means?
Example Continued
Experiment: take a sample of size 2 from this population
and get the mean of the sample
Repeat this 2,000 times
Now have 2,000 means - what would the histogram of all
these means look like?
What would happen if you did the same experiment, but
with samples of sizes 10, 20 and 30?
Distribution of the Sample Means
varying the sample size
Original Distribution
Note that the histogram
become more Normal as
the sample size increases
Same result but
plotted on same scale
Note the spread
decreases with
increasing sample size
Central Limit Theorem
What has happened?
As the sample sizes increased the shape of the histogram of means =>
normal
As the sample sizes increased the spread (standard deviation) between the
sample means decreased
These histograms are pictures of The Sampling Distribution of the
Mean
This phenomenon will happen in ALL cases
The proof of this is called the Central Limit Theorem (CLT)
The CLT involves some fairly non-trivial mathematics
Central Limit Theorem
Since bigger samples are more representative, two means
from samples of size=100 are more likely to be closer
together than two means from samples of size=10
The larger the sample size is the more the sample means
will tend to agree, so the standard deviation of the
Sampling Distribution of the Mean will decrease
When the sample size is sufficiently large, the Sampling
Distribution of the Mean will be Normally distributed
Central Limit Theorem
If a random sample is taken from a population, where:
Each member of the sample can be considered to be
independent of each other
The are all members of the same population
That population has a mean value μ and a standard deviation
σ
Then,
A sample mean ( X ) can be considered a random
variable sampled from a probability distribution of
possible sample means of the same size called the
Sampling Distribution of the Mean.
___
Definition: Central Limit Theorem
continued…
The sampling distribution of the mean has a average value = (the
population mean).
The sampling distribution of the mean has a standard deviation =
n
Where σ is the population standard deviation, and n is the sample
size taken.
This value is called the standard error of the mean.
The Sampling Distribution of the Mean will be a Normal distribution
if the sample size is large.
CLT - Summary
When the sample size is sufficiently large, the
Sampling Distribution of the Mean will be
normally distributed
with a mean = ,
and a standard deviation (i.e. standard error) =
n
From the simulation above;
For a sample size of 2, the standard error of the mean
should be
= 3959 / √2 = 2,799
Population
Size = 2
Mean
from
2,000
samples
4,900
5,017
Size = 10
Size = 20
Size = 30
4,899
4,915
4,934
Standard
Actual
Deviation
Standard
predicted by CLT Deviation
2,799
1,251
885
722
3959
2,805
1,232
871
732
Practical use for the CLT continued…
This avoids the necessity of specifying a complete
statistical model for all the sampled data.
All we have to do is specify a probability model for the
sample mean.
For any sample mean, calculated from a large
independent random sample taken from ANY population
with a mean μ and standard deviation σ, we know from
the CLT, that this sample mean is a random variable from
a Normal distribution with a mean = μ and a standard
deviation =
n
Practical use for the CLT continued…
Take a single sample and calculate
This is an estimate of μ – the true (but unknown)
population mean.
But, how good is this estimate?
___
___
X
We assume that X is not exactly , but is
somewhere near - but how near is it likely to be?
Confidence Intervals Intoduction
We would like to make probability statements as to
___
how close X
is likely to be to .
___
If sample size is sufficiently large – then the estimate X
can be considered as:
a random variable from a Normal distribution,
so probability statements are possible.
This is how we use the CLT in practical data analysis.
For a Normal distribution, we know that 95% of
values will be within 1.96 Standard deviations of
So, given one estimate we can say that this estimate
is within 1.96 standard errors of the actual
population mean , with 95% confidence
95% in
shaded
area
•We can turn this knowledge on its
head: given
we can be 95% confident that the
true mean is within 1.96 standard
errors of it.
Confidence Interval
From this we can specify a range of values within which we are
95% confident that the population mean () lies
This is called a confidence interval
95% Confidence Interval for a population mean
(from large enough sample):
__
x 1 . 96 standard
__
x 1 . 96
error
n
Remarkably, this result holds for samples of size 30 or more.
So, a large sample in this context, is a sample of 30 or more.
Example
One
sample of size 30 from the electronic components yields
a sample mean = 5,873 hours .We know = 3,959 so a 95%
confidence interval would be;
__
x 1 . 96 standard
__
x 1 . 96
error
5873 1 . 96
n
3959
30
5873 1417 4456 to 7290
So, we would say that the average lifetime of all components
(μ) is between 4,456 and 7,290 hours with 95% confidence
Confidence Intervals
Why is this any good?
Before: one estimate, = 5,873 but no idea of how good or bad
it was, i.e. how close to μ is was likely to be.
Now: 95% confident that μ is between 4,456 and 7,290 hours.
So, using CLT ~> Confidence Intervals ~> able to get an
estimate with certain level of confidence that can be justified,
i.e. it gives us an objective measure of the actual amount of
information contained in our sample about the likely location
of μ.
General Confidence Interval for μ (σ known)
The general formula is:
__
CI 1- x z 1 / 2
n
Where:
• is between a value between 0-1,
• (1-)×100% is the confidence level you want
• Z1-/2 is a value from the Normal distribution table.
• Example: for a 95% CI, = 0.05
(1-)×100% = 95%
Z1-/2 = 1.96
Problem with σ
All of the above assumes that the population standard deviation (i.e.
) is known.
In practice this is not known (just like ).
=> So, we need to estimate as well as
=> we get this estimate from the standard
deviation of the sample
Sample Standard Deviation is called ‘s’
=> Estimate by s,
When sample size is
large
s
x x
n 1
2
Z-Values
The value of Z1-/2 for other % confidence intervals
are given in standard tables.
Confidence Level
α/2
Z1-/2
90%
0.05 (5%)
1.6449
95%
0.025 (2.5%)
1.96
99%
0.005 (0.5%)
2.5758
99.9%
0.0005 (0.05%)
4.4172
Example
Using these we get the following results for the electronic
component example:
Confidence
Level
Z1-/2
CI
90%
1.6449
4681 to 7065
95%
1.96
4456 to 7290
99%
2.5758
4011 to 7735
99.9%
4.4172
2679 to 9067
Note as gets smaller the CI gets wider
Also, at the same time as n gets bigger the CI narrows –
So big samples leads to more precise estimates (i.e.
narrower confidence intervals)
What CI’s and sample sizes should I use?
•
You can’t control s – it is inherent in the data
(population).
•
You can’t control x-bar either.
•
You can control Z1-/2 but in practice scientific
convention sets this to reflect 90%, 95% or 99%
confidence, with 95% being the accepted default.
•
You can choose n – but resources may limit you.
•
There is a whole topic called sample size
determination which you may want to review before
collecting data or starting research
Assumptions for hypothesis testing about μ
(large sample) and Calculation of CIs
Sample size 30 or greater
Experimental units are independent or each other
Experimental units were randomly sampled
The independence assumption requires that value of the
variable for one experimental unit should not tell us anything
about the value of another.
e.g. in the rats experiment – different and unrelated rats
should be used – not 1 rat tested 100 times.
Randomness is required to avoid systematic bias in selection.
Exercise
Complete Exercise 1 & 2
Calculation of CIs for small samples
What about small samples?
In the case of CIs about a mean we can use the Student-t
distribution.
The process turns of to be very similar – but the CLT no
longer works
History of the Student t test
William Gosset used the publishing pseudonym ‘Student’.
He derived the correct sampling distribution for the mean
of samples < 30 – and called it the ‘t distribution’.
In his honour, it is often called the ‘Student t’ distribution.
Gosset was a chief brewer for Guinness.
The mathematical details are complicated, but, it turns out
that we perform exactly the same calculations as before,
with the one change that the t distribution instead of the
normal distribution is used.
Assumptions
Student t’s result only referred to a mean where the
distribution of the population was normally
distributed with some mean μ and finite standard
deviation σ.
This is in contrast to the CLT for large samples that
required no such assumption about normality.
The t-test also requires the assumption regarding
independence in the sample.
Statistical Model for mean from small
samples
The experimental units are independently sampled from a
population with mean=μ and standard deviation = σ
The population is normally distributed (we don’t need
this with large samples)
So, to use the t-test for a small sample, you need to
establish that data is sampled from a population that is
normally distributed – you could look at the histogram of
the sample and see if it is symmetric and bell shaped – or
use other methods.
The t - Statistic
If Assumptions met:
The statistic:
___
t
X
s
n
Can be shown to be distributed according to a
(student) t-distribution.
The t-distribution has one parameter, called ‘degrees
of freedom’ (df).
The t-Distribution
The t-distribution itself is bell shaped and symmetric –
just like the normal distribution but is ‘flatter’.
There are many t distributions – one for each sample size.
The rule used is: for a sample of size n – use the t
distribution with degrees of freedom = n−1
Example: if the sample size is 15, then use a t distribution
with degrees of freedom 15 − 1=14.
Note the degrees of freedom often abbreviated to df.
0.4
The t-Distribution
0.0
0.1
0.2
0.3
Normal(0,1)
t(df=4
t(df=1)
-4
-2
0
2
4
The t probability density function with k degrees of freedom:
f (x)
k 1 / 2
k k 2
1
x
2
/ k 1
( k 1 ) / 2
General Confidence Interval for μ (small
Samples)
The general formula is:
__
CI 1- x t ( 1 / 2 , n 1 )
s
n
Where (1-) 100% is the confidence level you want and
t(n-1, /2) is a value from the t distribution with df=n-1, and with
a specified level.
What is t(n−1, 1−/2)?
A value from the t distribution with n−1 df such that
100(1 − )% of values lie within that range around the mean.
How do you find t(n−1, 1−/2)?
from a table specifically designed to give it to you or use
a computer
Confidence
Level
/2
t(df=1)
t(df=10)
t(df=30)
90%
0.05 (5%)
6.314
1.812
1.697
95%
0.025 (2.5%)
12.71
2.228
2.042
99%
0.005 (0.5%)
63.66
3.169
2.750
99.9%
0.0005 (0.05%)
636.6
4.587
3.646
Note: as gets smaller then CI gets wider
as df gets smaller then CI gets wider
Example
Internal temperature of autoclaved aerated concrete used
in building. An engineer recorded the following data:
23.01, 22.22, 22.04, 22.62, 22.59
95% CI for the population mean?
__
s
CI 1- x t ( / 2 , n 1 )
22 . 5 2 . 776
n
0 . 3793
5
22 . 5 0 . 4696 ( 22 . 03 , 22 . 97 )
Exercise
Answer Questions 3-6
Confidence Intervals for Proportions
(Large Samples)
Proportions (including %) are often a statistic of interest
Think of the proportion of defective items on a
production line, the proportion of people who respond
favourably to a survey question, to proportion of success
versus failures in some experiment
Proportions are also covered by the CLT - remember
that a proportion is a different kind of average
Confidence Intervals for Proportions
(Large Samples)
Take a sample of size n of electronic components
coming off a production line, a test each one for
defects. The statistic of interest is the proportion of
defectives produced by the production process.
The estimated proportion from the sample is,
pˆ
No of Defective
s in the Sample
n(the total sample size)
where (p-hat) is the symbol used for the estimated
proportion from the sample
Confidence Intervals for Proportions
(Large Samples)
If the sample size is sufficiently large and we repeat
the experiment a large number of times, then:
The sampling distribution of the proportion will be
normally distributed by the CLT
The mean of this distribution will be p - i.e. the 'true'
population proportion
The standard deviation of the sampling distribution of the
proportion, called the standard error of the proportion
is estimated by
S.E of proportion
pˆ (1 pˆ )
n
Example:
A pharmaceutical company produces 400,000 capsules
per day of a particular drug. They test 200 of the capsules
for defects (too much/little active compound). If the
population p = 0.05, and they take 10,000 repeated
samples this is the histogram they would get
Sample Size
How big does the sample have to be for the CLT to work
with proportions?
The rule is different than the rule for means. Do the
following test.
A rule of thumb: the sample size is big enough if
1.
2.
np > 5 and
n(1-p) > 5
General Confidence Interval Formula for a
Population Proportion (large Sample)
CI pˆ z1 / 2
pˆ (1 pˆ )
n
where = the confidence level and Z1-/2 = a value from
the standard normal distribution such that 100(1-)% of
values of a standard normal distribution lie within that
range around the mean
So the Z1-/2 values used for a population proportion are
the same as those used for a population mean
Example
How many voters will give F.F. a first preference in the
next general election ? There are 2 different estimates
Researcher A (10 people)
Researcher B (100 people)
=> 40%
=> 25%
How much 'better' is estimate B than estimate A ?
Step one: Can we use the formula for large numbers
1.
2.
Researcher A: np = 10 * 0.4 = 4 => 4 is not greater than 5
therefore you cannot used the large number method
Researcher B: np = 100 * 0.25 = 25
n(1-p) = 100 * (1-0 .25) = 75
both figures are greater than 5 therefore you can used the
large number method
Example Continued
Researcher B - 95% Confidence Interval
CI pˆ z 1 / 2
pˆ (1 pˆ )
CI 95 0 . 25 1 . 96
n
0 . 25 0 . 75
CI 95 0 . 25 1 . 96 0 . 04
CI 95 0 . 25 0 . 08
CI 95 0.17 to 0.33
So, the 95% CI is 17% to 33%.
100
Example Continued
NB: If fact we can get a 95% CI for researcher A's findings
using small sample theory (exact CI) - this is available in
SAS and other software:
Exact CI’s are often based on direct use of probability
models.
The method is based directly on calculations for the
binomial distribution (see lecture 3)
What do we have to do?
Using the CLT, we found, that the 95% CI was composed
of the set of values for the mean, such that an hypothesis
test would not reject the null hypotheses for any of those
values in the set using the α = 0.05 level.
Using SAS we can calculate a 95% CI for Researcher A:
CI 95% for Researcher A = 12% to 74%
which is too wide to be informative anyway!
If we use the same technique for researcher B we get:
CI95 for Researcher B = 17% to 35%
Which is virtually the same as before using the CLT.
Exact CI and tests for population
proportions
These work for small samples as well as large samples
With large sample will give essentially the same results as
CLT
Must be used for small samples, however
Based on the binomial probability distribution.
Difference between Exact and CLT based
methods
When sample sizes are ‘large’ they will give the same
results – but exact tests can be very hard to compute
even with modern PCs
When sample sizes are small exact methods must be
used
The CIs from small samples tend to be very wide – there
is no short cut from collecting as much high quality data
as you can manage.
Exercise
Answer Question 7-9