Week 7: Sampling Distributions

Download Report

Transcript Week 7: Sampling Distributions

Sampling Distributions
Week 7
Objectives
One completion of this module you should be
able to:
 calculate the standard error of the mean and
explain the effect of sample size on the
standard error,
 explain the concept of a sampling distribution
for samples taken from either normal or nonnormal populations and understand the
central limit theorem,
 calculate the standard error of the proportion,
2
Objectives
One completion of this module you should be
able to:
 calculate probabilities relating to sample
means and proportions,
 use the normal approximation to the binomial
and Poisson distributions and
 understand and apply sampling techniques
for finite populations.
3
Sampling distributions




We use sample data to estimate population
parameters.
For example X gives us an estimate of μ, S
gives us an estimate of , and p estimates π.
Sample error occurs since the sample does
not reflect the population exactly.
Standard error measures how the
parameter estimate varies from sample to
sample.
4
Sampling distribution of the mean


The arithmetic mean  X  is unbiased since
the average of all possible sample means
will be equal to the population mean μ.
We’ll demonstrate with a very small
population…
5
Sampling distribution of the
mean
1
0.25
0.2
0.15
0.1
0.05
7
6
5
4
3
2
0.05
0.1
0.15
0.2
0.25
1
2
3
4
5
6
7
Component
A
B
C
D
Number of
faults
5
3
6
2
53 6 2

4
4

Four components of a
coffee machine are
tested and the number
of faults found on each
is recorded.
0.25
0.2
0.15
0.1
0.05
1 2 3 4 5 6 7 X
6
Samples of size two
(n=2) are drawn from
the population (N=4).

0.4
0.3
0.2
0.1
1
2
3
4
5
6 X
A, B
X
A, C
X
A, D
X
B, C
X
B, D
X
C, D
X
53

4
2
56

 5.5
2
5 2

 3.5
2
3 6

 4.5
2
3 2

 2.5
2
62

4
2
2.5  3.5  4  4  4.5  5.5
X 
Note: there are  N  N
6
 Cn


possible samples  n 
 4    !!
7
Sampling distribution of the mean



We see from this example that  X   .
The arithmetic mean is an unbiased estimator
of the population mean.
Although we can’t be sure that a sample mean
is close to the population mean, we can be sure
that the average of all sample means is equal
to the population mean.
8
Sampling distribution of the mean



There is variation in the sample means but
not as much as in the population.
Standard error of the mean – a measure
of the variability in the mean from sample
to sample.

The standard error is:  X 
n
where  = population standard deviation
and n = sample size.
9
Sampling distribution of the mean



If sampling is from a normal population (with
mean μ and standard deviation ) then the
sampling distribution of the mean will also be
normally distributed with mean  X and standard
deviation  X .
This allows us to find the probability that a
sample mean is greater than or less than certain
values etc.
X  X X  

We use: Z 
X

n
10
Sampling distribution of the mean

We can also rearrange the expression for Z
to find an interval within which a fixed
proportion of sample means fall:

X Z

n
So for example to include 95% of sample
means we would substitute in Z  1.96 to
obtain the following two values:
X Lower    1.96

n
X Upper    1.96

n
11
Sampling distribution of the mean

If sampling is from a non-normal population
(with mean μ and standard deviation ) then
the Central Limit Theorem states that the
sampling distribution of the mean can be
approximated by the normal distribution
when the sample size is sufficiently large
(usually n  30).
12
Sampling distribution of the mean



CLT applies regardless of the shape
distribution of individual values in the
population.
If the population is highly skewed (rare),
more than 30 observations may be needed for
normality to be approximated…
If the population is fairly symmetrical sample
sizes may only need to be 15 or more.
13
Example 7-1
The distribution of times it takes an office
worker to complete a particular task is known
to have a mean of eight minutes and a
standard deviation of two minutes.
If random samples of forty tasks are taken,
find:
(a) the probability that the average time spent
per task will be more than nine minutes
14
Solution 7-1



We are given   8,   2 and n  40.
The standard error of the mean is:

2
X 

n
40
Given that we are looking for P  X  9  , the
Z-value is:
X  X X   9  8
Z


 3.16

2
X
n
40
15
Solution 7-1
P  X  9   P  Z  3.16 
 1  0.99921
 0.00079

The probability that the average time spent
per task will be more than nine minutes is
0.00079.
16
Solution 7-1
(b) the proportion of sample means that will
be between 7.2 and 8.5 minutes
 We are looking for P  7.2  X  8.5  .
X  X
X 
7.2  8
Z1 


 2.53

2
X
n
40
X   X X   8.5  8
Z2 


 1.58

2
X
n
40
17
Solution 7-1
P  7.2  X  8.5   P  2.53  Z  1.58 
 0.9429  0.0057
 0.9372
The proportion of sample means that can be
expected to be between 7.2 and 8.5 minutes
is 0.9372 (93.72%).
18
Solution 7-1
(c) If the random sample had been of only 20
tasks, what changes would this make to
you answers in (a) and (b)?
What assumptions would you need to make
in order to be able to answer (a) and (b)
based on a sample of 20 tasks?
19
Solution 7-1



Sample of 20 is less than required by CLT.
If the population is known to be normal (we
are not told this) or symmetrical then CLT
may still apply and we could solve the
problem using the methods in (a) and (b).
If not, then we can’t assume the means are
normally distributed and couldn’t solve (a)
and (b).
20
Solution 7-1
(d) Which of the following is more likely to
occur:
 a sample mean below 7.5 minutes in a
sample of 30 tasks?
 a sample mean below 7.5 minutes in a
sample of 50 tasks?
 an individual task taking less than two
minutes?
21
Solution 7-1

sample mean below 7.5 minutes in a sample
of 30 tasks?
Z
X 
7.5  8

 1.37

2
n
30
P  X  7.5  P  Z  1.37 
 0.0853
22
Solution 7-1

sample mean below 7.5 minutes in a sample
of 50 tasks?
Z
X 
7.5  8

 1.77

2
n
50
P  X  7.5  P  Z  1.77 
 0.0384
23
Solution 7-1

Individual task taking less than 2 minutes?
Z
X 

28

 3
2
P  X  2   P  Z  3 
 0.00135
Therefore, the most likely outcome is a
sample mean below 7.5 minutes in a sample
of 30.
24
Sampling distribution of the
proportion


Often we are interested in the proportion of
items in a population which possess a
certain characteristic.
When we can’t examine every item in the
population, we estimate this proportion with:
X number of items having the characteristic
ps  
n
sample size
25
Sampling distribution of the
proportion


As with sample means, estimates of the
proportion will differ between samples.
The standard error of the proportion is:
p 
s
p 1  p 
n
26
Sampling distribution of the
proportion


When sampling with replacement, the
sampling distribution of the proportion follows
the binomial distribution.
We’ll see shortly that when the following
conditions are met, this can be approximated
by the normal distribution:
np  5 and n 1  p   5
27
Sampling distribution of the
proportion

The difference between the sample
proportion and the population proportion in
standardised normal units is:
Z
ps  p
p 1  p 
n
28
Example 7-2
Recent research has indicated a growing
number of young children from two parent
families are being placed in child care so that
both parents can work.
Although the families increase their income
with two pay packets, the cost of child care
is often prohibitively high.
A particular study indicated that 40% of
families have children in child care facilities.
29
Example 7-2
(a)
If a random sample of 100 two-parent
families is selected, find:
the proportion of samples which will have
between 40% and 50% of families using
child care facilities
30
Solution 7-2
We are given p = 0.4 and n = 100.
np  100  0.4   40  5
n 1  p   100  0.6   60  5
Therefore the sample size is large enough to
use normal distribution approximation.
We are looking for P  0.4  ps  0.5.
31
Solution 7-2
Z1 
Z2 
ps  p
p 1  p 
n
ps  p
p 1  p 
n


0.4  0.4
0.4 1  0.4 
100
0.5  0.4
0.4 1  0.4 
100
0
 2.04
32
Solution 7-2
P  0.4  ps  0.5   P  0  Z  2.04 
 0.9793  0.5
 0.4793
So the proportion of samples between 40%
and 50% will be 0.4793.
33
Example 7-2
If a random sample of 100 two-parent
families is selected, find:
(b) the probability of obtaining a sample
percentage of greater than 45%
Z
ps  p
p 1  p 
n

0.45  0.4
0.4 1  0.4 
100
 1.02
34
Solution 7-2
P  ps  0.45   P  Z  1.02 
 1  0.8461
 0.1539
So the probability of obtaining a sample
percentage of greater than 45% is 0.1539.
35
Example 7-2
(c) Within what symmetrical limits of the
population percentage will 95% of the sample
percentages fall?

95% of the standard normal curve is between
±1.96 and so
P  1.96  Z  1.96   0.95
36
Solution 7-2
Now rearranging
ps  p
Z
p 1  p 
n
we get
p 1  p 
ps  p  Z
n
37
Solution 7-2
Substituting in the two values of Z, we find
that:
0.4  0.6 
ps  0.4  1.96
100
 0.3040 (to 4 dec. pl.)
0.4  0.6 
ps  0.4  1.96
100
 0.4960 (to 4 dec. pl.)
So 95% of the sample percentages will fall
between 30.40% and 49.60%.
38
Normal approximation to the
binomial distribution



As we saw earlier, we require that
np  5 and n 1  p   5
in order for the normal approximation to the
binomial to be appropriate.
We also need to consider a continuity
correction since the normal distribution is
continuous whilst the binomial is discrete.
We’ll demonstrate this via an example.
39
Normal approximation to the
binomial distribution

We know that for a binomial distribution
  np and   np 1  p 
X 
and so Z 

becomes Z 
X a  np
n 1  p 
where Xa is adjusted using the continuity
correction.
40
Example 7-3
A company offers its sales staff a choice of
three salary packages.
Package A includes a base salary of $50000
per year as well as 1% commission on all sales
made by the staff member.
Package B includes a base salary of $20000
plus a 4% sales commission and package C
consists solely of a 7% sales commission.
41
Example 7-3
(a)
The company has designed the packages in
such a way as to expect equal numbers of
staff to choose each option.
If a random selection of six sales staff is
taken, what is the probability that at least
three will select package C?
We are given n = 6.
42
Solution 7-3


Since we are interested only in whether they
select package C or not (i.e. when the result
is not choosing package C, we aren’t
interested in whether it is A or B), we can say
that
1
p
3
We want to find the value of P  X  3.
43
Solution 7-3


Given that
1
np  6    2  5 and
3
 1
n 1  p   6 1    4  5
 3
we cannot use the normal approximation
(and must use the binomial distribution).
We will use the binomial formula (since the
exact p value is not tabulated).
44
Solution 7-3
P  X  3  1  P  X  2   P  X  1  P  X  0 
2
4
1
5
6!  1   2 
6!  1   2 
 1
    
  
2!4!  3   3  1!5!  3   3 
0
6
6!  1   2 

   
0!6!  3   3 
 1  0.3292  0.2634  0.0878
 0.3196
Given a random selection of six sales staff, the
probability that at least three will select salary
package C is 0.3196.
45
Example 7-3
(b) If a random selection of twenty sales staff is
taken, what is the approximate probability
that at least three will select package C?
2
1
np  20    6  5
3
3
1
2
n 1  p   20    13  5
3
3
So we can use the normal approximation.
46
Continuity correction




The normal distribution is continuous but the
binomial is discrete.
This means the binomial can only take on
certain values (like the bars in a histogram).
The normal distribution can take on any value
so is drawn as a continuous line (see next
slide).
The two distributions will therefore have
differences.
47
Continuity correction
Normal
Binomial
48
Continuity correction

In our example we are looking for P  X  3
on the binomial distribution (the shaded
area).
1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5
49
Continuity correction


To include the
everything greater
than or equal to 3,
the entire bar must
be included.
On the normal curve
this means
everything from 2.5
upwards.
1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5
50
Solution 7-3
1
2.5  20  
X a  np
3

Z

 1.98
np 1  p 
 1  2 
20   
 3  3 
P  X  3  P  Z  1.98  1  0.0239  0.9761
Given a random selection of twenty employees,
the approximate probability that at least three
will select salary package C is 0.9761.
51
Normal approximation to the
Poisson distribution



With the Poisson distribution
 2  
We can use the normal distribution to
approximate the Poisson distribution whenever
  5.
Xa  
Then    and Z 

where Xa is adjusted (continuity correction).
52
Example 7-4
Customers arrive at a busy takeaway coffee
counter at the rate of five per minute.
(a) What is the probability that in any given
minute three or fewer customers arrive?
 5
P  X  3  P  X  0   P  X  1  P  X  2 
 P  X  3
 0.0067  0.0337  0.0842  0.1404
 0.2650 (using Poisson tables).
53
Solution 7-4
(b) What is the approximate probability that
in any given minute three or fewer customers
arrive?
 Since   5  5 we can use the normal
approximation to the Poisson distribution.
 Xa = 3.5 since we are looking for less than or
equal to 3.
X a   3.5  5
Z

 0.67

5
P  X  3  P  Z  0.67   0.2514
54
Solution 7-4
(c) Compare your answers from (a) and (b).
 There is a difference of 0.0136.
 Normal distribution has been a reasonably
accurate approximation of the Poisson
distribution in this case.
55
Sampling from finite
populations



Until now we have assumed sampling with
replacement and an infinite population in our
calculations, or at least that our sample size is
very small relative to the population size.
Sampling is more often without replacement.
We use the finite population correction
factor when
 the population is finite of size N and
 the sample size n is not small relative to the
population size i.e. n N  0.05.
56
Sampling from finite
populations


Finite population correction factor (fpc)
N n
fpc 
N 1
Standard error of the mean for finite
populations
X

n
N n
N 1
p 
s
p 1  p  N  n
n
N 1
57
Example 7-5
The management team of a large company has
been investigating the work habits of the
employees of the company.
They have been concerned that some of the
employees are spending large portions of the
working day outside having smoking breaks.
It is known that 500 employees are regular
smokers.
58
Example 7-5
It is expected that the time these employees
spend smoking per day is normally distributed with
a mean of 25 minutes and a standard deviation of
eight minutes.
If a random sample of 50 of the smokers is
selected without replacement, what proportion of
the sample means would be greater than 26
minutes?
59
Example 7-5


We have μ = 25 and  = 8.
50
Since n

 0.1  0.05
N 500
and the sample is without replacement, the
finite population correction factor is needed.
 X    25

N n
8
500  50
X 

n N 1
50 500  1
 1.0744 (to 4 dec. pl.)
60
Solution 7-5
Z
X  X
X
26  25

 0.93 (2 dec. pl.)
1.0744
P  X  26   P  Z  0.93  1  0.8238  0.1762
So 0.1762 (17.62%) of the sample means can
be expected to be greater than 26 minutes.
61
After the lecture each week…





Review the lecture material
Complete all readings
Complete all of recommended problems
(listed in SG) from the textbook
Complete at least some of additional
problems
Consider (briefly) the discussion points prior
to tutorials
62