Stat200: pre7 - Sampling Distributions

Download Report

Transcript Stat200: pre7 - Sampling Distributions

Presentation 7
Sampling Distributions
Statistics VS parameters

Statistic – is a numerical value computed from a
sample.

Parameter – is a numerical value associated with a
population.

Essentially, we would like to know the parameter. But
in most cases it is hard to know the parameter since
the population is too large. So we have to estimate
the parameter by some proper statistics computed
from the sample.
Quick Review




p = population proportion
pˆ = sample proportion (it is called p-hat)
μ = population mean
x = sample mean
Empirical rule:
For Variables with a Normal (Bell-Shaped Distribution)
~68% of the values fall within +/- 1 standard deviation of
the mean.
~95% of the values fall within +/-2 standard deviations
of the mean.
Sampling Distribution of the
Sample Proportion
Situation 1: A survey is undertaken to
determine the proportion of PSU students who
engage in under-age drinking. The survey asks
200 random under-age students (assume no
problems with bias). Suppose the true population
proportion of those who drink is 60% or p=.6
pˆ is the proportion in the sample who drink.
Repeated Samples
Imagine repeating this survey many times, and each
time we record the sample proportion of those who have
engaged in under-age drinking. What would the
ˆ look like?
sampling distribution of p
Sample (n=200)
Sample Proportion
pˆ1
pˆ2
1
2
pˆ 3
pˆ4
pˆ5
3
4
5
…
150,000
…
pˆ 150,000
pˆ is a random variable
assigning a value to
each sample!
0
2
4
6
8
10
Histogram of pˆ for 150k
samples.
0.4
0.5
pˆ
0.6
0.7
0.8
Sampling Distribution of pˆ
Derived from the Binomial Distribution
Let X be the number of respondents who say they engage in under age drinking.
What is the PDF of X?
X is binomial with n=200 and p=.6 so we can calculate the probability of X for
each possible outcome (0-200). The PDF is plotted below:
0.06
Probability
0.05
0.04
0.03
0.02
0.01
0.00
69 74 79 84 89 94 99 104 109 114 119 124 129 134 139 144 149 154 159 164 169
X
Sampling Dist. of pˆ
 Since the pˆ is simply X/n it follows that the sampling
distribution of pˆ is the same as that of the binomial
distribution divided by n.
ˆ : E( p
ˆ )  E ( X n)  p
Meanof p
ˆ : sd ( p
ˆ )  sd ( X n) 
St d.Dev.of p
ˆ : se( p
ˆ) 
St andardErrorof p
np(1  p )

n
ˆ (1  p
ˆ)
p
n
p (1  p )
n
Normal Approximation for Sample
Proportions

1.
2.
The sampling distribution of pˆ is approximately
normal with mean p and standard deviation
p(1  p) n if the following conditions are
satisfied:
A random sample is selected from the population. Even
if the sample is not perfectly random, as long as it is
free from bias it will be okay.
Sample must be large enough, np and n(1-p) MUST be
greater than 5, and should be greater than 10.
Example: Problem 9.11

Recent studies have shown that about 20% of
American adults fit the medical definition of being
obese. A large medical clinic would like to
estimate what percent of their patients are obese,
so they take a random sample of 100 patients and
find that 18 percent are obese. Suppose in truth,
the same percentage holds for the patients of the
medical clinic as for the general population, 20%.
Give a numerical value of each of the following….
Problem 9.11 Cont.
a. The population proportion of obese patients
in the medical clinic, p = .2
b. The proportion of obese patients in the
sample of 100 patients,
= 18/100 = 0.18
c. The standard error of pˆ , pˆ (1  pˆ ) = 0.0384
n
d. The mean of the sampling distribution of pˆ
= p = .2
e. The standard deviation of the sampling
p(1  p) = .04
distribution of pˆ ,
n
Sampling Distribution of the
Sample Mean
Situation 2: The mean height of women age 20
to 30 is normally distributed (bell-shaped) with a
mean of 65 inches and a standard deviation of 3
inches. A random sample of 200 women was
taken and the sample mean x recorded.
Now IMAGINE taking MANY samples of size 200
from the population of women. For each sample
we record the x . What is the sampling
distribution of x ?
Histograms for the Distribution
of X and X-Bar
Distribution of Sample Means:
X-bar = mean of random sample
of size 200.
0.0
0.00
0.02
0.5
0.04
0.06
1.0
0.08
0.10
1.5
0.12
Original Population of Women:
X= height of random woman
50
55
60
65
X
70
75
80
62
63
64
65
x
66
67
68
For Normal Data:


Consider a random variable X with mean μ and
standard deviation σ.
The sampling distribution of the sample mean
for sample of size n, is normal with…
Mean of
x  E ( x)  

Std.Dev.of x  sd ( x) 
n
What about for skewed or non-normal data?
0
10
20
30
40
CD Data from the Class Survey
0
100
200
300
400
500
600
CDs
Situation 3: Clearly CDs is a right skewed data set. Suppose our
population looked something like this, let us take repeated samples
from this population and see what the sample mean looks like.
1200
n=8
0
0
200
500
400
600
1000
800
1500
n=4
1000
2000
Suppose we take repeated
samples of size, 4, 8, 16, 32
0
100
200
0
300
50
100
150
200
250
Sample Mean for n=8
800
Sample Mean for n=4
800
n = 32
0
0
200
200
400
400
600
600
n = 16
50
100
150
Sample Mean for n16
200
40
60
80
100
120
140
Sample Mean for n=32
160
180
Statistics From Skewed Data

Using that CD sample as the population,
µ = 87.6, σ = 87.8
The sample means from the previous slide had the
following summary statistics:
Sample Size
Mean
Std. Deviation
N=4
86.6
43.2
N=8
86.8
30.9
N = 16
86.7
21.9
N = 32
86.6
15.6
Note: that the mean remains constant, and the std. deviation
decreases as the sample size increases!
Conclusions and Conditions for
the Sample Mean

For non-normal data the sampling distribution
of the sample mean is approximately normal
with mean μ and standard deviation σ/ n
 Conditions!
The above is true if the sample size is large
enough, usually n greater than 30 is sufficient.
What next?



We have shown that both the sampling
distribution of the sample proportion, and the
sampling distribution of the sample mean are both
normal under certain conditions.
Now we can use what we know about normal
distributions to draw conclusions about pˆ and
x!
Situation 4, demonstrates how to use the sampling
distribution of p-hat to draw conclusions.

Situation 4: A certain antibiotic in known to cure 85% of
strep bacteria infections. A scientist wants to make sure
the drug does not lose its potency over time. He treats
100 strep patients with a 1 year old supply of the
antibiotic. Let pˆ be the proportion of individuals who
are cured.
ASSUME the drug has NOT lost potency, answer the following questions…
pˆ ?
1.
What is the sampling distribution of
2.
If we repeated this study many times we would expect 95% of
to fall within what interval?
3.
What is the probability that more than 90% in the sample are
cured?
4.
Suppose the scientist observed a cure rate of only 75%, would he
be justified in concluding the 1 year old drug is less effective?
pˆ
1. What is the sampling distribution of pˆ ?

Since both np = 85 and n(1-p) = 15 are
greater than 10, and if we assume the sample
is random/representative….
Then the sampling distribution of pˆ is
approximately normal with mean p=.85 and
standard deviation p(1  p) = .036.
n
2.


If we repeated this study many times we would
expect 95% of pˆ to fall within what interval?
The empirical rule states that for a normally
distributed variable ~95% of the values fall
within +/- 2 standard deviations of the mean.
So 95% of the pˆ should fall within
.85+/- 2*.036
or
there is 95% probability that the proportion
cured should be between 78% and 92%
3.
What is the probability that more than 90% in
the sample are cured?

In other words what is P( pˆ >.9)?
First calculate a z-score…
Z-score = [value-mean]/StdDev
Z-score = [.9-.85]/.036 =1.4
P( pˆ >.9) = P(Z>1.4 ) = 1- P(Z<1.4 )
= 1-.9192 = .0808
4.

Suppose the scientist observed a cure rate of
only 75%, would he be justified in concluding
the 1 year old drug is less effective?
In other words, assuming the cure rate is actually
85%, what is the chance he would observe as sample
proportion equal or less than 75%? What is P( pˆ .75)?
Z-score = [.75-.85]/.036 = -2.80
ˆ .75) = P(Z< -2.80) = .0026
P( p
We will see some examples about how to
use the sampling distribution of the
sample mean in class activities…but it is
similar idea.