Probability Probability Distributions

Download Report

Transcript Probability Probability Distributions

Estimation in Sampling
GTECH 201
Lecture 15
Conceptual Setting

How do we come to conclusions from
empirical evidence?



Systematic methods for drawing
conclusions from data


Isn’t common sense enough?
Why?
Statistical inference
Inductive versus Deductive Reasoning
Drawing Conclusions

Statistical inference


Based on the laws of probability
What would happen if?





You ran your experiment hundreds of times
You repeated your survey over and over again
Statistic and Parameter
The proportion of the population who are
<disabled> usually denoted by: p
In a SRS of 1000 people, the proportion of
the people who are <disabled> usually
denoted by: pˆ (p -hat)
Estimating with Confidence

Say you are conducting an opinion poll…






SRS of 1000 adult television viewers
You ask these folks if they trust Walter
Cronkite when he delivers the nightly news
Out of 1000, 570 say, they trust him
57% of the people trust Walter
pˆ is 0.57
If you collect another set of 1000 television
viewers, what will the rating be?
Confidence Statement




We need to add a confidence statement
We need to say something about the margin
of error
Confidence statements are based on the
distribution of the values of the sample
proportion pˆ that would occur if many
independent SRS were taken from the same
population
The sampling distribution of the statistic pˆ
Terminology Review



Sample
Population
Statistic


sample
Parameter


a numerical characteristic associated with a
A numerical characteristic associated with
the population
Sampling error

The need for interval estimation
Point Estimation

Point estimation of a parameter is the
value of a statistic that is used to estimate
the parameter



Compute statistic (e.g., mean)
Use it to estimate corresponding population
parameter
Point Estimators of Population Parameters
(see next slide)
Point Estimators for
Population Parameters
Population
Parameter



Sample
statistic
x
s
T
Calculating
formula
in
xi

i 1 n
( xi  x )

n 1
i 1
i n
2
X

N(X )  N
n

p
x
n
i
Interval Estimation

Sample point estimators are usually not absolutely
precise
 How close or how distant is the calculated
sample statistic from the population parameter
 We can say that the sample statistic is within a
certain range or interval of the population
parameter.
 The determination of this range is the basis for
interval estimation
Interval Estimation (2)


A confidence interval (CI) represents the level
of precision associated with a population
estimate
Width of the interval is determined by



Sample size,
variability of the population, and
the probability level or the level of confidence
selected
Sampling Distribution
of the Mean



The distribution of all possible sample
means for a sample of a given size
Use the mean of a sample to estimate and
draw conclusions about the mean of that
entire population
So we have samples of a particular size

We need formulas to determine the mean and
the standard deviation of all possible sample
means for samples of a given size from a
population
Sample and Population Mean

For samples of size n, mean of the
variable X


Is equal to the mean of the variable under
consideration
Mean of all possible sample means is equal
to the population mean  x  
Sample Standard Deviation

For samples of size n, the standard
deviation of the variable X


Is equal to the standard deviation of the
variable under consideration, divided by the
square root of the sample size
For each sample size, the standard deviation
of all possible sample means equals the
population standard deviation divided by the

square root of the sample size  x 
n
Central Limit Theorem


Suppose all possible random samples of size n
are drawn from an infinitely large, normally
distributed population having a mean  and a
standard deviation 
The frequency distribution of these sample
means will have:
 A mean of  (the population mean)


A normal distribution around this population mean

A standard deviation of  x 
n
Sampling Error

Standard Error of the mean (SEM) is a basic
measure for the amount of sampling error
x 



n
SEM indicates how much a typical sample mean is
likely to differ from a true population mean
Sample size, and population standard deviation
affect the sampling error
Sampling Error (2)


The larger the sample size, the smaller the
amount of sampling error
The larger the standard deviation, the
greater the amount of sampling error
Large
ia
a
St
Small
Small
a
nd
rd
v
de
n
tio
o
o
fp
pu
(
on
i
t
la
)
Sa
m
ple
siz
e
(n
)
Large
Finite Population
Correction Factor



The frequency distribution of the sample means is
approximately normal if the sample size is large
N < 30 (small sample); N > 30 (large sample)
If you have a finite population, then you need to
introduce a correction, i.e., the fpc rule/factor in
the estimation process
N n
fpc 



N 1
where fpc = finite population correction;
n = sample size;
N = population size
Standard Error of the Mean
for Finite Populations
When including the fpc should be:
x 

n
( fpc)
In general, you include the fpc in the
population estimates only when the ratio of
sample size to population size exceeds 5 % or
when n / N > 0.05
Constructing Confidence
Intervals



A random sample of 50 commuters reveals
that their average journey-to-work distance
was 9.6 miles
A recent study has determined that the std.
deviation of journey-to-work distance is
approximately 3 miles
What is the CI around this sample mean of
9.6 that guarantees with 90 % certainty that
the true population mean is enclosed within
that interval?
Confidence Interval
for the Mean
x  9 .6
 3
n  50




Z value associated with a 90 % confidence level
(Z =1.65)
The sample mean is the best estimate of the
true population mean
CI = x  z x


9.6 +1.65 (3/ 50)
9.6 - 1.65 (3/ 50)
= 10.30 miles
= 8.90 miles
Confidence Interval

We say that the sample statistic is within a certain
range or interval of the population parameter




e.g., in our sample, 57% of the viewers thought Walter
Cronkite is trustworthy
In the general population, between 54% and 60%
of viewers think that Walter Cronkite is trustworthy
Or, in our sample, the average commuting distance
was 9.6 miles
In the population, we calculated that the average
commute is likely to be somewhere between 8.9
miles and 10.3 miles
Confidence Level




Gives you an understanding of how reliable your
previous statement regarding the confidence
interval is
The probability that the interval actually includes
the population parameter
For example, the confidence level refers to the
probability that the interval (8.9 miles to 10.3
miles) actually encompasses the TRUE population
mean (90%, 95%, 99.7%)
Confidence Level probability is 1 - 
Significance Level

 (alpha)




The probability that the interval that surrounds
the sample statistic DOES NOT include the
population parameter
E.g., the probability that the average
commuting distance does not fall between 8.9
miles and 10.3 miles
 = 0.10 (90%); 0.05 (95%); 0.01 (99.7%)
Confidence Interval width -- increases
Sampling Error


Total sampling error =

Probability that the sample statistic will fall
into either tail of the distribution is:
/2

If you want 99.7% confidence (i.e., low
error), then you have to settle for giving a
less precise estimate (the CI is wider)
If the Standard Deviation
is Unknown




If we don’t know the population mean, its likely
we don’t know the standard deviation
What you are likely to have is the variance and
standard deviation of your sample
Also, you have a small population, so you have to
use the finite population correction factor that
was discussed earlier
Once you have the formula for standard error,
then you can proceed as before to determine the
confidence interval
Standard Error
x 
x 
s2 

n
( fpc)
fpc 
s2  N  n 


n N 
n
N
CI  x  z x
N n
N 1
Student’s T Distribution

William Gosset (1876-1937)


Published his contributions to statistical
theory under a pseudonym
Student’s t distribution is used in
performing inferences for a population
mean, when,



The population being sampled is approximately
normally distributed
The population standard deviation is unknown
And the sample size is small (n < 30)
Characteristics of the
t - Distribution





A t curve is symmetric, bell shaped
Exact shape of distribution varies with sample size
When n nears 30, the value of t approaches the
standard normal Z value
A particular distribution is identified by defining its
degrees of freedom (df)
x 
For a t distribution, df = (n -1) t 
s
n
Properties of t Curves





The total area under a t curve = 1
A t curve extends indefinitely in both directions,
approaching, but never touching the horizontal
axis
A t-curve is symmetrical about 0
As the degrees of freedom become larger,
t curves look increasingly like the standard
normal curve
We need to use a t-table and look for values of t,
instead of Z to determine the confidence interval
Calculating various CIs



Sampling
 SRS, systematic, or stratified
Parameters
 Mean, total, or proportion
Six situations

Consider whether to use fpc


when n/N > 0.05
Consider whether to use Z or t

when n < 30
If Random or Systematic Sample

Estimate of Population Mean


Best estimate is ?
Estimate of sampling error

Standard error of the mean (inc. fpc)
x 
s2  N  n 


n N 
CI  x  z x
If Stratified Sample

Estimate of population mean

Still equal to sample mean but…
1
X
N
i m
N X
i 1
i
i
Where m=number of strata; i= refers to a particular stratum

Std. Error of the mean (inc. fpc)
x 
1
N2
2

 N i  ni 
s
2
i

N i  

i 1
 ni  N i 
i m
Minimum Sample Size



Before going out to the field, you want to
know how big the sample ought to be for
your research problem
Sample must be large enough to achieve
precision and CI width that you desire
Formulas to determine the three basic
population parameters with random
sampling
Sample Size Selection - Mean

Your goal is to determine the minimum sample
size CI  x  z x
 You want to situate the estimated population
mean, in a specified CI
E  Z x
E = amount of error
2

you are willing to tolerate  Z
n
 Z 
n

 E 
Example 1

We are looking at Neighborhood X






3,500 households
Sample size = 25 households
Sample mean = 2.73
Sample variance = 2.6
CI = 90%
Find the mean number of people per
household
Example 2



Sample of 30 households
Sample standard deviation is 1.25
What sample size is needed to estimate
the mean number of persons per
household in neighborhood X

and be 90% confident that your estimate
will be within 0.3 persons of the true
population mean?