Concepts of Sampling and Sampling Distributions

download report

Transcript Concepts of Sampling and Sampling Distributions

Topic 9
Estimation Bias, Standard
Error and Sampling
Distribution
From sample to population
Inductive (inferential) statistical methods
Make inference about a
population based on
information from a
sample derived from
that population
Population
inductive
statistical
methods
sample
Statistical Concepts of Sampling
• Suppose we want to estimate the mean
birthweight of Malay male live births in
Singapore, 1992
• Due to logistical constraints, we decide to
take a random sample of 50 live births from
the records of all Malay male live births for
that year
Sampling from Target Population
Target population:
All Malay male
live births
in Singapore,
1992
random sample of 50
Malay male live births
in Singapore, 1992
Suppose
sample mean = 3.55 kg
sample SD (S) = 0.92 kg
What can we say about the
population mean?
Statistical Modeling
• Assume the population values follow a normal
or some other appropriate distribution. This
means a relative frequency histogram of the
population values will look like a normal or
that appropriate distribution.
• Assume we have a random sample, i.e., we
sample n (=50 in example) values
independently from the population
Notation
Sample data:
X 1 ,..., X n
Assume X 1 ,..., X n are independent and each is
distributed according to say a normal distribution
Population parameters:
Population mean = mean of the normal population

Population variance = variance of the normal population
2
Population standard deviation  
Statistical Inference
Two general areas:
(a) Statistical Estimation
i.e. estimating population
based on sample statistics
parameters
(b) Hypothesis Testing
i.e. testing certain assumptions about the
population
Also called Test of Statistical Significance
Statistical Estimation
There are two ways by which a population
parameter can be estimated from a sample:
(1) Point estimate
(2) Interval estimate
Point Estimate
Estimate the population parameter by a
single value:
Sample mean
Sample median
Sample variance
Sample SD
Sample proportion
population mean
population median
population variance
population SD
population proportion
Point Estimate
If the average birthweight for a random
sample of Malay male births was 3.55 kg and
we use it to estimate , the mean birthweight
of all Malay male births in the population, we
would be making a point estimate for 
• Poor practice to report just the point estimate because
•
•
people cannot judge how good the estimate is
Should also report the accuracy of the estimate.
Remember that the quality of an estimator is judged by
its performance over REPEATED SAMPLING although
we have just one sample in hand.
Inference for population parameter should make allowance
for sampling error
Accuracy of statistical estimation
Two types of error:
(a) Sampling error or fluctuation
“random” error or fluctuation that is due entirely to chance
in the process of sampling. Minimizing the sampling error
maximizes the precision of a statistical estimate.
(b)
Systematic error or bias
Non-random error/bias which is either a property of the
estimator itself or due to bias in the sampling or
measurement process. Minimizing the systematic error
maximizes the validity of a statistical estimate. Systematic
errors can be minimized by making efforts to reduce
measurement bias (eg non-random sampling, nonresponse and non-coverage, untruthful answers, unreliable
calibration, errors with data recording and coding etc)
Unbiased estimation of the mean
i.e., the sample mean equals the population mean
when averaged over repeated samples
Hypothetical results of repeated sampling
Sample
Mean
1
2
3
4
5
6
7
8
9
10
3.55
3.59
3.48
3.51
3.49
3.46
3.48
3.52
3.51
3.49
•Unbiasedness means the
sample mean equals the
population mean when
averaged over repeated
samples
•However, there is
fluctuation from sample to
sample
•Variance = ?
Standard Error (SE) of an estimator
• The SE of an estimator (e.g., the sample mean) is
just the standard deviation (SD) of the estimator.
It measures the variability of the estimator under
“repeated” sampling
• SE is just a special case of SD
• The reason why the standard deviation of an
estimator is called standard error is because it is a
measure the magnitude of the estimation error due
to sampling fluctuation
Standard Deviation vs Standard Error
• The population standard deviation (SD) measures the
amount of variation among the individual measurements
that make up the population and can be estimated from a
sample using the sample standard deviation.
• The standard error (e.g. of the sample mean), on the other
hand, measures how much the value of the estimator
changes from sample to sample under repeated sampling.
• As we take only 1 sample rather that repeated samples in
practice, it seems impossible at first to estimate standard
error which is defined with reference to repeated
sampling.
• Fortunately, the standard error of the sample mean is a
function of the population SD. As the latter is estimable
from a single sample, so is the standard error.
Estimated standard error of the
sample mean
• Let  denote the population SD
• It was shown earlier that
• SE = SD(sample mean) =  / n , where n
is the sample size
• Since  can be estimated by the sample
standard deviation S, we can estimate the
standard error by SE = S/ n
Note that SE decreases with n at the rate 1/ n , i.e., the precision
of the sample mean improves as sample size increases
Knowing the mean and standard
error of an estimator still doesn’t
tell us the whole story
The whole story is told by the
sampling distribution since that
helps in calculating the
probabilities
Sampling distribution of the sample mean
• The distribution of the sample mean under “repeated”
sampling from the population
Sample
Mean
1
2
3
4
5
6
7
8
9
10
3.55
3.59
3.48
3.51
3.49
3.46
3.48
3.52
3.51
3.49
•Distribution of the sample
mean rather than
individual measurements
•In practice, we take only
one sample, not repeated
samples and so the
sampling distribution is
unobserved but fortunately
it can often be derived
theoretically
Demo: http://www.ruf.rice.edu/~lane/stat_sim/index.html
Exact result when sampling from a
normal population
• If the population is normal with mean  and variance  2,
then the sample mean based on a random sample of size
n is also normal with mean  and variance  2 / n
 2 

i.e., X ~ N   ,
n 

• Note how we can derive theoretically the distribution of
the sample mean under repeated sampling without
actually drawing repeated samples
• This is important because we usually only have one
sample at our disposal in practice
Topic 10: Interval Estimate
• Provides an estimate of the population
parameter by defining an interval or range
of plausible values within which the
population parameter could be found with a
given confidence.
• This interval is called a confidence interval.
• The sampling distribution is used in
constructing confidence intervals.
Confidence interval for the mean
of a normal population
Fact: With probability 0.95, a normally distributed
variable is within 1.96 standard deviations from its
mean.
2


Now X ~ N (  , ), with SD( X )  SE 
n
n
•It follows that the sample mean must be within
1.96 standard errors from the population mean
with probability 0.95.
• Equivalently, the population mean is within 1.96
standard errors from the sample mean.

 

0.95  P X  1.96
   X  1.96 
n
n

We call

 

 X  1.96 , X  1.96   X  1.96 SE ( X )
n
n

a 95% confidence interval for the population mean.
If  is unknown, replace it by the sample SD S
and replace 1.96 by the upper 2.5-percentile of a
t-distribution with n-1 degrees of freedom to yield
^
S
S 
 X t
, X  tn1,0.025   X  tn1,0.025 SE ( X )

n 1, 0.025
n
n

as a 95% confidence interval for the
population mean
The t densities
• t densities are symmetric
and similar in
appearance to N(0,1)
density but with heavier
tails
• Tables for t distributions
are widely available
• As d.f. increases, t
distribution converges to
standard normal
distribution
Demo: http://www.isds.duke.edu/sites/java.html
95% confidence interval for the population mean
^
S
S 
 X t
, X  tn1,0.025   X  tn1,0.025 SE ( X )

n 1, 0.025
n
n

Birthweight data revisited
•n = 100, Sample mean = 3.55 kg, S = 0.92 kg
•SE = .92/sqrt(50) = 0.13 kg
•d.f. = 49, upper 2.5-percentile of t = 2.01
•95% C.I. for the mean Malay male birthweight is
3.55 +/- 2.01 (0.13) = (3.29 kg, 3.81 kg)
The meaning of confidence interval
Under repeated sampling,
X  t n 1, 0.025
S
n
will contain the true mean 95% of the
times.
Demo: http://www.isds.duke.edu/sites/java.html