Probability vs Statistics

Download Report

Transcript Probability vs Statistics

Point estimates and sampling
distribution
A Typical Statistics Problem






1. Have a question in our mind.
2. ***Design a scheme to collect data
(includes how many data to collect and how
to collect data)
3. Collect the data
4. ***Conduct analysis.
5. ***Draw conclusions.
Step 2, 4 and 5 usually have a lot of
probability involved.
Example I

Suppose you want to study the average GPA
of students in management major. What is
your plan of collecting data?
Example II


Suppose you took a simple random sample
of students in management major and got
the following data:
2.1, 1.8, 2.3, 3.2, 3.6, 3.1, 2.0, 2.8, 3.2, 2.2,
2.1, 1.2, 1.6, 3.2, 3.4, 2.5, 3.5, 3.8,
2.1, 1.9.
What can you say about the mean GPA of
management students?
Example II


1. The quantity we are interested in is the
mean and variance of the GPA of
management students.
2. We took a simple random sample of size
20 and use the mean and variance of the
GPA of this sample to estimate the average
GPA of all management students.
Example II

3. The average GPA of our sample is 2.58
and the variance is 0.56.
–
Formula: mean= X 
–
standard deviation= s 
x
i
=2.58
n
2
(
x

X
)
 i
n 1
=0.56
Example II



Also, we are interested in the proportion of
management students whose GPA is greater
than 2.8.
2.1, 1.8, 2.3, 3.2, 3.6, 3.1, 2.0, 2.8, 3.2, 2.2,
2.1, 1.2, 1.6, 3.2, 3.4, 2.5, 3.5, 3.8,
2.1, 1.9.
In this case, the proportion is 8 out of 20, or
0.4.
Some Concepts




We are interested in some quantities about the
population, we call them population parameter.
From the sample, we get some corresponding
quantities that describes the sample, we call them
sample statistics.
The process of calculating sample statistics is called
point estimation.
Consider a point-to-point correspondence.
Some Concepts


We call the sample quantity the point
estimator of the population counterpart.
For example,
–
–
–
Sample mean is the point estimator of the
population mean
Sample variance is the point estimator of the
population variance.
Sample proportion is the point estimator of the
population proportion.
Some Concepts



Given a sample, we are usually able to
calculate those sample quantities.
In the previous examples, we calculated the
sample mean, sample variance and sample
proportion. They are 2.58, 0.56 and 0.4
correspondingly.
Those values, 2.58, 0.56 and 0.4 are the
numeric values of the point estimator, they
are called the point estimates.
Some Concepts

Therefore, in statistics, we usually say, find
the point estimate of the population mean,
variance or proportion. That means, we
want to calculate the sample mean,
sample variance and sample proportion.
Some Concepts

In the previous examples:
–
–
–
The population parameters are: population
mean, population variance and population
proportion.
The point estimators of those parameters are:
sample mean, sample variance and sample
proportion.
The point estimates of those parameters are:
2.58, 0.56 and 0.4.
Some Concepts

To be fair, we actually should say:
–
–
–
–
–
Sample mean is one of the point estimators of the
population mean
Sample variance is one of the point estimators of the
population variance.
Sample proportion is one of the point estimators of the
population proportion.
There are many different ways to estimate a population
parameter and using the sample counterpart is just one of
them.
But we only talk about sample mean, variance and sample
proportion in this course.
Example III





A researcher is interested in the average height of
NFL players. He took a simple random sample of
100 NFL players and the sample mean is 6’5’’.
What is the population, what is the sample?
Population is all NFL players and sample is the
100 NFL players in the sample.
What is the population parameter?
The average height of all NFL players.
Example III

What is the point estimator of the population
parameter?

The average height of the 100 NFL players in the
sample.

What is the point estimate of the population
parameter?

6’5”
Example IV


In a clinical study on the proportion of people
having diabetes in a Lafayette community,
the researchers took a sample of 356
subjects, among which 56 have diabetes.
Identify population, sample, population
parameter, its point estimator and point
estimates.
Example IV



1. The population will be all people living in Lafayette
community.
2. The sample will be the 356 people selected.
3. The population parameter is the proportion of
people with diabetes in Lafayette community. The
point estimator of that parameter will be the sample
proportion and the point estimate is 56/356
Sampling Distribution



In all examples above, we just took ONE sample and
calculate the point estimates.
But we know that samples are different from each
other. If we take many different samples, we will
have different point estimates of the same
population parameter. (That is what we call
sampling variability)
How do we adjust to this discrepancy?
Sampling Distribution

Review:
–

Random variable: a numerical description of the
outcome of an experiment.
If we consider drawing a sample from a
population and calculating the point
estimates of population parameter an
experiment, then the outcomes of the
experiment is the point estimate(s).
Sampling Distribution



Therefore, the different point estimates from
different samples of the same population will
be considered different numerical values of a
random variable.
And random variables always have an
associated probability distribution.
That probability distribution is called
sampling distribution.
Sampling Distribution


Therefore, from previous examples, we know
the values of sample mean, sample variance
and sample proportion are all considered
random variables.
Then the next question is, what is the
probability distribution of those random
variables?
Sampling Distribution




What we need to find answers to now?
As a random variable, we need to find its probability
distribution, its mean/expected value and
variance/standard deviation.
Now let’s answer these questions for the sample
mean and sample proportion.
There are answers to the sample variance but it is
much more complicated than the previous two and
out of the range of this course.
Sampling Distribution

*** All the answers below assume we have a
simple random sample.

Another thing to keep in mind is that there is
a difference between taking a simple random
sample from a finite or infinite population.
Sampling Distribution of Sample Mean


1. The expected value of sample mean is the
population mean. (Any point estimator that
possesses this property is called an
unbiased estimator).
2. The standard deviation of sample mean
depends on whether we have a finite or
infinite population.
Standard deviation of sample mean


With INFINITE population.
With FINITE population:
X 

X 

n
N n 
(
)
N 1
n
Finite population correction factor:
N n
N 1
Empirical Rule



When the population is infinite,
Or when the population is finite but the
sample size is less than or equal to 5% of the
population size,
We use:

X 
n
Terminology


We usually call the population standard
deviation the standard deviation.
We call the standard deviation of a point
estimator, in this case the sample mean, the
standard error.
Form of the sampling distribution of
sample mean



Now the question is, what is the form of distribution
of sample mean.
When the population from which we select a simple
random sample is normally or nearly normally
distributed, the sampling distribution of sample
mean is normal.
When the population from which we select a SRS is
NOT normally distributed, the sampling distribution
can be approximated by a normal distribution for
large sample size. This is by CLT.
Sampling distribution of sample mean

For a finite population of size N
x ~ N ( ,

N n 
(
))
N 1
n
For an infinite population
x ~ N ( ,

n
)
GPA example revisited



Using the information from our GPA example, if we
know that the average GPA of Purdue students is
2.8 with a standard deviation of 1.2, what is the
probability that we get a sample of size 20 with a
mean between 2 or 3.
1. Since our sample size is 20 which is far less than
5% of all Purdue students, we can use the infinite
population version.
2. P(2<x-bar<3)
GPA example revisited





=P((2-2.8)/(1.2/sqrt(20) < Z < (3-2.8)/(1.2/sqrt(20) )
= P(-2.98 < Z < 0.75)
= Φ(0.75)- Φ(-2.98)
=0.7734-0.0014
=0.772
Relationship between the sample size and sampling
distribution of sample mean



It comes to us pretty naturally that the point
estimates should be better given a larger
sample size.
But how do we measure the “goodness” of
an estimate?
Actually, the “goodness” of an estimate is
measured by its standard deviation, or the
standard error.
Relationship between the sample size and
sampling distribution of sample mean

Look at the formula of standard error

X 
n



We can see that the standard error decreases as n
increases.
Also, the sampling distribution of sample mean is
considered normal.
Compare two normal distributions with the same
mean but different standard deviation, which one
gives us less uncertainty?
Sampling distribution of

p
We know that the point estimator of the
population proportion, p, is the sample
proportion, p , and p  x .
n

Also, the expected value of sample
proportion is the population proportion.
–
E( p )=p
Sampling Distribution of
p
Standard deviation of p , or standard error
Infinite population:
p(1  p)
p 
n
Finite population:

p 
N n
N 1
p(1  p)
n
Form of the sampling distribution
of p


According to the central limit theorem, (or
think about the normal approximation to
binomial distribution), the sampling
distribution of p can be approximated by a
normal distribution when n*p  5 and
n*(1-p)  5.
Sampling Distribution of

For a finite population of size N
N  n p(1  p)
p ~ N ( p,
)
N 1
n

For an infinite population
p ~ N ( p,
p(1  p)
)
n
p
GPA example re-revisited

Again, using information from our GPA
example, if we also know that 56% of Purdue
students will have a GPA of 2.8 or higher,
what is the probability that the proportion
calculated from a sample of size 20 will fall
below 0.3?

From CLT, we know that the sample proportion, pbar, follows a normal distribution with mean 0.56 and
variance 0.56*(1-0.56)/20
GPA example re-revisited




P(p-var < 0.3)
=P(Z<(0.3-0.56)/sqrt(0.56*0.44/20))
= Φ (-2.34)
=0.0096
What makes a “GOOD” point
estimator

Suppose we are interested in the population
parameter  , and we come up with an
estimator  . How do we decide whether  is a
“good” estimator of  ?
What makes a “GOOD” point
estimator

1. Unbiased:
–


E(  )= θ
2. Efficient: each point estimator has a
sampling distribution with mean and
variance, the one with smaller variance is
considered more efficient.
3. Consistent: a point estimator is consistent
if it becomes closer to the population
parameter as the sample size increases.