Day 7: Simple Random Sampling
Download
Report
Transcript Day 7: Simple Random Sampling
Sampling and Estimating
Population Percentages and
Averages
Math 1680
Overview
Introduction
Picking a Sample
Chance Errors in Sampling
Estimating the Population Percentage
Estimating the Population Average
Summary
Introduction
In real life, we often want to know about a
population of individuals
Voters for an election
Families who watch TV
Students at a university
Most of the time the population is far too big
to be able to question directly
Introduction
Instead, we take a sample of the population,
determine the parameters of the sample, and then
use this to infer the parameters of the population at
large
Characteristics take the form of statistics
Average height
Median exam score
Proportion of voters in favor of a candidate
Can adjust these statistics to apply to the population
Which would better represent the population (all
other things being equal), a large sample or a small
sample?
Picking a Sample
Survey tools include
Interview
Mail
Phone
Internet
Sample types include
Convenience
Quota
Simple Random
Picking a Sample
Generally, the best bet is to make the sample
random
Allows the surveyor no choice in determining who
will be interviewed
Provides the best defense against bias
Simple random sampling is one way to do
this
Draw randomly as if from a box
Picking a Sample
Sources of bias
People who respond to voluntary surveys tend to have different
parameters than people who don’t respond (volunteer bias)
People tend to exaggerate/round off things such as their weight
or age when asked, depending on the context of the question
The wording of the questions or the nature of the interviewer can
suggest certain types of answers from people (response bias)
Phone surveys in particular are susceptible to volunteer bias and
also tend to over-represent the middle class
In general, poor people and rich people are less likely to have listed
numbers
Picking a Sample
Give at least two types of bias (with
examples) that one encounters when trying to
conduct a phone survey by looking up
numbers randomly in a phone book
Many people have their numbers
unlisted.
Many people (especially the ones with
Caller ID) will hang up the phone or
choose not to answer, creating a nonresponse bias.
Picking a Sample
In order to obtain a representative sample of
10,000 American citizens, a polling
organization takes one morning to go door-todoor in the capital city of each state until they
have enough people to get that state’s
percentage out of the 10,000
Find at least two problems with this sampling
procedure and justify why they are problems
The capital city will not represent the entire
state, especially rural states like Wyoming.
Going in the morning will eliminate anyone who
is working that morning.
Picking a Sample
(Hypothetical) The teacher of a 400-person class wants to determine
whether or not to curve his first exam. His exams are graded by four
different teacher’s assistants. Since there are so many students, the
teacher does not want to look through every exam himself. Instead,
he pulls out the first 5 tests from each TA’s section and takes the
average grade. He then makes a curve based on this average.
Give at least two possible confounding factors in this
process.
The TA’s may not have graded to the same level, and some TA’s
may have sorted their test scores, either alphabetically or by
grade. Especially in the later, the average from the first five
tests will not be representative of all the students.
Chance Errors in Sampling
Warning: The formulae and methods used in
the rest of the lecture only apply to simple
random samples
If the sample in question is not obtained by this
method, using these formulae will give
meaningless results!
Chance Errors in Sampling
Suppose we are dealing with a population of
10,000 students, 5,300 of which are women
and 4,700 of which are men
If we were to sample 100 people at random, how
many would we expect to be women?
53
What percentage would we expect to be women?
53%
Chance Errors in Sampling
Suppose we are dealing with a population of
10,000 students, 5,300 of which are women
and 4,700 of which are men
If we were to sample 100 people, how far off of
our expected value are we likely to be?
≈5
By what percentage are we likely to be off?
≈ 5%
Chance Errors in Sampling
The expected value for a percentage (%EV) in a
sample size of n is the probability of drawing a “1”
from the corresponding box
EVn ( Avg )( n)
% EV
Avg
n
n
Chance Errors in Sampling
The standard error for a percentage (%SEn) in a
sample size of n, if we draw with replacement, is the
standard deviation of the box, divided by the square
root of the number of draws
SEn (SD)( n ) (SD)( n ) SD
%SEn
2
n
n
( n)
n
Note that %SEn depends on n, while %EV does not!
Chance Errors in Sampling
As sample size increases, %SEn goes to 0%
Also, dividing the SD by the square root of the
sample size is exact only when the draws are being
made with replacement
Since in sampling we usually do not re-interview the people
we sample, this formula is only an approximation
If the number of people we sample is small relative to the
population size, then this approximation is good enough
If not, we need a correction factor
%SEn (
SD N n
)
n N 1
Chance Errors in Sampling
Approximate %SEn and exact %SEn for the number
of men in a sample of size n, for different sample
sizes
Population (N) is 10,000
5,000 are men
n
%SEn (exact)
%SEn (approximate)
100
4.98%
5%
900
1.62%
1.7%
1,000
1.5%
1.6%
6,400
0.38%
0.63%
Chance Errors in Sampling
Suppose we are dealing with a population of
10,000 students, 5,300 of which are women
and 4,700 of which are men
If we were to sample 100 people at random, what
is the probability that we get between 45% and
50% men?
≈ 38%
Chance Errors in Sampling
To verify that simple random sampling is accurate,
we need to see that there is a high probability of
being “close” to the %EV
We can use the normal curve to answer this kind of
question
Find the expected value and standard error for the
percentage
Standardize the range in question according to the %EV
and %SEn
Find the corresponding area under the curve by using the
normal table
Chance Errors in Sampling
According to the 2000 US Census, of
Americans 25 years or older, 80.4% are high
school graduates and 24.4% have at least a
Bachelor’s degree. There were about 281
million Americans in 2000.
If one was to take a simple random sample of 900
Americans age 25 or more, what is the probability
that the sample would contain less than 79% high
school graduates?
≈ 14.5%
Chance Errors in Sampling
According to the 2000 US Census, of
Americans 25 years or older, 80.4% are high
school graduates and 24.4% have at least a
Bachelor’s degree. There were about 281
million Americans in 2000.
What is the probability that the sample would
contain between 23% and 26% college
graduates?
≈ 70%
Chance Errors in Sampling
(Hypothetical) A polling organization wants to find out if a simple
random sample really achieves the correct demographic
proportions. They decide to conduct surveys in Dallas, where they
know before-hand from the US Census that 50.4% of Dallas
residents are male, and that 38.8% of families in Dallas have a
married couple heading them. The population of Dallas at that time
was 1,188,580.
If they take a simple random sample of 5,000 people, estimate the
probability that their sample would contain less than 49% males
≈ 2.4%
Chance Errors in Sampling
(Hypothetical) A polling organization wants to find out if a simple
random sample really achieves the correct demographic
proportions. They decide to conduct surveys in Dallas, where they
know before-hand from the US Census that 50.4% of Dallas
residents are male, and that 38.8% of families in Dallas have a
married couple heading them. The population of Dallas at that time
was 1,188,580.
If they take a simple random sample of 500,000 people estimate the
probability that their sample would contain less than 49% males
0%
Chance Errors in Sampling
(Hypothetical) A polling organization wants to find out if a simple
random sample really achieves the correct demographic
proportions. They decide to conduct surveys in Dallas, where they
know before-hand from the US Census that 50.4% of Dallas
residents are male, and that 38.8% of families in Dallas have a
married couple heading them. The population of Dallas at that time
was 1,188,580.
If they take a simple random sample of 10,000 people, estimate the
probability that their sample would contain between 38.5% and 39.5%
families headed by married couples
≈ 66%
Estimating the Population Percentage
Often, we are interested in estimating a
population’s percentage about some
parameter
Parameters can be seen as answers to a yes/no
question
Do you favor this candidate?
Do you smoke?
To estimate the percentage, we start by
taking a simple random sample and
calculating its average and SD
Estimating the Population Percentage
A simple random sample of 1,000 people is
taken, of which 543 are Democrats
What percentage of the population do we expect
to be Democrats?
How far off do we expect to be?
The expected percentage in the population is
just the percentage in the sample
54.3%
Estimating the Population Percentage
Ideally, we would set up our box model for the
entire population and calculate the SD, then
divide by the square root of the sample size
However, we don’t know the population SD
Instead, we just use the sample SD in its place,
making the assumption that it should reflect the
population SD
When estimating %SEn for the expected
percentage in the population, use the sample’s
SD in your calculations
So the %SE in this case is…
1.58%
Estimating the Population Percentage
In the previous example, we saw that we expected
the population to be 54.3% Democrats, and we
expected to be off by about 1.6%
%EV is really a random variable
If the sample was large enough, we can assume %EV is
approximately normal and say that the interval 54.3 ± 1.6%
is the 68% confidence interval for the percentage of
Democrats in the population
What would the 95% confidence interval be? 54.3 ± 3.2%
What would the 99% confidence interval be? 54.3 ± 4.8%
Estimating the Population Percentage
Warning: If the sample percentage is very
close to 0% or 100%, a very large sample is
needed to use the normal approximation!
One way to check this is to calculate the sample
SD
If the SD is close to 50%, a small sample will allow for a
normal approximation
If the SD is well below 50%, very large samples are
needed to use the normal approximation
Estimating the Population Percentage
At this point, we are tempted to say that there is
a 95% chance that the true percentage of
Democrats falls between 51.1% and 57.5%
This is not so
Remember that the true percentage of Democrats in the
population is determined by the entire population
Sample percentages are random numbers determined by the
people we sample
It is correct to say that if we were to take 100 samples
and calculate 100 different 95% confidence intervals,
then about 95 of them should encompass the true
percentage of Democrats
Estimating the Population Percentage
Homer Simpson decides to run against
Mayor Quimby for the leadership of
Springfield
After a grueling campaign, Election Day finally
arrives, and the early exit polls show that Homer
has the votes of 58 out of the 100 people polled
Find the 95% confidence interval for the percentage of
votes for Homer
Can Homer break out the Duff champagne?
58% ± 9.88%, no
Estimating the Population Percentage
Homer Simpson decides to run against
Mayor Quimby for the leadership of
Springfield
A few hours later, the exit polls show a turn for the
worse for Homer, with only 485 out of 1,000
sampled voting for him
Calculate the 95% confidence interval for the
percentage of votes for Homer
Is he out of the running?
48.5% ± 3.16%, no
Estimating the Population Percentage
Homer Simpson decides to run against
Mayor Quimby for the leadership of
Springfield
By midnight, 10,000 votes have been counted,
and Homer has 5,204 of them
One last time, calculate the 95% confidence interval for
the percentage of votes for Homer
Has Quimby’s regime finally been toppled?
52.04% ± 1.0%, it appears so
Estimating the Population Average
Often, we are interested in estimating a
population’s average on some parameter
Height
IQ
Income
As before, to estimate the average, we start
by taking a simple random sample and
calculating its average and SD
Estimating the Population Average
The expected value for the average (mEV) in a
sample size of n is the average of the
corresponding box
EVn ( Avg )( n)
mEV
Avg
n
n
Estimating the Population Average
The standard error for the average (mSEn) in a
sample size of n, if we draw with replacement, is the
standard deviation of the box, divided by the square
root of the number of draws
SEn (SD)( n ) (SD)( n ) SD
mSEn
2
n
n
( n)
n
Note that mSEn depends on n, while mEV does not!
Estimating the Population Average
Just as with percentages, we estimate the
standard error for the population’s average by
applying the sample’s SD in place of the
population SD
If our sample is large enough, we can
assume the distribution on the sample
average is approximately normal
This allows us to obtain confidence intervals for
the population average
Estimating the Population Average
(Hypothetical) A large company takes a
simple random sample of 500 employees and
asks them how long they have worked there
The employees averaged 4.2 years with an SD of
1.3
Give a 95% confidence interval for the average length of
employment at the company
(4.08 years, 4.32 years)
Estimating the Population Average
(Hypothetical) Out of a county containing a college, a simple
random sample of 1000 people is taken
From the sample, the average level of education (years of school
completed, not counting Kindergarten) is 14 years, with an SD of
2 years
Give an expected value and standard error for the average
educational level of people in the county
14 ± 0.0632 years
Have 68% of the county’s residents completed 14 ± 0.063 years
of schooling?
No, the SD for the population is 2, not 0.0632 years.
On top of this, we can’t determine if the population’s
education level is normally distributed.
Estimating the Population Average
(Hypothetical) As a way of measuring the quality of
English education in one high school, the senior
class is required to take the ACT
The English department wants to see an English
ACT average of 25
Of the class, 125 tests are checked
The average English ACT score was a 25.4, with an SD of
2.2
Can you say with 99.7% confidence that the English
department will get the average it desires?
No, because the 99.7% confidence
interval is (24.81, 25.99), which includes
the goal of 25.
Summary
When trying to determine characteristics of a
large population, researchers use smaller
samples to infer the population
characteristics
Ideal samples are randomly selected
Simple random samples can be modeled with a
box and ticket model
A simple random sample will give an accurate
representation of the population, provided
that it is large enough
Summary
We use a sample percentage/average to
estimate the population percentage/average
We find the standard error by assuming that the
sample SD in place of the population SD
If the sample is large enough, we can
assume the sample percentage/average as a
random variable is approximately normal
We can calculate confidence intervals around our
sample percentage/average to narrow down the
location of the population percentage/average