Transcript Open
Understanding sample survey data
Underlying Concept
A sample statistic is our best estimate of a
population parameter
If we took 100 different samples from the
same population to measure, for example,
the mean height of men, we would get a
100 different estimates of mean height.
The mean of these means would be very
close to the real population mean.
Population of men
(Population Size = N)
Sample of men from population
(Sample size = n)
We take a sample
from the
population and
measure the
heights of all our
sample members.
The mean height
from this sample
is 174.4 cm.
Sample of men from population
(Sample size = n)
We take a
another sample
from the same
population and
measure the
heights of all
sample members.
The mean height
from this sample
is 165.9 cm.
Sample of sample means
We take a another 100 samples from the same population and measure the
heights of all sample members.
Sample 1 mean was 174.4 cm
Sample 2 mean was 165.9 cm
Sample 3 mean was 171.0 cm
Sample 4 mean was 175.2 cm
Sample 5 mean was 162.8 cm
Etc
Etc
Etc
Sample of sample means
We don’t ever take hundreds of samples. We just take 1.
The concept of the mean of sample means is central to all survey
statistics.
The central limit theorem says that if we took a sufficiently large
number of samples, the mean of the sample means would be
normally distributed.
This is true even if the thing we are measuring is not normally
distributed.
The central limit theorem can be proved mathematically.
It is the basis of how we calculate our required sample size and
how we calculate confidence intervals around our
estimates……………….
Variance, standard deviation and
standard error
Variance = the sum of
squared differences
from the mean
divided by n-1
Sample
values (n=5)
Difference from
mean
Squared difference
from the mean
172
171 - 172 = -1
-1 x -1 = 1
169
2
4
168
3
9
175
-4
16
171
0
0
Mean = 171
Sum = 0
Sum of squares = 30
Variance = 30 / 4 = 7.5
Standard deviation = the
square root of the
variance
SD = √ variance
= √7.5 = 2.74
Standard error = the
square root of the
variance divided by
the sample size
SE = √ (variance / n)
= √ (7.5 / 5)
= 1.22
Standard Error
The standard error is our best estimate of
the standard deviation of the sample
means.
In other words if we took 100 samples from
the same population and got 100
estimates of men’s mean height, the
standard deviation of that mean is the
standard error.
Confidence Intervals
Because the means of sample means are normally distributed, we can use the
characteristics of the normal distribution to look at our mean and standard
error.
We know that in a normal distribution 68.3% of values fall within one standard
deviation of the mean and 95% fall within 1.96 standard deviations of the
mean.
So 1.96 times the standard error gives us the 95% confidence limits.
Our standard error is 1.22.
1.96 x 1.22 = 2.4
Our sample mean is 171.0
171.0 – 2.4 = 168.6
171.0 + 2.4 = 173.4
So.. If we took 100 samples, 95 of them would have a mean somewhere
between 168.6 and 173.4.
Or… we can be 95% confident that the true mean (the population mean) lies
between 168.6 and 173.4.
It works the same for proportions
The 95% confidence interval around a proportion is 1.96 times the standard error of the
estimate.
The standard error of a proportion is √ ( (p (100-p)) / n )
Where p is the percentage and n is the sample size.
So if we estimate that 75% of people prefer dogs from a sample of 45, p=75 and n= 45.
= √ (( 75 x (100-75)) / 45 )
= √ ( (75 x 25) / 45 )
= √ ( 1875
/ 45 )
=√
42.7
= 6.5
75 – 6.5 = 68.5 and 75 + 6.5 = 81.5
So.. if we took 100 samples, 95 of them would have a percentage somewhere between
68.5 and 81.5.
Or… we can be 95% confident that the true percentage of people who prefer dogs (the
population percentage) lies between 68.5 and 81.5.
Sample Size
• We can use our understanding of
confidence intervals to decide how big we
need our sample to be
• First we think through
– what inferences about the population we are
going to be making
– what level of uncertainty we can live with
For example
• We decide to conduct a survey to find out how many
people in Scotland believe in the Loch Ness monster
• We define the population and source an appropriate
sampling frame from which we will take a simple
random sample
• We decide we want to be 95% confident that our
estimate will be accurate to 3 percentage points
• The confidence intervals for proportions are widest for a
50% estimate
• We have no good reason to expect the proportion of
people believing in the Loch Ness monster will be much
more or less than 50% so we will use that as our basis
• Excel spreadsheet example
Design effects
• If the sample is not a simple random sample
then an adjustment will need to be made to the
standard error
• Proportionate stratification will decrease the
standard error
• Disproportionate stratification will increase the
standard error
• Clustering will increase the standard error
• See PEAS website for information about design
effects
http://www2.napier.ac.uk/depts/fhls/peas/index.htm
Finite Population Correction
If the sample size is a large proportion of
the population size (>5%) then applying
the finite population correction will reduce
the standard error
Weighting and Grossing Factors
• How many people in the population the
sample respondent represents
• Weights are used to alter proportions (e.g.
to adjust for unequal selection probabilities
or non-response)
• Grossing factors gross up to the total
population number
• Often combined
For example
Achieved
sample
Known
population
Weighting
factor
Men
150
(36%)
4,500
(45%)
45 / 36 =
1.25
4,500 / 150 =
30
Women
270
(64%)
5,400
(55%)
55 / 64 =
0.86
5,400 / 270 =
20
420
9,900
Total
Grossing
factor
9,900 / 420 =
23.6
Weighting &
grossing factor