Transcript Document

STK 4600: Statistical methods for
social sciences.
Survey sampling and statistical demography
Surveys for households and individuals
1
Survey sampling: 4 major topics
1. Traditional design-based statistical inference
•
7 weeks
2. Likelihood considerations
•
1 week
3. Model-based statistical inference
•
3 weeks
4. Missing data - nonresponse
•
2 weeks
2
Statistical demography
• Mortality
• Life expectancy
• Population projections
• 2 weeks
3
Course goals
• Give students knowledge about:
– planning surveys in social sciences
– major sampling designs
– basic concepts and the most important estimation methods in
traditional applied survey sampling
– Likelihood principle and its consequences for survey
sampling
– Use of modeling in sampling
– Treatment of nonresponse
– A basic knowledge of demography
4
But first: Basic concepts in sampling
Population (Target population): The universe of all
units of interest for a certain study
•
Denoted, with N being the size of the population:
U = {1, 2, ...., N}
All units can be identified and labeled
•
Ex: Political poll – All adults eligible to vote
•
Ex: Employment/Unemployment in Norway– All
persons in Norway, age 15 or more
•
Ex: Consumer expenditure : Unit = household
Sample: A subset of the population, to be observed.
The sample should be ”representative” of the
population
5
Sampling design:
• The sample is a probability sample if all units in the sample have been
chosen with certain probabilities, and such that each unit in the population
has a positive probability of being chosen to the sample
• We shall only be concerned with probability sampling
• Example: simple random sample (SRS). Let n denote the sample size.
Every possible subset of n units has the same chance of being the sample.
Then all units in the population have the same probability n/N of being
chosen to the sample.
• The probability distribution for SRS on all subsets of U is an example of
a sampling design: The probability plan for selecting a sample s from the
population:
N 
p( s )  1 /   if | s | n
n 
p( s )  0 if | s | n
6
Basic statistical problem: Estimation
• A typical survey has many variables of interest
• Aim of a sample is to obtain information regarding totals or
averages of these variables for the whole population
• Examples : Unemployment in Norway– Want to
estimate the total number t of individuals unemployed.
For each person i (at least 15 years old) in Norway:
yi  1 if person i is unemployed , 0 otherwise
Then :
t  iN1 yi
7
• In general, variable of interest: y with yi
equal to the value of y for unit i in the
population, and the total is denoted
t  iN1 yi
• The typical problem is to estimate t or t/N
•Sometimes, of interest also to estimate ratios of
totals:
Example- estimating the rate of unemployment:
yi  1 if person i is unemployed , 0 otherwise
xi  1 if person i is in the labor force, 0 otherwise
with totals t y , t x
Unemployment rate: t y / t x
8
Sources of error in sample surveys
1. Target population U vs Frame population UF
Access to the population is thru a list of units – a
register UF . U and UF may not be the same:
Three possible errors in UF:
–
–
–
•
Undercoverage: Some units in U are not in UF
Overcoverage: Some units in UF are not in U
Duplicate listings: A unit in U is listed more than
once in UF
UF is sometimes called the sampling frame
9
2. Nonresponse - missing data
•
•
•
•
•
Some persons cannot be contacted
Some refuse to participate in the survey
Some may be ill and incapable of responding
In postal surveys: Can be as much as 70%
nonresponse
In telephone surveys: 50% nonresponse is not
uncommon
•
Possible consequences:
–
–
Bias in the sample, not representative of the
population
Estimation becomes more inaccurate
•
Remedies:
–
imputation, weighting
10
3. Measurement error – the correct value of yi is
not measured
– In interviewer surveys:
•
•
•
Incorrect marking
interviewer effect: people may say what they think
the interviewer wants to hear – underreporting of
alcohol ute, tobacco use
misunderstanding of the question, do not
remember correctly.
11
4. Sampling «error»
– The error (uncertainty, tolerance) caused by
observing a sample instead of the whole
population
– To assess this error- margin of error:
measure sample to sample variation
–
Design approach deals with calculating
sampling errors for different sampling designs
– One such measure: 95% confidence interval:
If we draw repeated samples, then 95% of the
calculated confidence intervals for a total t
will actually include t
12
• The first 3 errors: nonsampling errors
– Can be much larger than the sampling error
• In this course:
– Sampling error
– nonresponse bias
– Shall assume that the frame population is
identical to the target population
– No measurement error
13
Summary of basic concepts
•
•
•
•
•
Population, target population
unit
sample
sampling design
estimation
–
–
–
–
estimator
measure of bias
measure of variance
confidence interval
14
• survey errors:
–
–
–
–
register /frame population
mesurement error
nonresponse
sampling error
15
Example – Psychiatric Morbidity Survey
1993 from Great Britain
• Aim: Provide information about prevalence of
psychiatric problems among adults in GB as well
as their associated social disabilities and use of
services
• Target population: Adults aged 16-64 living in
private households
• Sample: Thru several stages: 18,000 adresses were
chosen and 1 adult in each household was chosen
• 200 interviewers, each visiting 90 households
16
Result of the sampling process
• Sample of addresses
Vacant premises
Institutions/business premises
Demolished
Second home/holiday flat
• Private household addresses
Extra households found
• Total private households
Households with no one 16-64
• Eligible households
• Nonresponse
• Sample
18,000
927
573
499
236
15,765
669
16,434
3,704
12,730
2,622
10,108
households with responding adults aged 16-64
17
Why sampling ?
• reduces costs for acceptable level of accuracy
(money, manpower, processing time...)
• may free up resources to reduce nonsampling error
and collect more information from each person in
the sample
– ex:
400 interviewers at $5 per interview: lower sampling
error
200 interviewers at 10$ per interview: lower
nonsampling error
• much quicker results
18
When is sample representative ?
• Balance on gender and age:
– proportion of women in sample @ proportion in
population
– proportions of age groups in sample @ proportions in
population
• An ideal representative sample:
– A miniature version of the population:
– implying that every unit in the sample represents the
characteristics of a known number of units in the
population
• Appropriate probability sampling ensures a
representative sample ”on the average”
19
Alternative approaches for statistical inference
based on survey sampling
• Design-based:
– No modeling, only stochastic element is the
sample s with known distribution
• Model-based: The values yi are assumed to be
values of random variables Yi:
– Two stochastic elements: Y = (Y1, …,YN) and s
– Assumes a parametric distribution for Y
– Example : suppose we have an auxiliary
variable x. Could be: age, gender, education. A
typical model is a regression of Yi on xi.
20
• Statistical principles of inference imply that the
model-based approach is the most sound and valid
approach
• Start with learning the design-based approach
since it is the most applied approach to survey
sampling used by national statistical institutes and
most research institutes for social sciences.
– Is the easy way out: Do not need to model. All
statisticians working with survey sampling in
practice need to know this approach
21
Design-based statistical inference
• Can also be viewed as a distribution-free
nonparametric approach
• The only stochastic element: Sample s, distribution
p(s) for all subsets s of the population U={1, ..., N}
• No explicit statistical modeling is done for the
variable y. All yi’s are considered fixed but
unknown
• Focus on sampling error
• Sets the sample survey theory apart from usual
statistical analysis
• The traditional approach, started by Neyman in 1934
22
Estimation theory-simple random sample
N 
SRS of size n: Each sample s of size n has p( s )  1 /  
n 
Can be performed in principle by drawing one unit at time
at random without replacement
Estimation of the population mean of a variable y:
  iN1 yi / N
A natural estimator - the sample mean: ys  is yi / n
Desirable properties:
( I) Unbiasedness : An estimator ˆ is unbiased if
E ( ˆ ) 
y s is unbiased for SRS design
23
The uncertainty of an unbiased estimator is measured by its
estimated sampling variance or standard error (SE):
Var( ˆ ) E ( ˆ   )2 , if E ( ˆ )  
Vˆ ( ˆ ) is an (unbiased) estimate of Var( ˆ )
SE( ˆ )  Vˆ ( ˆ )
Some results for SRS:
(1) Let  i be the probabilit y that unit i is in the sample,
Then  i  n / N  f , the sampling fraction
( 2) E ( y s ) 
24
(3) Let  2 be the population variance :  2 
Var( ys ) 
2
1
2
N
i 1 ( yi   )
N 1
(1  f )
n
Here, the factor (1 - f ) is called the finite population correction
• usually unimportant in social surveys:
n =10,000 and N = 5,000,000: 1- f = 0.998
n =1000 and N = 400,000: 1- f = 0.9975
n =1000 and N = 5,000,000: 1-f = 0.9998
• effect of changing n much more important than effect of
changing n/N
25
An unbiased estimator of  2 is given by the
sample variance
1
2
s 
is ( yi  y s )
n 1
2
s
The estimated variance Vˆ ( y s )  (1  f )
n
Usually we report the standard error of the estimate:
2
SE ( y s )  Vˆ ( y s )
Confidence intervals for  is based on the
Central Limit Theorem:
For large n, N  n : Z  ( y s   ) /  (1  f ) / n ~ N (0,1)
Approximat e 95% CI for  :
ys  1.96  SE( ys ), ys  1.96  SE( ys )  ys  1.96  SE( ys )
26
Example – Student performance in California
schools
• Academic Performance Index (API) for all California
schools
• Based on standardized testing of students
• Data from all schools with at least 100 students
• Unit in population = school (Elementary/Middle/High)
• Full population consists of N = 6194 observations
• Concentrate on the variable: y = api00 = API in 2000
• Mean(y) = 664.7 with min(y) =346 and max(y) =969
• Data set in R: apipop and y= apipop$api00
27
Histogram of y population with fitted normal density
0.0020
0.0010
0.0000
Density
0.0030
Histogram of y
400
500
600
700
800
900
1000
y
28
Histogram for sample mean and fitted normal density
y = api scores from 2000. Sample size n =10, based on
10000simulations
R-code:
>b =10000
>N=6194
>n=10
>ybar=numeric(b)
>for (k in 1:b){
+s=sample(1:N,n)
+ybar[k]=mean(y[s])
+}
>hist(ybar,seq(min(ybar)-5,max(ybar)+5,5),prob=TRUE)
>x=seq(mean(ybar)-4*sqrt(var(ybar)),mean(ybar)+4*sqrt(var(ybar)),0.05)
>z=dnorm(x,mean(ybar),sqrt(var(ybar)))
>lines(x,z)
29
Histogram and fitted normal density
api scores. Sample size n =10, based on 10000 simulations
0.006
0.004
0.002
0.000
Density
0.008
0.010
Histogram of ybar
500
550
600
650
700
750
800
ybar
30
y = api00 for 6194 California schools
10000 simulations of SRS. Confidence level of the approximate 95% CI
n
Conf. level
10
0.915
30
0.940
50
0.943
100
0.947
1000
0.949
2000
0.951
31
For one sample of size n = 100
100:
y s  654 .5
SE( y s )  12.6
Approximat e 95% CI for population mean :
654.5  1.96  12.6  654.5  24.7  (629 .8, 679.2)
R-code:
>s=sample(1:6194,100)
> ybar=mean(y[s])
> se=sqrt(var(y[s])*(6194-100)/(6194*100))
> ybar
[1] 654.47
> var(y[s])
[1] 16179.28
> se
[1] 12.61668
32
Absolute value of sampling error is not informative when
not related to value of the estimate
For example, SE =2 is small if estimate is 1000, but very
large if estimate is 3
The coefficient of variation for the estimate:
CV ( y s ) SE ( y s ) / y s
In example : CV ( y s )  12.6 / 654 .5  0.019  1.9%
•A measure of the relative variability of an estimate.
•It does not depend on the unit of measurement.
• More stable over repeated surveys, can be used for
planning, for example determining sample size
• More meaningful when estimating proportions
33
Estimation of a population proportion p
with a certain characteristic A
p = (number of units in the population with A)/N
Let yi = 1 if unit i has characteristic A, 0 otherwise
Then p is the population mean of the yi’s.
Let X be the number of units in the sample with
characteristic A. Then the sample mean can be
expressed as
pˆ  y s  X / n
34
Then under SRS :
E ( pˆ )  p
and
p(1  p )
n 1
Var( pˆ ) 
(1 
)
n
N 1
since the population variance equals  2 
Np(1  p )
N 1
n
s 
pˆ (1  pˆ )
n 1
2
So the unbiased estimate of the variance of the estimator:
pˆ (1  pˆ )
n
ˆ
V ( pˆ )
(1  )
n 1
N
35
Examples
A political poll: Suppose we have a random sample of 1000
eligible voters in Norway with 280 saying they will vote
for the Labor party. Then the estimated proportion of Labor
votes in Norway is given by:
p̂  280 / 1000  0.28
p̂( 1  p̂ )
n
0.28  0.72
SE( p̂ )
(1  ) 
 0.0144
n 1
N
999
Confidence interval requires normal approximation.
Can use the guideline from binomial distribution, when
N-n is large: np  5 and n(1  p)  5
36
In this example : n = 1000 and N = 4,000,000
Approximat e 95% CI : p̂  1.96  SE( p̂ )
 0.280  0.028  (0.252, 0.308)
Ex: Psychiatric Morbidity Survey 1993 from Great Britain
p = proportion with psychiatric problems
n = 9792 (partial nonresponse on this question: 316)
N @ 40,000,000
pˆ  0.14
SE( pˆ )  (1  0.00024 )0.14  0.86 / 9791  0.0035
95 % CI : 0.14  1.96  0.0035  0.14  0.007  (0.133,0.1 47)
37
General probability sampling
• Sampling design: p(s) - known probability of selection for
each subset s of the population U
• Actually: The sampling design is the probability distribution
p(.) over all subsets of U
• Typically, for most s: p(s) = 0 . In SRS of size n, all s with
size different from n has p(s) = 0.
• The inclusion probability:
 i  P( unit i is in the sample)
 P(i  s )   p( s )
{s:is}
38
Illustration
U = {1,2,3,4}
Sample of size 2; 6 possible samples
Sampling design:
p({1,2}) = ½, p({2,3}) = 1/4, p({3,4}) = 1/8, p({1,4}) = 1/8
The inclusion probabilities:
 1   p( s )  p({1,2})  p({1,4})  5 / 8
{s:1s}
 2   p( s )  p({1,2})  p({2,3})  3 / 4  6 / 8
{s:2s}
 3   p( s )  p({2,3})  p({3,4})  3 / 8
{s:3s}
 4   p( s )  p({3,4})  p({1,4})  2 / 8
{s:4s}
39
Some results
( I )  1   2  ...   N  E ( n ) ; n is the sample size
( II ) If sample size is determined to be n in advance :
 1   2  ...   N  n
Proof :
Let Z i  1 if unit i is included in the sample, 0 otherwise
 i  P( Z i  1)  E ( Z i )
n  i 1 Z i  E (n) i 1 E ( Z i )  i 1  i
N
N
N
40
Estimation theory
probability sampling in general
Problem: Estimate a population quantity for the variable y
N
For the sake of illustration: The population total t   yi
An estimator of t based on the sample : tˆ
i 1
Expected value : E (tˆ )  s tˆ( s ) p( s )
Variance : Var(tˆ )  E[tˆ  Etˆ]2  s [tˆ( s )  Etˆ]2 p( s )
Bias : E (tˆ )  t
tˆ is unbiased if E (tˆ )  t
41
Let Vˆ (tˆ) be an (unbiased if possible) estimate of Var(tˆ)
The standard error of tˆ : SE(tˆ) Vˆ (tˆ)
Coefficient of variation of tˆ : CV (tˆ) SE(tˆ) / tˆ
CV is a useful measure of uncertainty, especially when
standard error increases as the estimate increases
Margin of error : 2  SE(tˆ)
Because, typically we have that
P(tˆ  2SE(tˆ)  t  tˆ  2SE(tˆ))  0.95 for large n, N  n
Since tˆ is approximat ely normally distribute d for large n, N  n
t̂  2  SE( t̂ ) is approximat ely a 95% CI
42
Some peculiarities in the estimation theory
Example: N=3, n=2, simple random sample
s1  {1,2}, s2  {1,3}, s3  {2,3}
p ( sk ) 1 / 3 for k  1,2,3
Let tˆ1  3 ys , unbiased
Let tˆ2 be given by :
ˆt 2 ( s1 )  3  1 ( y1  y2 )  tˆ1 ( s1 )
2
ˆt 2 ( s2 )  3  ( 1 y1  2 y3 )  tˆ1 ( s 2 )  1 y3
2
3
2
1
1
1
tˆ2 ( s3 )  3  ( y2  y3 )  tˆ1 ( s3 )  y3
2
3
2
43
Also tˆ2 is unbiased :
1 3 ˆ
1
ˆ
ˆ
E (t2 )  s t2 ( s ) p( s )  k 1 t2 ( sk )   3t  t
3
3
1
ˆ
ˆ
Var(t1 )  Var(t2 )  y3 (3 y2  3 y1  y3 )
6
 Var(tˆ1 )  Var(tˆ2 ) if y3  0 and 3 y2  3 y1  y3
If yi  0 / 1  variables, this happens when y1  0, y2  y3  1
For this set of values of the yi’s:
tˆ1 ( s1 )  1.5, tˆ1 ( s2 )  1.5, tˆ1 ( s3 )  3 : never correct
tˆ2 ( s1 )  1.5, tˆ2 ( s2 )  2, tˆ2 ( s3 )  2.5
tˆ2 has clearly less variabilit y than tˆ1 for these y - values
44
Let y be the population vector of the y-values.
This example shows that
Ny s
is not uniformly best ( minimum variance for all y)
among linear design-unbiased estimators
Example shows that the ”usual” basic estimators do not
have the same properties in design-based survey
sampling as they do in ordinary statistical models
In fact, we have the following much stronger result:
Theorem: Let p(.) be any sampling design. Assume each
yi can take at least two values. Then there exists no
uniformly best design-unbiased estimator of the total t
45
Proof:
Let tˆ be unbiased, and let y 0 be one possible value of y.
Then there exists unbiased tˆ0 with Var(tˆ0 )  0 when y  y 0
tˆ0 ( s, y)  tˆ( s, y)  tˆ( s, y 0 )  t0 , t0 is the total for y 0
1) tˆ0 is unbiased : E (tˆ0 )  t  s tˆ( s, y 0 ) p( s )  t0  t
2) When y  y 0 : tˆ0  t0 for all samples s  Var(tˆ0 )  0
This implies that a uniformly best unbiased estimator
must have variance equal to 0 for all values of y,
which is impossible
46
Determining sample size
• The sample size has a decisive effect on the cost of the survey
• How large n should be depends on the purpose for doing the
survey
• In a poll for detemining voting preference, n = 1000 is
typically enough
• In the quarterly labor force survey in Norway, n = 24000
Mainly three factors to consider:
1. Desired accuracy of the estimates for many variables.
Focus on one or two variables of primary interest
2. Homogeneity of the population. Needs smaller
samples if little variation in the population
3. Estimation for subgroups, domains, of the population
47
It is often factor 3 that puts the highest demand on the survey
• If we want to estimate totals for domains of the
population we should take a stratified sample
• A sample from each domain
• A stratified random sample: From each domain a
simple random sample
H strata that constitute the whole population
Sample sizes : n1 , n2 ,...,nH
Total sample size : n  n1  n2  ...  nH
Must determine each nh
48
Assume the problem is to estimate a population proportion p for
a certain stratum, and we use the sample proportion from the
stratum to estimate p
Let n be the sample size of this stratum, and assume that n/N is
negligible
Desired accuracy for this stratum: 95% CI for p should be  5%
pˆ (1  pˆ )
95% CI for p : pˆ  1.96
n
The accuracy requirement:
pˆ (1  pˆ )
1
1.96
 0.05 
n
20
 n  1.96 2 20 2 pˆ (1  pˆ )  384
49
The estimate is unkown in the planning fase
Use the conservative size 384 or a planning value
p0 with n = 1536 p0(1- p0 )
F.ex.: With p0 = 0.2: n = 246
In general with accuracy requirement d, 95% CI  pˆ  d
n  3.84 p0 (1  p0 ) / d 2
Alternative accuracy requirement :
Length of 95% CI is proportion al to pˆ
(when pˆ  0.5, otherwise estimate 1 - p )
pˆ (1  pˆ )
1.96
 d  pˆ  CV ( pˆ )  d / 1.96  e
n
50
1 1  pˆ
SE( pˆ ) / pˆ  e  n  2 
pˆ
e
1 1  p0
Planning value p0 : n  2 
p0
e
With e = 0.1, then we require approximately that
when p0  0.5 : 95% CI  pˆ  0.10 and n  100
when p0  0.1 : 95% CI  pˆ  0.02 and n  900
51
Example: Monthly unemployment rate
Important to detect changes in unemployment rates
from month to month
planning value p0 = 0.05
Desired accuracy:
1.96  SE( pˆ )  d  n  3.84 p0 (1  p0 ) / d 2  0.1824 / d 2
d  0.001 (margin of error  0.1%)  n  182,400
d  0.002  n  45,600
d  0.005  n  7300
Note : d  0.005  CV ( pˆ )  0.00255 / 0.05  .051  5%
52
Two basic estimators:
Ratio estimator
Horvitz-Thompson estimator
• Ratio estimator in simple random samples
• H-T estimator for unequal probability
sampling: The inclusion probabilities are
unequal
• The goal is to estimate a population total t
for a variable y
53
Ratio estimator
Suppose we have known auxiliary
information for the whole population:
x  ( x1 , x2 ,...xN )
Ex: age, gender, education, employment status
Let X   i 1 xi
N
The ratio estimator for the y-total t:
tˆR  X 


yi
ys
 X
x
xs
is i
is
54
We can express the ratio estimator on the following form:
ˆt R  X  ( Ny s )
Nx s
It adjusts the usual “sample mean estimator” in the cases
where the x-values in the sample are too small or too large.
Reasonable if there is a positive correlation between x and y
Example: University of 4000 students, SRS of 400
Estimate the total number t of women that is planning a
career in teaching, t=Np, p is the proportion
yi = 1 if student i is a woman planning to be a teacher, t is the
y-total
55
Results : 84 out of 240 women in the sample plans to be a
teacher
pˆ  84 / 400  0.21
tˆ  Npˆ  840
HOWEVER: It was noticed that the university has 2700 women
(67,5%) while in the sample we had 60% women.
A better estimate that corrects for the underrepresentation of women
is obtained by the ratio estimate using the auxiliary
x = 1 if student is a woman
2700
tˆR 
(840)  945
4000  0.6
56
In business surveys it is very common to use a
ratio estimator.
Ex: yi = amount spent on health insurance by
business i
xi = number of employees in business i
We shall now do a comparison between the ratio
estimator and the sample mean based estimator. We
need to derive expectation and variance for the
ratio estimator
57
First: Must define the population covariance
1
N
 xy 
( xi   x )( yi   y )

i 1
N 1
 x ,  y are population means of the y and x variables
1
N
2
 
(
y


)

i
y
i 1
N 1
1
N
2
2
x 
( xi   x )

i 1
N 1
2
y
The population correlation
coefficient:
 xy
 xy 
 x y
58
Let R i 1 yi / i 1 xi  t / X 
N
N
and Rˆ  Nys / Nxs  ys / xs
( I ) Bias : E (tˆR  t )  Cov( Rˆ , Nxs )
Proof
ˆt R  t  Nys X   t  Nys (1  Nxs  X  )  t
Nx s
Nx s
N
y
 E (tˆR  t )   E s  ( Nxs  X  )  Cov( Rˆ , Nxs )
Nx s
59
It follows that
| Cov( Rˆ , Nxs ) |
| Bias( tˆR ) |

Var ( Nxs )
Var (tˆR ) X  Var ( Rˆ )Var ( Nxs )
 CV ( Nxs ) | Corr ( Rˆ , Nxs ) | CV ( xs )
Hence, in SRS, the absolute bias of the ratio
estimator is small relative to the true SE of the
estimator if the coefficient of variation of the xsample mean is small
Certainly true for large n
60
( II ) E( t̂ R )  t , for large n
2 1 f
2
2 2
ˆ
( III ) Var (t R )  N 
( y  2 R xy  R  x )
n
1
N
2 1 f
2
N 

( yi  Rx i )

i 1
n N 1
61
Note: The ratio estimator is very precise when the
population points (yi , xi) lie close around a straight
line thru the origin with slope R.
The regression model generates the ratio estimator
62
1
N
2 1 f
2
ˆ
Var (t R )  N 

(
y

Rx
)
 i i
n N  1 i 1
and recalling that
1
N
2 1 f
2
Var ( Nys )  N

(
y


)

i
y
i 1
n N 1
N
N
2
2
ˆ
Var(t R )  Var( Nys )  i 1 ( yi  Rxi )  i 1 ( yi   y )
The ratio estimator is more accurate if Rxi
predicts yi better than y does
63
Estimated variance for the ratio estimator
Estimate

N
i 1
( yi  Rx i ) /( N  1)
2
by is ( yi  Rˆ xi ) 2 /( n  1)
2
 x  2 1 f 1
2
ˆ
ˆ
ˆ
V (t R )    N 

(
y

R
x
)

i
i
is
n n 1
 xs 
Note : If xs is very small, then Rˆ is more uncertain and
the variance estimate becomes larger to reflect th at
64
For large n, N-n:
Approximate normality holds and an
approximate 95% confidence interval is
given by
X
1

f
1
2

ˆ
ˆt R  1.96

( yi  Rxi )

is
xs
n n 1
65
R-computing of estimate, variance estimate and confidence interval
>y=apipop$api00
>x=apipop$col.grad #col.grad = percent of parents with college degree
#calculating the ratio estimator:
>s=c(20,2000,3900,5000)
>N=6194
>n=4
>r=mean(y[s])/mean(x[s])
>#ratio estimate of t/N:
>muhatr=r*mean(x)
>muhatr
[1] 542.3055
#variance estimate
ssqr=(1/(n-1))*sum((y[s]-r*x[s])^2)
varestr=(mean(x)/mean(x[s]))^2*(1-n/N)*ssqr/n
ser=sqrt(varestr)
ser
[1] 63.85705
#confidence interval:
>CI=muhatr+qnorm(c(0.025,0.975))*se
>CI
[1] 417.1479 667.4630
y = api00 for 6194 California schools
10000 simulations of SRS. Confidence level of approximate 95% CI
n
Conf.level
10
0.927
30
0.946
50
0.946
100
0.947
1000
0.947
2000
0.948
67
R-code for simulations to estimate true confidence level of
95% CI, based on the ratio estimator
>simtratio=function(b,n,N)
+{
+muhatr=numeric(b)
+se=numeric(b)
+for (k in 1:b){
+s=sample(1:N,n)
+r[k]=mean(y[s])/mean(x[s])
+muhatr[k]=r[k]*mean(x)
+ssqr[k]=(1/(n-1))*sum((y[s]-r[k]*x[s])^2)
+se[k]=sqrt((mean(x)/mean(x[s]))^2*(1-n/N)*ssqr[k]/n)
}
+sum(mean(y)<muhatr+1.96*se)-sum(mean(y)<muhatr-1.96*se)
}
68
Unequal probability sampling
Inclusion probabilities:
 i  P(i  s)  0 for all i  1,..., N
Example:
Psychiatric Morbidity Survey: Selected individuals
from households
 i  1/ M i
M i  number of adults 16 - 64 in the household that
individual i belongs to
69
Horvitz-Thompson estimatorunequal probability sampling
 i  P(i  s)  0 for all i  1,..., N
Let’s try and use Ny s
Let Z i  1 if i  s, 0 otherwise. E ( Z i )   i
1
N
N
E ( Nys )  N E (i 1 yi Z i )  ( N / n)i 1 yi i  t
n
not unbiased
Bias is large if inclusion probabilities tend to
increase or decrease systematically with yi
70
Use weighting to correct for bias:
tˆ  is wi yi ; wi does not depend on s


N
N
ˆ
E (t )  E i 1 wi yi Z i  i 1 wi i yi
and tˆ is unbiased for all possible values yi
if and only if wi  1 /  i
tˆHT  is
yi
i
In SRS,  i  n / N and tˆHT  Nys
71
a ) Var (tˆHT )  i 1
N
1 i
i
y  2i 1
2
i
N 1
 ij   i j
 j i 1   yi y j
i j
N
If | s | n, then
b) Var (tˆHT )  i 1
N 1
 yi y j 
 j i1 ( i j   ij )    
j 
 i
2
N
 ij  P(i, j  s )  P( Z i  Z j  1)
Horvitz-Thompson estimator is widely used
f.ex., in official statistics
72
Note that the variance is small if we determine
the inclusion probabilities such that
yi /  i are approximat ely equal,
i.e.  i increases with increasing yi
Of course, we do not know the value of yi when
planning the survey, use known auxiliary xi and
choose
 i  xi   i  nxi / X 
since

N
i 1
i  n
73
If yi and  i are not related or negatively " correlated "
Var (tˆHT ) can be enormous and one should not use HT - estimator,
even thoug h the  i ' s are unequal
Example: Population of 3 elephants, to be shipped. Needs an
estimate for the total weight
•Weighing an elephant is no simple matter. Owner wants to
estimate the total weight by weighing just one elephant.
• Knows from earlier: Elephant 2 has a weight y2 close to the
average weight. Wants to use this elephant and use 3y2 as an
estimate
• However: To get an unbiased estimator, all inclusion
probabilities must be positive.
74
• Sampling design:
| s | 1 and  2  0.90,  1   3  0.05
• The weights: 1,2, 4 tons, total = 7 tons
• H-T estimator: tˆHT  yi /  i if s  {i}
20 if s  {1}

 2.22 if s  {2}
80 if s  {3}

Hopeless! Always far from true total of 7
Can not be used, even though
E (tˆHT )  7  t
75
Problem:
Var (tˆHT )  (20  7) 2  0.05  (2.22  7) 2  0.90  (80  7) 2 .0.05
 295.46
True SE (tˆHT )  Var (tˆHT )  17.2 !!!
The planned estimator, even though not a SRS:
tˆeleph  3 ys  3 yi if s  {i}
Possible values: 3, 6, 12
76
E( t̂ )  6.15
not unbiased, but look at
SE( t̂eleph )  2.2275  1.49
MSE (tˆeleph )  E (tˆeleph  t ) 2  Bias 2  Var (tˆeleph )  2.95
MSE (tˆeleph )  1.72
t̂eleph is clearly preferable to t̂ HT
77
Variance estimate for H-T estimator
Assume the size of the sample is determined in
advance to be n.
An unbiased estimator of Var (tˆHT ), provided all
joint inclusion probabilit ies  ij  0 :
 yi y j 




i
j
ij
  
Vˆ (tˆHT )  
 ij   i  j 
is js
j i
2
Approximat e 95% CI, for large n, N  n :
tˆHT  1.96 Vˆ (tˆHT )
78
• Can always compute the variance estimate!!
Since, necessarily ij > 0 for all i,j in the sample s
• But: If not all ij > 0 , should not use this
estimate! It can give very incorrect estimates
• The variance estimate can be negative, but for
most sampling designs it is always positive
79
A modified H-T estimator
Consider first estimating the population mean y  t / N
An obvious choice: yˆ HT  tˆHT / N
Alternative: Estimate N as well, whether N is known or not
1
ˆ
N  is
( yi  1, i )
i
 N 1 
N 1
ˆ
E ( N )  E i 1 Z i   i 1  i  N
i 
i

N
ˆ
For SRS,  i  n / N  N  is  N
n
80
yˆ w  tˆHT / Nˆ 


is
yi /  i
is
1/  i
 tˆw  Nyˆ w
Interestin gly, tˆw is often better tha n tˆHT , even thoug h it is
only approximat ely unbiased. It usually has smaller va riance.
So tˆw is ordinarily the estimator to use, whether
N is known or not. We note that it is a ratio estimator
Illustration :
yi  c for all i  1,...,N .
Then t̂ HT  c is 1 /  i  cN̂
while t̂ w  Nc  t , a better estimate if Var( N̂ )  0
81
If sample size varies then the “ratio” estimator
performs better than the H-T estimator, the ratio is
more stable than the numerator
Example:
yi  c, for i  1,..., N
Sampling design  Bernoulli sampling :
Each unit in the population is selected with
probabilit y  , independen tly
Z i ' s are i.i.d. with  i  P( Z i  1)  
n is a stochastic variable, has a binomial ( N ,  ) distributi on
E (n)  N
82
tˆHT 
n

c
( E (tˆHT ) 
N
c  Nc  t )

nc
/

tˆw  N
 Nc  t
n /
H-T estimator varies because n varies, while
the modified H-T is perfectly stable
83
Review of Advantages of Probability
Sampling
• Objective basis for inference
• Permits unbiased or approximately unbiased
estimation
• Permits estimation of sampling errors of
estimators
– Use central limit theorem for confidence interval
– Can choose n to reduce SE or CV for estimator
84
Outstanding issues in design-based inference
• Estimation for subpopulations, domains
• Choice of sampling design –
– discuss several different sampling designs
– appropriate estimators
• More on use of auxiliary information to improve
estimates
• More on variance estimation
85
Estimation for domains
• Domain (subpopulation): a subset of the
population of interest
• Ex: Population = all adults aged 16-64
Examples of domains:
– Women
– Adults aged 35-39
– Men aged 25-29
– Women of a certain ethnic group
– Adults living in a certain city
• Partition population U into D disjoint domains
U1,…,Ud,..., UD of sizes N1,…,Nd,…,ND
86
Estimating domain means
Simple random sample from the population
True domain mean :  d  iU yi / N d
d
• e.g., proportion of divorced women with
psychiatric problems.
Estimate  d by the sample mean from U d :
sd  the part of the sample s in U d
ysd  is yi / nd
d
nd | sd |
Note: nd is a random variable
87
The estimator is a ratio estimator:
Define
 yi if i  U d
ui  
0 otherwise
1 if i  U d
xi  
0 otherwise
 d  i 1 ui / i 1 xi  R
N
N
y sd  is ui / is xi  u s / xs  Rˆ
88
ysd is approximat ely unbiased for large n
2
1  Nd / N  2 1 f 1
2
ˆ
 N 
V ( ysd )  2 

(
u

y
x
)

i
sd i
is
N d  nd / n 
n n 1
Let sd2 be the sample variance for the domain,
1
2
s 
(
y

y
)
 i sd
nd  1 isd
2
d
2
2
s
n
1

f
Vˆ ( ysd )  2 
(nd  1) sd2  (1  f ) d
nd n(n  1)
nd
SE ( ysd )  (1  f )sd2 / nd , f  n / N
89
For large samples f d  nd / N d  f
• Can then treat sd as a SRS from Ud
• Whatever size of n is, conditional on nd, sd is a
SRS from Ud – conditional inference
Example: Psychiatric Morbidity Survey 1993
Proportions with psychiatric problems
y sd
SE ( y sd )
Domain d
nd
women
4933 0.18
.18  0.82 / 4932  0.005
Divorced
women
314
0.29  0.71 / 313  0.026
0.29
90
Estimating domain totals
• Nd is known: Use N d y sd
• Nd unknown, must be estimated
Since N d is the x - total :
Nˆ d  Nxs  N  nd / n
ˆt d  Nˆ d y s  N 1  ui  Nu s
d
n is
2
ˆ
SE (td )  N (1  f )su / n
91
Stratified sampling
• Basic idea: Partition the population U into H
subpopulations, called strata.
• Nh = size of stratum h, known
• Draw a separate sample from each stratum, sh of size nh
from stratum h, independently between the strata
• In social surveys: Stratify by geographic regions, age
groups, gender
• Ex –business survey. Canadian survey of employment.
Establishments stratified by
o Standard Industrial Classification – 16 industry
divisions
o Size – number of employees, 4 groups, 0-19, 20-49, 50199, 200+
o Province – 12 provinces
Total number of strata: 16x4x12=768
92
Reasons for stratification
1. Strata form domains of interest for which
separate estimates of given precision is required,
e.g. strata = geographical regions
2. To “spread” the sample over the whole
population. Easier to get a representative sample
3. To get more accurate estimates of population
totals, reduce sampling variance
4. Can use different modes of data collection in
different strata, e.g. telephone versus home
interviews
93
Stratified simple random sampling
• The most common stratified sampling design
• SRS from each stratum
• Notation:
From stratum h : sample sh of size nh
Total sample size : n  h 1 nh
H
Values from stratum h : yhi , i  1,..., N h
Sample : ( yhi : i  sh )
Sample mean : yh  is yhi / nh
h
94
th = y-total for stratum h: th 

Nh
i 1
yhi
The population total : t  h1 th
H
Consider estimation of th: tˆh  N h yh
Assuming no auxiliary information in addition to
the “stratifying variables”
The stratified estimator of t:
tˆst  h1 tˆh  h1 N h yh
H
H
95
To estimate the population mean t / N :
H Nh
ˆ
Stratified mean : yst  t st / N  h 1 yh
N
A weighted average of the sample stratum means.
•Properties of the stratified estimator follows
from properties of SRS estimators.
•Notation:
Mean in stratum h :  h  i 1 yhi / N h
Nh
1
Nh
2
Variance in stratum h :  
(
y


)

hi
h
i 1
Nh 1
2
h
96
E( t̂ st )  t , t̂ st is unbiased
2

2
h
Var( t̂ st )  hH1Var( t̂ h )  hH1 N h
nh
(1  fh )
Estimated variance is obtained by estimating the stratum
variance with the stratum sample variance
sh2 
1
2
(
y

y
)
 hi h
nh  1 ish
2
s
Vˆ (tˆst )  h 1 N h2 h (1  f h )
nh
H
Approximate 95% confidence interval if n and N-n are large:
tˆst  1.96 Vˆ (t st )
97
Estimating population proportion in stratified
simple random sampling
ph : proportion in stratum h with a certain characteristic A
pˆ h  yh
where yhi  1 if unit i in stratum h has characteri stic A
p is the population mean: p = t/N 
Stratum mean estimator:

H
h 1
N h ph / N
pˆ st  yst  h1 ( N h / N ) pˆ h
H
Stratified estimator of the total t = number of units in the
with characteristic A:
ˆt st  Npˆ st  H N h pˆ h
h 1
98
Estimated variance:
p̂h ( 1  p̂h )
nh
V̂ ( p̂h ) 
(1 
) (slide 31)
nh  1
Nh
nh pˆ h (1  pˆ h )
H
H
2
ˆ
ˆ
 V ( pˆ st )  h 1V (Wh pˆ h )  h 1Wh (1 
)
Nh
nh  1
where Wh  N h /N
and
ˆ h (1  pˆ h )
n
p
H
H
2
2
h
ˆ
ˆ
V (tˆst )  h 1V ( N  Wh pˆ h ) N h 1Wh (1 
)
Nh
nh  1
99
Allocation of the sample units
• Important to determine the sizes of the stratum samples,
given the total sample size n and given the strata
partitioning
– how to allocate the sample units to the different strata
• Proportional allocation
– A representative sample should mirror the population
– Strata proportions: Wh=Nh/N
– Strata sample proportions should be the same:
nh/n = Wh
– Proportional allocation:
Nh
nh
n
nh  n


for all h
N
Nh N
100
The stratified estimator under
proportional allocation
 Inclusion probabilit ies :  hi  nh / N h  n / N
the same for all units in the population , but it is not a SRS
1
H
H
ˆ
 t st  h 1 N h yh  h 1 N h 
nh

N
n
 
H
h 1
ish

ish
yhi
yhi  Nys
 The stratified mean : yst  tˆst / N  ys
The equally weighted sample mean ( sample is selfweighting: Every unit in the sample represents the
same number of units in the population , N/n)
101
Variance and estimated variance under
proportional allocation
2

2
h
Var (tˆst )  h 1 N h
H
1 f
N 
n
2
nh
(1  f h )
2
W

h1 h h ,
H
2 1 f
ˆ
ˆ
V (t st )  N 
n
f  n / N , Wh  N h / N
2
W
s
h1 h h
H
102
• The estimator in simple random sample:
tˆSRS  Ny s
• Under proportional allocation:
tˆst  tˆSRS
• but the variances are different:
2 1 f
ˆ
Under SRS : VarSRS (t SRS )  N 
 2
n
2 1 f
ˆ
Under proportion al allocation : Var (t st )  N 
n
2
W

h1 h h
H
103
Nh 1 Nh
Nh
Using the approximat ions

and
 1:
N 1
N
Nh 1
 2  h 1Wh h2  h 1Wh (  h   ) 2
H
H
Total variance = variance within strata + variance between strata
Implications:
1. No matter what the stratification scheme is:
Proportional allocation gives more accurate estimates of
population total than SRS
2. Choose strata with little variability, smaller strata variances. Then
the strata means will vary more and between variance becomes
larger and precision of estimates increases compared to SRS
3. This is also essentiall y true in general, as seen from
H
2
2 1 fh
ˆ
V (t st )  N h 1Wh
 h2
nh
104
Constructing stratification and drawing stratified
sample in R
Use API in California schools as example with
schooltype as stratifier.
3 strata: Elementary, middle and high schools.
Stratum1: Elementary schools, N1 =4421
Stratum 2: Middle schools, N2 = 1018
Stratum 3: High schools, N3 = 755
5% stratified sample with proportional allocation:
n1 = 221
n2 = 51
n3 = 38
n = 310
105
R-code: making strata
>x=apipop$stype
# To make a stratified variable from schooltype:
>make123 = function(x)
+{
+ x=as.factor(x)
+ levels_x = levels(x)
+x=as.numeric(x)
+attr(x,"levels") = levels_x
+x
+}
> y=apipop$api00
> tapply(y,strata,mean)
1
2
3
672.0627 633.7947 655.7230
# 1=E, 2=H, 3 = M. Will change stratum 2 and 3
106
> x1=as.numeric(strata<1.5)
> x2=as.numeric(strata<2.5)-x1
> x3=as.numeric(strata>2.5)
> stratum=x1+2*x3+3*x2
> tapply(y,stratum,mean)
1
2
3
672.0627 655.7230 633.7947
> # stratified random sample with proportional allocation
> N1=4421
> N2=1018
> N3=755
> n1=221
> n2=51
> n3=38
> s1=sample(N1,n1)
> s2=sample(N2,n2)
> s3=sample(N3,n3)
107
> y1=y[stratum==1]
> y2=y[stratum==2]
> y3=y[stratum==3]
> y1s=y1[s1]
> y2s=y2[s2]
> y3s=y3[s3]
> t_hat1=N1*mean(y1[s1])
> t_hat2=N2*mean(y2[s2])
> t_hat3=N3*mean(y3[s3])
> t_hat=t_hat1+t_hat2+t_hat3
> muhat=t_hat/6194
> muhat
[1] 661.8897
> mean(y1s)
[1] 671.1493
> mean(y2s)
[1] 652.6078
> mean(y3s)
[1] 620.1842
108
> varest1=N1^2*var(y1s)*(N1-n1)/(N1*n1)
> varest2=N2^2*var(y2s)*(N2-n2)/(N2*n2)
> varest3=N3^2*var(y3s)*(N3-n3)/(N3*n3)
> se=sqrt(varest1+varest2+varest3)
> se
[1] 44915.56
> semean=se/6194
> semean
[1] 7.251463
> CI=muhat+qnorm(c(0.025,0.975))*semean
> CI
[1] 647.6771 676.1023
#CI = (647.7, 676.1)
109
Suppose we regard the sample as a SRS
> z=c(y1s,y2s,y3s)
> mean(z)
[1] 661.8516
> var(z)
[1] 17345.13
> sesrs=sqrt(var(z)*(6194310)/(6194*310))
> sesrs
[1] 7.290523
Compared to 7.25 for the stratified SE.
Note: the estimate is the same, 661.9, since we
have proportional allocation
110
Optimal allocation
If the only concern is to estimate the population total t:
• Choose nh such that the variance of the stratified
estimator is minimum
• Solution depends on the unkown stratum variances
• If the stratum variances are approximately equal,
proportional allocation minimizes the variance of
the stratified estimator
111
Optimal allocation : nh  n 

N h h
H
k 1
N k k
Proof :
Minimize Var (tˆst ) with respect to the sample sizes
nh subject to n  h 1 nh is fixed
H
Use Lagrange multiplier method : Minimize
1
1
H
Q  h 1 N  ( 
)   (h 1 nh  n)
nh N h
H
2
h
2
h
Q
1 2 2
 0   2 N h  h    0  nh  N h h / 
nh
nh
Result follows since the sample sizes must add up to n
112
• Called Neyman allocation (Neyman, 1934)
• Should sample heavily in strata if
– The stratum accounts for a large part of the population
– The stratum variance is large
• If the stratum variances are equal, this is
proportional allocation
• Problem, of course: Stratum variances are unknown
– Take a small preliminary sample (pilot)
– The variance of the stratified estimator is not very
sensitive to deviations from the optimal allocation. Need
just crude approximations of the stratum variances
113
Optimal allocation when considering the cost
of a survey
• C represents the total cost of the survey, fixed – our
budget
• c0 : overhead cost, like maintaining an office
• ch : cost of taking an observation in stratum h
– Home interviews: traveling cost +interview
– Telephone or postal surveys: ch is the same for all
strata
– In some strata: telephone, in others home interviews
C  c0  h1 nh ch
H
• Minimize the variance of the stratified estimator for a
given total cost C
114
1
H
2
2 2 1
ˆ
Minimize Var (t st )  N h 1Wh  h ( 
)
nh N h
subject to : c0  h 1 nh ch  C
H
Solution:
nh  Wh h / ch
Wh h
(C  c0 )
 nh 
 H
ch  Wk k ck
k 1
Hence, for a fixed total cost C :
(C  c0 )h 1 N h h / ch
H
n

H
h 1
N h h ch
115
In particular, if ch = c for all h:
n  (C  c0 ) / c
We can express the optimal sample sizes in relation to n
 nh  n 
Wh h / ch

H
k 1
Wk k / ck
1. Large samples in inexpensiv e strata
2. If the ch ' s are equal : Neyman allocation
3. If the ch ' s are equal and the  h ' s are equal :
proportion al allocation
116
Other issues with optimal allocation
• Many survey variables
• Each variable leads to a different optimal solution
– Choose one or two key variables
– Use proportional allocation as a compromise
• If nh > Nh, let nh =Nh and use optimal allocation
for the remaining strata
• If nh=1, can not estimate variance. Force nh =2 or
collapse strata for variance estimation
• Number of strata: For a given n often best to
increase number of strata as much as possible.
Depends on available information
117
• Sometimes the main interest is in precision
of the estimates for stratum totals and less
interest in the precision of the estimate for
the population total
• Need to decide nh to achieve desired
accuracy for estimate of th, discussed earlier
– If we decide to do proportional allocation, it
can mean in small strata (small Nh) the sample
size nh must be increased
118
Poststratification
• Stratification reduces the uncertainty of the
estimator compared to SRS
• In many cases one wants to stratify according to
variables that are not known or used in sampling
• Can then stratify after the data have been collected
• Hence, the term poststratification
• The estimator is then the usual stratified estimator
according to the poststratification
• If we take a SRS and N-n and n are large, the
estimator behaves like the stratified estimator with
proportional allocation
119
Poststratification to reduce nonresponse bias
• Poststratification is mostly used to correct for
nonresponse
• Choose strata with different response rates
• Poststratification amounts to assuming that the
response sample in poststratum h is representative
for the nonresponse group in the sample from
poststratum h
120
Systematic sampling
•
•
Idea:Order the population and select every kth unit
Procedure: U = {1,…,N} and N=nk + c, c < n
1. Select a random integer r between 1 and k, with equal
probability
2. Select the sample sr by the systematic rule
sr = {i: i = r + (j-1)k: j= 1, …, nr}
where the actual sample size nr takes values
[N/k] or [N/k] +1
k : sampling interval = [N/n]
•
Very easy to implement: Visit every 10th house or
interview every 50th name in the telephone book
121
• k distinct samples each selected with probability 1/k
1 / k if s  sr , r  1,..., k
p( s )  
0 otherwise
• Unlike in SRS, many subsets of U have zero probability
Examples:
1) N =20, n=4. Then k=5 and c=0. Suppose we select r =1.
Then the sample is {1,6,11,16}
5 possible distinct samples. In SRS: 4845 distinct samples
2) N= 149, n = 12. Then k = 12, c=5. Suppose r = 3.
s3 = {3,15,27,39,51,63,75,87,99,111,123,135,147} and
sample size is 13
3) N=20, n=8. Then k=2 and c = 4. Sample size is nr =10
4) N= 100 000, n = 1500. Then k = 66 , c=1000 and c/k
=15.15 with [c/k]=15. nr = 1515 or 1516
122
Estimation of the population total
t ( s)  is yi , n( s)  sample size
Two estimators (equal when N  nk ) :
1) tˆ( s)  kt( s)  [ N / n]t ( s )
t ( s)

ˆ
2) t ( s)  Nys  N 
n( s )
These estimators are approximately the same:
n( s)  [ N / k ] or [ N / k ]  1
k
1
N / n  N   N 
N
(N / k)
123
t̂ is unbiased :
E( t̂ )  kr 1 t̂ ( sr ) p( sr )

k
r 1 kt( s r
1
)   kr 1 t( sr )  t
k
tˆ is only approximat ely unbiased (it' s a ratio estimator)
- usually slightly smaller va riance than tˆ
• Advantage of systematic sampling: Can be
implemented even where no population frame exists
•E.g. sample every 10th person admitted to a hospital,
every 100th tourist arriving at LA airport.
124
k
2
ˆ
ˆ
Var (t )  E (t  t )  r 1 (tˆ( sr )  t ) 2 p( sr )
1 k
k
2
 r 1 (k  t ( sr )  t )  k r 1 (t ( sr )  t ) 2
k
where t  r 1 t ( sr ) / k
k
is the average of the sample totals
• The variance is small if
t ( sr ) varies little, i.e., if the " strata"
{1,..,k}, {k  1, ...,2k},.. etc., are very homogeneou s
• Or, equivalently, if the values within the possible samples
sr are very different; the samples are heterogeneous
• Problem: The variance cannot be estimated properly
because we have only one observation of t(sr)
125
Systematic sampling as Implicit Stratification
In practice: Very often when using systematic sampling
(common design in national statistical institutes):
The population is ordered such that the first k units constitute a
homogeneous “stratum”, the second k units another “stratum”, etc.
Implicit strata
1
2
:
Units
1,2….,k
k+1,…,2k
:
n = N/k assumed
(n-1)k+1,.., nk
Systematic sampling selects 1 unit from each stratum at
random
126
Systematic sampling vs SRS
• Systematic sampling is more efficient if the study variable
is homogeneous within the implicit strata
– Ex: households ordered according to house numbers
within neighbourhooods and study variable related to
income
• Households in the same neighbourhood are usually
homogeneous with respect socio-economic variables
• If population is in random order (all N! permutations are
equally likely): systematic sampling is similar to SRS
• Systematic sampling can be very bad if y has periodic
variation relative to k:
– Approximately: y1 = yk+1, y2 = yk+2 , etc
127
Variance estimation
•No direct estimate, impossible to obtain unbiased estimate
• If population is in random order: can use the variance
estimate form SRS as an approximation
• Develop a conservative variance estimator by
collapsing the “implicit strata”, overestimate the variance
• The most promising approach may be:
Under a statistical model, estimate the expected value
of the design variance
• Typically, systematic sampling is used in the second stage
of two-stage sampling (to be discussed later), may not be
necessary to estimate this variance then.
128
Cluster sampling and multistage sampling
•
•
Sampling designs so far: Direct sampling of the
units in a single stage of sampling
Of economial and practical reasons: may be
necessary to modify these sampling designs
–
–
There exists no population frame (register: list of all
units in the population), and it is impossible or very
costly to produce such a register
The population units are scattered over a wide area,
and a direct sample will also be widely scattered. In
case of personal interviews, the traveling costs would
be very high and it would not be possible to visit the
whole sample
129
• Modified sampling can be done by
1. Selecting the sample indirectly in groups , called
clusters; cluster sampling
– Population is grouped into clusters
– Sample is obtained by selecting a sample of
clusters and observing all units within the
clusters
– Ex: In Labor Force Surveys: Clusters =
Households, units = persons
2. Selecting the sample in several stages;
multistage sampling
130
3. In two-stage sampling:
• Population is grouped into primary sampling
units (PSU)
• Stage 1: A sample of PSUs
• Stage 2: For each PSU in the sample at stage
1, we take a sample of population units, now
also called secondary sampling units (SSU)
• Ex: PSUs are often geographical regions
131
Examples
1.
Cluster sampling. Want a sample of high school
students in a certain area, to investigate smoking and
alcohol use. If a list of high school classes is
available,we can then select a sample of high school
classes and give the questionaire to every student in the
selected classes; cluster sampling with high school class
being the clusters
2. Two-stage cluster sampling. If a list of classes is not
available, we can first select high schools, then classes
and finally all students in the selected classes. Then we
have 2-stage cluster sample.
1. PSU = high school
2. SSU = classes
3. Units = students
132
Psychiatric Morbidity Survey is a 4-stage
sample
- Population: adults aged 16-64 living in private
households in Great Britain
- PSUs = postal sectors
- SSUs = addresses
- 3SUs = households
- Units = individuals
Sampling process:
1) 200 PSUs selected
2) 90 SSUs selected within each sampled PSU
(interviewer workload)
3) All households selected per SSU
4) 1 adult selected per household
133
Cluster sampling
Number of clusters in the population : N
Number of units in cluster i: Mi
Population size : M  i 1 M i
N
s I  sample of clusters, n | s I |
Final sample of units : s  all units in s I
Size of final sample s :
m  is M i
not fixed in advance
I
ti  y  total in cluster i , t  i 1 ti
N
Population mean for the y  variable :  y  t / M
134
Simple random cluster sampling
Ratio-to-size estimator
Use auxiliary information: Size of the sampled clusters
t̂ R

M

t
is I i
is I
Mi
Approximately unbiased with approximate variance
1
N
2 1 f
2
2
ˆ
Var (t R )  N

M
(
y


)
 i i
n N  1 i 1
where yi  ti / M i , the cluster mean, and   t / M
135
estimated by
1
M / N
2 1 f
2
2
V̂ ( t̂ R )  

M
(
y

y
)
 N

i
i
s
is I
m
/
n
n
n

1


where f  n / N and ys  is ti / is M i
2
I
I
is the usual sample mean
Note that this ratio estimator is in fact the usual sample mean
based estimator with respect to the y- variable
tˆR  M  ys
And corresponding estimator of the population mean of y is
ys
Can be used also if M is unknown
136
• Estimator’s variance is highly influenced by how the clusters
are constructed.
Choose clusters to make  M i2 ( yi   ) 2 small
 make the clusters heterogene ous,
such that most of the variation in the y - values
lies in the clusters, making the yi  values similar
• Note: The opposite in stratified sampling
• Typically, clusters are formed by “nearby units’ like households,
schools, hospitals because of economical and practical reasons,
with little variation within the clusters:
Simple random cluster sampling will lead to much less precise
estimates compared to SRS, but this is offset by big cost reductions
Sometimes SRS is not possible; information only known for
137
clusters
Design Effects
A design effect (deff) compares efficiency of two designestimation strategies (sampling design and estimator) for
same sample size
Now: Compare
Strategy 1:simple random cluster sampling with ratio
estimator
Strategy 2: SRS, of same sample size m, with usual
sample mean estimator
In terms of estimating population mean:
Strategy 1 estimator : tˆR / M  ys
Strategy 2 estimator : ys
138
The design effect of simple random cluster sampling, SCS, is then
deff (SCS , ys )  VarSCS ( ys ) / VarSRS ( ys )
Estimated deff: VˆSCS ( ys ) / VˆSRS ( ys )
In probation example:
VˆSRS ( pˆ )  [ pˆ (1  pˆ ) /( m 1)](1  f )  pˆ (1  pˆ ) /( m 1)  0.003872
Estimated deff  0.0302 2 / 0.00387 2  60.9
Conclusion: Cluster sampling is much less efficient
Note : We can estimate the p.c. factor 1 - m/M by letting Mˆ  N  (m / n)
and 1 - m / Mˆ  1  n / N  16 / 26  0.615
estdeff  60.9 / 0.615  99 !
139