SiStaN in Brief

Download Report

Transcript SiStaN in Brief

A balanced Sampling approach for
multiway stratification design for small
area estimation
Piero Demetrio Falorsi - Paolo Righi
ISTAT
Index
1. The issue of multivariate-multidomain sampling
strategy
2. The proposed sampling strategy
3. Balanced sample for multiway stratification
4. Modified GREG estimator
5. The algorithm for the sample size definition
6. Application fields and experiments
1. The issue of multivariate-multidomain sampling strategy
When planning a sample strategy for a survey aiming at producing
estimates for several domains (defined as non-nested partitions of
the population) an issue is to define the sample size so that the
sampling errors of domain estimates of several parameters are
lower than given thresholds.
A sampling strategy is proposed here dealing with
multivariate-multidomain surveys when the overall sample size must
satisfy budget constraints.
The standard solution of a stratification given by cross-classification of
the domain variables is often not feasible because the number of
strata can be larger than the overall sample size. Moreover, even if
the overall sample size allows covering all the strata, the resulting
allocation could lead to an inefficient design.
1. The issue of multivariate-multidomain sampling strategy
Population
Planned and actual sample with cross-classification stratification
1. The issue of multivariate-multidomain sampling strategy
Example: Business Structural Statistics
Table 1.2. Number of domains of the Italian Structural Business Statistics Survey by partition
Partitions
Economic activity class (4-digits of the NACE rev.1 classification)
Economic activity group (3-digits of the NACE rev.1 classification) by Size class(1)
Economic activity division (2-digits of the NACE rev.1 classification) by Region(1)
Total number of estimation domains
(1)
(2)
Size classes are defined in terms of number of persons employed.
Regions are 21 including autonomous provinces.
36.000 cross-classification strata
Number of domains
465
395
961
1,821
1. The issue of multivariate-multidomain sampling strategy
Standard strategy
Standard solution to obtain planned domains adopts cross-stratified
sampling design by combining the domains
Consequences:
–
–
–
–
when the population size in many strata is small, the
stratification scheme could be inefficient;
if different partitions in domains of interest are not nested, the
allocation of the sample in the cross-classified strata may be
substantially different from the optimal allocation for the domains
of a given partition;
the sample size to cover all strata could be too large for the
survey economical constrains;
dealing with surveys repeated over time, statistical burden may
arise if there exist strata containing only few units in the
population.
1. The issue of multivariate-multidomain sampling strategy
One possible solution is the multi-way stratification:
Several sophisticated solutions have been proposed to keep under
control the sample size in all the categories of the stratifying
variables without using cross-classification design. These methods
are generally referred to as multi-way stratification techniques, and
have been developed under two main approaches:
(i) Latin Squares or Latin Lattices schemes (Bryant et al., 1960;
Jessen, 1970); the indipendece among rows and columns is
supposed. these methods work only if all the cross-strata exist in
the population.
(ii) Controlled rounding problems via linear programming (Causey et
al., 1985; Sitter and Skinner, 1994). Very computationally complex
methods, not always get to a solution, inclusion probability (both
simple and joint) cannot be computed immediately.
The main weaknesses of these approaches derives from the
computational complexity and moreover a solution is not always
reached.
2. The proposed sampling strategy
Aim of this work is to define a sample strategy that is optimal with regard to the sample
scheme and to the estimator utilized, by exploiting the available auxiliary information in
both phases:

Define a probabilistic sample method

Realize a multiway stratification based on balanced sampling, controlling the sample
size of the margin domains

Use a modified GREG estimator

Define the sample allocation, aiming at controlling the sampling errors on margins,
using a variance estimator taking into account jointly both the regression model under
the GREG estimator and the balanced sampling design

The strategy may take into account a simple (Fay Herriot) Small Area Estimator
The proposed overall sampling strategy is easy to implement and a software has been
developed for each phase
It is possible to extend it to different contexts (considering the anticipated variance or the use
of indirect small area estimators)
It is possible to develop a sample strategy for small area estimation considering the sample
and estimation phases jointly
2. The proposed sampling strategy
Notation
Denote with:
U the population of size N;
Ub the b-th partition in Mb domains Ubd , b=1,…, B, d=1,…, Mb
yr,k the value of the (r = 1,…,R) variable of interest in the k-th population unit
bd  k the domain membership indicator
n the overall fixed sample size
r-th parameter of interest
bd t r


kU
y r ,k
bd k


kUbd
y r ,k
3. Balanced sampling and multi-way stratification
Balanced sampling is a class of designs using auxiliary information.
Properties have been studied in the
• model based approach (Royall and Herson, 1973; Valliant et
al., 2000);
• design based approach (Deville and Tillé, 2004, 2005).
In the following we consider the design based or model assisted
approach
3. Balanced sampling and multi-way stratification
Let us define the sampling design p(.) with inclusion probabilities
π  (1 ,..., k ,..., N ) a design which assigns a probability p(s) to each
sample s such that
E (λ )  sS p ( s )λ  π
being
Let
λ  (1 ,..., k ,..., N )
a vector of sample indicators.
zk  ( z1k ,...,zhk ,...,zQk ) be a vector of Q auxiliary variables known for
each unit in the population. The sampling design p(s) is said to be
balanced with respect to the Q auxiliary variables if and only if it satisfies
the balancing equations given by
ˆ z t
z t HT
being

 z k k ak   z k
kU
ak  1 /  k the sample weight
kU
3. Balanced sampling and multi-way stratification
Multi-way stratification design can represent a special case of balanced
design, when for unit k the auxiliary variable vector is the indicator of the
belonging to the domains of the different partitions multiplied by its
inclusion probability
The z vector, in this case, is defined as
1
b
 B
b

zk  (0,...,πk ,...,0,...,0,...,πk ,...,0)  πk (11δk ,...,bd δk ,...,BM B δk )
the balancing equations assure that for each selected sample s,
the size of the subsample sbd  s  U bd is a non-random quantity
and is
nbd  kU  k
bd
3. Balanced sampling and multi-way stratification
For multiway stratification the balancing equations become
 (  k bd  k
kU
being
and
k ) /  k 
  k bd  k
kU


kU bd
k  nbd
nbd the sample size for the d-th domain of the b-th partition
3. Balanced sampling and multi-way stratification
A relevant drawback of balanced sampling has always been
implementing a general procedure giving a multivariate balanced
random sample.
Deville and Tillé (2004) proposed a sample selection method (cube
method) drawing a balanced samples for a large set of auxiliary
variables and with respect to different vectors of inclusion
probabilities.
A free macro for the selection of balanced samples for large data sets
may be downloaded (SAS or R routine)
http://www.insee.fr/fr/nom_df_met/outils_stat/cube/accueil_cube.htm
Deville and Tillé (2000) show that with our specification of the
auxiliary vectors, the balancing equations can be exactly satisfied,
while in general the balancing equation are approximately respected
4. Modified GREG estimator
In the context of multi-variate estimation, the r-th parameter of interest is
bd t r


yr, k
k U
The modified GREG estimator is (through a specific domain weight)
ˆ
bd tr, greg 
bd
wk  a k
bd  k
 ( bd tx  bd

ks
bd wk
y r ,k



tˆx ,ht )  ( a k x k x k / ck ) 
 k s



1
a k x k / ck
The superpopulation working model is
yr,k  xk βr   r,k
4. Modified GREG estimator: variance
Variance of the Horvitz-Thompson estimator with the balanced
sampling
Deville and Tillé (2005) proposed an approximation of the variance
expression for HT estimator and the overall domain
V (ˆtht | zˆt ht  z t ) 
N

N  Q kU
 1


 1 ( yk  zk B z ) 2
 πk

with

 1
 

B z   z k zk 
 1


πk


k

U


1
 1


z
y

1
 k k  π 
 k

k U
4. Modified GREG estimator: variance
Starting from the result by Deville (2005) it is possible to derive the
approximate expression of the variance for the modified GREG
estimator under balanced sampling
V p ( bd tˆr ,greg | tˆz,ht  tz ) 
N

N  Q kU
 1


 1 bdr2,k
k

being
bd  r , k
 r , k  zk bd B z, for k  U bd

for k  U bd
 zk bd B z,
and
bd B z ,

 1
 

  z k z k 
 1
 k U
k
 

1
 z k  r ,k
k U
 1

 1
k

bd  k 

5. The algorithm for the sample size definition
In order to calculate the inclusion probabilities it is necessary to
fix the sample size for each domain so that the constraints
on the sampling errors were accomplished
When considering separately each marginal partition we would
have for each of them a different set of inclusion probabilities
In our methodology we calculate a single inclusion probability
through a two step procedure
• Optimisation (calculating of optimal probabilities)
• Calibration (calculating of “working” probabilities)
5. The algorithm for the sample size definition
Optimisation: the calculus of the inclusion probabilities (sample size
and domain allocation) is carried out with the aim of minimizing the
expected sampling errors on several domains and estimates:
 Multi domains
 Multi variable
The problem is solved through the system





Min  π k 



k

U




 1

 N

 bd ηr2,k  bdVr

1
 N  Q   π

kU  k


0  π   1 ( k  1,...,N )
k

The solution can be obtained through the
Chromy algorithm
(the one used in the software for allocation
MAUSS, which can be can be downloaded
from www.istat.it)
bd ηr ,k

Residual term
5. The algorithm for the sample size definition
Calibration: optimal inclusion probabilities lead to non integer values for
the domain sample size

Rounding of the expected domain sample size to next
integer;

Calculating “working” probabilities nearest to the optimal
ones
The problem is defined through the system





Min
G
(

;

)

k
k 


Solution obtained by means of the Newton algorithm (with
k

U



some change), the same used in calibration software

Genesees which can be can be downloaded from
  k  n
www.istat.it)
kU

b=1,…,B; d=1,…, Mb - 1
   k  nbd
kUbd
21
6. Application fields and experiments
Artificial data
Population – Contingency table
,
Variable for the allocation and estimation model
y1,k  0.35xk  ε1,k
Em (1,k )  0
Em (1,k 1,l )  0
k l
Vm (1,k )  1.5xk
6. Application fields and experiments
Artificial data
Compared sampling designs and expected CV(%)
22
6. Application fields and experiments
Real data
A simulation on real enterprises data (N=10,392) has been carried out to
evaluate the effects of planned sample size for small domain of
estimate (Falorsi et al., 2006) :
•
•
•
•
•
U1 partition: regions (20 domains);
•
the 2 allocations guarantee a CV of 34.5% for U1 and 8.7%
for U2 with regard to the variables number of employers
(supposed known at sampling stage);
•
the overall sample size is n=360
U2 partition: economic activity by size class (24 domains);
Cross-classification strata with population units: 360.
Variables of interest: value added and labour cost
the sample sizes of U1 and U2 partitions have been planned
separately by means of a compromise allocation
6. Application fields and experiments
Real data
The experiment examines a situation characterizing many real survey contexts in
which the overall sample size n is fixed and the marginal sample sizes are determined
by a quite simple rule being a compromise between the Allocation Proportional to
Population size (APP) and the allocation uniform for each domain of a given partition:
nbd  b n ( Nbd / N )  (1  b ) n / M b
0  b  1
The probabilities of both designs for U1 and U2 partitions have been obtained as
solution of the calibration problem below where the initial probabilities are set
uniformly equal to
 k  n / N





 Min   G (  k ;  k ) 

 kU


  k  n
kU

b=1,…,B; d=1,…, Mb - 1
   k  nbd
kU bd
6. Application fields and experiments: Real data
7. Extension to the Fay Herriot Model
Let b denote the partition for which it is necessary to adopt a small
area indirect estimator
and let us consider the model (7.1.1) described in Rao (2005, pag.
116). for the domains of the
defined as
ˆ
bd tr , greg
b  th partition, this model may be
 bd tˆr , greg / N bd 

bd a φ r
 bd h
bd vr

bd ur
where bd a is a p  1 vector of area level covariates, φ r is an
unknown p  1 vector of regression coefficients, bd h is a known
quantity related to the
independent
bd  th
domain, bd vr
of the sampling error bd ur
(0, bd  r2t ) , being
2
bd  rt
iid
(0, b r2 )
approximately ind
 V p (bd tˆr , greg | tˆz , ht  tz ) / N b2d .
26
27
7. Extension to the Fay Herriot Model
2
For known b r and
ˆ
bd tr ,blup
 Nbd (
2
bd  rt
values, the BLUP estimator of
ˆ
bd  r bd t r , greg
bd tr
is
 (1  bd  r ) bd a φˆ r )
being
bd  r
 b r2
bd h
2
/( bd  r2t  b  r2
bd h
2
)
The MSE of the BLUP estimator is
MSE(bd tˆr ,blup ) 

 N 2  bd  r
bd 

 M b
2
2
2
2


bd  rt  (1  bd  r ) bd a   bd a bd a ( bd  rt  b r
 d 1

2 
bd h ) 

1


bd a
.

7. Extension to the Fay Herriot Model
28
Looking at previous expressions it is possible to note that for a given
values of the variance b r , it is possible to control the MSE (bd tˆr,blup )
in the sampling design phase, by defining a proper value of the variance
2
2
bd  rt
.
An iterative procedure finds the  k inclusion probabilities which
guarantee the minimum sample size and assure the respects of the
following constraints
N /( N  Q)  1 /  k  1 bd r2, k 
k U
bd Vr
(for b  b ; d=1,…, M b ; r  1,...,R)
And
MSE(bd tˆr ,blup )  bdVr (d=1,…, Mb ; r  1,...,R) .