TRUNCATED AND CENSORED VARIABLES

Download Report

Transcript TRUNCATED AND CENSORED VARIABLES

SAMPLE SELECTION
Cheti Nicoletti
ISER, University of Essex
2009
Wage equation and labour
participation for women
Gourieroux C. (2000), Econometrics of Qualitative
Dependent Variables, Cambridge University Press,
Cambridge
• Let y* be the potential offered wage and let w be the
reservation wage then the observed wage y is given
by
*
*

y
if
y
w

y
*
0
if
y
w


*

 1 if y  w i.e. a woman work
d 
*
0
if
y
 w i.e. a woman does not work


• Let us consider the following very simple earnings
profile equation
y *     age  
0
1
Women in the labour force are not a
random sample
• “Women’s labour force participation rates are highly
dependent on age.” Gourieroux (2000)
• Labour participation is in general lower for women aged:
– 16-20 because some women are still studying
– 25-44 for work interruption linked to children
– 55-60 because some women prefer to retire early
• Presumably the earnings observed for women aged
– 16-20 are lower than if all women worked
– 25-44 are higher because women with higher earnings are less
incline to work interruptions
– 55-60 are higher because women with higher earnings are less
incline to retire early
Women career profile
5000
earnings
4000
3000
2000
1000
0
0
20
40
age
60
80
Sample selection model
Labour participation equation
• Probit model for labour participation
 1 if a woman work
d 
0 if a woman does not work
where d * is thepropensityto work
d *  z  u where u iid N (0,1)
Pr(d  1 | z )  ( z )
L
z   1  z  
n
1 d i
di
i
i 1
i
Joint model for the log-earnings and the labour
participation equations
Generalized TOBIT MODEL
y*  x    iid N (0,  2 ) y* is observed only if d  1
• Possible candidates for x: education dummies, age, work
experience
1 if d *  0
d *  z  u where u iid N (0,1) d  
0 if d *  0
• Possible candidates for z: age, education, number of children,
dummies for the presence of children <5, for cohabiting, for
widow, regional unemployment rate.
 0  2   
 
u 
 ,
iid
N

u 
0


1
 







Bivariate normal
  m1   2   
 y1 
12 
If   is N   ,  1

2

 2  
 y2 
 m2  
Then y1 is N (m1 , 12 )
2
and y2 | y1 is N (m2  12 ( y1  m1 ) / 12 , 22  12
/ 12 )
 0  2   
u 
If   is N   , 

 0 

1
u 




 
Then v is N (0,1)and  | u is N ( u u,  2   2u )
Truncated Normal
If Y ~ N (,  ) thenY Y  c ~ Truncated Normal
2

withE(Y Y  c)      and Var(Y Y  c)   2 1      2

  
c


where   
(inverseMill's ratio)and 
1   

Suggestions for the proof
Z
Y 

~ N (0,1) If Y  c then Z   
c

d ( z)
d ( z ) z
  ( z) z and
  ( z) z 2   ( z )   ( z) z 2   ( z)   ( z) z
dz
dz


Sample selection problem
E(y*|d=1,x,z)=x+E(|d=1,x,z)
E(|d=1,x,z)= E(|u>-zδ )=  u E (u | u   z )   u
 ( z )
E(y*|d=1,x,z)= X  u
( z )
 ( z )
( z )
Two-step estimation
• 1 STEP: estimation of a probit model for the
probability to be in the labour market,
Π Pr(di=1|zi)di Pr(di=0|zi)1-di=Π (zi ) di (-zi ) 1-di
• 2 STEP: estimation of the regression model
with an additional variable (the inverse Mill’s
ratio) using the subsample of individuals with
di=1 (and using some IV restrictions)
 ( Z )
Y  X   u
v
( Z )
Testing selectivity
•
•
•
 ( Z )
Y  X   u
v
( Z )
If the error terms  and u are uncorrelated, then the
selection problem is ignorable.
H0: σu =0
Verifying H0 is equivalent to verify whether the
coefficient of the additional variable in the
equation is zero (using for ex. a Wald test)
Notice that the errors are heteroskedastic so a proper
estimation should be adopted to estimate the
standard errors
Generalized Tobit: Maximum
Likelihood Estimation
d *  z  u
y  x  
*
  x   2   
 0  2     y* | x, z 
 
u 

u  
 ,
iid
N
,
iid
N






v 


*


z  
 0 
 d | x, z 
1


1
 







*
y | x iid
N x ,  
2
2

 u *
 u 
*
*
d | y , x, z iid N  z  2 y  x , 1  2 

 



heckman
• The heckman command is used to estimate Generalized Tobit or
Tobit of the 2nd type using ML estimation (default option) or the twostep estimation (option [twostep])
y *  x    iid N (0,  2 )
d *  z  v where v iid N (0,1)
*
*
*


1
if
d
 0 for employedpeople
 y if d  0

y
d 
*

0
if
d
 0 otherwise
.
otherwise



heckman y x1 x2 … xk, select(z1 z2 … zs)
heckman y x1 x2 … xk, select(d = z1 z2 … zs)
heckman y x1 x2 … xk, select(z1 z2 … zs) twostep
Generalized Tobit: Maximum
Likelihood Estimation
L
n
Pr(d
*
i
 0 zi )
1di
f (y
*
i
xi ) Pr(d  0 zi , y )
*
i
*
i

di
i1
*


y

x

1
*
*
i
i

Pr(di  0 zi )  Pr(zi  vi  0)   zi 
f ( yi xi )   

  

2 



d i * | yi * , xi , zi iid N  zi  2u yi *  xi  , 1  2u 






 *

 u *
 u *
*
*

P r(d i  0 yi , zi , xi )  P r d i  zi  2 yi  xi    zi  2 yi  xi  











   zi  2u y *  x










 2u 
1 2
 


Variable
Joint model
Two-step estimation
Coeff
Coeff
p-value
p-value
LABOUR PARTICIPATION MODEL
Constant
-0.57
0.06
-0.99
0.04
No. children <18
-0.12
0.00
-0.13
0.00
No. children <4
-0.09
0.00
-0.07
0.00
log husband's wage
-0.10
0.04
-0.08
0.06
Years of education
0.15
0.00
0.14
0.00
Age
0.81
0.01
0.91
0.02
-0.12
0.03
-0.14
0.01
0.35
0.00
0.29
0.00
Age square
Correlation between error
Inverse Mill's ration
WAGE MODEL
Constant
4.50
0.02
4.70
0.01
Years of education
0.11
0.01
0.10
0.00
Work experience
0.13
0.01
0.08
0.01
Work experience square
0.00
0.02
0.01
0.00
Joint model for log-income and
response probability
y  x    iid N (0,  )
*
2
• Possible candidates for x: education dummies, age, work
experience
d *  z  v where v iid N (0,1)
• d* is the propensity to respond to the earnings question
• Z: mode of interview, education, gender, age, etc.
 0  2  u  
 
 
 v  iid N  0, 
1 
 
  
Item nonresponse for income equation
or poverty model in cross section
sample surveys:
Potential explanatory variables:
• Socio-demographic variables: age, gender, level
of education, number of adults, number of
children.
• Situational economic circumstance: labour
status activity.
• Data collection characteristics: mode of the
interview, number of visits, duration of the
interview. (These are plausible IV)
Maximum Likelihood estimation of the joint model
Variable
Coeff
p-value
RESPONSE
MODEL
Variable
Coeff
p-value
INCOME MODEL
Constant
2.13
0.00
Constant
2.10
0.00
Duration of the interview
-0.34
0.12
Years of education
0.02
0.00
No. of interview attempts
-0.02
0.01
Inactive
-0.13
0.01
Labour Status
Mode (type) of interview
Face to face interview
0.15
0.00
Self-employed
-0.21
0.02
Telephone interview
0.05
0.00
Unemployed
-0.56
0.00
Reference
category:
employed
Reference category:
Post interview
Age
-0.02
0.01
Age
0.02
0.00
Age square
0.00
0.45
Age square
-0.00
0.00
Female Gender
0.31
0.01
Correlation
between errors
-0.23
0.00
Years of education
0.02
0.05
Attrition in panel surveys has two possible
causes: failed contact and refusal
The potential variables explaining attrition (contact and
cooperation) are lagged variables observed in the last
wave.
The equation of interest has to use lagged variables
(otherwise we have missing explanatory variables too)
• Socio-demographic variables: age, gender, level of
education, number of adults, number of children.
• Social-integration: talking often to neighbours,
cohabitation, house ownership.
• Situational economic circumstance: labour status activity,
household equalised income.
• Data collection characteristics: mode of the interview,
number of visits, duration of the interview, same
interviewer across wave, duration of the panel, length of
the fieldwork. (These are plausible IV)
Attrition due to lack of cooperation (BHPS 1994-96)
Variables
Coefficients
Test
p-value
Wave 1996
0.17108
2.21
0.027
Workload
-0.01619
-22.04
0.000
Item nonresponse by interviewer
-3.08725
-3.74
0.000
Co-operation rate by interviewer
1.62772
4.85
0.000
Age 35 or less
-0.05109
-0.58
0.560
Age 60 or more
-0.01904
-0.15
0.882
Female
0.20994
2.77
0.006
Living without a spouse
-0.15878
-1.90
0.057
No. of children
-0.03666
-0.96
0.337
No. of adults
-0.06812
-1.68
0.092
Unemployed
-0.38718
-3.00
0.003
Inactive
0.16281
1.64
0.100
No. of visits
-0.02887
-2.33
0.020
Same interviewer
0.61158
7.78
0.000
Item nonresponse
0.04194
0.20
0.843
Constant
1.54751
7.30
0.000
Wald joint significance test
2068.9
No. obs.
14265
Weighted estimation
y *  x    iid N (0,  2 )
d *  z  v where v iid N (0,1)
*
*
*
*


1
if
d

0
if
y
is observed
 y if d  0

y
d 
*
*

0
if
d

0
if
y
is missing
.
otherwise



Weightsare given by theinverseof
  Pr(d *  0 | z )  Pr(d  1 | z )  ( z )
Assumptionof missing at random(MAR):
y * independent of d given observedvariables
Weighted estimation
y *  x    iid N (0,  2 )
OLS is based on E[ x' ( y*  x ) | x)  0
but we can consider only
E[ x' ( y*  x )d | x)  0
If d is independent of y* given ( x, z ) then wecan provethat
E[ x' ( y*  x )d 1 | x)  0
so that theweighted least squares estimationis consistent
Proof thatE[ x' ( y*  x )d 1 | x)  0
Conditioning and integrating out (marginalizing)
with respect to z
EZ (E[x’(y*-xβ)dπ-1]|x,z)
=EZ (E[x’(y*-xβ)|x,z,d=1] Pr(d=1|x,z)π-1)
=EZ (E[x’(y*-xβ)|x,z])=E[x’(y*-xβ)|x]=0
How to use weights in Stata
• Most Stata commands can deal with weighted data.
Stata allows four kinds of weights:
1. fweights, or frequency weights, are weights that
indicate the number of duplicated observations.
2. pweights, or sampling weights, are weights that
denote the inverse of the probability that the
observation is included due to the sampling
design, nonresponse or sample selection.
3. aweights, or analytic weights, are weights that are
inversely proportional to the variance of an
observation; i.e., the variance of the j-th observation
is assumed to be sigma^2/w_j, where w_j are the
weights.
4. iweights, or importance weights, are weights that
indicate the "importance" of the observation in some
vague sense.
Option pweights
• Usually sample surveys provide weights to take account of sampling
design, nonresponse .
• Let p be individual weight
• Then we can run a regression with weighted observations
regress y x1 x2 … xk [pweight=p]
• Let us assume to have a random sample affected by nonresponse,
but weights to take account of unit nonresponse are not available
• A possible way to estimate your own weights is described in the
following:
probit d z1 z2 … zs
predict prop
gen invprop=1/prop
reg y x1 x2 … xk [pweight=invprop]
For complex survey design it is
better to use
• svyset [pweight=p]
• svy: regress y x1 x2 … xk
• svyset have options for cluster sampling
designs or other complex design
• To declare survey design with stratum
• svyset [pweight=p], strata(stratid)
Stata propensity score methods for
evaluation of treatment
Abadie A., Drukker D., Herr J.L., Imbens G.W. (2001),
Implementing Matching Estimators for Average Treatment
Effects in Stata, The Stata Journal, 1, 1-18
http://ksghome.harvard.edu/~.aabadie.academic.ksg/software.html
Becker S.O., Ichino A. (2002), Estimation of average treatment
effects based on propensity scores. The Stata Journal, 2, 358377 http://www.lrz-muenchen.de/~sobecker/pscore.html
Sianesi B. (2001), Implementing Propensity Score Matching
Estimators with STATA, UK Stata Users Group, VII Meeting
London, http://ideas.repec.org/c/boc/bocode/s432001.html
Some references for regressions
with sample selection
•
•
•
•
•
•
•
Buchinski, M. (2001) Quantile regression with sample selection: Estimation women
return to education in the U.S., Empirical Economics, 26, 86-113.
Ibrahim, J.G., Chen, M.-H., Lipsitz, S.R., Herring, A.H. (2005) Missing-data methods
for generalized linear models: A comparative review, Journal of the American
Statistical Association, 100, 469, 332-346.
Lipsitz, S.R., Fitzmaurice, G.M., Molenberghs, G., Zhao, L.P. (1997), Quantile
regression methods for longitudinal data with drop-outs, Applied Statistics, 46, 463476.
Robins, J. M., Rotnitzky, A. (1995), Semiparametric Effciency in Multivariate
Regression Models With Missing Data, Journal of the American Statistical
Association, 90, 122-129.
Vella F. (1998), Estimating models with sample selection bias: a survey',
The Journal of Human Resources, vol. 3, 127-169.
Wooldridge, J.M. (2007) Inverse probability weighted M-Estimation for
General missing data problems, Journal of Econometrics, 141, 2, 12811301.
Wooldridge, J.M. (2007) Inverse probability weighted M-Estimation for General
missing data problems, Journal of Econometrics, 141, 2, 1281-1301.