Matching Methods

Download Report

Transcript Matching Methods

Matching Estimators
Methods of Economic
Investigation
Lecture 11
Last Time

General Theme: If you don’t have an
experiment, how do you get a ‘control
group’

Difference in Differences



How it works: compare before-after between
two comparable entities
Assumptions: Fixed differences over time
Tests to improve credibility of assumption


Pre-treatment trends
Ashenfelter Dip
Today’s Class

Another way to get a control group:
Matching



Assumptions for identification
Specific form of matching called “propensity
score matching”
Is it better than just a plain old regression?
The Counterfactual Framework

Counterfactual: what would have
happened to the treated subjects, had
they not received treatment?

Idea: individuals selected into treatment
and nontreatment groups have potential
outcomes in both states:


the one in which they are observed
the one in which they are not observed.
Reminder of Terms

For the treated group, we have observed
mean outcome under the condition of
treatment E(Y1|T=1) and unobserved
mean outcome under the condition of
nontreatment E(Y0|T=1).

For the control group we have both
observed mean E(Y0|T=0) and unobserved
mean E(Y1|T=0)
What is “matching”?

Pairing treatment and comparison units
that are similar in terms of observable
characteristics

Can do this in regressions (with
covariates) or prior to regression to define
your treatment and control samples
Matching Assumption

Conditioning on observables (X) we
can take assignment to treatment ‘as if’
random, i.e.
(Yi 0 , Yi1 )  Ti | Xi

What is the implicit statement:
unobservables (stuff not in X) plays no
role in treatment assignment (T)
A matched estimator

E(Y1 – Y0 | T=1) =
E[Y1 | X, T=1] – E[Y0 | X, T=0] E[Y0 | X, T=1] – E[Y0 | X, T=0]
Assumed to be zero
Matched treatment effect
Key idea: all selection occurs only through
observed X
Just do a regression…

Regression are flexible




if you only put in a “main effect” the
regression will estimate a purely linear
specification
Interactions and fixed effects allow different
slopes and intercepts for any combination of
variables
Can include quadratic and higher order
polynomial terms if necessary
But fundamentally specify additively separable
terms
Sometimes regression not feasible…

The issue is largely related to
dimentionality

Each time you add an observable
characteristics, you partition your data into
bins.

Imagine all variables are zero-one variables


Then if you have k X’s, you have 2k potential
different values
Need enough observations in each value to
estimate that precisely
Reducing the Dimensionality

Use of propensity score: Probability of
receiving treatment, conditional on
covariates

Key assumption: if (Yi 0 , Yi1 )  Ti | Xi
and defining
(Yi 0 , Yi1 )  Ti | Xi

(Yi 0 , Yi1 )  Ti | p( Xi )
If this is true, can interpret estimate of
differences in outcomes conditional on X as
causal effect
Why not control for X

Matching is flexible in a different way

Avoid specifying a particular for the outcome
equation, decision process or unobservable
term

Just need the “right” observables

Flexible in the form of how X’s affect treatment
probability but inflexible in how treatment
probability affects outcome
Participation decision

Remember our 3 groups:



Always takers: take the treatment if offered
AND take the treatment if not offered
We observe them if T=0 but R=1
Never takers: don’t take the treatment if not
offered AND don’t take it even if it is offered
We observe them if T=1 but R=0
Compliers: just do what they’re assigned to do
T=1 & R=1 OR T=0 & R=0
Conditions for Matching to Work
Take 1-sided non-compliance for ease…if
not offered, can’t take it, but some people
don’t take it even if offered
Error term for
never takers
If it’s zero  Perfect compliance: so
conditioning on X replicates experimental
setting
Error term for
compliers
On avg, conditional
on X unobservable
are the same
Common Support

Can only exist if there is a region of
“common support”



People with the same X values are in both the
treatment and the control groups
Let S be the set of all observables X, then
0<Pr(T=1 | X)<0 for some S* subset of S
Intuition: Someone in control close
enough to match to treatment unit OR
enough overlap in the distribution of
treated and untreated individuals
0
.1
.2
.3
.4
Lots of common support
-4
-2
0
2
x
kdensity treatment
kdensity control
4
Between
red and
blue line
is area of
common
support
0
.1
.2
.3
.4
Not so much common support
-5
0
5
x
kdensity treatment
kdensity control
10
Trimming
Define Min and Max values of X for region
of overlap—drop all units not in that
region
 Remove Regions which do not have
strictly positive propensity score in both
treatment and control distributions
(Petra and Todd, 2005)
 Both are quite similar when used in
practice but if missing sections in middle
of distribution can use the second option

How do we match on p(X)

Taken literally, should match on exactly
p(Xi)
In practice hard to do so strategy is to
match treated units to comparison units
whose p-scores are sufficiently close to
consider
 Issues:




How many times can 1 unit be a “match”
How many to match to treatment unit
How to “match” if using more than 1 control
unit per treatment unit
Replacement

Issue: once control group person Z is a
match for individual A, can she also be a
match for individual B

Trade-off between bias and precision:


With replacement minimizes the propensity
score distance between the matched and the
comparison unit
Without replacement
Are we doing a one-to-one match?

If 1-to-1 match: units closely related but
may not be very precise estimates

More you include in match, the more the
p-score of the control group will differ
from the treatment group

Trade-off between bias and precision

Typically use 1-to-many match because 1-to-1
is extremely data intensive if X is multidimensional
Different matching algorithms-1

Can use nearest neighbor which chooses m
closest comparison units



Can use ‘caliper’—radius around a point



implicitly weights these all the same
Get fixed m but may end up with different pscores
Again implicitly weights these the same
Fixed difference in p-scores, but may not be many
units in radius
Stratify


Break sample up into intervals
Estimate treatment effect separately in each region
Different Matching Algorithms-2

Can also use some type of distribution:

Kernel estimator puts some type of distribution
(e.g. normal) around the each treatment unit
and weights closer control units more and
farther control units less

Explicit weighting function can be used if you
have some knowledge of how related units of
certain distances are to each other
How close is close enough?

No “right” answer in these choices—will
depend heavily on sample issues


How deep is the common support (i.e. are
there lots of people in both control and
treatment group at all the p-score values
Should all be the same asymptotically but
in finite samples (which is everything)
may differ
Tradeoffs in different methods
Source: Caliendo and Kopeinig, 2005
How to estimate a p-score

Typically use a logit


Specific, useful functional form for estimating
“discrete choice” models
You haven’t learned these yet but you will

For now, think of running a regular OLS
regression where the outcome is 1 if you
got the treatment and zero if you didn’t

Take the E[T | X] and that’s your
propensity score
The Treatment Effect


CIA holds and sufficient region of of common
support
Difference in outcome between treated
individual i and weighted comparison group J,
with weight generated by the p-score
distribution in the common support region
N is the treatment group
and |N| is the size of the
treatment group
J is comparison group with |J| is
the number of comparison group
units matched to i
General Procedure
Run Regression:
1-to-1 match
• Dependent variable: T=1, if
participate; T = 0, otherwise.
estimate difference in
outcomes for each pair
•Choose appropriate
conditioning variables, X
Take average difference
as treatment effect
• Obtain propensity score:
predicted probability (p)
1-to-n Match
 Nearest neighbor matching
 Caliper matching
 Nonparametric/kernel
matching
Multivariate analysis based on new sample
Standard Errors

Problem: Estimated variance of treatment
effect should include additional variance
from estimating p

Typically people “bootstrap” which is a nonparametric form of estimating your coefficients
over and over until you get a distribution of
those coefficients—use the variance from that

Will do this in a few weeks
Some concerns about Matching

Data intensive in propensity score
estimation



May reduce dimensionality of treatment effect
estimation but still need enough of a sample to
estimate propensity score over common support
Need LOTS of X’s for this to be believable
Inflexible in how p-score is related to
treatment


Worry about heterogeneity
Bias terms much more difficult to sign (nonlinear p-score bias)
Matching + Diff-in-Diff
Worry that unobservables causing selection
because matching on X not sufficient
 Can combine this with difference and
difference estimates





Take control group J for each individual i
Estimate difference before treatment
If the groups are truly ‘as if’ random should be
zero
If it’s not zero: can assume fixed differences
over time and take before after difference in
treatment and control groups
Next Time

Comparing Non-Experimental Methods to
the experiments they are trying to
replicate

Goal: See how well these techniques work
to get the estimated experimental effect