Alternative III Zero-inflated Poisson Regression

Download Report

Transcript Alternative III Zero-inflated Poisson Regression

Count Data Models in SAS
April 5, 2016
© 2006 ChoicePoint Asset Company. All Rights Reserved.
Introduction
 A comprehensive survey of models for count data in SAS
 Why? Gaining popularity since 1980
=> Insurance: # of auto/medical insurance claims
=> Banking: # of delinquencies / missed payments
=> Marketing: # of responses / purchases
 5 Models to be covered:
poisson regression, negative binomial regression,
hurdle poisson regression, zero-inflated poisson regression,
finite mixture (latent class) poisson regression
© 2006 ChoicePoint Asset Company. All Rights Reserved.
2
SAS Capability
Procedures
GENMOD
GLIMMIX
NLIN
NLMIXED
COUNTREG
MODEL
Poisson
Regression
NB
Regression
Hurdle
Regression
ZIP
Regression
LC Poisson
Regression
✔
✔
✔
✔
✔
✔
✔
✔
✔
✔
✔
✔
✔
✔
✔
✔
✔
✔
✔
✔
© 2006 ChoicePoint Asset Company. All Rights Reserved.
✔
✔
3
Count Data
 Nature of count data
nonnegative, discrete, skewed distribution
high proportion of zero outcomes
potential problems: over-dispersion (variance >> mean) , excess
zeroes
 Why OLS won’t work?
counts are heteroskedastic (variance dependent on mean)
predicted has to be nonnegative (log transformation won’t work)
 A case study: model # of hospital stays
© 2006 ChoicePoint Asset Company. All Rights Reserved.
4
Data Summary
Classical data for count models:
- 4406 elderly respondents sampled from National Medical
Expenditure Survey (NMES) in 1987
- Information included: 7 health, demo, and socio-econ variables
© 2006 ChoicePoint Asset Company. All Rights Reserved.
5
Starting Point
100%
Observations:
1) 80% zeroes ==> excess zeroes
2) Variance = 2 * Mean ==> possible over-dispersion
3) Poor fit with univariate Poisson
80%
60%
40%
20%
0%
0
1
2
Observed Probability
© 2006 ChoicePoint Asset Company. All Rights Reserved.
3
4
5
6
7
8
Univariate Poisson Probability
6
Baseline Model
 Probability Function of Poisson Regression
Exp ui   ui
f Yi | X i  
Yi !
Yi
proc nlmixed data = data;
params b0 = 0 b1 = 0 b2 = 0 ... ...;
mu = exp(b0 + b1 * x1 + b2 * x2...);
p = exp(-mu) * mu ** y / fact(y);
ll = log(p);
Identical to
Prob. Function
model y ~ general(ll);
Run;
© 2006 ChoicePoint Asset Company. All Rights Reserved.
7
Result of Poisson Model
100%
Observations:
1) Improvement by including observed heterogeneity
2) Significantly under-fit at zeroes
80%
What's wrong? ==> Over-Dispersion
60%
40%
20%
0%
0
1
2
Observed Probability
© 2006 ChoicePoint Asset Company. All Rights Reserved.
3
4
5
6
7
8
Predicted Probability of Poisson Regerssion
8
Test for Over-Dispersion
 Auxiliary OLS regression (Cameron, 1996):
 yi  ui 2  yi
ui
 ui  ei
data ols_tmp;
set poi_out;
dep = ((y - yhat) ** 2 - y) / yhat;
run;
proc reg data = ols_tmp;
model dep = yhat / noint;
run;
© 2006 ChoicePoint Asset Company. All Rights Reserved.
significant yhat
indicates
over-dispersion
9
Alternative I
 Most common alternative: Negative Binomial Regression
 NB can be considered a generalized Poisson by including a
dispersion parameter.
ui  Exp X i   ei   Exp X i  Expei 

where Expei  ~ Gamma  1 ,  1

s.t. E Expei   1 and V Expei   
© 2006 ChoicePoint Asset Company. All Rights Reserved.
10
Alternative I
 Probability Function of Negative Binomial Regression
f Yi | X i  
Yi     
 1
1 
Yi  1     ui
1
1
 1



 ui
 1
   ui



Yi
proc nlmixed data = data;
params b0 = 0 b1 = 0 b2 = 0 ... ...;
mu = exp(b0 + b1 * x1 + b2 * x2 ... ...);
p = gamma(y + 1/alpha) / (gamma(y + 1) *
gamma(1/alpha)) * ((1/alpha) / (1/alpha + mu)) **
(1/alpha) * (mu / (1/alpha + mu)) ** y;
ll = log(p);
model y ~ general(ll);
Run;
© 2006 ChoicePoint Asset Company. All Rights Reserved.
11
Result of NB Model
100%
Observations:
1) Significant Improvement by including unobserved
heterogeneity
80%
Comparison with Poisson model:
Likelihood Ratio = 2 * (LL_poi - LL_nb)
= 2 * (-3048 - -2857)
= 378
60%
40%
20%
0%
0
1
2
Observed Probability
© 2006 ChoicePoint Asset Company. All Rights Reserved.
3
4
5
6
7
8
Predicted Probability of NB Regerssion
12
Alternative II
 Hurdle Regression (Mullahy, 1986)
Two Parts:
- zero outcomes: Logistic regression
- positive outcomes: Truncated Poisson regression
 Probability Function of Hurdle Regression
 i

Y
f Yi | X i    1   i   Exp ui   ui i
 1  Exp u   Y !
i
i

© 2006 ChoicePoint Asset Company. All Rights Reserved.
for Yi  0
for Yi  0
13
Alternative II
proc nlmixed data = data;
params b0 = 0 b1 = 0 ... a0 = 0 a1 = 0 ...;
xb = b0 + b1 * x1 + b2 * x2 ... ...);
mu = exp(b0 + b1 * x1 + b2 * x2...);
xa = a0 + a1 * x1 + a2 * x2 ... ...);
if y = 0 then p = exp(xa) / (1 + exp(xa));
else p = (1 - exp(xa) / (1 + exp(xa))) / (1 exp(-mu)) * (exp(-mu) * mu ** y / fact(y));
ll = log(p);
Prob function
for zeroes
Prob function
for positive
model y ~ general(ll);
Run;
© 2006 ChoicePoint Asset Company. All Rights Reserved.
14
Result of Hurdle Model
100%
80%
Observations:
1) Significant Improvement by modeling zeroes
separatedly
60%
How to compare with Poisson model?
AIC, BIC, & Vuong statistic
40%
20%
0%
0
1
2
Observed Probability
© 2006 ChoicePoint Asset Company. All Rights Reserved.
3
4
5
6
7
8
Predicted Probability of Hurdle Regerssion
15
Alternative III
 Zero-inflated Poisson Regression (Lambert, 1992)
Two sources of zeroes
- a point mass of zeroes
- zeroes from standard Poisson distribution
 Probability Function of Hurdle Regression
 i  1  i   Exp ui 

Y
Exp ui   ui i
f Yi | X i   

1  i 

Yi !

© 2006 ChoicePoint Asset Company. All Rights Reserved.
for Yi  0
for Yi  0
16
Alternative III
proc nlmixed data = data;
params b0 = 0 b1 = 0 ... a0 = 0 a1 = 0 ...;
xb = b0 + b1 * x1 + b2 * x2 ... ...);
mu = exp(b0 + b1 * x1 + b2 * x2...);
xa = a0 + a1 * x1 + a2 * x2 ... ...);
if y = 0 then p = exp(xa) / (1 + exp(xa)) +
(1 - exp(xa) / (1 + exp(xa)) * exp(-mu);
Prob function
for zeroes
else p = (1 - exp(xa) / (1 + exp(xa))) *
(exp(-mu) * mu ** y / fact(y));
Prob function
for zeroes
ll = log(p);
model y ~ general(ll);
Run;
© 2006 ChoicePoint Asset Company. All Rights Reserved.
17
Result of ZIP Model
100%
80%
Observations:
1) Significant Improvement by assuming 2 sources of
zeroes
60%
How to compare with other models?
AIC, BIC, & Vuong statistic
40%
20%
0%
0
1
2
Observed Probability
© 2006 ChoicePoint Asset Company. All Rights Reserved.
3
4
5
6
7
8
Predicted Probability of ZIP Regerssion
18
Alternative IV
 Latent Class Poisson Regression (Wedel, 1993):
- Existence of S >= 2 classes of latent segments in the data
- Each latent segment is poisson with different parameter
- Each case drawn from such latent segments with certain probs.
- Interesting in marketing: segment and model at the same time
 Probability Function of LC Poisson Regression
S
f Yi | X i    ps
s 1
© 2006 ChoicePoint Asset Company. All Rights Reserved.
Exp ui |s  ui |s
Yi
Yi !
19
Alternative IV
proc nlmixed data = data;
params a0 = 0 ... b0 = 1 ... c0 = 2 ...
prior1 = 0 to 1 by 0.1 prior2 = 0 to 1 by 0.1;
xa = a0 + a1 * x1 + a2 * x2 ... ...);
ma = exp(xa);
pa = exp(-ma) * ma ** y / fact(y);
xb = b0 + b1 * x1 + b2 * x2 ... ...);
mb = exp(xb);
pb = exp(-mb) * mb ** y / fact(y);
xc = c0 + c1 * x1 + c2 * x2 ... ...);
mc = exp(xc);
pc = exp(-mc) * mc ** y / fact(y);
p = prior1 * pa + prior2 * pb + (1 - prior1 - prior2) * pc;
ll = log(p);
... ...
© 2006 ChoicePoint Asset Company. All Rights Reserved.
20
Result of LC Poisson
100%
80%
Observations:
1) Significant Improvement by assuming 3 latent
classes with different sets of parameter
60%
How to compare with other models?
AIC, BIC, & Vuong statistic
40%
20%
0%
0
1
Observed Probability
© 2006 ChoicePoint Asset Company. All Rights Reserved.
2
3
4
5
6
7
8
Predicted Probability of LC Poisson Regerssion
21
Models Prediction
1) Poisson cannot give adequate fit for the data.
2) Hurdle and ZIP are better to model excess zeroes.
3) NB and LC are better to handle heterogeneity.
© 2006 ChoicePoint Asset Company. All Rights Reserved.
22
Models Comparison
1) AIC & BIC is convenient and easy to compute for model comparison, good
enough for practitioners. BIC tends to select a more parsimonious model.
2) Vuong test is good but computationally tedious (code available in the
paper), recommended for researchers.
© 2006 ChoicePoint Asset Company. All Rights Reserved.
23
Conclusion
 In practice, Poisson model usually is not sufficient for overdispersed data but useful as a baseline model.
(Rule of Thumb for Over-Dispersion: Variance ≥ 2 * Mean)
 It is important to identify the reason for over-dispersion, long tail,
excess zeroes, or … … ?
(Excess zeroes might be the most common reason)
 Statistics shouldn’t be the only consideration for model selection.
Examples:
1) Both Hurdle and ZIP suggest positive effect of private insurance
on hospital stays, which makes perfect sense.
2) LC provides a possibility to segment population, which is
invaluable in marketing, insurance, and credit risk.
© 2006 ChoicePoint Asset Company. All Rights Reserved.
24