Transcript Document

Discrete and Categorical
Data
William N. Evans
Department of Economics/MPRC
University of Maryland
1
Part I
Introduction
2
Introduction
• Workhorse statistical model in social
sciences is the multivariate regression
model
• Ordinary least squares (OLS)
• yi = β0 + x1i β1+ x2i β2+… xki βk+ εi
• yi = xi β + εi
3
Linear model
yi =  + xi + i
•  and  are “population” values – represent
the true relationship between x and y
• Unfortunately – these values are unknown
• The job of the researcher is to estimate these
values
• Notice that if we differentiate y with respect to
x, we obtain
• dy/dx = 
4
•  represents how much y will change for
a fixed change in x
– Increase in income for more education
– Change in crime or bankruptcy when slots
are legalized
– Increase in test score if you study more
5
Put some concreteness
on the problem
• State of Maryland budget problems
– Drop in revenues
– Expensive k-12 school spending initiatives
• Short-term solution – raise tax on
cigarettes by 34 cents/pack
• Problem – a tax hike will reduce
consumption of taxable product
• Question for state – as taxes are raised,
how much will cigarette consumption
fall?
6
• Simple model: yi =  + xi + i
• Suppose y is a state’s per capita
consumption of cigarettes
• x represents taxes on cigarettes
• Question – how much will y fall if x is
increased by 34 cents/pack?
• Problem – many reasons why people
smoke – cost is but one of them –
7
• Data
– (Y) State per capita cigarette consumption for the
years 1980-1997
– (X) tax (State + Federal) in real cents per pack
– “Scatter plot” of the data
– Negative covariance between variables
• When x>, more likely that y<
• When x<, more likely that y>
• Goal: pick values of  and  that “best fit” the
data
– Define best fit in a moment
8
Notation
• True model
• yi =  + xi + i
• We observe data points (yi,xi)
• The parameters  and  are unknown
• The actual error (i) is unknown
• Estimated model
• (a,b) are estimates for the parameters (,)
• ei is an estimate of i where
• ei=yi-a-bxi
• How do you estimate a and b?
9
Objective: Minimize sum of
squared errors
• Min iei2 = i(yi – a – bxi)2
• Minimize sum of squared errors (SSE)
• Treat (+) and (-) errors equally
– Over or under predict by “5” is the same
magnitude of error
– “Quadratic form”
– The optimal value for a and b are those that
make the 1st derivative equal zero
– Functions reach min or max values when
derivatives are zero
10
Cigarette Consumption and Taxes
Per capita packs/year
300
250
200
150
100
50
0
0
20
40
60
80
100
120
Tax per pack (cents)
11
Cigarette Consumption and Taxes
Per capita packs/year
300
250
200
150
100
50
0
0
20
40
60
80
100
120
Tax per pack (cents)
12
• The model has a lot of nice features
– Statistical properties easy to establish
– Optimal estimates easy to obtain
– Parameter estimates are easy to interpret
– Model maximizes prediction
• If you minimize SSE you maximize R2
• The model does well as a first order
approximation to lots of problems
13
Discrete and Qualitative Data
• The OLS model work well when y is a
continuous variable
– Income, wages, test scores, weight, GDP
• Does not has as many nice properties
when y is not continuous
• Example: doctor visits
• Integer values
• Low counts for most people
• Mass of observations at zero
14
Downside of forcing non-standard
outcomes into OLS world?
• Can predict outside the allowable range
– e.g., negative MD visits
• Does not describe the data generating
process well
– e.g., mass of observations at zero
• Violates many properties of OLS
– e.g. heteroskedasticity
15
This talk
• Look at situations when the data
generating process does not lend itself
well to OLS models
• Mathematically describe the data
generating process
• Show how we use different optimization
procedure to obtain estimates
• Describe the statistical properties
16
• Show how to interpret parameters
• Illustrate how to estimate the models
with popular program STATA
17
Types of data generating
processes we will consider
• Dichotomous events (yes or no)
– 1=yes, 0=no
– Graduate high school? work? Are obese?
Smoke?
• Ordinal data
– Self reported health (fair, poor, good, excel)
– Strongly disagree, disagree, agree, strongly
agree
18
• Count data
– Doctor visits, lost workdays, fatality counts
• Duration data
– Time to failure, time to death, time to reemployment
19
Recommended Textbooks
• Jeffrey Wooldridge, “Econometric
analysis of cross sectional and panel
data”
– Lots of insight and mathematical/statistical
detail
– Very good examples
• William Greene, “Econometric Analysis”
– more topics
– Somewhat dated examples
20
Course web page
• www.bsos.umd.edu/econ/evans/jpsm.html
• Contains
–
–
–
–
These notes
All STATA programs and data sets
A couple of “Introduction to STATA” handouts
Links to some useful web sites
21
STATA Resources
Discrete Outcomes
• “Regression Models for Categorical
Dependent Variables Using STATA”
– J. Scott Long and Jeremy Freese
• Available for sale from STATA website
for $52 (www.stata.com)
• Post-estimation subroutines that
translate results
– Do not need to buy the book to use the
subroutines
22
• In STATA command line type
• net search spost
• Will give you a list of available programs
to download
• One is
Spostado from
http://www.indiana.edu/~jslsoc/stata
• Click on the link and install the files
23
Part II
A brief introduction to STATA
24
STATA
• Very fast, convenient, well-documented,
cheap and flexible statistical package
• Excellent for cross-section/panel data
projects, not as great for time series but
getting better
• Not as easy to manipulate large data
sets from flat files as SAS
• I usually clean data in SAS, estimate
models in STATA
25
• Key characteristic of STATA
– All data must be loaded into RAM
– Computations are very fast
– But, size of the project is limited by available
memory
• Results can be generated two different ways
– Command line
– Write a program, (*.do) then submit from the
command line
26
Sample program to get you
started
• cps87_or.do
• Program gets you to the point where can
• Load data into memory
• Construct new variables
• Get simple statistics
• Run a basic regression
• Store the results on a disk
27
Data (cps87_do.dta)
• Random sample of data from 1987
Current Population Survey outgoing
rotation group
• Sample selection
– Males
– 21-64
– Working 30+hours/week
• 19,906 observations
28
Major caveat
• Hardest thing to learn/do: get data from
some other source and get it into STATA
data set
• We skip over that part
• All the data sets are loaded into a
STATA data file that can be called by
saying:
use data file name
29
Housekeeping at the top of the
program
•
•
•
* this line defines the semicolon as the ;
* end of line delimiter;
# delimit ;
•
•
* set memork for 10 meg;
set memory 10m;
•
•
•
•
* write results to a log file;
* the replace options writes over old;
* log files;
log using cps87_or.log,replace;
•
•
* open stata data set;
use c:\bill\stata\cps87_or;
•
•
* list variables and labels in data set;
desc;
30
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
-----------------------------------------------------------------------------> storage display
value
variable name
type
format
label
variable label
-----------------------------------------------------------------------------> age
float %9.0g
age in years
race
float %9.0g
1=white, non-hisp, 2=place,
n.h, 3=hisp
educ
float %9.0g
years of education
unionm
float %9.0g
1=union member, 2=otherwise
smsa
float %9.0g
1=live in 19 largest smsa,
2=other smsa, 3=non smsa
region
float %9.0g
1=east, 2=midwest, 3=south,
4=west
earnwke
float %9.0g
usual weekly earnings
------------------------------------------------------------------------------
31
Constructing new variables
• Use ‘gen’ command for generate new
variables
• Syntax
– gen new variable name=math statement
• Easily construct new variables via
– Algebraic operations
– Math/trig functions (ln, exp, etc.)
– Logical operators (when true, =1, when false, =0)
32
From program
•
•
•
•
•
•
•
•
•
•
•
•
* generate new variables;
* lines 1-2 illustrate basic math functoins;
* lines 3-4 line illustrate logical operators;
* line 5 illustrate the OR statement;
* line 6 illustrates the AND statement;
* after you construct new variables, compress the
data again;
gen age2=age*age;
gen earnwkl=ln(earnwke);
gen union=unionm==1;
gen topcode=earnwke==999;
gen nonwhite=((race==2)|(race==3));
gen big_ne=((region==1)&(smsa==1));
33
Getting basic statistics
• desc -- describes variables in the data
set
• sum – gets summary statistics
• tab – produces frequencies (tables) of
discrete variables
34
From program
• * get descriptive statistics;
• sum;
• * get detailed descriptics for continuous
variables;
• sum earnwke, detail;
• * get frequencies of discrete variables;
• tabulate unionm;
• tabulate race;
• * get two-way table of frequencies;
• tabulate region smsa, row column cell;
35
Results from sum
•
•
•
•
•
•
•
•
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------age |
19906
37.96619
11.15348
21
64
race |
19906
1.199136
.525493
1
3
educ |
19906
13.16126
2.795234
0
18
unionm |
19906
1.769065
.4214418
1
2
smsa |
19906
1.908369
.7955814
1
3
-------------+--------------------------------------------------------
36
Detailed summary
•
•
•
•
•
•
•
usual weekly earnings
------------------------------------------------------------Percentiles
Smallest
1%
128
60
5%
178
60
10%
210
60
Obs
19906
25%
300
63
Sum of Wgt.
19906
•
•
•
•
•
•
50%
75%
90%
95%
99%
449
615
865
999
999
Largest
999
999
999
999
Mean
Std. Dev.
488.264
236.4713
Variance
Skewness
Kurtosis
55918.7
.668646
2.632356
37
Results for tab
•
•
•
•
•
•
•
•
1=union |
member, |
2=otherwise |
Freq.
Percent
Cum.
------------+----------------------------------1 |
4,597
23.09
23.09
2 |
15,309
76.91
100.00
------------+----------------------------------Total |
19,906
100.00
38
2x2 Table
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
1=east, |
2=midwest, |
1=live in 19 largest smsa,
3=south, |
2=other smsa, 3=non smsa
4=west |
1
2
3 |
Total
-----------+---------------------------------+---------1 |
2,806
1,349
842 |
4,997
|
56.15
27.00
16.85 |
100.00
|
38.46
18.89
15.39 |
25.10
|
14.10
6.78
4.23 |
25.10
-----------+---------------------------------+---------2 |
1,501
1,742
1,592 |
4,835
|
31.04
36.03
32.93 |
100.00
|
20.58
24.40
29.10 |
24.29
|
7.54
8.75
8.00 |
24.29
-----------+---------------------------------+---------3 |
1,501
2,542
1,904 |
5,947
|
25.24
42.74
32.02 |
100.00
|
20.58
35.60
34.80 |
29.88
|
7.54
12.77
9.56 |
29.88
-----------+---------------------------------+---------4 |
1,487
1,507
1,133 |
4,127
|
36.03
36.52
27.45 |
100.00
|
20.38
21.11
20.71 |
20.73
|
7.47
7.57
5.69 |
20.73
-----------+---------------------------------+---------Total |
7,295
7,140
5,471 |
19,906
|
36.65
35.87
27.48 |
100.00
|
100.00
100.00
100.00 |
100.00
|
36.65
35.87
27.48 |
100.00
39
Running a regression
• Syntax
reg dependent-variable independent-variables
• Example from program
*run simple regression;
reg earnwkl age age2 educ nonwhite union;
40
•
•
•
•
•
•
Source |
SS
df
MS
-------------+-----------------------------Model | 1616.39963
5 323.279927
Residual | 3622.93905 19900 .182057239
-------------+-----------------------------Total | 5239.33869 19905 .263217216
Number of obs
F( 5, 19900)
Prob > F
R-squared
Adj R-squared
Root MSE
=
19906
= 1775.70
= 0.0000
= 0.3085
= 0.3083
= .42668
•
•
•
•
•
•
•
•
•
•
-----------------------------------------------------------------------------earnwkl |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------age |
.0679808
.0020033
33.93
0.000
.0640542
.0719075
age2 | -.0006778
.0000245
-27.69
0.000
-.0007258
-.0006299
educ |
.069219
.0011256
61.50
0.000
.0670127
.0714252
nonwhite | -.1716133
.0089118
-19.26
0.000
-.1890812
-.1541453
union |
.1301547
.0072923
17.85
0.000
.1158613
.1444481
_cons |
3.630805
.0394126
92.12
0.000
3.553553
3.708057
------------------------------------------------------------------------------
41
Analysis of variance
• R2 = .3085
– Variables explain 31% of the variation in
log weekly earnings
• F(5,19900)
– Tests the hypothesis that all covariates
(except constant) are jointly zero
42
Interpret results
• Y = β0 + β1Xi + εi
• dY/dX = β1
• But in this case Y=ln(W) where W weekly
wages
• dln(W)/dX = (dW/W)/dX = β1
– Percentage change in wages given a change in x
43
• For each additional year of education,
wages increase by 6.9%
• Non whites earn 17.2% less than whites
• Union members earn 13% more than
nonunion members
44
Part III
Some notes about probability
distributions
45
Continuous Distributions
• Random variables with infinite
number of possible values
• Examples -- units of measure (time,
weight, distance)
• Many discrete outcomes can be
treated as continuous, e.g., SAT scores
46
How to describe a continuous
random variable
• The Probability Density Function (PDF)
• The PDF for a random variable x is
defined as f(x), where
f(x) $ 0
If(x)dx = 1
• Calculus review: The integral of a
function gives the “area under the curve”
47
5
Graph of y=f(x)
4
y=f(x)
y
3
2
1
0
a
x
48
Cumulative Distribution
Function (CDF)
• Suppose x is a “measure” like distance or
time
•0# x# 4
• We may be interested in the Pr(x#a) ?
49
CDF
a
F (a )   f ( x )dx
 Pr( x  a )
0
What if we consider all values?

Pr( x  )   f ( x )dx  1
0
50
Properties of CDF
• Note that Pr(x # b) + Pr(x>b) =1
• Pr(x>b) = 1 – Pr(x # b)
• Many times, it is easier to work with
compliments
51
General notation for
continuous distributions
• The PDF is described by lower case such
as f(x)
• The CDF is defined as upper case such
as F(a)
52
Standard Normal Distribution
• Most frequently used continuous
distribution
• Symmetric “bell-shaped” distribution
• As we will show, the normal has useful
properties
• Many variables we observe in the real
world look normally distributed.
• Can translate normal into ‘standard
normal’
53
Examples of variables that
look normally distributed
• IQ scores
• SAT scores
• Heights of females
• Log income
• Average gestation (weeks of pregnancy)
• As we will show in a few weeks – sample
means are normally distributed!!!
54
Standard Normal Distribution
• PDF:
1
f (z) 
e
2
1
 z2
2
 ( z )
• For - 4# z # 4
55
Notation
• (z) is the standard normal PDF
evaluated at z
• [a] = Pr(z # a)
a
Pr( z  a )    ( z )dz  ( a )

56
Standard Normal PDF
0.45
0.40
0.35
f(Z)
0.30
0.25
0.20
0.15
0.10
0.05
0.00
-3
-2
-1
0
1
2
3
Z
57
Standard Normal
• Notice that:
–
–
–
–
–
Normal is symmetric: (a) = (-a)
Normal is “unimodal”
Median=mean
Area under curve=1
Almost all area is between (-3,3)
• Evaluations of the CDF are done with
– Statistical functions (excel, SAS, etc)
– Tables
58
Standard Normal CDF
• Pr(z # -0.98) = [-0.98] = 0.1635
Area Under Standard Normal PDF
0.45
0.40
0.35
f(Z)
0.30
0.25
0.20
0.15
0.10
0.05
0.00
-3
-2
-1
0
1
2
3
Z
59
0.09
0.0559
0.0681
0.0823
0.0985
0.1170
0.1379
0.1611
0.1867
0.2148
0.2451
0.2776
0.08
0.0571
0.0694
0.0838
0.1003
0.1190
0.1401
0.1635
0.1894
0.2177
0.2483
0.2810
0.07
0.0582
0.0708
0.0853
0.1020
0.1210
0.1423
0.1660
0.1922
0.2206
0.2514
0.2843
0.06
0.0594
0.0721
0.0869
0.1038
0.1230
0.1446
0.1685
0.1949
0.2236
0.2546
0.2877
0.05
0.0606
0.0735
0.0885
0.1056
0.1251
0.1469
0.1711
0.1977
0.2266
0.2578
0.2912
0.04
0.0618
0.0749
0.0901
0.1075
0.1271
0.1492
0.1736
0.2005
0.2296
0.2611
0.2946
0.03
0.0630
0.0764
0.0918
0.1093
0.1292
0.1515
0.1762
0.2033
0.2327
0.2643
0.2981
0.02
0.0643
0.0778
0.0934
0.1112
0.1314
0.1539
0.1788
0.2061
0.2358
0.2676
0.3015
0.01
0.0655
0.0793
0.0951
0.1131
0.1335
0.1562
0.1814
0.2090
0.2389
0.2709
0.3050
0
0.0668
0.0808
0.0968
0.1151
0.1357
0.1587
0.1841
0.2119
0.2420
0.2743
0.3085
Z
-1.50
-1.40
-1.30
-1.20
-1.10
-1.00
-0.90
-0.80
-0.70
-0.60
-0.50
60
• Pr(z # 1.41) = [1.41] = 0.9207
Area Under Standard Normal PDF
0.45
0.40
0.35
f(Z)
0.30
0.25
0.20
0.15
0.10
0.05
0.00
-3
-2
-1
0
Z
1
2
3
61
Z
0.50
0.60
0.70
0.80
0.90
1.00
1.10
1.20
1.30
1.40
1.50
0.00
0.6915
0.7257
0.7580
0.7881
0.8159
0.8413
0.8643
0.8849
0.9032
0.9192
0.9332
0.01
0.6950
0.7291
0.7611
0.7910
0.8186
0.8438
0.8665
0.8869
0.9049
0.9207
0.9345
0.02
0.6985
0.7324
0.7642
0.7939
0.8212
0.8461
0.8686
0.8888
0.9066
0.9222
0.9357
0.03
0.7019
0.7357
0.7673
0.7967
0.8238
0.8485
0.8708
0.8907
0.9082
0.9236
0.9370
0.04
0.7054
0.7389
0.7704
0.7995
0.8264
0.8508
0.8729
0.8925
0.9099
0.9251
0.9382
0.05
0.7088
0.7422
0.7734
0.8023
0.8289
0.8531
0.8749
0.8944
0.9115
0.9265
0.9394
0.06
0.7123
0.7454
0.7764
0.8051
0.8315
0.8554
0.8770
0.8962
0.9131
0.9279
0.9406
0.07
0.7157
0.7486
0.7794
0.8078
0.8340
0.8577
0.8790
0.8980
0.9147
0.9292
0.9418
0.08
0.7190
0.7517
0.7823
0.8106
0.8365
0.8599
0.8810
0.8997
0.9162
0.9306
0.9429
0.09
0.7224
0.7549
0.7852
0.8133
0.8389
0.8621
0.8830
0.9015
0.9177
0.9319
0.9441
62
• Pr(x>1.17) = 1 – Pr(z # 1.17) = 1[1.17]
• = 1 – 0.8790 = 0.1210
Area Under Standard Normal PDF
0.45
0.40
0.35
f(Z)
0.30
0.25
0.20
0.15
0.10
0.05
0.00
-3
-2
-1
0
Z
1
2
3
63
Z
0.50
0.60
0.70
0.80
0.90
1.00
1.10
1.20
1.30
1.40
1.50
0.00
0.6915
0.7257
0.7580
0.7881
0.8159
0.8413
0.8643
0.8849
0.9032
0.9192
0.9332
0.01
0.6950
0.7291
0.7611
0.7910
0.8186
0.8438
0.8665
0.8869
0.9049
0.9207
0.9345
0.02
0.6985
0.7324
0.7642
0.7939
0.8212
0.8461
0.8686
0.8888
0.9066
0.9222
0.9357
0.03
0.7019
0.7357
0.7673
0.7967
0.8238
0.8485
0.8708
0.8907
0.9082
0.9236
0.9370
0.04
0.7054
0.7389
0.7704
0.7995
0.8264
0.8508
0.8729
0.8925
0.9099
0.9251
0.9382
0.05
0.7088
0.7422
0.7734
0.8023
0.8289
0.8531
0.8749
0.8944
0.9115
0.9265
0.9394
0.06
0.7123
0.7454
0.7764
0.8051
0.8315
0.8554
0.8770
0.8962
0.9131
0.9279
0.9406
0.07
0.7157
0.7486
0.7794
0.8078
0.8340
0.8577
0.8790
0.8980
0.9147
0.9292
0.9418
0.08
0.7190
0.7517
0.7823
0.8106
0.8365
0.8599
0.8810
0.8997
0.9162
0.9306
0.9429
0.09
0.7224
0.7549
0.7852
0.8133
0.8389
0.8621
0.8830
0.9015
0.9177
0.9319
0.9441
64
• Pr(0.1 # z # 1.9)
= Pr(z # 1.9) – Pr(z # 0.1)
= (1.9) - (0.1) = 0.9713 - 0.5398
= 0.4315
65
Area Under Standard Normal PDF
0.45
0.40
0.35
f(Z)
0.30
0.25
0.20
0.15
0.10
0.05
0.00
-3
-2
-1
0
1
2
3
Z
66
Z
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
0.00
0.5000
0.5398
0.5793
0.6179
0.6554
0.6915
0.7257
0.7580
0.7881
0.8159
0.8413
0.01
0.5040
0.5438
0.5832
0.6217
0.6591
0.6950
0.7291
0.7611
0.7910
0.8186
0.8438
0.02
0.5080
0.5478
0.5871
0.6255
0.6628
0.6985
0.7324
0.7642
0.7939
0.8212
0.8461
0.03
0.5120
0.5517
0.5910
0.6293
0.6664
0.7019
0.7357
0.7673
0.7967
0.8238
0.8485
0.04
0.5160
0.5557
0.5948
0.6331
0.6700
0.7054
0.7389
0.7704
0.7995
0.8264
0.8508
0.05
0.5199
0.5596
0.5987
0.6368
0.6736
0.7088
0.7422
0.7734
0.8023
0.8289
0.8531
0.06
0.5239
0.5636
0.6026
0.6406
0.6772
0.7123
0.7454
0.7764
0.8051
0.8315
0.8554
0.07
0.5279
0.5675
0.6064
0.6443
0.6808
0.7157
0.7486
0.7794
0.8078
0.8340
0.8577
0.08
0.5319
0.5714
0.6103
0.6480
0.6844
0.7190
0.7517
0.7823
0.8106
0.8365
0.8599
0.09
0.5359
0.5753
0.6141
0.6517
0.6879
0.7224
0.7549
0.7852
0.8133
0.8389
0.8621
67
Z
1.00
1.10
1.20
1.30
1.40
1.50
1.60
1.70
1.80
1.90
2.00
0.00
0.8413
0.8643
0.8849
0.9032
0.9192
0.9332
0.9452
0.9554
0.9641
0.9713
0.9772
0.01
0.8438
0.8665
0.8869
0.9049
0.9207
0.9345
0.9463
0.9564
0.9649
0.9719
0.9778
0.02
0.8461
0.8686
0.8888
0.9066
0.9222
0.9357
0.9474
0.9573
0.9656
0.9726
0.9783
0.03
0.8485
0.8708
0.8907
0.9082
0.9236
0.9370
0.9484
0.9582
0.9664
0.9732
0.9788
0.04
0.8508
0.8729
0.8925
0.9099
0.9251
0.9382
0.9495
0.9591
0.9671
0.9738
0.9793
0.05
0.8531
0.8749
0.8944
0.9115
0.9265
0.9394
0.9505
0.9599
0.9678
0.9744
0.9798
0.06
0.8554
0.8770
0.8962
0.9131
0.9279
0.9406
0.9515
0.9608
0.9686
0.9750
0.9803
0.07
0.8577
0.8790
0.8980
0.9147
0.9292
0.9418
0.9525
0.9616
0.9693
0.9756
0.9808
0.08
0.8599
0.8810
0.8997
0.9162
0.9306
0.9429
0.9535
0.9625
0.9699
0.9761
0.9812
0.09
0.8621
0.8830
0.9015
0.9177
0.9319
0.9441
0.9545
0.9633
0.9706
0.9767
0.9817
68
Important Properties of Normal
Distribution
• Pr(z # A) = [A]
• Pr(z > A) = 1 - [A]
• Pr(z # - A) = [-A]
• Pr(z > -A) = 1 - [-A] = [A]
69
Section IV
Maximum likelihood estimation
70
Maximum likelihood
estimation
• Observe n independent outcomes, all
drawn from the same distribution
• (y1, y2, y3….yn)
• yi is drawn from f(yi; θ) where θ is an
unknown parameter for the PDF f
• Recall definition of indepedence. If a
and b and independent, Prob(a and b) =
Pr(a)Pr(B)
71
• Because all the draws are independent,
the probability these particular n values
of Y would be drawn at random is called
the ‘likelihood function’ and it equals
• L = Pr(y1)Pr(y2)…Pr(yn)
• L = f(y1; θ)f(y2; θ)…..f(y3; θ)
72
• MLE: pick a value for θ that best
represents the chance these n values of y
would have been generated randomly
• To maximize L, maximize a monotonic
function of L
• Recall ln(abcd)=ln(a)+ln(b)+ln(c)+ln(d)
73
• Max L = ln(L) = ln[f(y1; θ)] +ln[f(y2; θ)] +
….. ln[f(yn; θ)
= Σi ln[f(yi; θ)]
• Pick θ so that L is maximized
• dL/dθ = 0
74
L
θ1
θ2
θ
75
Example: Poisson
• Suppose y measures ‘counts’ such as
doctor visits.
• yi is drawn from a Poisson distribution
• f(yi;λ) =e-λ λyi/yi!
For λ>0
• E[yi]= Var[yi] = λ
76
• Given n observations, (y1, y2, y3….yn)
• Pick value of λ that maximizes L
• Max L = Σi ln[f(yi; θ)] = Σi ln[e-λ λyi/yi!]
= Σi [– λ + yiln(λ) – ln(yi!)]
= -n λ + ln(λ) Σi yi – Σi ln(yi!)
77
• L = -n λ + ln(λ) Σi yi – Σi ln(yi!)
• dL/dθ = -n + (1/ λ )Σi yi = 0
• Solve for λ
• λ = Σi yi /n =  = sample mean of y
78
• In most cases however, cannot find a
‘closed form’ solution for the parameter
in ln[f(yi; θ)]
• Must ‘search’ over all possible solutions
• How does the search work?
• Start with candidate value of θ.
• Calculate dL/dθ
79
• If dL/dθ > 0, increasing θ will increase L
so we increase θ some
• If dL/dθ < 0, decreasing θ will increase L
so we decrease θ some
• Keep changing θ until dL/dθ = 0
• How far you ‘step’ when you change θ is
determined by a number of different
factors
80
L
dL/dθ > 0
θ1
θ
81
L
dL/dθ < 0
θ3
θ
82
Properties of MLE estimates
• Sometimes call efficient estimation. Can
never generate a smaller variance than one
obtained by MLE
• Parameters estimates are distributed as a
normal distribution when samples sizes are
large
• Therefore, if we divide the parameter by its
standard error, should be normally distributed
with a mean zero and variance 1 if the null
(=0) is correct
83
Section 5
Dichotomous outcomes
84
Dichotomous Data
• Suppose data is discrete but there are
only 2 outcomes
• Examples
– Graduate high school or not
– Patient dies or not
– Working or not
– Smoker or not
• In data, yi=1 if yes, yi =0 if no
85
How to model the data generating
process?
• There are only two outcomes
• Research question: What factors impact
whether the event occurs?
• To answer, will model the probability
the outcome occurs
• Pr(Yi=1) when yi=1 or
• Pr(Yi=0) = 1- Pr(Yi=1) when yi=0
86
• Think of the problem from a MLE
perspective
• Likelihood for i’th observation
• Li= Pr(Yi=1)Yi [1 - Pr(Yi=1)](1-Yi)
• When yi=1, only relevant part is Pr(Yi=1)
• When yi=0, only relevant part is [1 - Pr(Yi=1)]
87
• L = Σi ln[Li] =
= Σi {yi ln[Pr(yi=1)] + (1-yi)ln[Pr(yi=0)] }
• Notice that up to this point, the model is
generic. The log likelihood function will
determined by the assumptions
concerning how we determine Pr(yi=1)
88
Modeling the probability
• There is some process (biological, social,
decision theoretic, etc) that determines
the outcome y
• Some of the variables impacting are
observed, some are not
• Requires that we model how these
factors impact the probabilities
• Model from a ‘latent variable’
perspective
89
• Consider a women’s decision to work
• yi* = the person’s net benefit to work
• Two components of yi*
– Characteristics that we can measure
• Education, age, income of spouse, prices of child
care
– Some we cannot measure
• How much you like spending time with your kids
• how much you like/hate your job
90
• We aggregate these two components into one equation
• yi* = β0 + x1i β1+ x2i β2+… xki βk+ εi
= xi β + εi
• xi β (measurable characteristics but with uncertain weights)
• εi random unmeasured characteristics
• Decision rule: person will work if yi* > 0
(if net benefits are positive)
yi=1 if yi*>0
yi=0 if yi*≤0
91
• yi=1 if yi*>0
• yi* = xi β + εi > 0 only if
• εi > - xi β
• yi=0 if yi*≤0
• yi* = xi β + εi ≤ 0 only if
• εi ≤ - xi β
92
How to interpret ε?
• When we look at certain people, we have
expectations about whether y should equal 1
or 0
• These expectations do not always hold true
• The error ε represents deviations from what
we expect
• Go back to the work example, suppose xi β is
‘big.’ We observe a woman with:
– High wages
– Low husband’s income
– Low cost of child care
93
• We would expect this person to work,
UNLESS, there is some unmeasured
variable that counteracts this
• For example:
– Suppose a mom really likes spending time
with her kids, or she hates her job.
– The unmeasured benefit of working is then
a big negative coefficient εi
94
• If we observe them working, there are a
certain range of values that εi must have been
in excess of
• yi=1 if εi > - xi β
• If we observe someone not working, then
Consider the opposite. Suppose we observe
someone NOT working.
• Then εi must not have been big or it was a
bigger negative number, since
• yi=0 if εi ≤ - xi β
95
The Probabilities
• The estimation procedure used is
determined by the assumed distribution
of ε
• What is the probability we observe
someone with y=1?
– Use definition of the CDF
– Pr(yi=1) = Pr(yi*>0) = Pr(εi > - xi β)
= 1 – F(-xi β)
96
• What is the probability we observe
someone with y=0?
– Use definition of the CDF
– Pr(yi=0) = Pr(yi*≤ 0) = Pr(εi ≤ - xi β) =
F(-xi β)
• Two standard models: ε is either
– normal or
– logistic
97
Normal (probit) Model
• ε is distributed as a standard normal
– Mean zero
– Variance 1
• Evaluate probability (y=1)
– Pr(yi=1) = Pr(εi > - xi β) = 1 – Ф(-xi β)
– Given symmetry: 1 – Ф(-xi β) = Ф(xi β)
• Evaluate probability (y=0)
– Pr(yi=0) = Pr(εi ≤ - xi β) = Ф(-xi β)
– Given symmetry: Ф(-xi β) = 1 - Ф(xi β)
98
• Summary
– Pr(yi=1) = Ф(xi β)
– Pr(yi=0) = 1 -Ф(xi β)
• Notice that Ф(a) is increasing a.
Therefore, is one of the x’s increases the
probability of observing y, we would
expect the coefficient on that variable to
be (+)
99
• The standard normal assumption
(variance=1) is not critical
• In practice, the variance may be not
equal t 1, but given the math of the
problem, we cannot identify the
variance. It is absorbed into parameter
estimates
100
Logit
• CDF: F(a) = exp(a)/(1+exp(a))
–
–
–
–
Symmetric, unimodal distribution
Looks a lot like the normal
Incredibly easy to evaluate the CDF and PDF
Mean of zero, variance > 1 (more variance than
normal)
• Evaluate probability (y=1)
– Pr(yi=1) = Pr(εi > - xi β) = 1 – F(-xi β)
– Given symmetry: 1 – F(-xi β) = F(xi β)
– F(xi β) = exp(xi β)/(1+exp(xi β))
101
• Evaluate probability (y=0)
– Pr(yi=0) = Pr(εi ≤ - xi β) = F(-xi β)
– Given symmetry: F(-xi β) = 1 - F(xi β)
– 1 - F(xi β) = 1 /(1+exp(xi β))
• When εi is a logistic distribution
– Pr(yi =1) = exp(xi β)/(1+exp(xi β))
– Pr(yi=0) = 1/(1+exp(xi β))
102
Example: Workplace smoking
bans
• Smoking supplements to 1991 and 1993
National Health Interview Survey
• Asked all respondents whether they
currently smoke
• Asked workers about workplace tobacco
policies
• Sample: workers
• Key variables: current smoking and
whether they faced by workplace ban
103
• Data: workplace1.dta
• Sample program: workplace1.doc
• Results: workplace1.log
104
Description of variables in data
•
. desc;
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
storage display
value
variable name
type
format
label
variable label
-----------------------------------------------------------------------> smoker
byte
%9.0g
is current smoking
worka
byte
%9.0g
has workplace smoking bans
age
byte
%9.0g
age in years
male
byte
%9.0g
male
black
byte
%9.0g
black
hispanic
byte
%9.0g
hispanic
incomel
float %9.0g
log income
hsgrad
byte
%9.0g
is hs graduate
somecol
byte
%9.0g
has some college
college
float %9.0g
-----------------------------------------------------------------------
105
Summary statistics
•
sum;
•
•
•
•
•
•
•
•
•
•
•
•
•
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------smoker |
16258
.25163
.433963
0
1
worka |
16258
.6851396
.4644745
0
1
age |
16258
38.54742
11.96189
18
87
male |
16258
.3947595
.488814
0
1
black |
16258
.1119449
.3153083
0
1
-------------+-------------------------------------------------------hispanic |
16258
.0607086
.2388023
0
1
incomel |
16258
10.42097
.7624525
6.214608
11.22524
hsgrad |
16258
.3355271
.4721889
0
1
somecol |
16258
.2685447
.4432161
0
1
college |
16258
.3293763
.4700012
0
1
106
Running a probit
• probit smoker age incomel male black
hispanic hsgrad somecol college worka;
• The first variable after ‘probit’ is the
discrete outcome, the rest of the
variables are the independent variables
• Includes a constant as a default
107
Running a logit
• logit smoker age incomel male black
hispanic hsgrad somecol college worka;
• Same as probit, just change the first
word
108
Running linear probability
• reg smoker age incomel male black hispanic
hsgrad somecol college worka, robust;
• Simple regression.
• Standard errors are incorrect
(heteroskedasticity)
• robust option produces standard errors
with arbitrary form of heteroskedasticity
109
Probit Results
•
•
•
•
Probit estimates
•
•
•
•
•
•
•
•
•
•
•
•
•
•
-----------------------------------------------------------------------------smoker |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------age | -.0012684
.0009316
-1.36
0.173
-.0030943
.0005574
incomel |
-.092812
.0151496
-6.13
0.000
-.1225047
-.0631193
male |
.0533213
.0229297
2.33
0.020
.0083799
.0982627
black | -.1060518
.034918
-3.04
0.002
-.17449
-.0376137
hispanic | -.2281468
.0475128
-4.80
0.000
-.3212701
-.1350235
hsgrad | -.1748765
.0436392
-4.01
0.000
-.2604078
-.0893453
somecol |
-.363869
.0451757
-8.05
0.000
-.4524118
-.2753262
college | -.7689528
.0466418
-16.49
0.000
-.860369
-.6775366
worka | -.2093287
.0231425
-9.05
0.000
-.2546873
-.1639702
_cons |
.870543
.154056
5.65
0.000
.5685989
1.172487
------------------------------------------------------------------------------
Log likelihood = -8761.7208
Number of obs
LR chi2(9)
Prob > chi2
Pseudo R2
=
=
=
=
16258
819.44
0.0000
0.0447
110
How to measure fit?
• Regression (OLS)
– minimize sum of squared errors
– Or, maximize R2
– The model is designed to maximize predictive
capacity
• Not the case with Probit/Logit
– MLE models pick distribution parameters so as
best describe the data generating process
– May or may not ‘predict’ the outcome well
111
Pseudo R2
• LLk log likelihood with all variables
• LL1 log likelihood with only a constant
• 0 > LLk > LL1 so | LLk | < |LL1|
• Pseudo R2 = 1 - |LL1/LLk|
• Bounded between 0-1
• Not anything like an R2 from a
regression
112
Predicting Y
• Let b be the estimated value of β
• For any candidate vector of xi , we can predict
probabilities, Pi
• Pi = Ф(xib)
• Once you have Pi, pick a threshold value, T, so
that you predict
• Yp = 1 if Pi > T
• Yp = 0 if Pi ≤ T
• Then compare, fraction correctly predicted
113
• Question: what value to pick for T?
• Can pick .5
– Intuitive. More likely to engage in the
activity than to not engage in it
– However, when the  is small, this criteria
does a poor job of predicting Yi=1
– However, when the  is close to 1, this
criteria does a poor job of picking Yi=0
114
• *predict probability of smoking;
• predict pred_prob_smoke;
• * get detailed descriptive data about predicted
prob;
• sum pred_prob, detail;
• * predict binary outcome with 50% cutoff;
• gen pred_smoke1=pred_prob_smoke>=.5;
• label variable pred_smoke1 "predicted smoking, 50%
cutoff";
• * compare actual values;
• tab smoker pred_smoke1, row col cell;
115
•
. sum pred_prob, detail;
•
•
•
•
•
•
•
Pr(smoker)
------------------------------------------------------------Percentiles
Smallest
1%
.0959301
.0615221
5%
.1155022
.0622963
10%
.1237434
.0633929
Obs
16258
25%
.1620851
.0733495
Sum of Wgt.
16258
•
•
•
•
•
•
50%
75%
90%
95%
99%
.2569962
.3187975
.3795704
.4039573
.4672697
Largest
.5619798
.5655878
.5684112
.6203823
Mean
Std. Dev.
.2516653
.0960007
Variance
Skewness
Kurtosis
.0092161
.1520254
2.149247
116
• Notice two things
– Sample mean of the predicted probabilities
is close to the sample mean outcome
– 99% of the probabilities are less than .5
– Should predict few smokers if use a 50%
cutoff
117
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
| predicted smoking,
is current |
50% cutoff
smoking |
0
1 |
Total
-----------+----------------------+---------0 |
12,153
14 |
12,167
|
99.88
0.12 |
100.00
|
74.93
35.90 |
74.84
|
74.75
0.09 |
74.84
-----------+----------------------+---------1 |
4,066
25 |
4,091
|
99.39
0.61 |
100.00
|
25.07
64.10 |
25.16
|
25.01
0.15 |
25.16
-----------+----------------------+---------Total |
16,219
39 |
16,258
|
99.76
0.24 |
100.00
|
100.00
100.00 |
100.00
|
99.76
0.24 |
100.00
118
• Check on-diagonal elements.
• The last number in each 2x2 element is
the fraction in the cell
• The model correctly predicts 74.75 +
0.15 = 74.90% of the obs
• It only predicts a small fraction of
smokers
119
• Do not be amazed by the 75% percent
correct prediction
• If you said everyone has a  chance of
smoking (a case of no covariates), you
would be correct Max[(,(1-)] percent of
the time
120
• In this case, 25.16% smoke.
• If everyone had the same chance of
smoking, we would assign everyone
Pr(y=1) = .2516
• We would be correct for the 1 - .2516 =
0.7484 people who do not smoke
121
Key points about prediction
• MLE models are not designed to
maximize prediction
• Should not be surprised they do not
predict well
• In this case, not particularly good
measures of predictive capacity
122
Translating coefficients in probit:
Continuous Covariates
• Pr(yi=1) = Φ[β0 + x1i β1+ x2i β2+… xki βk]
• Suppose that x1i is a continuous variable
• d Pr(yi=1) /d x1i = ?
• What is the change in the probability of
an event give a change in x1i?
123
Marginal Effect
• d Pr(yi=1) /d x1i
• = β1 φ[β0 + x1i β1+ x2i β2+… xki βk]
• Notice two things. Marginal effect is a
function of the other parameters and the
values of x.
124
Translating Coefficients:
Discrete Covariates
• Pr(yi=1) = Φ[β0 + x1i β1+ x2i β2+… xki βk]
• Suppose that x2i is a dummy variable (1
if yes, 0 if no)
• Marginal effect makes no sense, cannot
change x2i by a little amount. It is either
1 or 0.
• Redefine the variable of interest.
Compare outcomes with and without x2i
125
• y1 = Pr(yi=1 | x2i=1)
= Φ[β0 + x1iβ1+ β2 + x3iβ3 +… ]
• y0 = Pr(yi=1 | x2i=0)
= Φ[β0 + x1iβ1+ x3iβ3 … ]
Marginal effect = y1 – y0.
Difference in probabilities with and
without x2i?
126
In STATA
• Marginal effects for continuous
variables, and Change in probabilities
for dichotomous outcomes, STATA picks
sample means for X’s
127
STATA command for Marginal
Effects
• mfx compute;
• Must come after the outcome when
estimates are still active in program.
128
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Marginal effects after probit
y = Pr(smoker) (predict)
= .24093439
-----------------------------------------------------------------------------variable |
dy/dx
Std. Err.
z
P>|z| [
95% C.I.
]
X
---------+-------------------------------------------------------------------age | -.0003951
.00029
-1.36
0.173 -.000964 .000174
38.5474
incomel | -.0289139
.00472
-6.13
0.000
-.03816 -.019668
10.421
male*|
.0166757
.0072
2.32
0.021
.002568 .030783
.39476
black*| -.0320621
.01023
-3.13
0.002 -.052111 -.012013
.111945
hispanic*| -.0658551
.01259
-5.23
0.000 -.090536 -.041174
.060709
hsgrad*|
-.053335
.01302
-4.10
0.000
-.07885 -.02782
.335527
somecol*| -.1062358
.01228
-8.65
0.000 -.130308 -.082164
.268545
college*| -.2149199
.01146 -18.76
0.000 -.237378 -.192462
.329376
worka*| -.0668959
.00756
-8.84
0.000
-.08172 -.052072
.68514
-----------------------------------------------------------------------------(*) dy/dx is for discrete change of dummy variable from 0 to 1
129
Interpret results
• 10% increase in income will reduce
smoking by 2.9 percentage points
• 10 year increase in age will decrease
smoking rates .4 percentage points
• Those with a college degree are 21.5
percentage points less likely to smoke
• Those that face a workplace smoking
ban have 6.7 percentage point lower
probability of smoking
130
• Do not confuse percentage point and
percent differences
– A 6.7 percentage point drop is 29% of the
sample mean of 24 percent.
– Blacks have smoking rates that are 3.2
percentage points lower than others, which
is 13 percent of the sample mean
131
Comparing Marginal Effects
Variable
age
incomel
male
Black
hispanic
hsgrad
college
worka
LP
-0.00040
-0.0289
0.0167
-0.0321
-0.0658
-0.0533
-0.2149
-0.0669
Probit
-0.00048
-0.0287
0.0168
-0.0357
-0.0706
-0.0661
-0.2406
-0.0661
Logit
-0.00048
-0.0276
0.0172
-0.0342
-0.0602
-0.0514
-0.2121
-0.0658
132
Marginal effects for specific
characteristics
• Can generate marginal effects for a
specific x
• prchange, x(age=40 black=0 hispanic=0 hsgrad=0
somecol=0 worka=0);
• If an x is not specified, STATA will use
the sample mean (e.g., log income in this
case)
• Make sure when you specify a particular
dummy variable (=1) you set the rest to
zero
133
•
•
•
•
•
•
•
•
•
•
•
probit: Changes in Predicted Probabilities for smoker
age
incomel
male
black
hispanic
hsgrad
somecol
college
worka
min->max
-0.0323
-0.1795
0.0198
-0.0385
-0.0804
-0.0625
-0.1235
-0.2644
-0.0742
0->1
-0.0005
-0.0320
0.0198
-0.0385
-0.0804
-0.0625
-0.1235
-0.2644
-0.0742
-+1/2
-0.0005
-0.0344
0.0198
-0.0394
-0.0845
-0.0648
-0.1344
-0.2795
-0.0776
-+sd/2
-0.0056
-0.0263
0.0097
-0.0124
-0.0202
-0.0306
-0.0598
-0.1335
-0.0361
MargEfct
-0.0005
-0.0345
0.0198
-0.0394
-0.0847
-0.0649
-0.1351
-0.2854
-0.0777
134
Testing significance of individual
parameters
• In large samples, MLE estimates are
normally distributed
• Null hypothesis, βj=0
• If the null is true and the sample is
larges, βj is distributed as a normal with
variance σj2.
• Using notes from before, if we divide βj
by the standard deviation, we get
standard normal
135
• βj/se(βj) should be N(0,1)
• βj/se(βj) = z-score
• 95% of the distribution of a N(0,1) is
between -1.96, 1.96
• Reject null of the z-score > |1.96|
• Only age is statistically insignificant
(cannot reject null)
136
When will results differ?
• Normal and logit CDF look:
– Similar in the mid point of the distribution
– Different in the tails
• You obtain more observations in the
tails of the distribution when
– Samples sizes are large
–  approaches 1 or 0
• These situations will produce more
differences in estimates
137
Some nice properties of the Logit
• Outcome, y=1 or 0
• Treatment, x=1 or 0
• Other covariates, x
• Context,
– x = whether a baby is born with a low
weight birth
– x = whether the mom smoked or not during
pregnancy
138
• Risk ratio
RR = Prob(y=1|x=1)/Prob(y=1|x=0)
Differences in the probability of an event
when x is and is not observed
How much does smoking elevate the chance
your child will be a low weight birth
139
• Let Yyx be the probability y=1 or 0 given
x=1 or 0
• Think of the risk ratio the following way
• Y11 is the probability Y=1 when X=1
• Y10 is the probability Y=1 when X=0
• Y11 = RR*Y10
140
• Odds Ratio
OR=A/B = [Y11/Y01]/[Y10/Y00]
A = [Pr(Y=1|X=1)/Pr(Y=0|X=1)]
= odds of Y occurring if you are a smoker
B = [Pr(Y=1|X=0)/Pr(Y=0|X=0)]
= odds of y happening if you are not a smoker
What are the relative odds of Y happening if you do
or do not experience X
141
• Suppose Pr(Yi =1) = F(βo+ β1Xi + β2Z) and
F is the logistic function
• Can show that
• OR = exp(β1) = e β1
• This number is typically reported by
most statistical packages
142
• Details
• Y11 = exp(βo+ β1 + β2Z) /(1+ exp(βo+ β1+ β2Z) )
• Y10 = exp(βo+ β2Z)/(1+ exp(βo+β2Z))
• Y01 = 1 /(1+ exp(βo+ β1 + β2Z) )
• Y00 = 1/(1+ exp(βo+β2Z)
• [Y11/Y01] = exp(βo+ β1 + β2Z)
• [Y10/Y00] = exp(βo+ β2Z)
• OR=A/B = [Y11/Y01]/[Y10/Y00]
= exp(βo+ β1 + β2Z)/ exp(βo + β2Z)
= exp(β1)
143
• Suppose Y is rare,  close to 0
– Pr(Y=0|X=1) and Pr(Y=0|X=0) are both
close to 1, so they cancel
• Therefore, when  is close to 0
– Odds Ratio = Risk Ratio
• Why is this nice?
144
Population attributable risk
• Average outcome in the population
•  = (1-) Y10 +  Y11 = (1- )Y10 + (RR)Y10
• Average outcomes are a weighted average of
outcomes for X=0 and X=1
• What would the average outcome be in the
absence of X (e.g., reduce smoking rates to 0)
• Ya = Y10
145
Population Attributable Risk
• PAR
• Fraction of outcome attributed to X
• The difference between the current rate
and the rate that would exist without X,
divided by the current rate
• PAR = ( – Ya)/
= (RR – 1)/[(1-) + RR]
146
Example: Maternal Smoking and
Low Weight Births
• 6% births are low weight
– < 2500 grams (
– Average birth is 3300 grams (5.5 lbs)
• Maternal smoking during pregnancy has
been identified as a key cofactor
– 13% of mothers smoke
– This number was falling about 1 percentage
point per year during 1980s/90s
– Doubles chance of low weight birth
147
Natality detail data
• Census of all births (4 million/year)
• Annual files starting in the 60s
• Information about
– Baby (birth weight, length, date, sex, plurality,
birth injuries)
– Demographics (age, race, marital, educ of mom)
– Birth (who delivered, method of delivery)
– Health of mom (smoke/drank during preg, weight
gain)
148
• Smoking not available from CA or NY
• ~3 million usable observations
• I pulled .5% random sample from 1995
• About 12,500 obs
• Variables: birthweight (grams), smoked,
married, 4-level race, 5 level education,
mothers age at birth
149
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
-----------------------------------------------------------------------------> storage display
value
variable name
type
format
label
variable label
-----------------------------------------------------------------------------> birthw
int
%9.0g
birth weight in grams
smoked
byte
%9.0g
=1 if mom smoked during
pregnancy
age
byte
%9.0g
moms age at birth
married
byte
%9.0g
=1 if married
race4
byte
%9.0g
1=white,2=black,3=asian,4=other
educ5
byte
%9.0g
1=0-8, 2=9-11, 3=12, 4=13-15,
5=16+
visits
byte
%9.0g
prenatal visits
------------------------------------------------------------------------------
150
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
dummy |
variable, |
=1 |
=1 if mom smoked
ifBW<2500 |
during pregnancy
grams |
0
1 |
Total
-----------+----------------------+---------0 |
11,626
1,745 |
13,371
|
86.95
13.05 |
100.00
|
94.64
89.72 |
93.96
|
81.70
12.26 |
93.96
-----------+----------------------+---------1 |
659
200 |
859
|
76.72
23.28 |
100.00
|
5.36
10.28 |
6.04
|
4.63
1.41 |
6.04
-----------+----------------------+---------Total |
12,285
1,945 |
14,230
|
86.33
13.67 |
100.00
|
100.00
100.00 |
100.00
|
86.33
13.67 |
100.00
151
• Notice a few things
– 13.7% of women smoke
– 6% have low weight birth
• Pr(LBW | Smoke) =10.28%
• Pr(LBW |~ Smoke) = 5.36%
• RR
= Pr(LBW | Smoke)/ Pr(LBW |~ Smoke)
= 0.1028/0.0536 = 1.92
152
Logit results
•
Log likelihood = -3136.9912
Pseudo R2
=
0.0330
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
-----------------------------------------------------------------------------lowbw |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------smoked |
.6740651
.0897869
7.51
0.000
.4980861
.8500441
age |
.0080537
.006791
1.19
0.236
-.0052564
.0213638
married | -.3954044
.0882471
-4.48
0.000
-.5683654
-.2224433
_Ieduc5_2 | -.1949335
.1626502
-1.20
0.231
-.5137221
.1238551
_Ieduc5_3 | -.1925099
.1543239
-1.25
0.212
-.4949791
.1099594
_Ieduc5_4 | -.4057382
.1676759
-2.42
0.016
-.7343769
-.0770994
_Ieduc5_5 | -.3569715
.1780322
-2.01
0.045
-.7059081
-.0080349
_Irace4_2 |
.7072894
.0875125
8.08
0.000
.5357681
.8788107
_Irace4_3 |
.386623
.307062
1.26
0.208
-.2152075
.9884535
_Irace4_4 |
.3095536
.2047899
1.51
0.131
-.0918271
.7109344
_cons | -2.755971
.2104916
-13.09
0.000
-3.168527
-2.343415
------------------------------------------------------------------------------
153
Odds Ratios
• Smoked
– exp(0.674) = 1.96
– Smokers are twice as likely to have a low
weight birth
• _Irace4_2 (Blacks)
– exp(0.707) = 2.02
– Blacks are twice as likely to have a low
weight birth
154
Asking for odds ratios
• Logistic y x1 x2;
• In this case
• xi: logistic lowbw smoked age
married i.educ5 i.race4;
155
•
Log likelihood = -3136.9912
Pseudo R2
=
0.0330
•
•
•
•
•
•
•
•
•
•
•
•
•
•
-----------------------------------------------------------------------------lowbw | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------smoked |
1.962198
.1761796
7.51
0.000
1.645569
2.33975
age |
1.008086
.0068459
1.19
0.236
.9947574
1.021594
married |
.6734077
.0594262
-4.48
0.000
.5664506
.8005604
_Ieduc5_2 |
.8228894
.1338431
-1.20
0.231
.5982646
1.131852
_Ieduc5_3 |
.8248862
.1272996
-1.25
0.212
.6095837
1.116233
_Ieduc5_4 |
.6664847
.1117534
-2.42
0.016
.4798043
.9257979
_Ieduc5_5 |
.6997924
.1245856
-2.01
0.045
.4936601
.9919973
_Irace4_2 |
2.028485
.1775178
8.08
0.000
1.70876
2.408034
_Irace4_3 |
1.472001
.4519957
1.26
0.208
.8063741
2.687076
_Irace4_4 |
1.362817
.2790911
1.51
0.131
.9122628
2.035893
------------------------------------------------------------------------------
156
PAR
• PAR = (RR – 1)/[(1-) + RR]
• = 0.137
• RR = 1.96
• PAR = 0.116
• 11.6% of low weight births attributed to
maternal smoking
157
Hypothesis Testing in MLE
models
• MLE are asymptotically normally
distributed, one of the properties of MLE
• Therefore, standard t-tests of hypothesis
will work as long as samples are ‘large’
• What ‘large’ means is open to question
• What to do when samples are ‘small’ –
table for a moment
158
Testing a linear combination of
parameters
• Suppose you have a probit model
• Φ[β0 + x1iβ1+ x2i β2 + x3iβ3 +… ]
• Test a linear combination or parameters
• Simplest example, test a subset are zero
• β1= β2 = β3 = β4 =0
• To fix the discussion
• N observations
• K parameters
• J restrictions (count the equals signs, j=4)
159
Wald Test
• Based on the fact that the parameters
are distributed asymptotically normal
• Probability theory review
– Suppose you have m draws from a standard
normal distribution (zi)
– M = z12 + z22 + …. zm2
– M is distributed as a Chi-square with m
degrees of freedom
160
• Wald test constructs a ‘quadratic form’
suggested by the test you want to perform
• This combination, because it contains squares
of the true parameters, should, if the
hypothesis is true, be distributed as a Chi
square with J degrees of freedom.
• If the test statistic is ‘large’, relative to the
degrees of freedom of the test, we reject,
because there is a low probability we would
have drawn that value at random from the
distribution
161
Reading critical values
from a table
• All stats books will report the
‘percentiles’ of a chi-square
– Vertical axis (degrees of freedom)
– Horizontal axis (percentiles)
– Entry is the value where ‘percentile’ of the
distribution falls below
162
• Example: Suppose 4 restrictions
• 95% of a chi-square distribution falls
below 9.488.
• So there is only a 5% a number drawn at
random will exceed 9.488
• If your test statistic is below, cannot
reject null
• If your test statistics is above, reject null
163
Chi-square
DOF
1
2
3
4
5
6
7
8
9
10
Percentiles of the Chi-squared
0.500 0.750 0.800 0.900 0.950
0.455 1.323 1.642 2.706 3.841
1.386 2.773 3.219 4.605 5.991
2.366 4.108 4.642 6.251 7.815
3.357 5.385 5.989 7.779 9.488
4.351 6.626 7.289 9.236 11.070
5.348 7.841 8.558 10.645 12.592
6.346 9.037 9.803 12.017 14.067
7.344 10.219 11.030 13.362 15.507
8.343 11.389 12.242 14.684 16.919
9.342 12.549 13.442 15.987 18.307
0.990
6.635
9.210
11.345
13.277
15.086
16.812
18.475
20.090
21.666
23.209
0.995
7.879
10.597
12.838
14.860
16.750
18.548
20.278
21.955
23.589
25.188
164
Wald test in STATA
• Default test in MLE models
• Easy to do. Look at program
• test hsgrad somecol college
• Does not estimate the ‘restricted’ model
• ‘Lower power’ than other tests, i.e., high
chance of false negative
165
• . test hsgrad somecol college;
• ( 1)
• ( 2)
• ( 3)
•
•
hsgrad = 0
somecol = 0
college = 0
chi2( 3) =
Prob > chi2 =
504.78
0.0000
166
• Notice the higher value of the test
statistic. There is a low chance that a
variable, drawn at random from a chsquare with three degrees of freedom
will be this large.
• Reject null
167
-2 Log likelihood test
• * how to run the same tests with a -2 log like test;
• * estimate the unresticted model and save the estimates
;
• * in urmodel;
• probit smoker age incomel male black hispanic
• hsgrad somecol college worka;
• estimates store urmodel;
• * estimate the restricted model. save results in
rmodel;
• probit smoker age incomel male black hispanic
• worka;
• estimates store rmodel;
• lrtest urmodel rmodel;
168
• I prefer -2 log likelihood test
– Estimates the restricted and unrestricted
model
– Therefore, has more power than a Wald test
• In most cases, they give the same
‘decision’ (reject/not reject)
169
Section VI
Categorical Data
170
Ordered Probit
• Many discrete outcomes are to questions
that have a natural ordering but no
quantitative interpretation:
• Examples:
– Self reported health status
• (excellent, very good, good, fair, poor)
– Do you agree with the following statement
• Strongly agree, agree, disagree, strongly
disagree
171
• Can use the same type of model as in the
previous section to analyze these
outcomes
• Another ‘latent variable’ model
• Key to the model: there is a monotonic
ordering of the qualitative responses
172
Self reported health status
• Excellent, very good, good, fair, poor
• Coded as 1, 2, 3, 4, 5 on National Health
Interview Survey
• We will code as 5,4,3,2,1 (easier to think
of this way)
• Asked on every major health survey
• Important predictor of health outcomes,
e.g. mortality
• Key question: what predicts health
status?
173
• Important to note – the numbers 1-5
mean nothing in terms of their value,
just an ordering to show you the lowest
to highest
• The example below is easily adapted to
include categorical variables with any
number of outcomes
174
Model
• yi* = latent index of reported health
• The latent index measures your own
scale of health. Once yi* crosses a
certain value you report poor, then good,
then very good, then excellent health
175
• yi = (1,2,3,4,5) for (fair, poor, VG, G,
excel)
• Interval decision rule
• yi=1
• yi=2
• yi=3
• yi=4
• yi=5
if
if
if
if
if
yi* ≤ u1
u1 < yi* ≤ u2
u2 < yi* ≤ u3
u3 < yi* ≤ u4
yi* > u4
176
• As with logit and probit models, we will
assume yi* is a function of observed and
unobserved variables
• yi* = β0 + x1i β1 + x2i β2 …. xki βk + εi
• yi* = xi β + εi
177
• The threshold values (u1, u2, u3, u4) are
unknown. We do not know the value of
the index necessary to push you from
very good to excellent.
• In theory, the threshold values are
different for everyone
• Computer will not only estimate the β’s,
but also the thresholds – average across
people
178
• As with probit and logit, the model will
be determined by the assumed
distribution of ε
• In practice, most people pick nornal,
generating an ‘ordered probit’ (I have no
idea why)
• We will generate the math for the probit
version
179
Probabilities
• Lets do the outliers, Pr(yi=1) and
Pr(yi=5) first
• Pr(yi=1)
•
= Pr(yi* ≤ u1)
•
= Pr(xi β +εi ≤ u1 )
•
=Pr(εi ≤ u1 - xi β)
•
= Φ[u1 - xi β] = 1- Φ[xi β – u1]
180
• Pr(yi=5)
•
= Pr(yi* > u4)
•
= Pr(xi β +εi > u4 )
•
=Pr(εi > u4 - xi β)
•
= 1 - Φ[u4 - xi β] = Φ[xi β – u4]
181
Sample one for y=3
• Pr(yi=3) = Pr(u2 < yi* ≤ u3)
= Pr(yi* ≤ u3) – Pr(yi* ≤ u2)
= Pr(xi β +εi ≤ u3) – Pr(xi β +εi ≤ u2)
= Pr(εi ≤ u3- xi β) - Pr(εi ≤ u2 - xi β)
= Φ[u3- xi β] - Φ[u2 - xi β]
= 1 - Φ[xi β - u3] – 1 + Φ[xi β - u2]
= Φ[xi β - u2] - Φ[xi β - u3]
182
Summary
• Pr(yi=1) = 1- Φ[xi β – u1]
• Pr(yi=2) = Φ[xi β – u1] - Φ[xi β – u2]
• Pr(yi=3) = Φ[xi β – u2] - Φ[xi β – u3]
• Pr(yi=4) = Φ[xi β – u3] - Φ[xi β – u4]
• Pr(yi=5) = Φ[xi β – u4]
183
Likelihood function
• There are 5 possible choices for each
person
• Only 1 is observed
• L = Σi ln[Pr(yi=k)] for k
184
Programming example
• Cancer control supplement to 1994
National Health Interview Survey
• Question: what observed characteristics
predict self reported health (1-5 scale)
• 1=poor, 5=excellent
• Key covariates: income, education, age,
current and former smoking status
• Programs
• sr_health_status.do, .dta, .log
185
•
desc;
•
•
•
•
•
•
•
•
•
•
male
age
educ
smoke
smoke5
black
othrace
sr_health
byte
byte
byte
byte
byte
float
float
float
%9.0g
%9.0g
%9.0g
%9.0g
%9.0g
%9.0g
%9.0g
%9.0g
famincl
float
%9.0g
=1 if male
age in years
years of education
current smoker
smoked in past 5 years
=1 if respondent is black
=1 if other race (white is ref)
1-5 self reported health,
5=excel, 1=poor
log family income
186
• tab sr_health;
•
1-5 self |
•
reported |
•
health, |
•
5=excel, |
•
1=poor |
Freq.
Percent
Cum.
• ------------+----------------------------------•
1 |
342
2.65
2.65
•
2 |
991
7.68
10.33
•
3 |
3,068
23.78
34.12
•
4 |
3,855
29.88
64.00
•
5 |
4,644
36.00
100.00
• ------------+----------------------------------•
Total |
12,900
100.00
187
In STATA
• oprobit sr_health male age educ
famincl black othrace smoke
smoke5;
188
•
•
•
•
Ordered probit estimates
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
-----------------------------------------------------------------------------sr_health |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------male |
.1281241
.0195747
6.55
0.000
.0897583
.1664899
age | -.0202308
.0008499
-23.80
0.000
-.0218966
-.018565
educ |
.0827086
.0038547
21.46
0.000
.0751535
.0902637
famincl |
.2398957
.0112206
21.38
0.000
.2179037
.2618878
black |
-.221508
.029528
-7.50
0.000
-.2793818
-.1636341
othrace | -.2425083
.0480047
-5.05
0.000
-.3365958
-.1484208
smoke | -.2086096
.0219779
-9.49
0.000
-.2516855
-.1655337
smoke5 | -.1529619
.0357995
-4.27
0.000
-.2231277
-.0827961
-------------+---------------------------------------------------------------_cut1 |
.4858634
.113179
(Ancillary parameters)
_cut2 |
1.269036
.11282
_cut3 |
2.247251
.1138171
_cut4 |
3.094606
.1145781
------------------------------------------------------------------------------
Log likelihood = -16401.987
Number of obs
LR chi2(8)
Prob > chi2
Pseudo R2
=
=
=
=
12900
2379.61
0.0000
0.0676
189
Interpret coefficients
• Marginal effects/changes in probabilities
are now a function of 2 things
– Point of expansion (x’s)
– Frame of reference for outcome (y)
• STATA
– Picks mean values for x’s
– You pick the value of y
190
Continuous x’s
• Consider y=5
• d Pr(yi=5)/dxi
= d Φ[xi β – u4]/dxi = βφ[xi β – u4]
• Consider y=3
• d Pr(yi=3)/dxi = βφ[xi β – u3] - βφ[xi β – u4]
191
Discrete X’s
• xi β = β0 + x1i β1 + x2i β2 …. xki βk
– X2i is yes or no (1 or 0)
• ΔPr(yi=5) =
• Φ[β0 + x1i β1 + β2 + x3i β3 +.. xki βk]
- Φ[β0 + x1i β1 + x3i β3 …. xki βk]
• Change in the probabilities when x2i=1
and x2i=0
192
Ask for marginal effects
• mfx compute,
predict(outcome(5));
193
•
mfx compute, predict(outcome(5));
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Marginal effects after oprobit
y = Pr(sr_health==5) (predict, outcome(5))
= .34103717
-----------------------------------------------------------------------------variable |
dy/dx
Std. Err.
z
P>|z| [
95% C.I.
]
X
---------+-------------------------------------------------------------------male*|
.0471251
.00722
6.53
0.000
.03298
.06127
.438062
age | -.0074214
.00031 -23.77
0.000 -.008033 -.00681
39.8412
educ |
.0303405
.00142
21.42
0.000
.027565 .033116
13.2402
famincl |
.0880025
.00412
21.37
0.000
.07993 .096075
10.2131
black*| -.0781411
.00996
-7.84
0.000 -.097665 -.058617
.124264
othrace*| -.0843227
.01567
-5.38
0.000 -.115043 -.053602
.04124
smoke*| -.0749785
.00773
-9.71
0.000
-.09012 -.059837
.289147
smoke5*| -.0545062
.01235
-4.41
0.000 -.078719 -.030294
.081395
-----------------------------------------------------------------------------(*) dy/dx is for discrete change of dummy variable from 0 to 1
194
Interpret the results
• Males are 4.7 percentage points more
likely to report excellent
• Each year of age decreases chance of
reporting excellent by 0.7 percentage
points
• Current smokers are 7.5 percentage
points less likely to report excellent
health
195
Minor notes about estimation
• Wald tests/-2 log likelihood tests are
done the exact same was as in PROBIT
and LOGIT
• Tests of individual parameters are done
the same way (z-score)
196
• Use PRCHANGE to calculate marginal
effect for a specific person
prchange, x(age=40 black=0
othrace=0 smoke=0 smoke5=0
educ=16);
– When a variable is NOT specified (famincl),
STATA takes the sample mean.
197
• PRCHANGE will produce results for all
outcomes
•
•
•
•
•
male
0->1
Avg|Chg|
.0203868
0->1
5
.05096698
1
-.0020257
2
-.00886671
3
-.02677558
4
-.01329902
198
•
•
•
•
•
•
age
Min->Max
-+1/2
-+sd/2
MargEfct
Avg|Chg|
.13358317
.00321942
.03728014
.00321947
1
.0184785
.00032518
.00382077
.00032515
2
.06797072
.00141642
.01648743
.00141639
3
.17686112
.00424452
.04910323
.00424462
4
.07064757
.00206241
.0237889
.00206252
199
Section VII
Count Data Models
200
Introduction
• Many outcomes of interest are integer
counts
– Doctor visits
– Low work days
– Cigarettes smoked per day
– Missed school days
• OLS models can easily handle some
integer models
201
• Example
– SAT scores are essentially integer values
– Few at ‘tails’
– Distribution is fairly continuous
– OLS models well
• In contrast, suppose
– High fraction of zeros
– Small positive values
202
• OLS models will
– Predict negative values
– Do a poor job of predicting the mass of observations
at zero
• Example
–
–
–
–
Dr visits in past year, Medicare patients(65+)
1987 National Medical Expenditure Survey
Top code (for now) at 10
17% have no visits
203
•
visits |
Freq.
Percent
Cum.
• ------------+----------------------------------•
0 |
915
17.18
17.18
•
1 |
601
11.28
28.46
•
2 |
533
10.01
38.46
•
3 |
503
9.44
47.91
•
4 |
450
8.45
56.35
•
5 |
391
7.34
63.69
•
6 |
319
5.99
69.68
•
7 |
258
4.84
74.53
•
8 |
216
4.05
78.58
•
9 |
192
3.60
82.19
•
10 |
949
17.81
100.00
• ------------+----------------------------------•
Total |
5,327
100.00
204
Poisson Model
• yi is drawn from a Poisson distribution
• Poisson parameter varies across
observations
• f(yi;λi) =e-λi λi yi/yi!
For λi>0
• E[yi]= Var[yi] = λi = f(xi, β)
205
• λi must be positive at all times
• Therefore, we CANNOT let λi = xiβ
• Let λi = exp(xiβ)
• ln(λi) = (xiβ)
206
• d ln(λi)/dxi = β
• Remember that d ln(λi) = dλi/λi
• Interpret β as the percentage change in
mean outcomes for a change in x
207
Problems with Poisson
• Variance grows with the mean
– E[yi]= Var[yi] = λi = f(xi, β)
• Most data sets have over dispersion,
where the variance grows faster than
the mean
• In dr. visits sample,  = 5.6, s=6.7
• Impose Mean=Var, severe restriction
and you tend to reduce standard errors
208
Negative Binomial Model
Pr( yi ) 
i
( i  yi )  1    

 

( i )( yi  1)    1     1 
yi
• Where γi = exp(xiβ) and δ ≥ 0
• E[yi] = δγi = δexp(xiβ)
• Var[yi] = δ (1+δ) γi
• Var[yi]/ E[yi] = (1+δ)
209
• δ must always be ≥ 0
• In this case, the variance grows faster
than the mean
• If δ=0, the model collapses into the
Poisson
• Always estimate negative binomial
• If you cannot reject the null that δ=0,
report the Poisson estimates
210
• Notice that ln(E[yi]) = ln(δ) + ln(γi), so
• d ln(E[yi]) /dxi = β
• Parameters have the same
interpretation as in the Poisson model
211
In STATA
• POISSON estimates a MLE model for
poisson
– Syntax
POISSON y independent variables
• NBREG estimates MLE negative
binomial
– Syntax
NBREG y independent variables
212
Interpret results for Poisson
• Those with CHRONIC condition have
50% more mean MD visits
• Those in EXCELent health have 78%
fewer MD visits
• BLACKS have 33% fewer visits than
whites
• Income elasticity is 0.021, 10% increase
in income generates a 2.1% increase in
visits
213
Negative Binomial
• Interpret results the same was as
Poisson
• Look at coefficient/standard error on
delta
• Ho: delta = 0 (Poisson model is correct)
• In this case, delta = 5.21 standard error
is 0.15, easily reject null.
• Var/Mean = 1+delta = 6.21, Poisson is
mis-specificed, should see very small 214
standard errors in the wrong model
Selected Results, Count Models
Parameter (Standard Error)
Variable
Poisson
Negative Binomial
Age65
0.214
(0.026)
0.103
(0.055)
Age70
0.787
(0.026)
0.204
(0.054)
Chronic 0.500
(0.014)
0.509
(0.029)
Excel
(0.031)
-0.527
(0.059)
(0.007)
0.038
(0.016)
-0.784
Ln(Inc). 0.021
215
Section VIII
Duration Data
216
Introduction
• Sometimes we have data on length of
time of a particular event or ‘spells’
– Time until death
– Time on unemployment
– Time to complete a PhD
• Techniques we will discuss were
originally used to examine lifespan of
objects like light bulbs or machines.
These models are often referred to as
“time to failure”
217
Notation
• T is a random variable that indicates
duration (time til death, find a new job,
etc)
• t is the realization of that variable
• f(t) is a PDF that describes the process
that determines the time to failure
• CDF is F(t) represents the probability
an event will happen by time t
218
t
F ( t )  Pr( s  t )   f ( s)ds
0
• F(t) represents the probability that the
event happens by ‘t’.
• What is the probability a person will die
on or before the 65th birthday?
219
• Survivor function, what is the chance
you live past (t)
• S(t) = 1 – F(t)
• If 10% of a cohort dies by their 65th
birthday, 90% will die sometime after
their 65th birthday
220
• Hazard function, h(t)
• What is the probability the spell will end
at time t, given that it has already
lasted t
• What is the chance you find a new job in
month 12 given that you’ve been
unemployed for 12 months already
Pr( t  T  t  h | T  t )
 ( t )  lim h 0
h
221
• PDF, CDF (Failure function), survivor
function and hazard function are all
related
• λ(t) = f(t)/S(t) = f(t)/(1-F(t))
• We focus on the ‘hazard’ rate because its
relationship to time indicates ‘duration
dependence’
222
• Example: suppose the longer someone is
out of work, the lower the chance they
will exit unemployment – ‘damaged
goods’
• This is an example of duration
dependence, the probability of exiting a
state of the world is a function of the
length
223
• Mathematically
• d λ(t) /dt = 0
then there is no duration dep.
• d λ(t) /dt > 0
there is + duration dependence
the probability the spell will end
increases with time
• d λ(t) /dt < 0
there is – duration dependence
the probability the spell will end
decreases over time
224
• Your choice, is to pick values for f(t) that
have +, - or no duration dependence
225
Different Functional Forms
• Exponential
– λ(t)= λ
– Hazard is the same over time, a ‘memory less’
process
• Weibull
–
–
–
–
–
F(t) = 1 – exp(-γtρ) where ρ,γ > 0
λ(t) = ργρ-1
if ρ>1, increasing hazard
if ρ<1, decreasing hazard
if ρ=1, exponential
226
• Others: Lognormal, log-logistic,
Gompertz.
• Little more difficult – can examine when
you get comfortable with Weibull
227
A note about most data sets
• Most data sets have ‘censored’ spells
– Follow people over time
– All will eventually die, but some do not in
your period of analysis
– Incomplete spells or censored data
• Must build into the log likelihood
function
228
• Let ti be the duration we observe for all
people
• Some people die, and their they lived
until period ti
• Others are observed for ti periods, but do
not
• Let di=1 if data is complete spell
• di=1 if incomplete
229
• Recall, that f(s) is the PDF for someone
who dies at period s
t
F ( t )  Pr( s  t )   f ( s)ds
0
• F(t) is the probability you die by t
• 1-F(t) = the probability you die after (t)
230
• If di=1 then we observe f(ti), someone
who died in period ti
• If di=0 then someone lived past period ti
and the probability of that is [1-F(ti)]
• L = Σi {di ln[f(ti)] + (1-di)ln[1-F(ti)]}
231
Introducing covariates
• Look at exponential
• λ(t)= λ
• Allow this to vary across people
• λ i(t)= λi
• But like Poisson, λi is always positive
• Let λi = exp(β0 + x1i β1 + x2i β2 …. xki βk)
232
• In the Weibull
• λ(t) = αγtα-1
• Allow it to vary across people
• λ i(t) = αγi tα-1
• γi = exp(β0 + x1i β1 + x2i β2 …. xki βk)
233
Interpreting Coefficients
• This is the same for both Weibull and
Exponential
• In Weibull, λ(ti ) = αγitα-1
• Suppose x1i is a dummy variable
• When xi1=1, then
• γi1 = eβ0 + β1 + x2i β2 …. xki βk
• When xi1=0, then
• γi0 = eβ0 + x2i β2 …. xki βk
234
• When you construct the ratio of γi1/ γi0,
all the others parameters cancel, so
• (αγi1 tα-1 – αγi0 tα-1 ) / αγi0 tα-1 = eβ1 -1
• Percentage change in the hazard when
x1i turns from 0 to 1.
• STATA prints out eβ1, just subtract 1
235
Suppose x2i is continuous
• Suppose we increase x2i by 1 unit
• γi1 = exp(β0 + β1x1i + x2i β2 …. xki βk)
• γi2 = exp(β0 + β1 (x1i+1) + x2i β2 …. xki βk)
•
•
•
•
Can show that
(αγi1 tα-1 – αγi0 tα-1 ) / αγi0 tα-1 = eβ1 -1
= exp(β2) – 1
Percentage change in the hazard for 1 unit
increase in x
236
NHIS Multiple Cause of Death
• NHIS
– annual survey of 60K households
– Data on individuals
– Self-reported healthm DR visits, lost workdays, etc.
• MCOD
– Linked NHIS respondents from 1986-1994 to
National Death Index through Dec 31, 1995
– Identified whether respondent died and of what
cause
237
• Our sample
– Males, 50-70, who were married at the time
of the survey
– 1987-1989 surveys
– Give everyone 5 years (60 months) of
followup
238
Key Variables
• max_mths maximum months in the
survey.
• Diedin5 respondent died during the 5
years of followup
• Note if diedn5=0, the max_mths=60.
Diedin5 identifies whether the data is
censored or not.
239
•
•
•
•
•
•
•
•
•
•
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------age_s_yrs |
26654
59.42586
5.962435
50
70
max_mths |
26654
56.49077
11.15384
0
60
black |
26654
.0928566
.2902368
0
1
hispanic |
26654
.0454716
.20834
0
1
income |
26654
3.592181
1.327325
1
5
-------------+-------------------------------------------------------educ |
26654
2.766677
.961846
1
4
diedin5 |
26654
.1226082
.3279931
0
1
240
Duration Data in STATA
• Need to identify which is the duration
data
stset length, failure(failvar)
• Length=duration variable
• Failvar=1 when durations end in failure, =0 for
censored values
• If all data is uncensored, omit
failure(failvar)
241
• In our case
• stset max_mths, failure(diedin5)
242
Getting Kaplan-Meier Curves
• Tabular presentation of results
sts list
• Graphical presentation
sts graph
• Results by subgroup
sts graph, by(educ)
243
0.00
0.25
0.50
0.75
1.00
Kaplan-Meier survival estimate
0
20
40
60
analysis time
244
0.00
0.25
0.50
0.75
1.00
Kaplan-Meier survival estimates, by educ
0
20
40
60
analysis time
educ = 1
educ = 3
educ = 2
educ = 4
245
MLE of duration model with
Covariates
• Basic syntax
• streg covariates, d(distribution)
• streg age_s_yrs black hispanic _Ie* _Ii*,
d(weibull);
• In this model, STATA will print out
exp(β)
• If you want the coefficients, add ‘nohr’
option (no hazard ratio)
246
Weibull coefficients
•
•
•
•
•
No. of subjects =
No. of failures =
Time at risk
=
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
-----------------------------------------------------------------------------_t |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------age_s_yrs |
.0452588
.0031592
14.33
0.000
.0390669
.0514508
black |
.4770152
.0511122
9.33
0.000
.3768371
.5771932
hispanic |
.1333552
.082156
1.62
0.105
-.0276676
.294378
_Ieduc_2 |
.0093353
.0591918
0.16
0.875
-.1066786
.1253492
_Ieduc_3 |
-.072163
.0503131
-1.43
0.151
-.1707748
.0264488
_Ieduc_4 | -.1301173
.0657131
-1.98
0.048
-.2589126
-.0013221
_Iincome_2 | -.1867752
.0650604
-2.87
0.004
-.3142914
-.0592591
_Iincome_3 | -.3268927
.0688635
-4.75
0.000
-.4618627
-.1919227
_Iincome_4 | -.5166137
.0769202
-6.72
0.000
-.6673747
-.3658528
_Iincome_5 | -.5425447
.0722025
-7.51
0.000
-.684059
-.4010303
_cons | -9.201724
.2266475
-40.60
0.000
-9.645945
-8.757503
-------------+---------------------------------------------------------------/ln_p |
.1585315
.0172241
9.20
0.000
.1247729
.1922901
-------------+---------------------------------------------------------------p |
1.171789
.020183
1.132891
1.212022
1/p |
.8533961
.014699
.8250675
.8826974
------------------------------------------------------------------------------
Log likelihood
=
26631
3245
1505705
-12425.055
Number of obs
=
26631
LR chi2(10)
Prob > chi2
=
=
595.74
0.0000
247
• The sign of the parameters is informative
– Hazard increasing in age
– Blacks, hispanics have higher mortality rates
– Hazard decreases with income and age
• The parameter ρ = 1.17.
– Check 95% confidence interval (1.13, 1.21). Can
reject null p=1 (exponential)
– Hazard is increasing over time
248
Hazard ratios
•
•
•
•
•
No. of subjects =
No. of failures =
Time at risk
=
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
-----------------------------------------------------------------------------_t | Haz. Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------age_s_yrs |
1.046299
.0033055
14.33
0.000
1.03984
1.052797
black |
1.611258
.082355
9.33
0.000
1.457667
1.781032
hispanic |
1.142656
.093876
1.62
0.105
.9727116
1.342291
_Ieduc_2 |
1.009379
.059747
0.16
0.875
.8988145
1.133544
_Ieduc_3 |
.9303792
.0468103
-1.43
0.151
.8430114
1.026802
_Ieduc_4 |
.8779924
.0576956
-1.98
0.048
.7718905
.9986788
_Iincome_2 |
.8296302
.0539761
-2.87
0.004
.7303062
.9424625
_Iincome_3 |
.7211611
.0496617
-4.75
0.000
.6301089
.8253706
_Iincome_4 |
.5965372
.0458858
-6.72
0.000
.5130537
.6936049
_Iincome_5 |
.5812672
.041969
-7.51
0.000
.5045648
.6696297
-------------+---------------------------------------------------------------/ln_p |
.1585315
.0172241
9.20
0.000
.1247729
.1922901
-------------+---------------------------------------------------------------p |
1.171789
.020183
1.132891
1.212022
1/p |
.8533961
.014699
.8250675
.8826974
------------------------------------------------------------------------------
Log likelihood
=
26631
3245
1505705
-12425.055
Number of obs
=
26631
LR chi2(10)
Prob > chi2
=
=
595.74
0.0000
249
Interpret coefficients
• Age: every year hazard increases by
4.6%
• Black: have 61% greater hazard than
whites
• Hispanics: 14% greater hazard than
non-hispanics
• Educ 2, 3, 4 are some 9-11, 12-15 and
16+ years of school
250
• Educ 3: those with 12-15 years of educ
have .93 – 1 = -0.07 or a 7% lower
hazard than those with <9 years of
school
• Educ 4: those with a college degree have
0.88 – 1 = -0.12 or a 12% lower hazard
than those with <9 years of school
251
• Income 2-5 are dummies for people with
$10-$20K, $20-$30K, $30-$40K, >$40K
• Income 2: Those with $10-$20K have
0.83 – 1 = -0.17 or a 17% lower hazard
than those with income <$10K
• Income 5, those with >$40K in income
have a 0.58 – 1 = -0.42 or a 42% lower
hazard than those with income <$10K
252
Topics not covered
• Time varying covariates
• Competing risk models
– Die from multiple causes
• Cox proportional hazard model
– Heterogeneity in baseline hazard
253