Graphical Methods for Complex Surveys

Download Report

Transcript Graphical Methods for Complex Surveys

1
Types of Surveys
Cross-sectional
• surveys a specific population at a given point in
time
• will have one or more of the design components
• stratification
• clustering with multistage sampling
• unequal probabilities of selection
Longitudinal
• surveys a specific population repeatedly over a
period of time
• panel
• rotating samples
2
Cross Sectional Surveys
Sampling Design Terminology
3
Methods of Sample Selection
Basic methods
• simple random sampling
• systematic sampling
• unequal probability sampling
• stratified random sampling
• cluster sampling
• two-stage sampling
4
Simple Random Sampling
0
10
20
30
40
50
60
70
80
90
100
Why?
• basic building block of sampling
• sample from a homogeneous group of units
How?
• physically make draws at random of the units
under study
• computer selection methods: R, Stata
5
Systematic Sampling
0
10
20
30
40
50
60
70
80
90
100
Why?
• easy
• can be very efficient depending on the structure of
the population
How?
• get a random start in the population
• sample every kth unit for some chosen number k
6
Additional Note
Simplifying assumption:
• in terms of estimation a systematic sample
is often treated as a simple random sample
Key assumption:
• the order of the units is unrelated to the
measurements taken on them
7
Unequal Probability Sampling
Why?
• may want to give greater or lesser weight to
certain population units
• two-stage sampling with probability proportional
to size at the first stage and equal sample sizes at
the second stage provides a self-weighting design
(all units have the same chance of inclusion in the
sample)
How?
• with replacement
• without replacement
8
With or Without Replacement?
• in practice sampling is usually done without
replacement
• the formula for the variance based on without
replacement sampling is difficult to use
• the formula for with replacement sampling at the
first stage is often used as an approximation
Assumption: the population size is large and the
sample size is small – sampling fraction is less
than 10%
9
Stratified Random Sampling
0
10
20
30
40
50
60
70
80
90
100
Why?
• for administrative convenience
• to improve efficiency
• estimates may be required for each stratum
How?
• independent simple random samples are chosen
within each stratum
10
Example: Survey of Youth in Custody
• first U.S. survey of youths confined to long-term,
state-operated institutions
• complemented existing Children in Custody
censuses.
• companion survey to the Surveys of State Prisons
• the data contain information on criminal histories,
family situations, drug and alcohol use, and peer
group activities
• survey carried out in 1989 using stratified
systematic sampling
11
SYC Design
strata
• type (a) groups of smaller institutions
• type (b) individual larger institutions
sampling units
• strata type (a)
• first stage – institution by probability proportional to size of
the institution
• second stage – individual youths in custody
• strata type (b)
• individual youths in custody
• individuals chosen by systematic random sampling
12
Cluster Sampling
0
10
20
30
40
50
60
70
80
90
100
Why?
• convenience and cost
• the frame or list of population units may be
defined only for the clusters and not the units
How?
• take a simple random sample of clusters and
measure all units in the cluster
13
Two-Stage Sampling
0
10
20
30
40
50
60
70
80
90
100
Why?
• cost and convenience
• lack of a complete frame
How?
• take either a simple random sample or an unequal
probability sample of primary units and then within a
primary take a simple random sample of secondary units
14
Synthesis to a Complex Design
Stratified two-stage cluster sampling
Strata
• geographical areas
First stage units
• smaller areas within the larger areas
Second stage units
• households
Clusters
• all individuals in the household
15
Why a Complex Design?
• better cover of the entire region of interest
(stratification)
• efficient for interviewing: less travel, less
costly
Problem: estimation and analysis are more
complex
16
Ontario Health Survey
• carried out in 1990
• health status of the population was
measured
• data were collected relating to the risk
factors associated with major causes of
morbidity and mortality in Ontario
• survey of 61,239 persons was carried out in
a stratified two-stage cluster sample by
Statistics Canada
17
OHS
Sample Selection
• strata: public health units
– divided into rural and
urban strata
• first stage: enumeration
areas defined by the 1986
Census of Canada and
selected by pps
• second stage: dwellings
selected by SRS
• cluster: all persons in the
dwelling
18
Longitudinal Surveys
Sampling Design
19
Schematic Representation
Panel Survey
4
Time
3
2
1
0
Respondents
20
Schematic Representation
Rotation Survey
4
Time
3
2
1
0
Respondents
21
British Household
Panel Survey
Objectives of the survey
• to further understanding of social and economic
change at the individual and household level in
Britain
• to identify, model and forecast such changes, their
causes and consequences in relation to a range of
socio-economic variables.
22
BHPS: Target
Population and Frame
Target population
• private households in Great Britain
Survey frame
• small users Postcode Address File
(PAF)
23
BHPS: Panel Sample
• designed as an annual survey of each adult (16+)
member of a nationally representative sample
• 5,000 households approximately
• 10,000 individual interviews approximately.
• the same individuals are re-interviewed in successive
waves
• if individuals split off from original households, all
adult members of their new households are also
interviewed.
• children are interviewed once they reach the age of 16
• 13 waves of the survey from 1991 to 2004
24
BHPS: Sampling Design
Uses implicit stratification embedded in two-stage
sampling
• postcode sector ordered by region
• within a region postcode sector ordered by socioeconomic group as determined from census data and
then divided into four or five strata
Sample selection
• systematic sampling of postcode sectors from ordered
list
• systematic sampling of delivery points (≈ addresses or
households)
25
BHPS: Schema for Sampling
26
Survey Weights
27
Survey Weights: Definitions
initial weight
• equal to the inverse of the inclusion probability
of the unit
final weight
• initial weight adjusted for nonresponse,
poststratification and/or benchmarking
• interpreted as the number of units in the
population that the sample unit represents
28
Interpretation
Interpretation
• the survey
weight for a
particular
sample unit is
the number of
units in the
population
that the unit
represents
Not sampled, Wt = 2, Wt = 5, Wt = 6, Wt = 7
29
Effect of the Weights
• Example: age
distribution, Survey of
Youth in Custody
Sum of
Age Counts Weights
11
1
28
12
9
149
13
53
764
14
167
2143
15
372
3933
16
622
5983
17
634
5189
18
334
2778
19
196
1763
20
122
1164
21
57
567
22
27
273
23
14
150
24
13
128
Totals 2621
25012
30
Unweighted Histogram
Age Distribution of Youth in Custody
0.3
Proportion
0.25
0.2
0.15
0.1
0.05
0
11 12 13 14 15 16 17 18 19 20 21 22 23 24
Age
31
Weighted Histogram
Age Distribution of Youth in Custody
0.3
Proportion
0.25
0.2
0.15
0.1
0.05
0
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Age
32
Weighted versus Unweighted
Proportion
Weighted and Unweighted
Histograms
0.3
0.25
0.2
0.15
0.1
0.05
0
11
12
13 14
15
16 17
18 19
20
21 22
23
24
Age
Weighted
Unweighted
33
Observations
• the histograms are similar but significantly
different
• the design probably utilized approximate
proportional allocation
• the distribution of ages in the unweighted
case tends to be shifted to the right when
compared to the weighted case
• older ages are over-represented in the dataset
34
Survey Data Analysis
Issues and Simple Examples from
Graphical Methods
35
Basic Problem
in
Survey Data Analysis

36
Issues
iid (independent and identical distribution)
assumption
• the assumption does not not hold in
complex surveys because of correlations
induced by the sampling design or because
of the population structure
• blindly applying standard programs to the
analysis can lead to incorrect results
37
Example: Rank Correlation Coefficient
Pay equity survey dispute: Canada Post and PSAC
• two job evaluations on the same set of people (and
same set of information) carried out in 1987 and
1993
• rank correlation between the two sets of job values
obtained through the evaluations was 0.539
• assumption to obtain a valid estimate of
correlation: pairs of observations are iid
38
Scatterplot of Evaluations
Rank in 1993
200
100
0
0
100
200
Rank in 1987
• Rank correlation is 0.539
39
A Stratified Design with Distinct
Differences Between Strata
• the pay level increases with each pay
category (four in number)
• the job value also generally increases with
each pay category
• therefore the observations are not iid
40
Scatterplot by Pay Category
Rank in 1993
200
2
3
4
5
100
0
0
100
Rank in 1987
200
41
Correlations within Level
Correlations within each pay level
• Level 2: –0.293
• Level 3: –0.010
• Level 4: 0.317
• Level 5: 0.496
Only Level 4 is significantly different from 0
42
Graphical Displays
first rule of data analysis
• always try to plot the data to get some initial
insights into the analysis
common tools
• histograms
• bar graphs
• scatterplots
43
Histograms
unweighted
• height of the bar in the ith class is proportional to
the number in the class
weighted
• height of the bar in the ith class is proportional to
the sum of the weights in the class
44
Body Mass Index
measured by
• weight in kilograms
divided by square of height
in meters
• 7.0 < BMI < 45.0
• BMI < 20: health
problems such as eating
disorders
• BMI > 27: health
problems such as
hypertension and coronary
heart disease
45
BMI: Women
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
BMI
11
14
17
20
23
26
29
32
35
38
41
44
46
BMI: Men
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
BMI
11
14
17
20
23
26
29
32
35
38
41
44
47
BMI: Comparisons
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
BMI
11
14
17
20
23
26
Women
29
32
35
38
41
44
Men
48
Bar Graphs
Same principle as histograms
unweighted
• size of the ith bar is proportional to the number in
the class
weighted
• size of the ith bar is proportional to the sum of the
weights in the class
49
Ontario Health Survey
Distribution of Levels of Happiness by Marital
Status
Marital Status
Divorced
Widowed
Single
Married
0%
20%
40%
60%
80%
100%
Percentage
Happy
Somewhat happy
Somewhat unhappy
Unhappy
Very unhappy
50
Scatterplots
unweighted
• plot the outcomes of one variable versus
another
problem in complex surveys
• there are often several thousand respondents
51
10
20
BMI
30
40
Scatterplot of BMI by Age and Sex
20
30
40
50
60
Age
52
Solution
• bin the data on one variable and find a
representative value
• at a given bin value the representative value
for the other variable is the weighted sum of
the values in the bin divided by the sum of
the weights in the bin
53
25
24
23
Binned-BMI
26
27
BMI Trends by Age and Sex
0
10
20
30
40
Age
54
Bubble Plots
• size of the circle is related to the sum of the surveys weights in the
estimate
• more data in the BMI range 17 to 29 approximately
DBMI versus BMI (binned)
30
DBMI
25
20
15
12
22
32
BMI
42
55
Computing Packages
STATA and R
56
Available Software for Complex
Survey Analysis
• commercial Packages:
• STATA
• SAS
• SPSS
• Mplus
• noncommercial Package
•R
57
STATA
defining the sampling design: svyset
– example
svyset [pweight=indiv_wt],
strata(newstrata) psu(ea)
vce(linear)
– output:
pweight: indiv_wt
VCE:
linearized
Strata 1:
newstrata
SU 1: ea
FPC 1:
<zero>
58
R: survey package
• define the sampling design: svydesign
– wk1de<svydesign(id=~ea,strata=~newstrata,weight=~i
ndiv_wt,nest=T,data=work1)
• output
> summary(wk1de)
Stratified 1 - level Cluster Sampling design
With (1860) clusters.
svydesign(id = ~ea, strata = ~newstrata, weight =
~indiv_wt,
nest = T, data = work1)
59
Syntax
• STATA:
–
–
–
–
svy: estimate
Example: least squares estimation
svyset [pweight=indiv_wt], strata(newstrata) psu(ea)
svy: regress dbmi bmi
• R:
– svy***(*, design, data=, ...)
– Example: least squares estimation
– wk2de<svydesign(id=~ea,strata=~newstrata,weight=~indiv_wt,
nest=T,data=work2)
– svyglm(dbmi~bmi, data=work2,design=wk2de)
60
Available Survey Commands
R
STATA
Descriptive
Yes
Yes
Regression
Yes
Yes (More)
Resampling
Yes
Yes
Longitudinal
Yes
No
PMLE
Yes
Yes
Calibration
Yes
No
61
Survey Data Analysis
Contingency Tables
and
Issues of Estimation of Precision
62
General Effect of Complex Surveys
on Precision
• stratification decreases variability (more
precise than SRS)
• clustering increases variability (less precise than
SRS)
• overall, the multistage design has the effect
of increasing variability (less precise than SRS)
63
Illustration Using Contingency Tables
• two categorical variables that can be set out
in I rows and J columns
• can get a survey estimate of the proportion
of observations in the cell defined by the ith
row and jth column: p̂ij
64
Example:
Ontario Health Survey
• rows: five levels describing levels of
happiness that people feel
• columns: four levels describing the amount
of stress people feel
• Is there an association between stress and
happiness?
65
STATA Commands
use "I:\workshopjune\work.dta", clear
svyset [pweight=indiv_wt], strata(newstrata) psu(ea)
svy: tabulate happiness stress
(running tabulate on estimation sample)
Number of strata
Number of PSUs
=
=
72
1860
Number of obs
Population size
Design df
=
48057
= 7961780.7
=
1788
66
STATA Output
• table on stress and happiness
• estimated proportions in the table with test
statistic
------------------------------------------------------|
stress
happiness |
1
2
3
4
Total
----------+-------------------------------------------1 |
.042
.2567
.2856
.085
.6692
2 |
.026
.1426
.0935
.0109
.2731
3 |
.0106
.0246
.0085 8.5e-04
.0446
4 |
.004
.0045
.0015 8.4e-04
.0108
5 |
.0016 3.4e-04 2.0e-04 2.1e-04
.0023
|
Total |
.0841
.4288
.3893
.0978
1
------------------------------------------------------Key: cell proportions
Pearson:
Uncorrected
Design-based
chi2(12)
= 3674.8280
F(8.66, 15484.10)=
89.2775
P = 0.0000
67
Possible Test Statistics
adapt the classical test statistic
• need the sampling distribution of the statistic
Wald Test
• need an estimate of the variance-covariance matrix
68
Estimation of Variance or Precision
• variance estimation with complex multistage
cluster sample design:
• exact formula for variance estimation is often too
complex; use of an approximate approach required
• NOTE: taking account of the design in variance
estimation is as crucial as using the sampling weights
for the estimation of a statistic
69
Some Approximate Methods
• Taylor series methods
• Replication methods
• Balanced Repeated Replication (BRR)
• Jackknife
• Bootstrap
70
Replication Methods
• you can estimate the variance of an
estimated parameter by taking a large
number of different subsamples from your
original sample
• each subsample, called a replicate, is used to
estimate the parameter
• the variability among the resulting estimates is used
to estimate the variance of the full-sample estimate
• covariance between two different parameter
estimates is obtained from the covariance in
replicates
• the replication methods differ in the way the
replicates are built
71
Assumptions
The resulting distribution of the test statistic is
based on having a large sample size with the
following properties
• the total number of first stage sampled clusters
(or primary sampling units) is assumed large
• the primary sample size in each stratum is small
but the number of strata is large
• the number of primary units in a stratum is large
• no survey weight is disproportionately large
72
Possible Violations of Assumptions
• the complex survey (stratified two-sample
sampling, for example) was done on a relatively
small scale
• a large-scale survey was done but inferences are
desired for small subpopulations
• stratification in which a few strata (or just one)
have very small sampling fractions compared to
the rest of the strata
• The sampling design was poor resulting in large
variability in the sampling weights
73
Survey Data Analysis
Linear and Logistic
Regression
74
General Approach
• form a census statistic (model estimate or
expression or estimating equation)
• for the census statistic obtain a survey
estimate of the statistic
• the analysis is based on the survey estimate
75
Regression
Use of ordinary least squares can lead to
• badly biased estimates of the regression
coefficients if the design is not ignorable
• underestimation of the standard errors of the
regression coefficient if clustering (and to a
lesser extent the weighting) is ignored
76
Example:
Ontario Health Survey
Regress desired body mass index (DBMI) on body
mass index (BMI)
STATA Unweighted Weighted
Intercept
Estimate
S.E.
10.877
0.141
11.196
0.064
10.877
0.065
Slope
Estimate
S.E.
0.4958
0.0058
0.4716
0.0025
0.4858
0.0026
77
Simple Linear Regression Model
• typical regression model
y i  α  β x i  ei
E(e i )  0, E(e )  σ , E(e ie j )  0
2
i
2
• linear relationship plus random error
• errors are independent and identically
distributed
78
Census Statistic
• census estimate of the slope parameter 
N
B  β̂ 
 (x
i 1
i
 X)(y i  Y)
N
2
(x

X
)
 i
i 1
• Problem: the assumption of independent
errors in the population does not hold
• Solution: the least squares estimate is a
consistent estimate of the slope 
79
Survey Estimate
• the census estimate B is now the parameter of
interest
• the survey estimate is given by
b
ˆ )(y  Y
ˆ)
w
(x

X
 i i
i
i
ˆ )2
w
(x

X
 i i
i
• estimate obtained from an estimating equation
• the estimate of variance cannot be taken from the
analysis of variance table in the regression of y on
x using either a weighted or unweighted analysis
80
Variance Estimation
Again, estimate of the variance of b is
obtained from one of the following
procedures
• Taylor linearization
• Jackknife
• BRR
• Bootstrap
81
Issues in Analysis
• application of the large sample
distributional results
• small survey
• regression analysis on small domains of interest
• multicollinearity
• survey data files often have many variables
recorded that are related to one another
82
Multicollinearity Example:
Ontario Health Survey
Two regression models: regress desired body
mass index on
• actual body mass index, age, gender, marital
status, smoking habits, drinking habits, and
amount of physical activity
• all of the above variables plus interaction
terms: marital status by smoking habits,
marital status by drinking habits, physical
activity by age
83
Partial STATA Output
No interaction terms
-----------------------------------------------------------------------------|
Linearized
dbmi |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------bmi |
.4375517
.0066716
65.58
0.000
.4244667
.4506368
age |
.0157202
.0014647
10.73
0.000
.0128475
.0185929
_Imarital_2 |
.1413547
.0498052
2.84
0.005
.0436718
.2390377
_Imarital_3 |
.4752516
.1416521
3.36
0.001
.1974293
.7530739
_Imarital_4 | -.0349268
.0749697
-0.47
0.641
-.1819648
.1121113
_Isex_2 | -2.192169
.036238
-60.49
0.000
-2.263243
-2.121095
Interaction terms present
-----------------------------------------------------------------------------|
Linearized
dbmi |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------bmi |
.4369983
.0066473
65.74
0.000
.4239608
.4500357
age |
.0027515
.0045811
0.60
0.548
-.0062335
.0117364
_Imarital_2 |
.020803
.283399
0.07
0.941
-.5350276
.5766337
_Imarital_3 |
.8300453
.3153888
2.63
0.009
.2114731
1.448618
_Imarital_4 |
-.486307
.4478352
-1.09
0.278
-1.364646
.3920324
84
_Isex_2 | -2.193464
.0362143
-60.57
0.000
-2.264491
-2.122437
Comparison of Domain Means
Domains and Strata
• both are nonoverlapping parts or segments of a
population
• usually a frame exists for the strata so that
sampling can be done within each stratum to
reduce variation
• for domains the sample units cannot be separated
in advance of sampling
Inferences are required for domains.
85
Regression Approach
• use the regression commands in STATA and
declare the variables of interest to be
categorical
• example: DBMI relative to BMI related to
sex and happiness index
STATA commands
use "I:\workshopjune\work.dta", clear
svyset [pweight=indiv_wt], strata(newstrata) psu(ea)
.
. xi:svy: regress ratio i.sex*i.happiness
i.sex
_Isex_1-2
(naturally coded; _Isex_1 omitted)
i.happiness
_Ihappiness_1-5
(naturally coded; _Ihappiness_1 omitted)
i.sex*i.happi~s
_IsexXhap_#_#
(coded as above)
(running regress on estimation sample)
86
STATA Output
-----------------------------------------------------------------------------|
Linearized
ratio |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------_Isex_2 | -.0555096
.0022378
-24.81
0.000
-.0598986
-.0511206
_Ihappines~2 |
.0036588
.0033689
1.09
0.278
-.0029487
.0102663
_Ihappines~3 |
.0038151
.0082526
0.46
0.644
-.0123708
.0200009
_Ihappines~4 |
.0256273
.0181474
1.41
0.158
-.0099653
.0612199
_Ihappines~5 |
.0736566
.086237
0.85
0.393
-.0954801
.2427933
_IsexXhap_~2 | -.0088389
.0046613
-1.90
0.058
-.0179811
.0003032
_IsexXhap_~3 | -.0292948
.0114269
-2.56
0.010
-.0517063
-.0068833
_IsexXhap_~4 | -.0720886
.0224737
-3.21
0.001
-.1161663
-.0280108
_IsexXhap_~5 | -.1428534
.0978592
-1.46
0.145
-.3347848
.0490779
_cons |
.9628054
.0016317
590.05
0.000
.9596051
.9660058
------------------------------------------------------------------------------
87
Logistic Regression
• probability of success pi for the ith individual
• vector of covariates xi associated with ith
individual
• dependent variable must be 0 or 1, independent
variables xi can be categorical or continuous
Does the probability of success pi depend on the
covariates xi – and in what way?
88
Census Parameter
Obtained from the logistic link function
 pi 
  α  βx i
ln 
 1  pi 
and the census likelihood equation for the
regression parameters
Note: it is the log odds that is being modeled
in terms of the covariate
89
Example:
Ontario Health Survey
How does the chance of suffering from hypertension
depend on:
• body mass index
• age
• gender
• smoking habits
• stress
• a well-being score that is determined from selfperceived factors such as the energy one has, control
over emotions, state of morale, interest in life and so on
90
STATA Commands
use "I:\workshopjune\work.dta", clear
svyset [pweight=indiv_wt], strata(newstrata) psu(ea)
recode hyper (1=1) (2=0)
(hyper: 24258 changes made)
xi:svy: logit hyper bmi age i.sex i.smoktype i.stress i.wellbe
i.sex
_Isex_1-2
(naturally coded; _Isex_1
omitted)
i.smoktype
_Ismoktype_1-4
(naturally coded; _Ismoktype_1
omitted)
i.stress
_Istress_1-4
(naturally coded; _Istress_4
omitted)
i.wellbe
_Iwellbe_1-4
(naturally coded; _Iwellbe_1
omitted)
(running logit on estimation sample)
91
STATA Output part I
xi:svy: logit hyper bmi age i.sex i.smoktype i.stress i.wellbe
i.sex
_Isex_1-2
(naturally coded; _Isex_1 omitted)
i.smoktype
_Ismoktype_1-4
(naturally coded; _Ismoktype_1 omitted)
i.stress
_Istress_1-4
(naturally coded; _Istress_4 omitted)
i.wellbe
_Iwellbe_1-4
(naturally coded; _Iwellbe_1 omitted)
(running logit on estimation sample)
Number of strata
Number of PSUs
=
=
72
1849
Number of obs
Population size
Design df
F( 12,
1766)
Prob > F
=
25871
= 4341226.9
=
1777
=
64.99
=
0.0000
92
STAT Output part II
-----------------------------------------------------------------------------|
Linearized
hyper |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------bmi |
.1029348
.00803
12.82
0.000
.0871855
.118684
age |
.0850085
.0040016
21.24
0.000
.0771601
.0928569
_Isex_2 | -.0094895
.0832978
-0.11
0.909
-.1728615
.1538825
_Ismoktype_2 | -.1068761
.100976
-1.06
0.290
-.3049203
.0911682
_Ismoktype_3 | -.1391754
.2245528
-0.62
0.535
-.5795907
.3012399
_Ismoktype_4 | -.1862018
.1050622
-1.77
0.077
-.3922601
.0198566
_Istress_1 |
.4201336
.2115243
1.99
0.047
.005271
.8349961
_Istress_2 |
.0103797
.2055384
0.05
0.960
-.3927428
.4135022
_Istress_3 |
-.177385
.2015597
-0.88
0.379
-.572704
.217934
_Iwellbe_2 | -.6197166
.2755986
-2.25
0.025
-1.160248
-.0791852
_Iwellbe_3 | -.7841664
.2593617
-3.02
0.003
-1.292853
-.2754803
_Iwellbe_4 |
-1.07929
.2600326
-4.15
0.000
-1.589292
-.5692879
_cons |
-8.12002
.441972
-18.37
0.000
-8.98686
-7.25318
------------------------------------------------------------------------------
93
GEE: Generalized
Estimating Equations
Dependent or response variable
• well-being measured on a 0 to 10 scale
• focus is on women only
Independent or explanatory variables’
• has responsibility for a child under age 12 (yes = 1, no = 2)
• marital status (married = 1, separated = 2, divorced = 3,
never married = 5 [widowed removed from the dataset])
• employment status (employed = 1, unemployed = 2, family
care = 3)
STATA syntax
tsset pid year, yearly
xi: xtgee wellbe i.mlstat i.job i.child i.sex
[pweight = axrwght], family(poisson)
link(identity) corr(exchangeable)
94
GEE Results
-----------------------------------------------------------------------------|
Semi-robust
wellbe |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------_Imlstat_2 |
1.206905
.2036603
5.93
0.000
.8077382
1.606072
_Imlstat_3 |
.3732488
.120658
3.09
0.002
.1367635
.6097342
_Imlstat_5 | -.0250266
.077469
-0.32
0.747
-.1768631
.1268098
_Ichild_2 | -.0456858
.063007
-0.73
0.468
-.1691773
.0778056
_Ijobc_2 |
.9498503
.4045538
2.35
0.019
.1569394
1.742761
_Ijobc_3 |
.0124392
.1827747
0.07
0.946
-.3457926
.370671
_cons |
1.922769
.0554797
34.66
0.000
1.814031
2.031507
------------------------------------------------------------------------------
95
For each type of initial marital status
Married
-----------------------------------------------------------------------------|
Semi-robust
wellbe |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------_Ichild_2 |
.0666723
.0672237
0.99
0.321
-.0650836
.1984283
_Ijobc_2 |
.888502
.720494
1.23
0.218
-.5236403
2.300644
_Ijobc_3 |
.2989137
.2369747
1.26
0.207
-.1655482
.7633756
_cons |
1.825918
.0562928
32.44
0.000
1.715586
1.93625
------------------------------------------------------------------------------
Separated or divorced
-------------+---------------------------------------------------------------_Ichild_2 | -.6732289
.1847309
-3.64
0.000
-1.035295
-.3111629
_Ijobc_2 |
1.239189
.8163575
1.52
0.129
-.3608422
2.83922
_Ijobc_3 | -.2405778
.6582919
-0.37
0.715
-1.530806
1.049651
_cons |
2.777478
.1734716
16.01
0.000
2.43748
3.117476
------------------------------------------------------------------------------
Never married
-------------+---------------------------------------------------------------_Ichild_2 | -.5800375
.2041848
-2.84
0.005
-.9802324
-.1798426
_Ijobc_2 |
.9851042
.5063179
1.95
0.052
-.0072607
1.977469
_Ijobc_3 | -.2799635
.290873
-0.96
0.336
-.8500642
.2901371
_cons |
2.406
.1951377
12.33
0.000
2.023538
2.788463
96
------------------------------------------------------------------------------
Cox Proportional Hazards
Model
Dependent or outcome variable
• time to breakdown of first marriage
Independent or explanatory variables
• gender
• race (white/non-white)
• Age in 1991 (restricted to 18 – 60)
• financial position: comfortable=1, doing
alright=2, just about getting by=3, quite
difficult=4, very difficult =5
97
STATA Commands
• Command for survival data set up
stset tvariable [pweight = axrwght],
failure(fail==1) scale(1)
• Command for Cox proportional hazards mode
xi: stcox i.sex i.arace
aage i.afisit
98
STATA Output
-----------------------------------------------------------------------------|
Robust
_t | Haz. Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------_Isex_2 |
1.251224
.1483865
1.89
0.059
.9917185
1.578635
_Iarace_1 |
1.979298
.7844764
1.72
0.085
.9102175
4.304047
aage |
.9366
.0056464
-10.86
0.000
.9255984
.9477324
_Iafisit_2 |
1.226635
.201547
1.24
0.214
.8889056
1.692682
_Iafisit_3 |
1.519284
.2527755
2.51
0.012
1.096523
2.10504
_Iafisit_4 |
1.95182
.3985054
3.28
0.001
1.308124
2.912263
_Iafisit_5 |
1.936742
.5864388
2.18
0.029
1.069869
3.506006
------------------------------------------------------------------------------
99