Linear Regression 1 - University of California, Irvine

Download Report

Transcript Linear Regression 1 - University of California, Irvine

Event History Analysis 1
Sociology 8811 Lecture 14
Copyright © 2007 by Evan Schofer
Do not copy or distribute without permission
Announcements
• Paper #1 due on Thursday!
• Questions?
• New Topic: Event History Analysis
Regression and EHA: Examples
• Medical Research on Drug Efficacy
• Question #1: Do patients with larger doses
of a drug have lower cholesterol?
• Approach: OLS Regression
• If assumptions are met, OLS is appropriate
• Independent Variable = dosage (“level” of drug)
• Dependent Variable = cholesterol (“level”)
Regression Example: Cholesterol
Cholesterol Level
300
Relationship between level of
X and Y is modeled as a
linear function:
250
Y = a + bX + e
200
150
100
0
10
20
30
40
50
Drug Dosage (mg)
60
70
Example 2: Drug & Mortality
• Suppose a different question:
• Does increased drug dosage reduce the
incidence of mortality among patients?
• The dependent variable has a different character
• 1. Whereas cholesterol is measured as a
“level” (continuously), mortality is “discrete”
• Either the patient lives or they don’t (not a “level”)
• 2. Also, TIMING is an issue
• Not just if a patient survives, but how long
• A drug that extends life is good, even if patients die
Logit/Probit Strategies
• Research strategies to address this problem:
• 1. Use a non-linear regression model for
discrete outcomes: Logit, Probit, etc.
• Dependent variable is a dummy for patient mortality
• Look for relationship between dosage and mortality
• Benefit: Easy. An analog of regression
• Limitation: Doesn’t take timing into account
• All patients that die have the same influence on the
model (whether they live 5 days or 20 years due to the
drug dosage).
Logit/Probit Strategy: Visual
Relationship between
level of X and the
discrete variable Y is
modeled as a nonlinear function
Mortality
Yes
No
0
10
20
30
40
50
Drug Dosage (mg)
60
70
Drug & Mortality: OLS Regression
• Option #2: Use OLS regression to model the
time elapsed (duration) until mortality
– Rather than ask “did they live or die”
(logit/probit), you ask “how long did they live”?
• Compute a variable that reflects the time until mortality
(in relevant time units – e.g., months since drug
therapy is started)
• Model time as the dependent variable
• Observe: Do patients with high drug doses die later
than ones with low doses?
OLS Duration Strategy: Visual
Months Until Mortality
80
60
40
Q: Where do you
put individuals
who were alive at
the end of the
study?
20
0
0
10
20
30
40
50
Drug Dosage (mg)
60
70
Drug & Mortality: OLS Regression
• Problem #1: What about patients who don’t
experience mortality during study?
• This is called “censored data”
• If study is 80 months, you know that Y>80…
– But, you don’t have an exact value
• What do you do?
– Treat them as experiencing mortality at the very end of the
study? Or approximate time of mortality?
– Exclude them? NO! That selects on the dependent variable!
• Possible solution: Use models for censored data
– Ex: tobit model.
Drug & Mortality: OLS Regression
• Problem #2: Temporal data often violates
normality assumption of OLS regression
• Often violations are quite bad
• “Censored” data is a surmountable problem, but
normality violation is usually not
• So – we shouldn’t typically use OLS!
Drug and Mortality: EHA Strategy
• Event History Analysis (EHA) provides purchase on
this exact type of problem
• And others, as well
• In essence, EHA models a dependent variable that
reflects both:
– 1. Whether or not a patient experiences mortality (like
logit),
and…
– 2. When it occurs (like a OLS regression of duration)
• Note: This information is typically encoded in 2 or more variables
Drug and Mortality: EHA Strategy
• Moreover: EHA is very flexible and can
address various situations:
• 1. EHA can address “repeated” events
• Mortality can only occur once per patient.
• But, heart attack can occur repeatedly, at different
points in time – further confounding OLS or probit
• 2. EHA can address different time-clocks
• Durations could be coded in a number of contexts:
• From start of study. Age of patient. Historical time.
• And even more complex issues
EHA: Overview and Terminology
• EHA is referred to as “dynamic” modeling
• i.e., addresses the timing of outcomes: rates
• Dependent variable is best conceptualized
as a rate of some occurrence
• Not a “level” or “amount” as in OLS regression
• Think: “How fast?” “How often?”
• The “occurrence” may be something that can
occur only once for each case: e.g., mortality
• Or, it may be repeatable: e.g., marriages, strategic
alliances.
EHA: Overview
• EHA involves both descriptive and
parametric analysis of data
• Just like regression
• Scatterplots, partialplots = descriptive
• OLS model/hypothesis tests = parametric
• Descriptive analyses/plots
• Allow description of the overall rate of some outcome
• For all cases, or for various subgroups
• Parametric Models
• Allow hypothesis testing about variables that affect
rate (and can include control variables).
EHA: Types of Questions
• Some types of questions EHA can address:
• 1. Mortality: Does drug dosage reduce rates?
• Does “rate” decrease with larger doses?
• Also: control for race, gender, treatment options, etc
• 2. Life stage transitions: timing of marriage
• Is rate affected by gender, class, religion?
• 3. Organizational mortality
• Is rate affected by size, historical era, competition?
• 4. Civil war
• Is rate affected by economic, political factors?
EHA Terminology: States & Events
• EHA has evolved its own terminology:
• “State” = the “state of being” of a case
• Conceptualized in terms of discrete phenomena
• e.g., alive vs. dead
• “State space” = the set of all possible states
• Can be complex: Single, married, divorced, widowed
• “Event” = Occurrence of the outcome of
interest
• Shift from “alive” to “dead”, “single” to “married”
• Occurs at a specific, known point in time
Terminology: Risk & Spells
• “Risk Set” = the set of all cases capable of
experiencing the event
• e.g., those “at risk” of experiencing mortality
• Note: the risk set changes over time…
• “Spell” = A chunk of time that a case
experiences, bounded by: events, and/or the
start or end of the study
• As in “I’m gonna sit here for a spell…”
• EHA is, in essence, an analysis of a set of spells
(experienced by a given sample of cases).
States, Spells, & Events: Visually
• If we assign numeric values to states, it is
easy to graph cases over time
• As they experience 1 or more spells
• Example: drug & mortality study
• States:
• Alive = 0
• Dead = 1
• Time = measured in months
• Starting at zero, when the study begins
• Ending at 60 months, when study ends (5 years).
States, Spells, & Events: Visually
• Example of mortality at month 33
Event
End of
Study
State
Spell #2
1
Spell #1
0
0
10
20
30
40
Time (Months)
50
60
• Note: It takes 2 spells to describe this case
– But, we may only be interested in the first spell. (Because there is no
possibility of change after transition to state = 1)
States, Spells, & Events: Visually
• Example of a patient who is cured
– Doesn’t experience mortality during study
State
End of
Study
1
Spell #1
0
0
10
20
30
40
Time (Months)
50
60
• Note: Only 1 spell is needed
– The spell indicates a consistent state (0), for the
period of time in which we have information
More Terminology: Censoring
• Note: In both cases, data runs out after
month 60
• Even if the patient is still alive
• In temporal analysis, we rarely have data for
all relevant time for all cases
• “Censored” = indicates the absence of data
before or after a certain point in time
• As in: “data on cases is censored at 60 months”
• “Right Censored” = no data after a time point
• “Left Censored” = no data before a time point
States, Spells, & Events: Visually
• A more complex state space: partnership
• 0 = single, 1 = married, 2 = divorced, 3 = widowed
• Individual history:
• Married at 20, divorced at 27, remarried at 33
3 Spell #1
Spell #2
Spell #3
Spell #4
State
2
Right
Censored
at 45
1
0
16
20
24
28
Age (Years)
32
36
40
44
Measuring States and Times
• EHA, in short, is the analysis of spells
• It takes into account the duration of spells, and
whether or not there was a change of state at the end
• States at start and end of spell are measured
by assigning pre-defined values to a variable
• Much like logit/probit or multinomial logit
• Times at the start and end of spell must also
be measured
• Time Unit = The time metric in the study
• e.g., minutes, hours, days, months, years, etc
Time Clock
• Time Clock = time reference of the analysis
• Possibilities:
•
•
•
•
•
Duration since start of study
Chronological age of case (person, firm, country)
Duration since end of last spell
i.e., clock is set to zero at start of each spell
Historical time – the actual calendar date
• The choice of time-clock can radically
change the analysis and meaning of results
• It is crucial to choose a clock that makes sense for the
hypotheses you wish to test
Time Clocks Visually: Age
3 Spell #1
Spell #2
Spell #3
Spell #4
End of
Study
State
2
1
0
16
20
24
28
Age (Years)
32
36
40
44
• EHA examines rate of transitions as a function of
a person’s age
Time Clocks Visually: Duration
Single from 16-20 (4 years), married from 20-27 (7 years),
divorced from 27-33 (6 yrs), remarried at 33-45 (12 yrs)
Spell #3
Spell #2
Spell #4
3 Spell #1
State
2
1
0
0
4
6
12
Duration (Years)
18
22
• EHA examines rate of transitions as a function
of a person’s duration in their current state
Time Clocks: General Advice
• Different time-clocks have different strengths
• We’ll discuss this more…
• Chronological Age = good for processes
clearly linked to age
• Biological things: fertility, mortality
• Liability of newness
• Historical time = useful for examining the
impact of historical change on ongoing
phenomena
• E.g., effects of changing regulatory regimes on rates
of strategic alliances
Moving Toward Analyses: Example
• Example: Employee retention
• How long after hiring before employees quit?
• Data: Sample of 12 employees at McDonalds
• Time-Clock/Time Unit: duration of employment
from time of hiring (measured in days)
• 2 Possible states:
• Employed & No longer employed
• We are uninterested in subsequent hires
• Therefore, we focus on initial spell, ending in quitting.
Example: Employee Retention
• Visually – red line indicates length of
employment spell for each case:
Right
Cases
Censored
0
20
40
60
80
Time (days)
100
120
Simple EHA Descriptives
• Question: What simple things can we do to
describe this sample of 12 employees?
• 1. Average duration of employment
• Only works if all (or nearly all) have quit
• Many censored cases make “average” meaningless
– This is a fairly useful summary statistic
• Gives a sense of overall speed of events
• Especially useful when broken down by sub-groups
• e.g., average by gender or compensation plan.
Descriptives: Average Duration
• Simply calculate the mean time-to-quitting
Average =
33.4 days
Cases
Right
Censored
0
20
40
60
80
Time (days)
100
120
Simple EHA Descriptives
• Question: What simple things can we do to
describe this sample of 12 employees?
• 2. Compute “Half Life” of employee tenure
• Determine time at which attrition equals 50%
• Also highlights the overall turnover rate
• Note: Exact value is calculable, even if there are
censored cases
• Again, computing for sub-groups is useful
Descriptives: Half Life
• Determine time when ½ of sample has had
event
Half Life = 23 days
Right
Cases
Censored
0
20
40
60
80
Time (days)
100
120
Simple EHA Descriptives
• Question: What simple things can we do to
describe this sample of 12 employees?
• 3. Tabulate (or plot) quitters in different timeperiods: e.g., 1-20 days, 21-40 days, etc.
• Absolute numbers of “quitters” or “stayers”
– or
• Numbers of quitters as a proportion of “stayers”
• Or look at number (or proportion) who have “survived”
(i.e., not quit)
Descriptives: Tables
• For each period, determine number or
proportion quitting/staying
20-40
40-60
60-80 80-100
Cases
Day 1-20
0
20
40
60
80
Time (days)
100
120
EHA Descriptives: Tables
Time
Range
1 Day 1-20
2 Day 21-40
3 Day 41-60
4 Day 61-80
Quitters:
Total #, %
5 quit, 42% of all,
42% of remaining
2 quit, 16% of all
29% of remaining
1 quit, 8% of all
20% of remaining
1 quit, 8% of all
25% of remaining
# staying
7 left, 58 % of all
5 left, 42% of all
4 left, 33 % of all
3 left, 25% of all
EHA Descriptives: Tables
• Remarks on EHA tables:
• 1. Results of tables change depending on
time-ranges chosen (like a histogram)
• E.g., comparing 20-day ranges vs. 10-day ranges
• 2. % quitters vs. % quitters as a proportion
of those still employed
• Absolute % can be misleading since the number of
people left in the risk set tends to decrease
• A low # of quitters can actually correspond to a very
high rate of quitting for those remaining in the firm
• Typically, these ratios are more socially meaningful
than raw percentages.
EHA Descriptives: Plots
• We can also plot tabular information:
100
% Quit (of Remaining)
% Remaining
90
80
Percent
70
60
50
40
30
20
10
0
0
1
2
3
Time Period
4
5
The Survivor Function
• A more sophisticated version of % remaining
• Calculated based on continuous time (calculus), rather
than based on some arbitrary interval (e.g., day 1-20)
• Survivor Function – S(t): The probability (at
time = t) of not having the event prior to time t.
• Always equal to 1 at time = 0 (when no events can have
happened yet
• Decreases as more cases experience the event
• When graphed, it is typically a decreasing curve
• Looks a lot like % remaining
Survivor Function
• McDonald’s Example:
Survivor Function: McDonalds Employees
1
0.9
Steep decreases
indicate lots of
quitting at
around 20 days
0.8
0.7
S(t)
0.6
0.5
0.4
0.3
0.2
0.1
0
0
20
40
60
Time
80
100
120
The Hazard Function
• A more sophisticated version of # events
divided by # remaining
• Hazard Function – h(t) = The probability of
an event occurring at a given point in time,
given that it hasn’t already occurred
• Formula:
P(t  t  T  t T  t )
h(t )  lim
t 0
t
• Think of it as: the rate of events occurring for
those at risk of experiencing the event
The Hazard Function
• Example:
McDonalds Employees: Hazard Rate
0.12
High (and wide)
peaks indicate
lots of quitting
0.10
h(t)
0.08
0.06
0.04
0.02
0.00
0.00
10.00
20.00
30.00
40.00
Time
50.00
60.00
70.00
80.00
Cumulative Hazard Function
• Problem: the Hazard Function is often very
spiky and hard to read/interpret
• Alternative #1: “Smooth” the hazard function
(using a smoothing algorithm)
• Alternative #2: The “cumulative” or
“integrated” hazard
• Use calculus to “integrate” the hazard function
• Recall – An integral represents the area under the
curve of another function between 0 and t.
• Integrated hazard functions always increase (opposite
of the survivor function).
• Big growth indicates that the hazard is high.
Integrated Hazard Function
• Example:
McDonalds Employees: Integrated Hazard
“Flat” areas
indicate low
hazard rate
1.8
1.6
Integrated Hazard
1.4
1.2
1
0.8
Steep increases
indicate peaks in
hazard rate
0.6
0.4
0.2
0
0
20
40
60
Time
80
100
Descriptive EHA: Marriage
• Example: Event = Marriage
• Time Clock: Person’s Age
• Data Source: NORC General Social Survey
• Sample: 29,000 individuals
Survivor: Marriage
• Compare survivor for women, men:
Kaplan-Meier survival estimates, by dfem
1.00
Survivor plot
for Men
(declines later)
0.75
Survivor plot
for Women
(declines earlier)
0.50
0.25
df em 0
df em 1
0.00
0
50
analysis time
100
Integrated Hazard: Marriage
• Compare Integrated Hazard for women, men:
Nelson-Aalen cumulative hazard estimates, by dfem
3.00
dfem 1
2.00
dfem 0
Integrated Hazard for
men increases slower
(and remains lower)
than women
1.00
0.00
0
50
analysis time
100
Hazard Plot: Marriage
• Hazard Rate: Full Sample
Estimated Hazard Rate
Figure 3. Estimated hazard rate
of entry into first marriage for entire sample
12
20
30
40
50
60
70
80
.2
.2
.15
.15
.1
.1
.05
.05
0
0
12
20
30
40
50
Age in Years
60
70
80
Survivor Plot: Pros/Cons
• Benefits:
• 1. Clear, simple interpretation
• 2. Useful for comparing subgroups in data
Limitations:
• 1. Mainly useful for a fixed risk set with a single nonrepeating event (e.g., Drug trials/mortality)
– If events recur frequently, the survivor drops to zero (and
becomes uninterpretable)
• 2. If the risk set fluctuates a lot, the survivor function
becomes harder to interpret.
Hazard Plot Pros/Cons
• Benefits:
• Directly shows the rate over time
– This is the actual dependent variable modeled
• Works well for repeating events
• Limitations:
• Can be difficult to interpret – requires practice
• Spikes make it hard to get a clear picture of trend
– Pay close attention to width of spikes, not just height!
• Choice of smoothing algorithms can affect results
• Hard to compare groups (due to spikeyness).
Integrated Hazard Plot Pros/Cons
• Benefits:
• Closely related to the dependent variable that you’ll be
modeling
• Very good for comparing groups
• Works for repeating events
• Limitations:
• Not as intuitive as the actual hazard rate
• Still takes some practice to interpret.