Lecture 6: Analysis of Categorical and Ordinal Data - Sortie-ND
Download
Report
Transcript Lecture 6: Analysis of Categorical and Ordinal Data - Sortie-ND
Lecture 6
Analysis of Categorical and Ordinal Data:
Binomial and Logistic Regression
Analysis of Binary Data
Binomial Regression
-
Used when the individual “trial” is not the unit of study, but rather
when there are replicates of a set of trials (i.e. seedlings in a quadrat)
» In the past, folks often analyzed this type of dataset by converting
the response variable to a percentage, and then doing regression on
the percentages (after doing ugly transformations…)
-
Model predicts the underlying Binomial probability that would
produce the observed number of successes given a number of trials
Logistic Regression
-
Used when the individual Bernoulli trial is the unit of study (i.e. did
the seedling die…)
-
Model predicts the probability of “success” of a given trial
Steps in a likelihood analysis for binomial
regression
In R:
1. Specify the “scientific model” that predicts the probability of
“success” as a function of a set of independent variables…
-- Note that your scientific model should predict expected values
bounded by 0 and 1 (since the predicted value is a probability)
2. Define the likelihood function (using dbinom)
binom_log_lh_function <- function(successes,trials,p)
{ dbinom(x=successes,size=trials, prob = p, log = TRUE) }
3. Set up optimization to find the parameters of the scientific model
that maximize likelihood across the dataset
Logistic Regression Example:
analysis of windthrow data
Traditionally: Summarize variation in degree and type of damage,
across species and tree sizes, from the storm, as a whole...
A likelihood alternative: Use the spatial variation in storm
intensity that occurs within a given storm to estimate parameters
of functions that describe
susceptibility to windthrow,
as a function of variation
in storm severity and individual
tree attributes...
Types of Response Variables
(with examples from analysis of windthrow data)
BINARY: Only two possible outcomes (yes, no; lived, died; etc.)
-
This is termed a “Bernoulli trial”
CATEGORICAL: Multiple categories (uprooted, snapped,...)
ORDINAL: Ordered categories (degree of damage): none, light,
medium, heavy, complete canopy loss {usually estimated visually}
CONTINUOUS: just what the term implies, but rarely used in
analyses of wind damage because of the difficulties of quantifying
damage accurately...
Analysis of Binary Data:
Traditional Logistic Regression
Consider a sample space consisting of two outcomes (A,B) where
the probability that event A occurs is p
Definition: Logit = log of an odds ratio (i.e log[p/(1-p)])
Benefits of logits
• A logit is a continuous variable
• Ranges from negative when p < 0.5 to positive when p > 0.5
Standard logistic regression involves fitting a linear function to
the logit:
log p /(1 p) a bX1 cX 2 ...
What if your terms are multiplicative?
log p /(1 p) a bX1 X 2 ...
Example: Assume that the probability of windthrow is a joint
(multiplicative) function of
(1) Storm severity, and
(2) Tree size
In addition, assume that the effect of DBH is nonlinear....
A model that incorporates these can be written as:
log( pisj /(1 pisj )) as cs * Si * DBH isjbs
A little more detail....
log( pisj /(1 pisj )) as cs * Si * DBH isjbs
Pisj is the probability of windthrow of the jth individual of
species s in plot i
DBHisj is the DBH of that individual
as, bs, and cs are species-specific, estimated parameters, and
Si is the estimated storm severity in plot i
-
NOTE: storm severity is an arbitrary index, and was allowed to
range from 0-1
NOTE: you can think of this as a hierarchical model, with trees
nested in plots, and S is the plot term
But don’t you have to measure storm severity (not estimate it)?
Likelihood Function for Logistic Regression
It couldn’t be any easier... (since the scientific model is already
expressed as a probabilistic equation):.
logit log( pisj /(1 pisj )) as cs * Si * DBH isjbs
elogit
pisj
1 elogit
log( pisj ) if treeisj was windthrown
log - likelihood
log(
1
p
)
if
tree
isj
was
not
windthrow
n
isj
isj
loglikelihood <- function(pred,observed)
{ ifelse(observed == 1, log(pred), log(1-pred)) }
Example: Windthrow in the Adirondacks
Highly variable damage
due to:
• variation within storm
• topography
• susceptibility of species
within a stand
Reference:
Canham, C. D., Papaik, M. J., and Latty, E. F. 2001. Interspecific variation in
susceptibility to windthrow as a function of tree size and storm severity for
northern temperate tree species. Canadian Journal of Forest Research 31:1-10.
The dataset
Study area: 15 x 6 km area perpendicular to the storm
path
43 circular plots: 0.125 ha (19.95 m radius) censused
in 1996 (20 of the 43 were in oldgrowth forests)
The plots were chosen to span a wide range of
apparent damage
All trees > 10 cm DBH censused
Tallied as windthrown if uprooted or if stem was < 45o
from the ground
Critical data requirements
Variation in storm severity across plots
Variation in DBH and species mixture within plots
log( pisj /(1 pisj )) as cs * Sij * DBH isjbs
The analysis...
log( pisj /(1 pisj )) as cs * Sij * DBH isjbs
7 species comprised 97% of stems – only stems of
those 7 species were included in the dataset for
analysis
# parameters = 64 (43 plots + 3 parameters for each of
7 species)
Parameters estimated using simulated annealing
Observed Proportion Windthrown
Model evaluation
1.0
62
0.9
67
The solid line is a
1:1 relationship
0.8
70
111
0.7
123
0.6
174
0.5
0.4
312
0.3
527
0.2
0.1
787
279
0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Predicted Probability of Windthrow
Numbers
above bars
represent the
number of
observations
in the class
Estimating Storm Severity
Storm Severity Index
1.0
0.8
0.6
0.4
0.2
density
basal area
0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Plot-Level Windthrow
Probability of Mortality
Results: Big trees...
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
70 cm DBH
ACRU
ACSA
BEAL
FAGR
PIRU
PRSE
TSCA
ALL
0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Storm Severity
Probability of Mortality
Little trees...
1.0
10 cm DBH
0.9
0.8
0.7
0.6
0.5
ACRU
ACSA
BEAL
FAGR
PIRU
PRSE
TSCA
ALL
0.4
0.3
0.2
0.1
0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Storm Severity
New twists
Effects of partial harvesting on risk of windthrow to
residual trees
Effects of proximity to edges of clearings on risk of
windthrow
Research with Dave Coates in cedar-hemlock
forests of interior B.C.
Effects of harvest intensity and
proximity to edge…
Equation (1): basic model – probability of windthrow is a species-specific
function of tree size and storm severity:
log( pisj /(1 pisj )) as Si * cs * DBH isjbs
(1)
Equation (2) introduces the effect of prior harvest removal to equation (1) by
adding basal area removal and assumes the effect is independent and additive
log( pisj /(1 pisj )) as Si * (cs * DBH isjbs hBA / 100)
(2)
Equation (3) assumes the effects of prior harvest interact with tree size:
log( pisj /(1 pisj )) as Si * cs * DBH isj(bs hBA / 100)
Models 1a – 3a: test models where separate c coefficients are estimated for
“edge” vs. “non-edge” trees (edge = any tree within 10 m of a forest edge)
(3)
Other issues…
Is the risk of windthrow independent of the fate of
neighboring trees? (not likely)
- Should we examine spatially-explicit models that
factor in the “nucleating” process of spread of
windthrow gaps?…
Analysis for ORDINAL Response Variables
The categories in this case are ranked (i.e. none, light,
heavy damage)
Analysis shifts to cumulative probabilities...
The “Parallel Slopes” form of ordinal logistic
regression
The Challenge: Since the response categories are ordinal, and the
model predicts cumulative probabilities, we need a scientific model
that generates predictions that keep the categories in order (i.e.
the cumulative probability that a response should be in or less
than level k needs to be greater than the predicted cumulative
probability for level k-1
The Parallel Slopes solution:
-
Just allow the intercept term in the equation for the logit to vary
among the k ordinal responses, while the slope stays constant
(Note that you only need k-1 intercepts…)
Simple Ordinal Logistic Regression
If p Pr( y Yk X i)
(i.e. the probability that an observation y will be less than or equal to
ordinal level Yk (k = 1.. n-1 levels) , given a vector of X explanatory
variables),
Then simple ordinal logistic regression fits a model of the form:
logit( p ) ak b1x1 b2 x2 ...
Remember:
p e logit /(1 e logit )
The probability that an event will fall into a single class k (rather
than the cumulative probability) is simply
pk Pr( y Yk X i) Pr( y Yk 1 X i)
In our case...
logit ( p ) aks cs Si DBH isj
where
bs
p Pr( y Yk )
and where aks, cs and bs are species specific parameters (s = 1.. S species),
and Si are the estimated storm severities for the i = 1..N plots.
# of parameters: N + (K-1+2)*S,
where N = # of plots, K = # of ordinal response levels, and
S = # of species
The Likelihood Function Stays the Same
The probability that an event will fall into a single class k (rather
than the cumulative probability) is simply
pk Pr( y Yk X i) Pr( y Yk 1 X i)
Again, since the scientific model is already expressed as a
probabilistic equation:
log( pk ) if treeisj experienced damage level k
log - likelihood
isj log( 1 pk ) if treeisj did not
Hurricane Damage in Puerto Rico
Storm damage assessment in the permanent plot at the Luquillo
LTER site
-
Combined the data into a single analysis: 136 plots, 13 species
(including 1 lumped category for “other” species), and 3 damage
levels:
-
Hurricane Hugo - 1989
Hurricane Georges – 1998
No or light damage
Partial damage
Complete canopy loss
Total # of parameters = 188 (15,647 trees)
Canham, C. D., J. Thompson, J. K. Zimmerman, and M. Uriarte. 2010. Variation in
susceptibility to hurricane damage as a function of storm intensity in Puerto
Rican tree species. Biotropica 42:87-94.
Parameter Estimation with Simulated Annealing
0
2500000
5000000
-11300
Likelihood
-11320
-11340
-11360
-11380
-11400
-11420
-11440
-11460
Iteration
Solving simultaneously for 188 parameters in a dataset
containing > 15,000 trees takes time!
Model Evaluation
Goodness of Fit
Combined Hugo and Georges Analysis
Observed Proportion
1.0
0.8
Complete Damage
Partial Damage
No Damage
1:1 Line
0.6
0.4
0.2
0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Predicted Probability
Comparison of the two storms...
Figure 1. Relative Variation in Severity
of Hurricanes Hugo and Georges
0.30
Hugo
Georges
Statistics on variation in storm
severity from Hurricanes Hugo
and Georges
Proportion of Plots
0.25
0.20
0.15
0.10
0.05
Severity
n
minimum
maximum
mean
S.D.
Hugo
96
0.20
1.00
0.63
0.17
Georges
40
0.02
0.69
0.44
0.16
0.00
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Storm Severity Index
0.8
0.9
1.0
None
Medium
Complete
0.8
1
BUCCAP
Probability of Damage
Probability of Damage
1
0.6
0.4
0.2
0
None
Medium
Complete
0.6
0.4
0.2
0
0
0.2
0.4
0.6
0.8
1
0
Storm Severity
0.2
0.4
0.6
0.8
1
Storm Severity
1
1
None
Medium
Complete
0.8
CECSCH
Probability of Damage
Probability of Damage
ALCLAT
0.8
0.6
0.4
0.2
CASARB
0.8
0.6
None
Medium
Complete
0.4
0.2
0
0
0
0.2
0.4
0.6
Storm Severity
0.8
1
0
0.2
0.4
0.6
Storm Severity
0.8
1