Bayesian Analysis of Discrete Compositional Data
Download
Report
Transcript Bayesian Analysis of Discrete Compositional Data
Models for the Analysis of Discrete
Compositional Data
An Application of Random Effects Graphical Models
Devin S. Johnson
STARMAP
Department of Statistics
Colorado State University
Developed under the EPA STAR Research Assistance Agreement
CR-829095
Dissertation Contributions
• Chapter 2: Discrete regression graphical chain model.
– New graphical chain model.
– Markov properties of the DR model are illustrated.
• Chapter 3: Random effects graphical models.
– Single discrete response.
– Derived Markov properties and integrability cond.
• Chapter 4: Multi-way composition models.
– Allows analysis of multi-way compositions.
– Derived Markov properties.
– Integrability shown for preservative models.
• Chapter 5: Autoregressive models for capture-recapture
data (Biometrics, 2003).
Dissertation Contributions
• Chapter 2: Discrete regression graphical chain model.
– New graphical chain model.
– Markov properties of the DR model are illustrated.
• Chapter 3: Random effects graphical models.
– Single discrete response.
– Derived Markov properties and integrability cond.
• Chapter 4: Multi-way composition models.
– Allows analysis of multi-way compositions.
– Derived Markov properties.
– Integrability shown for preservative models.
• Chapter 5: Autoregressive models for capture-recapture
data (Biometrics, 2003).
Motivating Problem
• Various stream sites in the Mid-Atlantic region of the
United States were visited in Summer 1994.
– For each site, each observed fish species was cross
categorized according to several traits
– Environmental variables are also measured at each
site (e.g. precipitation, chloride concentration,…)
• Relative proportions are more informative.
• How can we determine if collected environmental
variables affect species richness compositions (which
ones)?
Outline
• Introduction
– Compositional data
– Probability models
• Brief introduction to chain graphs
• A graphical model for compositional data
– Modeling individual probabilities
– Markov properties of random effects graphical models
• Analysis of fish species richness compositional data
• Conclusions and Future Research
Discrete Compositions and Probability Models
• Compositional data are multivariate observations
Z = (Z1,…,ZD) subject to the constraints that SiZi = 1 and
Zi 0.
• Compositional data are usually modeled with the
Logistic-Normal distribution (Aitchison 1986).
– Scale and location parameters provide a large
amount of flexibility compared to the Dirichlet model
– LN model defined for positive compositions only
• Problem: With discrete counts one has a non-trivial
probability of observing 0 individuals in a particular
category
Existing Compositional Data Models
• Billhiemer and Guttorp (2001) proposed using a
multinomial state-space model for a single composition,
Ci1 ,...,CiD ~ multinomial Ni ; Zi1 ,...,ZiD
Zi1 ,...,ZiD ~ LN μi , Σi ,
where Yij is the number of individuals belonging to
category j = 1,…,D at site i = 1,…,S.
Limitations:
– Models proportions of a single categorical variable.
– Abstract interpretation of included covariate effects
Existing Graphical Models
• Graph model theory (see Lauritzen 1996) has been used
for many years to
– model cell probabilities for high dimensional
contingency tables
– determine dependence relationships among
categorical and continuous variables
Limitation:
– Graphical models are designed for a single sample
(or site in the case of the Oregon stream data).
Compositional data may arise at many sites
New Improvements for Compositional Data Models
• The Billhiemer and Guttorp model can be generalized by
the application of graphical model theory.
– Generalized models can be applied to cross-classified
compositions
– Simple interpretation of covariate effects as a variable
in a Markov random field
• Conversely, graphical model theory can be expanded to
include models for multiple site sampling schemes
Chain Graphs
a
b
c
e
d
• Mathematical graphs are used to illustrate complex
dependence relationships in a multivariate distribution.
• A random vector is represented as a set of vertices, V .
• Pairs of vertices are connected by directed edges if a
causal relationship is assumed, undirected if the
relationship is mutual
Probability Model for Individuals
(Unobserved Composition)
• Response variables
– Set F of discrete categorical variables
– Notation: y is a specific cell
• Explanatory variables
– Set G D of categorical (D) and/or continuous (G)
variables
– Notation: x refers to a specific explanatory
observation
• Random effects
– Allows flexibility when sampling many “sites”
– Unobserved covariates
– Notation: ef, f F, refers to a random effect.
Probability Model and Extended Chain Graph, Ge
• Joint distribution
f (y, x, e) = f (y|x, e) f (x) f (e)
• Graph illustrating possible dependence relationships for
the full model, Ge.
X1
Y1
e1
e{1,2}
X2
Y2
e2
Random Effects Discrete Regression Model
(REDR)
• Sampling of individuals occurs at many different random
sites, i = 1,…,S, where covariates are measured only
once per site
• Hierarchical model for individual probabilities:
f REDR y | x exp a F x ,ε
f F
f F
0
εf ~
MVN 0,S f
G d D
x
b y ,x
c G d D
fcd
D
c
f dm y , x D x e f y
m2
f F
M
m
if f F is not complete in Ge
if f F is complete in Ge
Random Effects Discrete Regression Model
(REDR)
Response parameters constraints:
• The function aF(x,e) is a normalizing constant w.r.t.
y|(x,e), and therefore, is not a function of y.
• The parameters bfcd(y, xD), f dm(y, xD), and ef (y) are
interaction effects that depend on y and xD through the
levels of the variables in f and d only.
• Interaction parameters (and random effects) are set to
zero for identifiability of the model if the cells y or xD are
indexed by the first level of any variable in f or d.
Random Effects Discrete Regression Model
(REDR)
• Model for explanatory variables (CG distribution):
f x f x D f xG x D
1
exp d x D MVN d x D , Ψ
d D
d D
• Again, interactions depend on xD through the levels of the
variables in the set d only, and identifiability constraints
are imposed.
Graphical Models for Discrete Compositions
• For a set F of categorical responses
– Let D be the number of cross-classified cells
– Let Cij = Number of observations in cell j=1,…,D at
site i=1,…,S
• Likelihood
(Ci1,…,CiD) | XG = xG ~ multinomial(Ni; pi1,…,piD),
where pij is given by the REDR model
• Covariate distribution
XG ~ CG(, , )
Markov Properties of Chain Graph Models
• Let P denote a probability measure on the product
space
X = ∏aV X a
• Markov (Global) property
The probability measure P is Markovian with respect to
a graph G if for any triple (A, B, S) of disjoint sets in V,
such that S separates A from B in {Gan(ABS)}m, we have
A B | S.
• There are two weaker Markov properties, pairwise and
local Markov properties.
Markov Properties of the REDR Model
Proposition 1. A REDR model is Ge Markovian if and only
if the following six constraints are satisfied for a given
extended graph Ge.
Response model
1. bfcd(y, xD) = 0 unless f c d is complete for
c d ≠ Ø.
2. fdm(y , xD) = 0 for m = 1,…,M, unless f {} d is
complete, where {} G and d D.
3. ef (y) = -bf ØØ (y) with probability 1 if f is not complete.
Markov Properties of the REDR Model
Proposition 1. A REDR model is Ge Markovian if and only
if the following six constraints are satisfied for a given
extended graph Ge.
Covariate model
4. d(xd) = 0 unless d is complete .
5. d(xd) = 0 unless { } c is complete, where { }
G and d D.
6. m.= 0 unless {m, } is complete, where , m G and
m is the (m, ) element of Ø.
Markov Properties of the REDR Model
Sketch of proof.
• Lauritzen and Wermuth (1989) prove conditions
concerning the , , and Ø parameters for the CG
distribution.
•
If the b and parameters are 0 for the specified sets
then the density factorizes according to Frydenburg’s
theorem.
•
A modified version of the proof of the HammersleyClifford Theorem shows that if f (y|x, e) separates into
complete factors, then, the corresponding b and
vectors for non-complete sets must be 0.
Preservative REDR Models
Preservative REDR models are defined by the following
conditions:
1. All connected components aq, q = 1,…,Q, of F in Ge are
complete, where Q is the total number of connected
components.
1. Any d G that is a parent of f aq is also a parent of
every other f aq, q = 1,…,Q.
Markov Properties of the REDR Model
Proposition 2. If P is a preservative REDR model, and P is
Ge Markovian, then the marginal distribution, PFG D, of
the covariates and response variables is G = (Ge)FG D
Markovian.
Sketch of Proof.
The integrated REDR density follows Frydenberg’s (1990)
factorization criterion. The factorizing functions, however,
do not exist in closed form.
Parameter Estimation
• A Gibbs sampling approach is used for parameter
estimation
• Hierarchical centering
– Produces Gibbs samplers which converge to the
posterior distributions faster
– Most parameters have standard full conditionals if
given conditional conjugate distributions.
• Independent priors imply that covariate and response
models can be analyzed with separate MCMC
procedures.
Fish Species Richness in the Mid-Atlantic Highlands
•
•
91 stream sites in the Mid Atlantic region of the United
States were visited in an EPA EMAP study
Response composition: Observed fish species were
cross-categorized according to 2 discrete variables:
1. Habit
2. Pollution tolerance
• Column species
• Intolerant
• Benthic species
• Intermediate
• Tolerant
Stream Covariates
•
Environmental covariates: values were measured at
each site for the following covariates
1.
2.
3.
4.
5.
6.
Mean watershed precipitation (m)
Minimum watershed elevation (m)
Turbidity (ln NTU)
Chloride concentration (ln meq/L)
Sulfate concentration (ln meq/L)
Watershed area (ln km2)
Fish Species Richness Model
• Composition Graphical Model:
6
1
f RE y i | xi ,εi exp a F xi ,εi b f y i xi x s e fi y i
f F 1
f F
ε fi ~ MVN 0 , Tf1
and
xi ~ MVN μ , Ψ 1
• Prior distributions
b f y ~ iid N 0, 2 ; γ 0,...,6, f F
Tf ~ Wish , R f
Ψ ~ Wish 6, R
Model Selection
Three different models are considered
1. Independent response
(i.e. bf(yi) = ef (yi) = 0 for f = {H, T })
2. Depended response w/ independent errors
3. Dependent response w/ correlated errors
(equivalent to Billheimer Guttorp model)
DIC
DDIC
pD
Independent
1107.7
--
68.7
Dependent (indep. errors)
1117.8
10.1
106.1
Dependent (corr. errors)
1166.8
59.1
162.5
Model
Fish Species Functional Groups
Posterior suggested chain graph for independence model
(lowest DIC model)
Precipitation
Habit
Elevation
Area
Turbidity
Sulfate
Tolerance
Chloride
Edge exclusion determined from 95% HPD intervals for b
parameters and off-diagonal elements of Ø.
Comments and Conclusions
• Using Discrete Response model with random effects, the
Billheimer-Guttorp model can be generalized
– Relationships evaluated though a graphical model
– Multi-way compositions can be analyzed with
specified dependence structure between cells
– MVN random effects imply that the cell probabilities
have a constrained LN distribution
• DR models also extend the capabilities of graphical
models
– Data can be analyzed from many multiple sites
– Over dispersion in cell counts can be added
Future Work
• Model determination under a Bayesian framework
– Models involve regression coefficients as well as
many random effects
– Initial investigation suggests selection based on
parameters, not edge, inclusion produces models with
higher posterior mass
• Accounting for spatial correlation
Any Questions?
The work reported here was developed under the STAR
Research Assistance Agreement CR-829095 awarded by
the U.S. Environmental Protection Agency (EPA) to
Colorado State University. This presentation has not been
formally reviewed by EPA. The views expressed here are
solely those of presenter and the STARMAP, the Program
he represents. EPA does not endorse any products or
commercial services mentioned in this presentation.
# CR - 829095