Lecture_Bayesian_Statistics - Sortie-ND

Download Report

Transcript Lecture_Bayesian_Statistics - Sortie-ND

Lecture 8
Bayesian Statistics
Likelihood Methods in Forest Ecology
October 9th – 20th , 2006
“Real knowledge is to know the extent
of one’s ignorance”
-Confucius
How do we measure our knowledge (ignorance)?
• Scientific point of view: Knowledge is
acceptable if it explains a body of natural
phenomena (Scientific model).
• Statistical point of view: Knowledge is
uncertain but we can use it if we can
measure its uncertainty. The question is
how to measure uncertainty and make use
of available knowledge.
Limitations of likelihoodist
& frequentist approaches
• Parsimony is often an insufficient criterion for
inference particularly if our objective is
forecasting.
• Model selection uncertainty is the big elephant in
the room.
• Since parameters do not have probability
distributions, error propagation in models cannot
be interpreted in a probabilistic manner.
• Cannot deal with multiple sources of error and
complex error structures in an efficient way.
• New data require new analyses.
Standard statistics revisited:
Complex Variance Structures
Inference
Addresses three basic questions:
1. What do I believe now that I have these
data? [Credibility or confidence]
2. What should I do now that I have these
data? [Decision]
3. How should I interpret these data as
evidence of one hypothesis vs. other
competing hypotheses? [Evidence]
Body of knowledge
Scientific
Hypothesis
Scientific Model
DATA
Statistical
Model
Statistical
Hypothesis
An example
Body of knowledge=
Fruit production
in trees
Scientific
Hypothesis
yi = DBH b
Scientific
Explanation
= physiology,
Life history
DATA
Statistical
Model=
Poisson dist.
Statistical
Hypothesis
b = value
Pred (y)
The Frequentist Take
Body of knowledge=
Fruit production
in trees
Scientific
Hypothesis
Log yi = b log(DBH)
Scientific
Explanation
= physiology
DATA
Statistical
Model=
Poisson dist.
Statistical
Hypothesis
b0
b = 0.4
Belief: Only with reference to an infinite series of trials
Decision: Accept or reject that b=0
Evidence: None
The Likelihodist Take
Body of knowledge=
Fruit production
in trees
Scientific
Hypothesis
Log yi = b log(DBH)
Scientific
Explanation
= physiology
DATA
Statistical
Model=
Poisson dist.
Statistical
Hypothesis
b0
b = 0.4
Belief: None, only relevant to the data at hand.
Decision: Only with reference to alternate models
Evidence: Likelihood Ratio Test or AIC.
The Bayesian Take
Body of knowledge=
Fruit production
in trees
Scientific
Hypothesis
Log yi = b log(DBH)
Scientific
Explanation
= physiology
DATA
Statistical
Model=
Poisson dist.
Statistical
Hypothesis
b0
b = 0.4
Belief: Credible intervals
Decision: Parameter in or out of a distribution
Evidence: None
Parallels and differences in Bayesian &
Frequentist statistics
• Bayesian and frequentist approaches use the
data to derive a parameter estimate and a
measure of uncertainty around the parameter
that can be interpreted using probability.
• In Bayesian inference, parameters are treated
as random variables that have a distribution.
• If we know their distribution, we can assess the
probability that they will take on a particular
value (posterior ratios or credible intervals).
Evidence vs Probability
“As a matter of principle , the infrequency
with which, in particular circumstances,
decisive evidence is obtained, should not
be confused with the force or cogency, of
such evidence”.
Fischer 1959
Frequentist vs Bayesian
• Prob = objective
relative frequencies
• Params are fixed
unknown constants,
so cannot write e.g.
P(=0.5|D)
• Estimators should be
good when averaged
across many trials
• Prob = degrees of
belief (uncertainty)
• Can write
P(anything|D)
• Estimators should be
good for the available
data
Source: “All of statistics”, Larry Wasserman
Frequentism
• Probability only defined as a long-term average
in an infinite sequence of trials (that typically
never happen!).
• p-value is probability of that extreme outcome
given a specified null hypothesis.
• Null hypotheses are often strawmen set up to be
rejected
• Improperly used p values are poor tools for
statistical inference.
• We are interested in parameter estimation rather
than p values per se.
Frequentist statistics violates the
likelihood principle
“The use of p-values implies that a
hypothesis that may be true can be
rejected because it has not predicted
observable results that have not actually
occurred.”
Jeffreys, 1961
Some rules of probability
Pr ob( A  B )  Pr ob( A )  Pr ob( B )  Pr ob( A  B )
Pr ob( A  B )  Pr ob( A )* Pr ob( B )
Pr ob( A  B )
Pr ob( A | B ) 
Pr ob( B )
Pr ob( A  B )
Pr ob( B | A ) 
Pr ob( A )
Pr ob( A  B )  Pr ob( B | A ) Pr ob( A )
assuming independence
A
B
Bayes Theorem
Pr ob( A  B )
Pr ob( A | B ) 
Pr ob( B )
Pr ob( A  B )
Pr ob( B | A ) 
Pr ob( A )
Pr ob( A  B )  Pr ob( B | A ) Pr ob( A )
Pr ob( A | B ) Pr ob( B )
Pr ob( B | A ) 
Pr ob( A )
Bayes Theorem
Pr ob( Data | Hyp ) Pr ob( Hyp )
Pr ob( Hyp | Data ) 
Pr ob( Data )
Pr ob( Data |  ) Pr ob(  )
Pr ob(  | Data ) 
Pr ob( Data )
Lik ( Data |  ) Pr ob(  )
Pr ob(  | Data ) 
Pr ob( Data )
Bayes Theorem
Lik ( Data |  ) Pr ob(  )
Pr ob(  | Data ) 
Pr ob( Data )
?
For a set of mutually exclusive
hypotheses…..
Lik ( Data |  i ) Pr ob(  i )
Pr ob(  i | Data ) 

Pr ob( Data )
Lik ( Data |  i ) Pr ob(  i)
Lik ( Data |  i ) Pr ob(  i )
 N
N

i 1
Pr ob( Data   )

i 1
Pr ob(  i ) Pr ob( Data |  i )
Bolker
An example from medical testing
Pr ob( ill )  10 6
Pr ob( test  | ill )  1
Pr ob( test  | healthy )  10
3
What is prob( ill | test  )?
Pr ob(  | ill ) Pr ob( ill )
Pr ob( ill |  ) 
Pr ob(  )
An example from medical testing
Pr ob( ill )  10 6
Pr ob( test  | ill )  1
Pr ob( test  | healthy )  10
3
What is prob( ill | test  )?
Pr ob(  | ill ) Pr ob( ill )
Pr ob( ill |  ) 
Pr ob(  )
ill
Test +
Not ill
Pr ob(  )  prob( ill   )  prob( healthy   ) 
prob( ill ) prob(  | ill )  prob( healthy ) Pr ob(  | healthy ) 
1x10 6  ( 1  10 6 )x10 3  103
Bayes Theorem
Rarely known
Lik ( Data |  ) Pr ob(  )
Pr ob(  | Data ) 
Pr ob( Data )
Hard to integrate function
MCMC methods
Joint and Marginal distributions:
Probability that 2 pigeon species (S & R) occupy an island
Events
S
Sc
Marginal
R
2
9
Pr{R}=11/32
Rc
18
3
Pr{Rc}=21/32
Marginal
Pr{S}=20/32
Pr{Sc}=11/32
N=32
Event
Prob
Estimate
R given S
Prob{R|S}
2/20
S given R
Prob{S|R}
2/11
Pr{ R , S } Pr{ S | R } Pr{ R }
Pr{ R | S } 

Pr{ S }
Pr{ S }
Diamond 1975
Pr{ y , x } Pr{ x | y } Pr{ y } Pr{ x | y } Pr{ y }
Pr{ y | x } 


Pr{ x }
Pr{ x }
 p( x | y ) p( y )dy
Conjugacy
• In Bayesian probability theory, a conjugate
prior is a family of prior probability
distributions which has the property that the
posterior probability distribution also belongs
to that family.
• A conjugate prior is an algebraic
convenience: otherwise a difficult numerical
integration may be necessary.
Jointly distributed random variables
Pr ob( A, B ,C )  Pr ob( A, B | C ) Pr ob( C ) 
Pr ob( A | B ,C ) Pr ob( B | C ) Pr ob( C )
Pr ob( parameters | data )  p( data | process , data parameters )
x p( process | process parameters )
x p( all parameters )
We have to normalize this to turn it into a probability
Hierarchical Bayes
• Ecological models tend to be high-dimensional and
include many sources of stochasticity.
• These sources of “noise” often don’t comply with
assumptions of traditional statistics:
– Independence (spatial or temporal)
– Balanced among groups
– Distributional assumptions
• HB can deal with these problems by partioning a
complex problem into a series of univariate
distributions for which we can solve –typically using
sophisticated computational methods.
Hierarchical Bayes
Clark et al. 2004
Hierarchical Bayes
• Marginal distribution of a parameter averaged
over all other parameters and hyperparameters:
p(  ,  , | Data )  p( Data |  ,  , )* p(  |  , Data )* p(  | Data )
Hierarchical Bayes: Advantages
1. Complex models can be constructed from simple,
conditional relationships. We don’t need an
integrated specification of the problem, only the
conditional components. We are drawing boxes and
arrows.
2. We relax the traditional requirement for independent
data. Conditional independence is enough. We typically
take up the relationships that cause correlation at a
lower process stage. We can accommodate multiple
data types within a single analysis, even treating model
output as ‘data’.
3. Sampling based approaches (MCMC) can do the
integration for us (the thing we avoided in advantage
1).
Why Hierarchical Bayes?
Useful approach for understanding ecological
processes because:
– Incorporates uncertainty using a probabilistic
framework
– Model parameters are random variables – output is
a probability distribution (the posterior distribution)
– Complex models are partitioned into a hierarchical
structure
– Performs well for high-dimensional models (ie -
many parameters) with little data
Bayes’ Rule
Posterior
Prior
distribution distribution Likelihood
p(|y) = p() * p(y|)
p(y)
Normalizing density
 is set of model parameters
y is observed data
p(y) = p()p(y|)d
(marginal distribution of y
or
prior predictive distribution)
 Posterior distribution is affected by the data only through
the likelihood function
 If prior distribution is non-informative, then the data
dominate the outcome
How do we do this?
Baby steps: Rejection sampling
0.8
1.0
• Suppose we have a distribution
0.0
0.2
0.4
g(xx)
0.6
Target distribution
-2
0
2
xx
4
6
0.20
0.15
0.00
0.0
0.05
0.2
0.10
0.4
g(xx)
0.6
1/(pi * (1 + (xx - 1)^2))
0.25
0.8
0.30
1.0
• Bound target distribution with a function f(x) so that
Cf(x)>=p(x)
-2
0
2
4
6
0
2
xx
xx
• Calculate ratio
-2
p( x )
a
Cf ( x )
4
6
0.10
0.15
0.20
Proposed distribution
p( x )
a
Cf ( x )
0.05
Target distribution
0.00
1/(pi * (1 + (xx - 1)^2))
0.25
0.30
• With prob a accept this value of a random draw from
p(x). With probability a-1 reject this value of X and
repeat the procedure. To do this draw a random
variate (z) from the uniform density. If z<a, accept X.
-2
0
2
xx
4
6
•Build an empirical distribution of accepted draws which
approximates the target distribution.
0.15
Theoretical distribution
0.05
0.10
Smoothed empirical
distribution
0.00
Density
0.20
0.25
0.30
Histogram of out
-4
-2
0
2
4
6
MCMC Methods
• Markov process – a random process whose next
step depends only on the prior realization (lag of 1)
• The joint probability distribution (p(|y), which is the
posterior distribution of the parameters) is generally
impossible to integrate in closed form
• So…use a simulation approach based on conditional
probability
• The goal is to sample from this joint distribution of
all parameters in the model, given the data, (the
target distribution) in order to estimate the
parameters, but…
• …we don’t know what the target distribution looks
like, so we have to make a proposal distribution
Monte Carlo principle
• Given a very large set X and a distribution p(x)
over it
• We draw i.i.d. a set of N samples
• We can then approximate the distribution using
these samples
p(x)
X
1
p N ( x) 
N
N
1( x
i 1
(i )
 x)
 p(x )
N 
Markov Chain Monte Carlo (MCMC)
• Recall again the set X and the distribution
p(x) we wish to sample from
• Suppose that it is hard to sample p(x) but
that it is possible to “walk around” in X
using only local state transitions
• Insight: we can use a “random walk” to
help us draw random samples from p(x)
p(x)
X
MCMC Methods
• Metropolis-Hastings algorithms are a way to
construct a Markov chain in such a way that its
equilibrium (or stationary) distribution is the
target distribution.
– Proposal is some kind of bounding distribution
that completely contains the target distribution
– Acceptance-rejection methods are used to
decide whether a proposed value is accepted
or rejected as being drawn from the target
distribution
– Jumping rules determine when and how the
chain moves on to new proposal values
MCMC
• The basic rule is that the ratio of successful jump
probabilities is proportional to the ratio of posterior
probabilities.
Post( A ) P( jump B   A )P( accept B | A )

Post( B ) P( jump A  B )P( accept A | B )
• This means that over the long term, we stay in areas
with high probability and the long-term occupancy of
the chain matches the posterior distribution of interest.
MCMC Methods
• Eventually, through many proposals that are updated
iteratively (based on jumping rules), the Markov chain
will converge to the target distribution, at which time it
has reached equilibrium (or stationarity)
• This is achieved after the so-called “burn-in” (“the chain
converged”)
• Simulations (proposals) made prior to reaching
stationarity (ie - during burn-in) are not used in
estimating the target
• Burning questions: When have you achieved
stationarity and how do you know??? (some diagnostics,
but no objective answer because the target distribution is not
known)
More burning questions
• How can you pick a proposal distribution when you
don’t know what the target distribution is?
(this is what M-H figured out!)
– Series of proposals depends on a ratio involving
the target distribution, which itself cancels out in
the ratio
– So you don’t need to know the target distribution in
order to make a set of proposals that will eventually
converge to the target
– This is (vaguely) analogous in K-L information
theory to not having to “know the truth” in order to
estimate the difference between 2 models in their
distance from the truth (truth drops out in the
comparison)
Posterior Distributions
• Assuming the chain converged, you obtain an estimate
for each parameter of its marginal distribution,
p(1| 2, 3 … n, y)
That is, the distribution of 1 , averaged over the
distributions for all other parameters in the model &
given the data
• This marginal distribution is the posterior distribution
that represents the probability distribution of this
parameter, given the data & other parameters in the
model
• These posterior distributions of the parameters are the
basis of inference
Assessing Convergence
• Run multiple chains
Chain
1
•
1
5
2
3
4
•
•
Chain
2
1
5
Chain
3
2
3
4
•
(chains are independent)
Many iterations (>2000)
First half are burn-in
“Thin” the chain (take
every xth value; depends
on auto-correlation)
Compare traces of chains
Not converged
1
5
2
3
4
Assessing Convergence
• Estimate R̂ , which is the square root of the
ratio of the variance within chains vs.
variance between chains
• At stationarity and as n 
 R̂,
1
• Above R̂ ~ 1.1, chains may not have
converged, and a greater number of
simulations is recommended