MCMC in R - CU-Boulder Psychology and Neuroscience

Download Report

Transcript MCMC in R - CU-Boulder Psychology and Neuroscience

Markov Chain Monte Carlo in R
Thanks to Lindon Eaves for first getting me interested (and for some of the slides herein)
and Scott Lynch for writing an excellent introduction which I use throughout
Markov Chain
Markov chain = a chain of events (e.g., sampling), where the
present state of the chain depends only upon the previous state
Markov Chain
Markov chain = a chain of events (e.g., sampling), where the
present state of the chain depends only upon the previous state
Monte Carlo
Monte Carlo:
1. Administrative region in France;
famous for gambling and
James Bond movies
2. Algorithms that rely on
repeated random sampling to
compute their results
Markov Chain Monte Carlo
Markov Chain Monte Carlo:
Repeatedly sample from probability distributions, each of which is
determined by the previous step (the Markov Chain part) and
by some random noise (the Monte Carlo part). Given a good
method of determining whether the current state stays the
same or moves to a new place, the chain approximates the
true sampling distribution of the statistic.
MCMC is Bayesian
Bayesians interpret probability as a measure
of a state of knowledge
Frequentists interpret probability as a
measure of how often a thing occurs out
of a large number of trials
MCMC is inherently Bayesian because the
parameters of interest (e.g., beta
coefficients in regression) are treated as
random variables which we sample from
in order to understand their distribution (or
how unsure about where the true beta is).
What good is it?
“Model Liberation” - we can do a lot of cool things much more
easily and naturally
It’s easy: coding models using Maximum Likelihood can be
extraordinarily cumbersome. Doing so in MCMC is very
simple. Basically, if you can simulate it, you can do MCMC with
it.
Provides non-parametric sampling distributions ‘for free’ - no need
to make assumptions about the shape of the sampling
distribution.
Obtain subject-level parameters (genetic and environmental factor
scores, imputed data, etc.) at no extra cost
Maximum Likelihood
1. Make a guess at a parameter Ω
2. Find the likelihood that this is the true parameter given your data:
Likelihood = P(Data|Ω) where Data is X = (x1, x2…xn)
3. This means multiplying lots of very tiny numbers, so in practice, you should
take the logs of these small numbers and add them: Log Likelihood =
∑log(P(Ω|x[i])
4. Figure out some way to move the next guess in the right direction. Refigure
3 above.
5. Keep doing 3 & 4 until your results don’t change much and the second
derivative is positive. This is your maximum likelihood solution
6. Obtain standard errors by differentiating the LL. The shape of the LL
leading to the maximum gives information, under assumptions (such as
normality) about the standard error of your estimate
Problems with ML
1. Many models require integration over values of latent
variables (evaluate each likelihood and derivative of each
parameter holding all others constant). This can be
prohibitively expensive when numbers of dimensions is
large.
2. It can be very difficult to code problems in ML compared to
simulating (and using MCMC) on it
3. It is sensitive to distributional assumptions (e.g., normality),
and assumed distributions of statistics may not reflect the
true distribution of statistics
Type of MCMC: Metropolis algorithm
1. Set things up:
a) Provide priors (Ω1) for Ω’s of interest. Priors don’t have to be good,
but it can an effect on speed
b) Provide a scale, ∂, (variance) for the proposal distribution, q().
c) Provide a good posterior distribution function (likelihood function) π()
2. Draw a new Ω’ (Ωi+1) from the proposal distribution q(.| Ωi & scale=∂)
3. Find the likelihood of both Ωi+1 & Ωi: π(Ωi+1 ) & π(Ωi )
4. If π(Ω I+1 ) > π(Ωi ), accept Ωi+1 . Otherwise, accept Ωi+1 with probability
= π(Ωi+1 )/π(Ωi )
5. Redo steps 2-4 for i = 1…R where R is some large number.
6. “Eventually,” a sample of your chosen statistics, Ω’, will converge to the true
distribution of Ω. This may require some time (called the ‘burn in’ period).
Fact about MCMC
1. Choosing a large scale, ∂, will mean that the proposal statistic Ωi+1 is
chosen too rarely. Choosing too small a scale will mean it is chosen too
often. Both make burn-in times too long. The optimal % chosen is between
.3 and .7.
2. Remarkably, a sample of your chosen statistics, Ω’, will converge to the true
distribution of Ω regardless of the shape of the proposal distribution, q(),
that you choose.
3. The statistics Ω’ will show very high autocorrelation (by the nature of the
Markov Chain!). Nevertheless, the property in 2 above is maintained so
long as R is large.
4. Difficult problems (high multicolinearity, many latent variables, hierarchical
models) can take a long time (very large R).
The normal distribution π() function
The normal distribution:
The normal distribution for multiple regression
1. Choosing a large scale, ∂, will mean that the proposal statistic Ωi+1 is
chosen too rarely. Choosing too small a scale will mean it is chosen too
often. Both make burn-in times too long. The optimal % chosen is between
.3 and .7.
2. Remarkably, a sample of your chosen statistics, Ω’, will converge to the true
distribution of Ω regardless of the shape of the proposal distribution, q(),
that you choose.
3. The statistics Ω’ will show very high autocorrelation (by the nature of the
Markov Chain!). Nevertheless, the property in 2 above is maintained so
long as R is large.
4. Difficult problems (high multicolinearity, many latent variables, hierarchical
models) can take a long time (very large R).
Excellent MCMC introductions