Powerpoints - Center for Biological Sequence Analysis

Download Report

Transcript Powerpoints - Center for Biological Sequence Analysis

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS
Bayesian Inference
Anders Gorm Pedersen
Molecular Evolution Group
Center for Biological Sequence Analysis
Technical University of Denmark (DTU)
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS
Bayes Theorem
P(A | B) = P(B | A) x P(A)
P(B)
P(M | D) = P(D | M) x P(M)
P(D)
Reverend Thomas Bayes
(1702-1761)
P(D|M): Probability of data given model = likelihood
P(M): Prior probability of model
P(D): Essentially a normalizing constant so posterior will sum to one
P(M|D): Posterior probability of model
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS
Bayesians vs. Frequentists
• Meaning of probability:
– Frequentist: long-run frequency of event in repeatable experiment
– Bayesian: degree of belief, way of quantifying uncertainty
• Finding probabilistic models from empirical data:
– Frequentist: parameters are fixed constants whose true values we are
trying to find good (point) estimates for.
– Bayesian: uncertainty concerning parameter values expressed by means
of probability distribution over possible parameter values
Bayes theorem: example
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS
William is concerned that he may have reggaetonitis - a rare, inherited disease that
affects 1 in 1000 people
Prior probability distribution
Doctor Bayes investigates William using new efficient test:
If you have disease, test will always be positive
If not: false positive rate = 1%
William tests positive. What is the probability that he actually has the disease?
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS
Bayes theorem: example II
Model parameter:
d (does William have disease)
Possible parameter values:
d=Y, d=N
Possible outcomes (data):
+ : Test is positive
- : Test is negative
Observed data:
+
Likelihoods:
P(+|Y) = 1.0
P(-|Y) = 0.0
P(+|N) = 0.01
P(-|N) = 0.99
(Test positive given disease)
(Test negative given disease)
(Test positive given no disease)
(Test negative given no disease)
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS
Bayes theorem: example III
P(M | D)
= P(D | M) x P(M)
P(D)
P(Y|+)
= P(+|Y) x P(Y)
P(+)
=
=
=
1 x 0.001
1 x 0.001 + 0.01 x 0.999
P(+|Y) x P(Y)
P(+|Y) x P(Y) + P(+|N) x P(N)
9.1%
Other way of understanding result:
(1) Test 1,000 people.
(2) Of these, 1 has the disease and tests positive.
(3) 1% of the remaining 999  10 will give a false positive test
=> Out of 11 positive tests, 1 is true: P(Y|+) = 1/11 = 9.1%
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS
Bayes theorem: example IV
In Bayesian statistics we use data to update our ever-changing view of reality
MCMC: Markov chain Monte Carlo
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS
Problem: for complicated models parameter space is enormous.
 Not easy/possible to find posterior distribution analytically
Solution: MCMC = Markov chain Monte Carlo
Start in random position on probability landscape.
Attempt step of random length in random direction.
(a) If move ends higher up: accept move
(b) If move ends below: accept move with probability
P (accept) = PLOW/PHIGH
Note parameter values for accepted moves in file.
After many, many repetitions points will be sampled in
proportion to the height of the probability landscape
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS
MCMCMC: Metropolis-coupled
Markov Chain Monte Carlo
Problem: If there are multiple peaks in the probability landscape, then MCMC may
get
stuck on one of them
Solution: Metropolis-coupled Markov Chain Monte Carlo = MCMCMC = MC3
MC3 essential features:
• Run several Markov chains simultaneously
• One chain “cold”: this chain performs MCMC sampling
• Rest of chains are “heated”: move faster across valleys
• Each turn the cold and warm chains may swap position (swap probability is
proportional to ratio between heights)
 More peaks will be visited
More chains means better chance of visiting all important peaks, but each additional
chain increases run-time
MCMCMC for inference of phylogeny
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS
Result of run:
(a)
Substitution parameters
(b)
Tree topologies
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS
Posterior probability distributions
of substitution parameters
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS
Posterior Probability Distribution over Trees
•
MAP (maximum a posteriori) estimate of phylogeny: tree topology occurring most often in
MCMCMC output
•
Clade support: posterior probability of group = frequency of clade in sampled trees.
•
95% credible set of trees: order trees from highest to lowest posterior probability, then add
trees with highest probability until the cumulative posterior probability is 0.95