Bayesian inference for RNA viruses

Download Report

Transcript Bayesian inference for RNA viruses

The Coalescent and Measurably Evolving Populations
The Coalescent and Measurably
Evolving Populations
Alexei Drummond
Department of Computer Science
University of Auckland, NZ
The Coalescent and Measurably Evolving Populations
Overview
1. Introduction to the Coalescent
2. Hepatitis C in Egypt
• An example using the coalescent
3. Measurably evolving populations
4. HIV-1 evolution within and among hosts
• An example using MEP concepts
5. Summary + Conclusions
The Coalescent and Measurably Evolving Populations
The coalescent
• The coalescent is a model of the ancestral
relationships of a small sample of individuals taken
from a large background population.
• The coalescent describes a probability distribution on
ancestral genealogies (trees) given a population history.
– Therefore the coalescent can convert information from ancestral
genealogies into information about population history and vice
versa.
• The coalescent is a model of ancestral genealogies, not
sequences, and its simplest form assumes neutral
evolution.
The Coalescent and Measurably Evolving Populations
The history of coalescent theory
•
•
•
•
•
•
•
•
•
•
1930-40s: Genealogical arguments well known to Wright & Fisher
1964: Crow & Kimura: Infinite Allele Model
1966: (Hubby & Lewontin) & (Harris) make first surveys of population allele
variation by protein electrophoresis
1968: Motoo Kimura proposes neutral explanation of molecular evolution &
population variation. So do King & Jukes
1971: Kimura & Ohta proposes infinite sites model.
1975: Watterson makes explicit use of “The Coalescent”
1982: Kingman introduces “The Coalescent”.
1983: Hudson introduces “The Coalescent with Recombination”
1983: Kreitman publishes first major population sequences.
1987: Cann et al. traces human origin and migrations with mitochondrial
DNA.
The Coalescent and Measurably Evolving Populations
The history of coalescent theory
•
•
•
•
•
•
•
•
1988: Hughes & Nei: Genes with positive Darwinian Selection.
1989-90: Kaplan, Hudson, Takahata and others: Selection regimes with
coalescent structure (MHC, Incompatibility alleles).
1991: MacDonald & Kreitman: Data with surplus of replacement
interspecific substitutions.
1994-95: Griffiths-Tavaré + Kuhner-Yamoto-Felsenstein introduces sampling
techniques to estimate parameters in population models.
1997-98: Krone-Neuhauser introduces Ancestral Selection Graph
1999: Wiuf & Donnelly uses coalescent theory to estimate age of disease
allele
2000: Wiuf et al. introduces gene conversion into coalescent.
2000-: A flood of SNP data & haplotypes are on their way.
The Coalescent and Measurably Evolving Populations
Population processes
COALESCENT
THEORY
Genealogy
The Coalescent and Measurably Evolving Populations
Coalescent inference
Randomly sample individuals from
population
Obtain gene sequences from
sampled individuals
Reconstruct tree / trees
from sequences
Infer coalescent
results from tree / trees
Infer coalescent results
directly from sequences
The Coalescent and Measurably Evolving Populations
Demographic History
• Change in population size through time
• Applications include
–
–
–
–
Estimating history of human populations
Conservation biology
Reconstructing infectious disease epidemics
Investigating viral dynamics within hosts
The Coalescent and Measurably Evolving Populations
Idealized Wright-Fisher populations
Grand parents
Parents
Now
Haploid
Diploid
The Coalescent and Measurably Evolving Populations
Random mating in an ideal population
•A constant population size of N individuals
•Each individual in the new generation “chooses” its
parent from the previous generation at random
The Coalescent and Measurably Evolving Populations
Genetic drift: extinction and ancestry
If you trace the ancestry of a sample of individuals back in time you inevitably
reach a single most recent common ancestor.
If you pick a random individual and trace their descendents forward in time, all
the descendents of that individual will with high probability eventually die
out.
Past
Discrete Generations
The Coalescent and Measurably Evolving Populations
A sample genealogy from an idealized
Wright-Fisher population
A sample genealogy of
3 sequences from a
population (N =10).
Past
Present
Present
The Coalescent and Measurably Evolving Populations
The coalescent: distributions and
expectations on a sample genealogy
Past
t2 ~ Exp(N)

E[t2 ]  N

t 3 ~ ExpN /3
E[t3 ]  N /3
Present

 1 
E[troot ]  2N1 
 n 

E[t k ] 
2N
k(k 1)
The Coalescent and Measurably Evolving Populations
The coalescent: probability density
distribution
Past
t2 ~ Exp(N)

t 2 
1
P(t2 | N)  exp 
 N 
N

3t 3 
3
t 3 ~ ExpN /3 P(t3 | N)  exp
 N 

N
Present

k(k
k(k 1)
1)t k 
fG (g | N)  
exp
dg



N
N
k 2
n
Kingman (1982a,b)
g  E g ,t
The genealogy is an
edge graph Eg and a
vector of times t.
The Coalescent and Measurably Evolving Populations
The coalescent: estimating population
size from a sample genealogy
Past
t2  7
Nˆ 2  7
Nˆ 15.5

t3  8


Present

Nˆ 3  24
k(k 1)
ˆ
Nk 
t k
2
n
1
k(k 1)
ˆ
N
tk

n 1 k 2 2
Felsenstein (1992)
-6
Nˆ  15.5 (5.1, 93.1)
-8
relative log likelihood
The Coalescent and Measurably Evolving Populations
The coalescent: estimating population
size confidence limits via ML
-10
-12
-14
-16

-18
-20
1
10
100
Population size (N)
Maximum likelihood can be used to
estimate population size by choosing a
population size that maximizes the
probability of the observed coalescent
waiting times.
1000
The confidence intervals
are calculated from the
curvature of the likelihood.
For a single parameter
model the 95% confidence
limits are defined by the
points where the loglikelihood drops 1.92 logunits below the maximum
log-likelihood.
Evolving Populations
Measurably
The Coalescent andThe
Coalescent
The coalescent: shapes of gene
genealogies
Exponential growth
Constant size
The coalescent can be used to convert coalescent times into knowledge
about population size and its change though time.
Evolving Populations
Measurably
The Coalescent andThe
Coalescent
Constant population size: N(t)=N0
small N0
large N0
TIME
Evolving Populations
Measurably
The Coalescent andThe
Coalescent
Coalescent and serial samples
Constant population
Exponential growth
Evolving Populations
Measurably
The Coalescent andThe
Coalescent
Uncertainty in Genealogies
How similar are these two trees? Both of them are plausible given
the data.
We can use MCMC to get the average result over all plausible trees,
Evolving Populations
Measurably
The Coalescent andThe
Coalescent
Coalescent Summary
• The coalescent provides a theory of how population size
is related to the distribution of coalescent events in a
tree.
• Big populations have old trees
• Exponentially growing populations have star-like trees
• Given a genealogy the most likely population size can be
estimated.
• MCMC can be used to get a distribution of trees from
which a distribution of population sizes can be estimated.
The Coalescent and Measurably
MCMC Evolving Populations
Markov chain Monte Carlo (MCMC)
• Imagine you would like to estimate two
parameters (,) from some data (D).
• You want to find values of  and  that have high
probability given the data: p(,|D)
• Say you have a likelihood function of the form:
Pr{D| ,}
• Bayes rule tells us that:
– p(,|D) = Pr{D| ,}p(,) / Pr{D}
– So that p(,|D)  Pr{D| ,}p(,)
The Coalescent and Measurably
MCMC Evolving Populations
Markov chain Monte Carlo (MCMC)
• p(,|D) is called the posterior probability (density) of , given D
• In an ideal world we want to know the posterior density for all
possible values of ,.
• Then we could pick a “credible region” in two dimensions that
contained values of , that account for the majority of the posterior
probability mass.
• This credible region would serve as an estimate that includes
incorporates our uncertainty and this credible set could be used to
address hypotheses like:  is greater than x.
• In reality we have to make due with a “sample” of the posterior - so
that we evaluate p(,|D) for a finite number (say 10,000,000) pairs
of ,.
• So which pairs should we choose?
The Coalescent and Measurably
MCMC Evolving Populations
Markov chain Monte Carlo (MCMC)
• Lets construct a random walk in 2-dimensional space
• In each step of the random walk we propose to make an (unbiased)
small jump from our current position (,) to a new position (’,’)
• If p(’,’|D) > p(,|D) then we make the proposed jump
• However, if p(’,’|D) < p(,|D), then we make the proposed jump
with probability  = p(’,’|D) / p(,|D), otherwise we stay where we
are.
• It can be shown (trust me!) that if you proceed in this fashion for an
infinite time then the equilibrium distribution of this random walk will
be p(,|D)!
• That is, the random walk will visit a particular region [0, 1] x [0, 1]
of the state space this often:
1
1
  p(, | D)dd
   0   0
The Coalescent and Measurably
MCMC Evolving Populations
Markov chain Monte Carlo (MCMC)
p(, | D)  Z  Pr{D | ,g} f ( | g) f (, )
g

Evolving
Measurably
The Coalescent
Egypt
C inPopulations
of Hepatitis
genetics
Populationand
Hepatitis C Virus (HCV)
•
•
•
•
Identified in 1989
9.6kb single-stranded RNA genome
Polyprotein cleaved by proteases
No efficient tissue culture system
Evolving
Measurably
The Coalescent
Egypt
C inPopulations
of Hepatitis
genetics
Populationand
How important is HCV?
•
•
•
•
170m+ infected
~80% infections are chronic
Liver cirrhosis & cancer risk
10,000 deaths per year in
USA
• No protective immunity?
Evolving
Measurably
The Coalescent
Egypt
C inPopulations
of Hepatitis
genetics
Populationand
HCV Transmission
Percutaneous exposure to infected blood
• Blood transfusion / blood products
• Injecting & nasal drug use
• Sexual & vertical transmission
• Unsafe injections
• Unidentified routes
Evolving
Measurably
The Coalescent
Egypt
C inPopulations
of Hepatitis
genetics
Populationand
Estimating demographic history of HCV
using the coalescent
•
•
•
•
Egyptian HCV gene sequences
n=61
E1 gene, 411bp
All sequence contemporaneous
•
Egypt has highest prevalence of
HCV worldwide (10-20%)
But low prevalence in
neighbouring states
Why is Egypt so seriously
affected?
Parenteral antischistosomal
therapy (PAT)
•
•
•
Evolving
Measurably
The Coalescent
Egypt
C inPopulations
of Hepatitis
genetics
Populationand
Demographic model
• The coalescent can be
extended to model
deterministically varying
populations.
• The model we used was a
const-exp-const model.
• A Bayesian MCMC method
was developed to sample the
gene genealogy, the
substitution model and
demographic function
simultaneously.
N

 C
N(t)  NC exp[r(t  x)]
N
 A
if t  x
if x  t  y
if t  y
Evolving
Measurably
The Coalescent
Egypt
C inPopulations
of Hepatitis
genetics
Populationand
Estimated demographic history
Based on a
single tree
Averaged
over all trees
Evolving
Measurably
The Coalescent
Egypt
C inPopulations
of Hepatitis
genetics
Populationand
Parameter estimates
Evolving
Measurably
The Coalescent
Egypt
C inPopulations
of Hepatitis
genetics
Populationand
Uncertainty in parameter estimates
Demographic parameters
Mutational parameters
Growth rate of the growth phase
Rates at different codon positions,
Grey box is the prior
All significantly different
Evolving
Measurably
The Coalescent
Egypt
C inPopulations
of Hepatitis
genetics
Populationand
Full Bayesian Estimation
• Marginalized over uncertainty in genealogy and mutational processes
• Yellow band represents the region over which PAT was employed in Egypt
Evolving Populations
and Measurably
The Coalescent
evolving populations
Measurably
Measurably evolving populations
(MEPs)
•
MEP pathogens:
– HIV
– Hepatitis C
– Influenza A
•
MEPs from ancient DNA
–
–
–
–
•
•
Present time point
(n = 5)
Bison
Brown Bears
Adelie penguins
Anything cold and numerous
Even over short periods (less than a
year) HIV sequences can exhibit
measurable evolutionary change
Time-structure can not be ignored in
our models
Earlier time point
(n = 5)
Evolving Populations
and Measurably
The Coalescent
evolving populations
Measurably
Time structure in samples
Contemporary sample
no time structure
Serial sample
with time structure
time
1980
1990
2000
Evolving Populations
and Measurably
The Coalescent
evolving populations
Measurably
Molecular evolution and population
genetics of MEPs
• Given sequence data that
is time-structured estimate
true values of:

– substitution parameters
• Overall substitution rate and
relative rates of different
substitutions
– population history: N(t)
– Ancestral genealogy
Ne
time
A
B
• Topology
• Coalescent times
C
D
E
The Coalescent and Measurably Evolving Populations
Molecular evolutionary model: Felsenstein’s
likelihood (1981)
AA b
1
GA
b4
AC
b3
b2
b5
GC
The probability of the sequence
alignment,
Pr{D | T,Q}

can be efficiently calculated
given a tree and branch lengths
(T), and a probabilistic model of
mutation represented by an
instantaneous rate matrix (Q).
In phylogenetics, branch
lengths are usually
unconstrained.
The Coalescent and Measurably Evolving Populations
Combining the coalescent with
Felsenstein’s likelihood
AA b
1
GA
b4
AC
b3
b2
t2
The “molecular clock”
constraint
t3
t4
b5
GC
2n–3 branch lengths
AA GA AC
GC
n–1 waiting times
p(N,g,Q | D)  Pr{D | g,Q} f G (g | N) f N (N) f Q (Q)
The joint posterior probability of the population history (N), the
genealogy (g) and the mutation matrix (Q) are estimated using
 Markov chain Monte Carlo (Drummond et al, Genetics, 2002)
Evolving Populations
and Measurably
The Coalescent
evolving populations
Measurably
Full Bayesian Model
Probability of what
we don’t know given
what we do know.
Likelihood function
P(g, , Ne, Q | D) =
1
Z
P(D | g, , Q)fG(g | Ne) f()fN(Ne )fQ(Q)
Unknown
normalizing
constant
Q = substitution parameters
Ne = population parameters
g = tree
 = overall substitution rate
other priors
coalescent prior
In the software package BEAST,
MCMC integration can be used to
provide a chain of samples from
this density.
Evolving Populations
and Measurably
The Coalescent
evolving populations
Measurably
HIV-1 (env) evolution in nine infected
individuals
Pt.9
HIV1U35926
Pt.7
Patient #6 from
Wolinsky et al.
HIVU95460
HIV1U36148
HIV1U36073
HIV1U36015
HIV1U35980
Pt.6
Pt.8
Pt.2
Pt.1
Shankarappa et al (1999)
Pt.3 10%
Pt.5
Evolving Populations
and Measurably
The Coalescent
evolving populations
Measurably
Molecular clock: HIV-1 (env) evolution in 9
individuals
Viral Divergence
10%
8%
6%
4%
2%
0
2
4
6
8
Years Post Seroconversion
Shankarappa et al (1999)
10
Evolving Populations
and Measurably
The Coalescent
evolving populations
Measurably
MEP Summary
• Most RNA viruses, including HCV and HIV are
measurably evolving
• Most vertebrate populations that have well-preserved
recent fossil records are MEPs.
• If sequence data comes from different times the timestructure can’t be ignored
• Time structure permits the direct estimation of:
–
–
–
–
substitution rate
Concerted changes in substitution rate
coalescent times in calendar units
Demographic function N(t) in calendar units
The Coalescent and Measurably Evolving Populations
Intermission
My brain is fried!
and Measurably
The Coalescent
of HIV Populations
geneticsEvolving
Population
HIV virion
HIV gp120 binds to
CD4 T cell surface
receptors
What is HIV?
viral core
inserted
into cell
•
•
replication of virus
genome by reverse
transcription (ssRNA
to dsDNA)
host
cell
migration of
dsDNA to nucleus
nucleus
LTR
integration
of proviral DNA
into DNA of
infected cell
LTR
viral RNA transcription
translation
viral
genomic
RNA
structural
proteins and
viral enzymes
regulatory
genes
RNA packaging and
virion assembly
budding of virus from
cell and maturation
HIV is a retrovirus.
Within infected individuals HIV
exhibits extremely high genetic
variability due to:
– Error-prone reverse
transcriptase (RT) that
converts RNA to DNA
(error rate is about one
mutation per genome per
replication cycle).
– DNA-dependent
polymerase also errorprone
– High turnover of virus
within infected individual
throughout infection.
and Measurably
The Coalescent
of HIV Populations
geneticsEvolving
Population
Patient 2 (Shankarappa et al, 1999)
Number of sequences obtained per sample
0
11
22
20
8
20
20 20 10
0
12
20
30
40
51
61 68 73 80 85 91
8 20 9
20
22
103
126
Time in months (post seroconversion)
• 210 sequences collected over a period of 9.5 years
• 660 nucleotides from env: C2-V5 region
• Effective population size and mutation rate were co-estimated using
Bayesian MCMC.
and Measurably
The Coalescent
of HIV Populations
geneticsEvolving
Population
A tree sampled from the posterior
distribution
‘Ladder-like’ appearance
Lineage A
Lineage B
and Measurably
The Coalescent
of HIV Populations
geneticsEvolving
Population
Estimated substitution rate
• Patient 2:
– 0.77–1.0% per year
• BUT….
Long term rates in HIV
– Korber et al:
• 0.24% (0.18-0.28%) per year
• Only 1/4 of the intrapatient
rate
Evolving Populations
and Measurably
The Coalescent
evolving populations
Measurably
Bayesian MCMC of Shankarappa data
Patient
Rate
Bottleneck (at
Best-fitting
Effective heterogeneity
seroconversion)
demographic
Estimated rate
population
(alpha
model
(per site per year)
size*
parameter) estimate upper limit
p1
p2 plasma
p2 provirus
p3
p5 plasma
p5 provirus
p6
p7
p8
p9
p11
Logistic
Logistic
Logistic
Logistic
Exponential
Logistic
Logistic
Logistic
Exponential
Logistic
Logistic
Overall
Logistic
0.0123
0.0166
0.0090
0.0175
0.0223
0.0215
0.0195
0.0085
0.0162
0.0071
0.0128
882
1708
2798
620
938
1345
581
3320
2309
2757
2502
0.278
0.242
0.278
0.237
0.192
0.293
0.221
0.322
0.455
0.346
0.239
1.57%
0.63%
0.04%
1.80%
11.80%
8.19%
0.52%
2.94%
28.70%
17.90%
0.15%
6.68%
3.34%
0.18%
4.97%
27.50%
15.20%
1.35%
13.60%
48.50%
44.80%
0.53%
0.0148
1796
0.282
6.75%
15.15%
* At the time of last sample assuming a generation length of 2.6 days
3.00E-02
2.50E-02
p1 - p11
rate (per site per year)
and Measurably
The Coalescent
of HIV Populations
geneticsEvolving
Population
Intra- and inter- patient rate estimates
(C2V3 envelope)
2.00E-02
Intrapatient estimates
1.50E-02
Interpatient estimates
1.00E-02
C
5.00E-03
A B
0.00E+00
0
50
100
150
Sampling interval (months)
200
250
and Measurably
The Coalescent
of HIV Populations
geneticsEvolving
Population
Summary: HIV intra-patient evolution
• HIV evolutionary rates appear to be faster intrapatient then across pandemic
– Different selection pressure at transmission?
– Transmitted viruses undergoing less rounds of
replication?
– Latent viruses?
– Reversion of escape mutants?
• Effective population size is changing over time
(bottleneck in envelope at least)
Populations
Measurably Evolving
The Coalescent and
tests
Goodness-of-fit
But how good is our best model?
• We can use standard statistical model-choice criteria to
choose between different models of substitution and
demography, but are any of the models we consider any
good at all?
• One way to look at this is ask the following question:
– Does our real data look anything like what we would
expect data from our model to look like?
• So what aspect of the data should we look at?
• And what should we expect?
Populations
Measurably Evolving
The Coalescent and
tests
Goodness-of-fit
We could look at branch length
distributions…
Ln 
troot
E[Ln ]  2Ne

Jn  Ln

 1 
E[troot ]  2N e 1 
 n 


n1
1
k1 k
E[J n ]  2N e 
Populations
Measurably Evolving
The Coalescent and
tests
Goodness-of-fit
Tree imbalance measures might also
be interesting…
4 cherries
Ic  0
3 cherries
Ic  0.24
2 cherries
Ic  0.81
N 3
N  3.125
N  4.125
Evolving Populations
and Measurably
The Coalescent
predictive simulation
Posterior
Posterior predictive simulation
•
A method of testing the goodness-of-fit of a Bayesian
model.
1. Run a Bayesian MCMC analysis on the data
2. Calculate the value of your favourite summary statistic, T(.) from
the data, D
3. For each state in the chain
1. Simulate a synthetic dataset, Di, using the parameter values of
state i.
2. Calculate T(Di) from the simulated data set.
4. Compare the T(D) value with predictive distribution of T(Di)
Evolving Populations
and Measurably
The Coalescent
predictive simulation
Posterior
So we need some summary statistics
• Summary statistics that
• Summary statistics that
can be measured directly
can be measured directly
from an genealogy:
from sequence
– Genealogical mean
alignment:
pairwise distance ()
– Mean pairwise distance
()
– Tajima’s D
– Fu & Li’s D
– Number of segregating
sites (S)
– …
–
–
–
–
–
Genealogical Tajima’s D
Genealogical Fu & Li’s D
Tree-imbalance statistics
Age of the root
Length of the tree
Evolving Populations
and Measurably
The Coalescent
predictive simulation
Posterior
Posterior predictive simulation (2)
•
Testing the goodness-of-fit of the neutral coalescent model under
variable demographic functions.
1. Run a Bayesian MCMC analysis on the data
2. For each state in the chain
1. Simulate a coalescent genealogy (GiS) using the population parameter
values of state i.
2. Calculate T(GiS) from the ith simulated genealogy
3. Calculate T(GiP) from the ith posterior genealogy
3. Calculate the predictive probability by comparing the posterior
distribution of T(.) with predictive distribution of T(.):
Populations
Measurably Evolving
The Coalescent and
tests
Goodness-of-fit
Human influenza A (HA gene) trees
State 5m
N e  9.12
t2  11.03 years

State 10m
N e  5.00
t2  15.29 years
Posterior genealogy Predictive simulations

Populations
Measurably Evolving
The Coalescent and
tests
Goodness-of-fit
Human influenza A trees:
Genealogical Fu & Li’s D statistic
Populations
Measurably Evolving
The Coalescent and
tests
Goodness-of-fit
Puerto Rican Dengue-4 gene trees:
multivariate summary statistics
Populations
Measurably Evolving
The Coalescent and
tests
Goodness-of-fit
Results of test of neutrality
Table 2. The predictive probabil ities ( PT ) for summ ary statistics on each of the exa mple
data sets are shown . Signifi cant departures from neu trality are marked (*) and marginall y
signifi cant departures (x < 0.05 or x > 0.95) are marked wit h (†). Signifi cant departures
on the best fit ting model for each data set are in bold.

Dataset
Brown bear
Demograph ic
model
Constant
T
Predictive probab ilities
troot
DFL
IC
Cn
B1
0.739
0.815
0.863
0.693
0.163 0.103
0.623
0.800
0.679
0.163 0.111
RSVA
Exponential growth 0.615
Constant
0.956†
0.964†
0.946
0.163
0.152 0.134
(g gene)
Exponential growth
0.693
0.656
0.884
0.206
0.149 0.134
Dengue-4
Constant
0.9574†
0.9958*
0.9997*
0.562
0.608 0.427
(E gene)
Human influenza A
Exponential growth 0.745
Constant
0.9510†
0.809
0.9792*
0.559
0.653 0.505
0.900
0.9999* 0.0462† 0.605 0.610
(HA)
Exponential growth
0.620
0.9995*
(d-loop)
0.910
0.0866 0.575 0.677
Fu & Li's D
1.0
0.9
0.8
0.7
Sim Constant size
Sim Exponential Growth
Target
Constant size
Exponential Growth
0.6
P*
Populations
Measurably Evolving
The Coalescent and
tests
Goodness-of-fit
Results for 28 HIV-1 infected
individuals
0.5
0.4
0.3
0.2
0.1
0.0
0.0
0.2
0.4
0.6
0.8
proportion of data sets
1.0
Pop size
1000
Ne / 30
and Measurably
The Coalescent
of HIV Populations
geneticsEvolving
Population
Is the population size constant?
mean
lower
upper
100
10
0
20
40
60
80
100
120
months (post seroconversion)
Patient 2
Measurably Evolving Populations
The Coalescent andPhylodynamics
Virus population dynamics
Measles virus
Human influenza virus
Measurably Evolving Populations
The Coalescent andPhylodynamics
80
Dengue-4: Modeling complex
demography
40
0
Den4.Neby
byYear.m
Year
Den4.Isol
120
120
80
80
40
40
0
1980
1985
Den4.Ne by Year
1990
0
1995
2000
120
80
N(t) = N0exp(-rt):
N(t) = scaled translated case data:
-10566.421
-10478.572
40
0
Hospital case data courtesy of Shannon Bennett
Evolving Populations
and Measurably
The CoalescentPopulation
size changes
Population size changes
Evolving Populations
and Measurably
The CoalescentPopulation
size changes
The generalized skyline plot
• Visual framework for exploring the demographic history of
sampled DNA sequences
• Input: a single estimated ancestral genealogy (a tree)
• Output: nonparametric plot of the population size through time
– Groups adjacent coalescent intervals
– Converts information within these intervals to estimates of
population size
k(k 1)
ˆ
Nk 
tk
2
Estimate of population size
from single coalescent interval
k
k(k  l)
ˆ
N k,l 
ti

2l i kl 1
Estimate of population size from l
adjacent coalescent intervals.
Evolving
and Measurably
The CoalescentGeneralized
Plot Populations
Skyline
Examples
I: Constant population size
N(t)=N(0)
Evolving
and Measurably
The CoalescentGeneralized
Plot Populations
Skyline
Skyline Plot
I: Constant population size
N(t)=N(0)
II: Exponential growth
N(t)=N(0)e-rt
Evolving
and Measurably
The CoalescentGeneralized
Plot Populations
Skyline
Skyline Plot
III: HIV-1 group M
(tree estimated in Yusim et al (2001)
Phil. Trans. Roy. Soc. Lond. B 356:
855-866)
– Black curve is a parametric
estimate obtained from the
same data under the
“expansion model”
– Results follow accepted
demographic pattern for
the HIV pandemic
Dengue-4 Bayesian skyline plot (15 epochs)
Estimate a demographic
function that has a certain
fixed number of steps (in this
example 15) and then
integrate over all possible
positions of the break points.
Population size * generation length
1000
100
10
M
lo
u
1
0.1
0.01
0
2
4
6
8
10
12
14
16
18
20
Years ago
Dengue-4 Bayesian
skyline plot (15 epochs)
100
90
Explains the Dengue data
quite well (test of neutrality
do not reject the data if we
use the Bayesian skyline
plot to describe the
demographic history.
Population size * generation length
Evolving Populations
and Measurably
The CoalescentPopulation
size changes
The Bayesian skyline plot
80
70
60
Media
lower
upper
50
40
30
20
10
0
0
2
4
6
8
10
Years ago
12
14
16
18
20
Evolving Populations
and Measurably
The CoalescentPopulation
size changes
Prior/Model: population is autocorrelated through time
In add iti on to this model we also introduce a sim ple smoothing on  which represents
our beli ef that effective popu lation size is auto-correlated through tim e. The prior
distribution we assume in all subs equen t simulations and ana lyses is that, going back in
tim e, each new population size is drawn from an exponen tial distribut ion wit h a mean

equa l t o the previou s popula tion size:
 j ~ Exp(  j1 ), 2  j  m .
(5)
In add iti on we introduce a scale-inva riant prior (Jeffreys 1946) on the fir st element
1
y that our prior beli ef is inva riant to changes in tim e scale. Th is
f 1 (1 )  to signif
1
results in a foll owing simple prior distribution on :
f  () 
1
1
m

j 2
exp  j / j 1 .

j 1
(6)
Evolving Populations
and Measurably
The CoalescentPopulation
size changes
Validating the Bayesian skyline plot (1)
Simulated data: Constant population
Simulated data: Exponential growth
Bayesian skyline (49 or 12 epochs)
100
10
Median (49)
lower (49)
upper (49)
truth
Median (12)
lower (12)
upper (12)
1
Theta
Evolving Populations
and Measurably
The CoalescentPopulation
size changes
Validating the Bayesian skyline plot (2)
0.1
0.01
0.001
0.0001
0.00001
0
0.002
0.004
0.006
Time (mutations)
0.008
100000
Median
Upper
Lower
Smoothed translated incidence
10000
Effective number of infections
Evolving Populations
and Measurably
The CoalescentPopulation
size changes
Comparing Bayesian skyline plot of
Dengue-4 with incidence data
1000
100
10
1
0
50
100
Months (before November 1998)
150
200
Evolving Populations
and Measurably
The CoalescentPopulation
size changes
Example of Bayesian skyline plot
(1920-1980) Anti-schistosomal needle-based treatment
Effective population size jumped from 300 to 100 to 10,000
Evolving Populations
and Measurably
The CoalescentPopulation
size changes
Comparison to parametric model
The Coalescent and Measurably Evolving Populations
http://evolve.zoo.ox.ac.uk/BEAST
Evolving Populations
and Measurably
The CoalescentStructured
populations
Coalescent with population structure
Evolving Populations
and Measurably
The CoalescentStructured
populations
Population subdivision - two demes
Evolving Populations
and Measurably
The CoalescentStructured
populations
Population subdivision - two demes
Evolving Populations
and Measurably
The CoalescentStructured
populations
Stepping stone model of subdivision
Evolving Populations
and Measurably
The CoalescentStructured
populations
Human migration
From Cavalli-Sforza,2001
Evolving Populations
and Measurably
The CoalescentStructured
populations
Simplified model of human evolution
Past
Rate of common ancestry = 1
Present
Africa
Mutation rate = 2.5
0.2
Non-Africa
The Coalescent and Measurably Evolving Populations
Why Bayesian?
• Probabilistic model-based inference
– Can make simple statements about the probability of alternative
hypotheses given the data
• Markov chain Monte Carlo
– Convenient computational technique
– Allows for complex models: “if you can simulate you can sample”
• Incorporates prior probabilities
– P(|D)  P(D| )P()
– Convenient means of assessing alternative sets of assumptions
– Allows incorporation of independent sources of information
• Easy to include sources of uncertainty
– Don’t need to assume perfect knowledge of tree (for example)
– Can treat the tree and a nuisance parameter and focus on parameters
of interest (strength of selection, mutation rate, growth rate, etc)
The Coalescent and Measurably Evolving Populations
Conclusions & cautionary remarks
• Bayesian MCMC has advantages
– a useful tool for exploring prior hypotheses
– Good for assessing levels of uncertainty
– Complex models can be investigated on practical datasets
• Bayesian MCMC has disadvantages
– Diagnostics are difficult, and it is essentially impossible to
guarantee correctness
– Model comparison can be difficult
– Requires large programs that are difficult to optimize and debug.
The Coalescent and Measurably Evolving Populations
Conclusions & cautionary remarks (2)
• Population genetics has advantages
– provides a framework for objective analysis of genetic data
– Allows interpretation of genetic data in terms of biological
properties of virus
– Can be extended to include selection, recombination et cetera
• Population genetics has disadvantages
– Models are still too simple
– Assumptions are too strong
– Extending to complex models that include changing selection
pressures and recombination are possible in MCMC but still very
difficult!