Transcript Slide 1

Population assignment likelihoods
in a phylogenetic and
demographic model.
Jody Hey
Rutgers University
DNA Barcoding is great!
• But it is useful to keep in mind that
species taxa are provisional – they are
hypothesis to be revised with more data
• Taxa are tools, not truth
• Mitochondrial-based DNA barcodes
– Can be misleading due to chance factors
(different genes have different histories)
– Can be misleading due to deterministic factors
(mitochondria are a large target for natural
selection)
A general problem…
You have some genetic data
For example, a gene sequenced multiple
times
Or a microsatellite locus genotyped in a
number of individuals
Suppose you are willing to assume that
positive or balancing selection has not
played a big role in the history of the
data
What could you figure out about the history
of the organisms from which the genes
came?
A General Parameterization for questions on
population demography, population divergence,
speciation, population identification etc
X genetic data (e.g. aligned sequences, microsatellites)
may (or may not) come with population labels
may (or may not) be given as diploid genotypes
may include multiple loci for each sampled organism
P population phylogeny
T splitting times – i.e. the times of branch points in the
phylogeny P
Θ Demography - population size and migration rate parameters
I Population labels – assignment of genes to populations which genes came from which populations or species
G Genealogy – the gene tree for the data
G is a necessary ‘nuisance’ parameter – it provides a
mathematical connection between X and (P,T, Θ and I)
It is possible to calculate the probability of G as a function
of P,T, Θ and I, p(G| P,T, Θ,I), using coalescent models
It is possible to calculate the probability of a data set given
G, p(X|G), using mutation models.
Connecting Data to the General Model – Parts 1-3
For unlabeled data - without information on the number
of populations, or on which populations were sampled
a random G with
1
2 Specify
topology and branch lengths
Unlabeled Data
for example :
Sequence1 ACgTACgACgCACgAAT
Sequence2 ACgTACgACgCACgAAT
5
6
Sequence3 ACCTTCgACgTACgAGT
Sequence4 ACgTTCgACgTACgAAT
Sequence5 ACCTTCgACgTACgAAT
Sequence6 ACgTTCgACgTATgAAT
3
With a mutation model, and a value of
G, we can calculate the probability of
G given the data: p(G|X)
4 3
1
2
Connecting Data to the General Model – Parts 4&5
4
Specify a random phylogeny P with multiple
populations and with splitting times T … for
example:
Pop 1
Pop 2
N2
N1
N3
Pop (2,3)
N(2,3)
Pop (2,3),1
5
Pop 3
← T2
← T1
N(2,3),1
With a phylogeny that depicts populations in time,
we can also pick random values for population sizes
and migration rates – Θ = {N1, N2... m1>2, m2>1…}
Connecting Data to the General Model – Parts 6-8
6
Overlay the genealogy on the phylogeny
5
6
4 3
1 2
Pop 1
7
add implied
migration events
and other random
migration events
to the phylogeny
Pop 2
Pop 3
8
Identify I, the data
labels representing
the populations
containing the data
Calculating the likelihood of
P, T, Θ, and I, given the data
L(P, T,Θ, I | X)  p(X | P, T,Θ, I) 
p(X
|
G)

p(G
|
P,
T,
Θ,
I)dG

• If we can solve this then we can obtain maximum
likelihood estimates of P,T, I and Θ
• We know how to calculate p(X|G) and p(G|P,T,I,Θ)
– The math is not the hard part
• The greatest challenge is finding efficient ways to
sample the space of genealogies and the space of P,
T, Θ, and I
Genetic Data and different types of data labels
Often Population Labels are known (come with data)
Population Labels
A
Aligned DNA Sequences
ACgTACgACgCACgAAT
A
ACgTACgACgCACgAAT
B
ACCTTCgACgTACgAGT
B
ACgTTCgACgTACgAAT
C
ACCTTCgACgTACgAAT
C
ACgTTCgACgTATgAAT
Population labels are already known and do not need
to be estimated. Parameter I (population labels) is not
included in the model.
Case 1 Data has no labeling at all
Population Labels
Aligned DNA Sequences
?
ACgTACgACgCACgAAT
?
ACgTACgACgCACgAAT
?
ACCTTCgACgTACgAGT
?
ACgTTCgACgTACgAAT
?
ACCTTCgACgTACgAAT
?
ACgTTCgACgTATgAAT
Case 2, no population labels, but data comes in
diploid genotypes pairs
Population Labels |Genotype Pairs
?
Individual #1
Aligned DNA Sequences
ACgTACgACgCACgAAT
?
ACgTACgACgCACgAAT
?
ACCTTCgACgTACgAGT
Individual #2
?
ACgTTCgACgTACgAAT
?
ACCTTCgACgTACgAAT
?
Individual #3
ACgTTCgACgTATgAAT
Gene copies are identified in genotype pairs only.
Parameter I (Population labels) is unknown (?) and
needs to be estimated.
Two kinds of data sets without
population labels
1.Alleles or gene copies provided without
any additional information on populations
- e.g. locus may be haploid
- or for whatever reason, data not
collected in a way that yields diploid
genotypes
2. Alleles or sequences provided in diploid
(genotype) pairs
This is a common situation for population
assignment
Case 1: Alleles or gene copies
come without any additional
information on populations
• The only available information on
population labels (parameter I) and all
other parameters (P, T, Θ) is in the actual
variation in the data
• This is a lot to ask of single locus data set.
• With multiple loci, can be possible to to
estimate P, T, Θ, and I
• Can include information from a database
on the same locus (loci) – i.e. DNA
barcoding
Case 2: Data comes in diploid
(genotype) pairs
• Such data contains two types of
information for population identification:
– Patterns of variation (as in case 1)
– Knowledge that both gene copies from a single
individual must come from the same
population (assume no hybrids)
• This problem (identifying populations
based on diploid genotypes) is traditionally
called population assignment
Population Assignment based on
diploid genotype data
• Many methods exist for population
assignment, using allelic data, based
on an assumption of Hardy-Weinberg
equilibrium within populations
• These methods do not otherwise
incorporate phylogenetics or
population genetics (no P, T, or Θ)
• Have to overcome difficulty of not
knowing the underlying allele
frequencies
Considering the probability of a
particular genotype configuration, D
The actual configuration D that comes with the data
is one of many possible configurations.
6 Sequences
1 ACgTACgACgCACgAAT
2 ACgTACgACgCACgAAT
3 ACCTTCgACgTACgAGT
4 ACgTTCgACgTACgAAT
5 ACCTTCgACgTACgAAT
6 ACgTTCgACgTATgAAT
3 Genotype pairs
Calculating the probability of a
particular genotype configuration, D
• Assume that genes come together and
form zygotes at random with respect to
their time of common ancestry
– This is a genealogical version of the
assumption of random mating that is
usually made with respect to
segregating alleles (e.g. in Hardy
Weinberg)
• Assume that both gene copies within an
individual are in the same population
Given a genealogy, G,
Some genotype
configurations are more
probable than others
under an assumption of
random union of
gametes