Transcript 3000_13_3a
genomic diversity and
differentiation
heading toward exam 3
genome region of arbitrary
size, what can you
measure and describe?
what else might you want
to know?
if given these data and
nothing else, what could
you say about them?
learning goals for
coalescent theory
• how do patterns in sequence data tell us about
effective population size?
• what if there are multiple populations
contributing information?
• how is our answer changed if the population
changes in size, or if there is selection for a
particular allele?
• why is this important for understanding
phylogenetics (species trees)?
patterns
• mutations happen at a more-or-less constant rate at
random location along genome (assumptions can be
tested)
• drift, selection, gene flow, recombination, etc. influence
how these mutations turn into patterns
• we interpret with statistical models - mostly beyond this
class
assume
genealogy
descent with
modification
focus on non-reticulate
gene trees
assume every mutation
happens at new
genome location
AVISE 1987, 1994
neutral model
• assume all these mutations have NO
effect on fitness (null model)
• thus, only drift influences whether allele
goes to fixation
• remember: probability allele goes to
fixation is its frequency in population
• so every new mutation has low but
equal probability that will get FIXED
(frequency 100%)
SPECIES
GENE COPY
POPULATION(DEME)
so you are collecting data not generally knowing the history of
inheritance or how discrete these units may be (actually discrete,
resolvably discrete)
we are working on how to infer (at least probabilities) how this diversity
partitions in space (population), time (frequencies), across genome
(paralogs), across species (orthologs)
also: copy number variation among loci, among populations, among
species
how many
whales?
Roman and Palumbi
2003
currently ~10,000
humpback whales; prewhaling (genetic
estimate) maybe
~250,000
how could there be so
many?
1. count whales - currently done using censusing and monitoring of whaling
vessels, about 10,000 right whales in Atlantic
2. collect DNA samples from some of them, and sequence at least one gene
(more is better!)
3. remember π is proportional to effective population size (times mutation rate
µ)
4. we know µ (~0.00000001 substitutions per DNA replication/reproduction)
from fossil and biogeographic data, and we can calculate π (average #
differences between every pair of sequences)
5. Ne = π/µ, adjusted for inheritance of marker (haploid, maternally inherited
mtDNA, versus diploid, biparental nuclear gene)
6. Ne of right whales ~250,000 even though only 10,000 whales now!
7. the genetic diversity is older than human whaling efforts and tells us
about the past
AUTOSOMES: ALL 4 COPIES CAN CONTRIBUTE MUTATIONS
MTDNA: ONE COMPONENT CONTRIBUTES MUTATIONS
WHEN PEOPLE REFER TO THE SMALLER EFFECTIVE SIZE OF
MITOCHONDRIAL GENOME, THEY ARE REFERRING TO COPY NUM
NOT THE NUMBER OF INDIVIDUALS IN THE POPULATION!
another look at Ne: drift
neutrality: mean Time to Most
Recent Common Ancestor
(tmrca)=time to homozygosity
= NOT MEMORIZE
DO
-4Ne[ plnp + (1-p)ln(1-p) ] gens
THIS
proportional to Ne; for p=0.5,
~2.77Ne gens
heterozygosity declines by
1/(2Ne) per generation
compare nuclear gene vs.
mitochondrial gene...?
basic summary stats
S, number of segregating sites (how many below?)
π, average number of differences among sequences (what is it below?)
ηi, folded site pattern: how many segregating sites appear i times?
caccgtattagcattatgctggtata
cgccgtactggcattatgctggtata
caccgtactagcattgtgctggtatg
caccgtactagcattatgccggtatg
cactgtactggcattatgctggtgta
cactgtactggcattatgctggtata
standard coalescent
sample size n has n-1
coalescent events
steps of extant size Ti
,E[Ti]=2/(i(i-1)) measured in units
of N
genetic (label) differences have
no fitness consequence
single population
constant population size (for
now)
TREE IS UNKNOWN, ANALYSIS IS ASKING WHICH TREES FIT THE D
WHAT THAT TELLS US ABOUT THE INTERVAL BETWEEN BRANCH NO
mutation
# mutations (K) Poisson distributed on genealogy, based on
total time t = (Ttotal)
Poisson process: stochastic, each time interval is
independent, waiting time is exponentially distributed
across time intervals (but when many branches,
multiplies opportunity in interval)
Applications
The classic example of phenomena well modelled by a Poisson process is deaths due to horse kick in the Prussian army, as
shown by Ladislaus Bortkiewicz in 1898.[4][5] The following examples are also well-modeled by the Poisson process:
•Requests for telephone calls at a switchboard.
•Goals scored in a soccer match.[6]
•Requests for individual documents on a web server.[7]
•Particle emissions due to radioactive decay by an unstable substance. In this case the Poisson process is nonhomogeneous in a predictable manner - the emission rate declines as particles are emitted.
Ewens distribution
under neutral model, mutations arise at rate µ
and are lost or drift to higher frequency (frequency
proportional to AGE)
thus we’ve come to expect a certain distribution of allele frequencies,
DO NOT MEMORIZE
e.g. p=q is unlikely
THIS
generally a small number of very common alleles, and increasing
number of very rare alleles
DO RECOGNIZE
THIS
um, huh?
• here is the context: DRIFT causes some alleles to
increase in frequency, some to be lost (moving
forward in time)
• moving back in time from NOW, the same process
•
•
can explain the frequency of alleles in the context of
how individuals are related
(most recent common ancestor)
this means we have expectations for how long it
takes for a sample of sequences from NOW to
coalesce to a common ancestor in the past (about 2
times effective population size)
one reason two separate evolutionary populations
may not APPEAR completely different, it takes time
for ancestral diversity to sort out
(now)
>1 population?
this pop
descended
from ‘red allele’
ancestor
this pop
descended
from ‘green allele’
ancestor
lets imagine two populations that rarely exchange migrants
but have a common ancestry in the recent evolutionary past
drift (moving forwards in time from ancestral population)
leads to many that descended from one particular allele
different in each population -> how do we know two populations?
•
•
•
•
•
•
evolutionary biology: the
populations tell us who they
are!
shown at right are two LOCATIONS, not
necessarily two distinct populations
may be one evolutionary population
however: if one is 90% A1 and 10% A2,
the other is 10% A1 and 90% A2
that means overall 50% A1, 50% A2
should see 25% A1A1 homozygotes,
25% A2A2 if Hardy-Weinberg fits
instead see overall ~41% A1A1, 41%
A2A2 because we are ‘pooling’ 2
diverged populations
excess of common
alleles
• excess homozygosity could mean that two
evolutionary populations are being analyzed as
though they are one
• so we don’t trust “even” allele frequencies: now
think frequency dependent selection, balancing
selection, or pooling of multiple evolutionary
populations
neutral theory: sort of like Goldilocks story
just right
= “neutral”
η1=2
η2=2
η3=1
η4=1
excess rare alleles
= purifying selection
or population expansion
η1=3
η2=2
η3=1
η4=0
excess common alleles
= positive selection or
long-term decline
η1=0
η2=1
η3=2
η4=3 (2, +1 for “η5”)
learning goals for
coalescent theory
• how do patterns in sequence data tell us about
effective population size?
• what if there are multiple populations
contributing information?
• how is our answer changed if the population
changes in size, or if there is selection for a
particular allele?
• why is this important for understanding
phylogenetics (species trees)?
•
why is this important for understanding
phylogenetics (species trees)?
• coalescent theory lets us test our
assumptions of how DNA sequences
evolve before we use them to
reconstruct phylogeny
• coalescent theory explains why
recently-diverged populations may not
yet have synapomorphies despite
already being on different evolutionary
paths
• this model gives us basis for estimating
time to ancestor of ANY two sequences
•
•
•
DNA characters are
just like phenotypic
characters
4 character states
A,C,T,G plus
information in
insertion-deletion,
gene copy number,
etc.
same concerns of
homology and
shared descent
apply
•
•
•
“mitochondrial Eve”
sets up
misunderstanding
every locus sampled
now has a point in the
past where all current
alleles coalesce to a
common ancestor
in recently diverged
species, diversity is
often older than the
species
human population isolated ~200kya
Ne
isolation
isolation
understanding coalescence
1. larger effective size (Ne), more diversity
2. when time between branching events short
relative to Ne, more likely that allelic diversity is older
than branching event
"This coalescence does not mean that the population originally
consisted of a single individual with that ancestral allele. It just
means that particular individual’s allele was the one that, out of all
the alleles present at that time, later became fixed in the
population."
phylogeny
inference
• 2 basic approaches: algorithm vs.
criterion
• “neighbor joining” shown in book is an
algorithm that generates a single tree
by finding shortest “distances”
(proportion of differences at nucleotide
sites)
• algorithm approaches do not help
identify our uncertainty: one answer
comes out, whether well supported or
not
criterion-based
phylogeny
30 tips results in 8.7 x 1036 possible trees
computer search necessary
3 of >10,000
possible trees
which fits data best?
depends on the
criterion
11 changes
7 changes = most
parsimonious of these 3
11 changes
3 of >10,000
possible trees
which fits data best?
depends on the
criterion
•
•
•
criteria used in
phylogeny
parsimony - the fewest # of changes indicates
the most acceptable tree topology
maximum likelihood - both topology
(arrangement of branches) and branch lengths
are iteratively searched for tree(s) that fit
statistical model of molecular evolution (e.g.
transitions > transversions)
Bayesian - criterion is still maximum likelihood,
search strategy is different (sums result over
many similar-likelihood trees)
why different criteria?
1. we are making our assumptions explicit
for inference of the unknown
2. different scientists have different
backgrounds that drive their assumptions
3. using multiple methods/criteria lets us test
how safe our assumptions are
4. next time: how do we decide if a tree
hypothesis is strongly supported?