Transcript Apr7
Molecular Clocks,
Base Substitutions,
&
Phylogenetic Distances
Definition: A mutation is a either an exchange within a DNA
sequence of one nucleotide for another or indel events. In
effect it is a mistake in the replication and repair of DNA.
Mutations are divided into three categories:
1. Deleterious – disadvantageous to the survival of the
organism.
2. Advantageous – contribute to the continued survival
of the organism.
3. Neutral – for example, a third nucleotide change in
the coding for valine.
Advantageous changes are in the minority. Also, some
changes can greatly affect an organism
A deceptively simple, important equation:
K
r
2T
Where:
r = the rate at which substitutions occur
K = the number of substitutions two sequences have
undergone since they last shared a common ancestor
expressed in substitutions per site.
T = the divergence time
Unfortunately, none of these variables are known.
T can be estimated by archaeological evidence, if it exists.
K can be approximated by sequence comparison.
Different portions of genes accumulate changes at widely
varying rates:
Amino Acids experience different substitution rates.
• Four-fold Degenerate Sites, those sites where a substitution
for one nucleotide by any one of the other three nucleotides
does not result in a change of the amino acid, occur most
rapidly, i.e. the third site of glycine.
• Two-fold Degenerate Sites, those where two of the
nucleotides result in one amino acid and two result in
another, i.e. aspartic acid and glutamic acid, occur less
frequently.
• Nondegenerate Sites, those where a change in this site
always results in a change in the amino acid, i.e. almost any
of the middle sites in Table 1.1 on p11 of K&R, are the least
common.
Natural selection makes it difficult to assess mutation rates for
the obvious fact that it has a tendency to eliminate deleterious
mutations.
Substitutions are mutations that have been filtered through
selection.
We consider two types of substitutions:
Synonymous – those that do not result in a change of the
amino acid.
Nonsynonymous – those that result in a change of the amino
acid.
Synonymous changes are less affected by selection and thus are
more reflective of the true mutation rate than nonsynonymous
changes
Table of synonymous and non synonymous substitution rates for various genes
in four mammalian species. See Table 3.3 on page 64 of K&R for
identification of the genes.
Because of differences in the selectivity constraints for various
substitutions in individual proteins, differences in amino acid
replacement between nuclear genes can be quite striking.
On the other hand, rates of molecular evolution for loci with
similar functional constraints can be quite uniform over long
periods of evolutionary time.
This observation caused Zukerkandl and Pauling in the 1960’s to
suggest that within homologous proteins the substitution rates
were so constant that they were like the ticking of a Molecular
Clock.
While the clock may run at different rates for different proteins,
the number of differences between two homologous proteins
correlated well with the time since speciation caused them to
diverge.
This hypothesis is controversial. Classical evolutionists
maintain that the erratic tempo of morphological evolution is
inconsistent with a steady rate of molecular change.
Furthermore, disagreements regarding the divergence times
have also placed in question any uniformity in evolution
rates that are promised by a “molecular clock.” See as one
example the article on the time of divergence of the human
and the chimp. One of the hypotheses there is that humans,
because of their longer life span, have a ‘slower’ molecular
clock.
On the other hand these varying rates can be explained in
several different ways and much useful information has been
obtained from sequence comparison.
For the moment we will proceed with the assumption of a
molecular clock for highly conserved sequences.
However, we are not yet out of the woods. For sequences with
relatively few substitutions a simple count will provide a
reasonable approximation of K. On the other hand, simple
counting in sequences with many differences may cause a
significant underestimation of the actual number of substitutions.
Why?
Jukes and Cantor in 1969 developed the first, and most simple,
model of nucleotide substitution that will account for the
underestimate of simple counting of differences and give a more
accurate accounting for the number of substitutions since two
sequences last shared a common ancester.
In 1980 Kimura developed a more sophisticated model that took
into account different rates for transitions and transversions.
To begin, we will investigate the ramifications of the JukesCantor model.
This model assumes that a certain proportion of any of the given
nucleotides will change during any one evolutionary period and
that any one of them is likely to change to any of the other
nucleotides without restriction, i.e. with equal probability.
This assumption leads to a table that can be expressed in the
following way:
α = the proportion of a particular nucleotide that changes during
any one evolutionary time period.
Reiterating the formula for p implied by the Jukes-Cantor model:
We can solve for the elapsed time, t, based on α and p:
4
p)
3
t
4
ln( 1 )
3
ln( 1
p can be approximated by the number of observed differences in the two
sequences. However, that still leaves us with one equation in two unknowns, α
and t. This is not good! Or is it?
If we look at a the product αt and think about its meaning for a minute, we see
that this product is the number of time steps times the mutation rate or the
expected number of substitutions per site during the elapsed time. This
includes even those that do not appear in the count of differences, i.e. the
“hidden substitutions” (those that eventually resulted in a position once again
being occupied by its original nucleotide occupant.
We define a new variable d = αt which is called the Jukes-Cantor distance.
Notice that this distance is proportional to t.
We are almost where we want to be.
We make one last observation: If x is small ln(1 – x) -x . For example,
ln(1 - .00001) = -.00001000005
Thus, since α is very small, we have:
4
4
ln( 1 )
3
3
This approximation allows us to solve for d, the Jukes-Cantor distance.
t
4
p)
3
4
3
ln( 1 p)
4
4
3
3
ln( 1
Multiplying both sides by α,
3
4
d t ln( 1 p )
4
3
Thus, given two sequences, S0 and S1
3
4
d JC ( S 0 , S1 ) ln( 1 p )
4
3
We conclude with an example:
Consider two sequences with 40 sites
S0: AGCTTCCGATCCGCTATAATCGTTAGTTGTTACACCTCTG
S1: AGCTTCTGATACGCTATAATCGTGAGTTGTTACATCTCCG
Five sites have undergone substitution. Thus p = 5/40 = 1/8 = .125
Thus,
3
4
3
3
d JC ( S 0 , S1 ) ln( 1 .125) ln(. 833333) (.182321556) .13674117
4
3
4
4
This is the expected percentage of changes, i.e. 5.5 is the expected number of
substitutions based on the observed differences between the two sequences.