Transcript week7

Today’s Agenda


John M.’s presentation
(15-25 min)
Phylogenetic Trees
–
–
–
–
Overview
Construction
Algorithms
Etc.
Phylogenetic Trees

The means by which biologists portray
–
–
the history of life over millions of years and
the branching events that gave rise to biological diversity
Phylogenetic Trees

Phylogenetic trees of different alleles of a
particular gene are the byproduct of
1.
2.
3.

many rounds of mutation,
drift and
selection
resulting in nearly unique sequences at that
allele for individuals within species.
Phylogenetic Trees
Questions You Should Be Asking:




What are Alleles?
What exactly is mutation; what causes it?
What is drift?
What is selection?
Alleles & Genes

Remember a gene is a segment of DNA that
ultimately encodes a protein
–
–

Very similar organisms have the same genes
–
–
–
–

A protein that performs an important biological function.
Or leads to some other important trait
For example, we all have the gene for hemoglobin
However, a slight change in that gene might result in sickle
cell anemia
Another slight change might not cause any effect
A severe change might lead to a complete failure to create
hemoglobin (i.e., death)
Different versions of a gene are called alleles.
Natural
Selection




A prime objective for
all species is to
reproduce and survive,
When species do this they tend to produce more
offspring than the environment can support.
The lack of resources to nourish these
individuals places pressure on the size of the
species population, and
the lack of resources means increased
competition and as a consequence, some
organisms will not survive.
Natural Selection



The organisms who die as a
consequence of this competition
were not totally random,
Darwin found that those organisms
more suited to their environment
were more likely to survive.
Those organisms who are better
suited to their environment exhibit
desirable characteristics, which is
a consequence of their genome
being more suitable to begin with.
Natural Selection




As a particular species spreads over a large
geographic area, genetic branches arise.
Different areas (geographic regions) provide different
selection criteria
A sub-population of a species might find itself
isolated in tropical environment
While another sub-population get isolated in a desert
environment
Mutation





A mutation or polymorphism is a change in
the DNA "letters" of a gene or an alteration
in the chromosomes.
Most DNA variation is neutral (not
beneficial or harmful),
But harmful sequence changes sometimes
do occur.
Changes within genes can result in proteins
that don't work normally or don't work at all.
Some of these changes can contribute to
disease or affect how someone responds to
a medicine.
Mutation

Mutations
1.
2.
3.

may be passed down from parent
to child (in the sperm or egg cells),
may occur around the time of conception or
may be acquired during a person's lifetime.
Can arise spontaneously during normal cell
functions
–
–
when a cell divides, or
in response to environmental factors such as toxins,
radiation, hormones, and even diet.
Mutation



Nature provides us with a
system of finely tuned
repair enzymes that find
and fix most DNA errors.
But as our bodies change
in response to age, illness
and other factors, our
repair systems may
become less efficient.
Uncorrected mutations can
accumulate, resulting in
nasty stuff.
Genetic Drift




Allele frequencies can change due to chance alone.
Alleles that form the next generation's gene pool are a sample
of the alleles from the current generation.
When sampled from a population, the frequency of alleles
differs slightly due to chance alone.
A small percentage of alleles may continually change
frequency in a single direction for several generations
–
just as flipping a fair coin may, on occasion, result in a string of
heads or tails.
Genetic Drift
Parent
Population
Next Generation
Next Generation
Genetic Drift




Sharp drops in population size can change allele frequencies
substantially.
When a population crashes, the alleles in the surviving
sample may not be representative of the pre-crash gene pool.
This change in the gene pool is called the founder effect,
because small populations of organisms that invade a new
territory (founders) are subject to this.
Many biologists feel the genetic changes brought about by
founder effects may contribute to isolated populations
developing reproductive isolation from their parent
populations.
Genetic Drift

The founders effect
Invaders
Large
Population
Small subset
Survives
Genetic Drift & Fitness



Large populations are often divided into smaller
subpopulations.
– Drift causes allele frequency differences
between subpopulations
If a subpopulation is small enough, the population
could even drift through fitness valleys in the
adaptive landscape.
Then, the subpopulation could climb a larger fitness
hill.
Genetic Drift & Fitness
Genetic Drift & Fitness




Both natural selection and genetic drift decrease
genetic variation.
If they were the only mechanisms of evolution,
populations would eventually become
homogeneous and further evolution would be
impossible.
There are, however, mechanisms that replace
variation depleted by selection and drift.
Thank God for mutation and environmental
diversity.
Trees and Distance

http://babbage.clarku.edu/~djoyce/java/Phyltr
ee/intro.html
Trees and
Distance
Reconstructing Phylogenetic Trees

There are ten extant species (species currently living)
–

named from 1 through 10.
The lines above the extant species represent the same
species, just in the past.
Reconstructing Phylogenetic Trees


When two lines converge to a point, that should be
interpreted as the point when the two species diverged
from a common ancestral species
the point being the common ancestral species.
Reconstructing Phylogenetic Trees


horizontal dimension doesn't mean anything!
It is completely arbitrary whether a branch of the tree
is placed to the left or to the right
Reconstructing Phylogenetic Trees


vertical dimension corresponds to time.
Although its imprecise, the difference between two
species can be used to estimate when they diverged.
Reconstructing Phylogenetic Trees
A tree isn't always the best model. Here are some
times when it isn't best.
 For individuals within a species. The genetic
material of an individual doesn't derive from a single
earlier existing individual.
–
Animals and plants that multiply by sexual reproduction
receive half their genetic material from each of two parents,
so a tree like this is inappropriate.
Reconstructing Phylogenetic Trees
Here are some other examples
 For closely related species. Individuals do
occasionally mate between closely related species,
and their progeny survive to contribute to the gene
pool of one or both of the parent species.
 Hybrid species. In the plant world it occasionally
happens that a new tetraploid species arises from
two diploid species. The two parent species need to
be somewhat related for this to happen.
Reconstructing Phylogenetic Trees
Here is one last example:
 Distant interaction. There are a couple of ways that
genetic material from one species can find its way
into unrelated species.
–
–
Sometimes a bacterium of one species can ingest the
genetic material of a bacterium of another species and
incorporate part of it into its own genetic material.
Sometimes viruses can inadvertently transport genetic
material from one species to another.
In spite of these exceptions, a tree model is usually a
pretty good model to show the relations among
species.
Mutation Rates & Vertical Dimension





Differences among species are the key to reconstructing
the phylogenetic tree.
Species differ in the characteristics, also called characters.
The characters may be observable and measurable properties
of the individuals.
For instance, among mammals, the numbers of the different
kinds of teeth that the individuals of the species have has been
a successful character to classify mammals.
This character has been especially important among extinct
species since fossilized teeth are commonly found.
Mutation Rates & Vertical Dimension

Any characters can be used to classify species and
reconstruct a phylogenetic tree of species,
–



but some are more useful than others.
If a species depends on a character for its continued
survival, that character will not change as any
mutations of it will be eliminated.
Call such characters essential. And most visible
characters are essential for the species.
This means that if we choose essential characters,
any differences should count as very significant.
Mutation Rates & Vertical Dimension




There are, however, some difficulties with considering essential
characters.
If one species evolves by changing an essential characteristic,
whatever ecological forces supported that change may also
apply to other species, and that could lead to parallel evolution.
Thus, differences or similarities in essential characters are very
relevant to the reconstruction of the general shape of the
phylogenetic tree, but they really can't be used to determine the
relative lengths of the lines within the tree.
Some species have been stable for millions of years. Others
evolve very fast.
Mutation Rates & Vertical Dimension



Irrelevant mutations. We could, on the other hand,
consider nonessential characters.
Changes in nonessential characters are effected by
mutations, mutations that we can call irrelevant.
The rate of change of irrelevant mutations should be
fairly uniform among species, especially among
species that are fairly closely related.
Mutation Rates & Vertical Dimension





Much of the genome sequence of an organism is irrelevant.
For example, there are 64 (43) different codons for 20 amino
acids.
Some amino acids are coded by up to four different codons.
For these multiply coded amino acids, typically the third
nucleotide can take any of the four possible values.
In other words, a mutation in this third nucleotide is irrelevant.
The DNA can mutate at this site and the resulting protein
doesn't change.
Mutation Rates & Vertical Dimension
By concentrating on irrelvant mutations,
 not only can the shape of the phylogenetic
tree be reconstructed, but
 the relative lengths of the lines within the
phylogenic tree can also be estimated.
Mutations as a measure of time


Let's concentrate on one character to begin with.
Our first questions are:
–
–
What is the probability p(t) that the character has some
value at the beginning of a time interval of length t as it does
at the end?
What is the probability q(t) that the character has one value
at the beginning of a time interval of length t but a different
value at the end of the interval?
Mutations as a measure of time


Suppose that there are m different possible alternate values,
and suppose that the mutation rate is r mutations per unit time
interval.
Some statistical analysis (which we'll skip) gives us the
answers to these questions.
Mutations as a measure of time

Note that initially, when t = 0, p(0) is 1, while q(0) is 0
since there are no mutations in no time. Also, as t
approaches infinity, p(t) and q(t) both approach 1/m,
which means that in the long run, each of the m
alternative values are equally probable.
Mutations as a measure of time

Now let's assume that there are n different
characters, not just one. Then E(t), the expected
number of characters that are not the same at the
end of a time interval of length t as they were at the
beginning, is n(m –1) q(t), that is,
Mutations as a measure of time

Here's the graph of that function when there are
m = 4 alternate values for each character, there are
n = 40 characters, and the mutation rate is r = 0.1.
Mutations as a measure of time


Time t is shown on the horizontal axis, while the vertical axis
gives y, the expected number of character differences.
Note that when t gets large, the expected number of character
differences approaches 30.
Mutations as a measure of time


We can take the inverse function of y = E(t), that is,
turn this graph around, to give us an estimate for
time t in terms of the observed number of character
differences. Let g denote the inverse function.
The base of the logarithm function here is e.
Mutations as a measure of time


The graph of t = g(y) is shown
to the right with the same
parameter values m = 4,
n = 40, and r = 0.1.
Note that as the number of
expected differences
approaches 30, the
corresponding time
approaches infinity.
Mutations as a measure of time


The observed number of
differences may be near the
expected number, but it's
usually more or less.
So the observed number of
differences could easily be
greater than 30.
Mutations as a measure of time


Should that happen, the best
conclusion to make is that the
time is very great, but can't be
estimated.
It would be prudent not to
estimate the time when the
number of differences is
slightly less than 30, too
Reconstruction


How do you reconstruct the phylogenic tree
when all you know are characters of extant
species?
When there are only a few species, only a
few characters, and the number of mutations
is small but not too small, then common
sense and a little bit of logic does a pretty
good job, at least for deciding on the shape
of the tree.
Reconstruction



As the number of species goes up, and the number
of characters goes up, then conflicting data begins to
appear.
Then common sense and logic are insufficient for the
job.
The mutation rate may not be high enough to
distinguish closely related species,
–

those near the bottom of the tree,
but too high to make confident conclusions for
reconstructing the top of the tree in order to connect
distantly related species.
Reconstruction

Also, deciding the relative lengths of the lines
in the tree,
–

or the equivalent problem of deciding how high to
put join the various lines,
requires computations and a basis for
making computations.
Reconstruction




A simplification of the problem.
There's a lot of information in the gene sequences,
and it's difficult to analyze it all.
One way to simplify things is to look at just pairs of
species at a time.
This will ignore some useful information, but enough
will remain to do a pretty good job on reconstructing
a phylogenetic tree, and the computations become
simpler.
Reconstruction



A simplification of the problem.
When we look at two species,
we have two sequences of characters, and the
relevant measure is the number of differences in
these two sequences,
–

a measure that we can interpret as the distance between
the species.
Algorithms that depend only on distances between
species are called distance matrix algorims.
Reconstruction


Distances between species.
If two species have a small distance between
them
–


(as measured by the number of differences in
their character sequences),
then they have a recent common ancestor;
but if they are far apart, then their common
ancestor is in the remote past.
Reconstruction



1.
2.

Distances between species.
We can use the distance between the species as a
measure of the distance in time since the species
diverged.
These two distances,
the number of character differences and
the time since divergence,
will be approximately proportional when they're
relatively small.
Reconstruction




The difference matrix.
Here is a model phylogenetic tree with six extant
species alongside a matrix.
This 6 by 6 matrix results from mutations of 40
irrelevant characteristics each with 4 alternate
values.
The mutation rate is uniform with a value of 100
mutations per 1000 time units, that is, 0.1 mutations
per time unit.
Reconstruction



The difference matrix.
The (i,j)th entry in the matrix indicates how many of
the 40 characters differ between species i and
species j.
If two species are not very distant in the tree, then
there hasn't been much time for mutations to occur,
so the entry in this matrix should be small.
Reconstruction


The difference matrix.
If two species are very distant, the entry in the
matrix should be large, that is, close to 30, which is
3/4 of the number of characters. You won't see such
large entries in the matrix unless you increase the
mutation rate or the number of species.
Reconstruction


The difference matrix.
Note that the matrix is symmetric, that is, the (i,j)th
entry is the same as the (j,i)th entry. Also, the
entries along the diagonal are all 0, denoted here as
*, since each (i,i)th entry indicates how many
differences between the ith character sequence and
itself, which, of course, is 0.
Reconstruction Algorithms




The problem
Suppose all we know is how far apart the
species are as measured by the number of
differences in their characters, that is, the
entries in the difference matrix.
How can we reconstruct the phylogenetic
tree?
First, we can convert the differences to times.
Reconstruction Algorithms


The problem
The conversion is given by the formula
Reconstruction Algorithms

Of course, we might not be able to reconstruct the
actual tree,
–



since the mutations are random and need not reflect the
actual distances between species.
So a better question is:
How can we reconstruct the most likely
phylogenetic tree?
We can come up with some promising algorithms
that should give trees that aren't too far away from
the most likely tree.
Reconstruction Algorithms





A solution: the "minimum" reconstruction
method.
It seems reasonable that the two species that share
the greatest number of characters are the most
closely related.
That is, the smallest entry in the mutation matrix
indicates which two species diverged most recently.
Also, the next smallest entry should indicate which
two species diverged just before that.
And so forth.
Reconstruction Algorithms




A solution: the "minimum" reconstruction method.
That's the idea of the algorithm, but it needs a little clarification.
Suppose species 1 and 2 are closest with 4 differences in their
characters, and species 1 and 3 are next closest with 6
differences.
Then since we conclude that species 1 and 2 diverged most
recently, it won't be species 1 and 3 that diverged just before
that, rather it will be the ancestral species of 1 and 2 that
diverged from species 3 just before that.
Reconstruction Algorithms






Two other solutions: the "average" and "maximum" methods.
They start out exactly the same by joining the two species that share
the most characters.
To explain these methods, suppose that species 1 and 2 are closest.
Name their ancestral species as species 6.
With the minimum method, we effectively determined that the distance
between species 6 and any other species such as species 3 was the
minimum of the distance from 1 to 3 and the distance from 2 to 3.
With the maximum method, instead take the distance from species 6
to species 3 to be the maximum of these two distances.
And, of course, for the average method, take the average of those two
distances.