Transcript Slide 1
PART III. MACROEVOLUTION
We already considered History of Life and learned a lot about Macroevolution that occurred
in the past - but only at the shallow level of chronology and generalizations.
3b) Cladogenesis and extinction are extremely unfair processes
Now, after studying Microevolution, we are ready to treat the same subject deeper, and to try
to understand hidden mechanisms of Macroevolution.
For some generalizations simple explanations may be enough, but Macroevolution is such a
complex and mysterious process that it must be based on theory, which is so far absent.
GENERALIZATION:
New genes mostly appear from pre-existing genes - of course, this is an easy way.
IN NEED OF A DEEP THEORY:
Changing <1% of the genome is enough to turn an ape into a human - how?
We will consider partial theories of Macroevolution at all levels, starting from sequences.
Macroevolution at different levels:
At the level of sequences (genomes), Macroevolution is
relatively well-understood. In contrast, Macroevolution at the
next three levels - molecules, cells and organisms - is
understood very poorly. However, the two upper levels - of
populations and of ecosystems, are simpler again, and there
are many useful partial theories of their Macroevolution.
Sequences are just genetic texts - they are not doing
anything directly, and are, thus, relatively easy to study. In
contrast, molecules, cells and organisms are levels where
real action occurs. Not surprisingly, studying them is tough.
However, complexity of adaptations can be ignored again
when we consider Macroevolution of populations and
ecosystems.
Organism
Individual
Macroevolution of genomes is tightly connected to the
evolution of populations: the genome of an organism is just
the record of allele replacements in its ancestral lineage. In
contrast, Macroevolution of complex phenotypes appears to
be mostly independent of Microevolution.
ACGATCGACGACGATCGATCGACGATCGA
Topic 15. Lecture 23-24. Macroevolution of Genomes
What do we know already about the evolution of sequences?
Level-specific generalizations:
1. Sequences
a) Mutation strongly affects sequence evolution, and selfish segments are common
b) Functionally important segments and sites of genomes usually evolve slower
c) Complex organisms have larger genomes, mostly due to noncoding sequences
Generalizations concerned with adaptation and complexity:
1. Genetical aspects of adaptive evolution
a) Evolution of both coding and non-coding sequences is important for adaptation
b) The target for strong positive selection is narrow at each moment
c) Tightly related genes can perform rather different functions
3. Origin of novelties
a) New non-coding regulatory sites, but not new genes, often appear from scratch
So, what do we want to know, on top of these generalizations and their simple explanations?
It makes sense to think of two aspects of sequence evolution.
On the one hand, there are properties of sequence evolution that are mostly dictated by
selection that acts at the upper levels of organization. We will not consider them here.
On the other hand, there are properties of sequence evolution that are not dictated by
fitness landscapes in the spaces of molecules, cells, or multicellular organisms.
MATEGDKLLGGRFVGSTDPIMEILSSSISTEQRLTEVDIQASMAYAKALEKASILTKTEL...
MA+EGDKL GGRF GSTDPIME+L+SSI+ +QRL+EVDIQ SMAYAKALEKA ILTKTEL...
MASEGDKLWGGRFSGSTDPIMEMLNSSIACDQRLSEVDIQGSMAYAKALEKAGILTKTEL...
Similarity of delta-crystalline sequence (top) to argininosuccinate lyase sequence
(bottom), is a sequence-level, and not a molecular-level, phenomenon.
Still, before we can do this, I wish to briefly address two fundamental concepts of the theory
of sequence evolution that are not directly concerned with any deep understanding of
evolution, but are necessary to reconstruct its past course.
Reconstructing the course of past Macroevolution of genomes: Evolutionary distance.
Evolutionary distance (ED) between two sequences that diverged from the same ancestral
sequence is the number of accepted nucleotide replacements per site. If two sequences can
be aligned without gaps (simply placed one above the other), their alignment will contain the
fraction 1-M of matches and the fraction M of mismatches.
ACACGACACGATGCATACTA
|||||| ||||||||| |||
ACACGATACGATGCATGCTA
If two sequences are very similar to each other, their ED probably equals to M. However,
multiple events per site become important if we consider more dissimilar sequences.
Indeed, homoplasy can create a match at a site where multiple substitutions occurred after
divergence. Can we estimate the total number of replacements, including hidden ones, from
the observed dissimilarity M?
We observe the fraction of mismatches M, but we want to know ED, the total number of
replacements that occurred per site. If we know how evolution occurred, we can derive the
function that relates M (observable) to ED (unobservable). Then, we invert this function, and
estimate unobservable from observable.
In the simplest case, known as 1-parameter Jukes-Cantor model, we assume that all 10
possible nucleotide substitutions (A -> T, A -> G, ...) are equally frequent. If the total
substitution rate per site is a, the rate at which matches become mismatches is 2a (any
replacement in either sequence will turn a match into a mismatch), and the rate at which
mismatches become matches is 2a/3 (only one replacement out of 3 possible ones will turn
a mismatch into a match). Thus,
dM
2a(1 M ) (2a / 3) M 2a(1 (4 / 3) M )
dt
This equation can be easily integrated:
M
t
dy
0 1 (4 / 3) y 2a 0 d
so that
3
4M (t )
ln(1
) 2at
4
3
Because, ED=2at, our goal has already been achieved:
3
4M (t )
ED ln(1
)
4
3
We can also recover, from the same equation, M as a function of time:
3
8 at
M (t ) (1 e 3 )
4
Reconstructing the course of past Macroevolution of genomes: Sequence alignment.
Common ancestry of individual nucleotides. If divergence of sequences involved insertions
and deletions, nucleotides derived from the same ancestral nucleotide can become shifted.
Thus, establishing common ancestry of individual nucleotides from different species
requires sequence alignment.
Let us consider alignment of just 2 sequences, each of length n. They can be aligned, under
reasonable assumptions, in time that is proportional only to n2. How could this be done?
One option is to construct a "dot-matrix" that describes matches/mismatches between all
the nucleotides in two sequences (hence n2). After this, the best path in this matrix can be
found, and this path corresponds to the best alignment.
A x
x
X
G
x
x
X
T
x
X
T
x
x
A x
X
x
C
x
X
G
X
x
x
C
X
x
A X
x
x
A C G T C A G T G A
A C G
C A T T G A
| | |
| |
| | |
A C G T C A G T G A
Tricks can be used to find alignments faster, but the basic idea is to consider a dot-matrix.
Reconstructing the course of past Macroevolution of genomes: Orthologous segments.
In the field of sequence evolution, homology traditionally means common ancestry. It is
necessary to distinguish two kinds of common ancestry ("homology") of sequences orthology and paralogy.
Two segments of different genomes are orthologous if they originated from the same
segment of the genome of the last common ancestor.
Two segments of the same genome are paralogous if they originated from the same
segment, by duplication. Two segments of different genomes are paralogous if they
originated from different paralogous segments of the genome of the last common ancestor.
The last common ancestor of two modern species, A and B, had
two paralogs in its genome (red and purple). Red segments of A
an B, originated from ancestral red segment, are each other's
orthologs. The same is true for purple segments, of course. Red
segment of A is a paralog to purple segments of A and B. Purple
segment of A is a paralog to purple segments of A and B.
Orthology is established using the bidirectional best hit test. If for segment a in genome A
segment b in genome B provides the best hit when a is compared against the whole genome
B, and if a provides the best hit for b, when b is compared against the whole genome A, we
conclude that a and b are orthologs.
If Nature conspired against us,
bidirectional best hit approach may falsely
conclude that paralogs are orthologs.
Thus, genomic contexts can also be used,
when A and B are not too distant.
A genome may contain two (or
more) orthologs to a segment in
some other genome, due to postdivergence duplication.
Now, we are ready for theory of Macroevolution at the sequence level. There are useful
partial theories, describing a variety of phenomena:
1) genes and other functional genome segments often form families of paralogs.
2) TEs and other junk genome segments often form families of paralogs.
3) non-recombining sex chromosomes and organelle genomes often undergo profound
degeneration.
4) Nucleotide composition (GC-content) often varies greatly along the genome.
5) Genome sizes of even not-too-distant species can differ greatly.
6) at functional nucleotide sites, the strength of selection is often s ~1/Ne.
So, let us try to understand these 3 sequence-level Macroevolutionary phenomena:
1) genes and other functional genome segments often form families of paralogs.
First, let us review the facts. For example, human genome contains 1434 multigene families
of three or more paralogous genes.
Some paralogs form clusters and are located close to each other, but many other paralogs
are scattered across the genome.
A sample of clusters of human paralogous genes, formed by recent duplications.
A majority of genes within a multigene family have at least one very close paralog.
KS was estimated for each human gene and its most closely related human paralog.
Now, what do we need to understand? Three things:
1) Why some gene duplications are maintained, and not eliminated by negative selection?
2) What happens to the paralogs, after a duplication is fixed? They can either:
i) evolve different functions (neofunctionalization)
or
ii) each retain only a part of the original function (subfunctionalization).
3) What processes affect the overall properties of multigene families?
The "life history" of a successful gene duplication consists of 3 phases: i) its origin by a
unique mutation, ii) its fixation within the population, and iii) divergence of paralogs.
Mutations that involve a duplication of a long sequence occur occasionally. A small fraction of
duplications that become successfully fixed are probably favored by positive selection.
Haploinsufficient genes, such that heterozygotes carrying a loss-of-function allele have low
fitness, have more paralogs than haplosufficient genes. If a gene is haploinsufficient,
duplicating it may be a good idea!
After a duplication becomes fixed, two things can happen. One of the two paralogs can be
lost, reversing the duplication. However, if both paralogs are retained, they will diverge.
There are 2 possibilities: subfunctionalization
or neofunctionalization. Only in the second
case the outcome of a duplication is better
than the initial state.
subfunctionalization
neofunctionalization
How to explain the distribution of sizes of families and the excess of similar paralogs?
One possibility is episodes of expansion
and contraction of a multigene family.
There are little data for this scenario.
However, paralogs often "talk to each
other" through gene conversion, which
can explain the apparent excess of
"recent" duplications.
So, we at least know what questions to ask regarding the evolution of multigene families.
2) TEs and other junk genome segments often form families of paralogs.
First, let us review the facts. We already know them:
1. In many species, families of paralogous transposable
elements (TEs) constitute a large fraction of the genome.
2. Evolutionary distances between paralogs within a family
indicate the time when the family has been formed.
3. In some species (Drosophila) individual TEs are rare,
while in others (Mammals) they are mostly fixed.
We need to understand factors that control the dynamics of
the families of TEs.
The ability of TEs to cause their own duplications
(transpositions) is the cause of the formation of TE families.
But what regulates the number of TEs in a family?
Is there an equilibrium number of TEs within a family? Theoretically, both yes and no
answers are possible.
Paralogous TEs may help each other
to propagate. Thus, an insertion rate
grows with the size of a TE family.
1. Equilibrium: insertion rate does not depend on the TE number, elimination rate increases.
2. Equilibrium: both rates increase, but elimination rate increases faster.
3. No equilibrium: both rates increase, but elimination rate increases slower.
Unlimited expansion of TEs of a particular kind in the genome must eventually lead to
extinction of the host lineage. If so, why did not TEs kill all life?
Another way to ask this question is: what increases the rate of elimination of TEs when their
number grows? Apparently, the only force which can eliminate TEs is selection against
those host genotypes that carry many of them. Still, there are two options:
1) Selection against genotypes with many
TEs may be stronger, due to epistasis.
2) When TEs accumulate, the probability of
ectopic recombination increases.
Perhaps both these effects are responsible
for preventing unlimited expansion of TEs
and saving live from extinction.
3) non-recombining sex chromosomes and organelle genomes often undergo profound
degeneration.
First, let us review the facts. In many clades, sex chromosomes evolved independently.
If males are heterogametic, females are
XX, and males are XY. If females are
heterogametic, females are ZW, and
males are ZZ.
Often, the chromosome restricted to the heterogametic sex (Y or W) never undergoes
recombination. Such non-recombining sex chromosomes have only a small number of
functional genes, contain a lot of repetitive junk DNA, and encode proteins that carry
multiple mildly deleterious amino acid replacements.
Evolutionary degeneration of a nonrecombining sex chromosome.
Why does it happen? Apparently, four processes contribute to this effect.
Models (a–c) assume that purifying selection against deleterious mutations is less efficient
on the Y, and model (d) assumes the same about positive selection for beneficial mutations.
(a) Accumulation of weakly deleterious mutations by background selection.
(b) Muller's ratchet.
(c) Genetic hitchhiking by favorable mutations.
(d) Lack of adaptation on the non-recombining Y chromosome.
In fact, long-term degeneration of non-recombining Y chromosomes is not the whole story. Y
chromosomes reside only in males, and X chromosomes reside in females 2/3 of the time.
Thus, genes with a net male benefit can accumulate on Y chromosome. In contrast, X
chromosome can accumulate genes with female benefits.
The accumulation of sexually antagonistic alleles on X and Y selects for the suppression of
recombination between the nascent sex chromosomes, creating a male-specific region on
the Y (MSY). The lack of recombination within the MSY causes genes in this region to
degenerate, whereas their homologs on the X might evolve dosage compensation.
Next slide shows a more realistic scenario of the evolution of sex chromosomes. A number
of open questions remain, but the key process of degeneration of a non-recombining of sex
chromosome appears to be well-understood.
Concluding remarks on the evolution of genomes:
A genome is a chronicle of past allele replacements, and Macroevolution of genomes can be
to a large extent explained through Microevolution of populations. This is good news.
The most interesting facets of the evolution of genomes are concerned with their
suboptimality - due to mutation-imposed limits on adaptive evolution (responsible for the
origin of multigene families), mutational pressures (responsible for proliferation of TEs), and
inefficient selection (responsible for degeneration of non-recombining chromosomes).
Is accumulation of mildly deleterious junk DNA essential for adaptive evolution? Functional
sequences often evolve from junk DNA. However, it is not clear whether availability of junk
was ever a limiting factor for adaptive evolution. If yes, efficient selection against junk DNA
in unicellular organisms with large populations may prevent evolution of complexity.
Are we complex because our ancestors
somehow accumulated a lot of junk DNA?
OR
Do we carry a lot of junk DNA because
we are complex and, thus, large?
Currently, we do not know the answer.
Quiz
So, we know that complex multicellular organisms have large, "bloated" genomes that
contain a lot of long introns, transposable elements, and other mostly junk DNA. Two
scenarios can be responsible for this correlation:
1) (Complexity as the cause of large genomes). Complex multicellular organisms are
physically large. Thus, their populations are necessarily small - and in small populations
weak selection against new pieces of junk DNA is inefficient. Thus, genomes became
bloated.
2) (Large genomes as the cause of complexity). Initially, the genomes of simple unicellular
ancestors of modern complex organisms became bloated - perhaps, these ancestors had
low population size due to some reason. After this, complexity and multicellularity evolved,
due to recruitment of some initially junk sequences for regulation of gene expression.
What kinds of data and analyses could determine, which of the two scenarios correspond to
reality?