Lecture PPT - Carol Lee Lab

Download Report

Transcript Lecture PPT - Carol Lee Lab

Phylogenetics
Reconstructing the Tree of Life
• In the Speciation lecture, I talked about
a “Phylogenetic Species Concept”
– What is a “Phylogeny?”
– How do you construct one?
– Why on earth should I care?
2
Why you should care:
• All biological relationships can be determined by
constructing phylogenies: Even if phylogenies are not
always the best way to define species boundaries, they
do tell you the genetic and evolutionary relationships
among groups and individuals
– Your ancestry
– Diseases—figure out evolutionary origins and evolutionary
pathways of disease, like HIV, Ebola, SARS, etc.
– Crops and live stock (food security)—rescue from
inbreeding, create new varieties
– Endangered Species— figure out how endangered
populations are related and how to perform genetic rescue
3
Tree of Life Web Project
http://www.tolweb.org/tree/
Textbook Version
EUKARYA
Dinoflagellates
Forams
Ciliates Diatoms
Red algae
Land plants
Green algae
Cellular slime molds
Amoebas
Euglena
Trypanosomes
Leishmania
Animals
Fungi
Sulfolobus
Green nonsulfur bacteria
Thermophiles
Halophiles
(Mitochondrion)
COMMON
ANCESTOR
OF ALL
LIFE
Methanobacterium
ARCHAEA
Spirochetes
Chlamydia
Green
sulfur bacteria
BACTERIA
Cyanobacteria
(Plastids, including
chloroplasts)
Updated Tree of Life 2016
Hug et al. 2016 Nature Microbiology
Bacteria
Eukarya
Archaea
Outline
1. What is a phylogeny?
2. How do you construct a phylogeny?
The Molecular Clock
Statistical Methods
Think about relationships among the major lineages of
life and when they appeared in the fossil record
Are Genetic Distances
and fossil record
roughly congruent?
Fossil Record vs
Molecular Clock
• Molecular clock and fossil record are not always congruent
– Fossil record is incomplete, and soft bodied species are usually
not preserved
– Mutation rates can vary among species (depending on
generation time, replication error, mismatch repair)
• But they provide complementary information
– Fossil record contains extinct species, while molecular data is
based on extant taxa
– Major events in fossil record could be used to calibrate the
molecular clock
Evolutionary History of HIV
HIV evolved multiple
times from SIV (Simian
Immunodeficiency Syndrome)
Time
Evolutionary Analysis
Freeman& Herron, 2004
Charles Darwin (1809 -1882)
On the Origin of Species (1859)
– Living species are related by
common ancestry
– Change through time occurs at
the population not the organism
level
– The main cause of adaptive
evolution is natural selection
Darwin envisaged evolution as a tree
The affinities of all the beings of the same class have
sometimes been represented by a great tree. I believe
this simile largely speaks the truth……
…The green and budding twigs may represent
existing species; and those produced during former
years may represent the long succession of extinct
species…..
….the great Tree of Life….covers the earth with
ever-branching and beautiful ramifications
Charles Darwin, On the Origin of Species; pages 131-132
Reconstructing the Tree of Life
The only figure in The Origin of Species
What did people believe
before Darwin?
Lamarck proposed a ladder of life
Past
Future
Jean-Baptiste Lamarck
• French Naturalist (1744-1829)
• “Professor of Worms and
Insects” in Paris
• The first scientific theory of
evolution (inheritance of acquired
traits)
Lamarck’s View of
Evolution
Being
• Continuum between physical
and biological world (followed
Aristotle)
• Scala Naturae (“Ladder of
Life” or “Great Chain of
Being”)
Realm
of Being
God
Angels
Demons
Man
Animals
Realm of
Becoming
Non-Being
Plants
Minerals
What is wrong with a ladder?
• Evolution is not linear but
branching
• Living organisms are not
ancestors of one another
• The ladder implies progress
What is right with the tree?
• Evolution is a branching process
• If a mutation occurs, one species
is not turning into another, but
there is a split, and both lineages
continue to evolve
• So, evolution is not progressive all living taxa are equally
“successful”
• Phylogenies (Trees) reflect the
hierarchical structuring of
relationships
The only figure in The Origin of Species
The Tree of Life is a Fractal
http://tolweb.org/tree/phylogeny.html
Genealogical structures
• Phylogeny
– A depiction of the ancestry relations between
species (it includes speciation events)
– Tree-like (divergent)
• Pedigree
– A depiction of the ancestry relations within
populations
– Net-like (reticulating)
Four butterflies connected
to their parents
offspring
parents
Population
past
future
Individuals
Population
Lineage/
Species
Phylogeny
What happened here?
Lineage-branching
Speciation
What happened here?
Extinction
Representation of phylogenies?
A
B
The True History
C
A
B
A simplified
representation
C
Some terms used to describe a phylogenetic tree
Taxon (taxa)
Tip
Internal branch
Internode
Node (Speciation event)
Root
Outline
1. What is a phylogeny?
2. How do you construct a phylogeny?
The Molecular Clock
Statistical Methods
What is a Phylogeny?
• A phylogenetic tree represents a hypothesis
about evolutionary relationships
• Each branch point represents the divergence
of two taxa (e.g. species)
• Sister taxa are groups that share an
immediate common ancestor
Molecular Clock
• Phylogenies rely on the
“Molecular Clock,” namely the
fact that Mutations on average,
occur at a given rate
• So, on average, more mutational
differences between taxa means
that they branched from a
common ancestor longer ago
Example:
Mitochondria: 1
mutation every
~2.2%/million years
• So longer branches on
phylogeny often  greater
evolutionary distance
31
Molecular Clock
Problem: mutation rate
can vary among
species
• Mutation rate is faster:
– Shorter generation time
(greater number of meiosis or
mitosis events in a given time)
– Replication Error (e.g. Sloppy
DNA or RNA polymerase; poor
mismatch repair mechanisms) 32
Phylogenetic Trees with
Proportional Branch Lengths
• In some trees, the length of a branch can
reflect the number of genetic changes that
have taken place in a particular DNA sequence
in that lineage
• So longer branches = greater evolutionary
distance
Neutral data are better for capturing genetic
distances (the molecular clock) than genes that
might be under selection
• Why?
Phylogenetic Informative Characters
(mutations)
• Neutral mutations:
– Mutations that are not subjected to selection
– Better for constructing phylogenies because
selection could make unrelated taxa appear
more similar or related taxa more different
– Examples: Noncoding regions of DNA, 3rd
codon position in proteins, introns,
microsatellites (“junk DNA”)
Codon Bias
 In the case of amino
acids
 Mutations in Position
1, 2 lead to change
 Mutations in Position
3 don’t matter
Order
Family Genus
Species
Taxidea
Taxidea
taxus
Lutra
Mustelidae
Panthera
Felidae
Carnivora
Panthera
pardus
Lutra lutra
Canis
Canidae
Canis
latrans
Canis
lupus
Branch point
(node)
Taxon A
Taxon B
Taxon C
ANCESTRAL
LINEAGE
Taxon D
Taxon E
Taxon F
Common ancestor of
taxa A–F
Polytomy
(unresolved
branching point)
Sister
taxa
A monophyletic clade consists of an
ancestral taxa and all its descendants
A
A
A
B
B
C
C
C
D
D
D
E
E
F
F
F
G
G
G
B
Group I
(a) Monophyletic group (clade)
Group II
(b) Paraphyletic group
E
Group III
(c) Polyphyletic group
Examples of Paraphyletic Groups
(not recognized as legitimate groups in the
Phylogenetic Species Concept, which only
recognizes monophyletic groups)
A
B
Group I
C
D
E
F
G
(a) Monophyletic group (clade)
(in the lecture on species concepts we discussed
that the “smallest” monophyletic group is a
“phylogenetic species”)
Synapomorphies
• Synapomorphies are shared derived homologous
traits
• They can be DNA nucleotides or other heritable
traits
• They are used to group taxa that are more closely
related to one another
synapomorphies
Sometimes similar looking traits are not
homologous, and are not synapomorphies,
but are the result of convergent evolution
How do we construct Phylogenies?
Phylogenetic Methods
• Parsimony: Minimize # steps
• Distance Matrix: minimize pairwise genetic
distances
• Maximum Likelihood: Probability of the data
given the tree
• Bayesian: Probability of the tree given the data
Parsimony
Uses Discrete
Characters (like mutations,
or some heritable trait)
Select the tree with
the minimum number
of character-state
transitions summed
across all characters
Fig. 26-15-1
Parsimony: Example 1
Species I
Species III
Species II
Three phylogenetic hypotheses:
I
I
III
II
III
II
III
II
I
Fig. 26-15-2
Site
1
2
3
4
Species I
C
T
A
T
Species II
C
T
T
C
Species III
A
G
A
C
Ancestral
sequence
A
G
T
T
1/C
I
1/C
II
I
III
III
II
1/C
II
III
1/C
I
1/C
Fig. 26-15-3
Site
1
2
3
4
Species I
C
T
A
T
Species II
C
T
T
C
Species III
A
G
A
C
Ancestral
sequence
A
G
T
T
1/C
I
1/C
II
I
III
III
II
1/C
II
III
I
1/C
3/A
2/T
I
2/T
3/A
3/A
4/C
II
II
2/T 4/C
III
2/T
4/C
III
3/A 4/C
I
III
II
4/C
1/C
I
2/T 3/A
Fig. 26-15-4
Site
1
2
3
4
Species I
C
T
A
T
Species II
C
T
T
C
Species III
A
G
A
C
Ancestral
sequence
A
G
T
T
1/C
I
1/C
II
I
III
III
II
1/C
II
III
I
1/C
3/A
2/T
I
2/T
3/A
3/A 4/C
3/A
4/C
III
II
2/T
4/C
II
III
6 events
I
III
II
4/C
1/C
I
2/T 3/A
2/T 4/C
I
I
III
II
III
II
III
II
I
7 events
7 events
Parsimony: Example 2
Three possible trees
O
O
C
B
C
A
A
OO
B
A
A
B
Tree 2
B
B
A
C
C
B
Tree 1
O
O
C
A
Tree 3 C
Map the characters (mutations) onto tree 1
1
2
3
4
5
O
T
G
G
A
A
A
G
G
C
C
G
A
A
A
A
A
G
C
A
C
T
B
C
O
C
1
2
B
A
Map the characters (mutations) onto tree 1
1
2
3
4
5
O
T
G
G
A
A
A
G
G
C
C
G
A
A
A
A
A
G
C
A
C
T
B
C
O
C
B
4
3
5
1
2
Total # number of steps = 6
A
3
Actually, there is more than one way to
map character 3
3
O
A
B
C
O
C
B
G
G
A
A
A
3
O
C
B
A
3
3
3
Either way the character contributes 2 steps
to the overall tree length
Map the characters onto tree 2
1
2
3
4
5
O
A
B
T
G
G
A
A
G
G
C
C
G
A
A
A
A
A
C
G
C
A
C
T
# steps = 5
O
A
B
C
45
1
3
2
Tree 3
1
2
3
4
5
O
A
B
T
G
G
A
A
G
G
C
C
G
A
A
A
A
A
C
G
C
A
C
T
Length = 6 steps
O
B
3
1
2
A3
C
45
Which tree had the shortest branch lengths
(most parsimonious)?
O
O
C
B
Most parsimonious tree
O
A
B
C
A
A
C
B
Tree 1: length = 6
O O
B
Tree 2: length = 5
B
A
CA
Tree 3: length = 6
C
Where do the Whales belong?
Example from
Freeman &
Herron, Fig. 4.8
Freeman & Herron, Fig. 4.9: Using maximum parsimony,
looks like the whales cluster with the hippos (and cows)
Parsimony
• Simplest and fastest method of phylogenetic
reconstruction
• Can give misleading results if rates of evolution
(rates that mutations occur) differ in different
lineages
• Tends to become less accurate as genetic
distances get greater
• Could be mislead by reversals, homoplasy:
Because with only 4 nucleotides, after a while, same
mutations occur repeatedly at a given site (called
“saturation”) – “multiple hits (mutations) per site”
Distance Matrix
Continuous or
Discrete Characters
Distance Matrix
• Calculate pairwise distances between taxa
• Choose the tree that minimizes overall
distances between taxa
proportion sequence distance at 2 genes
(hypothetical data)
Mouse
Cat
Dog
Dolphin
Seal
mouse cat
dog
dolphin seal
1
0.05
0.03
0.08
0.09
1
0.03
0.01
1
0.02
1
0.02
0.15
0.23
1
Freeman & Herron, Fig. 4.10: Using genetic distances, looks
like the whales again cluster with the hippos (and cows)
Distance Matrix
• Generally more accurate than parsimony
• Like parsimony, it tends to be computationally
fast
Maximum Likelihood (R.A. Fisher)
• Probability of the data given the tree
• This is a “Frequentist” method: one true answer
(one true tree)
• Draw from the data (probability distribution of DNA
sequence data) to find the true tree
• Choose the tree (x, y axis) that maximizes the
probability of the observed data (z axis)
Z: Probability of the data
Felsenstein, J. 1981. Evolutionary trees from DNA
sequences: a maximum likelihood approach. Journal
of Molecular Evolution. 17(6):368-76.
x,y: Tree space
Maximum Likelihood (R.A. Fisher)
• Probability of the data given the tree
• The aim of maximum likelihood estimation is to find
the parameter value(s) that makes the observed data
most likely.
• For example: finding a mean. If you want to have a
number that describes the data, like human height,
you could find the mean
P(data/tree) = likelihood(tree/data)
Tree = hypothesis
Z: Probability of the data
Felsenstein, J. 1981. Evolutionary trees from DNA
sequences: a maximum likelihood approach. Journal
of Molecular Evolution. 17(6):368-76.
x,y: Tree space
Maximum Likelihood
(R.A. Fisher)
• Often yields more accurate tree than
parsimony or distance
• Relies on an accurate assumption of
which mutations are more probable (A->G
more often than A->T or C? i.e. accurate model of molecular
evolution)
• Computationally intensive
Bayesian Inference
Reverend Thomas Bayes (1702-1760)
• Probability of a tree given the data
• Uses prior information on the tree
• Does not assume that there is one correct tree
• Will modify estimate based on additional information
• Uses Bayes’ Theorem
P(A/B) = P(B/A)P(A)
P(B)
Bayesian Inference
Reverend Thomas Bayes (1702-1760)
• Probability of a tree given the data:
• Will modify estimate based on additional information: so
as you get more data, you update your hypothesis for the
tree
• Uses prior information on the tree: this is where you start
• The sequential use of the Bayes' formula (recursive):
when more data become available after calculating a
posterior distribution, the posterior becomes the next prior
• Does not assume that there is one correct tree
Bayesian Inference
Reverend Thomas Bayes (1702-1760)
• Uses Bayes’ Theorem
P(A/B) = P(B/A)P(A) = P(tree/data) = P(data/tree)P(tree)
P(B)
P(data)
P(A) = prior probability, probability of a tree
P(A/B) = posterior probability—probably of tree given the data
P(B/A) = the probability B (data) of observing given A (tree), is also
known as the likelihood. It indicates the compatibility of the
evidence with the given hypothesis.
P(B) = probability of the data
Bayesian Inference
• Like Likelihood, often yields more accurate
tree than parsimony or distance
• Computationally more intensive than
parsimony or distance matrix, but less
intensive than likelihood
• Needs a prior probability for the tree and a
model of evolution
Potential problems of
Phylogenetic Reconstruction
• Sufficient Amount of Data:
– With enough data most statistical methods
usually yield the same tree
– Insufficient data would yield a tree that lacks
resolution (lacks statistical power)
• Gene trees vs species trees
– Evolutionary history of individual genes are
not necessarily the same
– Should try to get data from many genes, or
the whole genome
Challenges of Phylogenetic Reconstructions
• Different parts of the genome might have
different evolutionary histories (different
gene genealogies, horizontal gene
transfers, allopolyploidy, etc)
• So, there might not be one true tree for a
group of taxa, and relationships might be
difficult to resolve because they are
inherently complex
• Current trend is to use whole genome
data to reconstruct phylogenies
• Gain a comprehensive picture of the
evolutionary relationships among taxa
for the whole genome
Phylogenetic Reconstructions
• Typically, evolutionary biologists will use a variety of
methods to reconstruct a phylogeny.
• Maximum likelihood and Bayesian methods are considered
more robust.
• Tree is only as good as the data. Having many
homoplastic characters (due to convergent evolution,
reversals, etc.) will make the reconstruction less robust
• Standard to use Bootstrapping to assess the validity of the
tree
• Understanding statistics is fundamental to understanding
evolution
• Much of statistics was in fact developed in order to model
evolutionary processes (such as ANOVA, analysis of
variance)
1. Sometimes the Molecular Clock (based on genetic data)
conflicts with the Geological Record. Why would this
happen?
(A) Sometimes there are gaps in the geological record, because fossils
do not form everywhere, and mutation rate might vary between
different species
(B) Radiometric dating relies on chance events in the preservation of
isotopes, making the timing events in the geological time scale less
accurate than the molecular clock
(C) Mutation rates slow down as you go back in time, making
estimation of timing of events less accurate as you go back in time
(D) The molecular clock is calculated from radioisotopes, while the
geological record is obtained from fossil data. The two can conflict
when fossils end up displaced from their original sedimentary layer
2. You are a medical researcher working on HIV. A novel strain has appeared in Madison, Wisconsin. To determine which drugs
would be most effective in treating this new strain (because different strains are resistant to different drugs), you need to determine
its recent evolutionary history. You decide to reconstruct the evolutionary history of HIV by using a phylogenetic approach. Thus,
you collect samples from patients in various geographic locations and sequence a fragment of RNA. Using parsimony, which is the
correct phylogeny for HIV-1 based on the data below?
HIV-1, Uganda, Africa
HIV-1, San Francisco, USA
HIV-1, Madison, USA
HIV-1, New York, USA
HIV-1, Paris
HIV-2 Africa (ancestral outgroup):
ACAUG
UGAUG
UAAGG
UAAAG
ACAUC
ACCUG
3. Which of the following is most TRUE
regarding phylogenetic reconstructions?
(A) Phylogenetic reconstruction based on any gene
would yield the same tree
(B) Parsimony is the most accurate method for
reconstructing phylogenies
(C) Some DNA sequence data is better for phylogenetic
reconstruction than others, such as those that tend to
be less subjected to selection (3rd codon, introns)
(D) Maximum likelihood relies on maximizing distances
among taxa
4. Which of the following types of data would
be most optimal for constructing a
phylogeny?
(a) Non-coding and regulatory sequences
(b) Non-coding and non functional
sequences
(c) Paralogous genes
(d) Genes that have undergone purifying
selection
(e) Intron sequences within rapidly evolving
genes
5. Which of the following reasons is FALSE on why the
type of data chosen in the question above would be
optimal for constructing a phylogeny?
(a) Because selection might make taxa seem more
closely related due to convergent evolution
(b) Because selection might make taxa seem more
distantly related due to disruptive evolution
(c) Because selection might make taxa seem more
closely related due to purifying selection
(d) Because non-coding regulatory sequences are likely
to be neutral
(e) Because coding sequences are likely to be under
selection
Answers
•
•
•
•
•
1A
2C
3C
4B
5D