Molecular Phylogenetic

Download Report

Transcript Molecular Phylogenetic

Bioinformatics
Lecture 3
Molecular Phylogenetic
By: Dr. Mehdi Mansouri
[email protected]
Mehr 1395
Phylogenetics Basics
• Biological sequence analysis is founded on solid evolutionary
principles.
• Similarities and divergence among related biological sequences
revealed by sequence alignment often have to be rationalized and
visualized in the context of phylogenetic trees
What is evolution?
• Evolution can be defined as the development of a biological form from
other preexisting forms or its origin to the current existing form
through natural selections and modifications.
• Phylogenetics is the study of the evolutionary history of living
organisms using treelike diagrams to represent pedigrees of these
organisms.
• The tree branching patterns representing the evolutionary divergence
are referred to as phylogeny.
Studying phylogenetics
• Fossil records – which contain morphological
information about ancestors of current species and
the timeline of divergence. fossil record
nonexistent for microorganisms
• Molecular data (molecular fossils) – more
numerous than fossils, easier to obtain, favorite for
reconstruction of the evolutionary history
DNA sequence evolution
AAGACTT
AAGGCCT
AGGGCAT
AGGGCAT
-2 mil yrs
TGGACTT
TAGCCCT
TAGCCCA
-3 mil yrs
TAGACTT
AGCACTT
AGCACAA
AGCGCTT
-1 mil yrs
today
Major Assumptions
• Molecular sequences used in phylogenetic construction are
homologous
• Phylogenetic divergence is assumed to be bifurcating
• Each position in a sequence evolved independently
Tree terminology
Terminal node = Operational taxonomic unit (OTU)
Internal node = Hypothetical taxonomic unit (HTU)
Peripheral ( or terminal) branch = relationship between OTU and HTU
Internal branch = relationship between two HTUs
Clades
• A clade is a group of all the taxa that have been
derived from a common ancestor plus the
common ancestor itself.
10
Cladograms & Phylograms
Bacterium 1
Bacterium 2
Bacterium 3
Eukaryote 1
Cladograms show
branching order branch lengths are
meaningless
Eukaryote 2
Eukaryote 3
Eukaryote 4
Bacterium 1
Bacterium 2
Bacterium 3
Eukaryote 1
Phylograms show
branch order and
branch lengths
Eukaryote 2
Eukaryote 3
Eukaryote 4
11
• dichotomy – all branches bifurcate
• polytomy – result of a taxon giving rise to more than two descendants
or unresolved phylogeny
• unrooted – no knowledge of a common ancestor,
shows relative relationship of taxa, no direction of an
evolutionary path
• rooted – obviously, more informative
Rooting the tree
• outgroup – taxa that are known to fall outside of the group of interest Requires some prior
knowledge about the relationships among the taxa. The outgroup can either be species (e.g., birds
to root a mammalian tree) or previous gene duplicates (e.g., a-globins to root b-globins).
outgroup
Based on lectures by Tal Pupko
Rooting the tree
• Midpoint rooting approach - roots the tree at the midway point
between the two most distant taxa in the tree, as determined by branch
lengths. Assumes that the taxa are evolving in a clock-like manner.
A
d (A,D) = 10 + 3 + 5 = 18
Midpoint = 18 / 2 = 9
10
C
3
B
2
2
5
D
Based on lectures by Tal Pupko
Molecular clock
• This concept was proposed by Emil Zuckerkandl and Linus Pauling (1962)
as well as Emanuel Margoliash (1963).
• This hypothesis states that for every given gene (or protein), the rate of
molecular evolution is approximately constant.
• Pioneering study by Zuckerkandl and Pauling
• They observed the number of amino acid differences between human globins – β and
δ (~ 6 differences), β and γ (~ 36 differences), α and β (~ 78 differences), and α and
γ (~ 83 differences).
• They could also compare human to gorilla (both β and α globins), observing either 2
or 1 differences respectively.
• They knew from fossil evidence that humans and gorillas diverged from a common
ancestor about 11 MYA.
• Using this divergence time as a calibration point, they estimated that gene
duplications of the common ancestor to β and δ occurred 44 MYA; β and derived
from a common ancestor 260MYA; α and β 565 MYA; and α and γ 600MYA.
3 OTUs
1 unrooted tree = 3 rooted trees
17
4 OTUs
3 unrooted trees = 15 rooted trees
18
Finding a true tree is difficult
• Correct reconstruction of the evolutionary history = find a correct tree
topology with correct branch lengths.
• Number of potential tree topologies can be enormously large even with a
moderate number of taxa.
2𝑛 − 3 !
𝑁𝑅 = 𝑛−2
2
𝑛−2 !
2𝑛 − 5 !
𝑁𝑈 = 𝑛−3
2
𝑛−3 !
6 taxa … NR=945, NU=105
10 taxa … NR=34 459 425, NU = 2 027 025
The Newick format
Gene phylogeny vs. species phylogeny
• Main objective of building phylogenetic trees based on molecular sequences:
reconstruct the evolutionary history of the species involved.
• A gene phylogeny only describes the evolution of that particular gene or encoded
protein. This sequence may evolve more or less rapidly than other genes in the
genome.
• The evolution of a particular sequence does not necessarily correlate with the
evolutionary path of the species.
• Branching point in a species tree – the speciation event
• Branching point in a gene tree – which event?
• The two events may or may not coincide.
• To obtain a species phylogeny, phylogenetic trees from a variety of gene families
need to be constructed to give an overall assessment of the species evolution.
A gene tree may differ from a species tree
S = Divergence
time for species
1 and 2
22
A gene tree may differ from a species tree
G1 = Inferred
divergence
time by using
alleles a and f
S = Divergence
time for species
1 and 2
23
A gene tree may differ from a species tree
Alleles d and b are
closer to each other
than alleles d and f.
24
Incomplete lineage sorting due to polymorphism
at speciation time
25
Closest living relatives of humans?
Based on lectures by Tal Pupko
Closest living relatives of humans?
14
Humans
Gorillas
Chimpanzees
Chimpanzees
Bonobos
Bonobos
Gorillas
Orangutans
Orangutans
Humans
0
MYA
Mitochondrial DNA, most nuclear DNA-encoded
genes, and DNA/DNA hybridization all show that
bonobos and chimpanzees are related more
closely to humans than either are to gorillas.
15-30
MYA
0
The pre-molecular view was that the great
apes (chimpanzees, gorillas and orangutans)
formed a clade separate from humans, and
that humans diverged from the apes at least
15-30 MYA.
Orangutan
Gorilla
Chimpanzee
Human
From the Tree of the Life Website, University of Arizona
Procedure
1. Choice of molecular markers
2. Multiple sequence alignment
3. Choice of a model of evolution
4. Determine a tree building method
5. Assess tree reliability
Choice of molecular markers
•
•
•
•
Nucleotide or protein sequence data?
NA sequences evolve more rapidly.
They can be used for studying very closely related organisms.
E. g., for evolutionary analysis of different individuals within a population,
noncoding regions of mtDNA are often used.
• Evolution of more divergent organisms – either slowly evolving NA (e.g.,
rRNA) or protein sequences.
• Deepest level (e.g., relatioships between bacteria and eukaryotes) –
conserved protein sequences
• NA sequences: good if sequences are closely related, reveal
synonymous/nonsynonymous substitutions
Positive and negative selection
• synonymous substitution – nucleotide changes in a sequence not
resulting in amino acid sequence changes
• nonsynonymous changes
• nonsynonsymous substitution rate ≫ synonymous – positive
selection
• certain parts of the protein are undergoing active mutations that may contribute
to the evolution of new function
• negative selection – synonymous > nonsynonymous
• neutral changes at the AA level, the protein sequence is critical enough that its
changes are not tolerated
MSA
• Critical step
• Multiple state-of-the-art alignment programs (e.g., T-Coffee and
Praline) should be used.
• The alignment results from multiple sources should be inspected and
compared carefully to identify the most reasonable one.
Model of evolution
• A simple measure of the divergence of two sequences – number of
substitutions in the alignment, a distance between two sequences – a
proportion of substitutions
• If A was replaced by C: A → C or A → T → G → C?
• Back mutation: G → C → G.
• Parallel mutations – both sequences mutate into e.g., T at the same time.
• All of this obscures the estimation of the true evolutionary distances
between sequences.
• This effect is known as homoplasy and must be corrected.
• Statistical models infer the true evolutionary distances between sequences.
Model of evolution
• Homoplasy is corrected by substitution (evolutionary) models.
• There exists a lot of such models.
• Jukes-Cantor model
𝑑𝐴𝐵 = − 3 4 × 𝑙𝑛 1 − 4 3 × 𝑝𝐴𝐵
• dAB … distance, pAB … proportion of substitutions
• example: alignment of A and B is 20 nucleotides long, 6 pairs are different, pAB = 0.3, dAB
= 0.38
• Kimura model
𝑑𝐴𝐵 = − 1 2 × 𝑙𝑛 1 − 2𝑝𝑡𝑖 − 𝑝𝑡𝑣 − 1 4 × ln(1 − 2𝑝𝑡𝑣 )
• pti … frequency of transition, ptv … frequency of transversion
Among site variations
• Up to now we have assumed that different positions in a sequence are
assumed to be evolving at the same rate.
• However, in reality is may not be true.
• In DNA, the rates of substitution differ for different codon positions. 3rd codon
mutates much faster.
• In proteins, some AA change much more rarely than others owing to
functional constraints.
Tree building methods
• Two major categories.
• Distance based methods.
• Based on the amount of dissimilarity between pairs of sequences, computed on
the basis of sequence alignment.
• Characters based methods.
• Based on discrete characters, which are molecular sequences from individual
taxa.
Distance based methods
• Calculate evolutionary distances dAB between sequences using some
of the evolutionary model.
• Construct a distance matrix – distances between all pairs of taxa.
• Based on the distance scores, construct a phylogenetic tree.
• clustering algorithms – UPGMA, neighbor joining (NJ)
• optimality based – Fitch-Margoliash (FM), minimum evolution (ME)
Clustering methods
• UPGMA (Unweighted Pair Group Method with Arithmetic Mean)
• Produces rooted tree (most phylogenetic methods produce unrooted tree).
• Basic assumption of the UPGMA method: all taxa evolve at a constant rate,
they are equally distant from the root, implying that a molecular clock is in
effect.
• However, real data rarely meet this assumption. Thus, UPGMA often produces
erroneous tree topologies.
Distance based – pros and cons
• clustering
• Fast, can handle large datasets
• Not guaranteed to find the best tree
• The actual sequence information is lost when all the sequence variation is
reduced to a single value. Hence, ancestral sequences at internal nodes cannot
be inferred.
• NJ – does not assume that the rate of evolution is the same in all branches of
the tree
• NJ is slower but better than UPGMA
• exhaustive tree searching (FM)
• better accuracy
Character based methods
• Also called discreet methods
• Based directly on the sequence characters
• They count mutational events accumulated on the sequences and may
therefore avoid the loss of information when characters are converted
to distances.
• Evolutionary dynamics of each character can be studied
• Ancestral sequences can also be inferred.
• The two most popular character-based approaches: maximum
parsimony (MP) and maximum likelihood (ML) methods.
Maximum parsimony
• A tree with the least number of substitutions is probably the best to
explain the differences among the taxa under study.
MP – pros and cons
• The character-based method is able to provide evolutionary
information about the sequence characters, such as information
regarding homoplasy and ancestral states.
• It tends to produce more accurate trees than the distance-based
methods when sequence divergence is low because this is the
circumstance when the parsimony assumption of rarity in
evolutionary changes holds true.
• When sequence divergence is high, tree estimation by MP can be
less effective, because the original parsimony assumption no
longer holds.
• Estimation of branch lengths may also be erroneous because MP
does not employ substitution models to correct for multiple
substitutions.
Maximum likelihood – ML
• Uses probabilistic models to choose a best tree that has the highest
probability (likelihood) of reproducing the observed data.
• ML is an exhaustive method that searches every possible tree topology
and considers every position in an alignment, not just informative
sites.
• By employing a particular substitution model that has probability
values of residue substitutions, ML calculates the total likelihood of
ancestral sequences evolving to internal nodes and eventually to
existing sequences.
• It sometimes also incorporates parameters that account for rate
variations across sites.
ML – pros and cons
• Based on well-founded statistics instead of a medieval philosophy.
• More robust, uses the full sequence information, not just informative
sites.
• Employs substitution model – strength, but also weakness (choosing
wrong model leads to incorrect tree).
• Accurately reconstructs the relationships between sequences that have
been separated for a long time.
• Very time consuming, considerably more than MP which is itself more
time consuming than clustering methods.