lecture06_06

Download Report

Transcript lecture06_06

Introduction to Bioinformatics
Molecular
Phylogeny
Lesson 5
Theory of Evolution:
Life is monophyletic
• All organisms on
Earth had a
common ancestor.
Ancestor
• Any two organisms
share a common
ancestor in their
past.
Descendant 1
Descendant 2
2
Theory of Evolution:
• Speciation events
lead to creation of
different species
(two species ).
• Speciation caused by
physical separation
into groups where
different genetic
variants become
dominant.
Ancestor
Descendant 1
Descendant 2
3
Ancestor
4
Ancestor
5
Ancestor
6
The genetic distance between any two extant
organisms is computable.
extinct
extant 1
extant 2
7
ancestor
descendant 1
The differences
between 1 and 2
are the result of
changes on the
lineage leading to
descendant 1 +
those on the
lineage leading to
descendant 2.
descendant 2
8
Thus, any set of species are related: the
relation is Phylogeny
The relationships can be represented by
Phylogenetic Tree (or dendrogram)
9
MYA = Million Years Ago
1,500 MYA
120 MYA
5 MYA
10
Phylogenetic Tree Terminology
• Graph composed of nodes & branches
• Each branch connects two adjacent nodes
R
F
E
A
B
C
D
11
Phylogenetic Tree Terminology
• Nodes represent the taxonomic units
• Taxonomic units = species/genes/individuals
• Branch = relations among the taxonomic units
(descant & ancestry)
• Branching pattern = Topology
• Branch lengths correspond to number of
substitutions. Longer branch means more
substitutions.
12
Phylogenetic Tree Terminology
Root
Branches
A
B
C
D
E
internal node - hypothetical most recent common ancestors
leaf (terminal node) - current day species or gene “taxa”
13
OTUs & HTUs
• OTUs = Operational Taxonomic Units
– leaves of the tree
• HTUs = Hypothetical Taxonomic Units
– internal nodes of the tree
14
Trees
=
Gorilla
=
Human
Gorilla
Human Chimp
Chimp
Human
=
Chimp
Gorilla
Chimp
Human Gorilla
15
Same thing
=
s1
s2
s3
s4
s5
s1
s2
s3
s4
s5
16
Newick format
A
B
C
D
E
((A,B),(C,(D,E)));
17
Rooted vs. unrooted trees
3
1
3
2
1
2
18
Gorilla gorilla
(Gorilla)
Pan troglodytes
(Chimpanzee)
Homo sapiens
(human)
Gallus gallus
(chicken)
19
3 possible UNROOTED trees:
Human
Chicken
Human
Chimp
Gorilla
Human
Chicken
Gorilla
Chimp
Chimp
the best tree
Gorilla
Chicken
20
Rooting based on priori knowledge:
Human
Chimp
Chicken
Gorilla
Chicken
Gorilla
Human Chimp
21
Ingroup / Outgroup:
Chicken
OUTGROUP
Gorilla
Human Chimp
INGROUP
22
Monophyletic groups (clades):
A group is monophyletic (clade) if it has a common
ancestor and all the descendents of this ancestor are in
the group.
23
Monophyletic groups
Chicken
Gorilla
Human Chimp
The Gorilla+Human+Chimp are monophyletic
24
Non-monophyletic groups
Drosophila
Zebra-fish
Whale
Chimp
The Zebra-fish+Whale are not monophyletic:
Adaptation to water occurred more than once
during evolution, independently… (or was lost in
the lineage leading to chimp).
25
Monophyletic groups:
Human
Chicken
Rat
Chimp
Gorilla
When an unrooted tree is given, you cannot know which
groups are monophyletic. You can only say which are not.
For example, Chicken + Rat might be monophyletic if the root
was between Chicken + Rat and the rest. In fact, the real root
of the tree is between Chicken and the rest, hence Chicken
and rat are not monophyletic. But, Human and Gorilla are not
monophyletic no matter where is the root…
26
What data can be used?
(1) Molecular data (DNA, RNA, proteins)
(2) Morphological data (living or fossilized organisms)
27
Advantages of molecular data:
• Heritable entities
• Characters’ description is unambiguous
• Molecular data are amenable to quantitative
treatment
• Can assess evolutionary relationship among
distantly related organisms (ribosomal
RNA)
• More abundant data (bacteria, algae)
28
What we can learn from
phylogenetics tree?
Determining the closest relatives of the
organism that’s you are interested in.
29
Example 1:
Which species are closest to Human?
Human
Gorilla
Chimpanzee
Chimpanzee
Gorilla
Orangutan
Orangutan
Human
Molecular analysis:
Chimpanzee is related more closely
to human than the gorilla
Pre-Molecular analysis:
The great apes
(chimpanzee, Gorilla & orangutan)
Separate from the human
30
Example 2 :
Guilty Sequence - scientists map a
murder weapon
“In 1998, a Louisiana doctor was convicted
of attempting to murder his ex-girlfriend, a
nurse. The murder weapon was a syringe of
HIV-infected blood drawn from a patient
under the doctor's care.”
31
History of the virus:
Phylogenetic
analysis of the RT
region.
The smaller set of
boxed sequences
represents the
sequences from the
victim, and the
larger set of boxed
sequences
represents the
patient plus victim
sequences. LA
denote viral
sequences from
control HIV-1
infected individuals.
Metzker, Michael L. et al. (2002) Proc. Natl. Acad. Sci. USA 99, 14292-14297
©2002 National Academy of Sciences, U.S.A.
32
Species trees and Gene trees
• Species trees - representing the
evolutionary relationships among species
(the speciation process).
• Gene trees – Different genes may have
different evolutionary history.
33
What is Homology ?
Before Darwin, homology was defined morphologically.
Similarity between properties in various species.
Example:
• Bats and butterflies fly, but the structures are different.
• Bats fly and whales swim, yet the bones in a bat's wing
and a whale's flipper are strikingly alike.
Conclusions:
1. Bats and butterflies wings are not homologous.
2. Bat wings and whales flippers are homologous.
34
Homology Interpretation:
from Darwin to 21st Century
• Darwin (1859): Homology is a result of descent
with modifications from a common ancestor.
• Modern genetics: Homology is determined by genes.
• Two sequences are homologous if they are similar and
share a common ancestor (similarity by itself is not
enough).
• Large enough similarities typically imply homology.
35
Homolog
• A gene related to a second gene by descent
from a common ancestral DNA sequence.
36
Orthologs
Homologous sequences are orthologous if
they were separated by a speciation
event:
If a gene exists in a species, and that
species diverges into two species, then
the copies of this gene in the resulting
species are orthologous.
37
Orthologs
• Orthologs will typically have the same or
similar function in the course of evolution.
• Identification of orthologs is critical for
reliable prediction of gene function in
newly sequenced genomes.
38
Orthologs
ancestor
a
speciation
a
descendant 2
a
descendant 2
39
Paralogs
Homologous sequences are paralogous if
they were separated by a gene
duplication event:
If a gene in an organism is duplicated,
then the two copies are paralogous.
40
Paralogs
• Orthologs will typically have the same or
similar function.
• This is not always true for paralogs due to
lack of the original selective pressure upon
one copy of the duplicated gene, this copy is
free to mutate and acquire new functions.
41
Paralogs
a
Duplication
a
b
42
Orthologs & Paralogs
a
Paralogs
Duplication
a
b
Orthologs
Speciation
Orthologs
a
Species a
b
a
b
Species b
43
TR = “TREE ROOTED”
How many rooted trees
N=2, TR(2) = 1
a
N=3, TR(3) = 3
b
a
b
c
b
a
c
c
a
b
N=4, TR(4) = 15
a b c
d
a c
b d a c
c
d b a c
d c
b d
a b d
a b
a d b c
c
b a d
a b c
d
b a c
c
d
d a b
a c
b d
b c
a d
d a b c
a d b c
44
Number of possible trees:
Number of
Number of
Number of taxarooted treesunrooted trees
2
1
1
3
3
1
4
15
3
5
105
15
6
954
105
7
10,395
954
8
135,135
10,395
9
2,027,025
135,135
10
34,459,425
2,027,025
11
654,729,075
34,459,425
12
13,749,310,575
654,729,075
45
Number of possible trees
NRooted=(2n-3)! / 2n-2(n-2)!
NUnrooted=(2n-5)! / 2n-3(n-3)!
46
Evolution is an historical process.
Only one historical narrative is true.
From 8,200,794,532,637,891,559,375 possibilities for 20 taxas, 1
possibility is true and 8,200,794,532,637,891,559,374 are false.
Truth is one, falsehoods are many.
47
How do we know which of the
8,200,794,532,637,891,559,375 trees is true?
We don’t, we infer by using decision criteria.
48
Methods
49
Approach 1 - Distance methods
• Two steps:
– Compute a distances between any two
sequences from the MSA.
– Find the tree that agrees most with the distance
table.
Approach 2 - Character state methods
• Input: multiple sequence alignment
• Algorithms:
– Maximum parsimony (MP)
– Maximum likelihood (ML)
50
Step 1 :Distances estimation
There are different methods to compute the
distance between any two sequences. For
example, one can take into account different
probabilities between transitions and
transversions…
OTU A
A
B
C
D
B
C
D
8
7
9
12 14 11
51
Step 2:
From a distance table to a tree
• Algorithms:
– UPGMA
– Neighbor Joining (NJ)
52
Neighbor Joining (NJ)
•
•
•
•
Reconstructs unrooted tree
Calculates branch lengths
Based on Star decomposition
In each stage, the two nearest nodes of the
tree are chosen and defined as neighbors in
our tree.
This is done recursively until all of the
nodes are paired together.
53
Neighbors, we are …
What are neighbours?
Neighbours are defined as a pair of OTU's who
have one internal node connecting them.
C
A
B
D
A and B are neighbours,
C and D are neighbours,
But…
A and C are not
neighbours…
54
Neighbors, we are …
Which pair is closest?
ri=Σdik /(N-2)
average distance from all nodes
Mij= dij - [ri + rj]
distance of i,j relative to the rest
55
OTU A
A
B
C
D
E
B
C
D
C
E
10
7 6
11 8 2
E
B
D
8
7
9
12 1 3
11 10 2 6
OTU A (B,D) C E
A
(B,D)
E
A
C
A
(B,D)
C
E
56
A
OTU A (B,D) C E
A
(B,D) 10
7 6
C
11 8 2
E
(B,D)
C
E
A
A
B
(B,D)
=
D
(C,E)
E
C
57
Advantages and disadvantages of NJ
• Advantages
– is fast and thus suited for large datasets and for
bootstrap analysis
– permist lineages with largely different branch
lengths
– permits correction for multiple substitutions
• Disadvantages
– sequence information is reduced
• gives only one possible tree
– strongly dependent on the model of evolution
used.
58