Class 9: Phylogenetic Trees

Download Report

Transcript Class 9: Phylogenetic Trees

Phylogenetic Trees
Lecture 11
Sections 6.1, 6.2, in Setubal et. al., 7.1, 7.1 Durbin et. al.
© Shlomo Moran, based on Nir Friedman. Danny Geiger, Ilan Gronau
.
Evolution
Evolution of new organisms
is driven by
 Diversity
 Different individuals
carry different
variants of the same
basic blue print
 Mutations
 The DNA sequence can
be changed due to
single base changes,
deletion/insertion of
DNA segments, etc.
 Selection bias
2
Theory of Evolution
 Basic

idea
speciation events lead to creation of different
species (speciation: physical separation into groups where different
genetic variants become dominant)
 Any
two species share a (possibly distant) common
ancestor
 This is described by a rooted tree – the tree of life.
3
Tree of Life


Any two species share a (possibly distant) common ancestor
The process of evolution consists of:
 speciation events.
 mutations along
evolutionary branches.
Source: Alberts et al
4
Often only a subtree is studied
Definition: A phylogeny is a tree that describes the
sequence of speciation events that lead to the forming
of a set of current day species; also called a
phylogenetic tree.
5
Components of Phylogenenetic Trees
Aardvark Bison Chimp Dog
Elephant
Leaves - current day species (or taxa – plural of taxon)
 Internal vertices - hypothetical common ancestors
 Edges length - “time” from one speciation to the next
 The Tree Topology – the tree structure, ignoring edge
lengths. Usually the goal is to find the topolgy.

6
Historical Note
 Until
mid 1950’s phylogenies were constructed by
experts based on their opinion (subjective criteria)
 Since
then, focus on objective criteria for
constructing phylogenetic trees
 Thousands of articles in the last decades
 Important


for many aspects of biology
Classification
Understanding biological mechanisms
7
Outline
A. Introduction (this lecture)
1. The phylogenetic Reconstruction Problem: from
sequences to trees
2.Morphological vs. molecular sequences
3. Possible pitfalls
4. Directed and undirected trees
5. The “big” problem, the “small” problem.
.
Outline (cont)
B. Character based methods (this + next lectures)
1. Perfect Phylogeny
2. Maximum Parsimony
3. Maximum Likelihood (not studied in this course)
• These methods consider the evolution of each character separately.
• Try to find the tree which gives the “best” evolutionary explanation:
- least number of observed mutations (1&2), or most probable tree (3).
• These optimization problems are typically NP-hard.
• We’ll discuss ways for solving simplified versions of the problems.
.
Outline (cont)
C. Distance based methods (last 1-2 lectures)
- Run in polynomial time
- Compute distances between all taxon-pairs
- Find a tree (edge-weighted) best-describing the distances
D 
0









9
0
21 19 15 16 

22 20 16 17 
0 18 14 15 

0 8 9
0 3 
0 
4
5
10
7
6
1
1
2
2
Outline (end)
Distance Methods (cont.)
1. Efficient reconstruction (O(n2) time) from accurate distances
2.
Reconstruction from noisy distances: Can we reconstruct
accurate trees from approximate distances?
•
Worst-case noise model
•
More realistic noise models: inter-species distances derived from
probabilistic models of mutations.
.
Evolution as a Tree
ACGGTCA
AAAGTCA
ACGGATA
ACGGGTA
AAAGGCG
AAAGCTG
AAACACA
GGGGATT
TCTGGTA
GAACGTA
ACCCGTG
AATCCTG
AATGGGC
ATAGCTG
AAACCGA
TCTGGGA
TCCGGAA
ACCGTTG
AGCCGTG
12
Phylogenetic Reconstruction
GGGGATT
GAACGTA
AATCCTG
AATGGGC
ATAGCTG
AAACCGA
TCTGGGA
TCCGGAA
ACCGTTG
AGCCGTG
13
Phylogenetic Reconstruction
A : AATGGGC
B:
C:
D:
E:
F:
G:
H:
I:
J:
AATCCTG
ATAGCTG
GAACGTA
AAACCGA
GGGGATT
TCTGGGA
TCCGGAA
AGCCGTG
ACCGTTG
reconstruct
A
F
D
B
G
C
E
H
I
J
Goal: reconstruct the ‘true’ tree as accurately as possible
14
What are the sequences?
Morphological vs. Molecular

Classical methods. morphological features:
 number of legs, lengths of legs, etc.

Modern methods. molecular features:
 Gene (DNA) sequences
 Protein sequences

Analysis based on homologous sequences (e.g., globins)
in different species
16
Possible pitfall in reconstruction:
Misleading selection of sequences
 Gene/protein
sequences can be homologous for
several different reasons:

Orthologs -- sequences diverged after a
speciation event
  Paralogs -- sequences diverged after a
duplication event (next slides)
  Xenologs -- sequences diverged after a
horizontal transfer (e.g., by virus)
17
Misleading selection of sequences:
Using paralogs instead of orthologs
Consider evolutionary tree of three taxa:
Gene Duplication
…and assume that at some point
in the past a gene duplication
event occurred.
1
2
3
18
Paralogs instead of Orthologs
The gene evolution is described by this tree
(1,2,3 are species; A, B are the copies of the same gene).
Gene Duplication
Copy B
Copy A
Speciation events
1A
2A
3A
3B
2B
1B
19
Paralogs instead of Orthologs
If we happen to consider genes 1A, 2B, and 3A of species
1,2,3, we get a wrong tree.
S
Gene Duplication
S
1A
2A
Speciation events
3A
3B
S
2B
1B
In the sequel we assume all given sequences are orthologs –
created from a common ancestor by specification events.
20
Rooted vs. Undirected Trees
A natural representation of phylogeny is rooted trees
Common
Ancestor
21
Types of trees
Unrooted tree represents the same phylogeny without
the root node
Most known tree-reconstruction techniques do not distinguish
between different placements of the root.
22
Rooted versus unrooted trees
Tree a
Tree b
Tree c
b
a
c
Represents the three rooted trees
23
Positioning Roots in Unrooted Trees
 We
can estimate the position of the root by
introducing an outgroup:
 a set of species that are definitely distant from all
the species of interest
Proposed root
Falcon
Aardvark Bison Chimp Dog
Elephant
24
Two phylogenenetic trees of the same species:
Do these trees represent the same evolutionary history?
Aardvark Bison Chimp Dog
Chimp Aardvark
Bison
Elephant
Elephant
Dog
25
When two unrooted phylogenetic trees are
considered different?
Trees T1 and T2 on the same set of species are considered
identical if they represent the same evolutionary history, i.e.:
they have the same topology.
Formally, this is equivalent to:
There is a tree isomorphism h: T1  T2 s.t: For each species
x, h(x)=x.
26
The two trees represent the same evolution
v
u
w
Aardvark Bison Chimp Dog
h(v)
Elephant
h(u)
h(w)
Chimp Aardvark
Bison
Elephant
Dog
27
The “Big” reconstruction problem,
the “Small” problem
The “big” problem: compute the whole phylogenetic tree
from the n input sequences.
The “small” problem: Assume the tree topology and the
identities of the leaf-species are known.
Reconstruct the sequences at the internal vertices,
and give a score to the resulted phylogeny.
Connection between the problems: In order to solve the
big problem, solve the small problem on all possible
trees with n leaves, and output the tree(s) with the
highest “score”.
This is impossible in practice for more than few taxa.
28
Input for the “big” problem
A:
B:
C:
D:
E:
CAGGTA
CAGACA
CGGGTA
TGCACT
TGCGTA
Our task: Find evolutionary tree with leafs
corresponding to the 5 sequences, which
best explains the evolution of the strings.
29
Input for the “small” problem
Aardvark Bison Chimp Dog
A:
B:
C:
D:
E:
CAGGTA
CAGACA
CGGGTA
TGCACT
TGCGTA
Elephant
The tree and assignments of strings
to the leaves is given, and we need
only to assign strings to internal
vertices.
30
Character-based methods
for constructing phylogenies
In this approach, trees are constructed by comparing the characters of
the corresponding sequences. Characters may be morphological
(teeth structures) or molecular (nucleotides in homologous DNA
sequences). We will present two methods: “Perfect Phylogeny” and
“Maximum Parsimony”
Basic Assumption in these methods:
Best tree is one with minimal number of observed mutations
(character changes along the edges, aka substitutions).
31
Character based methods:
Input data
species
C1
C2
C3
C4
…
dog
A
A
C
A
G
G
T
C
T
T
C
G
A
G
G
C
C
C
horse
A
A
C
A
G
G
C
C
T
A
T
G
A
G
A
C
C
C
frog
A
A
C
A
G
G
T
C
T
T
T
G
A
G
T
C
C
C
human
A
A
C
A
G
G
T
C
T
T
T
G
A
T
G
A
C
C
pig
A
A
C
A
G
T
T
C
T
T
C
G
A
T
G
G
C
C
*
*
*
*
*
*
*
*
*
*
*
Cm
•
Each character (column) is processed independently.
•
The green character will separate the human and pig from frog, horse and dog.
•
The red character will separate the dog and pig from frog, horse and human.
32
The perfect phylogeny problem



A character is assumed to be a significant property,
which distinguishes between species (e.g. dental
structure, number of legs/limbs).
A characters state is a value of the character (eg: human
dental structure).
Assumption: It is unlikely that a given state will be
created twice in the evolution tree. Such characters are
called “Homoplasy free”, and are detailed next.
33
Homoplasy-free characters 1
Homoplasy free characters should avoid:
reversal transitions

A species regains a state it’s direct ancestor has
lost.

Famous known exceptions:
 Teeth in birds.
 Legs in snakes.
34
Homoplasy-free characters 2
…and also avoid
convergence transitions

Two species possess the same state while their
least common ancestor possesses a different state.

Famous known exceptions: The marsupials.
35
The Perfect Phylogeny Problem
Input:
1. A set of species
2. A set of characters
3. For each character, assignment of states to the species
Problem: Is there a phylogenetic tree T=(V,E), s.t. the evolution of
all characters is “homoplasy free” (no reversal, no convergence)
First, we define the problem using graphtheoretic terms.
36
Characters = Colorings
A coloring of a tree T=(V,E) is a mapping
C:V [set of colors]
A partial coloring of T is a coloring of a subset of the vertices U  V:
C:U [set of colors]
U=
37
Characters as Colorings
Each character defines a (partial) coloring of the
corresponding phylogenetic tree:
Species ≡ Vertices
States ≡ Colors
38
Convex Colorings (and Characters)
Let T=(V,E) be a partially colored tree, and d be a
color. The d-carrier is the minimal subtree of T
containing all vertices colored d
Definition: A (partial/total) coloring of a tree is convex
iff all d-carriers are disjoint
39
Convexity  Homoplasy Freedom
A character is Homoplasy free (avoids reversal and
convergence transitions)
↕
The corresponding (partial) coloring is convex
40
The Perfect Phylogeny Problem
(pure graph theoretic setting)
Input: Partial colorings (C1,…,Ck) of a set of vertices U (in the
example: 3 total colorings: left, center, right, each by two colors).
R G A
B B P
R G P
R B P
Problem: Is there a tree T=(V,E), s.t. UV and for i=1,…,k,, Ci is a
convex (partial) coloring of T?
PP is NP-Hard In general
In the tutorial you will see a special case
solvable in p-time.
41
Maximum Parsimony
Perfect Phylogeny is not only hard to compute, but in
many cases it doesn’t exist.
Next we discuss a more common approach, called
“Maximum Parsimony”, which looks for a tree which
minimizes the number of mutations.
42
Maximum Parsimony
A Character-based method
Input:

h sequences (one per species), all of length k.
Goal:

Find a tree whose leaves are labeled by the input
sequences, and an assignment of sequences to internal
nodes, such that the total number of substitutions is
minimized.
43
Example
Input: four nucleotide sequences: AAG, AAA, GGA, AGA taken
from four species.
By the parsimony principle, we seek a tree whose leaves are labeled
by the input sequences, and assignment of sequences to internal
vertices, with minimum total number of mutations (ie, letter
changes) along the tree edges.
Here is one possible tree + sequences assignment.
AAA
AAA
1
AAG
2
GGA
AAA
AAA
1
AGA
Total #substitutions = 4
44
Example Continued
Here are two other trees+ sequence assignments:
AAA 1
AAA
1
AAG
AAA
AAA
AAA
AGA
1
GGA
AGA
1
AAG
1
AGA
AAA
AAA
2
GGA
Total #substitutions = 3
Total #substitutions = 4
The left solution is preferred over the right one.
A solution has two parts: First, select a tree and label its leaves
by the input sequences; then, assign sequences to the internal
vertices.
45
Parsimony score
AAA 1
AAA
1
AAG
AAA
AAA
AAA
AGA
1
GGA
Parsimony score = 3
AGA
1
AAG
1
AGA
AAA
AAA
2
GGA
Parsimony score = 4
The parsimony score of a leaf-labeled tree T is the minimum
possible number of mutations over all assignments of sequences
to internal vertices of T.
47
Parsimony Based Reconstruction
We have here both the small and big problems:
1. The small problem: find the parsimony score for a
given leaf labeled tree.
2. The big problem: Find a tree whose leaves are
labeled by the input sequences, with the minimum
possible parsimony score.
3. We will see efficient algorithms for (1). (2) is hard.
48