GST II: ---Title - University of Missouri
Download
Report
Transcript GST II: ---Title - University of Missouri
Phylogenetic Tree
Construction
Dong Xu
Computer Science Department
271C Life Sciences Center
1201 East Rollins Road
University of Missouri-Columbia
Columbia, MO 65211-2060
E-mail: [email protected]
573-882-7064 (O)
http://digbio.missouri.edu
Outline
Evolution theory
Concept of phylogeny
Molecular clock
Types of trees
UPGMA
Parsimony
Maximum likelihood
An example for bird flu
Evolution
Many
theories of evolution
Basic
idea:
speciation events lead to creation of
different species
Any two species share a (possibly
distant) common ancestor
Evolutionary Events
Extinction: A new node u is created at the end of a
lineage, no new lineage is started from u
Speciation: A new node u is created at the end of a
lineage, and two new lineages are started from u
Hybridization: A new node u is created
when two lineages combine (diploid or polyploid)
when one lineage creates u and the new lineage from
u has double the number of homologs (autopolyploid)
Tree of Life
http://tolweb.org/
Toxonomy
Glycine max
Taxonomy ID: 3847
Genbank common name: soybean
Rank: species
Genetic code: Translation table 1 (Standard)
Mitochondrial genetic code: Translation table
1 (Standard)
Other names:
common name:soybeans
Lineage( full )
cellular organisms; Eukaryota;
Viridiplantae; Streptophyta;
Streptophytina; Embryophyta; Tracheophyta;
Euphyllophyta; Spermatophyta;
Magnoliophyta; eudicotyledons; core
eudicotyledons; rosids; eurosids I;
Fabales; Fabaceae; Papilionoideae;
Kingdom Plantae
Evolutionary
From
tree of plants
primitive more advanced
traits
moncot
Gymnosperms
Non-vascular _______
Dicot
Green
alga
ancestor
Flowers
Vascular
__________
Monocot vs. dicot plants (1)
FEATURE
MONOCOTS
DICOTS
1
2
Leaf venation
parallel
broad
Root system
Fibrous
Tap
In 3’s
In 4’s or 5’s
Scattered
Arranged in a
circle
Herbaceous
Either
Cotyledons
Number of floral
parts
Vascular bundle
position
Woody or
herbaceous
Monocot vs. dicot plants (2)
Number
of cotyledons: one vs. two
Monocot vs. dicot plants (3)
Leaf
venation pattern:
Monocot
Dicot
is parallel
is net pattern
Monocot vs. dicot plants (4)
Flower parts:
Monocot: in groups of three
Dicot: in groups of four or five
Outline
Evolution theory
Concept of phylogeny
Molecular clock
Types of trees
UPGMA
Parsimony
Maximum likelihood
An example for bird flu
Phylogenies (1)
A phylogeny is a tree that describes the
sequence of speciation events that lead to
the forming of a set of current day species
Aardvark Bison Chimp Dog
Elephant
Phylogenies (2)
Leafs - current day species
Nodes - hypothetical most recent common
ancestors
Edges length - “time” from one speciation
to the next
Primate Evolution
Tree Terminology
a
b
c
d
leaf
{a,b}
cluster
edge
{a,b,c}
{a,b,c,d}
internal node
root
Rooted/Unrooted Tree
Rooted trees
Single common ancestor
Requires more information
Unrooted trees
Objects are leaves
Internal nodes are some common ancestors
Insufficient information to tell whether not not a given
internal node is a common ancestor of any 2 leaves
Motivation
Understand the lineage of different species
Organizing principle to sort species into a
taxonomy
Understand how various functions evolved
Understand forces and constraints on evolution
Perform multiple sequence alignment
Predict gene function (phylogenetic footprint)
Tree Basis
Phylogenies are reconstructed based on
comparisons between present-day objects
Two main aspects
Topology
How its interior nodes connect to one another and
to the leaves
Distance
An estimate of the evolutionary distance between
the nodes
Assumptions
homology
reflects common ancestry
single common ancestor
treelike relationship exists
positional homology
independent processes
no reversals or convergence
molecular clock
Outline
Evolution theory
Concept of phylogeny
Molecular clock
Types of trees
UPGMA
Parsimony
Maximum likelihood
An example for bird flu
Molecular Clock Theory (1)
For any given protein, accepted mutations in the
amino acid sequence for the protein occur at
constant rate
Accepted = mutations that allow protein to
function without death
Implication
# of accepted mutations proportional to length of
time interval
i.e. relatively constant rate of accepted mutations
within a protein
Molecular Clock Theory (2)
Rate of accepted mutations maybe different
for different proteins (depending on their
tolerance for mutations)
Different parts of a protein may evolve at
different rates
Thus, if A and B differ by k accepted
mutations, then roughly k/2 mutations have
occurred since divergence
Science vol. 289
Outline
Evolution theory
Concept of phylogeny
Molecular clock
Types of trees
UPGMA
Parsimony
Maximum likelihood
An example for bird flu
Species/Gene Trees (1)
Species tree (how are my species
related?)
contains only one representative from each species
when did speciation take place?
all nodes indicate speciation events
Gene tree (how are my genes related?)
normally contains a number of genes from a single
species
nodes relate either to speciation or gene duplication
events
Species/Gene Trees (2)
• Your sequence data may not have the same
phylogenetic history as the species from
which they were isolated
•Different genes evolve at different speeds,
and there is always the possibility of
horizontal gene transfer (hybridization,
vector mediated DNA movement, or direct
uptake of DNA).
Morphological vs. Molecular
Classical
phylogenetic analysis:
morphological features
number of legs, lengths of legs, etc.
Modern
biological methods allow to
use molecular features
Gene sequences
Protein sequences
Dangers in Molecular
Phylogenies
Gene/protein sequence can be
homologous for different reasons:
Orthologs -- sequences diverged after a
speciation event
Paralogs -- sequences diverged after a
duplication event
Xenologs -- sequences diverged after a
horizontal transfer (e.g., by virus)
Ultrametric trees (1)
A metric on a set of objects O given by the
assignment of a real number d(x,y) to every pair
x,y in O
Ultrametric trees (2)
An ultrametric has to fulfill the additional
requirement
An ultrametric tree is characterized by the three
point condition
Additive Trees
Generalization of ultrametric trees
# of mutations were assumed to be proportional to
temporal distance of a node to ancestor
Also assumed, mutations took place at same rate in
all branches
Additive trees model different rates of
mutation along different branches
Additivity
In “real” tree, distances between
species are the sum of distances
between intermediate nodes
k
d (i , j ) a b
c
a
i
m
b
d (i , k ) a c
j
d ( j ,k ) b c
1
c = d (m, k ) (d (i , k ) d ( j , k ) d (i , j ))
2
Phylogeny Construction
parsimony
methods: fewest changes
likelihood
methods: maximize the
probability
distance
methods: based on pairwise
evolutionary distances (sequence
similarity, nucleotide composition,
etc.)
Outline
Evolution theory
Concept of phylogeny
Molecular clock
Types of trees
UPGMA
Parsimony
Maximum likelihood
An example for bird flu
UPGMA
UPGMA is the unweighted pair group
method with arithmetic mean
Distance matrix can come from (e.g) DNADNA hybridization, or be constructed from
sequence data etc.
Iteratively group the most closely related
groups. The average distance between
elements in two groups is the distance
between the groups.
UPGMA Procedure
1.
find closest pair of units (species, to start
with)
2.
connect this pair, defining an
evolutionary unit (branch)
3.
compute distances from the ancestor of
this unit to all other ungrouped units -Branch length is distance/2
4.
go back to #1 and repeat
Evolutionary distances
among primates (1)
nucleotide substitutions per 100 sites
Chimp
Gorilla
Orang
Rhesus
Human
1.45
1.51
2.98
7.51
Chimp
Gorilla
Orang
1.57
2.98
7.55
3.04
7.39
7.10
Humans and chimps are closest:
lump them and recompute distances
H
C
Evolutionary distances
among primates (2)
Gorilla
Orang
Rhesus
H-C
1.54
2.98
7.53
Gorilla
Orang
3.04
7.39
7.10
e.g., (H-C) to gorilla distance
G
= (H-G+C-G)/2
= (1.51+1.57)/2 = 1.54
Gorilla is closest to H-C clade
(((H, C), 1.45), G, 1.54)
H
C
Evolutionary distances
among primates (3)
Orang
Rhesus
H-C-G
3
7.46
Orang
7.10
R O
Human-ChimpGorilla is closer
to Orang than to Rhesus
G
H
C
UPGMA Clustering
Let Ci and Cj be clusters, define distance
between them to be
1
d (Ci ,C j )
d ( p, q )
| Ci || C j | pCi q C j
When we combine two cluster, Ci and Cj,
to form a new cluster Ck, then
d (C k , Cl )
| Ci | d (Ci , Cl ) | C j | d (C j , Cl )
| Ci | | C j |
UPGMA: conclusions
UPGMA gives
branch lengths or
evolutionary distances as well as
branching order
if (a big if) mutations occur at a
constant rate, we can estimate dates
of divergence from sequence
differences
Outline
Evolution theory
Concept of phylogeny
Molecular clock
Types of trees
UPGMA
Parsimony
Maximum likelihood
An example for bird flu
Possible Evolutionary Tree (1)
t1
1 three-taxa tree
t2
1*(2*3-3) = 3 four-taxa trees
t3
t1
t1
t3
t2
t2
t4
t3
t4
t2
t3
t1
t4
Possible Evolutionary Tree (2)
Taxa (n)
rooted
unrooted
(2n-3)!/(2n-2(n-2)!) (2n-5)!/(2n-3(n-3)!)
2
1
1
3
3
1
4
15
3
5
105
15
6
954
105
7
10,395
954
8
135,135
10,395
9
2,027,025
135,135
10
34,459,425
2,027,025
Possible Evolutionary Tree (3)
Taxa (n): 2
Taxa (n)
3
Unrooted/rooted
2
1/1
3
1/3
4
3/15
4
Maximum parsimony (1)
Minimizes the number of steps required to
generate the observed variation in the
sequences
Guaranteed to find the "best" tree - danger
of over-fitting the data
Columns representing greater variation
dominate
Works best for small, highly conserved
sequences
Maximum parsimony (2)
Begin with a multiple sequence alignment
Identify informative sites within the
sequences
Tree requiring smallest number of changes
identified
Repeat over all informative sites
Length = sum of the # of steps in each
branch
Choose tree with smallest length
Maximum parsimony (3)
Sequence position and character
Taxa
1
2
3
4
5
6
7
8
9
1
A
A
G
A
G
T
G
C
A
2
A
G
C
C
G
T
G
C
G
3
A
G
A
T
A
T
C
C
A
4
A
G
A
G
A
T
C
C
G
Maximum parsimony (4)
A
A ACGA
C
Tree 1
B
A
B ATGC
C GTGC
D
B
D GCAA
Tree 2
C
D
A
B
1 2 3 4 Total
Tree 3
D
C
Tree 1
1 2 1 2
6
Tree 2
2 2 1 2
7
Tree 3
2 1 1 1
5
Parsimony on
genomic sequence
Site
Human
Chimp
Gorilla
Orang
Recent branch
34
A
G
A
G
human-gorilla
560
C
C
A
A
human-chimp
1287
*
*
T
T
human-chimp
3057-
****
****
TAAT
TAAT
human-chimp
A
C
C
A
chimp-gorilla
3060
5153
human-chimp chimp-gorilla
12
3
human-gorilla
4
Outline
Evolution theory
Concept of phylogeny
Molecular clock
Types of trees
UPGMA
Parsimony
Maximum likelihood
An example for bird flu
Probabilistic Approaches
to Phylogeny (1)
Notation and definitions:
Let P(x•|T,t•) denote the probability of a set
of data given a tree, where:
x• denotes n sequences
T denotes a tree with n leaves with sequence j at
leaf j
t• denotes the edge lengths of the tree
The definition of P(x•|T,t•) depends on our
choice of model of evolution.
Probabilistic Approaches
to Phylogeny (2)
Let P(x|y,t) denote the probability that
sequence y evolves into x along an edge
of length t.
Assume that we can define P(x|y,t).
If we can do this for each edge of T we
can calculate the probability of T.
Probabilistic Approaches
to Phylogeny (3)
Ridiculously simplistic model of evolution:
1. Every site is independent
2. Deletions and insertions do not occur
3. Substitution accounts for all evolution
Let P(b|a, t) denote the probability of the
substitution of residue b for residue a over an
edge length of t.
Extending to aligned gapless sequences x and y,
P(x | y, t) = PuP(xu|yu, t), where u indexes over
sites
Probabilistic Approaches
to Phylogeny (4)
P(x1,.., x5|T, t•) =
P(x1|x4,t1)P(x2|x4,t2)P(x3|x5,t3)P(x4|x5,t4)P(x5)
root
x5
t4
x4
t2
t3
t1
x2
x1
x3
Divergence time estimates for major groups. Thick bars on branches denote fossil record of
fungi; solid circles are calibration points. From Heckman et al. 2001. Science 293: 1132
Confidence Assessment
Bootstrap values
Bootstrapping is a statistical
technique that can use random
resampling of data to determine
sampling error for tree topologies
Bootstrapping phylogenies
Characters are resampled with replacement to
create many bootstrap replicate data sets
Each bootstrap replicate data set is analysed (e.g.
with parsimony, distance, ML etc.)
Agreement among the resulting trees is
summarized with a majority-rule consensus tree
Frequencies of occurrence of groups, bootstrap
proportions (BPs), are a measure of support for
those groups
Outline
Evolution theory
Concept of phylogeny
Molecular clock
Types of trees
UPGMA
Parsimony
Maximum likelihood
An example for bird flu
Avian Influenza Viruses
Single strand
Negative RNA
Fragmented
Polymorphic
2003/2004 H5N1 Pandemic
Highly pathogenic; can be transmitted to
people and some cases are fatal
Virus: 8 genomic segments (PB1, PB2, PA,
HA, NP, NA, M, and NS) and genetic
reassortment
DNA sequences
o South China Agricultural University, China
o Genbank
Sources and evolution of flu viruses?
Outbreak History
HA
HA gene
Other 6 segments
(excluding PA) have a
similar tree structure
PA gene
Our analyses suggest that
2003-04 H5N1 pandemic
be caused by multiple
independent
transmissions with
multiple genotypes from
genetic reassortments.
(2001)
Reading Assignments
Suggested reading:
Chapter 14 in “Warren J. Ewens and
Gregory R. Grant: Statistical Methods in
Bioinformatics – An Introduction. Springer.
2001”.
Optional reading:
Chapter 17 in “Dan Gusfield: Algorithms on
Strings, Trees, and Sequences. Cambridge
University Press. 1997”.
Project Assignment
Develop a program that implement the
UPGMA algorithm
1. Modify your code in the assignment for global
alignment.
2. Use edit distance (match 1; otherwise 0) with
gap penalty –1 – k (k is gap size) for pairwise
sequence alignment.
3. Use the sequence identity as the tree
distance between leaves.
4. Input format: FASTA in one file.
5. Output format: (((a, b), d1), c, d2))…