Phylogenetic trees

Download Report

Transcript Phylogenetic trees

CS 177
Phylogenetics I
Taxonomy and phylogenetics
Phylogenetic trees
Cladistic versus phenetic analyses
Model of sequence evolution
Phylogenetic trees and networks
Taxonomy and
phylogenetics
Phylogenetic trees
Cladistic versus
phenetic analyses
Homology and
homoplasy
Cladistic and phenetic methods
Computer software and demos
Phylogenetic Inference I
Recommended readings
A science primer: Phylogenetics
http://www.ncbi.nlm.nih.gov/About/primer/phylo.html
(very) basic
Brown, S.M. (2000) Bioinformatics, Eaton Publishing, pp. 145-160
Brown, S.M.: Molecular Phylogenetics
www.med.nyu.edu/rcr/rcr/course/PPT/phylogen.ppt
Taxonomy and
phylogenetics
Hillis, D.M.; Moritz, G. & Mable, B.K. (1996) Molecular Systematics,
2. Edition, Sinauer Associates, 655 pp.
Phylogenetic trees
Cladistic versus
phenetic analyses
Homology and
homoplasy
Mount, D.W. (2001) Bioinformatics,
Cold Spring Harbor Lab Press, pp.237-280
advanced
CS 177
Phylogenetic Inference I
Evolution
The theory of evolution is the foundation upon which all of modern biology is built
From anatomy to behavior to genomics, the scientific method requires an appreciation of
changes in organisms over time
Taxonomy and
phylogenetics
It is impossible to evaluate relationships among gene
sequences without taking into consideration the
way these sequences have been modified over time
Phylogenetic trees
Cladistic versus
phenetic analyses
Homology and
homoplasy
Ernst Haeckel (1834-1919)
CS 177
Phylogenetic Inference I
Relationships
Similarity searches and multiple alignments of sequences naturally lead to the question
“How are these sequences related?”
and more generally:
“How are the organisms from which these sequences come related?”
Taxonomy and
phylogenetics
Phylogenetic trees
Cladistic versus
phenetic analyses
Homology and
homoplasy
Classifying Organisms
Nomenclature is the science of naming organisms
Evolution has created an enormous diversity, so how do we deal with it?
Names allow us to talk about groups of organisms.
- Scientific names were originally descriptive phrases; not practical
- Binomial nomenclature
> Developed by Linnaeus, a Swedish naturalist
> Names are in Latin, formerly the language of
science
> binomials - names consisting of two parts
Taxonomy and
phylogenetics
Phylogenetic trees
> The generic name is a noun.
> The epithet is a descriptive adjective.
- Thus a species' name is two words
e.g. Homo sapiens
Cladistic versus
phenetic analyses
Homology and
homoplasy
Carolus Linnaeus (1707-1778)
Classifying Organisms
Taxonomy is the science of the classification of organisms
Taxonomy deals with the naming and ordering of taxa.
The Linnaean hierarchy:
1. Kingdom
2. Division
3. Class
4. Order
Taxonomic Classific ation
of Man
Homo sapiens
5. Family
6. Genus
Taxonomy and
phylogenetics
Phylogenetic trees
Cladistic versus
phenetic analyses
Homology and
homoplasy
7. Species
Superkingdom: Eukaryota
Kingdom: Metazoa
Phylum: Chordata
Class: Mammalia
Order: Primata
Family: Hominidae
Genus: Homo
Species: sapiens
Subspecies: sapiens
Classifying Organisms
Systematics is the science of the relationships of organisms
Systematics is the science of how organisms are related and the evidence for those
relationships
Systematics is divided primarily into phylogenetics and taxonomy
Speciation -- the origin of new species from previously existing ones
- anagenesis - one species changes into another over time
- cladogenesis - one species splits to make two
Taxonomy and
phylogenetics
Phylogenetic trees
Reconstruct evolutionary history
Cladistic versus
phenetic analyses
Phylogeny
Homology and
homoplasy
Phylogenetics
Phylogenetics is the science of the pattern of evolution.
A. Evolutionary biology is the study of the processes that generate diversity, while
phylogenetics is the study of the pattern of diversity produced by
those processes.
B. The central problem of phylogenetics:
1. How do we determine the relationships between species?
2. Use evidence from shared characteristics, not differences
3. Use homologies, not analogies
Review of protein
structures
Need for analyses
of protein
structures
4. Use derived condition, not ancestral
a. synapomorphy - shared derived characteristic
b. plesiomorphy - ancestral characteristic
C. Cladistics is phylogenetics based on synapomorphies.
1. Cladistic classification creates and names taxa based only on synapomorphies.
Sources of protein
structure
information
The phylogeny and
classification of life as
Computational
proposed by Haeckel
Modeling
(1866)
2. This is the principle of monophyly
3. monophyletic, paraphyletic, polyphyletic
4. Cladistics is now the preferred approach to phylogeny
Phylogenetics
Evolutionary theory states that groups of similar organisms are descended
from a common ancestor.
Phylogenetic systematics is a method of taxonomic classification based
on their evolutionary history.
It was developed by Hennig, a German entomologist, in 1950.
Taxonomy and
phylogenetics
Phylogenetic trees
Cladistic versus
phenetic analyses
Homology and
homoplasy
Willi Hennig (1913-1976)
Phylogenetics
Phylogenetics is the science of the pattern of evolution
Evolutionary biology versus phylogenetics
- Evolutionary biology is the study of the processes that generate diversity
- Phylogenetics is the study of the pattern of diversity produced by those processes
Taxonomy and
phylogenetics
Phylogenetic trees
Cladistic versus
phenetic analyses
Homology and
homoplasy
Phylogenetics
Who uses phylogenetics? Some examples:
Evolutionary biologists (e.g. reconstructing tree of life)
Systematists (e.g. classification of groups)
Anthropologists (e.g. origin of human populations)
Forensics (e.g. transmission of HIV virus to a rape victim)
Parasitologists (e.g. phylogeny of parasites, co-evolution)
Taxonomy and
phylogenetics
Epidemiologists (e.g. reconstruction of disease transmission)
Phylogenetic trees
Genomics/Proteomics (e.g. homology comparison of new proteins)
Cladistic versus
phenetic analyses
Homology and
homoplasy
Phylogenetic trees
The central problem of phylogenetics:
how do we determine the relationships between taxa?
Taxonomy and
phylogenetics
Phylogenetic trees
Cladistic versus
phenetic analyses
Homology and
homoplasy
in phylogenetic studies, the most convenient way of presenting evolutionary
relationships among a group of organisms is the phylogenetic tree
Phylogenetics
Phylogenetics is the science of the pattern of evolution.
A. Evolutionary biology is the study of the processes that generate diversity, while
phylogenetics is the study of the pattern of diversity produced by
those processes.
B. The central problem of phylogenetics:
1. How do we determine the relationships between species?
2. Use evidence from shared characteristics, not differences
3. Use homologies, not analogies
Review of protein
structures
Need for analyses
of protein
structures
4. Use derived condition, not ancestral
a. synapomorphy - shared derived characteristic
b. plesiomorphy - ancestral characteristic
C. Cladistics is phylogenetics based on synapomorphies.
1. Cladistic classification creates and names taxa based only on synapomorphies.
Sources of protein
structure
information
2. This is the principle of monophyly
3. monophyletic, paraphyletic, polyphyletic
4. Cladistics is now the preferred approach to phylogeny
Computational
Modeling
Phylogenetic trees
Node: a branchpoint in a tree (a presumed ancestral OTU)
Branch: defines the relationship between the taxa in terms of descent and ancestry
Topology: the branching patterns of the tree
Branch length (scaled trees only): represents the number of changes that have occurred
in the branch
Root: the common ancestor of all taxa
Clade: a group of two or more taxa or DNA sequences that includes both their common
ancestor and all their descendents
Operational Taxonomic Unit (OTU): taxonomic level of sampling selected by the user
to be used in a study, such as individuals, populations, species, genera, or bacterial strains
Branch
Node
Taxonomy and
phylogenetics
Root
Phylogenetic trees
Cladistic versus
phenetic analyses
Homology and
homoplasy
Spec ies A
Clade
Spec ies B
Spec ies C
Spec ies D
Spec ies E
Phylogenetic trees
There are many ways of drawing a tree
A
B
C
D
E
Taxonomy and
phylogenetics
Phylogenetic trees
Cladistic versus
phenetic analyses
Homology and
homoplasy
A
B
C
D
E
Phylogenetic trees
There are many ways of drawing a tree
A
B
C
D
E
E
=
Taxonomy and
phylogenetics
Phylogenetic trees
Cladistic versus
phenetic analyses
Homology and
homoplasy
C
D
B
E
A
=
D
C
B
A
Phylogenetic trees
There are many ways of drawing a tree
A
B
C
D
A
E
=
Taxonomy and
phylogenetics
Phylogenetic trees
Cladistic versus
phenetic analyses
Homology and
homoplasy
no meaning
B
C
D
E
A
=
B
C
D
E
Phylogenetic trees
There are many ways of drawing a tree
A
B
C
D
A
E
B
C
D
E
=
/
Bifurcation
Trifurcation
Taxonomy and
phylogenetics
Phylogenetic trees
Cladistic versus
phenetic analyses
Homology and
homoplasy
Bifurcation versus Multifurcation (e.g. Trifurcation)
Multifurcation (also called polytomy): a node in a tree that connects more than three
branches. A multifurcation may represent a lack of resolution because of too few data
available for inferring the phylogeny (in which case it is said to be a soft multifurcation)
or it may represent the hypothesized simultaneous splitting of several lineages (in
which case it is said to be a hard multifurcation).
Phylogenetic trees
Trees can be scaled or unscaled (with or without branch lengths)
A
A
B
B
C
unit
C
D
D
E
E
C
C
A
Taxonomy and
phylogenetics
D
A
D
unit
Phylogenetic trees
Cladistic versus
phenetic analyses
Homology and
homoplasy
B
B
E
E
Phylogenetic trees
Trees can be unrooted or rooted
Unrooted tree
Rooted tree
B
A
C
D
C
A
Root
Root
B
Taxonomy and
phylogenetics
Phylogenetic trees
Cladistic versus
phenetic analyses
Homology and
homoplasy
A
D
C
Root
B
B
A
C
Root
D
D
Phylogenetic trees
Trees can be unrooted or rooted
A
2
4
C
1
Unrooted tree
B
5
3
D
Rooted tree 1
Rooted tree 2
Rooted tree 3
Rooted tree 4
Rooted tree 5
B
A
A
C
D
Taxonomy and
phylogenetics
A
B
B
D
C
Phylogenetic trees
C
C
C
A
A
Cladistic versus
phenetic analyses
D
D
D
B
B
Homology and
homoplasy
These trees show five different evolutionary relationships among the taxa!
Phylogenetic trees
Possible evolutionary trees
Taxa (n): 2
Taxa (n)
Taxonomy and
phylogenetics
Phylogenetic trees
Cladistic versus
phenetic analyses
Homology and
homoplasy
3
Unrooted/rooted
2
1/1
3
1/3
4
3/15
4
Phylogenetic trees
Possible evolutionary trees
Taxonomy and
phylogenetics
Phylogenetic trees
Cladistic versus
phenetic analyses
Homology and
homoplasy
Taxa (n)
rooted
(2n-3)!/(2n-2(n-2)!)
unrooted
(2n-5)!/(2n-3(n-3)!)
2
1
1
3
3
1
4
15
3
5
105
15
6
954
105
7
10,395
954
8
135,135
10,395
9
2,027,025
135,135
10
34,459,425
2,027,025
Phylogenetic trees
How to root?
A
2
4
C
1
B
5
3
Use information from ancestors
Taxonomy and
phylogenetics
Phylogenetic trees
Cladistic versus
phenetic analyses
Homology and
homoplasy
In most cases not available
D
Phylogenetic trees
How to root?
2
A
4
C
1
5
B
D
3
Use statistical tools will root trees automatically (e.g. mid-point rooting)
A
d
(
A
,
D
)
=
1
0
+
3
+
5
=
1
8
M
id
p
o
in
t=1
8/2=9
Taxonomy and
phylogenetics
Phylogenetic trees
10
C
3
B
2
2
5
D
Cladistic versus
phenetic analyses
This must involve assumptions … BEWARE!
Homology and
homoplasy
Phylogenetic trees
How to root?
A
2
4
C
1
B
5
3
D
Using “outgroups”
Taxonomy and
phylogenetics
Phylogenetic trees
Cladistic versus
phenetic analyses
Homology and
homoplasy
outgroup
- the outgroup should be a taxon known to be less closely related to the rest of
the taxa (ingroups)
- it should ideally be as closely related as possible to the rest of the taxa while
still satisfying the above condition
Phylogenetic trees
Exercise: rooted/unrooted; scaled/unscaled
A
C
B
D
A
C
E
A
B
C
D
E
D
B
A
B
C
E
Taxonomy and
phylogenetics
B
A
Phylogenetic trees
D
C
C
A
E
B
A
D
C
Cladistic versus
phenetic analyses
D
Homology and
homoplasy
D
E
E
B
F
E
Phylogenetics
What are useful characters?
Use homologies, not analogies!
- Homology: common ancestry of two or more character states
- Analogy: similarity of character states not due to shared ancestry
- Homoplasy: a collection of phenomena that leads to similarities in character states
for reasons other than inheritance from a common ancestor
(e.g. convergence, parallelism, reversal)
Taxonomy and
phylogenetics
Homoplasy is huge problem
in morphology data sets!
Phylogenetic trees
But in molecular data sets, too!
Homology and
homoplasy
Cladistic versus
phenetic analyses
Cactaceae
(cactus spines are
modified leaves)
Euphorbiaceae
(euphorb spines are
modified shoots)
Phylogenetics
Molecular data and homoplasy
0841r
0992r
3803r
4062r
3802r
ph2f
:
:
:
:
:
:
260
*
280
*
300
*
320
CCTTCAATTTTTATT-----------------------AGAGTTTTAGGAGAAATAAGTATGTG
CCTCCAATTTTTATTAGCTTGCCTACTCCTTTGGGCACAGAGTTTTAGGAGAAATAAGTATGTG
CCTCCAATTTTTATTAGCTTGCCTACTCCTTTGGGCACAGAGTTTTAGGAGAAATAAGTATGTG
CCTCCAATTTTTATTAGCTTGCCTACTCCTTTGGGAACAGAGTTTTAGGAGAAATAAGTATGTG
CCTCCAATTTTTATTAGTTTGCCTACTCCTTTGGGCACAGAGTTTTAGGAGAAATAAGTATGTG
CCTCCAATTTTTATTAGCTTGCCTACTCCTTTGGGCACAGAGTTTTAGGAGAAATAAGTATGTG
CCTcCAATTTTTATTag ttgcctactcctttggg acAGAGTTTTAGGAGAAATAAGTATGTG
:
:
:
:
:
:
gene sequences represent character data
characters are positions in the sequence (not all workers agree; some say one
gene is one character)
Taxonomy and
phylogenetics
character states are the nucleotides in the sequence (or amino acids in the case
of proteins)
Problems:
Phylogenetic trees
Homology and
homoplasy
Cladistic versus
phenetic analyses
the probability that two nucleotides are the same just by chance mutation is 25%
what to do with insertions or deletions (which may themselves be characters)
homoplasy in sequences may cause alignment errors
272
213
305
319
282
306
Phylogenetics
Molecular data and homoplasy: Orthologs vs. Paralogs
When comparing gene sequences, it is important to distinguish between identical
vs. merely similar genes in different organisms
Orthologs are homologous genes in different species with analogous functions
Paralogs are similar genes that are the result of a gene duplication
A phylogeny that includes both orthologs and paralogs is likely to be incorrect
Taxonomy and
phylogenetics
Phylogenetic trees
Homology and
homoplasy
Cladistic versus
phenetic analyses
Sometimes phylogenetic analysis is the best way to determine if a new gene is an
ortholog or paralog to other known genes
Phylogenetics
What are useful characters?
Use derived condition, not ancestral
- Synapomorphy (shared derived character): homologous traits share the same
character state because it originated in their immediate common ancestor
- Plesiomorphy (shared ancestral character”): homologous traits share the same
character state because they are inherited from a common distant ancestor
analogy
Taxonomy and
phylogenetics
Phylogenetic trees
Homology and
homoplasy
Cladistic versus
phenetic analyses
autapom orphy
(unique derived
c haracter)
plesiomorphy
(shared anc estral
c haracter)
synapomorphy
(shared derived
c haracter)
Phenetics versus cladistics
Within the field of taxonomy there are two different
methods and philosophies of building phylogenetic trees:
cladistic and phenetic
Phenetic methods construct trees (phenograms) by considering the current
states of characters without regard to the evolutionary history that brought the
species to their current phenotypes;
phenograms are based on overall similarity
Cladistic methods construct trees (cladograms) rely on assumptions about
ancestral relationships as well as on current data;
cladograms are based on character evolution (e.g. shared derived characters)
Cladistics is becoming the method of choice; it is considered to be more powerful
and to provide more realistic estimates, however, it is slower than phenetic algorithms
Phenetics vs. cladistics
An example
characteristics
identity
critter A 4 limbs meta. hair
kidney
endothermy vivip. no
cloaca
placental
critter B 4 limbs meta. hair
kidney
endothermy ovip.
cloaca
echidna
critter C 4 limbs meta. feathers
kidney
endothermy ovip.
cloaca
bird
cloaca
turtle
ancestor 4 limbs meta. no
ectothermy
kidney hair/feathers
ovip.
Phenetics vs. cladistics
Phenetic (overall similarity)
characteristics
identity
critter A 4 limbs meta. hair
kidney
endothermy vivip. no
cloaca
placental
critter B 4 limbs meta. hair
kidney
endothermy ovip.
cloaca
echidna
critter C 4 limbs meta. feathers
kidney
endothermy ovip.
cloaca
bird
cloaca
turtle
ancestor 4 limbs meta. no
ectothermy
kidney hair/feathers
C
A
4
B
3
5
C
overall similarity
ovip.
B
A
Phenetics vs. cladistics
Cladistics (character evolution; e.g. shared derived characters)
characteristics
identity
critter A 4 limbs meta. hair
kidney
endothermy vivip. no
cloaca
placental
critter B 4 limbs meta. hair
kidney
endothermy ovip.
cloaca
echidna
critter C 4 limbs meta. feathers
kidney
endothermy ovip.
cloaca
bird
cloaca
turtle
ancestor 4 limbs meta. no
ectothermy
kidney hair/feathers
A
A
2
B
1
1
C
shared derived characters
ovip.
B
C
Model of sequence evolution
0841r
0992r
3803r
4062r
3802r
ph2f
:
:
:
:
:
:
260
*
280
*
300
*
320
CCTTCAATTTTTATT-----------------------AGAGTTTTAGGAGAAATAAGTATGTG
CCTCCAATTTTTATTAGCTTGCCTACTCCTTTGGGCACAGAGTTTTAGGAGAAATAAGTATGTG
CCTCCAATTTTTATTAGCTTGCCTACTCCTTTGGGCACAGAGTTTTAGGAGAAATAAGTATGTG
CCTCCAATTTTTATTAGCTTGCCTACTCCTTTGGGAACAGAGTTTTAGGAGAAATAAGTATGTG
CCTCCAATTTTTATTAGTTTGCCTACTCCTTTGGGCACAGAGTTTTAGGAGAAATAAGTATGTG
CCTCCAATTTTTATTAGCTTGCCTACTCCTTTGGGCACAGAGTTTTAGGAGAAATAAGTATGTG
CCTcCAATTTTTATTag ttgcctactcctttggg acAGAGTTTTAGGAGAAATAAGTATGTG
:
:
:
:
:
:
The problem
- A basic process in the evolution of a sequence is change in that sequence over time
- Now we are interested in a mathematical model to describe that
- It is essential to have such a model to understand the mechanisms of change and is required to
estimate both the rate of evolution and the evolutionary history of sequences
272
213
305
319
282
306
Model of sequence evolution
Nucleotide
base + sugar
+ phosphate
O
3’
-- PO
O
P O
4
5’
O
3’
sugar
3’
3’
Pyrimidine (C4N2H4)
Purine (C5N4H4)
3’
3’
3’
3’
3’
Thymine
Adenine
5’
3’
Cytosine
Guanine
Models of sequence evolution
Examples
Jukes-Cantor model (1969)

A
G




C

T
All substitutions have an equal probability and
base frequencies are equal
Models of sequence evolution
Examples
Felsenstein (1981)

A
G




C

T
All substitutions have an equal probability, but there are unequal
base frequencies
Models of sequence evolution
Examples
Kimura 2 parameter model (K2P) (1980)
Purines

A

Purymidines C
G




T
Transitions and transversions have different probabilities
Models of sequence evolution
Examples
Hasegawa, Kishino & Yano (HKY) (1985)
Purines

A

Purymidines C
G




T
Transitions and transversions have different probabilities,
base frequencies are unequal
Models of sequence evolution
Examples
General time reversible model (GTR)

A



C
G


T
Different probabilities for each substitution,
base frequencies are unequal
Models of sequence evolution

A
G




A
G

C
Jukes-Cantor





A
T
Felsenstein

A

HKY
G


C
T




K2P

C

C



G
T
GTR
G




C

A
T

T
More models of sequence evolution …
Currently, there are more than 60 models described
- plus gamma distribution and invariable sites
- accuracy of models rapidly decreases for highly divergent sequences
- problem: more complicated models tend to be less accurate (and slower)
How to pick an appropriate model?
- use a maximum likelihood ratio test
- implemented in Modeltest 3.06 (Posada & Crandall, 1998)