No Slide Title
Download
Report
Transcript No Slide Title
Advanced Methods in
Reconstructing Phylogenetic
Relationships
2010 Practical Course:
March 8th to 13th, 2010, Rio de Janeiro
Darwin’s letter to
Thomas Huxley 1857
• The time will come I
believe, though I shall not
live to see it, when we
shall have fairly true
genealogical (phylogenetic)
trees of each great
kingdom of nature
Haeckel’s pedigree of man
Aims of the course:
• To introduce the theory and practice of
phylogenetic inference from molecular
data
• To introduce some of the most useful
methods and computer programmes
• To encourage a critical attitude to data
and its analysis
Some definitions
Richard Owen
Owen’s definition of homology
• Homologue: the same organ under every
variety of form and function (true or
essential correspondence)
• Analogy: superficial or misleading similarity
Richard Owen 1843
Charles Darwin
Darwin and homology
• “The natural system is based upon descent with
modification .. the characters that naturalists
consider as showing true affinity (i.e. homologies)
are those which have been inherited from a common
parent, and, in so far as all true classification is
genealogical; that community of descent is the
common bond that naturalists have been seeking”
Charles Darwin, Origin of species 1859 p. 413
Homology is...
• Homology: similarity that is the result
of inheritance from a common ancestor the identification and analysis of
homologies is central to phylogenetic
systematics
Phylogenetic systematics
• Sees homology as evidence of common
ancestry
• Uses tree diagrams to portray relationships
based upon recency of common ancestry
• Monophyletic groups (clades) - contain
species which are more closely related to
each other than to any outside of the group
Cladograms and phylograms
Bacterium 1
Bacterium 2
Bacterium 3
Eukaryote 1
Eukaryote 2
Cladograms show
branching order branch lengths are
meaningless
Eukaryote 3
Eukaryote 4
Bacterium 1
Bacterium 2
Bacterium 3
Eukaryote 1
Phylograms show
branch order and
branch lengths
Eukaryote 2
Eukaryote 3
Eukaryote 4
Rooting using an outgroup
archaea
eukaryote
archaea
Unrooted tree
archaea
eukaryote
eukaryote
eukaryote
Rooted
by outgroup
bacteria outgroup
archaea
Monophyletic group
archaea
archaea
eukaryote
eukaryote
root
eukaryote
eukaryote
Monophyletic
group
What kind of data?
Fossil skulls
Family tree for
humans
Microbial morphologies - some are complex but
many are simple - for example look at a drop
of lake water:
Linus Pauling
Molecules as documents of
evolutionary history
• “We may ask the question where in the now
living systems the greatest amount of
information of their past history has survived
and how it can be extracted”
• “Best fit are the different types of
macromolecules (sequences) which carry the
genetic information”
Small subunit ribosomal RNA
18S or 16S rRNA
An alignment involves hypotheses of
positional homology between bases or
amino acids
<---------------(--------------------HELIX 19---------------------)
<---------------(22222222-000000-111111-00000-111111-0000-22222222
Thermus ruber
UCCGAUGC-UAAAGA-CCGAAG=CUCAA=CUUCGG=GGGU=GCGUUGGA
Th. thermophilus UCCCAUGU-GAAAGA-CCACGG=CUCAA=CCGUGG=GGGA=GCGUGGGA
E.coli
UCAGAUGU-GAAAUC-CCCGGG=CUCAA=CCUGGG=AACU=GCAUCUGA
Ancyst.nidulans UCUGUUGU-CAAAGC-GUGGGG=CUCAA=CCUCAU=ACAG=GCAAUGGA
B.subtilis
UCUGAUGU-GAAAGC-CCCCGG=CUCAA=CCGGGG=AGGG=UCAUUGGA
Chl.aurantiacus UCGGCGCU-GAAAGC-GCCCCG=CUUAA=CGGGGC=GAGG=CGCGCCGA
match
**
***
* ** ** *
**
Alignment of 16S rRNA sequences from different bacteria
Automated Progressive Alignment
of Sequences
• Essentially a heuristic method and as such
is not guaranteed to find the ‘optimal’
alignment.
• Most successful implementation is Clustal
(Des Higgins). This software is cited
3,000 times per year in the scientific
literature.
Des Higgins is
very famous
Automatic alignment programs
• There are a variety available:
• Clustal W 2.0, Muscle, T-Coffee are
among the most popular
• All are easy to use and relatively quick
(but this depends on how many sequences
and how similar they are).
• Outputs files are produced which can be
read by most phylogenetic analysis
programmes.
• Can fail badly with highly divergent
sequences.
James McInerney
is not here
• But he has produced a nice lecture on
some background issues for multiple
alignment
• This can be downloaded from the embo
world 2009 directory on our lab
webpage:
•
http://research.ncl.ac.uk/microbial_eukaryotes/index.html
Advice on alignments
•
•
•
•
Treat cautiously
Can be improved by eye (usually)
Often helps to have colour-coding
Depending on the use, the user should be
able to make a judgement on those regions
that are reliable or not
• For phylogeny reconstruction, only use
those positions whose hypothesis of
positional homology is unimpeachable (or do
experiments)
Patterns in sequence data
Exploring patterns in sequence data 1:
• Which sequences should we use?
• Do the sequences contain phylogenetic
signal for the relationships of interest?
(might be too conserved or too variable)
• Are there features of the data which
might mislead us about evolutionary
relationships?
Is there a molecular clock?
• The idea of a molecular clock was
initially suggested by Zuckerkandl and
Pauling in 1962
• They noted that rates of amino acid
replacements in animal haemoglobins
were roughly proportional to time - as
judged against the fossil record
Rate Heterogeneity
Rates of amino acid replacement in
different proteins
There is no universal molecular
clock
• The initial proposal saw the clock as a Poisson
process with a constant rate
• Now known to be more complex - differences in
rates occur for:
– different sites in a molecule
– different genes
– different regions of genomes
– different genomes in the same cell
– different taxonomic groups for the same gene
• There is no universal molecular clock
Small subunit ribosomal RNA
18S or 16S rRNA
Failure To Accommodate Rate
Heterogeneity Can Lead To Problems
When Making Trees
Unequal rates in different lineages may
cause problems for phylogenetic analysis
• Felsenstein (1978) made a simple model phylogeny including
four taxa and a mixture of short and long branches
A
p
TRUE TREE
A
B
p
q
q
C
q
D
WRONG TREE
p>q
C
D
B
• All methods are susceptible to “long branch” problems
• Methods which assume that all sites change at the same
rate are particularly poor at recovering the true tree
Chaperonin 60 Protein Maximum Likelihood Tree
(PROTML, Roger et al. 1998, PNAS 95: 229)
Longest
branches
Bootstrap values are a
common way of assessing
support for relationships
High bootstrap values can be misleading adding a single new sequence
Cucurb ita sp.
Arab idopsis thaliana
Plasmodium falciparum
Dictyostelium discoideum
Cucurb ita sp.
Arab idopsis thaliana
Spironucleus b arkhanus
Giardia lamb lia
Entamoeb a histolytica
Trichomonas vaginalis
Drosophila melanogaster
Homo sapiens
Saccharomyces cerevisae
Schizosaccharomyces pomb e
Trypanosoma brucei
Euglena gracilis
Holospora obtusa
Ehrlichiasp.
Ehrlichia chaffeensis
Rickettsia tsutsugamushi
Rhizobium meliloti
Bartonella bacilliformis
Bradyrhizobium japonicum
Caulobacter crescentus
Rhodobacter sphaeroides
Pseudomonas aeruginosa
Escherichia coli
Chromatium vinosum
Neisseria gonorrhoeae
Chlamydia trachomatis
Treponema pallidum
Thermus thermophilus
Giardia lamb lia
Trichomonas vaginalis
Entamoeb a histolytica
Dictyostelium discoideum
Drosophila melanogaster
Homo sapiens
Saccharomyces cerevisae
Schizosaccharomyces pomb e
Trypanosoma brucei
Euglena gracilis
Plasmodium falciparum
Ehrlichiasp.
Ehrlichia chaffeensis
Rickettsia tsutsugamushi
Holospora obtusa
Rhizobium meliloti
Rhodobacter sphaeroides
Bartonella bacilliformis
Bradyrhizobium japonicum
Caulobacter crescentus
Escherichia coli
Pseudomonas aeruginosa
Chromatium vinosum
Neisseria gonorrhoeae
Chlamydia trachomatis
Treponema pallidum
Thermus thermophilus
A proposal for three domains of
life
(Woese, Kandler and Wheelis 1990 PNAS 87, 4576)
Concatenated LSU+SSU rRNA analyzed
using a standard (GTR plus gamma*2) model
eukaryotes
The 3-domains tree of life
Two longest
branches
archaebacteria
eocyte
archaebacteria
Cox et al. 2008. PNAS
bacteria
The same RNA data analyzed using better
models (Cox et al. 2008)
eukaryotes
eocytes
0.75
0.95
bacteria
Other archaebacteria
NDCH (GTR+g+2cv)*2
Heterogeneous across tree
CAT model
Saturation in sequence data:
• Saturation is due to multiple changes at the
same site subsequent to lineage splitting
• Most data will contain some fast evolving sites
which are potentially saturated (e.g. in
proteins often position 3)
• In severe cases the data becomes essentially
random and all information about relationships
can be lost
Multiple changes at a single site
- hidden changes
Seq 1
Seq 2
AGCGAG
GCGGAC
Number of changes
1
Seq 1
C
Seq 2 C
3
2
G
T
1
A
A
Exploring patterns in sequence
data
• Do sequences manifest biased base
compositions (e.g thermophilic
convergence) or biased codon usage
patterns which may obscure
phylogenetic signal
A case study in phylogenetic analysis:
Deinococcus and Thermus
• Deinococcus are radiation resistant bacteria
• Thermus are thermophilic bacteria
– BUT:
– Both have the same very unusual cell wall
based upon ornithine
– Both have the same menaquinones (Mk 9)
– Both have the same unusual polar lipids
• Congruence between these complex characters
supports a phylogenetic relationship between
Deinococcus and Thermus
% Guanine + Cytosine in 16S rRNA
genes from mesophiles and thermophiles
Thermophiles:
Thermotoga maritima
Thermus thermophilus
Aquifex pyrophilus
Mesophiles:
Deinococcus radiodurans
Bacillus subtilis
%GC variable
all sites sites
62
64
65
72
72
73
55
55
52
50
Shared nucleotide or amino acid composition biases
can also cause problems for phylogenetic analysis
Aquifex
True
tree
Bacillus
Thermus
Aquifex (73%)
Bacillus (50%)
Wrong
tree
16S rRNA
Deinococcus
The correct tree can be obtained if a
model is used which allows base/aa
composition to vary between
sequences -LogDet/Paralinear
Distances
Heterogeneous Maximum Likelihood
Thermus
(72%)
Deinococcus
(52% G+C)
Aquifex
Bacillus
Thermus
Deinococcus
Gene trees and species trees
Gene tree
a
A
b
B
c
C
Species tree
We often assume that gene trees give us
species trees
Orthologues and paralogues
paralogous
orthologous
a
b* c
Ancestral gene
b* C*
orthologous
C* B
A*
A*
A mixture of
orthologues and
paralogues sampled
Duplication to give 2 copies
on the same genome =
paralogues of each other
The malic enzyme gene tree contains a
mixture of orthologues and paralogues
Gene duplication
97
100
100
100
Anas = a duck!
Homo sapiens 2 Mit
Ascaris suum Mit
100
75
Homo sapiens 1 Cyt
Anas platyrhynchos Cyt
Zea mays
Ch
Plant chloroplast
Flaveria trinervia Ch
Populus trichocarpa Ch
Solanum tuberosum Mit
Plant
100
mitochondrion
Amaranthus Mit
Neocallimastix
Hyd
Trichomonas vaginalis Hyd
Giardia lamblia Cyt
Schizosaccharomyces
Saccharomyces
Lactococcus lactis
Summary:
• There may be conflicting patterns in data which
can potentially mislead us about evolutionary
relationships
• Our methods of analysis need to be able to deal
with the complexities of sequence evolution and
to recover any underlying phylogenetic signal
• Some methods may do this better than others
depending on the properties of individual data
sets
• All trees are simply hypotheses!
Phylogenetic analysis requires
careful thought
• Phylogenetic analysis is frequently treated
as a black box into which data are fed
(often gathered at considerable cost) and
out of which “The Tree” springs
• (Hillis, Moritz & Mable 1996, Molecular
Systematics)