Transcript Gene tree

中国科学院上海生命科学研究院研究生课程 人类群体遗传学
人类群体遗传学
基本原理和分析方法
徐书华
金 力
中科院-马普学会计算生物学伙伴研究所
第三讲
进化树的构建方法及应用
进化树的构建方法及应用
► 进化树的概念及相关的术语;
► 进化树的种类;
► 进化树的常用构建方法;
► 进化树的检验方法;
► 进化树的应用;
► 什么情况下使用什么方法最合适?
► 构建进化树的常用软件;
► 练习
进化树的概念及相关的术语
Phylogeny (phylo =tribe + genesis)
The purpose of a phylogenetic tree is to
illustrate how a group of objects (usually
genes or organisms) are related to one
another
Phylogeny
Phylogenetic trees are about
visualising evolutionary
relationships
Orangutan
Gorilla
From the Tree of the Life Website,
University of Arizona
Chimpanzee
Human
Phylogenetic trees diagram the evolutionary
relationships between the taxa
Taxon B
Taxon C
Taxon A
Taxon D
No meaning to the
spacing between the
taxa, or to the order in
which they appear from
top to bottom.
Taxon E
This dimension either can have no scale (for ‘cladograms’),
can be proportional to genetic distance or amount of change
(for ‘phylograms’ or ‘additive trees’), or can be proportional
to time (for ‘ultrametric trees’ or true evolutionary trees).
((A,(B,C)),(D,E)) = The above phylogeny as nested parentheses
These say that B and C are more closely related to each other than either is to A,
and that A, B, and C form a clade that is a sister group to the clade composed of
D and E. If the tree has a time scale, then D and E are the most closely related.
Clades
► Evolutionary
trees depict clades.
► A clade is a group of organisms that
includes an ancestor and all descendents of
that ancestor. You can think of a clade as a
branch on the tree of life.
Molecular Evolution - Li
Terminology
►
►
►
►
• External nodes: things under comparison;
operational taxonomic units (OTUs)
• Internal nodes: ancestral units; hypothetical; goal is to
group current day units
• Root: common ancestor of all OTUs under study. Path
from root to node defines evolutionary path
• Unrooted: specify relationship but not evolutionary path
 – If have an outgroup (external reason to believe certain OTU
branched off first), then can root
►
►
• Topology: branching pattern of a tree
• Branch length: amount of difference that occurred along
a branch
Common Phylogenetic Tree Terminology
Terminal Nodes
Branches or
Lineages
A
B
C
D
Ancestral Node
or ROOT of
the Tree
Internal Nodes or
Divergence Points
(represent hypothetical
ancestors of the taxa)
E
Represent the
TAXA (genes,
populations,
species, etc.)
used to infer
the phylogeny
Terminology
► Homologue
► Orthologue
► Paralogue
Homologs are commonly defined as orthologs, paralogs, or
xenologs.
Orthologs are homologs resulting from speciation. They are
genes that stem from a common ancestor. Orthologs often have
similar functions. SPO11 (Baudat et al. Mol Cell 2000)
Paralogs are homologs resulting from gene duplication. They
are genes derived from a common ancestral locus that was
duplicated within the genome of an organism. Paralogs tend to
have different functions. CLB1/CLB2 (Brachat et al.
GenomeBiology 2003).
Xenologs are homologs resulting from the horizontal transfer
of a gene between two organisms. The function of xenologs can
be variable. VDE (Okuda et al. Yeast 2003)
Types of Similarity
Observed similarity between two entities can be due to:
C
C
G
Evolutionary relationship:
Shared ancestral characters (‘plesiomorphies’)
Shared derived characters (‘’synapomorphy’)
G
Homoplasy (independent evolution of the same character):
Convergent events (in either related on unrelated entities),
Parallel events (in related entities), Reversals (in related entities)
G
C
C
G
C
G
T
G
C
G
C
G
Character-based methods can tease apart types of similarity and theoretically
find the true evolutionary tree. Similarity = relationship only if certain conditions
are met (if the distances are ‘ultrametric’).
Homology and Homoplasy
Hair?
Wings?
Bat
+ wings
+ hair
Chimp
bat
chimp
no hair
no wings
hawk
Hawk
+ wings
Homology:
Homoplasy:
identity due to
shared ancestry
(evolutionary signal)
identity despite
separate ancestry
(evolutionary noise)
paralogs
orthologs
paralogs
orthologs
Erik L.L.
Sonnhammer
Orthology,paralogy
and proposed
classification for
paralog subtypes
TRENDS in
Genetics Vol.18
No.12 December
2002
http://tig.trends.co
m 0168-9525/02/$
– see front matter
© 2002 Elsevier
Science Ltd. All
rights reserved.
The Molecular Clock
For a given protein the rate of sequence
evolution is approximately constant
across lineages
Zuckerkandl and Pauling (1965)
This would allow speciation and duplication events to be dated
accurately based on molecular data
Local and approximate molecular clocks more reasonable
Relative Rate Test
► Test
whether sets of sequences are
evolving at equal rates (local molecular
clock hypothesis)
KAC - KBC = 0
A
B
C
e.g. RRTree, Robinson-Rechavi
http://pbil.univlyon1.fr/software/rrtree.html
进化树的种类
Trees
► Diagram
consisting of branches and nodes
► Species tree (how are my species related?)
 contains only one representative from each species.
 all nodes indicate speciation events
► Gene
tree (how are my genes related?)
 normally contains a number of genes from a single
species
 nodes relate either to speciation or gene duplication
events
Gene tree, species tree
Gene tree
a
A
b
B
Species tree
D
c
We often assume that gene trees give us
species trees
Gene tree - Species tree
Gene A
Mutation events
Gene B
Gene C
Gene D
Gene E
Gene tree
Speciation events
Species A
Species B
Species C
Species D
Species E
Species tree
The two events - mutation and speciation- are not expected to occur
at the same time. So gene trees cannot represent species tree.
Gene tree - Species tree
•
Time
Duplication
•
Duplication
A
B
C
Gene tree
Speciation
Speciation
A
A
B
C
B
Species tree
C
Three types of trees
Cladogram
Phylogram
6
Taxon B
Taxon C
Taxon A
Taxon D
no meaning
1
1
3
1
5
Ultrametric tree
Taxon B
Taxon B
Taxon C
Taxon C
Taxon A
Taxon A
Taxon D
Taxon D
genetic change
time
All show the same evolutionary relationships, or branching orders, between the taxa.
Tree Properties
Ultrametricity
Additivity
All tips are an equal
distance from the root.
X
Distance between any two
tips equals the total branch
length between them.
a
b
Root
c
e Y
d
a=b+c+d+e
a X
b
Root
e
c
d
XY = a + b + c + d + e
In simple scenarios, evolutionary trees are ultrametric
and phylograms are additive.
Y
Cladograms vs Phylograms
Bacterium 1
Bacterium 2
Bacterium 3
Eukaryote 1
Eukaryote 2
Cladograms show
branching order branch lengths are
meaningless
Eukaryote 3
Eukaryote 4
Bacterium 1
Bacterium 2
Bacterium 3
Eukaryote 1
Phylograms show
branch order and
branch lengths
Eukaryote 2
Eukaryote 3
Eukaryote 4
Phenetics
► Phenetics,
when first introduced (Michener and
Sokal, 1957), challenged the prevailing view that
classifications should be based on comparisons
between a limited number of characters that
taxonomists believed to be important for one
reason or another.
► Pheneticists argued that classifications should
encompass as many variable characters as
possible, these characters being scored
numerically and analyzed by rigorous
mathematical methods.
Cladistics
►
►
►
►
Cladistics (Hennig, 1966) also emphasizes the need for
large datasets but differs from phenetics in that it does not
give equal weight to all characters.
The argument is that in order to infer the branching order
in a phylogeny it is necessary to distinguish those
characters that provide a good indication of evolutionary
relationships from other characters that might be
misleading.
This might appear to take us back to the pre-phenetic
approach but cladistics is much less subjective: rather than
making assumptions about which characters are ‘important',
cladistics demands that the evolutionary relevance of
individual characters be defined.
In particular, errors in the branching pattern within a
phylogeny are minimized by recognizing two types of
anomalous data.
Why Cladistics?
Convergent evolution and Derived character states
Convergent evolution
Derived character state
Phenetics versus Cladistics
► Phenetics
 is the study of relationships among a group of
organisms on the basis of the degree of
similarity between them, be that similarity
molecular, phenotypic, or anatomical.
 A tree-like network expressing phenetic
relationships is called a phenogram.
Phenetics versus Cladistics
► Cladistics
 can be defined as the study of the pathways of
evolution.
 In other words, cladists are interested in such questions
as: how many branches there are among a group of
organisms; which branch connects to which other
branch; and what is the branching sequence.
 A tree-like network that expresses such ancestordescendant relationships is called a cladogram.
 Thus, a cladogram refers to the topology of a rooted
phylogenetic tree.
Phenetics versus Cladistics
► While
a phenogram may serve as an
indicator of cladistic relationships, it is not
necessarily identical to the cladogram.
► If there is a linear relationship between the
time of divergence and the degree of
genetic (or morphological) divergence, the
two types of trees may become identical to
each other.
Cladistics and Phenetics
► Trees
are drawn based on the conserved
characters
► Trees are based on some measure of
distance between the leaves
► Molecular phylogenies are inferred from
molecular (usually sequence) data
 either cladistic (e.g. gene order) or phenetic
Cladistics and Phenetics
► The
maximum parsimony method is a
typical representative of the cladistic
approach, whereas the UPGMA method is a
typical phenetic method.
► The other methods, however, cannot be
classified easily according to the above
criteria.
Unrooted vs Rooted tree
archaea
eukaryote
archaea
Unrooted tree
archaea
eukaryote
eukaryote
eukaryote
Rooted
by outgroup
archaea
Monophyletic group
archaea
Rooted tree
outgroup
bacteria outgroup
archaea
eukaryote
eukaryote
root
eukaryote
eukaryote
Monophyletic
group
Rooting the Tree
► In
an unrooted tree the direction of
evolution is unknown.
► The root is the hypothesized ancestor of the
sequences in the tree.
► The root can either be placed on a branch
or at a node.
Inferring evolutionary relationships between
the taxa requires rooting the tree:
B
To root a tree mentally,
imagine that the tree is
made of string. Grab the
string at the root and
tug on it until the ends of
the string (the taxa) fall
opposite the root:
Root
D
Unrooted tree
A
A
Note that in this rooted tree,
taxon A is no more closely
related to taxon B than it is
to C or D.
C
B
C
D
Rooted tree
Root
Now, try it again with the root at another position:
B
C
Root
Unrooted tree
D
A
A
B
C
D
Rooted tree
Root
Note that in this rooted tree, taxon A
is most closely related to taxon B,
and together they are equally
distantly related to taxa C and D.
An unrooted, four-taxon tree theoretically can be rooted in five
different places to produce five different rooted trees
A
The unrooted tree 1:
4
1
B
Rooted tree 1a
2
Rooted tree 1b
C
5
D
3
Rooted tree 1c
Rooted tree 1d
Rooted tree 1e
B
A
A
C
D
A
B
B
D
C
C
C
C
A
A
D
D
D
B
B
These trees show five different evolutionary relationships among the taxa!
All of these rearrangements show the same evolutionary
relationships between the taxa
Rooted tree 1a
B
A
C
D
A
C
A
D
D
C
B
B
C
D
D
C
A
A
B
B
B
B
C
D
D
A
C
A
There are two major ways to root trees:
By outgroup:
Uses taxa (the “outgroup”) that are
known to fall outside of the group of
interest (the “ingroup”). Requires
some prior knowledge about the
relationships among the taxa. The
outgroup can either be species (e.g.,
birds to root a mammalian tree) or
previous gene duplicates (e.g.,
a-globins to root b-globins).
outgroup
By midpoint or distance:
Roots the tree at the midway point
between the two most distant taxa in
the tree, as determined by branch
lengths. Assumes that the taxa are
evolving in a clock-like manner. This
assumption is built into some of the
distance-based tree building methods.
A
d (A,D) = 10 + 3 + 5 = 18
Midpoint = 18 / 2 = 9
10
C
3
B
2
2
5
D
Rooting and Tree Interpretation
chicken
human
fruit fly
oak
chicken
human
– bones
bacteria
archaea
oak
– cell nuclei
fruit fly
bacteria
archaebacteria
oak
bacteria
archaebacteria
fruit fly
+ cell nuclei
human
+ bones
chicken
How Many Trees?
(assuming bifurcation only)
Unrooted trees
#
#
sequence pairwise
s
distances
3
4
5
6
10
30
N
# trees
#
branches
/tree
Rooted trees
# trees
# branches
/tree
How Many Trees?
Unrooted trees
#
#
sequence pairwise
s
distances
Rooted trees
#
branches
/tree
# trees
# branches
/tree
# trees
3
3
1
3
3
4
4
6
3
5
15
6
5
10
15
7
105
8
6
15
105
9
945
10
10
45
2,027,025
17
34,459,425
18
30
435
8.69  1036
57
4.95  1038
58
N N (N - 1)
(2N - 5)!
2N - 3 (N - 3)!
2N - 3
2
(2N - 3)!
2N - 2 (N - 2)!
2N - 2
进化树的常用构建方法
系统发育树构建的基本方法

最大简约法(maximum parsimony,MP)

距离法(distance)

最大似然法(maximum likelihood,ML)
Maximum Parsimony
Check each topology
 Count the minimum number of changes
required to explain the data
 Choose the tree with the smallest number
of changes

Maximum Parsimony
ACT
GTT
2 GTT GTA
1
2
GTA
ACA
ACA
GTT
ACT
ACA ACT
1
3
3
MP score = 7
MP score = 5
ACA
ACT
GTA
ACA GTA
2
1
1
MP score = 4
Optimal MP tree
GTT
GTA
Maximum Parsimony: Limitations
With only a few sequences, becomes computationally intractable (“NP-hard”)
# of rooted trees =
(2n-3)!
2n-2(n-2)!
# of unrooted trees = (2n-5)!
2n-3(n-3)!
Number of possible trees (Felsenstein 1978)
#of species
#rooted trees
#unrooted trees
2
1
1
3
3
1
4
15
3
5
105
15
10
3.44x107
2.03x106
15
2.13x1014
7.91x1012
20
8.20x1021
2.21x1020
Maximum Parsimony: Limitations

Long Branches Attraction
 In
a set of sequences evolving at different
rates the sequences evolving rapidly have
been observed to be drawn together.
Long Branches Attraction
NJ tree based on CNVs
Distance Methods
Distance Methods
Distance Method Criteria
Distance methods
Normally fast and simple
 e.g. UPGMA, Neighbour Joining, Minimum
Evolution

UPGMA
UPGMA: Visually
UPGMA: example
UPGMA: example
UPGMA: example
UPGMA weaknesses
UPGMA weaknesses
Neighbor Joining
Neighbor Joining (NJ)
8
8
7
1
1
7
6
2
6
2
5
3
5
4
3
Start off with star tree; pull out pairs at a time
4
NJ Algorithm
NJ Algorithm
NJ Algorithm
NJ Performance
Minimum Evolution
The total length of all branches in the tree
should be a minimum.
 Neighbour joining is an approximation to
minimum evolution.
 It has been shown that the minimum
evolution tree is expected to be the true
tree provided branch lengths corrected
for multiple hits.

Maximum Likelihood
Maximum Likelihood
Maximum Likelihood
Maximum Likelihood
Maximum Likelihood



The maximum likelihood method is a phenetic method
that is statistically well founded.
It has often lower variance than other methods (ie. it is
frequently the estimation method least affected by
sampling error) and tends to be robust to many violations
of the assumptions in the evolutionary model. Even with
very short sequences maximum likelihood tends to
outperform alternative methods such as parsimony or
distance methods. Different tree topologies are
evaluated.
An important disadvantage is that it is very CPU
intensive and thus time consuming and not appropriate
for large datasets.
Phylogeny Flowchart
Difference in Methods
Comparison of methods







Neighbour Joining (NJ) is very fast but depends on
accurate estimates of distance. This is more difficult with
very divergent data
Parsimony suffers from Long Branch Attraction. This
may be a particular problem for very divergent data
NJ can suffer from Long Branch Attraction
Parsimony is also computationally intensive
Codon usage bias can be a problem for MP and NJ
Maximum Likelihood is the most reliable but depends on
the choice of model and is very slow
Methods may be combined
Comparison of Methods
Neighbor-joining
Maximum
parsimony
Maximum likelihood
Very fast
Slow
Very slow
Easily trapped in local
optima
Assumptions fail
when evolution is
rapid
Highly dependent on
assumed evolution
model
Good for generating
tentative tree, or
choosing among
multiple trees
Best option when
tractable (<30 taxa,
strong conservation)
Good for very small data
sets and for testing trees
built using other methods
进化树的检验方法
How confident am I that my
tree is correct?
Bootstrapping: how dependent is the tree on the dataset
1.
2.
3.
4.
Randomly choose n objects from your dataset of n, with replacement
Rebuild the tree based on the subset of the data
Repeat 1,000 – 10,000 times
How often are the same children joined?
Jackknifing: how dependent is the tree on the dataset
1.
2.
3.
4.
Randomly choose k objects from your dataset of n, without replacement
Rebuild the tree based on the subset of the data
Repeat 1,000 – 10,000 times
How often are the same children joined?
Assessing Reliability:
Bootstrap
Assessing Reliability:
Bootstrap
Assessing Reliability:
Bootstrap
Assessing Reliability:
Bootstrap
Bootstrap - interpretation





Bootstrapping is a very valuable and widely used technique
(it is demanded by some journals)
BPs give an idea of how likely a given branch would be to
be unaffected if additional data, with the same distribution,
became available
BPs are not the same as confidence intervals. There is no
simple mapping between bootstrap values and confidence
intervals. There is no agreement about what constitutes a
‘good’ bootstrap value (> 70%, > 80%, > 85% ????)
Some theoretical work indicates that BPs can be a
conservative estimate of confidence intervals
If the estimated tree is inconsistent all the bootstraps in the
world won’t help you…..
Bootstrapping – an example
Ciliate SSUrDNA - parsimony bootstrap
Ochromonas (1)
Symbiodinium (2)
100
Prorocentrum (3)
Euplotes (8)
84
Tetrahymena (9)
96
100
Loxodes (4)
Tracheloraphis (5)
100
100
Majority-rule consensus
Spirostomum (6)
Gruberia (7)
Bootstrapping – an example
16
59
26
21
71
16
Ochromonas
Symbiodinium
Prorocentrum
Loxodes
Spirostomumum
Tetrahymena
Euplotes
Tracheloraphis
Gruberia
59
71
Ochromonas
Symbiodinium
Prorocentrum
Loxodes
Tracheloraphis
Spirostomumum
Euplotes
Tetrahymena
Gruberia
Majority-rule consensus (with minority components)
Jack-knifing
Jack-knifing is very similar to
bootstrapping and differs only in the
character resampling strategy
 Jack-knifing is not as widely available or
widely used as bootstrapping
 Tends to produce broadly similar results

Likelihood-based tests of
topologies

Kishino-Hasegawa test
 Trees
specified apriori
 KH can be used to test whether two competing
hypotheses have significantly different likelihood
 NB should not be used to test trees that have been
chosen on the basis of the data!

Shimodaira-Hasegawa test
 Can
be used to test confidence of ML tree compared
to related trees (e.g. second most likely tree from the
data)
 Andrew Rambaut
http://evolve.zoo.ox.ac.uk/software/shtests
进化树的应用
Phylogeny Applications
A few examples of what can be inferred
from phylogenetic trees built from DNA
or protein sequence data:

Which species are the closest living relatives of
modern humans?

Did the infamous Florida Dentist infect his
patients with HIV?

What were the origins of specific transposable
elements?

Plus countless others…..
Which species are the closest living
relatives of modern humans?
14
Humans
Gorillas
Chimpanzees
Chimpanzees
Bonobos
Bonobos
Gorillas
Orangutans
Orangutans
Humans
0
MYA
Mitochondrial DNA, most nuclear DNAencoded genes, and DNA/DNA
hybridization all show that bonobos and
chimpanzees are related more closely to
humans than either are to gorillas.
15-30
MYA
0
The pre-molecular view was that the
great apes (chimpanzees, gorillas and
orangutans) formed a clade separate
from humans, and that humans diverged
from the apes at least 15-30 MYA.
Did the Florida Dentist infect his patients with HIV?
Phylogenetic tree
of HIV sequences
from the DENTIST,
his Patients, & Local
HIV-infected People:
DENTIST
Patient C
Patient A
Patient G
Patient B
Patient E
Patient A
DENTIST
Yes:
The HIV sequences from
these patients fall within
the clade of HIV sequences
found in the dentist.
Local control 2
Local control 3
Patient F
No
Local control 9
Local control 35
Local control 3
Patient D
From Ou et al. (1992) and Page & Holmes (1998)
No
What data is used to build trees?
Data for Phylogeny
NIH & University of Michigan
Stanford University
Neighbour-joining trees of
population relationships
NJ tree based on SNP
genotypes
NJ tree based on SNP
haplotypes
NJ tree based on CNVs
Maximum likelihood tree of 51 populations
Oceania
150,000 SNPs
America
East Asia
South/Central Asia
Europe
Middle East
North Africa
Phylogeney programs on web
常用软件
►PHYLIP, Phylogenetic Inference Package
 http://evolution.genetics.washington.edu/phylip.h
tml
►MEGA, Molecular Evolutionary Genetics
Analysis
 http://www.megasoftware.net
► PAUP,
Phylogenetic Analysis Using Parsimony
 paup.csit.fsu.edu
练习
► 利用HapMap数据构建群体及个体的
Phylogenetic tree;
 http://www.hapmap.org