Transcript Slide 1

Estimating divergence time
1. Genomic clocks and evolutionary timescales
S. Blair Hedges and Sudhir Kumar
Trends in Genetics 2003, Vol.19 No. 4:200-206
2. A genomic timescale of prokaryote evolution:
insights into the origin of methanogenesis,
phototrophy, and the colonization of land
Fabia U Battistuzzi, Andreia Feijao and S Blair Hedges
BMC Evolutionary Biology 2004, 4:44
Things you need
to get a date
1. Organisms (interested)
2. A phylogram describes
-the relationships
-the number of evolutionary
changes along a branch
-the branch lengths represent
nucleotide substitutions.
3. Black box of dating methods
4. External information on time
-fossil records
5.Branch length + evolutionary
rate + time (age) -- result to
chronogram
-the branch lengths represent
time units between divergences
A genomic timescale of prokaryote evolution
(Background)
• The timescale of prokaryote evolution
has been difficult to reconstruct.
•
1. A limited fossil record
•
Some elements have forms (called isotopes) with
unstable atomic nuclei that have a tendency to
change, or decay.
U-235 (unstable isotope or parent isotope)
Pb-207(daughter isotope)
Where the amounts of parent and daughter
isotopes can be accurately measured, the ratio
can be used to determine how old the rock is.
•
•
•
• For eukaryotes, the fossil record provides an abundant source of such data, but this has
not been true for prokaryotes, which are difficult to identify as fossils
• In 1999, the analysis of cyanobacteria-specific molecular fossil such as 2-methylhopanoid
as biomarker s for cyanobacterial oxygenic photosynthesis have shown that late Archean
shales dated to ages between 2.04 and 3.08 Ga and allowed calibration of distance trees and
permits estimates of the major prokaryotic divergences.
A genomic timescale of prokaryote evolution
(Background)
•
•
•
2. complexities associated with molecular
clocks and deep divergences.
Horizontal gene transfer (HGT) is any process in
which an organism transfers genetic material to
another cell that is not its offspring. This events
are of great interest in their roles to create
functionally new combinations of genes, but they
pose problems for investigating the phylogenetic
history and divergence times of organisms.
S. Garcia-Vallve, E. Guzman, MA. Montero and A.
Romeu. 2003. HGT-DB: a database of putative
horizontally transferred genes in prokaryotic
complete genomes. Nucleic Acids Research
31(1):187-189
What is molecular clock?
• The molecular clock is a technique in
genetics to date when two species
diverged. Elapsed time is deduced by
applying a time scale to the number of
molecular differences measures between
the species' sequence or proteins
• Current molecular clock methods are three
basic approaches for analysis of multiple
genes or protein;
• 1. Methods that use a molecular clock and
one global rate of substitution
• 2. Methods that correct for rate
heterogeneity or local clock
• 3. Methodss that try to incorporate rate
heterogeneity
• There are two basic approaches to
estimate divergence time when data from
multiple constant-rate genes or rate
heterogeneity; Multigene method and
Supergene method
The difference of data sets
• Multigene method
• Supergene method
• 1. divergence time are estimated
for each gene separately
• 2. the average or modal
divergence time and error is
estimated from pooled time
estimates
• 3. Multigene time distributions
are usually symmetric and have a
strong central tendency
• 4.This method can be used with
genes having widely varying
species samples and rate of
change
• 1.concatenating nucleotide or
protein sequences from all
relevant genes of a species to
form a single alignment for time
estimation
• 2. rate variation between genes
and among sites within genes
should be modeled
• 3. evolutionary distance between
species can be computed for each
gene and then averaged over all
genes for a given species pair.
• 4.divergence time is estimated by
dividing the average distance by
the average calibration distance.
Global clock or one global rate of substitution method
• all branches of a phylogenetic tree evolve
at the same, global substitution rate. The
clock-like tree is ultrametric, which means
that the total distance between the root
and every tip is constant
• If a data set follows a clock, the
evolutionary rate is constant over time,
and the branch lengths equal the product
of rate and time.
• the branch lengths can be used to infer
relative rates and times
• Relative rate tests are used in almost all
global clock studies. Genes and lineages
that are rejected in the rate tests are
usually removed from later analyses. Each
gene that is not rejected in the relative rate
tests can be considered to be evolving
under a constant rate of substitution.
Table 1 Comparison table of different molecular dating methods.
Part 1: Methods that use a molecular clock and one global rate
of substitution
§PAUP* (Swofford, 2001), DNAMLK(part of the PHYLIP package; Felsenstein, 1993), BASEML (part of the PAML
package; Yang, 1997), MRBAYES(Huelsenbeck and Ronquist, 2001), BEAST(Drummond & Rambaut, 2003), etc
Advantage - The global molecular clock seemed to be very useful for calculating
divergence times and set up models of evolution for many group of organisms
Drawback - the clock turned out to be an oversimplified model, and many studies
presented highly unlikely results. Comparisons with the fossil record often showed
large discrepancies between molecular and fossil ages
Local clock or heterogeneity rate of substitution method
•
•
•
•
•
Several reasons are given for these deviations
from the clock-like model of sequence
evolution; generation time, metabolic rate,
mutation rate and population size
Local clock methods use a model of nucleotide
or amino acid substitution in which rate is not
constant among all branches by dividing the
global rate into several rate classes (local
rates).
Uses Bayesian inference methodology and
maximum likelihood to estimate divergence
time
Advantage-Local clock methods are promising
as they can use genes discarded by global
clock methods, permitting a larger total number
of genes for estimating time, a positive attribute
in data-limited situations.
Drawback-some assignments of rates to
branches are not feasible as they cause the
model to become unidentifiable.
Table 2 Comparison table of different molecular dating methods.
Part 2: Methods that correct for rate heterogeneity
Methods that estimate divergence times by
incorporating rate heterogeneity
• Methods that relax rate constancy must necessarily be
guided by specifications about how rates are expected to
change among lineages
• All methods estimate branch lengths without assuming
rate constancy, and then model the distribution of
divergence times and rates by minimizing the
discrepancies between branch lengths and the rate
changes over the branches.
• The methods differ in their strategy to incorporate
age constraints (calibration points) into the analysis.
Table 3 Comparison table of different molecular dating methods.
Part 3: Methods that incorporate rate heterogeneity
Prokaryote evolution timescale
• In 2004 , Battistuzzi et al has been
reconstructed the genomic timescale of
prokaryote from a data set of sequences
currently available from 32 proteins common
to 72 species.
• They estimated phylogenetic relationships and
divergence times with a local clock method.
• Assembled many family protein and many
species
The objective of this study
• The most information on the timescale of prokaryote evolution has come
from analysis of DNA and amino acid sequence data with molecular clocks.
• The increasing number of prokaryotic genomes available has facilitated the
detection of HGT through more accurate detection of orthology, paralogy, and
monophyletic groups, and the concatenation of gene and protein sequences
has helped increase the confidence of nodes and decrease the variance of
time estimates.
• 1. To assemble a data set of sequences from 32 proteins (~7600
amino acids) common to 72 species.
• 2. To estimate phylogenetic relationships and divergence times
with a local clock method.
• 3. To investigate the origin of metabolic pathways of
importance in evolution of the biosphere.
Methods(Data assembly)
•
•
•
•
•
•
•
Data assembly began with the Clusters of Orthologous Groups of Proteins (COG), is
a systematic grouping of gene families that have completed genomes. These groups
are formulated by comparing protein sequences of known origin to those proteins of
unicellular genomes which have been studied extensively, have a phylogenetic
lineage, and have been deemed complete.
A COG consists of a protein or group of proteins typically paralogs that come from a
minimum of 3 lineages which will ultimately correspond to an ancestral
domain. Currently 66 clusters exist and with more research and study the list will
continue to expand.
In this study, COGs consisted of 84 proteins common to 43 species.
With that initial dataset, other species from among completed microbial genomes
(NCBI; National Center for Biotechnology Information) assisted by BLAST and PSIBLAST were added.
In total 72 species were included in the study (54 eubacteria, 15 archaebacteria and
three eukaryotes were Arabidopsis thaliana, Drosophila melanogaster, and Homo
sapiens).
This dataset consisted of 60 proteins that were individually analyzed as a step in
orthology determination.
The proteins were aligned with CLUSTALW .
Methods(Data assembly)
•
•
•
•
Phylogenetic trees of each protein were built and visually inspected. Initial trees were
constructed using Minimum Evolution (ME), with MEGA version 2.1.
Minimum Evolution method (ME) is the method that the expected value of the sum of all
branch lengths (S) is smallest for the true tree or as the best tree.
S = T∑i bi ;
bi = estimate of the length of the i-th branch and T is the total number of branch
4.5
5.3
9.5
12.1
S = 35.3
6.1
-2.1
11.7
7.5
14.3
6.2
-2.4
14.4
S = 37.5
Estimates of branch lengths obtained by ME
7.6
11.8
S = 37.6
Methods(Data assembly)
•
•
The major criterion used in determining which genes to include or exclude was the
monophyly of domains.
Monophyly taxon is a group of organisms descended from a single ancestor.
Polyphyletic taxon is composed of unrelated organisms descended from more than
one ancestor. One type of monophyletic taxon is a paraphyletic taxon, which
includes an ancestor and a group of organisms descended from it.
Methods(Data assembly)
• Rejected genes with domains (arcaebacteria and eubacteria) that were nonmonophyletic, as these would be the best examples of HGT, this amounted to
61% of the genes rejected.
• tested the effectiveness of the criteria by examining the stability of individual
protein trees, using different gamma values (α = 1, 0.5 and 0.3).(???)
• kept only the genes that were stable to such perturbations (in terms of
remaining in that category of non-HGT genes).
• The 32 remaining proteins were concatenated for analysis. The majority
(81%) of the 32 proteins that were used are classified in the "information
storage and processes" functional category of the COG. The other categories
represented are "cellular processes" (10%), "metabolism“ (3%), and
"information storage and processing" + "metabolism" (proteins with
combined functions; 6%).
• From the concatenation, trees were constructed with ME, Maximum
Likelihood (ML)(log likelihood score, optimized over branch lengths and
model parameter) and Bayesian methods (Posterior probability, calculated by
integrating over branch lengths and substitution parameter).
Result (Phylogeny)
• The phylogenies obtained with ME, ML and Bayesian were similar, differing
only at nonsignificant nodes assessed by the bootstrap method.
• Bootstrap test method is the method of testing the reliability of the topology
of a tree obtained by distance methods. This test examine the reliability of
each interior branch of a tree by computing the probability of confidence
which is called bootstrap value. Nucleotide or amino acid sites are sampled
randomly, with replacement, and new tree is constructed. This is repeated
many times and the frequency of appearance of a particular node. If the
value is higher than 95%, the interior branch is considered to be statistically
significant.
• The phylogeny of eubacteria (Fig. 1) shows significant bootstrap support for
most of the major groups and subgroups.
• 1.All proteobacteria form a monophyletic group (support values 95/47/99 for
ME, ML and Bayesian respectively) with the following relationships of the
subgroups: (epsilon (alpha (beta, gamma))).
Result(Phylogeny)
• 2.There has been debate about the effect of base composition and
substitution rate on the phylogenetic position of the endosymbiont
Buchnera among γ-proteobacteria. Its position differs slightly from both
studies; accordingly,any conclusions concerning its divergence time should be
treated with caution.
• 3.Spirochaetes cluster with Chlamydiae, Actinobacteria with Cyanobacteria
and Deinococcus (support values for Cyanobacteria + Deinococcus are
92/80/99) and the hyperthermophiles (Thermotoga, Aquifex) branch basally
in the tree. These groups and relationships are similar to those found
previously with analyses of prokaryote genome sequences.
2
1
3
Figure1. Phylogenetic tree (ME;
α = 0.94) of eubacteria rooted
with archaebacteria, using
sequences of 32 proteins
(7,597 amino acids).
Bootstrap values are shown on
nodes; asterisks indicate
support values > 95%. For
major groups, support values
from three
phylogenetic methods
(ME/ML/Bayesian) are
indicated in italics (dash
indicates a group was not
present).
Result(Phylogeny)
• 4.The phylogeny of archaebacteria (Fig. 2) agrees with some but not all
aspects of previous phylogenetic analyses of prokaryote genomes using
sequence data and the presence and absence of genes
• For example, each of the two major clades of Archaebacteria is
monophyletic. This is consistent with some analyses but not others. Also, the
position of Crenarchaeota as closest relatives of eukaryotes (Fig. 2), instead
of Euryarchaeota, has been debated.
• 5.Methanogens were found to be monophyletic in some previous analyses
but were paraphyletic in other analyses and in our analysis (Fig. 2). The
phylogenetic position of one species of methanogen in particular,
Methanopyrus kandleri, has differed among previous studies. However, it is
difficult to make direct comparisons among various studies because they
have included different sets of taxa.
Figure 2 Phylogenetic tree (ME; α = 1.20) of archaebacteria rooted with eubacteria, using
sequences of 32 proteins (7,338 amino acids) Phylogenetic tree (ME; α = 1.20) of archaebacteria
rooted with eubacteria, using sequences of 32 proteins (7,338 amino acids). Bootstrap values are
shown on nodes; asterisks indicate support values > 95%. For major groups, support values from
three phylogenetic methods (ME/ML/Bayesian) are indicated in italics.
5
4
Method (Time estimation)
•
•
•
•
Time estimation was conducted separately within each domain (Archaebacteria and Eubacteria)
using reciprocal rooting and several calibration points.
All time estimates were calculated with a Bayesian local clock approach utilizing concatenated
data sets of multiple proteins and a JTT+gamma model of substitution.
Bayesian local clock idea is that If evolutionary events (e.g.,nucleotide substitutions) occur
independently, then the number of evolutionary events that occur on a branch existing from time 0 to
time T and having rate R(t) at time t follows a Poisson distribution with mean
B(T) refer as a branch length. R(t) cannot be directly observed. One way to overcome this problem is
to adopt the restrictive molecular clock assumption of a constant rate with respect to time.
- The rate of branch i will be denoted Ri. The autocorrelation of
rates between an ancestral branch and its direct descendant will
depend on the time difference between the midpoints of the
ancestral and the descendant branches. For example, the time
difference between the thickened ancestral and descendant
branches in figure 1 is
Method (Time estimation)
•
•
•
•
•
•
In Bayesian analyses, a priori knowledge about parameter values is summarized through
assignment of probability distributions known as priors.
The observed data and the prior distributions are then used to determine probability distributions
known as posteriors.
The posterior distribution is a probability distribution representing uncertainty about the
parameters after observing the data.
The logarithm of the rate on the descendant branch has a normal distribution with a mean equal to
the logarithm of the rate on the ancestral branch and with a variance equal to the time difference
multiplied by a constant that we will refer to as v. A high value of v means there is little rate
autocorrelation, and a low value implies strong rate autocorrelation.
By Bayesian convention, a parameter governing a prior distribution is called a hyperparameter. In
this model, the value of v determines the prior distribution for the rates of molecular evolution on
different branches given the internal node times.
Posterior Distribution, for a data set X of aligned homologous sequences, the posterior distribution
depends on p(T, R, v | X) through
•
if both the rates R and the divergence times T are known, about the data X. Letting B = (B0, . . . ,
Bk) represent the lengths of the branches on the tree, we have The distribution p(T, R, v | X) is
•
because p(X |T, R) = p(X I B).
Method (Time estimation)
•
•
•
Calibration of rate in this method was implemented by assigning constraints to nodes
in the phylogeny. Five different initial settings (prior distributions) were used in each
domain.
These were chosen at intervals of 0.5 Ga starting from 4.5 Ga, which is
approximately the age of the Earth and Solar System, to 2.5 Ga, which is slightly
before the major rise in oxygen (Great Oxidation Event; GOE) as recorded in the
geologic record and related to the presence of oxygenic cyanobacteria.
Those constraints pertained to the ingroup root, or deepest divergence in the tree
excluding the outgroup. Because of the relatively small number of duplicate genes
available for rooting the tree of life, we were unable to estimate the time of the last
common ancestor (the divergence of eubacteria and archaebacteria).
Prior distribution for the rate of molecular evolution
EUBACTERIA
ARCHAEBACTERIA
Rttm
•
sss
(ingroup root
Rttm
(ingroup root
constraint)
Rtrate
2500 Ma
0.034
2500 Ma
0.026
3000 Ma
0.028
3000 Ma
0.022
3500 Ma
0.024
3500 Ma
0.019
4000 Ma
0.020
4000 Ma
0.016
4500 Ma
0.019
4500 Ma
0.014
Rtrate
constraint)
-my understand is 2500, 3000, 3500, 4000 and 4500 Ma
are the Calibration point as the time or age and we have
branch length of data set from phylogram. So these two
parameter will find out what is the rate (Rtrate) in each
calibration time.
Chronogram of Archaebacteria
Time estimation
*
The fossil calibration was the first appearance of a representative of the plant
lineage (red algae) at 1.198 ± 0.022 Ga. The molecular time estimate for this
divergence was 1.609 ± 0.060 Ga from a study of 143 rate-constant proteins.
*
Figure3. A timescale of
prokaryote evolution
* 100
* 29
Chronogram of
Eubacteria
Time estimation
*
*
Time estimation
•
•
•
•
•
•
For the eubacterial data set, we used four internal time constraints in separate
analyses, all involving the origin of cyanobacteria.
The first and most conservative constraint was a fixed origin (minimum and maximum
bounds) at 2.3 Ga, which corresponds to the GOE.
For the second constraint we used 2.3 Ga as a minimum bound, with no maximum
bound.
For the third constraint we used a previous molecular time estimate (2.56 Ga) for the
divergence of cyanobacteria from closest living relatives among eubacteria, and fixed
the minimum (2.04 Ga) and maximum (3.08 Ga) values to the 95% confidence limits
of that time estimate.
The fourth constraint for the origin of cyanobacteria was set at 2.7 Ga (minimum
constraint) based on biomarker evidence for the presence of 2α-methylhopanes .
The use of these four alternative constraints for the origin of cyanobacteria considers
most of the widely discussed hypotheses but does not rule out an origin prior to 2.7
Ga. Although the results of the four different calibrations are provided for comparison,
our preferred calibration is the 2.3 (minimum) geologic calibration because it has the
best justification (supporting evidence).
Result (time estimation)
•
•
•
A single timetree was constructed from the phylogenetic and divergence
time data. The time estimates summarized in that tree derive only from the
best-justified calibrations.
For eubacteria, the 2.3 Ga minimum calibration (constraint), from the
geologic record, was chosen because it encompasses all of the
hypothesized time estimates for the origin of cyanobacteria.
For archaebacteria, the 1.2 Ga calibration (minimum 1.174 Ga, maximum
1.222 Ga), from the red algae fossil record, was selected because it
provides a conservative constraint on the divergence of plants and animals.
2.
Discussion
8.
9.
11.
10.
7.
6..
5.
3.
1.
4.
1.Hyperthermophiles
- Most in basal
Position
- Debate in this tree
2. E.Coli and Salmonella
is consistent.
3. Inconsistent with the
fossil record that
represented
Cyanobacteria
4. Closely spaced in
time of
archaebacteria
Discussion
5. Origin of life on Earth
• A Hadean (4.5–4.0 Ga) origin for life on Earth is also consistent with the
early establishment of a hydrosphere. Nevertheless, the earliest geologic
and fossil evidence for life has been debated leaving no direct support for
such old time estimates.
6. Methanogenesis
• Archaebacteria are the only prokaryotes known to produce methane. Our
time estimate of between 4.11 Ga (3.31–4.49 Ga) (node P-O) and 3.78 Ga
(3.05–4.16 Ga) for the origin of methanogenesis suggests that
methanogens were present on Earth during the Archean, consistent with the
methane greenhouse theory.
7. Anaerobic methanotrophy
• Anaerobic methanotrophy, or anaerobic oxidation of methane (AOM), is a
metabolism associated with oxidation of methane and sulfate reduction. The
methane oxidizers are represented by archaebacteria phylogenetically
related to the Methanosarcinales(at 3.09 (2.47–3.51) Ga ,node M to 0.23
Ga (0.12–0.39 Ga, node L), while the sulfate reducers, when present, are
eubacterial members of the δ-proteobacteria division
Discussion
8. Aerobic methanotrophy
• Both anaerobic and aerobic methanotrophy have been used to explain the
highly depleted carbon isotopic values found in 2.8–2.6 Ga geologic
formations. Divisions of the proteobacteria has been suggested an origin of
this metabolism between (node C (2.80 Ga; 2.45–3.22 Ga) and node B
(2.51 Ga; 2.15–2.93 Ga) ).The time estimates for these two metabolisms
are both compatible with the isotopic record.
9. Phototrophy
• The ability to utilize light as an energy source (phototrophy,photosynthesis)
is restricted to eubacteria among prokaryotes. Phototrophic eubacteria are
found in five major phyla (groups), including proteobacteria, green sulfur
bacteria, green filamentous bacteria, gram positive heliobacteria, and
cyanobacteria. This broad taxonomic distribution of phototrophic
metabolism mechanism is HGT. They have assumed that the common
ancestor (Node I) was phototrophic .Therefore, phototrophy evolved prior to
3.19 (2.80–3.63) Ga (node I). Because the hyperthermophiles Aquifex and
Thermotoga are not phototrophic and branch more basally, 3.64 (3.17–4.13)
Ga (Node J) can be considered a maximum date for phototrophy.
Discussion
10. The colonization of land
• The synthesis of pigments such as carotenoids, which function as
photoprotective compounds against the reactive oxygen species created by
UV radiation, is an ability present in all the photosynthetic eubacteria and in
groups that are partly or mostly associated with terrestrial habitats such as the
actinobacteria, cyanobacteria, and Deinococcus-Thermus. Pigmentation was
probably a fundamental step in the colonization of surface environments. An
early colonization of land is inferred to have occurred after the divergence of
this terrestrial lineage with Firmicutes (node H), 3.05 (2.70–3.49) Ga, and prior
to the divergence of Actinobacteria with Cyanobacteria + Deinococcus (node
F), 2.78 (2.49–3.20) Ga. These molecular time estimates are compatible with
time estimates (2.6–2.7 Ga) based on geological evidence for the earliest
colonization of land by organisms (prokaryotes).
11. Oxygenic photosynthesis
• some of the early steps leading to oxygenic photosynthesis apparently were
acquisition of protective pigments, phototrophy, and the colonization of land.
Species of cyanobacteria are known, broadly distributed among the orders.
The origin of cyanobacteria as a calibration was 2.3 Ga, geologic time based
on GOE. In this case, the time estimated for node E (2.56 Ga; 2.31–2.97 Ga)
was not much older than the constraint itself
Figure 4 A time line of metabolic innovations and events on Earth
A time line of metabolic innovations and events on Earth. The minimum time for oxygenic photosynthesis is
constrained by the Great Oxidation Event (2.3 Ga) whereas the maximum time for the origin of life is
constrained by the origin of Earth (4.5 Ga). Horizontal lines indicate credibility intervals, white boxes indicate
minimum and maximum time constraints on the origin of a metabolism or event, and colored boxes indicate
the presence of the metabolism or event.
•
•
•
•
•
•
•
•
•
•
•
•
Our phylogenetic results support most of the currently recognized higher-level groupings of prokaryotes.
Divergence time estimates for the major groups of eubacteria are between 2.5–3.2 billion years ago (Ga)
while those for archaebacteria are mostly between 3.1–4.1 Ga.
The time estimates suggest a Hadean origin of life (prior to 4.1 Ga),
an early origin of methanogenesis (3.8–4.1 Ga),
an origin of anaerobic methanotrophy after 3.1 Ga,
an origin of phototrophy prior to 3.2 Ga,
an early colonization of land 2.8–3.1 Ga, and
an origin of aerobic methanotrophy 2.5–2.8 Ga.
Conclusions: Our early time estimates for methanogenesis support the consideration of methane,
in addition to carbon dioxide, as a greenhouse gas responsible for the early warming of the Earths‘
surface.
Our divergence times for the origin of anaerobic methanotrophy are compatible with
highly depleted carbon isotopic values found in rocks dated 2.8–2.6 Ga. An early origin of phototrophy is
consistent with the earliest bacterial mats and structures identified as stromatolites, but a 2.6 Ga origin of
cyanobacteria suggests that those Archean structures, if biologically produced, were made by anoxygenic
photosynthesizers. The resistance to desiccation of Terrabacteria and their elaboration of photoprotective
compounds suggests that the common ancestor of this group inhabited land. If true, then oxygenic
photosynthesis may owe its origin to terrestrial adaptations.