Introduction to Phylogenomics
Download
Report
Transcript Introduction to Phylogenomics
Introduction to phylogenomics
Julie Thompson
Laboratory of Integrative Bioinformatics and Genomics
IGBMC, Strasbourg, France
[email protected]
Phylogenomics
A combination of :
genomics (study of function and structure of genes and genomes)
molecular phylogenetics (study of evolutionary relationships among organisms)
Two different aspects :
using phylogenetic data to infer functions for DNA and protein sequences
(Eisen. Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. Genome Res. 1998)
using genomic data to infer phylogenetic relationships (species trees) and to
gain insights into the mechanisms of molecular evolution
(O'Brien and Stanyon. Phylogenomics. Ancestral primate viewed. Nature 1999)
Julie Thompson – IGBMC
1. Phylogeny-based functional inference
• Homology based methods
• Non-homology based methods
Julie Thompson – IGBMC
Phylogeny-based functional inference
Used in molecular biology, genetics, development, behavior,
epidemiology, ecology, systematics, conservation biology, forensics…
draw structural/functional inferences from the structure of the tree or
from the way the character states map onto the tree
use these clues to build hypotheses and models of important events,
systems, predict behavior, etc.
Julie Thompson – IGBMC
Reviewed in: Brown & Sjölander, PLoS Comput Biol. 2006
Levasseur et al, Evolutionary Bioinformatics, 2008
Phylogeny-based functional inference
A two-step process:
Model Systems
(Human, mouse,
drosophila, yeast …)
Complexes & Networks :
Copresence/coabsence
Fusion/fission
2
Model Systems
Data &
(Human, mouse,
Information
drosophila, yeast …)
inference
Interactome :
Interologous approach
Propagation
Modelling
Interfaceome :
Conserved residues
Phylogenetic
interactions Inference
Promotome :
Phylogenetic footprint
Transcriptome,
proteome…
Julie Thompson – IGBMC
Similar Expression
Knowledge
extraction
Sequence
Structure
New Systems
(partial
(human,
experimental
partial
experimental
Data) data)
Function
Evolution
1
Propagation from model systems
Classical method : similarity-based functional annotation (from Blast best hit)
Perform Blast search to
detect similar sequences
human
mouse1
mouse2
worm
yeast
Julie Thompson – IGBMC
Transfer function from
highest scoring sequence
with known function
human
mouse1
Errors :
• gene duplications
(ortholog/paralog)
• multi-domain proteins
• existing database errors
Propagation from model systems
Classical method : similarity-based functional annotation (from Blast best hit)
Problems :
distantly related sequences may have different functions
spurious hits in low complexity regions
propagation of existing database annotation errors
Example : complex domain organisation
Julie Thompson – IGBMC
Problems : domain organisation
SW:Y449_MYCGE
RNA binding domain
SW:Y663_MYCPN
SW:SYFB_IDILO
phenylalanyl-tRNA synthetase
SPT:A5IAL4_LEGPC
Julie Thompson – IGBMC
Annotation errors
Sequence prediction errors :
65% of the sequences are in silico predictions
44% of eukaryote predicted proteins are partially incorrect: at least one
suspicious indel or divergent segment (Bianchetti et al, 2005)
Function annotation errors :
66% of sequences in the UniProt database have GO annotations, but only
3% have evidence codes indicating experimental support (Krishnamurthy et
al, 2006)
10-30% of genome functional annotations are erroneous (Devos, Valencia,
2000)
Julie Thompson – IGBMC
Propagation from model systems
Phylogeny-based inference
Perform Blast search to
detect similar sequences
human
mouse1
mouse2
worm
yeast
Perform multiple alignment
of sequences representing
potential homologs
Construct phylogenetic tree
and identify orthologs
human
mouse2
human
mouse2
mouse1
worm
yeast
mouse1
worm
yeast
duplication
fusion
Infer function from
set of orthologs,
domain organisation,
conserved motifs
(also 3D structure, etc.)
Julie Thompson – IGBMC
Assumption
We can identify set of homologous sequences and differentiate
orthologs from paralogs
orthologous sequences (diverged by speciation) are more reliable predictors of
protein function than paralogous sequences (that diverged by gene duplication)
ancestor gene A
speciation
mouse gene A
human gene A
orthologs
duplication
human gene A’
Julie Thompson – IGBMC
paralogs
human gene A
mouse gene A
Define orthologous groups
pairwise orthology: reciprocal best hits (RBH)
Inparanoid (Remm et al., 2001)
COGs: Clusters of Orthologous Groups (Tatusov et al., 1997)
orthoMCL (Li et al., 2003)
EggNOG (Jensen et al., 2008)
ancestor gene A
speciation
mouse gene A
human gene A
orthologs
duplication
human gene A’
Julie Thompson – IGBMC
paralogs
human gene A
mouse gene A
Problems:
leading to wrong orthology assumptions
sequencing errors, non-predicted sequences
gene duplication followed by differential gene loss
varying rates of evolution
rat
mouse
human
Sub-family X
Sub-family A
worm
gene loss
(human )
rat
human
Sub-family Y
fly
Sub-family B
duplication
RBH: human B
Julie Thompson – IGBMC
mouse A
RBH: human X
rat Y
Define orthologous groups
Tree-based orthology: build a phylogenetic tree of a group of genes and
compare gene tree to species tree to define speciation, duplication events
Resampled Inference of Orthologs (RIO) (Zmasek and Eddy, 2002)
Orthostrapper (Storm and Sonnhammer, 2002)
Levels Of Orthology From Trees (LOFT) (Van de Heijden et al, 2007)
Example: G protein-coupled receptors
Unknown sequence
Unknown sequence
Prediction: Opiod receptor
Julie Thompson – IGBMC
More general prediction: GPCR of unknown specificity
Large scale analysis pipelines
FIGENIX (Gouret et al, 2005): automatic pipeline for
structural/functional annotation and phylogeny
SIFTER (Engelhardt et al, 2005): statistical inference algorithms to
propagate function annotations within a phylogeny
PhyloFacts (Krishnamurthy et al, 2006): database of protein families,
integrating different predictions and experimental data in a phylogeny
MACSIMS (Thompson et al, 2006): information management system, to
propagate structural/functional data within a multiple alignment
Julie Thompson – IGBMC
Large scale analysis: example
Phylogenies of peroxisomal proteins (yeast, rat) were reconstructed to determine their
origin : eukaryotic, bacterial or archaeal
39–58% were of eukaryotic origin (biogenesis or maintenance)
13–18% were of bacterial origin (enzymes) by recruitment of proteins originally targeted to
mitochondria
bacteria
archaea
Julie Thompson – IGBMC
Gabaldón et al. Biology Direct, 2006.
Large scale analysis : example
Figenix functional analysis of genes lost in mammals/vertebrates, but present
in other animals
More than 50% of lost genes are involved in biomolecular metabolism/catabolism
e.g. TPS biosynthesizes Trehalose 6P from UDP-glucose, a disaccharide crucial for the
survival of species in dry and freezing periods and other stress conditions
Julie Thompson – IGBMC
Danchin, Gouret and Pontarotti. BMC Evolutionary Biology 2006
Online resource : PhyloFacts
The Berkeley Phylogenomics Group : a phylogenomic encyclopedia containing 10,000
'books' for protein families, pre-calculated structural, functional and evolutionary analyses.
FlowerPower
SAM
Blast to PDB
MSA analysis
MUSCLE
NJ
MP
ML
SCI-PHY
PFAM
Julie Thompson – IGBMC
http://phylogenomics.berkeley.edu/phylofacts
Online resource : PhyloFacts
Search with fasta sequence
Julie Thompson – IGBMC
http://phylogenomics.berkeley.edu/phylofacts
Julie Thompson – IGBMC
http://phylogenomics.berkeley.edu/phylofacts
Phylogeny-based inference
Warning: inference accuracy depends on evolutionary
distance and the particular functional attribute under
consideration
Some attributes of protein families, such as the 3D
structure, are conserved across large evolutionary
distances
Other attributes, such as substrate specificity, can be
modified by a few amino acid substitutions in critical
positions
Julie Thompson – IGBMC
MACSIMS : Information Management System
http://bips.u-strasbg.fr/MACSIMS/
Data collection :
• creation of a relational database
(BIRD, H. Nguyen)
Information management :
• data validation
• reliable propagation
Efficient exploitation :
• automatic, high-throughput processing
(XML format)
• visualisation (JalView editor)
Julie Thompson – IGBMC
Thompson et al, 2006
Substrate specificity
rhodocoxin reductase
thioredoxin reductase
Julie Thompson – IGBMC
***
FAD binding
http://bips.u-strasbg.fr/MACSIMS/
MACSIMS : Information Management System
Sulfatase protein family :
GALNS
Mutations in GALNS gene are implicated in Morquio A syndrome :
• mutation C79Y -> severe phenotype
• others -> milder phenotypes
Julie Thompson – IGBMC
http://bips.u-strasbg.fr/MACSIMS/
“non-homology” based methods
When no characterised homologs are available, 'nonhomology' methods
can be used to analyze other patterns :
gene co-inheritance (phylogenetic profiling)
gene context
domain fusion
gene neighborhood (operon, synteny, …)
gene regulation (phylogenetic footprinting / shadowing)
They predict functional associations between proteins :
physical interactions
co-membership in pathways, regulons or other cellular processes
Julie Thompson – IGBMC
Phylogenetic profiling
Joint presence or joint absence of two traits across large numbers
of species can be used to infer a biological connection
e.g. involvement of two different proteins in the same biological
pathway (Pelligrini et al., 1999)
Hypothesis:
A biological process (photosynthesis, methanogenesis, histidine
biosynthesis, …) may require the concerted action of many proteins
If some protein critical to a process is lost, other proteins dedicated to
that process would become useless; natural selection makes it unlikely
they will be retained over evolutionary time
Therefore, genes that are functionally related should be gained and lost
together from genomes during evolution, which results in a correlation
of their occurrence vectors
Julie Thompson – IGBMC
Phylogenetic profiling
Julie Thompson – IGBMC
For each gene, code Presence (1)
or Absence (0) in each species
Group genes with same or similar
profiles
Genes with similar profiles are
likely to be functionally related
Phylogenetic profiling: example
Comparative Genomics Identifies a Flagellar and Basal Body Proteome that Includes the BBS5 Human Disease Gene
Li et al, Cell, 2004
Julie Thompson – IGBMC
Phylogenetic profiling
Other methods:
Similarity-based methods (correlating rates of evolution) (eg. Marcotte, 2000)
Comparison of trees, rather than simple co-presence/co-absence (eg. for
STRING database, von Mering et al, 2003)
Problems/limitations :
Need to include a large number of genomes
Genes may not be predicted (or badly predicted)
Functional link is inferred, but no clues to exact gene functions
Julie Thompson – IGBMC
Domain fusion (Rosetta stone)
Hypothesis:
some pairs of interacting proteins have homologs in another
organism fused into a single protein chain
A comparison of sequence homologs from multiple organisms
can reveal these fused sequences
called Rosetta Stone sequences because they decipher the
interactions between the protein pairs (Marcotte et al, 1999)
Example:
Julie Thompson – IGBMC
Rosetta stone : genome analysis
Prediction of E. coli genome-wide gene network
Significanc
e score
threshold
Number of
functional
links
Number of
proteins
(% of genome)
1
4613
1124 (26%)
1x10-6
111
583 (14%)
1x10-10
854
475 (11%)
Problems :
The networks generated are sparse, but begin to define cellular systems
May not be scaleable to higher eukaryotes due to large numbers of duplicate
genes, promiscuous domains
Julie Thompson – IGBMC
Gene neighborhood methods
genes that frequently co-occur in the same operon (genomic region) in
a diverse set of species are more likely to physically interact or be
involved in the same pathway (Dandekar et al, 1998; Huynen et al, 2000;…)
Example:
fatty acid biosynthesis
fatty acid degradation
predicted transcription factor
TF may regulate fatty acid degradation and biosynthesis
Julie Thompson – IGBMC
From Harrington et al, PNAS 2007
Protein function prediction using combined methods
E.g. PLEX (Date and Marcotte, 2005)
mySQL relational database, with gene sequences, chromosomal positions, pre-computed
phylogenetic profiles and Rosetta Stone linkages, accessible via a web-based interface
Julie Thompson – IGBMC
http://bioinformatics.icmb.utexas.edu/plex/
Protein function prediction using combined methods
Study of protein function prediction in genomes and metagenomes
Combination of homology and non-homology approaches
specific function
non-specific function
conserved protein
singleton
Julie Thompson – IGBMC
From Harrington et al, PNAS 2007)
Phylogenetic footprinting
Used to identify Transcription Factor Binding Sites (TFBS) within a
non-coding region of DNA
Hypothesis: selective pressure causes regulatory elements to evolve
at a slower rate than the non-functional surrounding sequence
Phylogenetic shadowing : a related technique used with closely related species
Julie Thompson – IGBMC
Tagle et al, 1988
Phylogenetic footprinting
Protocol:
Carefully choose species with orthologous genes to provide enough
sequence divergence
Decide on the length of the upstream / downstream region to be analysed
Align the sequences
Look for conserved regions and analyse them
Example:
Julie Thompson – IGBMC
From Zhang and Gerstein Journal of Biology 2003
Footprinting programs…
Multiple alignment of genomic regions: PipMaker, AVID, Multiz
Experimentally validated motif databases: DBTSS, EPD
Motif prediction: First EF, Eponine and GenScan
Integrated systems: CONREAL, ConSite, Footer, PHYLONET,
PromAn, PhyloScan
Problems:
Species specific binding sites
Very short binding sites
Less specific binding factors
Compound binding regions
Julie Thompson – IGBMC
2. Construction of species trees
Julie Thompson – IGBMC
2. Construction of species trees
Problem
phylogenetic trees based on single gene families, may show
conflict due to a variety of causes (gene duplication, loss,
horizontal transfer, convergent or parallel evolution…)
Solution
integrate the phylogenetic information from the different gene
families to form a single species phylogeny
Julie Thompson – IGBMC
Construction of species trees
Define groups of orthologous sequences
Then use:
Whole genome features (complete genome alignment, gene content)
Supermatrix (simultaneous-analysis, combined-analysis)
Supertree (separate analysis)
Julie Thompson – IGBMC
Delsuc et al, Nature reviews, 2005
Gene content
No multiple alignments, but sequence information is used to
determine the orthologous genes
Build a matrix indicating the presence or absence of OGs in all
species (phylogenetic profile)
Binary matrix can be treated in the same way as a multiple
sequence alignment
4 states: ACGT
Julie Thompson – IGBMC
2 states: P present, A absent
Infer a phylogenomic tree from matrix (alignment)
Gene content
Distance methods:
Maximum parsimony:
Julie Thompson – IGBMC
Snel, Bork & Huynen. (1999) Nature Genet.
Tekaia, Lazcano & Dujon. (1999) Genome Res.
Lin & Gerstein. (2000) Genome Res.
Wolf, Rogozin, Grishin, Tatusov & Koonin. (2001) BMC
Evol. Biol.
Fitz-Gibbon & House. (1999) Nucleic Acids Res.
Gene order (synteny)
Estimate evolutionary distance from the number of
rearrangements necessary to transform one genome into
another (computationally complex)
construct phylogenetic trees by minimizing the number of
breakpoints between genomes (Blanchette et al 1999)
More practical solution: simply score the presence or
absence of pairs of orthologous genes (Korbel et al. 2002,
Wolf et al 2001)
Julie Thompson – IGBMC
Gene content / gene order
Problems
Julie Thompson – IGBMC
Orthology assessment
‘big genome attraction’: distantly related species with
large genomes may share more genes than closer
related species with small genomes.
Sequence information is lost
Superalignments (supermatrix)
multiple alignments for each gene are
concatenated to form a superalignment
Use conventional phylogenetic reconstruction
methods (e.g. distance or MP)
(Brown et al. 2001, Wolf, et al 2001)
Julie Thompson – IGBMC
Superalignments
Julie Thompson – IGBMC
Example: RibAlign
analysis of 16S ribosomal RNA (rRNA) sequences
has been the de-facto gold standard for the
assessment of phylogenetic relationships among
prokaryotes
concatenation of ribosomal protein sequences
(MAFFT, Phylip: ProML, MrBayes)
Superdistance (supermatrix)
Superdistance methods first calculate distance
matrices for all gene families.
The phylogenomic distance between two
species is then defined as the average distance
between all the shared gene families
(Kunin et al., 2005)
Julie Thompson – IGBMC
Supertree
Reconstruct phylogenetic trees for each gene
family separately
Combine the multiple gene family trees to
form a single phylogenomic tree (Gene Tree
Reconciliation)
(Bininda-Emonds, 2004; Daubin et al., 2002)
Julie Thompson – IGBMC
Gene tree reconciliation methods
Consensus tree methods are used to combine fully overlapping
source trees (strict, majority consensus rules, …)
Julie Thompson – IGBMC
(eg. Mincut Semple and Steele 2000)
From de Queiroz and Gatesy, Trends Ecol Evol, 2007
Gene tree reconciliation methods
Indirect supertree construction represents individual source trees as
matrices, then combines them using an optimization criterion :
Matrix representation using parsimony (MRP)
“flip” supertrees
Average consensus procedure
Most Similar Supertree (MSSA)
Maximum Quartet Fit (QFIT)
Maximum Splits Fit (SFIT).
From Bininda-Emonds et al, 2002
Julie Thompson – IGBMC
Software Clann, http://bioinf.may.ie/software/clann/
Problems
Large amounts of data: need automatic pipelines
Need a reliable method to identify genuine orthologues
Missing data: some genes missing from some species (incomplete
sequencing)
Factors leading to an incorrect tree, even with use of genome-scale
data:
nucleotide or amino acid compositional bias
long-branch attraction caused by unequal evolutionary rates among
lineages
sparse taxon sampling
heterotachy (the shift of position specific evolutionary rates)
Julie Thompson – IGBMC
Comparison supermatrix/supertree
Supermatrix methods
Include all sequence information (reduces noise)
Can yield relationships that are not present in the set of source trees
Ignore differences in rates or modes of evolution
More sensitive to missing data
Computationally expensive
Supertree methods
Relatively efficient => allow construction of large trees
Estimate an independent set of parameters for every gene
Allow incorporation of diverse kinds of data, e.g. characters from fossils, morphobank
Less sensitive to missing data
Use heuristic algorithms that cannot be justified rigorously on a statistical basis.
Ignore uncertainties in the subtrees (bootstrap values, Bayesian posterior probabilities,…)
but some recent algorithms may solve this problem (Burleigh, 2006; Moore, 2006)
May over-fit the data and cause large variances in the estimates
Julie Thompson – IGBMC
Statistical modelling approach
statistical likelihood provides a framework for combining information from
different experiments
combine data from multiple genes while accommodating differences in the
evolutionary process
define a model that estimates the probability of obtaining a series of subtree
topologies, given a hypothesized supertree
select supertree that maximums the likelihood (product of likelihoods
of all subtrees)
Julie Thompson – IGBMC
Ren, F., et al. A likelihood look at the supermatrix–supertree controversy, Gene (2008)
Applications: tree of life
Mammalian tree topology
70 mammalian species, plus Marsupialia and Monotremata as outgroups
Supermatrix approach using 1st, 2nd codon positions of mitochondrial proteincoding genes and MrBayes
Julie Thompson – IGBMC
Reyes, et al. Mol. Biol. Evol. 2004.
Applications: tree of life
3 hypotheses for the root of the eutherian tree
basal Afrotheria
basal Xenarthra
basal Boreotheria,
or Afrotheria/Xenarthra clade
concatenated dataset of the 2,789 gene sequences
Supermatrix (ML analyses) supported the Boreotheria (cow, dog, mouse, rat,
human, chimpanzee, and macaque) monophyly robustly, but root was sensitive to
substitution model
Supertree (ML analyses), takes account of variations in the rates and modes of
evolution among genes by assigning different parameters to different genes =>
all models support tree 1
Julie Thompson – IGBMC
Nishihara et al, Genome Biol. 2007 Rooting the eutherian tree: the power and pitfalls of phylogenomics
Applications: tree of life
Current status and future challenges
Identified or
Confirmed by phylogenomics
Hypothetical relationships
Hypothetical relationships
from phylogenomics
Main uncertainties
Julie Thompson – IGBMC
Extinct species
Julie Thompson – IGBMC
Ancestral sequence reconstruction
putative archosaur rhodopsin
visual pigment synthesised and
tested for function using
biochemical methods
archosaurs may have had visual pigments that would support dim-light vision
Were ancestors nocturnal or diurnal?
Julie Thompson – IGBMC
Chang, et al. Mol Biol Evol 2002
Perspectives: Jurassic genome?
Julie Thompson – IGBMC
Zimmer C. Evolution. Jurassic genome. Science. 2007