Phylogenomics and the Evolution of Gene Repertoires in

Download Report

Transcript Phylogenomics and the Evolution of Gene Repertoires in

Phylogenomics and the
evolution of gene
repertoires in bacteria
Paris, MEP, June 18th 2005
Vincent Daubin
Bioinformatique et Génomique Evolutive
Menu
• Introduction: phylogenomics
– A neologism and an old quote.
• Phylogenomics in Bacteria/Prokaryotes
– What phylogenetic framework???
• Approaches for finding the Tree (if there is one)
– Results obtained from different methods
• Reconstructing the history of complete genomes
• Conclusion
Phylogenomics
• Nothing makes sense in genomics except in a
phylogenetic framework
– Understanding the organization of genomes, the
evolution of functions, the histories of duplications,
etc…
• Numerous prokaryotic genomes (relatively small,
dense in genes…)
• But what phylogenetic framework for
prokaryotes?
Woese, 1987
SSU rRNA phylogeny
From one gene…
ATTTGAC…
ACTTGAC…
ATTCGCC…
ATTCGCC…
… to another
TTTAGAC…
TCTAGAC…
TTACGCC…
TTACGAC…
Evidence for Lateral Gene Transfer
0.5
Buchnera sp.
Pasteurella multocida
Haemophilus influenzae
Vibrio cholerae
Pseudomonas aeruginosa
Xylella fastidiosa
Rickettsia prowazekii
Caulobacter crescentus
Neisseria meningitidis
Campylobacter jejuni
Helicobacter pylori
Arabidopsis thaliana
Synechocystis sp.
Aquifex aeolicus
Bacillus halodurans
Bacillus subtilis
Staphylococcus aureus
Lactococcus lactis
Streptococcus pyogenes
Mycobacterium tuberculosis
Mycobacterium leprae
Streptomyces coelicolor
Deinococcus radiodurans
Chlamydia trachomatis
Chlamydia muridarum
Chlamydophila pneumoniae
Mycoplasma pneumoniae
Mycoplasma genitalium
Ureaplasma parvum
Thermotoga maritima
Archaeoglobus fulgidus
Pyrococcus abyssi
Pyrococcus horikoshii
Methanococcus jannashii
Halobacterium sp.
Methanobacterium thermoautotrophicum
Sulfolobus solfataricus
Aeropyrum pernix
Thermoplasma acidophilum
Treponema pallidum
Borrelia burgdorferi
Green plant
cyanobacteria
Bacteria
Archaea
Eukaryota
UMP-Kinase
Multiple LGT or … ?
Mycobacterium leprae
Mycobacterium tuberculosis
0.5
Streptomyces coelicolor
Aquifex aeolicus
Synechocystis sp.
Pyrococcus horikoshii
Pyrococcus abyssi
Methanococcus jannashii
Methanobacterium thermoautotrophicum
Archaeoglobus fulgidus
Campylobacter jejuni
Helicobacter pylori
Thermotoga maritima
Caulobacter crescentus
Bacteria
Archaea
Eukaryota
Deinococcus radiodurans
Halobacterium sp.
Thermoplasma acidophilum
Caenorhabditis elegans
Chlamydophila pneumoniae
Xylella fastidiosa
Saccharomyces cerevisiae
Pseudomonas aeruginosa
Vibrio cholerae
Pasteurella multocida
Haemophilus influenzae
Neisseria meningitidis
Buchnera sp.
Aeropyrum pernix
Sulfolobus solfataricus
Orotate Phosphoribosyltransferase
Lateral gene transfer in bacteria
Transduction
Conjugation
Transformation
Acquisition of function via LGT
Ochman, et al., 2000
Massive gene “exchanges” !
Ochman, et al., 2000
The alternative to the tree ?
Zhaxybayeva and Gogarten, 2002
Methods used to reconstruct the tree
of life using complete genomes
• Oligonucleotides/peptides (words)
frequency in genome/proteome
• Global index of similarity (BLAST)
Hypothesis of
homology not always
clear
Mostly gene homology
• Gene content
• Gene order
• Gene concatenation
• Supertrees
Mostly gene orthology
Gene orthology
(alignments)
Finding xenology
Statistics on genomes
Whole genomes (proteome)
Word frequency
sp1
sp2
sp1
sp2
sp3
…
AAAA
104
63
307
….
AAAC
…
…
…
Tree
AAAG
AAAT
….
sp3
…
Count words
(correct for % of letters)
Compute distances (=
differences in word usage)
Build a tree
Re-sample words for support
Hypothesis of homology ?
Statistics on genomes
• Pride et al. 2003
• Based on tetranucleotide
frequency in 27 genomes
• Distance ~ differences in
usage
• Relatively little signal for
resolving the tree of
bacteria BUT resolves
recent and very deep
nodes (i.e., domains).
Statistics on genomes
• Qi et al., 2004
• K-strings in
proteins (i.e.,
words of K
letters) – here,
K=6
• 109 genomes
• Gets better with
longer strings
(relationship to
gene homology?)
Blast scores
Compare proteomes
(BLASTP…)
Distance matrix
sp1
sp2
sp1
sp2
sp1
0
0.5
sp2
0.5
0
sp3
sp3
…
sp3
Tree
…
….
0
0
…
Average %identity,
normalized BLAST scores…
(restrict to orthologous genes)
Transform into distance
Build a tree
Re-sample pairs of matching
genes for support
(remove discordant matches)
Blast scores
• Clarke et al., 2002
• Normalized BLASTP
scores (=match/self_match)
• 37 genomes (3 domains)
• Finds most of the phyla
defined by rRNA
• Remove phylogenetically
discordant matches (little
effect)
Gene content
Compare proteomes
(BLASTP…)
Parsimony matrix
sp1
sp2
sp1
sp2
sp3
…
Gene1
0
1
0
….
Gene2
1
1
1
Tree
…
sp3
…
….
…
Code presence/absence of :
- Orthologs (reciprocal best matches)
- Homologs (families)
- Domains, Folds (superfamilies)
Compute distance (correct for genome size)
Dollo parsimony
…
Build a tree
Sample subset of genes for statistical support
Gene content
• Yang et al. 2004
• Folds (=superfamilies)
in 119 bacterial
genomes
• Distance method
• Finds a few phyla
defined by rRNA
Gene content
• Snel et al. 1999
• Orthologs in 23
genomes
• Finds most of the
phyla defined by
rRNA
Gene order
Compare proteomes
(BLASTP…)
b
sp1
sp3
a
e
c
d
sp1
a
b
sp2
Tree
Gene order
e
d
sp2
…
c
c
a
e
sp3
d
b
…
Assign orthologs
- keep those present in ≥ 2
- keep those present in all
Compute distances based on:
- conservation of pairs of neighbor
- number of breakpoints
- sequence of inversions…
Gene order
• Wolf et al., 2001
• Based on
conservation of pairs
of neighbors
• Finds most of the
phyla defined by
rRNA + suggests
some non-trivial
groups
Wolf et al., 2001
Gene concatenation
gene alignments
Super-alignment
select genes that can be
concatenated:
- reduce missing data
- analyze congruence (… or not)
Bootstrap:
Re-sample sites
(Re-sample genes)
Gene concatenation
• 57 genes in 45 species
(8857 positions)
• unrooted tree of bacteria
• Finds all phyla defined
by rRNA + suggests some
non-trivial groupings
Brochier et al., 2002
Whatever distance
Comparison of (some) phylogenomic
distances
1,8
1,6
1,4
1,2
Gene_order = -ln(s)
R2 = 0,0913
Concatenated proteins (9genes - JTT)
R2= 0,6477
1
0,8
0,6
Gene order = (s-1)
R2= 0,132
Presence/absence = -ln(s)
R2= 0,0756
0,4
Presence/absence = (s-1)
R2= 0,1849
0,2
0
0
0,1
0,2
0,3
0,4
0,5
0,6
16S rRNA divergence (F84)
Supertrees
Combination of trees
gene trees
F
E
D
F
C
A
D
A
E
B
G
C
F
A
B
B
D
E
G
select trees that can be
combined = analyze
congruence
(… or not)
Bootstrap:
Re-sample sites (MRP)
(Re-sample trees)
Supertree of bacteria
• Daubin et al. 2002
• bacterial supertree based on
the combination of 121 gene
trees with 7 ≤ nb sp ≤ 32
• Matrix Representation with
Parsimony
• Finds all phyla defined by
rRNA + suggests some nontrivial groupings
100
100
100
100
95
100
65
100
63
91
43
100
100
100
83
100
100
100
100
100
100
100
99
80
100
100
100
92
100
Streptomyces pyogenes
Lactococcus lactis
Staphylococcus aureus
Bacillus subtilis
Bacillus halodurans
Ureaplasma parvum
Mycoplasma genitalium
Mycoplasma pneumoniae
Synechocystis sp.
Deinococcus radiodurans
Mycobacterium tuberculosis
Mycobacterium leprae
Streptomyces coelicolor
Helicobacter pylori
Campylobacter jejuni
Rickettsia prowazekii
Caulobacter crescentus
Neisseria meningitidis
Xylella fastidiosa
Pseudomonas aeruginosa
Buchnera sp.
Haemophilus influenzae
Pasteurella multocida
Escherichia coli
Vibrio cholerae
Aquifex aeolicus
Thermotoga maritima
Chlamydophila pneumoniae
Chlamydia muridarum
Chlamydia trachomatis
Borrelia burgdorferi
Treponema pallidum
Low G+C
Gram-postives
Mycoplasmas
High G+C
Gram-postives


 Proteobacteria

Chlamydiales
Spirochaetes
A tree of bacteria?
100
100
100
100
95
100
65
100
63
91
43
100
100
100
83
100
100
100
100
100
100
100
99
80
100
100
100
92
100
Streptomyces pyogenes
Lactococcus lactis
Staphylococcus aureus
Bacillus subtilis
Bacillus halodurans
Ureaplasma parvum
Mycoplasma genitalium
Mycoplasma pneumoniae
Synechocystis sp.
Deinococcus radiodurans
Mycobacterium tuberculosis
Mycobacterium leprae
Streptomyces coelicolor
Helicobacter pylori
Campylobacter jejuni
Rickettsia prowazekii
Caulobacter crescentus
Neisseria meningitidis
Xylella fastidiosa
Pseudomonas aeruginosa
Buchnera sp.
Haemophilus influenzae
Pasteurella multocida
Escherichia coli
Vibrio cholerae
Aquifex aeolicus
Thermotoga maritima
Chlamydophila pneumoniae
Chlamydia muridarum
Low G+C
Gram-postives
Mycoplasmas
High G+C
Gram-postives


 Proteobacteria

Chlamydia trachomatis
Borrelia burgdorferi
Treponema pallidum
Super-tree (Daubin et al. 2002)
121 genes
Chlamydiales
Spirochaetes
Concatenation of ribosomal proteins
(Brochier, et al., 2002)
57 genes
A consensus for the tree of life
• Black: already known
from rRNA
• Red: established from
complete genome
analysis (congruence
among methods)
• Dashed red:
suggested by
complete genome
analysis
Wolf et al., 2002
Phylogenomics in bacteria
Nature of gene innovation in bacteria?
In eukaryotes: mainly duplication
What about bacteria?
The origin of « duplicates » in bacterial genomes
-Intra-genomic duplication
a
a
a’
PARALOGS
-LGT of a gene having already an homolog in the genome
bx
b
b
bx
XENOLOGS
Calling these genes « duplicates » or « paralogs » is
an overstatement:
“SYNOLOGS” = PARALOGS || XENOLOGS
Phylogenomics of Gammaproteobacteria (13 species)
• Ancient group (0.5-1 billion years – May et al., 2001)
• Model of bacterial diversification:
–
–
–
–
–
–
–
–
–
–
–
–
Escherichia coli K12
Salmonella typhimurium LT2
Buchnera aphidicola AP
Haemophilus influenzae
Pasteurella multocida
Yersinia pestis (CO92 and KIM)
Pseudomonas aeruginosa PAO1
Xanthomonas axonopodis
Xanthomonas campestris
Xylella fastidiosa
Wigglesworthia brevipalpis
Vibrio cholerae
commensal
human pathogen
endosymbiont of aphids
commensal
human pathogen
animal pathogen (agent of plague)
human opportunistic pathogen
plant pathogen (Citrus)
plant pathogen (crucifers)
plant pathogen (Citrus)
endosymbiont of tse-tse fly
human pathogen (agent of cholera)
• High rate of LGT reported (e.g., E. coli)
Gene families in -proteobacteria
8000
8035
Genes unique to a genome
Number of families
7000
6000
5000
4000
minimal core of genes
in -proteobacteria
2693
3000
2000
988
1000
552
332
224
266
205
145
127
174
142
275
5
6
7
8
9
10
11
12
13
0
1
2
3
4
Number of species
The core of genes
• among the 275 families that group genes
from the 13 species:
205 families with 1 gene per species.
 true orthologs.
 Do these genes have the same history?
ML tests (ELW, SH, KH…)
Sequence alignment
Ln1 and LnX
significantly different ?
ML tree (Ln1)
LnX
NO: accept
phylogenetic
hypothesis
Phylogenetic hypothesis to
test
(e.g., “species phylogeny”)
YES:
possible LGT
197 196
200
best topology
203
186
172
181 178 177
150
133 130
117
110
100 95
108
97
88
72
50
33
8
9
27
24
19
75
28
2
0
1
2
3
SSU rRNA
4
5
6
Concatenated
proteins
not different from the ML tree
7
8
9
10
11
12
13
other hypothesis
different from the ML tree
The organismal phylogeny
100
100
100
E. coli
4183
S. typhimurium
4203
Y. pestis CO92
3599
100 Y. pestis KIM
B. aphidicola
100
100
W. brevipalpis
H. infuenzae
100
100
P. multocida
V. cholerae
P. aeruginosa
100
100
0.2
3879
564
653
1709
2015
2724+1081
5540
X. fastidiosa
2680
X. axonopodis
4192
X. campestris
4029
Based on the concatenation of 203 genes
Lerat et al., 2003
Exemple: Maximum likelihood test with one synolog
Sp A
Synolog
in sp A
Test
ΔL
species topology
ML trees
- Allows detection of possible LGT and
identification of the true ortholog
- !!! Incongruence can result from duplication + loss
(results need to be checked manually) !!!
Results for the phylogenetically
« informative » fraction of the genomes
80
Number of synologs
Percentage of LGT
0 1 2 >2
60
40
20
0
6
7
8
9
10
11
12
13
Number of species
Synology is associated with a high frequency of LGT
A lot of the so called duplicates in bacterial
genomes arise in fact by LGT
Lerat et al., 2005
But families having synologs are rare
7655
2429
835
457
Number of synologs
(# genes – # genomes)
0 1 2 3 4 5 6 7 8 9 10 >10
Number of families
250
a
200
150
100
50
0
1
2
3
4
5
6
7
8
9
10
11
12
13
Number of species
 Families having synologs represent less than 2% of
the total
The auxiliary genome of bacteria is an
ORFanage
Welch et al., 2002
Genes unique to a genome
- Some genes are annotated as phages or secretion
proteins…
- Most have unknown function
- Most are ORFans (no homolog known in
databases)
What are ORFans ?
• Rapidly evolving genes, or possibly pseudogenes
(cf. Amiri et al., 2003)
• Genes produced de novo from non-coding
sequences
• Artifacts resulting from the algorithms used to
recognize coding sequences in genomes.
• Genes transferred from organisms that have no
representatives in the databases
How can we understand ORFans ?
• By definition, no possible comparative study
(evolutionary rate, structure determination by
homology…)
• But… if the mechanism producing ORFans is
continuous overtime, we can find ORFans for
every node in a tree
• Search for ORFans in the lineage leading to E. coli
MG1655 (K12)
Examine genes restricted to each clade at
increasing phylogenetic depths (n0, n1, n2,
etc.) as well as those ancestral to all taxa
(native).
At each node, define two types of genes:
- ORFans: genes restricted to a clade and
having no other homologs
- HOPs (Heterogeneous Occurrence in
Prokaryotes): genes restricted to a clade but
with matches in some distantly related
organism (LGT events)
This approach allows:
1. comparisons of the sequence
features of ORFans of different ages
2. comparisons of ORFans to acquired
& ancestral genes.
3. use of comparative methods to
obtain information about
evolutionary rate and functional
status of ORFans (e.g., n2: E. coli vs.
Salmonella)
ORFans
HOPs
Daubin & Ochman, 2004
Length of ORFans and HOPs
-proteobacteria
E. coli
Vibrio
Vibrio/Haem
enterics
enterics
enterics
S. ent
S. ent
S. ent
S. enterica
E. coli +
E. coli +
E. coli +
E. coli +
E. coli +Shigella
E. coli
E. coli
E. coli
E. coli
E. coli
Length (bp)
1200
1000
HOPs
ORFans
800
600
400
n0
younger
n1
n2
n3
n4
native
older
Evolutionary rates of ORFans and HOPs
-proteobacteria
Vibrio
Vibrio/Haem
enterics
enterics
enterics
S. ent
S. ent
S. ent
S. enterica
E. coli +
E. coli +
E. coli +
E. coli +Shigella
E. coli
E. coli
E. coli
E. coli
Escherichia-Salmonella
Ka/Ks
0.14
0.12
HOPs
ORFans
0.10
0.08
0.06
n2
n3
n4
nativ
e
Ka/Ks is low, indicating that ORFans encode proteins;
however, both Ka & Ks are elevated
G+C content of ORFans and HOPs
-proteobacteria
E. coli
Vibrio
Vibrio/Haem
enterics
enterics
enterics
S. ent
S. ent
S. ent
S. enterica
E. coli +
E. coli +
E. coli +
E. coli +
E. coli +Shigella
E. coli
E. coli
E. coli
E. coli
E. coli
% G+ C 3
58
54
HOPs
ORFans
50
46
42
n0
younger
n1
n2
n3
n4
native
older
ORFans in A+T rich genomes
0,39
0,38
0,37
0,36
0,35
0,34
0,33
0,32
0,31
0,3
0,29
Helicobacter pylori
G+C3
G+C3
Streptococcus pneumoniae
0,44
0,42
0,4
0,38
0,36
0,34
0,32
0,3
0,28
“Natives”
ORFans
“Natives”
ORFans
Daubin et al., 2003
Features of ORFans
ORFans arise quickly in genomes & can be strain-specific
Do not originate from native DNA that is shared among strains
ORFans are short and very A+T-rich
Consistently A+T-richer donor ?
Average Ka/Ks of ORFans is much less than 1 (often < 0.2)
Most ORFans are functional (although functions are unassigned)
ORFans evolve faster than other genes in the genome
Under less constraints or possibly due to positive selection
ORFans originate by lateral gene transfer
but by different vehicles, mechanisms or processes than HOPs
(which are present in other Bacteria, Archaea or Eukaryotes)
Given their base compositions, lack of homologs & functional status
ORFans most likely derive from DNA phages
(which are poorly represented in the databases)
Rocha et Danchin, 2002
And if you are not yet convinced,
1. Younger ORFans tend to be clustered (as if co-inherited in a single event),
whereas older ORFans are dispersed (by rearrangements & deletions)
ORFan cluster sizes average 2.1 genes in n0/n1 and 1.3 genes in n4
2. Genes in DNA phage genomes are short.
Average is 615 bp, and only 471 bp for those encoding ‘hypothetical’ proteins
3. ORFans often occur at tRNA genes or near translocatable sequences
4. ORFans in E. coli have di-nucleotide frequencies close to coliphages
ORFans are conserved through time
and may assume key functions
- An ORFan from n2 (only in E. coli and Salmonella) is
the ribosomal protein S22, expressed in stationary phase
- Some ORFans from n3 (restricted to the enterics) have
been retained in the highly reduced genome of Buchnera:
e.g., dnaT and dnaC, which are essential to E. coli
Daubin & Ochman, 2004
The genealogy of bacterial genomes
S. typhimurium (4206)
Ubiquitous genes are rare and show
few evidence for LGT
E. coli (4187)
Y. pestis KIM (3883)
Genes seem to be acquired
continuously
Y. pestis CO92 (3599)
W. brevipalpis (653)
B. aphidicola (564)
P. multocida (2015)
H. influenzae (1709)
Most of the acquired genes are
completely new for the genome (no
homologs)
A lot of them are even ORFans
V. cholerae (3805)
P. aeruginosa (5540)
X. fastidiosa (2680)
X. campestris (4030)
X. axonopodis (4193)
ORFans genes appear as a contribution
of phage to bacterial evolution
Because genomes are not increasing in
size, non-homologous replacement
may play a major role
acknowledgements
•
•
•
•
•
Emmanuelle Lerat (esp. LGT in Gamma-proteobacteria !!!)
Manolo Gouy
Guy Perrière
Howard Ochman
Nancy Moran