Aucun titre de diapositive

Download Report

Transcript Aucun titre de diapositive

Pathway Tools Meeting - December 1, 2005, Geneva (SIB)
&
:
Putting together synteny and
metabolic information to achieve
relevant expert annotation of
microbial genomes
Dr Claudine Médigue
What is MaGe ? Yet another bacterial annotation platform !…
Its development started in Oct. 2002
Context : the Acinetobacter sp. ADP1
genome annotation (Summer 2004)
Shares functionalities with other existing annotation systems :
 An automatic annotation process :
Syntaxic and functional annotations
Functional annotation and classification inferences
 A relational database (MySQL) used to store the sequences and
the analysis results.
 A WEB interface allowing multiple users to simultaneously annotate a
genome.
 Connectivity to other databases or systems
Developed by biologists involved in manual expert annotation
Graphical interface which focuses on gene context and synteny
results with available bacterial proteomes.
Introduction to the Prokaryotic Genome DataBase (PkGDB)
Purpose: storage of ‘clean’ and complete annotation data which are
subsequently used in the genomic comparative analysis.
 Relational SGBD (MySQL)
• Complete bacterial genomes
(Refseq NCBI and Genome Review EBI)
 Integration in PkGDB
Correction of obvious errors
Management of frameshifts
 Syntactic re-annotation
NAR (WS), 2003
Add missing gene annotations
NAR (WS), 2005
• New bacterial genomes (annotation projects)
• Annotation tool results :
 Intrinsic: genes, signals, repeats,…
 Extrinsic : BLAST, InterPro, COG, synteny …
Simplified structure of PkGDB
Re-annotation project Annotation project
Published genomes
NCBI
RefSeq
Genome
Reviews
Newly sequenced genomes
Gene prediction AMIGene
Project customization
Reference
annotation for model
organisms
Ecogene
Geneprotec
Subtilist
Annotation management
Sequence updates and
annotation transfer
Genomic Objects
Automatic and manual
functional assignations
Annotation history
Annotator management
Functional Classification
MultiFun
GeneOntology
Functional predictions
Protein similarities
 helixes and signal peptides
Enzymatic functions
KEGG
BioCyc
Domains and motifs
Uniprot
Interpro
COG
Specific regions
Orthologs & Paralogs
Syntenies
• Multiple correspondences
• Local rearrangements (ins/del)
Boyer et al. Bioinformatics (Nov 2005)
How to read the synteny maps ?
ACIAD0574
hutH
Two ‘homologs’ to ACIAD0574
on the P. aeruginosa genome
These two P. syringae genes
(PSPTO5274/hutH-2 and 5276/ hutH-3)
are similar to ACIAD0574 (putative
paralogs of PSPTO0599)
This P. syringae gene
(PSPTO0599/hutH-1) is a putative
‘ortholog’ to ACIAD0574 and is
involved in a synteny group
containing 17 genes (in green)
A larger view of the previous Acinetobacter ADP1 region
0574
0562
hisS
0582-0583
hutH
fabG-fabF
4 of 138
genomes
in PkGDB
9 of 284
complete
microbial
proteomes
(RefSeq
section)
How are genes organized in a synteny group ?
Synteny with Ralstonia solanacearum chromosome
Synteny with Ralstonia solanacearum Mega Plasmid
Synteny maps are useful to annotate gene fusion/fission
Fusion of genes involved in DNA replication
dnaQ (DNA polIII, epsilon subunit + proofreading 3’-5’ exonuclease)
rnhA (degradation of Okazaki fragments)
(dnaQ) YPO1082
(dnaQ) STM0264
(dnaQ) NMB1514
(dnaQ) PA1816
(dnaQ) PSPTO3711
YPO1081 (rnhA)
STM0263 (rnhA)
(rnhA) NMB1618
PA1815 (rnhA)
PSPTO3712(rnhA)
Colored rectangles
represent the part of the
protein which aligns with
the corresponding
Acinetobacter protein.
Simplified structure of PkGDB
Re-annotation project Annotation project
Published genomes
NCBI
RefSeq
Newly sequenced genomes
Genome
Reviews
Gene prediction AMIGene
Project customization
Reference
annotation for model
organisms
Ecogene
Geneprotec
Subtilist
Annotation management
Sequence updates and
annotation transfer
Genomic Objects
Automatic and manual
functional assignations
Annotation history
Annotator management
Functional Classification
MultiFun
GeneOntology
Functional predictions
 helixes and signal peptides
Protein similarities
PRIAM
http://bioinfo.genopoletoulouse.prd.fr/priam/
Position-specific
scoring matrices
('profiles') built
with SwissProt
proteins
Enzymatic functions
KEGG
BioCyc
Domains and motifs
Uniprot
Dynamic
requests
www.genome.jp/kegg/
Interpro
COG
Local
installation
http://www.biocyc.org/
Specific regions
Orthologs & Paralogs
Syntenies
Setting up a new annotation project : an example
Available related sequences
• Rhizobium leguminosarum
(Sanger Center)
• Rhodobacter sphaeroides
(DOE/JGI)
• Rhodospirillum rubrum (DOE/JGI)
Newly sequenced genomes
Genomes in public DataBanks
• Mesorhizobium loti (00)
• Sinorhizobium meliloti (01)
• Bradyrhizobium japonicum (02)
• Rhodopeudomonas palustris (03)
Automatic syntaxic annotations
Re-annotation process
(in some cases, functional annotations)
(pseudogenes, missing genes)
• Bradyrhizobium sp. ORS278
(Genoscope) -> 1 chr (7,5 Mb)
• Bradyrhizobium sp. BTAi
(DOE/JGI) -> 1 chr (8,5 Mb)
Complete pipeline of
automatic annotations
Searching for synteny groups with complete proteomes available in RefSeq section
(NCBI, 284 to date) and in PkGDB (curated genomes, 138 to date)
PkGDB
Pathway Tools
Metabolic pathway
reconstruction Ocelot
object
model
BrajapCyc
YersiniaScope
AcinetoScope
ColiScope
RhizoScope
BradyBTCyc
BradyORCyc
FrankiaScope
CloacaScope
RhizoCyc
BioWareHouse relational model
Comparative Metabolic Capabilities : an example
Reaction content comparisons between the 3 Bradyrhizobium
organisms (BioWareHouse SQL query on reactions having gene->
protein->reaction correspondences )
Bradyrhizobium sp. ORS278
830
BRAOR5732
BRAOR5733
BRAOR5771
BRAOR5772
BRAOR5776
873
76
14
ORS278
Bradyrhizobium sp. BTAi
43
BTAi genes coding the same
reaction
BRABT1389,BRABT0754,BRABT07
23,BRABT0755,BRABT0724
BRABT1389,BRABT0754,BRABT07
23,BRABT0755,BRABT0724
BRABT1389,BRABT0754,BRABT07
23,BRABT0755,BRABT0724
BRABT1389,BRABT0754,BRABT07
23,BRABT0755,BRABT0724
BRABT0759
16
724 Pathway
Reaction
protocatechuate degradation I
PROTOCATECHUATE-4,5-DIOXYGENASE-RXN
protocatechuate degradation I
PROTOCATECHUATE-4,5-DIOXYGENASE-RXN
protocatechuate degradation I
PROTOCATECHUATE-4,5-DIOXYGENASE-RXN
30
127
protocatechuate degradation I
PROTOCATECHUATE-4,5-DIOXYGENASE-RXN
protocatechuate degradation I
RXN-2463
Bradyrhizobium japonicum USDA 110
897
Bradyrhizobium ORS278 region containing CDS 5771&5772
BRAOR5771-5772 - 5773
15277747
“Cloning and Characterization
of the Genes Encoding
!!!
Enzymes for the Protocatechuate Meta-degradation
Pathways of Pseudomonas ochraceae NGJ1” Maruyama et
al. (2004) Biosci. Biotechnol. Biochem, 68, 1434-1441.
!!!
???
AUTOmatic vs EXPert annotation of the region
PRODUCT
BRAOR5770
AUTO 4-carboxy-2-hydroxymuconate-6-semialdehyde dehydrogenase
EC-number Gene
1.1.1.18
EXP 4-carboxy-2-hydroxymuconate-6-semialdehyde dehydrogenase 1.2.1.45
Evidence
ligC
BLAST R. palus
PRIAM (medium)
ligC
BLAST P. testosteroni
Publication + Enzyme
BRAOR5771
AUTO = EXP Protochatechuate 4,5-dioxygenase, alpha subunit
1.13.11.8
ligB
BLAST R. palus
PRIAM (high)
BRAOR5772
AUTO = EXP Protochatechuate 4,5-dioxygenase, beta subunit
1.13.11.8
ligA
BLAST R. palus
PRIAM (high)
none
ligI
BLAST R. palus
3.1.1.57
ligI
none
none
BLAST R. palus
1.1.1.-
none
BLAST R. palus
InterproScan
none
fidZ
BLAST R. palus
4.1.3.17
ligK
BLAST P. ochraceae
Publication + Enzyme
none
ligJ
BLAST R. palus
4.2.1.83
ligJ
BLAST R. palus
Publication + Enzyme
BRAOR5773
2-pyrone-4,6-dicarboxylic acid hydrolase
AUTO
EXP
2-pyrone-4,6-dicarboxylic acid hydrolase
BRAOR5774
AUTO Putative dehydrogenase
EXP
Putative dehydrogenase with NAD binding protein
BRAOR5775
AUTO Putative acyl transferase
EXP 4-hydroxy-4-methyly-2-oxoglutarate aldolase
BRAOR5776
AUTO 4-oxalomesaconate hydratase
EXP 4-oxalomesaconate hydratase
BLAST R. palus
Publication + Enzyme
Bradyrhizobium ORS278 region after expert annotation
BRAOR5770
BRAOR5771-72
BRAOR5773
1.2.1.45
1.13.11.8
3.1.1.57
ligC
ligBA
ligI
BRAOR5777
BRAOR5776
BRAOR5775
4.2.1.83
4.1.3.17
ligJ
ligK
BRAOR5778
Connectivity to KEGG database
Enzymes encoded by genes in the MaGe region
Enzymes encoded by genes elsewhere in the
Bradyrhizobium genome
Additional enzymes in E. coli
4.2.1.83
?
Connectivity to KEGG database
Enzymes encoded by genes in the MaGe region
Enzymes encoded by genes elsewhere in the
Bradyrhizobium genome
Additional enzymes in E. coli
Bradyrhizobium ORS278 region after expert annotation
Probable protochatechuate
transporter
5770
5771
5772
Probable transcriptional
regulator of protochatechuate
degradation
5776
5773
5775
BRAOR5777
BRAOR5778
ligR
BRAOR5770_ligC
4-carboxy-2-hydroxymuconate
6-semialdehyde dehydrogenase
1.2.1.45
BRAOR5776_ligJ
4-oxalmesaconate
hydratase
4.2.1.83
The reactions catalyzed by
1.2.1.45 and 4.2.1.83 exist in
MetaCyc but they are not
involved in a pathway.
Enzymatic activity predictions (PRIAM) : some results
 Comparison of PRIAM predictions [P] and Expert annotations [E]
Acinetobacter
ADP1
Total genes
3325
Pseudoalteromonas
haloplanktis
Frankia
alni
Pseudomonas
entomophila
3514
6861
5182
Nb EC_[P] vs EC_[E]
1012 / 947
927 / 993
1729 / 1498
1455 / 1232
EC_[P] = EC_[E]
632 (62.5%)
697 (75.2%)
912 (52.8%)
820 (56.3%)
47 (4.6%)
23 (2.5%)
68 (3.9%)
46 (3.2%)
EC_[P] <> EC_[E]
131 (12.9%)
102 (11.0%)
401 (23.2%)
285 (19.6%)
EC_[P] & (NO EC_[E])
202 (20.0%)
105 (11.3%)
348 (20.1%)
304 (20.9%)
EC_[E] & (NO EC_[P])
111 (11.7%)
152 (15.3%)
111 (7.4%)
90 (7.3%)
EC_[P](3 digit) = EC_[E]
 Limitations of PRIAM sequence-based enzyme prediction
 Availability of at least one UniProt/SwissProt sequence in the Enzyme
entry !
 Existence of closely related enzymes with different substrate specificity
 Relaxed substrate specificity exhibited by some enzymes
 Several wrong predictions in case of Medium/Low PRIAM confidence
PGDBs built at Genoscope
 Our PGDBs are currently available in the MaGe’s interface
HomePage : http://www.genoscope.cns.fr/agc/mage/
 NO curation to date (Tier 3* Databases)
(except for Acinetobacter ADP1-> Metabolic Thesaurus project)
 MaGe’s training courses include a quick overview of how to
explore PathoLogic results to perform relevant expert annotation
 Automatic updates of PathoLogic predictions : every week
 To date : about 60 Tier 3 PGDBs
 16 PGDBs common to SRI/EBI PGDBs Tier3* (and 4 with Tier2*):
«Expansion of the BioCyc collection of pathway/genome databases to 160 genomes» Karp et al.
Nucleic Acid Research, 2005, 33: 6083-6089.
• The number of enzymes and pathways is slightly greater in our PGDBs
(source of annotations + process of Pathologic file format generation)
• Important discrepancies with Sinorhizobium meliloti (44 predicted
pathways in the SRI/EBI PGDB vs 259 in the Genoscope PGDB)
 18 PGDBs : other published bacterial genomes
 25 PGDBs for newly sequenced and annotated bacterial genomes
*Tier 3: Computationally-Derived Databases Subject to No Curation
*Tier 2: Computationally-Derived Databases Subject to Moderate Curation
Some Questions / Perspectives
 Better correspondences between BioCyc and MaGe
• Optional fields in the PathoLogic file format (PubMedID, Funcat, …)
 How to tackle the pseudogene information ?
No enzyme has been found
Pathway X doesn’t
exist because
Some enzymes correspond to pseudogenes
 Curation of PGDB ?
 Integration and evaluation of Pathway Hole Filler
 Remove false-positive pathway (Tier 3 -> Tier2)
• Automatic reduction of false positive pathway predictions stored
in the PGDBs
• Finding a way to get a list of false positive pathways at the end of
the manual process of annotation.
 Tier2 -> Tier1*, especially creation of new metabolic pathways :
!!! Not an easy task !!! (a strong knowledge of metabolism is required)
• PGDBs freely available for «adoption» by biologists
*Tier1: Intensively Curated Databases
Metabolic Thesaurus project at Genoscope
Véronique de Berardinis’s team
Knock-out
collection
2240 ADP1 genes
knocked out
Biological evidence
Systematic
phenotyping
Annotation
Accurate
phenotyping
Biochemical
studies
3325 Acinetobacter ADP1
annotated genes
Functional
complementation
Model
Network
reconstruction
Flux Models
Metabolism prediction
Vincent Schächter’s bioInformatic team
Transcriptome
analyses
Metabolic Pathway Reconstruction / Experimental Data
Metabolic Thesaurus
ColiScope
Acinetobacter ADP1
KO collection
Sequencing of 2 commensal
and 4 pathogenic E. coli strains
Phenotypic analysis: growth essay on different nutrient sources
+
Metabolome analysis: LC/MS and CE/MS
Data Integration and Comparative Analysis
Linked enzymatic activity
to genes of unknown function
Evolution of metabolic
capabilities => adaptation of
microorganisms
commensalism / virulence
emergence
Participating teams
 AGC team :
 Zoé Rouy
 David Vallenet
 Aurélie Lajus
 Stéphane Cruveiller
 Claudine Médigue
 Genoscope informatic system team
 Claude Scarpelli
 Laurent Sainte-Marthe
 Sylvain Bonneval
 … and with the help of :
 François Lefèvre (V. Schächter team)
 Mage’s users feedback helps in improving many functionalities of
our system !