Transcript General
The EcoCyc and MetaCyc
Pathway/Genome
Databases
Peter D. Karp, Ph.D.
Bioinformatics Research Group
SRI International
[email protected]
http://www.ai.sri.com/pkarp/
http://EcoCyc.org/
SRI International
Bioinformatics
Overview
Motivations
and terminology
Pathway/genome
databases
BioCyc collection
EcoCyc, MetaCyc
Pathway Tools software
Bioinformatics
Database Warehouse project
SRI International
Bioinformatics
A
E
SRI International
Bioinformatics
What to do When Theories Become
Larger than Minds can Grasp?
Example: E. coli metabolic network
160 pathways involving 744 reactions and 791 substrates
Example: E. coli genetic network
Control by 97 transcription factors of 1174 genes in 630 transcription units
Past solutions:
Partition theories across multiple minds
Encode theories in natural-language text
We cannot compute with theories in those forms
Evaluate theories for consistency with new data: microarrays
Refine theories with respect to new data
Compare theories describing different organisms
Solution:
Biological Knowledge Bases
SRI International
Bioinformatics
Store biological knowledge and theories in computers in a
declarative form
Amenable to computational analysis and generative user interfaces
Establish ongoing efforts to curate (maintain, refine,
embellish) these knowledge bases
Accepted to store data in computers, but not knowledge
Such knowledge bases are an integral part of the scientific
enterprise
SRI International
Bioinformatics
Pathway Definition
Chemical reactions interconvert chemical compounds
A+B
C+D
An enzyme is a protein that accelerates chemical reactions
A pathway is a linked set of reactions
Often regulated as a unit
A
C
A conceptual unit of cell’s biochemical machine
E
Terminology
Organism Database (MOD) –
DB describing genome and other
information about an organism
Model
Pathway/Genome
Database
(PGDB) – MOD that combines
information about
Pathways, reactions, substrates
Enzymes, transporters
Genes, replicons
Transcription factors, promoters,
operons, DNA binding sites
– Collection of 15 PGDBs
at BioCyc.org
EcoCyc, AgroCyc, YeastCyc
BioCyc
SRI International
Bioinformatics
SRI International
Bioinformatics
BioCyc Collection of
Pathway/Genome DBs
Computationally Derived Datasets:
Literature-based
Datasets:
Agrobacterium
MetaCyc
Escherichia
coli (EcoCyc)
http://BioCyc.org/
tumefaciens
Caulobacter crescentus
Chlamydia trachomatis
Bacillus subtilis
Helicobacter pylori
Haemophilus influenzae
Mycobacterium tuberculosis RvH37
Mycobacterium tuberculosis CDC1551
Mycoplasma pneumonia
Pseudomonas aeruginosa
Saccharomyces cerevisiae
Treponema pallidum
Vibrio cholerae
Yellow
= Open Database
Terminology –
Pathway Tools Software
SRI International
Bioinformatics
PathoLogic
Prediction of metabolic network from genome
Computational creation of new Pathway/Genome Databases
Pathway/Genome Editors
Distributed curation of PGDBs
Distributed object database system, interactive editing tools
Pathway/Genome Navigator
WWW publishing of PGDBs
Querying, visualization of pathways, chromosomes, operons
Analysis operations
Pathway visualization of gene-expression data
Global comparisons of metabolic networks
Bioinformatics 18:S225 2002
Pathway Tools Algorithms
Query,
visualization and editing tools for
these datatypes:
Full
Metabolic Map
Paint gene expression data on metabolic
network; compare metabolic networks
Pathways
Pathway prediction
Reactions
Balance checker
Compounds
Chemical substructure comparison
Enzymes, Transporters, Transcription
Factors
Genes: Blast search
Chromosomes
Operons
Operon prediction
SRI International
Bioinformatics
Model Organism Databases
SRI International
Bioinformatics
DBs that describe the genome and other information about
an organism
Every sequenced organism with an active experimental
community requires a MOD
Integrate genome data with information about the biochemical and genetic
network of the organism
MODs are platforms for global analyses of an organism
Interpret gene expression data in a pathway context
Characterize systems properties of metabolic and genetic networks
Determine consistency of metabolic and transport networks
In silico prediction of essential genes
EcoCyc Project – EcoCyc.org
SRI International
Bioinformatics
E. coli Encyclopedia
Model-Organism Database for E. coli
Computational symbolic theory of E. coli
Electronic review article for E. coli – over 3500 literature citations
Tracks the evolving annotation of the E. coli genome
Collaborative development via Internet
Karp (SRI) -- Bioinformatics architect
John Ingraham -- Advisor
(SRI) Metabolic pathways
Saier (UCSD) and Paulsen (TIGR)-- Transport
Collado (UNAM)-- Regulation of gene expression
Database content: 18,000 objects
SRI International
EcoCyc = E.coli Dataset +
Bioinformatics
Pathway/Genome Navigator
Pathways: 165
Reactions: 2,760
Enzymes: 914
Transporters: 162
Proteins: 4,273
Promoters: 812
TransFac Sites: 956
Citations: 3,508
Compounds: 774
Genes: 4,393
Transcription
Units: 724
Factors: 110
http://EcoCyc.org/
EcoCyc Procedures
All
SRI International
Bioinformatics
DB updates by 5 staff curators
Information gathered from biomedical literature
Corrections solicited from E. coli researchers
Review-level database
Four releases per year
Available through WWW site, as data files, as
downloadable application
Quality assurance of data and software:
Evaluate database consistency constraints
Perform element balancing of reactions
Run other checking programs
Display every DB object
SRI International
Bioinformatics
MetaCyc: Metabolic Encyclopedia
Nonredundant metabolic pathway database
Describe a representative sample of every experimentally
determined metabolic pathway
Literature-based DB with extensive references and
commentary
Pathways, reactions, enzymes, substrates
460 pathways, 1267 enzymes, 4294 reactions
172 E. coli pathways, 2735 citations
Nucleic Acids Research 30:59-61 2002.
Jointly developed by SRI and Carnegie Institution
New focus on plant pathways
Family of Pathway/Genome
Databases
MetaCyc
SRI International
Bioinformatics
SRI International
Bioinformatics
Pathway Tools Implementation Details
Allegro
Common Lisp
Sun and PC platforms
Ocelot
object database
250,000
lines of code
Lisp-based
WWW server at BioCyc.org
Manages 15 PGDBs
Pathway Tools Architecture
WWW
Server
SRI International
Bioinformatics
Pathway
Genome
Navigator
X-Windows
Graphics
GFP API
Object Editor
Pathway Editor
Reaction Editor
Object DBMS
Oracle
Ocelot Knowledge Server
Architecture
Frame
SRI International
Bioinformatics
data model
Classes, instances, inheritance
Frames have slots that define their properties, attributes,
relationships
A slot has one or more values
Each value can be any Lisp datatype
Slotunits define metadata about slots:
Domain, range, inverse
Collection type, number of values, value constraints
Transaction
logging facility
Schema evolution
SRI International
Bioinformatics
Ocelot Storage System Architecture
Persistent storage via disk files, Oracle DBMS
Concurrent development: Oracle
Single-user development: disk files
Read-only delivery: bundle data into binary program
Oracle storage
DBMS is submerged within Ocelot, invisible to users
Relational schema is domain independent, supports multiple KBs
simultaneously
Frames transferred from DBMS to Ocelot
On demand
By background prefetcher
Memory cache
Persistent disk cache to speed performance via Internet
SRI International
Bioinformatics
The Common Lisp Programming
Environment
Gatt
studied
Lisp and Java
implementation
of 16 programs
by 14
programmers
(Intelligence
11:21 2000)
EcoCyc WWW Server
SRI International
Bioinformatics
SRI International
Bioinformatics
Pathway/Genome DBs Created by
External Users
Plasmodium
falciparum, Stanford University
plasmocyc.stanford.edu
Mycobacterium tuberculosis, Stanford University
BioCyc.org
Arabidopsis
thaliana and Synechosistis, Carnegie
Institution of Washington
Arabidopsis.org:1555
Methanococcus
janaschii, EBI
Maine.ebi.ac.uk:1555
Other
PGDBs in progress by 24 other users
Software freely available
Each PGDB owned by its creator
SRI International
Bioinformatics
Global Consistency
Checking of Biochemical Network
Given:
A PGDB for an organism
A set of initial metabolites
Infer:
What set of products can be synthesized by the smallmolecule metabolism of the organism
Can
known growth medium yield known essential
compounds?
Pacific Symposium on Biocomputing p471 2001
SRI International
Bioinformatics
Algorithm:
Forward Propagation
Nutrient
set
Products
Metabolite
set
PGDB
reaction
pool
Reactants
“Fire”
reactions
Results
SRI International
Bioinformatics
Phase
I: Forward propagation
21 initial compounds yielded only half of 38 essential
compounds for E. coli
Phase
II: Manually identify
Bugs in EcoCyc (e.g., two objects for tryptophan)
Missing initial protein substrates (e.g., ACP)
Missing pathways in EcoCyc
Phase
III: Forward propagation with 11 more initial
metabolites
Yielded all 38 essential compounds
SRI International
Bioinformatics
Nutrient-Related Analysis:
Validation of the EcoCyc Database
Results on EcoCyc:
Phase I:
• Essential compounds
• produced
• not produced
19
19
• Total compounds
• produced:
(28%)
• Reactions
• Fired
(31%)
SRI International
Bioinformatics
Missing Essential Compounds Due To
Bugs
in EcoCyc
Narrow
conceptualization of the problem
Protein substrates
Incomplete
biochemical knowledge
SRI International
Bioinformatics
Nutrient-Related Analysis:
Validation of the EcoCyc Database
Results on EcoCyc:
Phase II (After adding 11 extra metabolites):
• Essential compounds
• produced
• not produced
• Total compounds
• produced:
• not produced:
• Reactions
• Fired
• Not fired
38
0
(49%)
(51%)
(58%)
(42%)
Pathway Tools Misconceptions
SRI International
Bioinformatics
PathoLogic
Does not re-annotate genomes
Pathway
Tools does not handle quantitative
information
Pathway/Genome
web
Editors do not work through the
SRI International
Bioinformatics
HumanCyc: Human Metabolic Pathway
Database Consortium
Construct DB of human metabolic pathways using
PathoLogic
Link to human genome web sites
Hire one curator to refine and curate with respect to
literature over a 2 year period
Remove false-positive predictions
Insert known pathways missed by PathoLogic
Add comments and citations from pathways and enzymes to the literature
Add enzyme activators, inhibitors, cofactors, tissue information
Available as flatfiles and with Pathway/Genome Navigator
New versions to be released every 6 months
Summary
SRI International
Bioinformatics
Pathway/Genome
Databases
MetaCyc non-redundant DB of literature-derived pathways
14 organism-specific PGDBs available through SRI at
BioCyc.org
Computational theories of biochemical machinery
Pathway
Tools software
Extract pathways from genomes
Morph annotated genome into structured ontology
Distributed curation tools for MODs
Query, visualization, WWW publishing
BioCyc and Pathway Tools
Availability
WWW
SRI International
Bioinformatics
BioCyc freely available to all
BioCyc.org
Six
BioCyc DBs openly available to all
BioCyc
DBs freely available to non-profits
Flatfiles downloadable from BioCyc.org
Binary executable:
Sun UltraSparc-170 w/ 64MB memory
PC, 400MHz CPU, 64MB memory, Windows-98 or newer
PerlCyc API
Pathway
Tools freely available to non-profits
SRI International
Bioinformatics
Acknowledgements
SRI
Suzanne Paley, Pedro
Romero, John Pick, Cindy
Krieger, Martha Arnaud
EcoCyc
Project
Julio Collado-Vides, Ian
Paulsen, Monica Riley, Milton
Saier
MetaCyc
Project
Sue Rhee, Lukas Mueller,
Peifen Zhang, Chris Somerville
Funding
sources:
NIH National Center for
Research Resources
NIH National Institute of
General Medical
Sciences
NIH National Human
Genome Research
Institute
Department of Energy
Microbial Cell Project
DARPA BioSpice, UPC
Stanford
Gary Schoolnik, Harley
McAdams, Lucy Shapiro, Russ
Altman, Iwei Yeh
BioCyc.org