Transcript General

The EcoCyc and MetaCyc
Pathway/Genome
Databases
Peter D. Karp, Ph.D.
Bioinformatics Research Group
SRI International
[email protected]
http://www.ai.sri.com/pkarp/
http://EcoCyc.org/
SRI International
Bioinformatics
Overview
 Motivations
and terminology
 Pathway/genome
databases
BioCyc collection
 EcoCyc, MetaCyc
 Pathway Tools software

 Bioinformatics
Database Warehouse project
SRI International
Bioinformatics
A
E
SRI International
Bioinformatics
What to do When Theories Become
Larger than Minds can Grasp?




Example: E. coli metabolic network
 160 pathways involving 744 reactions and 791 substrates
Example: E. coli genetic network
 Control by 97 transcription factors of 1174 genes in 630 transcription units
Past solutions:
 Partition theories across multiple minds
 Encode theories in natural-language text
We cannot compute with theories in those forms
 Evaluate theories for consistency with new data: microarrays
 Refine theories with respect to new data
 Compare theories describing different organisms
Solution:
Biological Knowledge Bases
SRI International
Bioinformatics

Store biological knowledge and theories in computers in a
declarative form
 Amenable to computational analysis and generative user interfaces

Establish ongoing efforts to curate (maintain, refine,
embellish) these knowledge bases

Accepted to store data in computers, but not knowledge

Such knowledge bases are an integral part of the scientific
enterprise
SRI International
Bioinformatics
Pathway Definition

Chemical reactions interconvert chemical compounds
A+B
C+D

An enzyme is a protein that accelerates chemical reactions

A pathway is a linked set of reactions
Often regulated as a unit

A

C
A conceptual unit of cell’s biochemical machine
E
Terminology
Organism Database (MOD) –
DB describing genome and other
information about an organism
Model
Pathway/Genome
Database
(PGDB) – MOD that combines
information about
 Pathways, reactions, substrates
 Enzymes, transporters
 Genes, replicons
 Transcription factors, promoters,
operons, DNA binding sites
– Collection of 15 PGDBs
at BioCyc.org
 EcoCyc, AgroCyc, YeastCyc
BioCyc
SRI International
Bioinformatics
SRI International
Bioinformatics
BioCyc Collection of
Pathway/Genome DBs
Computationally Derived Datasets:
Literature-based
Datasets:
Agrobacterium
MetaCyc
Escherichia
coli (EcoCyc)
http://BioCyc.org/
tumefaciens
Caulobacter crescentus
Chlamydia trachomatis
Bacillus subtilis
Helicobacter pylori
Haemophilus influenzae
Mycobacterium tuberculosis RvH37
Mycobacterium tuberculosis CDC1551
Mycoplasma pneumonia
Pseudomonas aeruginosa
Saccharomyces cerevisiae
Treponema pallidum
Vibrio cholerae
Yellow
= Open Database
Terminology –
Pathway Tools Software
SRI International
Bioinformatics

PathoLogic
 Prediction of metabolic network from genome
 Computational creation of new Pathway/Genome Databases

Pathway/Genome Editors
 Distributed curation of PGDBs
 Distributed object database system, interactive editing tools

Pathway/Genome Navigator
 WWW publishing of PGDBs
 Querying, visualization of pathways, chromosomes, operons
 Analysis operations



Pathway visualization of gene-expression data
Global comparisons of metabolic networks
Bioinformatics 18:S225 2002
Pathway Tools Algorithms
Query,
visualization and editing tools for
these datatypes:
Full
Metabolic Map
 Paint gene expression data on metabolic
network; compare metabolic networks
Pathways
 Pathway prediction
Reactions
 Balance checker
Compounds
 Chemical substructure comparison
Enzymes, Transporters, Transcription
Factors
Genes: Blast search
Chromosomes
Operons
 Operon prediction
SRI International
Bioinformatics
Model Organism Databases



SRI International
Bioinformatics
DBs that describe the genome and other information about
an organism
Every sequenced organism with an active experimental
community requires a MOD
 Integrate genome data with information about the biochemical and genetic
network of the organism
MODs are platforms for global analyses of an organism
 Interpret gene expression data in a pathway context
 Characterize systems properties of metabolic and genetic networks
 Determine consistency of metabolic and transport networks
 In silico prediction of essential genes
EcoCyc Project – EcoCyc.org
SRI International
Bioinformatics

E. coli Encyclopedia
 Model-Organism Database for E. coli
 Computational symbolic theory of E. coli
 Electronic review article for E. coli – over 3500 literature citations
 Tracks the evolving annotation of the E. coli genome

Collaborative development via Internet
 Karp (SRI) -- Bioinformatics architect
 John Ingraham -- Advisor
 (SRI) Metabolic pathways
 Saier (UCSD) and Paulsen (TIGR)-- Transport
 Collado (UNAM)-- Regulation of gene expression

Database content: 18,000 objects
SRI International
EcoCyc = E.coli Dataset +
Bioinformatics
Pathway/Genome Navigator
Pathways: 165
Reactions: 2,760
Enzymes: 914
Transporters: 162
Proteins: 4,273
Promoters: 812
TransFac Sites: 956
Citations: 3,508
Compounds: 774
Genes: 4,393
Transcription
Units: 724
Factors: 110
http://EcoCyc.org/
EcoCyc Procedures
 All
SRI International
Bioinformatics
DB updates by 5 staff curators
 Information gathered from biomedical literature
 Corrections solicited from E. coli researchers
 Review-level database
 Four releases per year
 Available through WWW site, as data files, as
downloadable application
 Quality assurance of data and software:
 Evaluate database consistency constraints
 Perform element balancing of reactions
 Run other checking programs
 Display every DB object
SRI International
Bioinformatics
MetaCyc: Metabolic Encyclopedia







Nonredundant metabolic pathway database
Describe a representative sample of every experimentally
determined metabolic pathway
Literature-based DB with extensive references and
commentary
Pathways, reactions, enzymes, substrates
460 pathways, 1267 enzymes, 4294 reactions
 172 E. coli pathways, 2735 citations
Nucleic Acids Research 30:59-61 2002.
Jointly developed by SRI and Carnegie Institution
 New focus on plant pathways
Family of Pathway/Genome
Databases
MetaCyc
SRI International
Bioinformatics
SRI International
Bioinformatics
Pathway Tools Implementation Details
 Allegro
Common Lisp
 Sun and PC platforms
 Ocelot
object database
 250,000
lines of code
 Lisp-based
WWW server at BioCyc.org
 Manages 15 PGDBs
Pathway Tools Architecture
WWW
Server
SRI International
Bioinformatics
Pathway
Genome
Navigator
X-Windows
Graphics
GFP API
Object Editor
Pathway Editor
Reaction Editor
Object DBMS
Oracle
Ocelot Knowledge Server
Architecture
 Frame
SRI International
Bioinformatics
data model
 Classes, instances, inheritance
 Frames have slots that define their properties, attributes,
relationships
 A slot has one or more values
 Each value can be any Lisp datatype
 Slotunits define metadata about slots:


Domain, range, inverse
Collection type, number of values, value constraints
 Transaction
logging facility
 Schema evolution
SRI International
Bioinformatics
Ocelot Storage System Architecture

Persistent storage via disk files, Oracle DBMS
 Concurrent development: Oracle
 Single-user development: disk files
 Read-only delivery: bundle data into binary program

Oracle storage
 DBMS is submerged within Ocelot, invisible to users
 Relational schema is domain independent, supports multiple KBs
simultaneously
 Frames transferred from DBMS to Ocelot




On demand
By background prefetcher
Memory cache
Persistent disk cache to speed performance via Internet
SRI International
Bioinformatics
The Common Lisp Programming
Environment
 Gatt
studied
Lisp and Java
implementation
of 16 programs
by 14
programmers
(Intelligence
11:21 2000)
EcoCyc WWW Server
SRI International
Bioinformatics
SRI International
Bioinformatics
Pathway/Genome DBs Created by
External Users
Plasmodium
falciparum, Stanford University
 plasmocyc.stanford.edu
Mycobacterium tuberculosis, Stanford University
 BioCyc.org
Arabidopsis
thaliana and Synechosistis, Carnegie
Institution of Washington
 Arabidopsis.org:1555
Methanococcus

janaschii, EBI
Maine.ebi.ac.uk:1555
Other
PGDBs in progress by 24 other users
Software freely available
Each PGDB owned by its creator
SRI International
Bioinformatics
Global Consistency
Checking of Biochemical Network
 Given:
A PGDB for an organism
 A set of initial metabolites

 Infer:

What set of products can be synthesized by the smallmolecule metabolism of the organism
 Can
known growth medium yield known essential
compounds?
 Pacific Symposium on Biocomputing p471 2001
SRI International
Bioinformatics
Algorithm:
Forward Propagation
Nutrient
set
Products
Metabolite
set
PGDB
reaction
pool
Reactants
“Fire”
reactions
Results
SRI International
Bioinformatics
 Phase
I: Forward propagation
 21 initial compounds yielded only half of 38 essential
compounds for E. coli
 Phase
II: Manually identify
 Bugs in EcoCyc (e.g., two objects for tryptophan)
 Missing initial protein substrates (e.g., ACP)
 Missing pathways in EcoCyc
 Phase
III: Forward propagation with 11 more initial
metabolites
 Yielded all 38 essential compounds
SRI International
Bioinformatics
Nutrient-Related Analysis:
Validation of the EcoCyc Database
Results on EcoCyc:
Phase I:
• Essential compounds
• produced
• not produced
19
19
• Total compounds
• produced:
(28%)
• Reactions
• Fired
(31%)
SRI International
Bioinformatics
Missing Essential Compounds Due To
 Bugs
in EcoCyc
 Narrow
conceptualization of the problem
 Protein substrates
 Incomplete
biochemical knowledge
SRI International
Bioinformatics
Nutrient-Related Analysis:
Validation of the EcoCyc Database
Results on EcoCyc:
Phase II (After adding 11 extra metabolites):
• Essential compounds
• produced
• not produced
• Total compounds
• produced:
• not produced:
• Reactions
• Fired
• Not fired
38
0
(49%)
(51%)
(58%)
(42%)
Pathway Tools Misconceptions
SRI International
Bioinformatics
 PathoLogic

Does not re-annotate genomes
 Pathway
Tools does not handle quantitative
information
 Pathway/Genome
web
Editors do not work through the
SRI International
Bioinformatics
HumanCyc: Human Metabolic Pathway
Database Consortium

Construct DB of human metabolic pathways using
PathoLogic

Link to human genome web sites

Hire one curator to refine and curate with respect to
literature over a 2 year period
 Remove false-positive predictions
 Insert known pathways missed by PathoLogic
 Add comments and citations from pathways and enzymes to the literature
 Add enzyme activators, inhibitors, cofactors, tissue information

Available as flatfiles and with Pathway/Genome Navigator

New versions to be released every 6 months
Summary
SRI International
Bioinformatics
 Pathway/Genome
Databases
 MetaCyc non-redundant DB of literature-derived pathways
 14 organism-specific PGDBs available through SRI at
BioCyc.org
 Computational theories of biochemical machinery
 Pathway
Tools software
 Extract pathways from genomes
 Morph annotated genome into structured ontology
 Distributed curation tools for MODs
 Query, visualization, WWW publishing
BioCyc and Pathway Tools
Availability
 WWW
SRI International
Bioinformatics
BioCyc freely available to all
 BioCyc.org
 Six
BioCyc DBs openly available to all
 BioCyc
DBs freely available to non-profits
 Flatfiles downloadable from BioCyc.org
 Binary executable:



Sun UltraSparc-170 w/ 64MB memory
PC, 400MHz CPU, 64MB memory, Windows-98 or newer
PerlCyc API
 Pathway
Tools freely available to non-profits
SRI International
Bioinformatics
Acknowledgements
SRI

Suzanne Paley, Pedro
Romero, John Pick, Cindy
Krieger, Martha Arnaud
EcoCyc

Project
Julio Collado-Vides, Ian
Paulsen, Monica Riley, Milton
Saier
MetaCyc

Project
Sue Rhee, Lukas Mueller,
Peifen Zhang, Chris Somerville
Funding
sources:
 NIH National Center for
Research Resources
 NIH National Institute of
General Medical
Sciences
 NIH National Human
Genome Research
Institute
 Department of Energy
Microbial Cell Project
 DARPA BioSpice, UPC
Stanford

Gary Schoolnik, Harley
McAdams, Lucy Shapiro, Russ
Altman, Iwei Yeh
BioCyc.org