krummenacker - Buffalo Ontology Site
Download
Report
Transcript krummenacker - Buffalo Ontology Site
The BioCyc Ontologies
Markus Krummenacker
Bioinformatics Research Group
SRI International
[email protected]
BioCyc.org
EcoCyc.org, MetaCyc.org, HumanCyc.org
1
SRI International Bioinformatics
Overview
Pathway/Genome
Databases (PGDBs)
BioCyc collection
EcoCyc, MetaCyc
Pathway Tools Software & Applications
Visualization, Editing, Analysis, Omics data
Inference tools: PathoLogic, Operon predictor, Pathway hole
filler
Tools for debugging a predicted metabolic network
Some Ontology Details
Pathways, Reactions and Compounds, Enzymes, Genes
Regulation
Integration with other efforts: BioPAX, GO, NCBI Taxonomy
2
SRI International Bioinformatics
Model Organism Databases / PGDBs
3
DBs that describe the genome and molecular machinery of
one specific organism.
Integrating many diverse types of data into a coherent model of a cell
Every sequenced organism with an active experimental
community requires a MOD
Integrate genome data with information about the biochemical and genetic
network of the organism
Integrate literature-based information with computational predictions
Ongoing updating of sequence, gene positions and functions, regulatory
sites, pathways
MODs are platforms for global analyses of the organism
Interpret omics data in a pathway context
In silico prediction of essential genes
Characterize systems properties of metabolic and genetic networks
SRI International Bioinformatics
BioCyc Collection of
Pathway/Genome Databases
Database (PGDB) –
combines information about
Pathways, reactions, substrates
Enzymes, transporters
Genes, replicons
Transcription factors/sites, promoters,
operons
Pathway/Genome
Tier
1: Literature-Derived PGDBs
MetaCyc
EcoCyc -- Escherichia coli K-12
Tier
2: Computationally-derived DBs,
Some Curation -- 20 PGDBs
HumanCyc
Mycobacterium tuberculosis
Tier
3: Computationally-derived DBs,
No Curation -- 349 DBs
4
SRI International Bioinformatics
Pathway Tools: PathoLogic Inference
Annotated
Genome
MetaCyc
Reference
Pathway DB
PathoLogic
Pathway/Genome
Database
Pathway/Genome
Editors
5
Pathway/Genome
Navigator
SRI International Bioinformatics
Pathway Tools Software:
PGDBs Created Outside SRI
1,300+
licensees: 75+ groups applying software to 200+ organisms
Saccharomyces
cerevisiae, SGD project, Stanford University
Mouse, MGD, Jackson Laboratory
dictyBase, Northwestern University
Under development:
CGD (Candida albicans), Stanford University
Drosophila, P. Ebert in collaboration with FlyBase
C. elegans, P. Ebert in collaboration with WormBase
Planned:
RGD (Rat), Medical College of Wisconsin
Arabidopsis
thaliana, TAIR, Carnegie Institution of Washington
PlantCyc, ~20 plant PGDBs, Carnegie Institution of Washington
Six Solanaceae species, Cornell University
GrameneDB, Cold Spring Harbor Laboratory
Medicago truncatula, Samuel Roberts Noble Foundation
6
SRI International Bioinformatics
Pathway Tools Software:
PGDBs Created Outside SRI
BioHealthBase
(M. tuberculosis, F. tuleremia), PATRIC, ApiDB
Gary Xie, Los Alamos Lab, Dental pathogens
F. Brinkman, Simon Fraser Univ, Pseudomonas aeruginosa
V. Schachter, Genoscope, Acinetobacter
M. Bibb, John Innes Centre, Streptomyces coelicolor
G. Church, Harvard, Prochlorococcus marinus, multiple strains
E. Uberbacher, ORNL and G. Serres, MBL, Shewanella onedensis
R.J.S. Baerends, University of Groningen, Lactococcus lactis IL1403,
Lactococcus lactis MG1363, Streptococcus pneumoniae TIGR4, Bacillus
subtilis 168, Bacillus cereus ATCC14579
Matthew Berriman, Sanger Centre, Trypanosoma brucei, Leishmania major
Herbert Chiang, Washington University, Bacteroides thetaiotaomicron
Sergio Encarnacion, UNAM, Sinorhizobium meliloti
Gregory Fournier, MIT, Mesoplasma florum
Mark van der Giezen, University of London, Entamoeba histolytica, Giardia
intestinalis
Michael Gottfert, Technische Universitat Dresden, Bradyrhizobium
japonicum
Artiva Maria Goudel, Universidade Federal de Santa Catarina, Brazil,
Chromobacterium violaceum ATCC 12472
7
SRI International Bioinformatics
Pathway Tools Software:
PGDBs Created Outside SRI
Large scale users:
C. Medigue, Genoscope, 150+ PGDBs
G. Burger, U Montreal, 60+ PGDBs
Bart Weimer, Utah State University, Lactococcus lactis, Brevibacterium linens,
Lactobacillus acidophilus, Lactobacillus plantarum, Lactobacillus johnsonii, Listeria
monocytogenes
Partial
8
listing of outside PGDBs at BioCyc.org
SRI International Bioinformatics
Pathway Evidence
9
SRI International Bioinformatics
Pathway Tools Overviews and Omics Viewers
Provide
genome-scale visualizations of cellular networks
Harness human visual system to interpret patterns in biological
contexts
Designed
to avoid the hairball effect
Generated automatically from PGDB
Magnify, interrogate
Omics viewers paint omics data onto
overview diagrams
Different perspectives on same dataset
Use animation for multiple time points or
conditions
Paint any data that associates numbers
with genes, proteins, reactions, or
metabolites
10
SRI International Bioinformatics
Regulatory Overview and Omics Viewer
Show
regulatory relationships among gene
groups
11
SRI International Bioinformatics
12
SRI International Bioinformatics
13
SRI International Bioinformatics
Comparative Analysis
Via Cellular Overview
Comparative genome browser
Comparative pathway table
Comparative analysis reports
Compare reaction complements
Compare pathway complements
Compare transporter complements
14
SRI International Bioinformatics
Pathway Tools Ontology
1621 Classes
Main classes such as:
15
Pathways, Reactions, Compounds, Macromolecules, Proteins, Replicons,
DNA-Segments (Genes, Operons, Promoters)
Taxonomies for Pathways, Reactions (EC), Compounds
Cell Component Ontology
Protein Feature ontology
221 Slots for attributes and relationships
Meta-data: Creator, Creation-Date
Comment, Citations, Common-Name, Synonyms
Attributes: Molecular-Weight, DNA-Footprint-Size
Relationships: Catalyzes, Component-Of, Product
Evidence codes, supporting citations
SRI International Bioinformatics
Pathway/Genome Database Schema
16
SRI International Bioinformatics
Protein Feature Ontology
17
SRI International Bioinformatics
Advanced Query Form
Intuitive
construction of complex database
queries of SQL power
18
SRI International Bioinformatics
Enzymatic-Reactions
TCA Cycle
in-pathway
Succinate + FAD = fumarate + FADH2
reaction
Enzymatic-reaction
catalyzes
Succinate dehydrogenase
component-of
Sdh-flavo
Sdh-Fe-S
Sdh-membrane-1
Sdh-membrane-2
product
sdhA
19
sdhB
sdhC
sdhD
SRI International Bioinformatics
Need for Enzymatic-Reactions
Reactions can have isozymes
Enzymes can be multi-functional
20
Enzymatic-Reaction frames are needed to
decouple the many-to-many relationships
Isozymes may have different inhibitors, etc.
Gene-Reaction schema diagrams:
SRI International Bioinformatics
New Representation of Regulation
Previously,
regulation was represented idiosyncratically:
One representation for modulation of enzymes
Completely different representation for regulation of transcription initiation
Now unified under single Regulation class w/ subclasses
This enables us to easily add support for new kinds of regulation, e.g.
Transcriptional attenuation (done)
Regulation of translation by small RNAs (in progress)
New tools for display and editing of new Regulation classes
21
SRI International Bioinformatics
Operons and Transcription Units
Operon:
A set of two or more genes that are
transcribed as a unit. May include multiple
promoters.
Transcription
Unit: A set of one or more genes
that are transcribed as a unit from a single
promoter.
Pathway
Tools schema does not represent
operons explicitly, only transcription-units
22
SRI International Bioinformatics
Ontology for Transcriptional Regulation
left
trp
BR001
apoTrpR
components
regulator
trpLEDCBAp1
regulated-by
trpLEDCBA
right
TrpR*trp
reg001
trpL
trpE
associated-binding-site
site001
trpD
trpC
trpB
trpA
23
SRI International Bioinformatics
Representation of Transcriptional
Regulation
24
Transcription-Unit
Components include genes, a single promoter, zero or more terminators
Binding-Sites
Linked to regulation frames
Regulation frames
Transcriptional Initiation: defines a 3-way pairing between promoter,
transcription factor and binding-site
Transcriptional Attenuation: defines relationship between terminator and
the entity (tRNA, protein, small molecule) that regulates it.
SRI International Bioinformatics
Infer Anti-Microbial Drug Targets
Infer
drug targets as genes coding for enzymes
that encode chokepoint reactions
Two
types of chokepoint reactions:
Genome Research 14:917 2004
25
SRI International Bioinformatics
Reachability Analysis of Metabolic
Network
Given:
A PGDB for an organism
A set of initial metabolites
Infer:
What set of products can be synthesized by the smallmolecule metabolism of the organism
Can
known growth medium yield known essential
compounds?
Romero and Karp, Pacific Symposium on Biocomputing, 2001
26
SRI International Bioinformatics
Algorithm: Forward Propagation
Through Production System
Each reaction becomes a production rule
Each metabolite in nutrient set becomes an axiom
Nutrient
set
Products
Metabolite
set
PGDB
reaction
pool
“Fire”
reactions
Reactants
27
SRI International Bioinformatics
28
SRI International Bioinformatics
Results
Phase I: Forward propagation
21 initial compounds yielded only half of the 41 essential compounds for E.
coli
Phase II: Manually identify
Bugs in EcoCyc (e.g., two objects for tryptophan)
29
A+BC+D
“Bootstrap compounds”
Missing initial protein substrates (e.g., ACP)
B’ C
Incomplete knowledge of E. coli metabolic network
AB
Protein synthesis not represented
Phase III: Forward propagation with 11 more initial
metabolites
Yielded all 41 essential compounds
SRI International Bioinformatics
Integration with other efforts
Export
of
BioPAX
SBML
Import
of
Enzyme DB (EC hierarchy of reactions)
GO
NCBI Taxonomy
BioPAX (work in progress)
30
SRI International Bioinformatics
Near Future
Signalling
pathways
Validating the design
Regulation
Small RNAs, and other additional types
Higher
Eukaryotes
Gene expression, Multiple splice forms
Cell types, localization
31
SRI International Bioinformatics
Summary
Pathway/Genome
Databases
MetaCyc non-redundant DB of literature-derived pathways
370 organism-specific PGDBs available through SRI at
BioCyc.org
Computational theories of biochemical machinery
Pathway
Tools software
Extract pathways from genomes
Morph annotated genome into structured ontology
Distributed curation tools for MODs
Query, visualization, WWW publishing
32
SRI International Bioinformatics
BioCyc and Pathway Tools
Availability
BioCyc.org
Web site and database files freely
available to all
Pathway
Tools freely available to non-profits
Macintosh, PC/Windows, PC/Linux
References
Pathway Tools User’s Guide
33
Appendix A: Guide to the Pathway Tools Schema
Ontology Papers section of
http://biocyc.org/publications.shtml
SRI International Bioinformatics
Acknowledgements
SRI
Funding
Suzanne Paley, Ron Caspi,
Ingrid Keseler, Carol Fulcher,
Markus Krummenacker, Alex
Shearer, Tomer Altman, Joe
Dale, Fred Gilham, Pallavi Kaipa
sources:
NIH National Center for
Research Resources
NIH National Institute of
General Medical Sciences
NIH National Human Genome
Research Institute
EcoCyc
Collaborators
Julio Collado-Vides, Robert
Gunsalus, Ian Paulsen
MetaCyc
Collaborators
Sue Rhee, Peifen Zhang, Kate
Dreher
Lukas Mueller, Anuradha Pujar
BioCyc.org
Learn more from BioCyc webinars: biocyc.org/webinar.shtml
34
SRI International Bioinformatics
BioWarehouse:
A Bioinformatics Database
Warehouse
Peter D. Karp, Tom J. Lee, Valerie Wagner
BMC Bioinformatics 7:170 2006
bioinformatics.ai.sri.com/biowarehouse/
BioCyc
BioPAX
ENZYME
CMR
Genbank
GO
BioWarehouse
Oracle (10g) or
MySQL (4.1.11)
Eco2DBase
KEGG
UniProt
Taxonomy MAGE-ML
35
SRI International Bioinformatics
Motivations
36
Hundreds of bioinformatics DBs exist
Important problems involve queries across
multiple DBs
SRI International Bioinformatics
Why is the Multidatabase Approach
Alone Not Sufficient?
37
Multidatabase query approaches assume
databases are in a queryable DBMS
Most sites that do operate DBMSs do not allow
remote query access because of security and
loading concerns
Users want to control data stability
Users want to control speed of their hardware
Internet bandwidth limits query throughput
Users need to capture, integrate and publish
locally produced data of different types
Multidatabase and Warehouse approaches
complementary
SRI International Bioinformatics
Key Challenges for BioWarehouse
38
Designing a schema that accurately captures the contents of
source DBs
Designing a schema that is understandable and scalable
Addressing poorly-specified syntax & semantics of source
DBs
Balancing the preservation of source data with mapping into
common semantics
SRI International Bioinformatics
Technical Approach
Multi-platform support: Oracle (10g) and MySQL
Schema support for multitude of bioinformatics
datatypes
Create loaders for public bioinformatics DBs
Parse file format of the source DB
Semantic transformations
Insert DB contents into warehouse tables
Provide Warehouse query access mechanisms
SQL queries via ODBC, JDBC, OAA
Operate public BioWarehouse server: publichouse
BMC Bioinformatics 7:170 2006
39
SRI International Bioinformatics
PublicHouse Server
Publicly queryable BioWarehouse server operated by SRI
Manages a set of biological DBs constructed using
BioWarehouse
Large-scale data mining using
40
CMR
Open BioCyc DBs
ENZYME
NCBI Taxonomy
UniProt
Dashboard Warehouse Query Analyzer
MySQL client command line
See:
http://bioinformatics.ai.sri.com/biowarehouse/publichouse.html
Host: publichouse.sri.com
Port: 3306
Database: biospiceSRI International Bioinformatics
BioWarehouse Schema
41
Manages many bioinformatics datatypes simultaneously
Pathways, Reactions, Chemicals
Proteins, Genes, Replicons
Sequences, Sequence Features
Organisms, Taxonomic relationships
Computations (sequence matches)
Citations, Controlled vocabularies
Links to external databases
Gene expression datasets
Protein-protein interactions datasets
Flow cytometry datasets
Each type of warehouse object implemented through one or
more relational tables (currently ~150)
SRI International Bioinformatics
Warehouse Schema
42
Manages multiple datasets simultaneously
Dataset = Single version of a database
Version comparison
Multiple software tools or experiments that
require access to different versions
Each dataset is a warehouse entity
Every warehouse object is registered in a dataset
SRI International Bioinformatics
Warehouse Schema
43
Different databases storing the same
biological datatypes are coerced into
same warehouse tables
Design of most datatypes inspired by
multiple databases
Representational tricks to decrease
schema bloat
Single space of primary keys
Single set of satellite tables such as for synonyms,
citations, comments, etc.
SRI International Bioinformatics
Acknowledgements
SRI
Funding
Suzanne Paley, Ron Caspi,
Ingrid Keseler, Carol Fulcher,
Markus Krummenacker, Alex
Shearer, Tomer Altman, Joe
Dale, Fred Gilham, Pallavi Kaipa
sources:
NIH National Center for
Research Resources
NIH National Institute of
General Medical Sciences
NIH National Human Genome
Research Institute
EcoCyc
Collaborators
Julio Collado-Vides, Robert
Gunsalus, Ian Paulsen
MetaCyc
Collaborators
Sue Rhee, Peifen Zhang, Kate
Dreher
Lukas Mueller, Anuradha Pujar
BioCyc.org
Learn more from BioCyc webinars: biocyc.org/webinar.shtml
44
SRI International Bioinformatics