krummenacker - Buffalo Ontology Site

Download Report

Transcript krummenacker - Buffalo Ontology Site

The BioCyc Ontologies
Markus Krummenacker
Bioinformatics Research Group
SRI International
[email protected]
BioCyc.org
EcoCyc.org, MetaCyc.org, HumanCyc.org
1
SRI International Bioinformatics
Overview
 Pathway/Genome
Databases (PGDBs)
BioCyc collection
 EcoCyc, MetaCyc
 Pathway Tools Software & Applications
 Visualization, Editing, Analysis, Omics data
 Inference tools: PathoLogic, Operon predictor, Pathway hole
filler
 Tools for debugging a predicted metabolic network
 Some Ontology Details
 Pathways, Reactions and Compounds, Enzymes, Genes
 Regulation
 Integration with other efforts: BioPAX, GO, NCBI Taxonomy

2
SRI International Bioinformatics
Model Organism Databases / PGDBs
3

DBs that describe the genome and molecular machinery of
one specific organism.
 Integrating many diverse types of data into a coherent model of a cell

Every sequenced organism with an active experimental
community requires a MOD
 Integrate genome data with information about the biochemical and genetic
network of the organism
 Integrate literature-based information with computational predictions
 Ongoing updating of sequence, gene positions and functions, regulatory
sites, pathways

MODs are platforms for global analyses of the organism
 Interpret omics data in a pathway context
 In silico prediction of essential genes
 Characterize systems properties of metabolic and genetic networks
SRI International Bioinformatics
BioCyc Collection of
Pathway/Genome Databases
Database (PGDB) –
combines information about
 Pathways, reactions, substrates
 Enzymes, transporters
 Genes, replicons
 Transcription factors/sites, promoters,
operons
Pathway/Genome
Tier
1: Literature-Derived PGDBs
 MetaCyc
 EcoCyc -- Escherichia coli K-12
Tier
2: Computationally-derived DBs,
Some Curation -- 20 PGDBs
 HumanCyc
 Mycobacterium tuberculosis
Tier
3: Computationally-derived DBs,
No Curation -- 349 DBs
4
SRI International Bioinformatics
Pathway Tools: PathoLogic Inference
Annotated
Genome
MetaCyc
Reference
Pathway DB
PathoLogic
Pathway/Genome
Database
Pathway/Genome
Editors
5
Pathway/Genome
Navigator
SRI International Bioinformatics
Pathway Tools Software:
PGDBs Created Outside SRI
1,300+
licensees: 75+ groups applying software to 200+ organisms
Saccharomyces
cerevisiae, SGD project, Stanford University
Mouse, MGD, Jackson Laboratory
dictyBase, Northwestern University
Under development:
 CGD (Candida albicans), Stanford University
 Drosophila, P. Ebert in collaboration with FlyBase
 C. elegans, P. Ebert in collaboration with WormBase
Planned:
 RGD (Rat), Medical College of Wisconsin
Arabidopsis
thaliana, TAIR, Carnegie Institution of Washington
PlantCyc, ~20 plant PGDBs, Carnegie Institution of Washington
Six Solanaceae species, Cornell University
GrameneDB, Cold Spring Harbor Laboratory
Medicago truncatula, Samuel Roberts Noble Foundation
6
SRI International Bioinformatics
Pathway Tools Software:
PGDBs Created Outside SRI
BioHealthBase
(M. tuberculosis, F. tuleremia), PATRIC, ApiDB
Gary Xie, Los Alamos Lab, Dental pathogens
F. Brinkman, Simon Fraser Univ, Pseudomonas aeruginosa
V. Schachter, Genoscope, Acinetobacter
M. Bibb, John Innes Centre, Streptomyces coelicolor
G. Church, Harvard, Prochlorococcus marinus, multiple strains
E. Uberbacher, ORNL and G. Serres, MBL, Shewanella onedensis
R.J.S. Baerends, University of Groningen, Lactococcus lactis IL1403,
Lactococcus lactis MG1363, Streptococcus pneumoniae TIGR4, Bacillus
subtilis 168, Bacillus cereus ATCC14579
Matthew Berriman, Sanger Centre, Trypanosoma brucei, Leishmania major
Herbert Chiang, Washington University, Bacteroides thetaiotaomicron
Sergio Encarnacion, UNAM, Sinorhizobium meliloti
Gregory Fournier, MIT, Mesoplasma florum
Mark van der Giezen, University of London, Entamoeba histolytica, Giardia
intestinalis
Michael Gottfert, Technische Universitat Dresden, Bradyrhizobium
japonicum
Artiva Maria Goudel, Universidade Federal de Santa Catarina, Brazil,
Chromobacterium violaceum ATCC 12472
7
SRI International Bioinformatics
Pathway Tools Software:
PGDBs Created Outside SRI

Large scale users:
 C. Medigue, Genoscope, 150+ PGDBs
 G. Burger, U Montreal, 60+ PGDBs
 Bart Weimer, Utah State University, Lactococcus lactis, Brevibacterium linens,
Lactobacillus acidophilus, Lactobacillus plantarum, Lactobacillus johnsonii, Listeria
monocytogenes
 Partial
8
listing of outside PGDBs at BioCyc.org
SRI International Bioinformatics
Pathway Evidence
9
SRI International Bioinformatics
Pathway Tools Overviews and Omics Viewers
Provide
genome-scale visualizations of cellular networks
Harness human visual system to interpret patterns in biological
contexts
Designed
to avoid the hairball effect
Generated automatically from PGDB
Magnify, interrogate
Omics viewers paint omics data onto
overview diagrams
 Different perspectives on same dataset
 Use animation for multiple time points or
conditions
 Paint any data that associates numbers
with genes, proteins, reactions, or
metabolites
10
SRI International Bioinformatics
Regulatory Overview and Omics Viewer
 Show
regulatory relationships among gene
groups
11
SRI International Bioinformatics
12
SRI International Bioinformatics
13
SRI International Bioinformatics
Comparative Analysis

Via Cellular Overview

Comparative genome browser

Comparative pathway table

Comparative analysis reports
 Compare reaction complements
 Compare pathway complements
 Compare transporter complements
14
SRI International Bioinformatics
Pathway Tools Ontology

1621 Classes
 Main classes such as:




15
Pathways, Reactions, Compounds, Macromolecules, Proteins, Replicons,
DNA-Segments (Genes, Operons, Promoters)
Taxonomies for Pathways, Reactions (EC), Compounds
Cell Component Ontology
Protein Feature ontology

221 Slots for attributes and relationships
 Meta-data: Creator, Creation-Date
 Comment, Citations, Common-Name, Synonyms
 Attributes: Molecular-Weight, DNA-Footprint-Size
 Relationships: Catalyzes, Component-Of, Product

Evidence codes, supporting citations
SRI International Bioinformatics
Pathway/Genome Database Schema
16
SRI International Bioinformatics
Protein Feature Ontology
17
SRI International Bioinformatics
Advanced Query Form
 Intuitive
construction of complex database
queries of SQL power
18
SRI International Bioinformatics
Enzymatic-Reactions
TCA Cycle
in-pathway
Succinate + FAD = fumarate + FADH2
reaction
Enzymatic-reaction
catalyzes
Succinate dehydrogenase
component-of
Sdh-flavo
Sdh-Fe-S
Sdh-membrane-1
Sdh-membrane-2
product
sdhA
19
sdhB
sdhC
sdhD
SRI International Bioinformatics
Need for Enzymatic-Reactions
Reactions can have isozymes
 Enzymes can be multi-functional




20
Enzymatic-Reaction frames are needed to
decouple the many-to-many relationships
Isozymes may have different inhibitors, etc.
Gene-Reaction schema diagrams:
SRI International Bioinformatics
New Representation of Regulation
Previously,
regulation was represented idiosyncratically:
 One representation for modulation of enzymes
 Completely different representation for regulation of transcription initiation
Now unified under single Regulation class w/ subclasses
This enables us to easily add support for new kinds of regulation, e.g.
 Transcriptional attenuation (done)
 Regulation of translation by small RNAs (in progress)
New tools for display and editing of new Regulation classes
21
SRI International Bioinformatics
Operons and Transcription Units
 Operon:
A set of two or more genes that are
transcribed as a unit. May include multiple
promoters.
 Transcription
Unit: A set of one or more genes
that are transcribed as a unit from a single
promoter.
 Pathway
Tools schema does not represent
operons explicitly, only transcription-units
22
SRI International Bioinformatics
Ontology for Transcriptional Regulation
left
trp
BR001
apoTrpR
components
regulator
trpLEDCBAp1
regulated-by
trpLEDCBA
right
TrpR*trp
reg001
trpL
trpE
associated-binding-site
site001
trpD
trpC
trpB
trpA
23
SRI International Bioinformatics
Representation of Transcriptional
Regulation
24

Transcription-Unit
 Components include genes, a single promoter, zero or more terminators

Binding-Sites
 Linked to regulation frames

Regulation frames
 Transcriptional Initiation: defines a 3-way pairing between promoter,
transcription factor and binding-site
 Transcriptional Attenuation: defines relationship between terminator and
the entity (tRNA, protein, small molecule) that regulates it.
SRI International Bioinformatics
Infer Anti-Microbial Drug Targets
 Infer
drug targets as genes coding for enzymes
that encode chokepoint reactions
 Two
types of chokepoint reactions:
Genome Research 14:917 2004
25
SRI International Bioinformatics
Reachability Analysis of Metabolic
Network
 Given:
A PGDB for an organism
 A set of initial metabolites

 Infer:

What set of products can be synthesized by the smallmolecule metabolism of the organism
 Can
known growth medium yield known essential
compounds?
Romero and Karp, Pacific Symposium on Biocomputing, 2001
26
SRI International Bioinformatics
Algorithm: Forward Propagation
Through Production System


Each reaction becomes a production rule
Each metabolite in nutrient set becomes an axiom
Nutrient
set
Products
Metabolite
set
PGDB
reaction
pool
“Fire”
reactions
Reactants
27
SRI International Bioinformatics
28
SRI International Bioinformatics
Results

Phase I: Forward propagation
 21 initial compounds yielded only half of the 41 essential compounds for E.
coli

Phase II: Manually identify
 Bugs in EcoCyc (e.g., two objects for tryptophan)



29
A+BC+D
“Bootstrap compounds”
Missing initial protein substrates (e.g., ACP)


B’  C
Incomplete knowledge of E. coli metabolic network


AB
Protein synthesis not represented
Phase III: Forward propagation with 11 more initial
metabolites
 Yielded all 41 essential compounds
SRI International Bioinformatics
Integration with other efforts
 Export
of
 BioPAX
 SBML
 Import
of
 Enzyme DB (EC hierarchy of reactions)
 GO
 NCBI Taxonomy
 BioPAX (work in progress)
30
SRI International Bioinformatics
Near Future
 Signalling
pathways
 Validating the design
 Regulation

Small RNAs, and other additional types
 Higher
Eukaryotes
 Gene expression, Multiple splice forms
 Cell types, localization
31
SRI International Bioinformatics
Summary
 Pathway/Genome
Databases
 MetaCyc non-redundant DB of literature-derived pathways
 370 organism-specific PGDBs available through SRI at
BioCyc.org
 Computational theories of biochemical machinery
 Pathway
Tools software
 Extract pathways from genomes
 Morph annotated genome into structured ontology
 Distributed curation tools for MODs
 Query, visualization, WWW publishing
32
SRI International Bioinformatics
BioCyc and Pathway Tools
Availability
 BioCyc.org
Web site and database files freely
available to all
 Pathway
Tools freely available to non-profits
 Macintosh, PC/Windows, PC/Linux
 References

Pathway Tools User’s Guide


33
Appendix A: Guide to the Pathway Tools Schema
Ontology Papers section of
http://biocyc.org/publications.shtml
SRI International Bioinformatics
Acknowledgements
SRI

Funding
Suzanne Paley, Ron Caspi,
Ingrid Keseler, Carol Fulcher,
Markus Krummenacker, Alex
Shearer, Tomer Altman, Joe
Dale, Fred Gilham, Pallavi Kaipa



sources:
NIH National Center for
Research Resources
NIH National Institute of
General Medical Sciences
NIH National Human Genome
Research Institute
EcoCyc

Collaborators
Julio Collado-Vides, Robert
Gunsalus, Ian Paulsen
MetaCyc


Collaborators
Sue Rhee, Peifen Zhang, Kate
Dreher
Lukas Mueller, Anuradha Pujar
BioCyc.org
Learn more from BioCyc webinars: biocyc.org/webinar.shtml
34
SRI International Bioinformatics
BioWarehouse:
A Bioinformatics Database
Warehouse
Peter D. Karp, Tom J. Lee, Valerie Wagner
BMC Bioinformatics 7:170 2006
bioinformatics.ai.sri.com/biowarehouse/
BioCyc
BioPAX
ENZYME
CMR
Genbank
GO
BioWarehouse
Oracle (10g) or
MySQL (4.1.11)
Eco2DBase
KEGG
UniProt
Taxonomy MAGE-ML
35
SRI International Bioinformatics
Motivations
36

Hundreds of bioinformatics DBs exist

Important problems involve queries across
multiple DBs
SRI International Bioinformatics
Why is the Multidatabase Approach
Alone Not Sufficient?







37
Multidatabase query approaches assume
databases are in a queryable DBMS
Most sites that do operate DBMSs do not allow
remote query access because of security and
loading concerns
Users want to control data stability
Users want to control speed of their hardware
Internet bandwidth limits query throughput
Users need to capture, integrate and publish
locally produced data of different types
Multidatabase and Warehouse approaches
complementary
SRI International Bioinformatics
Key Challenges for BioWarehouse
38

Designing a schema that accurately captures the contents of
source DBs

Designing a schema that is understandable and scalable

Addressing poorly-specified syntax & semantics of source
DBs

Balancing the preservation of source data with mapping into
common semantics
SRI International Bioinformatics
Technical Approach





Multi-platform support: Oracle (10g) and MySQL
Schema support for multitude of bioinformatics
datatypes
Create loaders for public bioinformatics DBs
 Parse file format of the source DB
 Semantic transformations
 Insert DB contents into warehouse tables
Provide Warehouse query access mechanisms
 SQL queries via ODBC, JDBC, OAA
Operate public BioWarehouse server: publichouse
BMC Bioinformatics 7:170 2006
39
SRI International Bioinformatics
PublicHouse Server

Publicly queryable BioWarehouse server operated by SRI

Manages a set of biological DBs constructed using
BioWarehouse






Large-scale data mining using



40
CMR
Open BioCyc DBs
ENZYME
NCBI Taxonomy
UniProt
Dashboard Warehouse Query Analyzer
MySQL client command line
See:
http://bioinformatics.ai.sri.com/biowarehouse/publichouse.html
Host: publichouse.sri.com
Port: 3306
Database: biospiceSRI International Bioinformatics
BioWarehouse Schema
41

Manages many bioinformatics datatypes simultaneously
 Pathways, Reactions, Chemicals
 Proteins, Genes, Replicons
 Sequences, Sequence Features
 Organisms, Taxonomic relationships
 Computations (sequence matches)
 Citations, Controlled vocabularies
 Links to external databases
 Gene expression datasets
 Protein-protein interactions datasets
 Flow cytometry datasets

Each type of warehouse object implemented through one or
more relational tables (currently ~150)
SRI International Bioinformatics
Warehouse Schema
42

Manages multiple datasets simultaneously
 Dataset = Single version of a database

Version comparison

Multiple software tools or experiments that
require access to different versions

Each dataset is a warehouse entity

Every warehouse object is registered in a dataset
SRI International Bioinformatics
Warehouse Schema



43
Different databases storing the same
biological datatypes are coerced into
same warehouse tables
Design of most datatypes inspired by
multiple databases
Representational tricks to decrease
schema bloat
 Single space of primary keys
 Single set of satellite tables such as for synonyms,
citations, comments, etc.
SRI International Bioinformatics
Acknowledgements
SRI

Funding
Suzanne Paley, Ron Caspi,
Ingrid Keseler, Carol Fulcher,
Markus Krummenacker, Alex
Shearer, Tomer Altman, Joe
Dale, Fred Gilham, Pallavi Kaipa



sources:
NIH National Center for
Research Resources
NIH National Institute of
General Medical Sciences
NIH National Human Genome
Research Institute
EcoCyc

Collaborators
Julio Collado-Vides, Robert
Gunsalus, Ian Paulsen
MetaCyc


Collaborators
Sue Rhee, Peifen Zhang, Kate
Dreher
Lukas Mueller, Anuradha Pujar
BioCyc.org
Learn more from BioCyc webinars: biocyc.org/webinar.shtml
44
SRI International Bioinformatics