Transcript PPT
Interoperation of
Molecular Biology
Databases
Peter D. Karp, Ph.D.
Bioinformatics Research Group
SRI International
Menlo Park, CA
[email protected]
Main Message
SRI International
Bioinformatics
Interoperation
of molecular-biology databases is
a challenging problem of critical importance
DOE
should initiate a program in interoperation of
molecular biology databases
Pursue both warehouse approach and multidatabase
approach
Major progress possible within 5 years
Motivations
Important biological problems require access to multiple
bioinformatics databases
Different problems require different sets of databases
Hundreds of bioinformatics databases exist
Nucleic Acids Research 32:2004 – Database issue
Nucleic Acids Research DB list: http://www3.oup.co.uk/nar/database/a/
SRI International
Bioinformatics
350 databases listed in 2002
560 databases listed in 2004
Applications of integration include
Complex queries
Comparison of overlapping sources
Data mining
Bioinformatics Databases
SRI International
Bioinformatics
Tremendous
progress in point-and-click access
for biologist users
Less
progress toward providing a computable,
interoperable infrastructure for large-scale data
mining
Every
large-scale mining/learning problem
requires time consuming crafting of input/training
datasets
Warehouse Approach vs
Multidatabase Approach
SRI International
Bioinformatics
Multidatabase query approaches assume databases are in
a queryable DBMS
Most sites that do operate DBMSs do not allow remote
query access because of security and loading concerns
Users want to control data stability
Users want to control hardware applied to problem
Internet bandwidth limits query throughput
Users need to capture, integrate and publish locally
produced data of different types
Replicating and refreshing very large sources is expensive
Multidatabase and Warehouse approaches complementary
SRI BioWarehouse
Project Goal
SRI International
Bioinformatics
Create a toolkit for constructing bioinformatics
database warehouses that integrate sets of
bioinformatics databases into one physical
DBMS
BioWarehouse Approach
SRI International
Bioinformatics
Warehouse schema defines many bioinformatics datatypes
Create loaders for public bioinformatics DBs
Parse file format for the DB
Apply semantic transformations
Insert database into warehouse tables
Oracle and MySQL implementations
Warehouse query access mechanisms
SQL queries via JDBC,Lisp,Perl, ODBC, OAA
Warehouse Schema
SRI International
Bioinformatics
Manages many bioinformatics datatypes
simultaneously
Pathways, Reactions, Chemicals
Proteins, Genes, Replicons
Sequences, Sequence Features
Organisms, Taxonomic relationships
Computations (sequence matches)
Citations, Controlled vocabularies
Links to external databases
Each type of warehouse object implemented
through one or more relational tables (currently
43)
Warehouse Schema
SRI International
Bioinformatics
Manages multiple datasets simultaneously
Dataset = Single version of a database
Allows version comparison
Multiple software tools or experiments require access to different versions
Each dataset is a warehouse entity
Every warehouse object is registered in a dataset
Different databases storing the same biological datatypes are
coerced into same warehouse tables
Design of most datatypes inspired by multiple databases
Representational tricks to decrease schema bloat
Single space of primary keys
Single set of satellite tables such as for synonyms, citations, comments, etc.
Current Databases Supported by
BioWarehouse
BioCyc
15 genomes and metabolic networks
Swiss-Prot, TrEMBL
1.3M proteins
ENZYME
KEGG
NCBI Taxonomy
CMR
105 genomes, 250K genes, 250K proteins
Applications:
DARPA BioSpice program on biological simulation
Study of sequence coverage of known enzymes
SRI International
Bioinformatics
Summary
SRI International
Bioinformatics
Interoperation
of molecular-biology databases is
a challenging problem of critical importance
DOE
should initiate a program in interoperation of
molecular biology databases
Pursue both warehouse approach and multidatabase
approach
Major progress possible within 5 years