Bioinformatics for CS People - Artificial Intelligence Center
Download
Report
Transcript Bioinformatics for CS People - Artificial Intelligence Center
Overview of Genome Databases
Peter D. Karp, Ph.D.
SRI International
[email protected]
www-db.stanford.edu/dbseminar/seminar.html
Talk Overview
Definition
of bioinformatics
Motivations
Issues
for genome databases
in building genome databases
Definition of Bioinformatics
Computational
techniques for management and
analysis of biological data and knowledge
Methods for disseminating, archiving, interpreting, and
mining scientific information
Computational
Genome
theories of biology
Databases is a subfield of bioinformatics
Motivations for Bioinformatics
Growth in molecular-biology knowledge
(literature)
Genomics
1.
Study of genomes through DNA sequencing
2.
Industrial Biology
Example Genomics Datatypes
Genome
sequences
DOE Joint Genome Institute
Gene
511M bases in Dec 2001
11.97G bases since Mar 1999
and protein expression data
Protein-protein
Protein
interaction data
3-D structures
Genome Databases
Experimental data
Archive experimental datasets
Retrieving past experimental results should be faster than repeating the
experiment
Capture alternative analyses
Lots of data, simpler semantics
Computational symbolic theories
Complex theories become too large to be grasped by a single mind
The database is the theory
Biology is very much concerned with qualitative relationships
Less data, more complex semantics
Bioinformatics
Distinct intellectual field at the intersection of CS and
molecular biology
Distinct field because researchers in the field must know
CS, biology, and bioinformatics
Spectrum from CS research to biology service
Rich source of challenging CS problems
Large, noisy, complex data-sets and knowledge-sets
Biologists and funding agencies demand working solutions
Bioinformatics Research
algorithms
+ data structures = programs
algorithms
+ databases = discoveries
Combine
sophisticated algorithms with the right
content:
Properly structured
Carefully curated
Relevant data fields
Proper amount of data
Reference on Major Genome
Databases
Nucleic
Acids Research Database Issue
http://nar.oupjournals.org/content/vol30/issue1/
112 databases
Questions to Ask of a New
Genome Database
What are Database Goals and
Requirements?
What
Who
problems will database be used to solve?
are the users and what is their expertise?
What is its Organizing Principle?
Different
DBs partition the space of genome
information in different dimensions
Experimental
Organism
methods (Genbank, PDB)
(EcoCyc, Flybase)
What is its Level of Interpretation?
Laboratory
data
Primary
literature (Genbank)
Review
(SwissProt, MetaCyc)
Does
DB model disagreement?
What are its Semantics and Content?
What
How
entities and relationships does it model?
does its content overlap with similar DBs?
How many entities of each type are present?
Sparseness of attributes and statistics on
attribute values
What are Sources of its Data?
Potential
information sources
Laboratory instruments
Scientific literature
Manual entry
Natural-language text mining
Direct submission from the scientific community
Genbank
Modification
policy
DB staff only
Submission of new entries by scientific community
Update access by scientific community
What DBMS is Employed?
None
Relational
Object
oriented
Frame
knowledge representation system
Distribution / User Access
Multiple
distribution forms enhance access
Browsing access with visualization tools
API
Portability
What Validation Approaches are
Employed?
None
Declarative
consistency constraints
Programmatic
Internal
What
consistency checking
vs external consistency checking
types of systematic errors might DB
contain?
Database Documentation
Schema
and its semantics
Format
API
Data
acquisition techniques
Validation techniques
Size of different classes
Coverage of subject matter
Sparseness of attributes
Error rates
Update frequency
Relationship of Database Field to
Bioinformatics
Scientists
generally unaware of basic DB
principles
Complex queries vs click-at-a-time access
Data model
Defined semantics for DB fields
Controlled vocabularies
Regular syntax for flatfiles
Automated consistency checking
Most biologists take one programming class
Evolution of typical genome database
Finer points of DB research off their radar screen
Handfull of DB researchers work in bioinformatics
Database Field
For
many years, the majority of bioinformatics
DBs did not employ a DBMS
Flatfiles were the rule
Scientists want to see the data directly
Commercial DBMSs too expensive, too complex
DBAs too expensive
Most
scientists do not understand
Differences between BA, MS, PhD in CS
CS research vs applications
Implications for project planning, funding, bioinformatics
research
Recommendation
Teaching
scientists programming is not enough
Teaching scientists how to build a DBMS is
irrelevant
Teach scientists basic aspects of databases and
symbolic computing
Database requirements analysis
Data models, schema design
Knowledge representation, ontologies
Formal grammars
Complex queries
Database interoperability
BioSPICE Bioinformatics
Database Warehouse
Peter Karp, Dave Stringer-Calvert, Tom Lee, Kemal
Sonmez
SRI International
http://www.BioSPICE.org/
Project Goal
Create
a toolkit for constructing bioinformatics
database warehouses that collect together a set
of bioinformatics databases into one physical
DBMS
Motivations
Important bioinformatics problems require
access to multiple bioinformatics databases
Hundreds of bioinformatics databases exist
Nucleic Acids Research 30(1) 2002 – DB issue
Nucleic Acids Research DB list: 350 DBs at
http://www3.oup.co.uk/nar/database/a/
Different problems require different sets of
databases
Motivations
Combining
multiple databases allows for data
verification and complementation
Simulation
problems require access to data on
pathways, enzymes, reactions, genetic regulation
Why is the Multidatabase Approach
Not Sufficient?
Multidatabase query approaches assume
databases are in a DBMS
Internet bandwidth limits query throughput
Most sites that do operate DBMSs do not allow
remote SQL access because of security and
loading concerns
Control data stability
Need to capture, integrate and publish locally
produced data of different types
Multidatabase and Warehouse approaches
complementary
Scenario 1
BioSPICE
scientist wants to model multiple
metabolic pathways in a given organism
Enumerate pathways and reactions
What enzymes catalyze each reaction?
What genes code for each enzyme?
What control regions regulate each gene?
Approach
Oracle and MySQL implementations
Warehouse schema defines many bioinformatics
datatypes
Create loaders for public bioinformatics DBs
Parse file format for the DB
Semantic transformations
Insert database into warehouse tables
Warehouse query access mechanisms
SQL queries via Perl, ODBC, OAA
Example: Swiss-Prot DB
Version 40.0 describes 101K proteins in a 320MB
file
Each protein described as one block of records
(an entry) in a large text file
Loader tool parses file one entry at a time
Creates new entries in a set of warehouse tables
Warehouse Schema
Manages many bioinformatics datatypes
simultaneously
Pathways, Reactions, Chemicals
Proteins, Genes, Replicons
Citations, Organisms
Links to external databases
Each type of warehouse object implemented
through one or more relational tables (currently
43)
Warehouse Schema
Databases on our wish list:
Genbank (nucleotide sequences)
Protein expression database
Protein-protein interactions database
Gene expression database
NCBI Taxonomy database
Gene Ontology
CMR
Warehouse Schema
Manages multiple datasets simultaneously
Dataset = Single version of a database
Support alternative measurements and
viewpoints
Version comparison
Multiple software tools or experiments that
require access to different versions
Each dataset is a warehouse entity
Every warehouse object is registered in a dataset
Warehouse Schema
Different databases storing the same biological
types are coerced into same warehouse tables
Design of most datatypes inspired by multiple
databases
Representational tricks to decrease schema
bloat
Single space of primary keys
Single set of satellite tables such as for synonyms, citations,
comments, etc.
Warehouse Schema
Examples
Protein data from Swiss-Prot, TrEMBL, KEGG, and EcoCyc
all loaded into same relational tables
Pathway data from MetaCyc and KEGG are loaded into the
same relational tables
Example: Swiss-Prot DB
ID
AC
DT
DT
DT
DE
DE
GN
1A11_CUCMA STANDARD;
PRT; 493 AA.
P23599;
01-NOV-1991 (Rel. 20, Created)
01-NOV-1991 (Rel. 20, Last sequence update)
15-DEC-1998 (Rel. 37, Last annotation update)
1-AMINOCYCLOPROPANE-1-CARBOXYLATE SYNTHASE CMW33 (EC 4.4.1.14) (ACC
SYNTHASE) (S-ADENOSYL-L-METHIONINE METHYLTHIOADENOSINE-LYASE).
ACS1 OR ACCW.
How Swiss-Prot is Loaded into
The Warehouse
Register
Swiss-Prot in Datasets table
Create entry in Entry and Protein tables for each
Swiss-Prot protein
Satellite tables store
Protein synonyms, citations, comments, accession numbers,
organism, sequence features, subunits/complexes, DB links
Protein Table
CREATE TABLE Protein
(
WID
Name
AASequence
Charge
Fragment
MolecularWeightCalc
MolecularWeightExp
PICalc
PIExp
DataSetWID
);
NUMBER
--The warehouse ID of this protein
VARCHAR2(500) --Common name of the protein
VARCHAR2(4000),--Amino-acid sequence for this prote
NUMBER,
--Charge of the chemical
CHAR(1),
--Is this protein a fragment or not,
NUMBER,
--Molecular weight calculated from s
NUMBER,
--Molecular Weight determined throug
VARCHAR2(50), --pI calculated from its sqeuence.
VARCHAR2(50), --pI value determined through experi
NUMBER
--Reference to the data set from whi
Database Loaders
Loader tool defined for each DB to be loaded into
Warehouse
Example loaders available in several languages
Loaders
KEGG (C)
BioCyc collection of 15 pathway DBs (C)
Swiss-Prot (Java)
ENZYME (Java)
Terminology
Organism Database (MOD) –
DB describing genome and other
information about an organism
Pathway/Genome Database
(PGDB) – MOD that combines
information about
Pathways, reactions, substrates
Enzymes, transporters
Genes, replicons
Transcription factors, promoters,
operons, DNA binding sites
Model
– Collection of 15 PGDBs
at BioCyc.org
EcoCyc, AgroCyc, YeastCyc
BioCyc
Loader Architecture
Swiss-Prot
Datafile
Grammar for
Swiss-Prot
ANTLR
Parser
Generator
Parser for
SwissProt
SQL Insert
Commands
Oracle
Loadable
File
Current Warehouse Contents
KEGG
ENZYME
SwissProt
BsubCyc
Warehouse Total
Chemicals
7,284
2,952
0
576
10,812
Genes
5,714
0
88,605
4,221
98,540
60
0
103,807
1
103,868
Proteins
3,829
3,870
101,602
4,150
113,451
Enzymatic
Reactions
3,509
0
0
717
4,226
Pathways
4,517
0
0
138
4,655
Pathway
Reactions
36,271
0
0
530
36,801
Organisms
Example Warehouse Uses
Check
completeness of data sources
Count reactions in ENZYME database with (and without)
associated protein sequences in SWISS-PROT database:
3870 reactions in ENZYME
1662 reactions (43%) with a sequence in SWISS-PROT
2208 reactions (57%) without a sequence in SWISS-PROT
Count #of distinct non-partial EC numbers in SWISS-PROT:
1554 distinct EC numbers in SWISS-PROT (non-partial)