cidr2003 - Department of Computer and Information Science and
Download
Report
Transcript cidr2003 - Department of Computer and Information Science and
Genomics Algebra
A New, Integrating Data Model, Language,
and Tool for Processing and Querying
Genomic Information
Joachim Hammer and Markus Schneider
University of Florida
CIDR 2003
Asilomar, CA
Jan. 5-8, 2003
Overview
Data Management Problems in Bioinformatics
Proposed Solution
Joachim Hammer
Genomics Algebra and Unifying Database
Summary and Expected Impact
CIDR 2003
2
Bioinformatics
Growing field of problems in biological sciences
that require application of computing and
mathematics
Genome Projects
Construct detailed genetic and physical maps of a
variety of organisms
E.g., human genome project
Functional Genomics
Joachim Hammer
Bioinformatics was coined in mid 80’s
What do genes do and how do they interact?
E.g., drug discovery, agro-food, pharmacogenomics
(individualized medicine)
CIDR 2003
3
Why is Bioinformatics Important?
Acquiring sequences is first step …
Ultimate goal is to decipher structural, functional,
evolutionary information encoded in language of
biological sequences
To date, unable to predict structure (i.e., words and
sentences) from sequence
Joachim Hammer
Alphabet (amino acids), words (motifs), sentences (proteins)
Decoding an unknown language
Mostly pattern-matching techniques: detect similarity between
sequences and infer related structures and functions
Number of experimentally determined protein structures
is VERY small
CIDR 2003
4
An Information Revolution …
Emergence of rapid DNA sequencing and high
throughput gene analysis techniques
Flood of genomic data
Data stored in more than 500 repositories
Joachim Hammer
Nucleic acid and protein sequences, motifs, folding
units, modules, interaction information, etc.
Complex data, e.g., sequential lists, deeply nested
record structures, image & video data
E.g., EMBL (150 GB, 2001), GenBank, SWISS-PROT,
SANGER Centre (20TB, 2001), …
Sequence repositories increase 4x per year
Known sequence data outweighs protein structural
data ~100:1 (sequence/structure deficit)
CIDR 2003
5
… and the Resulting Problems
for Biologists
Scientists are overwhelmed by data which is awaiting
further refinement and analysis
Number and size of available data sources continuously
growing
Little or no agreement on terminology
Unmanageable query results
Forced to understand low-level data management
Often required to learn and write SQL or code in some other
programming language (Perl)
Noisy data
Joachim Hammer
Overlap and conflicting information
Proliferation of interfaces and portals
Familiar sources sometimes disappear or get merged
E.g., estimated that 30-60% of sequences in GenBank are
erroneous
CIDR 2003
8
Corresponding CS Problems
Management of heterogeneous, autonomous
sources
Query languages not suitable for intended users
Joachim Hammer
Missing standard for genomic data representation
Formatted files prevail over conventional database
representations (few sources use DBMSs)
Lots of redundancies and inconsistencies
Many different interfaces (e.g., Web-based,
specialized GUIs and retrieval packages)
Limited interaction functionality of repositories
Query results are often unmanageable
CIDR 2003
9
CS Problems Cont’d
Low-level treatment of data
Lack of extensibility of software managing sources
E.g., no personal scratch pad that can be integrated with
existing data
Dealing with uncertainty and erroneous data
Joachim Hammer
Not possible to integrate new, specialty evaluation functions
Extraction of new knowledge from existing sources
without much computational support
Integration of new knowledge into repositories is
tedious
Users manipulate strings and integers instead of genes and
sequences
No high-level operations either
E.g., frameshift problem
CIDR 2003
10
State-of-the-Art
Current research is focused mainly on integrating
existing repositories
Analysis is performed outside of the repositories
Sequence similarity search: e.g., Basic Local Alignment Search
(BLAST) and its derivatives, …
Visualization tools: e.g., BEAUTY, BioWidgets, …
Complex middleware tiers between end-users and the
data servers
Joachim Hammer
Federated and query-driven approaches (e.g., SRS,
BioNavigator, DiscoveryLink, K2/Kleisli, Tambis, …)
Work on standardizing terminology and representations (e.g.,
Gene Ontology, EcoCyc, …)
Inefficient, lots of user involvement (human query processor)
CIDR 2003
11
Iterative Query and Analysis
Query Relevant
Database(s)
Construct a database query
Store Query Output
Analyze Output
Joachim Hammer
While not done …
Store query output
Analyze query results
Done?
CIDR 2003
12
Fundamental Challenge
Development of a more principled
approach to genomic data management
Joachim Hammer
Leverage capabilities provided by modern
DBMS
Services tightly integrated
Shields scientists from knowing low-level
data management details as much as
possible
CIDR 2003
13
Integrating Approach to
Genomics Data Management
Extensible Genomics Algebra
Formal data model, query language, and software
for representing, storing, retrieving, querying, and
manipulating genomic information
Provides a set of high-level genomic data types
(GDTs) together with genomic operations or
functions
Unifying Database
Joachim Hammer
Persistent storage for high-level, structured GDT
values of Genomics Algebra
Warehouse for data from existing genomic
repositories
CIDR 2003
14
Mini Genomics Algebra
types
codon, aminoAcid, gene, primaryTranscript, mRNA,
protein
operators
decode: codon aminoAcid
“given a codon, computes the corresponding amino acid”
transcribe: gene primaryTranscript
“given a gene, returns its primary transcript”
splice: primaryTranscript mRNA
“given a primary transcript, removes its introns to produce the mRNA”
translate: mRNA protein
“given a messenger RNA, determines the corresponding protein”
.
.
Joachim Hammer
CIDR 2003
15
What Can We Do with a
Genomics Algebra?
Can use the algebra to formally express
existing biological operations
Create new operations using function
composition
Joachim Hammer
E.g., Given DNA fragment and sequence, returns
true if fragment contains specified sequence
contains(frag,“ATTGCCATA”)
E.g., express central dogma of molecular biology
as
translate(splice(transcribe(g)))
CIDR 2003
16
Research Challenges
What data types and operations do we need?
Formalize definition of GDTs and operations
Vague or lacking knowledge of many biological
processes makes this hard
Implement algebra
Joachim Hammer
Need comprehensive ontology defining
terminology, data objects, and operations
Design of data structures and efficient algorithms
for genomic operations
Must be extensible
Suitable for integration with a database system
CIDR 2003
17
Unifying Database
Persistent storage manager for Genomics
Algebra
Integrated repository (warehouse) for genomics
sources
Provides superior query processing performance
in multi-source environments
Ability to maintain and annotate extracted
source data after it has been cleansed,
reconciled and corrected
Joachim Hammer
GUS (U Penn) is only other known genomics
warehouse prototype system
Option to preserve historical data from those
repositories that do not archive their contents
CIDR 2003
18
Integrated System Architecture
Genomics
Algebra
GUI
DBMS-specific
Adapter
ETL
Extensible DBMS (Oracle, DB2, …)
Unifying Database
public space
user
space
Joachim Hammer
user
space
…
user
space
CIDR 2003
…
External Repositories
(e.g, GenBank, NCBI, …)
19
Implementation
Adapter provides DBMS-specific coupling
mechanism between Genomics Algebra and
DBMS
User interface component consisting of
Joachim Hammer
Use UDT mechanism (opaque types and user-defined
operators linked as external functions)
Supported by all major DB vendors
Biological query language together with graphical
output
XML application as standardized exchange format for
sharing genomics data
CIDR 2003
20
Research Challenges
Design of the integrated schema
Detecting changes in underlying sources
Iterative process with input from domain experts
Push capabilities are slowly being offered
Tools for computing what has changed
Database maintenance
View maintenance problem
Derived data (annotations) based on update must be
recomputed
Joachim Hammer
Knowing provenance of data could be used to determine
which annotations need to be recomputed
CIDR 2003
21
Vision and Expected Impact
Advocate a “back to the roots” strategy of
database technology for bioinformatics
Fundamental change in way biologists analyze
data
New knowledge about design and
implementation of biological type system and its
operations
Joachim Hammer
Single interface specifically designed for biologists
No need to become “computer scientists”
Demonstrate extensibility of modern DBMS
Help development of algebras for other applications
CIDR 2003
22