cidr2003 - Department of Computer and Information Science and

Download Report

Transcript cidr2003 - Department of Computer and Information Science and

Genomics Algebra
A New, Integrating Data Model, Language,
and Tool for Processing and Querying
Genomic Information
Joachim Hammer and Markus Schneider
University of Florida
CIDR 2003
Asilomar, CA
Jan. 5-8, 2003
Overview

Data Management Problems in Bioinformatics

Proposed Solution


Joachim Hammer
Genomics Algebra and Unifying Database
Summary and Expected Impact
CIDR 2003
2
Bioinformatics

Growing field of problems in biological sciences
that require application of computing and
mathematics


Genome Projects



Construct detailed genetic and physical maps of a
variety of organisms
E.g., human genome project
Functional Genomics


Joachim Hammer
Bioinformatics was coined in mid 80’s
What do genes do and how do they interact?
E.g., drug discovery, agro-food, pharmacogenomics
(individualized medicine)
CIDR 2003
3
Why is Bioinformatics Important?


Acquiring sequences is first step …
Ultimate goal is to decipher structural, functional,
evolutionary information encoded in language of
biological sequences



To date, unable to predict structure (i.e., words and
sentences) from sequence


Joachim Hammer
Alphabet (amino acids), words (motifs), sentences (proteins)
Decoding an unknown language
Mostly pattern-matching techniques: detect similarity between
sequences and infer related structures and functions
Number of experimentally determined protein structures
is VERY small
CIDR 2003
4
An Information Revolution …


Emergence of rapid DNA sequencing and high
throughput gene analysis techniques
Flood of genomic data



Data stored in more than 500 repositories



Joachim Hammer
Nucleic acid and protein sequences, motifs, folding
units, modules, interaction information, etc.
Complex data, e.g., sequential lists, deeply nested
record structures, image & video data
E.g., EMBL (150 GB, 2001), GenBank, SWISS-PROT,
SANGER Centre (20TB, 2001), …
Sequence repositories increase 4x per year
Known sequence data outweighs protein structural
data ~100:1 (sequence/structure deficit)
CIDR 2003
5
… and the Resulting Problems
for Biologists


Scientists are overwhelmed by data which is awaiting
further refinement and analysis
Number and size of available data sources continuously
growing






Little or no agreement on terminology
Unmanageable query results
Forced to understand low-level data management


Often required to learn and write SQL or code in some other
programming language (Perl)
Noisy data

Joachim Hammer
Overlap and conflicting information
Proliferation of interfaces and portals
Familiar sources sometimes disappear or get merged
E.g., estimated that 30-60% of sequences in GenBank are
erroneous
CIDR 2003
8
Corresponding CS Problems

Management of heterogeneous, autonomous
sources





Query languages not suitable for intended users


Joachim Hammer
Missing standard for genomic data representation
Formatted files prevail over conventional database
representations (few sources use DBMSs)
Lots of redundancies and inconsistencies
Many different interfaces (e.g., Web-based,
specialized GUIs and retrieval packages)
Limited interaction functionality of repositories
Query results are often unmanageable
CIDR 2003
9
CS Problems Cont’d

Low-level treatment of data



Lack of extensibility of software managing sources



E.g., no personal scratch pad that can be integrated with
existing data
Dealing with uncertainty and erroneous data

Joachim Hammer
Not possible to integrate new, specialty evaluation functions
Extraction of new knowledge from existing sources
without much computational support
Integration of new knowledge into repositories is
tedious


Users manipulate strings and integers instead of genes and
sequences
No high-level operations either
E.g., frameshift problem
CIDR 2003
10
State-of-the-Art

Current research is focused mainly on integrating
existing repositories



Analysis is performed outside of the repositories



Sequence similarity search: e.g., Basic Local Alignment Search
(BLAST) and its derivatives, …
Visualization tools: e.g., BEAUTY, BioWidgets, …
Complex middleware tiers between end-users and the
data servers

Joachim Hammer
Federated and query-driven approaches (e.g., SRS,
BioNavigator, DiscoveryLink, K2/Kleisli, Tambis, …)
Work on standardizing terminology and representations (e.g.,
Gene Ontology, EcoCyc, …)
Inefficient, lots of user involvement (human query processor)
CIDR 2003
11
Iterative Query and Analysis
Query Relevant
Database(s)
 Construct a database query


Store Query Output

Analyze Output
Joachim Hammer
While not done …
 Store query output
 Analyze query results
Done?
CIDR 2003
12
Fundamental Challenge

Development of a more principled
approach to genomic data management



Joachim Hammer
Leverage capabilities provided by modern
DBMS
Services tightly integrated
Shields scientists from knowing low-level
data management details as much as
possible
CIDR 2003
13
Integrating Approach to
Genomics Data Management

Extensible Genomics Algebra


Formal data model, query language, and software
for representing, storing, retrieving, querying, and
manipulating genomic information
Provides a set of high-level genomic data types
(GDTs) together with genomic operations or
functions

Unifying Database


Joachim Hammer
Persistent storage for high-level, structured GDT
values of Genomics Algebra
Warehouse for data from existing genomic
repositories
CIDR 2003
14
Mini Genomics Algebra
types
codon, aminoAcid, gene, primaryTranscript, mRNA,
protein
operators
decode: codon  aminoAcid
“given a codon, computes the corresponding amino acid”
transcribe: gene  primaryTranscript
“given a gene, returns its primary transcript”
splice: primaryTranscript  mRNA
“given a primary transcript, removes its introns to produce the mRNA”
translate: mRNA  protein
“given a messenger RNA, determines the corresponding protein”
.
.
Joachim Hammer
CIDR 2003
15
What Can We Do with a
Genomics Algebra?

Can use the algebra to formally express
existing biological operations


Create new operations using function
composition

Joachim Hammer
E.g., Given DNA fragment and sequence, returns
true if fragment contains specified sequence
contains(frag,“ATTGCCATA”)
E.g., express central dogma of molecular biology
as
translate(splice(transcribe(g)))
CIDR 2003
16
Research Challenges

What data types and operations do we need?


Formalize definition of GDTs and operations


Vague or lacking knowledge of many biological
processes makes this hard
Implement algebra



Joachim Hammer
Need comprehensive ontology defining
terminology, data objects, and operations
Design of data structures and efficient algorithms
for genomic operations
Must be extensible
Suitable for integration with a database system
CIDR 2003
17
Unifying Database


Persistent storage manager for Genomics
Algebra
Integrated repository (warehouse) for genomics
sources



Provides superior query processing performance
in multi-source environments
Ability to maintain and annotate extracted
source data after it has been cleansed,
reconciled and corrected

Joachim Hammer
GUS (U Penn) is only other known genomics
warehouse prototype system
Option to preserve historical data from those
repositories that do not archive their contents
CIDR 2003
18
Integrated System Architecture
Genomics
Algebra
GUI
DBMS-specific
Adapter
ETL
Extensible DBMS (Oracle, DB2, …)
Unifying Database
public space
user
space
Joachim Hammer
user
space
…
user
space
CIDR 2003
…
External Repositories
(e.g, GenBank, NCBI, …)
19
Implementation

Adapter provides DBMS-specific coupling
mechanism between Genomics Algebra and
DBMS



User interface component consisting of


Joachim Hammer
Use UDT mechanism (opaque types and user-defined
operators linked as external functions)
Supported by all major DB vendors
Biological query language together with graphical
output
XML application as standardized exchange format for
sharing genomics data
CIDR 2003
20
Research Challenges

Design of the integrated schema


Detecting changes in underlying sources



Iterative process with input from domain experts
Push capabilities are slowly being offered
Tools for computing what has changed
Database maintenance


View maintenance problem
Derived data (annotations) based on update must be
recomputed

Joachim Hammer
Knowing provenance of data could be used to determine
which annotations need to be recomputed
CIDR 2003
21
Vision and Expected Impact


Advocate a “back to the roots” strategy of
database technology for bioinformatics
Fundamental change in way biologists analyze
data



New knowledge about design and
implementation of biological type system and its
operations


Joachim Hammer
Single interface specifically designed for biologists
No need to become “computer scientists”
Demonstrate extensibility of modern DBMS
Help development of algebras for other applications
CIDR 2003
22