Transcript ppt

Develop Database
Requirements to Yield Schema
and Interfaces
1. (near term)
2. MoBIoS: Database Management
for Data in Metric Spaces
Daniel P. Miranker
Univ. of Texas
What we know for sure:
Exploit Commodity Architecture
External Data/DB
Sources
Curating New Content
Users
Web
App
Server
Computing Grid
DB
Repository Schema and Interface Definitions
Issue:
• Database organization and data
interchange should be addressed
simultaneously
• Once established, difficult to change
 Best to get this right the first time.
What we know for sure:
1. Data transfer XML & Nexus files
2. Curate: (manage quality)
Curating New Content
Users
Web
App
Server
Computing Grid
DB
Schema
Both 1 & 2 impact schema, (data provenance)
XML and Bioinformatics
• Taxonomic Markup Language (TML)
• PhyloML
• BEAST: Bayesian Evolutionary Analysis
Sampling Trees
• AGAVE: Architecture for Genomic
Annoation Visualization and Exchange
§
Answers Start with a
Requirements Analysis
•
•
•
•
Who
What
Why
How
“Use cases”: specific examples of
what is to be accomplish
A Head Start
Requirements of Phylogenetic Databases
(with Nakhleh, Barbancon Piel & Donoghue)[BIBE ’03]
•
Did
•
Proof of concept for a correctly normalized
database schema
a requirements analysis
1 evolutionary (tree)-edge = 1 row in the database
Who is interested in using
Phylogenies?
•
•
•
•
•
•
•
Casual Users
Visualization
Study Development
Super-tree algorithms
Simulation Studies
Parameter Derivation
Comparative Genomics
Super-Tree Algorithms Use-Cases
Construct phylogenies by assembling existing
studies
Collect those studies by:
• Determine minimum spanning clade for a
set of taxa
• Find all phylogenies sufficiently similar to a
given phylogeny
Requirements of
Phylogenetic Databases
The MoBIoS Project
Molecular Biological Information System
Daniel P. Miranker
University of Texas
MoBIoS – A Simple Idea
Organize the Storage Manager
Around Metric Space Indexing
Relational
Databases
B+ trees
Spatial
Databases
Metric
Databases
R & K-D trees
VP, M & GNAT
trees
1
dimensional
2&3
dimensions
No dimensions
Or
very high
dimensions
Biological queries conducted
with sequential scans.
•
•
•
•
Sequence (BLAST)
Phylogenies (Tree of Life)
Mass Spectra (Proteomics)
Ligand Docking (Rational Drug Design)
Metric Space is
• a pair, M=(D,d),
where
• D is a set of points
• d is [metric] distance function with the following
properties:
– d(x, y) = d (y, x)
– d(x, y) > 0, d(x, x) = 0
– d(x, y) <= d(x, z) + d(z, y)
(symmetry)
(non negativity)
(triangle inequality)
Can Biology Be Modeled by
Metrics?
• Already metrics re:
– Phylogenetic trees
– Ligand docking
• First Biologically Effective Metric Model of Amino
Acid Substitution [Xu&Miranker 03]
 In effect, precisely the phylogenetic relationships among
sequences are exploited to form a database index.
• Metrics for proteomic mass-spectra underway
MoBIoS Architecture
(Molecular Biological Information System)
phylogenies
First Application (with Randy Linder)
Compared:
{entire Arib. Genome} x {“entire” Rice genome}
To determine conserved pairs of primer pairs,
In O(m log n), will repeat study again soon, faster.
When biological data is put in to an
RDBMS
• Primary data is stored in text or blob fields
– Annotations may be relational
Organism
Function
Sequence (BLOB)
Yeast
membrane
AACCGGTTT
Yeast
mitosis
TATCGAAA
E. Coli
membrane
AGGCCTA
• Data retrieval
– Filter DB, sequential dump, O(n), to utilities
• E.g. BLAST, TreeBASE, Sequest
Homework:
Due tomorrow morning
1. Who are you, (generically)?
2. Use case involving the database
Don’t know: A General Web Service
ToL Infrastructure @ SDSC
Curating New Content
Computing Grid
Web
App
Server
Computing Grid
DB
Schema