A Metric-Space Database Management Supporting Molecular Biology

Download Report

Transcript A Metric-Space Database Management Supporting Molecular Biology

The MoBIoS Project
Molecular Biological Information System
Daniel P. Miranker
Dept. of Computer Sciences &
Center for Computational Biology
and Bioinformatics
University of Texas
Weijia Xu, Rui Mao, Will
Briggs, Smriti
Ramakrishnan, Shu Wang,
Lulu Zhang
Problem:
In Life Sciencses, database management
systems (DBMS) serve as glorified file
managers.
 Little use of sophisticated data and pattern-based
retrieval
 Real scientific and technological problems
When biological data is put in to an
RDBMS
• Primary data is stored in text or blob fields
– Annotations may be relational
Organism
Function
Sequence
Yeast
membrane
AACCGGTTT
Yeast
mitosis
TATCGAAA
E. Coli
membrane
AGGCCTA
• Data retrieval
– Filter DB, sequential dump, O(n), to utilities
• E.g. BLAST,
Linear Data Scans, O(n),
Endemic in Life Sciences
 Sequences:


Mass Spectra




proteomics
Small Molecules & Protein Structure


DNA, RNA, Protein databases
Protein interaction
Rational drug design
Pathways (graphs)
Phylogenies (graphs, trees in particular)
Scope: To Find Common Ground Both
Biology and DBMS’ Have to Move
DBMS
Biological
Information
System
Metric-Space Database as the Common Ground
Metric Space is
a pair, M=(D,d),
where
 D is a set of points
 d is [metric] distance function with the following
properties:




d(x,y) = d (y,x)
d(x, y) > 0, d(x,x) = 0
d(x,z) <= d(x,y) + d(y,z)
(symmetry)
(non negativity)
(triangle inequality)
x
y
z
Definition - By Analogy
A Metric-Space Database
Management System

Extend Relational DBMS




Special indexes for metricspaces
New data types
Biological information
system

A Spatial Database
Management System:
Life science data types
Extend relational DBMS



Special indexes for 2D and
3D data; k-d and R-trees
New data types
Geographic information
systems


Topographic maps
Buildings and the like
Develop index structures to support
distance & nearest-neighbor queries
• Well studied in main-memory
– But by no means a closed problem
• In databases (external/disk based methods)
– Embryonic
– Many myths
• Often assumed to be the basis of multimedia
database systems
How to build a metric-space
index
• Three algorithmic classes [Tasan, Ozsoyoglu 04]
– Vantage points
– Hyperplanes
– Bounding spheres
Vantage Point Method [Burkhard&Keller73]
Vantage Point Method
Choose a point,VP
And a radius, R
Vantage Point Method
• Given VP, R
The predicates
Choose a point,VP
• d(VP,x) < R
• d(VP,x)  R
Divide the set into
two equal halves
• apply recursively
And a radius,R
Query, q, range r
r
q
Query, q, range r
if
• d(q,VP) > R + r
then
• all neighbors are
outside the sphere
VP
R
r
q
Multi-vantage point method
Multi-vantage point method
• Consider d(VPi, x) a
projection onto an axis
• Looks like a k-d tree
– Choose number k & d
Myths
• Solved problem; M-trees [Ciaccia et.al. 96, 97]
– I can’t get them to work on anything but their original
synthetic data generator
• Good choice for vantage points is to find
corners[Yianilos93] (farthest-first clustering)
– Might be true for euclidean spaces
– Early result, not true for our data
• High dimensional indexing always asymptotically
reduces to linear scans.
– Formal result based on an assumption of uniform data
distributions.
Comparison of Three Methods of Metric-Space Indexing
#dist. cal.: RBT VS. GHT VS. MVPT
18000
RBT
GHT
MVPT
16000
#dist cal.
14000
12000
10000
8000
6000
4000
2000
0
0
2
4
radius
6
8
10
8
10
#I/O, RBT VS. GHT. VS MVPT
800
RBT
GHT
MVPT
700
600
#IO
500
400
300
200
100
0
0
2
4
radius
6
Figure 9. Comparison of metric-space index structures: RBT, GHT, and VPT
Open problems
• Is there a general metric-space index structure that
is generally good for most work loads.
– We are optimistic mvp tree’s – further tuning will be a
useful answer
– Hyperplane methods are fair game – there is
circumstantial evidence that that is key component in
Google’s search engine.
• No work addresses clustering data pages on disk.
• Metric-space join algorithms
Biological Models are Usually
Based on Similarity
Similarity
• Biologist like scoring functions that reward each
similar feature with a positive number
• Intuitive
Distance:
• More Similar  smaller numbers
• Identical  0
But Do Metric Models Capture Biology?
• Metrics are a subset of possible mathematical models
.
Sequence Problem 1
Sequence similarity based on weighted edit
distance
Accepted weight matrices, PAM &
BLOSSUM, are not metric
 Log-odd matrices – negative values
 Defy simple algebraic
normalization[TaylorJones93,Linialetal97]
Our First Result: mPAM [Xu&Miranker04]
Dayhoffetal’s PAM Derivation[74]
• Took a set of closely related
protein sequences
• Developed a phylogenetic tree
• Counted substitutions to
transform one sequence to
another
• Tree determines a measure of
time
PAM vs. mPAM: t = 1/f
Using original substitution counts

PAM: frequency of substitution
S(a,b|t) = log P(b|a,t)/qb

mPAM: expected time between
substitutions
D(a,b) = 1/log(1 – (P(a,x)P(b,x))
x
Sequence Problem 2
• Sequences long units (identity for storage
and retrieval)
– Genes
– Chromosomes
• Analysis comprises comparing small
substrings
Soln: Sequence View
• New view type
• Breaks sequences into q-grams
create SEQUENCEVIEW rice_sview as
SELECT CREATE FRAGMENTS (…, 3, 1)
FROM …
WHERE …
USING HAMMING-DISTANCE
Materialize as an Index
D(AAA)
≤2
Genomes
Rowid
Seq
R1
CAACA
R2
ATCAAA
R3
…
{
Rowd
Offset
Logical Fragment
R1
1
A C A
R1
2
R1
3
R1
4
…
{
A A C
A C A
…
R2
1
R2
2
R2
3
R2
4
…
C A A
…
A T C
T C A
C A A
A A A
…
…
D(ACA)
≤1
D(CAA)
≤0
D(ATC)
≤1
Status
• Started with McKoi
– A Java open source object-relational DBMS
– (Think of Postgress written in Java)
• Added
 Biological data types
 Metric-space index
 Extending SQL engine (in progress)
Computed in MoBIoS
Compare Arabidopsis Genome X Rice Genome
1.
Rice
Locate nucleotide patterns of form
18 Matching
Nucleotides
Arab.
Rice Gap 400 – 3000 Long
Arab. Gap 400 – 3000 Long
primer pair candidate
2.
3.
Eliminate non-unique primer candidates
Merge overlapping primer candidates
•
Usual implementations O(n2), n = 109
18 Matching
Nucleotides
mSQL Query to locate candidate
primer pairs
SELECT merge(R1.fragment, A1.fragment)
FROM
G1_sview R1, G1_sview R2, G2_sview A1, G2_sview A2
WHERE
distance(‘HAMMINGDISTANCE',
R1.fragment,
A1.fragment)
<=
1.0
distance(‘HAMMINGDISTANCE', R2.fragment, A2.fragment) <= 1.0 AND
(FRAGOFFSET(R2.fragment)-FRAGOFFSET(R1.fragment)) >= 400 AND
(FRAGOFFSET(R2.fragment)-FRAGOFFSET(R1.fragment)) <= 3000 AND
(FRAGOFFSET(A2.fragment)-FRAGOFFSET(A1.fragment)) >= 400 AND
(FRAGOFFSET(A2.fragment)-FRAGOFFSET(A1.fragment)) <= 3000
GROUP BY R1.fragment, A1.fragment;
AND
Query
Plan
Arab. Genome,
Rice Genome, O(m)
O(n)
Offline: Build Sequence
View O(n log n)
Compare O(mlogn)
Indexed Nested Loop
Eliminate Duplicates
Eliminate Low Complexity
Primers (LZ compression)
Merge Overlapping Primers
~10,000 conserved
primer pairs candidates
Preliminary Results
• Found 13,418 possible primer pairs from MoBIoS
• 100 best candidates BLASTed for matches in GenBank
– 15 matched other plant genes and the primers
– At least 2 of 15 showed potential after PCR amplification against
Helianthus and Phalaenopsis.
MoBIoS Architecture
(Molecular Biological Information System)
Analysing Mass-Spectra
Spectrum = Histogram of Mass/Charge Ratios
of a collection peptides
Similarity = Shared peaks count = Inner Product
(0100101) • (0111100) = 2
Cosine Distance Approx. Inner Product
Drs= 1 – xrx’s/(x’rxr)1/2(x’sxs)1/2
shown store and retrieve mass-spectra
- using cosine distance, and it scales
mSQL Query for Protein
Identification by Mass-Spec.
Signature Database Look
SELECT
Prot.accesion_id, Prot.sequence
FROM protein_sequences Prot, digested_sequences DS,
mass_spectra MS
WHERE
MS.enzyme = DS.enzyme = E and
Cosine_Distance(S, MS.spectrum, range1) and
DS.accession_id = MS.accession_id = Prot.accesion_id
and
DS.ms_peak = P and MPAM250(PS, DS.sequence,
range2);
Matching Electrostatic Shape of Molecules
Still benefit from grid-services:
 Intermittently, but regularly compile (recluster) the indices O(nlog n), n > 106
 Rational drug design: O(log n) finite element solutions to traverse search tree.
 Make a service call to the grid for these operations only
 Mirror data contents to minimize I/O
 Since need is intermittant, one grid serves many MoBIoS servers
recluster
MoBIoS
Server
G
R I
D
New index
Shape match (FEM)
Distance(real)
High speed I/O
Mirror DB-Contents
Hyper-planes [Ulhmann91]
• If d(x,h1) < d(x,h2) then x assigned to h1
h1
x
h2
Develop a Hierarchical Clustering
C
A
F
E
B
D
Hierarchy of Bounding spheres, (center, radius),
• Bounding spheres may overlap
• Inspired by R-trees