Return - Department of Computer Science

Download Report

Transcript Return - Department of Computer Science

MoBIoS
A Metric-space DBMS to Support
Biological Discovery
Presenter: Enohi I. Ibekwe
Overview
•
•
•
•
•
•
•
•
MoBIoS Project
Motivation
The challenge
Established similarity measures
Metric-space distance measure
Disk-based metric tree index
MoBIoS as a DBMS
Application of MoBIoS
MoBIoS Project
• Molecular Biological Information System
• Project at UT-Austin center for computational biology
and bioinformatics.
• DBMS based on metric-space indexing techniques,
object-relational model of genomic and proteomic data
types and a database query language that embodies the
semantics of genomic and proteomic data.
Motivation
Develop a DBMS to power Biological
Information System
The Challenge
• Established biological model of similarity
measure do not form a metrics.
• Scalable disk-based metric-indexes suffer
from the Curse of dimensionality
Established Similarity Measure
(I)
•
Sequence Homology
–
–
–
–
Query Sequence
Database of sequences
Substitution Matrix (PAM / BLOSUM)
Similarity Measure
– Global Sequence Alignment (Edit distance)
– Local Sequence Alignment (Most important)
Established Similarity Measure
(II)
•
Local Sequence Alignment
–
A local sequence alignment query asks, given a query
sequence S, a database of sequences T and a
similarity matrix corresponding to an evolutionary
model, return all subsequences of T that are
sufficiently similar to a subsequence of S
–
Main issue: Result is a set of answer.
• A metric distance function must return a single
value for each pair of argument
Established Similarity Measure
(III)
• Global Sequence Alignment
– Given an alphabet A , a similarity substitution matrix M
corresponding to an evolutionary model, the global
sequence alignment for two sequences s and t is to find
a strings a and b which are obtained from s and t
respectively by inserting spaces either into or at the
ends of s and t and whose score computed using M is at
a maximum (Similarity measure) over all pairs of such
strings obtained from s and t. (example)
– Issue: Result maybe negative since substitution matrix
is based on log-odd probability. Similarity measure
favors greater positive number.
•
Metric-space Distance measure
(I)
Homology Search
•
Query Sequence: Sub strings of length q (q-grams)
•
Database of sequences: Metric indexed records of
fixed length q (indexed q-grams) strings.
•
Substitution Matrix (mPAM)
•
Similarity Measure (distance measure)
– Local Alignments is computed from global
alignment.
Metric-space Distance measure
(II)
•
mPAM substitution Matrix
–
Accepted Point Mutation Model.
–
PAM calculates scores based on frequency in which
individual pairs of amino acids substituted for each
other.
–
mPAM instead of calculating frequency of
substitutions (PAM), computes expected time
between substitution.
–
mPAM has been validated.(Validation)
Metric-space Distance measure
(III)
•
Computing Local Alignment from Global
Alignment (Algorithm)
–
Offline
1. Divide database of sequence into sub strings (qgrams)
2. Build metric-space index structure on q-grams
– Online
1. Divide query sequence into sub strings (q-grams)
2. Using global alignment as a distance function to
match query q-grams.
Disk-based metric-tree index
• Phases
• Initialization
• Searching
• Query performance metric
• Number of disk I/O ( nodes visited)
• Number of distance computation
• Options Exploited
• M-Tree
• Generalized Hyper plane tree
• MVP-Tree (optimal)
Disk-based metric-tree index
(initialization)
• M-Tree initialization
– Best case : O(nlogn);
– worst case: O(n3)
• Generalized Hyper
plane (GH-Tree)
initialization
– Best case : O(nlogn);
– worst case: O(n2)
•
•
•
GH-tree: Bi-direction
M-Tree: Bottom-up
In practice, both M-Tree and GHTree scale linearly
Disk-based metric-tree index
(Searching)
MoBIoS as a DBMS (I)
• Mckoi (Java RDBMS).
– Plus metric-space indexing
– Plus Biological data types
– Plus biological semantics
• Life science data store
– Biological sequence data
– Mass-spectrometry protein
signature
MoBIoS as a DBMS (III)
• Language Extension
– M-SQL
• Data type Extension
– Data type for Sequences (DNA,RNA,peptide)
– Data type for Mass spectrum
• Semantics Extension
– Subsequence Operators
– Local alignment
MoBIoS as a DBMS (IV)
• Semantics Extension
– Similarity (metric distance) between data types
• mPAM250
• Cosine distance
• Lk norms
• Keys Extension
– Primary key (metrickey)
– Index (metric)
Application of MoBIoS (I)
•
MS/MS Protein Identification
1. Breakdown protein into fragments called
peptide using a protease enzyme
2. Identify protein by using a mass-spectrometer
to measure the mass-charge ratio of the
fragments and comparing the experiment result
to a database of precomputed spectra.
Application of MoBIoS(II)
M-SQL Solution
Create table
protein_sequences
(accesion_id int,
sequence peptide,
primary
metrickey(sequence,
mPAM250);
Create table
digested_sequences
(accession_id int,
fragment peptide,
enzyme varchar,
ms_peak int, primary
key(enzyme,
Create index
fragment_sequence on
digested_sequences
(fragment)
metric(mPAM250);
Create table
mass_spectra
(accession_id int,
enzyme varchar,
spectrum spectrum,
primary
metrickey(spectrum,
cosine_distance);
Application of MoBIoS(III)
• M-SQL Solution
SELECT Prot.accesion_id,
Prot.sequence
FROM protein_sequences Prot,
digested_sequences DS,mass_spectra
MS
WHERE
MS.enzyme = DS.enzyme = E and
Cosine_Distance(S, MS.spectrum,
range1) and
DS.accession_id = MS.accession_id
= Prot.accesion_id and
BLAST vs MoBIoS
MoBIoS
1.
2.
3.
Molecular Biological
Information System
DBMS specialized for storage,
retrieval and mining of biological
data
Sequence Database and query
sequence is divided into q-grams
and Database is indexed offline.
BLAST
1.
2.
3.
Basic Local Alignment Search
Tool
Utility specialized for retrieval
and mining of biological data
outside a database
Only query sequence is divide
and hot-point index is done at
query time
MoBIoS Demo
• MoBIoS:
http://ccvweb.csres.utexas.edu:9080/msfoun
d/ccForm.jsp
• PDB : http://www.rcsb.org/pdb/
Conclusion
• Biological data is not random and very
likely exhibit the intrinsic structure
necessary for metric-space indexing to
succeed.
References
• http://www.cs.utexas.edu/users/mobios/Publicat
ions/miranker-mobios-final-03.pdf
• http://www.cs.utexas.edu/users/mobios/Publicat
ions/mao-bibe-03.pdf
• http://www.cs.utexas.edu/users/mobios/
• http://www.mckoi.com/database/
Appendix
Return
Appendix I- Metric
A metric-space is a set of objects S, with a
distance function d, such that given any
three objects x, y, z,
1. Non-Negativity
d(x,y) > 0 for x = y; d(x,y) = 0 for x = y
2. Symmetry
d(x,y) = d(y,x)
3. Triangular inequality
d(x,y) + d(y,z) = d(x,y)
Return
Appendix II - Sequence
• 2 RNA sequences from a DNA strand.
Return
Appendix III - PAM
Percent Accepted Mutation(PAM)
A PAM(x) substitution matrix is a look-up table in
which scores for each amino acid substitution
have been calculated based on the frequency of
that substitution in closely related proteins that
have experienced a certain amount (x) of
evolutionary divergence. (e.g PAM250)
A unit to quantify the amount of evolutionary change in a protein sequence. Based
on log-odd probability.
Return
Appendix IV – PAM250
• At this evolutionary distance (250 substitutions per hundred
residues)
Return
Appendix V - BLOSUM
Blocks Substitution Matrix (BLOSUM)
A substitution matrix in which scores for each
position are derived from observations of the
frequencies of substitutions in blocks of local
alignments in related ( e.g BLOSUM62)
A unit to quantify the amount of evolutionary change in a protein sequence. Based
on log-odd probability
Return
Appendix VI – BLOSUM62
• BLOSUM62 matrix is calculated from protein blocks such
that if two sequences are more than 62% identical
Return
Appendix VII – mPAM250
• Expected time based on 250 PAM distance as a unit.
Return
Appendix VIII – mPAM
Validation
• Based on benchmark
query set by SmithWaterman.
• Graph shows ROC50
values (Receiver
Operating Characteristics)
• Negative x- axis indicate
mPAM has better
performance
Difference between ROC50 values
using mPAM and PAM250
Return
Appendix IX - Distance measure
Global Sequence Alignment
Given an alphabet A , a similarity substitution matrix
M corresponding to an evolutionary model, the global
sequence alignment for two sequences s and t is to find
a strings a and b which are obtained from s and t
respectively by inserting spaces either into or at the
ends of s and t and whose score computed using M is
at a maximum (Similarity measure) or minimum
(distance measure) over all pairs of such strings
obtained from s and t.
Return
Appendix X – Homology Search
Build Index Structure(Offline)
1.
2.
Divide the database sequences into a set of overlapping sub
strings of length q (q-grams) with step size 1.
Build a metric-space index D based on global alignment to
support constant time lookup of exact match.
Homology Search Query (Online)
1.
2.
3.
Divide the query sequence W into overlapping sub string , F
= {wi | i =0..| W |-q }, of length q with step size 1.
For each wi in F, run range query Q(wi, r) against database D
to find a set of matching q-grams, Ri = f i,j | d( f i,j , wi) <= r,
f i,j E D wi E F }, where d is the distance function.
Using a greedy heuristic algorithm to extend and chain all
fragments in R0UR1U…Rw-t to deduce the result of
homology search based on local alignment for query W
Return
Appendix XI - GSA
Return