Transcript Slide 1

Reference-based Indexing of
Sequence Databases
Jayendra Venkateswaran, Deepak Lachwani,
Tamer Kahveci, Christopher Jermaine
University of Florida-Gainesville
www.cise.ufl.edu/~jgvenkat
VLDB 2006
Similarity Search
Given threshold ε , find sequences similar to the query sequence.
.
.
Query
=>
.
Similar Sequences
.
si
Sequence Database, S
sj
Sequence
Query
2
sk
Measure: Edit Distance
Edit Operations: Insert, Delete and Replace.
Example: P: ACGTACGTAC_GT
| |||| ||| ||
Q: A_GTACCTACCGT
Sequence Length: 12
3 Edit Operations: 2 Insertions and 1 Replace
Edit Distance is the minimum number of edit
operations needed to transform one sequence to
another.
3
Edit Distance: Complexity
Time and space complexity for computing Edit Distance between
two sequences is O(n2)
.
Query
.
Sequence Database, S
.
|S| = 100,000
.
One Sequence Comparison: 0.25 second.
Time taken for single query: 7 hours.
4
Need for Indexing
Select K sequences
as references
Query
Sequence
Database, S
Candidate Set, C
.
=>
.
Pre-compute referenceto-sequence distances
(K + |C|) << |S|
5
Query
Existing Methods
• Hierarchical Methods,
• VP-Tree (Yianilos, 1993)
• MVP-Tree (Bozkaya et al., 1997)
• M-Tree (Ciaccia et al., 1997),
• Slim-Tree (Traina et al., 2000),
• DF-Tree (Traina et al., 2002).
• DBM-Tree (Vieira et al., 2004)
• Omni (Filho et al., 2001)
• Frequency Vector (Kahveci et al., 2004).
6
Reference-based Indexing
Reference Circle
Including Query
Database Sequences
Reference Sequence
7
Query
Reference-based Indexing
Reference Circle
Including Query
Sequences outside the
reference circle (far from the
reference) are pruned.
Sequences close to
the references can
also be pruned
Database Sequences
Reference Sequence
8
Query
Reference-based Indexing
Reference
Circle
Excluding
Query
Database Sequences
Reference Sequence
9
Query
Reference-based Indexing
Reference
Circle
Excluding
Query
Sequences inside the reference circle
(close to the reference) are pruned.
Database Sequences
Reference Sequence
10
Query
Reference-based Indexing:
Bounds
Given a sequence s, reference r and query q,
Lower Bound: Minimum Distance between q and s with r as reference, |d1-d2|.
Upper Bound: Maximum Distance between q and s with r as reference, d1+d2.
Upper Bound
d1
d2
Database Sequence
Lower Bound
Reference Sequence
11
Query
Observations



Two types of pruning:
•
•
Sequences close to references.
Sequences far from references.
A good reference set should be able to
use both kinds of pruning effectively.
Each reference should prune some
part of the database not pruned by
other references.
12
Outline





Selection of References
Reference Assignment
Search Algorithm
Experimental Results
Conclusions
13
Our Contributions

Selection of References:
1. Maximum Variance Selection: Reference
with high variance of distance distributions
with other sequences in the database.
2. Maximum Pruning: A Combinatorial
approach of selecting the best reference set.

Assignment of References:
•
Each sequence has different set of
references.
14
Selection of References:
Maximum Variance (MV)
Basic Idea: Select references having more sequences close to and far
from it, and hence can prune them.
Bad
Good
Database Sequences
15
Selection of References:
Maximum Variance (MV)



Select references having sequences close
to and far away from them.
References have maximum variance of
distance distributions with other sequences
in the database.
New reference prunes some part of the
database not pruned by existing set of
references.
16
Maximum Variance: Algorithm
S1
Compute Distances
S2
S3
S4
S5
S6
S7
S8
Sequence
Database
S3
S5
=>
S7
S8
Random Subset
of Sequences
S1
S6
S2
S3
S3
S2
Sort
S4
S1
S5
S8
S6
S7
S7
S5
S8
S4
Variance of Distance
Distributions
17
Remove Sequences
Close to or Far away
from New Reference
Candidate
Reference Set
Maximum Variance: Example
a
b
f
e
g
d
c
Database Sequences
e
g
c
a
f
b
d
Maximum Variance Ordering
Reference Sequences
18
Selection of References:
Maximum Pruning (MP)




Combinatorial approach to select the
best reference set for given query set.
Select reference set that can prune more
sequences over all queries.
Sample query set Q’ following the actual
query distribution is given.
Sampling techniques to reduce the
complexity of this method.
19
Maximum Pruning: Algorithm
GAINS
Reference Set
S1
S2
S3
S4
S5
S6
Candidate References
v1 v2 v3 v4
Q1
Q2
Q3
Q4
Sample Queries, Q’
20
S1
S2
S3
S4
S5
S6
Sequence
Database
Maximum Pruning: Algorithm
GAINS
Reference Set
S1
S2
S3
S4
S5
S6
Candidate References
v1 v2 v3 v4
Q1
Q2
Q3
Q4
Sample Queries, Q’
21
S1
S2
S3
S4
S5
S6
Sequence
Database
Maximum Pruning: Algorithm
GAINS
Reference Set
S1
S2
S3
S4
S5
S6
Candidate References
S1 v2 v3 v4
Q1
Q2
Q3
Q4
Sample Queries, Q’
22
S1
S2
S3
S4
S5
S6
Sequence
Database
Maximum Pruning: Algorithm
GAINS
Reference Set
S1
S2
S3
S4
S5
S6
Candidate References
v1 S1 v3 v4
Q1
Q2
Q3
Q4
Sample Queries, Q’
23
S1
S2
S3
S4
S5
S6
Sequence
Database
Maximum Pruning: Algorithm
GAINS
Reference Set
S1
S2
S3
S4
S5
S6
Candidate References
v1 v2 S1 v4
Q1
Q2
Q3
Q4
Sample Queries, Q’
24
S1
S2
S3
S4
S5
S6
Sequence
Database
Maximum Pruning: Algorithm
GAINS
Reference Set
S1
S2
S3
S4
S5
S6
Candidate References
v1 v2 v3 S1
Q1
Q2
Q3
Q4
Sample Queries, Q’
25
S1
S2
S3
S4
S5
S6
Sequence
Database
Maximum Pruning: Algorithm
GAINS
G1
Reference Set
S1
S2
S3
S4
S5
S6
Candidate References
S1 v2 v3 v4
Q1
Q2
Q3
Q4
Sample Queries, Q’
26
S1
S2
S3
S4
S5
S6
Sequence
Database
Maximum Pruning: Algorithm
GAINS
G1
G2
Reference Set
S1
S2
S3
S4
S5
S6
Candidate References
v1 v2 S2 v4
Q1
Q2
Q3
Q4
Sample Queries, Q’
27
S1
S2
S3
S4
S5
S6
Sequence
Database
Maximum Pruning: Algorithm
GAINS
MAX()
G1
G2
G3
G4
G5
G6
Reference Set
S1
S2
S3
S4
S5
S6
v1 v2 S2 v4
Q1
Q2
Q3
Q4
Candidate References
Sample Queries, Q’
Repeat Until MAX() > 0
28
S1
S2
S3
S4
S5
S6
Sequence
Database
Maximum Pruning Example
f
Reference Set
e
d
a
<e,d>
q
{a,d,e,f}
d
b1
a
b3
b2
b
<e,a>
<a,d>
{a,b,c,d,e,f}
{a,b,c,d}
c
e
Database Sequences
Reference Sequences
Sequences pruned by a
29
Candidate
Reference
Gain
a
2
Maximum Pruning Example
f
Reference Set
e
d
a
<e,d>
q
{a,d,e,f}
d
b1
a
b3
b2
b
e
Database Sequences
Reference Sequences
Sequences pruned by a
30
<e,a>
<a,d>
{a,b,c,d,e,f}
{a,b,c,d}
c
Candidate
Reference
Gain
a
2
c
1
f
1
Outline





Selection of References
Assignment of References
Search Algorithm
Experimental Results
Conclusions
31
Assignment of References
Select K sequences
as references
Sequence Database, S
Candidate Set, C
Query
.
Query
=>
.
Pre-compute referenceto-sequence distances
Assign K references to each sequence
Increase the Number
of references to m
Query
(K + |C|) << |S|
(m + |C’|) < (K + |C|) << |S|
.
=>
Query
.
Candidate Set, C’
32
Reference Assignment: Example
Number of References = 2
ε1
f
q1
d
ba1
a
ba2
ε2
b
q2
c
e
ε3
bc1
q3
a
c
References for b
33
Outline





Selection of References
Reference Assignment
Search Algorithm
Experimental Results
Conclusions
34
Search Algorithm
Pre-compute Sequence-Reference Distances
Compute Query-Reference Distances
MAX(LB)
MIN(UB)
Lower
Bounds
Upper
Bounds
Query, q
Reference set, V
Sequence Database, S
If MAX(LB) ≤ ε ≤ MIN(UB), add s to Candidate set,
If ε > MIN(UB), add s to Result set.
If ε < MAX(LB), add s to Pruned set.
35
Outline





Selection of References
Reference Assignment
Search Algorithm
Experimental Results
Conclusions
36
Experimental Setup

Datasets
•
•
•
DNA: Alphabet size of 4 and 20000 sequences.
Protein: Alphabet size of 20 and 4000 sequences of up to 500 amino
acids.
Text: Alphabet size of 36 and 8000 sequences of length 100 each.

Size of Reference Set, m = 200.

Experiments,
•
•
Comparison with our methods
•
•
Maximum Variance with same and different reference sets (MV-S and MV-D).
Maximum Pruning with same and different reference sets (MP-S and MP-D).
Comparison with other methods
•
•
•
Frequency Vector (Kahveci et al., 2004).
Omni (Filho et al., 2001)
Others: M-Tree (Ciaccia et al., 1007), Slim-Tree (Traina et al., 2000), DBM-Tree
(Vieira et al., 2004) and DF-Tree (Traina et al., 2002).
37
Comparison of Our Methods
DNA Dataset
k=4
38
Comparison of Our Methods
DNA Dataset
Range = 8
39
Comparison with Other Methods
DNA Dataset, k = 16
40
Conclusion



References selected by Maximum Variance
and Maximum Pruning eliminates more
database sequences as compared to existing
selection strategies.
Assigning different reference set to each
sequence dramatically improves the
performance.
MP-D outperforms existing methods in
almost all the experiments.
41
Thank You
Questions ?
[email protected]
Comparison with Other Methods:
Protein Dataset
Query Range =
300
43
Assignment of References:
Memory Limitations
Main memory stores pre-computed reference-tosequence distances along with the references.
 For each [s,vi] pair (s  S, vi V), store [i,ED(s,vi)] (Takes
8 bytes).
 Given the available main memory in bytes, B
B = 8KN + zm
N: Number of sequences in the database.
K: Number of references per sequence.
z: Size of each sequence in bytes.
m: Number of references in reference set.
 Example: Given B = 1 GB, N = 10 million, z = 100 and m
= 1000, then K = 13.

44