Searching Similar Patterns

Transcript Searching Similar Patterns

An Efficient Index Structure
for String Databases
Tamer Kahveci
Ambuj K. Singh
Department of Computer Science
University of California
Santa Barbara
http://www.cs.ucsb.edu/~tamer
1
Whole/Substring Matching Problem
Find similar substrings in a database,
that are similar to a given query string
quickly, using a small index structure
(1-2 % of database size).
database string
query string
2
String Similarity
Motivation:

Base Pairs (millions)
Applications
 Genetic sequence
databases, NCBI
 Text databases, spell
checkers, web search.
 Video databases (e.g.
VIRAGE, MEDIA360)


Database size is too
large. Most of the
techniques available are
in-memory.
Space requirement of
current indexes is too
large.
Year
3
Outline
Motivation & background
Our contribution



Frequency vector, frequency distance &
wavelet transform
Multi-resolution index structure
k-NN & range queries
Experimental results
Conclusion
4
Notation
q : query string.
m,n : length of strings.
r : range query radius.
 = r/|q|: error rate.
5
String Similarity: an example
A C T - - T A G C
R
I I
D
A A T G A T A G -
6
Background
Edit operations:



Insert
Delete
Replace
Edit distance (ED) between s1 and s2 = minimum
number of edit operations to transform s1 to s2.
Finding the edit distance is costly.

O(mn) time and space if m and n are lengths of s1 and s2 if
dynamic programming is used [NW70, SW81].
7
Related Work
Lossless search

Online
 [Mye86] (Myers) reduce space requirement to O(rn), where r is
query radius.
 [WM92] (Wu, Manber) binary masks, O(rn).
 [BYN99] (Beaze-Yates, Navarro) NFA

Offline (index based)
 [Mye94] (Myers) condensed r-neighborhood.
 [BYN97] (Beaze-Yates, Navarro) dictionary.
Lossy search


[AG90] (Altschul, Gish) BLAST.
 FASTA, SENSEI, MegaBLAST, WU-BLAST, PHI-BLAST,
FLASH, QUASAR, REPUTER, MumMER.
[GWWV00] (Giladi, Walker, Wang, Volkmuth) SST-Tree
8
Outline
Motivation & background
Our contribution



Frequency vector, frequency distance &
wavelet transform
Multi-resolution index structure
k-NN & range queries
Experimental results
Conclusion
9
Frequency Vector
Let s be a string from the alphabet
={1, ..., }. Let ni be the number of
occurrences of the character i in s for
1i, then
frequency vector: f(s) =[n1, ..., n].
Example:


s = AATGATAG
f(s) = [nA, nC, nG, nT] = [4, 0, 2, 2]
10
Effect of Edit Operations on
Frequency Vector
Delete : decreases an entry by 1.
Insert : increases an entry by 1.
Replace : Insert + Delete
Example:




s = AATGATAG => f(s) = [4, 0, 2, 2]
(del. G), s = AAT.ATAG => f(s) = [4, 0, 1, 2]
(ins. C), s = AACTATAG => f(s) = [4, 1, 1, 2]
(AC), s = ACCTATAG => f(s) = [3, 2, 1, 2]
11
An Approximation to ED:
Frequency Distance (FD1)
s = AATGATAG => f(s)=[4, 0, 2, 2]
q = ACTTAGC => f(q)=[2, 2, 1, 2]
 pos = (4-2) + (2-1) = 3
 neg = (2-0) = 2
 FD1(f(s),f(q)) = 3
FD1(f(q),f(s))
 ED(q,s) = 4
FD1(f(s1),f(s2))=max{pos,neg}.
FD1(f(s1),f(s2)) ED(s1,s2).
f(s)
f(q)
12
An Illustration of Frequency Distance
& Edit Distance
Set of
strings 1
v1
Frequency
Distance
v2
Set of
strings 2
Edit Distance
13
Using Local Information: Wavelet
Decomposition of Strings
s = AATGATAC => f(s)=[4, 1, 1, 2]
s = AATG + ATAC = s1 + s2
f(s1) = [2, 0, 1, 1]
f(s2) = [2, 1, 0, 1]
1(s)= f(s1)+f(s2) = [4, 1, 1, 2]
2(s)= f(s1)-f(s2) = [0, -1, 1, 0]
14
Wavelet Decomposition of a String:
General Idea
Ai,j = f(s(j2i : (j+1)2i-1))
Bi,j = Ai-1,2j - Ai-1,2j+1
First wavelet coefficient
Second wavelet coefficient
(s)=
15
Wavelet Decomposition & ED
Define FD(s1,s2)=max{FD1, FD2}.
16
Outline
Motivation & background
Our contribution



Frequency vector, frequency distance &
wavelet transform
Multi-resolution index structure
k-NN and range queries
Experimental results
Conclusion
17
MRS-Index Structure Creation
w=2a
s1
transform
18
MRS-Index Structure Creation
s1
19
MRS-Index Structure Creation
s1
20
MRS-Index Structure Creation
s1
...
slide c times
c=box capacity
21
MRS-Index Structure Creation
s1
...
22
MRS-Index Structure Creation
s1
Ta,1
...
W=2a
23
Using Different Resolutions
s1
Ta,1
...
W=2a
...
W=2a+1
Ta+1,1
24
MRS-Index Structure
25
MRS-index properties
Relative MBR
volume (Precision)
decreases when


c increases.
w decreases.
MBRs are highly
clustered.
Box volume
Box Capacity
26
Outline
Motivation & background
Our contribution



Frequency vector, frequency distance &
wavelet transform
Multi-resolution index structure
k-NN & range queries
Experimental results
Conclusion
27
Range Queries [KS01]
s1
s2
sd
1=
w=24
...
...
...
...
w=25
...
...
...
...
w=26
...
...
...
...
2  1
w=27
...
...
...
...
3  2
208
16
64
128
28
k-Nearest Neighbor Query
[KSF+96, SK98]
k=3
29
k-Nearest Neighbor Query
k=3
r = Edit distance to 3rd closest substring
30
k-Nearest Neighbor Query
r
k=3
31
k-Nearest Neighbor Query
k=3
32
Outline
Motivation & background
Our contribution
Experimental results
Conclusion
33
Experimental Settings
w={128, 256, 512, 1024}.
Human chromosomes from
(www.ncbi.nlm.nih.gov)


chr02, chr18, chr21, chr22
Plotted results are from chr18 dataset.
Queries are selected from data set
randomly for 512  |q|  10000.
An NFA based technique [BYN99] is
implemented for comparison.
34
Experimental Results 1:
Effect of Box Capacity (10-NN)
35
Experimental Results 2:
Effect of Window Size (10-NN)
36
Experimental Results 3:
k-NN queries
37
Experimental Results 4:
Range Queries
38
Outline
Motivation & background
Our Contribution
Experimental results
Discussion & conclusion
39
Discussion
In-memory (index size is 1-2% of the
database size).
Lossless search.
3 to 45 times faster than NFA technique for kNN queries.
2 to 12 times faster than NFA technique for
range queries.
Can be used to speedup any previously
defined technique.
40
Future Work
Extend to weighted edit distance and affine
gaps.
Extend to local similarity (substring/substring)
search.
Compare the quality of answers and speed to
BLAST (lossy search).
Use as a preprocessing step to BLAST.
Apply the MRS index structure for larger
alphabet size (e.g. protein sequences.).
41
Related Work
Lossless search

Online
 [Mye86] (Myers) reduce space requirement to O(rn), where r is
query radius.
 [WM92] (Wu, Manber) binary masks, O(rn).
 [BYN99] (Beaze-Yates, Navarro) NFA

Offline (index based)
 [Mye94] (Myers) condensed r-neighborhood.
 [BYN97] (Beaze-Yates, Navarro) dictionary.
Lossy search


[AG90] (Altschul, Gish) BLAST.
 FASTA, SENSEI, MegaBLAST, WU-BLAST, PHI-BLAST,
FLASH, QUASAR, REPUTER, MumMER.
[GWWV00] (Giladi, Walker, Wang, Volkmuth) SST-Tree
42
Related Work (Similar problems)
[BYP92] (Beaze-Yates, Perleberg) only
replace is allowed.
[Gus97] (Gusfield) exact matching,
suffix trees.
[JKS00] (Jagadish, Koudas, Srivastava)
exact matching with wild-cards for
multidimensional strings, elided trees
and R-tree.
43
THANK YOU
44
Frequency Distance to an MBR
B
f(s)
FD(f(q),f(s))
FD(f(q),B)
f(q)
f(q)
45

Searching Similar Patterns

Transcript Searching Similar Patterns

Directory