Vakhitov Alexander Approximate Text Indexing.

Download Report

Transcript Vakhitov Alexander Approximate Text Indexing.

Vakhitov Alexander
Approximate Text Indexing.
Using simple mathematical
arguments the matching probabilities
in the suffix tree are bound and by a
clever division of the search pattern
sub-linear time is achieved
by “A Hybrid Method for Approximate
String Matching” G. Navarro, R. Baeza-Yates
The Task
The task is to find substrings from the long
text T, approximately matching to our pattern
P.
For example, we have text T='adbc' and
P='abc'
(s-starting position of a substring)
adbc=a+d+b+c – insertion of 'd' (s=1)
dbc=(a->d) +b+c – replacement 'a' with 'd'
(s=2)
bc=(a)+b+c – deletion of 'a' (s=3)
Errors
There are 3 kinds of transformations, which make
errors in initial string: insertion, replacement and
deletion. If we transform S with such chnges to S',
then we can transform S' to S with the same
number of changes.
The minimal number of deletions, insertions and
replacements, needed to transform string A to
string B is called edit distance between A and B
(ed(A,B)).
Example:
ed('abc','adbc')=ed('dbc','abc')=ed('bc','abc')=1;
replace 'v' with 'g'
insert 'r'
surgery
surgey
ed('survey','surgery')=2
survey
The resulting algorithm
The algorithm solves the approximate string
matching problem in O(nlog n) time (n is the size of
text T, (0,1)), if
e


1

the error level
, where

 is the size of the alphabet,
e=2.718..,
=k/m,
k is the number of errors,
m is the size of the pattern P.
Plan of the report
●
Some useful ideas & basic algorithms
●
The main algorithm
●
Analysis of the complexity of the algorithm in
different cases
Dividing the pattern
Lemma There are strings A and B, ed(A,B)k, and
we divide A into j substrings (Ai). Then at least one
of the (Ai) appear in B with at most ⌊k/j⌋ errors.
We need k changes to transform A to B. Each
change transforms one of the Ai, so k changes are
distributed between j substrings => the average
number of changes is k/j.
Example: ed('he likes','they like')=3=k;
A1='he ',A2='likes' => j=2;
ed('he ','they ')=2; ed('likes','like')=2=⌊k/j⌋
Computing edit distance
There are strings x,y. ed(x,y)=?
x=x1x2...xm,y=y1y2..yn, xp,yqΣ
Cij=ed(x1..xi,y1..yj) C0..|x|,0..|y| is a matrix, filled with
Cij.
... u
Computing Cij:
C0,j=j; Ci,0=i;
Ci,j = if (xi=yj) then Ci-1,j-1
else 1+min{Ci-1,j-1,Ci-1,j,Ci,j-1}
Example:
x='survey',y='surgery'
...
u
r
g
ery
r
v
ey
0 1 2
1 0 1
2 1 1
green means xi<>yj
red means xi=yj
arrow shows the element
used to sompute Cij
Edit distance in the case of text T
and pattern P
We need to find a substring in text T which matches
P with minimal number of errors. Let x be the
pattern, and y will be a text.
The matching text substring can begin in every text
position – so, we have to initialize C0,j with 0. The
rest is left from the previous task.
The algorithm can store only the last column and
analyze the text incrementally from the beginning.
It goes left and down through the matrix, filling it
with Cij.
Examples of the matrices
Construction of the NFA
Nondeterministic Finite Automaton, which is
searching the text substrings, approximately
matching to the pattern with k errors. It consists of
k rows and m columns.
Transitions:
Pattern and text characters are the same
(horizontal)
Insert characters into pattern (vertical)
Replace the pattern character with the text one
(solid diagonal)
Delete the pattern character (dashed diagonal)
Nondeterministic Finite Automaton
This automaton is for approximate string matching for the pattern 'surve
with 2 errors
Depth-first search
DEF “k-neighborhood” is the set of strings that match
P with at most k errors:
Uk(P)={x  * : ed(x,P)k}
Searching this strings in the text (without errors) can
solve the problem, but |Uk(P)|=O(mkk) is quite
large.
We can determine which strings form Uk(P) appear in
the text by traversing the text suffix tree. Here we
can use the Ukt(P) set. Ukt(P) is a set of
neighborhood elements which are not prefixes of
others.
Algorithm for searching on the
suffix tree
●
Starts from the root
●
Considers the string x incrementally
●
Determines when ed(P,x)k
●
Determines when ed(P,xy)>k for any y
Algorithm for searching on the
suffix tree
–
Each new character of x corresponds to a new
column in the matrix (adding s to x <=>
updating column in O(m) time).
–
A match is detected when the last element of
the column is  k
–
x cannot be extended to match P when all the
values of the last column are > k
Algorithm for searching on the
suffix tree (illustration)
Partitioning the pattern
The cost of the suffix tree search is exponential
in m and k, so it's better to perform j
searches of patterns of length m/j and k/j
errors – that's why we divide patterns.
So, we divide our pattern into j pieces and
search them using the above algorithm.
Then, for each match found ending at text
position i we check the text area [i-mk..i+m+k]
But the larger j, the more text positions need to
be verified, and the optimal j will be found
Searching pieces of the pattern
Let's use NFA with depth-first search (DFS)
technique (the suffixes from the suffix tree
will be the input of the automaton)
At first, we'll transform our NFA
–
Initial self-loop isn't needed (it allowed us earlier
to start matching from every position of the text);
–
We remove the low-left triangle of our
automaton, because we avoid initial insertions to
the pattern
–
We can start matching only with k+1 first pattern
characters
The changes to NFA
Using suffix array instead of suffix
tree
The suffix array can replace the suffix tree in
our algorithm. It has less space
requirements, but the time complexity should
be multiplied by log n.
Suffix array replaces nodes with intervals and
traversing to the node is going to the interval.
If there is a node and it's children, then the
node interval contains children intervals.
Analysis for the algorithm: the
average number of nodes at level l
●
●
●
For a small l, all the text suffixes (except the
last l) are longer than l, so nearly n suffixes
reach level l;
The maximum number of nodes in the level l
is l, where =||;
We use the model of n balls randomly thrown
into l urns. The average number of filled
urns is
l)
l
l
n
l

(n/

 (1-(1-1/ ) )= (1-e
)=(min{n,l})
Probability of processing a given
node at depth l in the suffix tree.
If lm', at least l-k text characters must match the
pattern (m’ is the pattern size), and
if lm', at least m'-k pattern characters must match
the text. We sum all the probabilities for different
pattern prefixes:
The largest term of the 1st sum is the first one:
and by using Stirling's approximation we have:
Probability of processing a given node
at depth l in the suffix tree.
..which is:
= ()lO(1/l) ,
where
, =k/l
The whole first summation is bounded by l-k times the
last term, so we get (l-k)()lO(1/l)=O(()l).The first
summation exponentially decreases if ()<1. It means
that:
>e2/(1- )2
(because e-1<   /(1- ) if [0,1])
Probability of processing a given
node at depth l in the suffix tree.
..or, equivalently,
The second summation can be also bounded by this O(
So the upper bound for the probability of processing a giv
node at depth l in the suffix tree is O(()l).
In practice, e should be replaced by c=1.09 (it was
defined experimentally), because we have only
founded the upper bound of the probability.
Analysis of the single pattern
search in the suffix tree
Using the formulas bounding the probability of
matching, let's consider that in levels l:
all the nodes are visited, while nodes at level l>L(k) are
visited with probability O((k/l)l).
Remember that the average number of visited nodes
at the level l (for small l) is (min{n,l}).
Three cases of analysis
The cases of analysis
(a) L(k) logn, n L(k) “small n” online search
preferable, no index needed (since the total work is
n);
(b) m+k < logn, n > m+k
“large n” the total cost
independent on n;(=k/l)
(c) L(k) logn m+k
“intermediate n”,
sublinear of n time.
Analysis of pattern partitioning
We need to perform j searches and then verify all the
possible matches. We also determine three cases
according to previous slide:
(a) 1  k j logn, n L(k/j), complexity O(n)

(b) m+k < j logn, n >
(m+k)/j
, if error level
1-log((
 1
e

the complexity is O(n
) - sublinear of n
(using
j=(m+k) / logn)
(c) with the same as in (b) error level, using the
same j, we also get sublinear complexity.
Other types of algorithms
●
●
Limited depth-first search technique determines
viable prefixes (the prefixes of the possible
pattern matches) and searches for them in the
suffix tree (it is expensive and it cannot be
implemented on the suffix array)
Filtering discard large parts of the text checking
for a necessary condition (simpler than the
matching condition). Most existing filters are
based on finding substrings of the pattern
without errors, and with big error level they can't
work.
Summary & Conclusions
●
●
●
The splitting technique balances between
traversing too many nodes of the suffix tree
and verifying too many text positions
The resulting index has sublinear retrieval
time (O(n)), 0<<1) if the error level is
moderate
In future there can appear more exact
algorithms to determine the correct number
of pieces in which the pattern is divided and
there are (and may appear in future) some
better algorithms for verifying after matching
a piece of pattern.