Presentation by Daniel Glasner

Download Report

Transcript Presentation by Daniel Glasner

Semi-Numerical String Matching
Semi-numerical String Matching





All the methods we’ve seen so far have been based
on comparisons.
We propose alternative methods of computation such
as:
Arithmetic.
Bit – operations.
The fast Fourier transform.
Semi-numerical String Matching

We will survey three examples of such methods:

The Random Fingerprint method due to Karp and
Rabin.
Shift–And method due to Baeza-Yates and Gonnet,
and its extension to agrep due to Wu and Manber.
A solution to the match count problem using the fast
Fourier transform due to Fischer and Paterson and
an improvement due to Abrahamson.


Karp-Rabin fingerprint - exact match



Exact match problem: we want to find all the
occurrences of the pattern P in the text T.
The pattern P is of length n.
The text T is of length m.
Karp-Rabin fingerprint - exact match

Arithmetic replaces comparisons.

An efficient randomized algorithm that makes an
error with small probability.
A randomized algorithm that never errors whose
expected running time is efficient.


We will consider a binary alphabet: {0,1}.
Arithmetic replaces comparisons.


Strings are also numbers, H: strings → numbers.
Let s be a string of length n,
H ( s)  i 1 2
n

ni
s(i )
Definition:
let Tr denote the n length substring of T starting at
position r.
Arithmetic replaces comparisons.

Strings are also numbers, H: strings → numbers.
T=10110101
P=0101
T=10110101
P=
0101
H(T5) = 5 =
H(P) = 5
T=10110101
P= 0101
H(T2) = 6 ≠
H(P) = 5
Arithmetic replaces comparisons.

Theorem:
There is an occurrence of P starting at position r of
T if and only if H(P) = H(Tr)

Proof:
Follows immediately from the unique
representation of a number in base 2.
Arithmetic replaces comparisons.

We can compute H(Tr) from H(Tr-1)
H (Tr )  2 H (Tr 1 )  2n T (r  1)  T (r  n  1)
T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0
H (T1 )  H (1011)  11
H (T2 )  2 11  2 4 1  0  22  16  6  H (0110)
Arithmetic replaces comparisons.

A simple efficient algorithm:

Compute H(T1).
Run over T
Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons.


Total running time O(m)?
Karp-Rabin


Let’s use modular arithmetic, this will help us keep
the numbers small.
For some integer p The fingerprint of P is defined
by
Hp(P) = H(P) (mod p)
Karp-Rabin

Lemma:
H p ( P )  {[...({[ P (1)  2(mod p)  P (2)]  2(mod p) 
P (3)}  2(mod p)  P (4))...](mod p)  P (n)}(mod p)
 H ( P )(mod p)
And during this computation no number ever
exceeds 2p.
An example
P=101111
p=7
H(P) = 47
Hp(P) = 47 (mod 7) = 5
1  2(mod 7)  0  2
2  2(mod 7)  1  5
5  2(mod 7)  1  4
4  2(mod 7)  1  2
2  2(mod 7)  1  5
5(mod 7)  5  H p ( P )
Karp-Rabin


Intermediate numbers are also kept small.
We can still compute H(Tr) from H(Tr-1).
Arithmetic:
H (Tr )  2 H (Tr 1 )  2n T (r  1)  T (r  n  1)
Modular arithmetic:
H p (Tr )  [( 2 H (Tr 1 )(mod p)) 
(2n (mod p))T (r  1)  T (r  n  1)](mod p)
Karp-Rabin


Intermediate numbers are also kept small.
We can still compute H(Tr) from H(Tr-1).
Arithmetic:
2n  2(2n1 )
Modular arithmetic:
2n (mod p)  2(2n1 (mod p))(mod p)
Karp-Rabin

How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of
T if and only if H(P) = H(Tr)
Modular arithmetic:
If there is an occurrence of P starting at position r
of T
then Hp(P) = Hp(Tr)
There are values of p for which the converse is not
true!
Karp-Rabin

Definition:
If Hp(P) = Hp(Tr) but P doesn’t occur in T starting at
position r, we say there is a false match between P
and T at position r.
If there is some position r such that there is a false
match between P and T at position r, we say there
is a false match between P and T.
Karp-Rabin

Our goal will be to choose a modulus p such that

p is small enough to keep computations efficient.
p is large enough so that the probability of a false
match is kept small.

Prime moduli limit false matches

Definition:
For a positive integer u, п(u) is the number of
primes that are less than or equal to u.

Prime number theorem (without proof):
u
u
  (u)  1.26
ln( u)
ln( u)
Prime moduli limit false matches

Lemma (without proof):
if u ≥ 29, then the product of all the primes that are
less than or equal to u is greater than 2u.

Example: u = 29, the prime numbers less than or
equal to 29 are: 2,3,5,7,11,13,17,19,23,29, their
product is
6,469,693,230 ≥ 536,870,912 = 229
Prime moduli limit false matches

Corollary:
If u ≥ 29 and x is any number less than or equal to
2u, then x has fewer than п(u) distinct prime
divisors.

Proof:
Assume x has k ≥ п(u) distinct prime divisors
q1 , …, qk then
2u ≥ x ≥ q1* …* qk but
q1* …* qk is at least as large as the product of the
first п(u) prime numbers.
Prime moduli limit false matches

Theorem:
Let I be a positive integer, and p a randomly
chosen prime less than or equal to I.
If nm ≥ 29 then
The probability of a false match between P and T is
less than or equal to п(nm) / п(I) .
Prime moduli limit false matches






Proof:
Let R be the set of positions in T where P doesn’t
begin.
nm
|
H
(
P
)

H
(
T
)
|

2

s
We have
sR
By the corollary the product has at most п(nm)
distinct prime divisors.
If there is a false match at position r then p divides
| H ( P )  H (Tr ) | thus also divides sR | H ( P )  H (Ts ) |
p must be in a set of size п(nm) but p was chosen
randomly out of a set of size п(I).
Random fingerprint algorithm

Choose a positive integer I.
Pick a random prime p less than or equal to I, and
compute P’s fingerprint – Hp(P).
For each position r in T, comput Hp(Tr) and test to
see if it equals Hp(P). If the numbers are equal
either declare a probable match or check and
declare a definite match.

Running time: excluding verification O(m).


How to choose I



The smaller I is, computations are more efficient
The larger I is, the probability of a false match
decresses.
Proposition:
When I = nm2
1. The largest number used in the algorithm
requires at most 4(log(n)+log(m)) bits.
2. The probability of a false match is at most
2.53/m.
How to choose I

Proof:
 (nm )
nm ln( nm 2 )
 1.26

2
2
 (nm )
nm ln( nm )
1  ln( n)  2 ln( m )  2.53
 
1.26 
m  ln( n)  ln( m ) 
m
Extensions

An idea: why not choose k primes?

Proposition:
when k primes are chosen randomly and
independently between 1 and I, the probability of a
false match is at most   (nm )  k



  (I ) 

Proof:
We saw that if p allows and error it is in a set of at
most п(nm) integers. A false match can occur only if
each of the independently chosen k primes is in a
set of size of at most п(nm) integers.
An illustaration

k = 4,
n = 250, m = 4000
I = 250*40002 < 232
k
  (nm )   2.53 
12

  
  10
  ( I )   4000 
k
Even lower limits on the error

When k primes are used, the probability of a false
match is at most   (n)  k


  (I ) 

Proof: Suppose a false match occurs at position r.
That means that each of the primes must divide
|H(P)-H(Tr) | ≤ 2n. There are at most п(n) primes
that divide it.
Each prime is chosen from a set of size п(I) and by
chance is a part of a set of size п(n).
Checking for error in linear time



Consider the list L of locations in T where the KarpRabin algorithm declares P to be found.
A run is a maximal interval of starting locations
l1, l2, …, lr in L such that every two numbers differ
by at most n/2.
Let’s verify a run.
Checking for error in linear time

Check the first two declared occurrences explicitly.
P = abbabbabbabbab
T = abbabbabbabbabbabbabbabbabbax…
P=
abbabbabbabbab
T = abbabbabbabbabbabbabbabbabbax…


If there is a false match stop.
Otherwise P is semi periodic with period
d = l1 – l2.
Checking for error in linear time

d is the minimal period.

P = abbabbabbabbab
T = abbabbabbabbabbabbabbabbabbax…
P=
abbabbabbabbab
T = abbabbabbabbabbabbabbabbabbax…
Checking for error in linear time
P = abbabbabbabbab
T = abbabbabbabbabbabbabbabbabbax…


For each i check that li+1 – li = d.
Check the last d characters of li for each i.
Checking for error in linear time
P = abbabbabbabbab
T = abbabbabbabbabbabbabbabbabbax…

Check l1
Checking for error in linear time
P=
abbabbabbabbab
T = abbabbabbabbabbabbabbabbabbax…

Check l2

P is semi periodic with period 3.
Checking for error in linear time
T = abbabbabbabbabbabbabbabbabbax…

Check li+1 – li = 3
Checking for error in linear time
P=
bab
T = abbabbabbabbabbabbabbabbabbax…

For each i check the last 3 characters of li.
Checking for error in linear time
P=
bab
T = abbabbabbabbabbabbabbabbabbax…

For each i check the last 3 characters of li.
Checking for error in linear time
P=
bab
T = abbabbabbabbabbabbabbabbabbax…

For each i check the last 3 characters of li.

Report a false match or approve the run.
Time analysis



No character of T is examined more than twice
during a single run.
Two runs are separated by at least n/2 positions
and each run is at least n positions long. Thus no
character of T is examined in more than two
consecutive runs.
Total verification time O(m).
Time analysis



When we have a false match we start again with a
different prime.
The expected probability of a false match is O(1/m).
We have converted the algorithm to one that never
mistakes with expected linear running time.
Why use Karp-Rabin?




It is efficient and simple.
It is space efficient.
It can be generalized to solve harder problems
such as 2-dimensional string matching.
It’s performance is backed up by a concrete
theoretical analysis.
The Shift-And
Method
The Shift-And Method

We start with the exact match problem.

Define M to be a binary n by m matrix such that:
M(i,j) = 1 iff the first i characters of P exactly match
the i characters of T ending at character j.
M(i,j) = 1 iff P[1 .. i] ≡ T[j-i+1 .. j]
The Shift-And Method


Let T = california
Let P = for
M=


1
2
3
4
5
6
7
8
9
m = 10
1
0
0
0
0
1
0
0
0
0
0
2
0
0
0
0
0
1
0
0
0
0
n=3
0
0
0
0
0
0
1
0
0
0
M(i,j) = 1 iff the first i characters of P exactly match
the i characters of T ending at character j.
How does M solve the exact match problem?
How to construct M




We will construct M column by column.
Two definitions are in order:
Bit-Shift(j-1) is the vector derived by shifting the
vector for column j-1 down by one and setting the first
bit to 1.
Example:
 0  1 
   
1   0 
BitShift(1 )  1 
   
 0  1 
1   0 
   
How to construct M


We define the n-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1 for the
positions in P where character x appears.
Example:
1 
 
 0
1 
U
(a
)

P = abaac
 
1 
 0
 
0
 
1 
U (b)   0 
 
0
0
 
0
 
0
U (c )   0 
 
0
1 
 
How to construct M


Initialize column 0 of M to all zeros
For j > 1 column j is obtained by
M ( j )  BitShift( j  1)  U (T ( j ))
An example j = 1
1 2 3 4 5 6 7 8 9 10
T=xabxabaaxa
12345
P=abaac
 0
 
 0
U ( x)   0 
 
 0
 0
 
1
1
0
2
0
3
0
4
0
5
0
2
3
4
5
6
7
8
9
1   0   0 
     
0 0 0
BitShift(0) & U (T (1))   0  &  0    0 
     
0 0 0
0 0 0
     
1
0
An example j = 2
1 2 3 4 5 6 7 8 9 10
T=xabxabaaxa
12345
P=abaac
1 
 
 0
U (a )  1 
 
1 
 0
 
1
2
1
0
1
2
0
0
3
0
0
4
0
0
5
0
0
3
4
5
6
7
8
9
1  1  1 
     
0  0  0
BitShift(1) & U (T (2))   0  & 1    0 
     
 0  1   0 
0  0  0
     
1
0
An example j = 3
1 2 3 4 5 6 7 8 9 10
T=xabxabaaxa
12345
P=abaac
0
 
1 
U (b)   0 
 
0
0
 
1
2
3
1
0
1
0
2
0
0
1
3
0
0
0
4
0
0
0
5
0
0
0
4
5
6
7
8
9
1   0   0 
     
 1   1  1 
BitShift(2) & U (T (3))   0  &  0    0 
     
 0  0  0
 0  0  0
     
1
0
An example j = 8
1 2 3 4 5 6 7 8 9 10
T=xabxabaaxa
12345
P=abaac
1 
 
 0
U (a )  1 
 
1 
 0
 
1
2
3
4
5
6
7
8
1
0
1
0
0
1
0
1
1
2
0
0
1
0
0
1
0
0
3
0
0
0
0
0
0
1
0
4
0
0
0
0
0
0
0
1
5
0
0
0
0
0
0
0
0
9
 1   1  1 
     
1   0   0 
BitShift(7) & U (T (8))   0  & 1    0 
     
 1   1  1 
 0  0  0
     
1
0
Correctness

1)
2)
For i > 1, Entry M(i,j) = 1 iff
The first i-1 characters of P match the i-1characters
of T ending at character j-1.
Character P(i) ≡ T(j).

1) is true when M(i-1,j-1) = 1.
2) is true when the i’th bit of U(T(j)) = 1.

The algorithm computes the and of these two bits.

Correctness
1 2 3 4 5 6 7 8 9 10
T=xabxabaaxa
abaac



1
2
3
4
5
6
7
8
9
1
0
1
0
1
0
0
1
0
1
1
0
1
2
0
0
1
0
0
1
0
0
0
0
3
0
0
0
0
0
0
1
0
0
0
4
0
0
0
0
0
0
0
1
0
0
5
0
0
0
0
0
0
0
0
0
0
M(4,8) = 1, this is because a b a a is a prefix of P of
length 4 that ends at position 8 in T.
Condition 1) – We had a b a as a prefix of length 3
that ended at position 7 in T ↔ M(3,7) = 1.
Condition 2) – The fourth bit of P is the eighth bit of
T ↔ The fourth bit of U(T(8)) = 1.
How much did we pay?



Formally the running time is Θ(mn).
However, the method is very efficient if n is the size
of a single or a few computer words.
Furthermore only two columns of M are needed at
any given time. Hence, the space used by the
algorithm is O(n).
agrep: The Shift-And Method with
errors

We extend the shift-and method for finding inexact
occurrences of a pattern in a text.

Reminder example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at
position 4,
it also occurs with 4 mismatches starting at position
2.
aatatccacaa
aatatccacaa
atcgaa
atcgaa
agrep

Our current goal given k find all the occurrences of
P in T with up to k mismatches.

We define the matrix Mk to be an n by m binary
matrix, such that:
Mk (i,j) = 1 iff
At least i-k of the first i characters of P match the i
characters up through character j of T.


What is M0?
How does Mk solve the k-mismatch problem?
Computing Mk




We compute Ml for all l=0, … , k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
The j’th column of Ml is given by:
Computing Mk

The first i-1 characters of P match a substring of T
ending at j-1, with at most l mismatches, and the
next pair of characters in P and T are equal.
j-1
*
*
*
*
*
*
*
*
*
*
i-1

BitShift( M ( j  1))  U (T ( j ))
l
Computing Mk

The first i-1 characters of P match a substring of T
ending at j-1, with at most l -1 mismatches.
j-1
*
*
*
*
*
*
*
*
*
*
i-1

BitShift( M
l 1
( j  1))
Computing Mk




We compute Ml for all l=1, … , k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
The j’th column of Ml is given by:
M l ( j) 
[ BitShift( M l ( j  1))  U (T ( j ))] 
BitShift( M l 1 ( j  1))
Example: M1
1 2 3 4 5 6 7 8 9 10
T=xabxabaaxa
P=
abaac
1
2
3
4
5
6
7
8
9
1
0
1
1
1
1
1
1
1
1
1
1
1
2
0
0
1
0
0
1
0
1
1
0
3
0
0
0
1
0
0
1
0
0
1
4
0
0
0
0
1
0
0
1
0
0
5
0
0
0
0
0
0
0
0
1
0
1
2
3
4
5
6
7
8
9
1
0
1
0
1
0
0
1
0
1
1
0
1
2
0
0
1
0
0
1
0
0
0
0
3
0
0
0
0
0
0
1
0
0
0
4
0
0
0
0
0
0
0
1
0
0
5
0
0
0
0
0
0
0
0
0
0
M 0=
Example: M1
1 2 3 4 5 6 7 8 9 10
T=xabxabaaxa
P= abaa
1
2
3
4
5
6
7
8
9
1
0
1
1
1
1
1
1
1
1
1
1
1
2
0
0
1
0
0
1
0
1
1
0
3
0
0
0
1
0
0
1
0
0
1
4
0
0
0
0
1
0
0
1
0
0
5
0
0
0
0
0
0
0
0
1
0
1 
 
 0
U (a )  1 
 
1 
 0
 
How much did we pay?



Formally the running time is Θ(kmn).
Again, the method is practically efficient for small n.
Still only a constant number of columns of M are
needed at any given time. Hence, the space used
by the algorithm is O(n).
The match count problem
The match-count problem

We want to count the exact number of characters
that match each of the different alignments of P
with T.
aatatccacaa
atcgaa
4
aatatccacaa
atcgaa
2
The match-count problem



We will first look at a simple algorithm which
extends the techniques we’ve seen so far.
Next, we introduce a more efficient algorithm that
exploits existing efficient methods to calculate the
Fourier transform.
We conclude with a variation that gives good
performance for unbounded alphabets.
Match-count Algorithm 1

We define the matrix MC to be an n by m integer
valued matrix, such that:
MC(i,j) =
The number of characters of P[1..i] that match T[jI+1,..,j]

How does MC solve the match-count problem?
Computing MC


Initialize column 0 of MC to all zeros
For j ≥ 1 column j is obtained by
 MC (i  1, j  1)
MC (i , j )  
 MC (i  1, j  1)  1

P (i )  T ( j )
Otherwise
Total of Θ(nm) comparisons and (simple) additions.
Match-count algorithm 2

Define a vector W that counts the matching
symbols, it’s indices are the possible alignments.
T=ababcaaa
P=abca
W(1) = 2
ababcaaa
abca
W(3) = 4
ababcaaa
abca
W(5) = 1
ababcaaa
abca
W(2) = 0
ababcaaa
abca
W(4) = 1
Match-count algorithm 2

Let’s handle one symbol at a time:
T=ababcaaa
P=abca
Ta = 1 0 1 0 0 1 1 1
Pa = 1 0 0 1
Wa(1) = 1
10100111
1001
Wa(3) = 2
Match-count algorithm 2

We have
W = Wa + Wb + Wc.

Or in the general case

W  W
 
Match-count algorithm 2

We can calculate Wα using a convolution.


Let’s rephrase the problem.
X = Tα padded with n zeros on the right.
Y = Pα padded with m zeros on the right.

We have two vectors X,Y of length m+n.

Match-count algorithm 2

Ta = 1 0 1 0 0 1 1 1
Pa = 1 0 0 1
X=101001110000
Y=100100000000
Match-count algorithm 2

In our modified representation:
W (i ) 
n  m 1
 X ( j )  Y (i  j )
j 0

Where the indices are taken modulo n+m.
W(1) =
W(2) =
< 1 0 1 0 0 1 1 1 0 0 0 0,
100100000000 >
<010100111000,
010010000000 >
Match-count algorithm 2

In our modified representation:
W (i ) 
n  m 1
 X ( j )  Y (i  j )
j 0

Where the indices are taken modulo n+m.

This is the convolution of X and the reverse of Y.

Using FFT calculating convolution takes time
O(m log(m)).
Match-count algorithm 2

The total running time is O(|∑| m log(m))

What happens if |∑| is large?

For example when |∑| =n, we get O(n m log(m))
which is actually worse than the naïve algorithm.
Match-count algorithm 3



An idea: some symbols might appear more often
than others.
Use convolutions for the frequent symbols.
Use a more simple counting method for the rest.
Rare symbols



Say α appears less than c times in P.
Record the locations of α in P
l1,…,lr r ≤c.
Go over the text, when we see α at location j we
increment W(j-l1+1) , … , W(j-lr+1+1).
Rare symbols
T = a b a b c a a a…
P=abcac
l1 = 3, l2 = 5
j = 5 → W(5-3+1)++
W(3)++
a b a b c a a a…
abcac
W(5-5+1)++
W(1)++
T = a b a b c a a a...
abcac
How much did we pay?



We can do this for all the rare symbols in one
sweep of T, for each position in T we make up to c
updates in W.
Thus handling the rare symbols will cost us O(cm).
For the frequent symbols we pay one convolution
per symbol so we pay at most
O(n/c m log(m)).
Determining c

We choose the c that gives us the best balance
n
cm  m  log( m ) 
c
c 2  n  log( m ) 
c  n  log( m )

The total running time is O(m n  log( m) )
References

Dan Gusfield,
Algorithms on Strings, Trees and Graphs.
Cambridge Univ. Press, Cambridge,1997.
The end