String Matching Algorithms
Download
Report
Transcript String Matching Algorithms
Algorithms and
Data Structures
Outline
Data Structures
Space Complexity
Case Study: string matching
Array implementation (e.g. KMP alg.)
Tree implementation (e.g. suffix tree)
/course/eleg67701-f/Topic-1b
2
Algorithm in action: data
structure transformation
Algorithm
Input data
structure
Intermediate
data structure
/course/eleg67701-f/Topic-1b
Output data
structure
3
Basic Data Structures
Scalar or “Atomic” data structures
Building blocks for other data structures
Cannot be divided into sub-elements
Integer, floating-point, character, access (pointer) types
Composite data structures
arrays, records
Data Abstraction
Abstract Data Types: A collection of data values together
with a set of well-specified operations on that data, e.g. list,
stack, queue, trees etc.
/course/eleg67701-f/Topic-1b
4
Scalar Data Structure
Physical Layout in
the Computer Memory
Conceptual View
var1
value
0238
0239
Variable name
0240
Assignment operation:
0241
var1 value;
0242
var2 var1;
0243
var1 var3;
0244
Memory address
/course/eleg67701-f/Topic-1b
value
0245
5
Composite Data Structure: Array
Physical Layout in
the Computer Memory
Conceptual View
A
v1
v2
v3
v4
v5
0
1
2
3
4
Variable name
Array A[1..5]
0238
0239
v1
0240
v2
0241
A[0] 5
v3
0242
v4
k 1
0243
v5
A[k] 11
0244
nil
Accessing array elements:
A[k+1] A[k] + 3
Memory address
/course/eleg67701-f/Topic-1b
0245
6
Data Abstraction: Tree
Physical Layout in
the Computer Memory
Conceptual View
T
T 0238
v1
v2
___
__
_
v3
___
__
_
___
__
_
v4
Accessing the elements:
T.value 12
T.left new(T)
___
__
_
___
__
_
Memory address
T.right new(T)
0238
0241
0239
v1
0240
0244
0241
nil
0242
v2
0243
nil
0244
nil
0245
v3
0247
/course/eleg67701-f/Topic-1b
..
7
Space Analysis
Storage space, like time, is another limited
resource that is important to programmers
Space requirements are also expressed as a
function of the input size
Space functions are classified in the same
manner as running times
/course/eleg67701-f/Topic-1b
8
Complexity Analysis: Sorting
Algorithm
Time-Complexity
Space-Complexity
Insertionsort
O(n2)
O(n)
Quicksort
O(n.log n)
O(n)
/course/eleg67701-f/Topic-1b
9
Space-Time Tradeoff
Reductions in running time are often possible
if we increase storage requirements
Decreasing the amount of storage used by an
algorithm usually results in longer running
times
Using an array to lookup previously computed
values can drastically increase the speed of a
function
/course/eleg67701-f/Topic-1b
10
Case Study: Searching for Patterns
Problem: find the first occurrence of
pattern P of length m inside the text S of
length n.
String matching problem
/course/eleg67701-f/Topic-1b
11
String Matching - Applications
Text editing
Term rewriting
Lexical analysis
Information retrieval
And, bioinformatics
/course/eleg67701-f/Topic-1b
12
Model for Pattern-Matching Problem
Pattern
P
Pattern
Matcher
generator
Input string
S
Pattern
Matcher
/course/eleg67701-f/Topic-1b
Yes
No
13
Array Implementation
Text S represented as an array of characters: S [1..n]
Pattern P represented as an array of characters: P [1..m]
S
a
g
c
a
g
a
a
g
a
g
P Pag Pag Pagag Pgag Pag Pga
a
g
a
g
a
g
g
t
a
Time complexity = O(m.n)
Space complexity = O(m + n)
/course/eleg67701-f/Topic-1b
14
Can we be more clever ?
When a mismatch is detected, say at
position k in the pattern string, we have
already successfully matched k-1 characters.
We try to take advantage of this to decide
where to restart matching
S
a
P
g
a
g
c
a
Pga Pag
a
g
g
a
a
g
a
g
Pga Pag
ag
g
g
a
a
g
g
/course/eleg67701-f/Topic-1b
t
a
15
Problem of Matching Keyword
PROBLEM. Given a pattern p consisting of a single
keyword and an input string s, answer “yes” if p
occurs as a substring of s, that is, if s=xpy, for some
x and y; “no” otherwise.
For convenience, we will assume p=p1p2…pm and
s=s1s2…sn where pi represents the ith character of
the pattern and sj the jth character of the input
string.
/course/eleg67701-f/Topic-1b
16
The Knuth-Morris-Pratt Algorithm
Observation: when a mismatch occurs, we may not
need to restart the comparison all way back (from
the next input position).
What to do:
Constructing a table h, called the next function, that
determines how many characters to slide the pattern to the
right in case of a mismatch during the pattern-matching
process.
Knuth, D. E., Morris, J.H. and Pratt, V. R., Fast Pattern Matching
Algorithm for Strings, SIAM J. Comput Sci., 43, 1977, 323-350
/course/eleg67701-f/Topic-1b
17
The key idea is that if we have successfully matched the prefix
p=p1p2…pi-1 of the keyword with the substring sj-i+1 sj-i+2…
sj-1 of the input string and pi = sj, then we do not need to
reprocess any of the suffix sj-i+1 sj-i+2… sj-1 since we know
this portion of the text string is the prefix of the keyword
that we have just matched.
/course/eleg67701-f/Topic-1b
18
Note that the inner while loop will iterate as long as p_i
and s_j do not match each other. Once they match, the inner
while loop terminate, both i and j will shift by one, and inner
loop repeats ...
/course/eleg67701-f/Topic-1b
19
An Important Property of the
Next Function in KMP
Algorithm
The largest k less than i such that p1p2…pk-1 is a
suffix of p1p2…pi-1 (i.e., p1…pk-1 = pi-k+1…pi-1) and
pi = pk. if there is no such i, then hi=0
/course/eleg67701-f/Topic-1b
20
Backtrack or Not Backtrack ?
Assume P(i) = S(j)for some i and j, what
should we do?
KMP algorithm chose not to backtrack on the text
S (e.g. j) for a good reason
The choice is how to shift the pattern P (e.g. i) –
i.e. by how much
If for each j, the shift of P is a small constant, then
the total time complexity is clearly linear in n
/course/eleg67701-f/Topic-1b
21
An Example
Given:
Patten:
Next funciton:
Next function:
1
a
0
0
2
b
1
1
Input string:
3
a
0
0
4
a
2
2
5
b
1
1
6
a
0
7
b
4
0
8
a
0
4
0
9
a
2
2
10
b
1
1
11
a
0
0
12
a
7
7
13
b
1
1
abaababaabacabaababaabaab.
i = 12
Scenario 1:
a b a a b a b a a b a a b
a b a a b a b a a b a c a b a a b a b a a b a a b
j = 12
What is hi = h12 = ?
Scenario 2:
hi = 7
h12 = 7, i = 7
i
a b a a b a b a a b a a b
a b a a b a b a a b a c a b a a b a b a a b a a b
j
/course/eleg67701-f/Topic-1b
22
An Example
Scenario 3:
h7 = 4, i = 4
(Contn’d)
i
a b a a b a b a a b a a b
a b a a b a b a a b a c a b a a b a b a a b a a b
j
Subsequently
i = 2, 1, 0
Finally, a match is found:
i
a b a a b a b a a b a a b
a b a a b a b a a b a c a b a a b a b a a b a a b
j
/course/eleg67701-f/Topic-1b
23
Question:
when P(i) = S(j), how much should we shift?
Pattern
Input
i=1
i
Pi
j=1
j
Sj
P
S
Observations:
We should shift P to the right
But – by how much?
One answer is: do not backtrack S(j)
/course/eleg67701-f/Topic-1b
24
Observation: Never backtrack on the input
string S.
/course/eleg67701-f/Topic-1b
25
How to Compute the Next Function?
j:= hj
hi:= hj
/course/eleg67701-f/Topic-1b
hi := j
26
How to Compute the Next Function?
j:= hj
hi:= hj
hi := j
Note: once p_i does not match p_j -- we know that j should be
the index to be found where a prefix before i matches a suffix ends at j
/course/eleg67701-f/Topic-1b
27
Interpretation of the Next Function
1
2
3
4
5
6
7
8
9
a
b
a
a
b
a
b
a
a
a
b
a
a
b
a
b
a
a
0
1
0
2
1
0
4
0
2
Note: P2 = P5
P4 = P9
Interpretation
Question: how to compute the next function?
/course/eleg67701-f/Topic-1b
28
Interpretation of the Next Function
1
2
3
4
5
6
7
8
9
a
b
a
a
b
a
b
a
a
a
b
a
a
b
a
b
a
a
0
1
0
2
1
0
4
0
2
Note: P1 = P5
P4 = P9
Interpretation
Question: how to compute the next function?
/course/eleg67701-f/Topic-1b
29
Interpretation of the Next Function
1
2
3
4
5
6
7
8
9
a
b
a
a
b
a
b
a
a
a
b
a
a
b
a
b
a
a
0
1
0
2
1
0
4
0
2
Note: P1 = P5
P4 = P9
Interpretation
Question: how to compute the next function?
/course/eleg67701-f/Topic-1b
30
KMP - Analysis
The KMP algorithm never needs to
backtrack on the text string.
preprocessing
searching
Time complexity = O(m + n)
Space complexity = O(m + n)
/course/eleg67701-f/Topic-1b
31
KMP Algorithm Complexity
Analysis Hints
What is the cost in the building of the
next function? (hint: in the code for the next
function, the operation j=h_j in the inner loop is
never executed more often than the statement i
:= i+1 in the outer loop)
What is the cost of the matching itself?
(hint: similar to the above)
/course/eleg67701-f/Topic-1b
32
Other String Matching Algorithms
The Boyer-Moore Algorithm [Boyer, R. S. and
Moore, J. E., A Fast String Searching Algorithm, CACM,
20(10), 1977, 62-72]
The Karp-Rabin Algorithm [Karp, R. M. and
Rpbin, M. O., Efficient Randomized Pattern-Matching
Algorithm, IBM J. of Res. And Develop., 32(2), 1987, 249260].
/course/eleg67701-f/Topic-1b
33
Matching of A Set of Key Words ?
Given a pattern of a set of keywords and an
input string S, answer “yes” if some
keywords occur as a substring of S, and “no”
otherwise.
How to solve this ?
/course/eleg67701-f/Topic-1b
34
How about repeatedly apply
KMP ?
What time complexity KMP algorithm will have
when do a matching of k patterns?
- Preprocessing each of the k patterns: assume each
pattern has 0(m) in length, this will take 0(km) time
- Searching each pattern will take o (n) time per
pattern
so, total time = k • o(m+n)
/course/eleg67701-f/Topic-1b
35
Question:
Can we improve the time complexity
when k is large?
Answer:
Yes, preprocessing the input string –
tree implementation.
/course/eleg67701-f/Topic-1b
36
Model for Pattern-Matching Problem
Pattern
P
Pattern
Matcher
generator
Input
string
S
Pre Processing
Pattern
Matcher
/course/eleg67701-f/Topic-1b
Yes
No
37
Tree Implementation -- suffix tree
Instead of preprocessing the pattern (P), preprocess
the text T !
Use a tree structure where all suffixes of the text are
represented;
Search for the pattern by looking for substrings of
the text;
You can easily test whether P is a substring of T
because any substring of T is the prefix of some
suffix.
/course/eleg67701-f/Topic-1b
38
Suffix Tree
Con’d
A suffix tree T for an m-character string S is a rooted directed tree with
exactly m leaves numbered 1 to m. Each internal node, other than the root,
has at least two children and each edge is labeled with a nonempty
substring of S. No two edges out of a node can have edge-labels beginning
with the same character. The key feature of the suffix tree is that for any
leaf i, the concatenation of the edge-labels on the path from the root to leaf
i exactly spells out the suffix of S that starts at position i. That is, it spells
3
out S[i…m].
x a
b
x
a
c
w c
u
c
4
c
6
2
Suffix tree for string xabxac. The node labels u and w on the two interior nodes will be used.
/course/eleg67701-f/Topic-1b
39
Note on Suffix Tree
Not all strings guaranteed to have
corresponding suffix trees
For example:
consider xabxa: it does not have a suffix tree:
because here xa is both a prefix and suffix
(I.e. xa does not necessarily ends at a leaf)
How to fix the problem: add $ - a special
“termination” character to the alphabet.
/course/eleg67701-f/Topic-1b
40
Algorithm for Constructing a
Suffix Tree
A subtree can be constructed in linear time
[Weiner73, McCreight76, Ukkonen95]
/course/eleg67701-f/Topic-1b
41
Suffix Tree
preprocessing
searching
Time complexity = O(n + m)
Space complexity = O(m + n)
/course/eleg67701-f/Topic-1b
42
Question
How to use suffix tree to help solving the
string matching problem ?
/course/eleg67701-f/Topic-1b
43
Other Tree based Methods
Suffix tree is not the only one ..
/course/eleg67701-f/Topic-1b
44