String Matching Algorithms

Download Report

Transcript String Matching Algorithms

Algorithms and
Data Structures
Outline
 Data Structures
 Space Complexity
 Case Study: string matching


Array implementation (e.g. KMP alg.)
Tree implementation (e.g. suffix tree)
/course/eleg67701-f/Topic-1b
2
Algorithm in action: data
structure transformation
Algorithm
Input data
structure
Intermediate
data structure
/course/eleg67701-f/Topic-1b
Output data
structure
3
Basic Data Structures
 Scalar or “Atomic” data structures



Building blocks for other data structures
Cannot be divided into sub-elements
Integer, floating-point, character, access (pointer) types
 Composite data structures

arrays, records
 Data Abstraction

Abstract Data Types: A collection of data values together
with a set of well-specified operations on that data, e.g. list,
stack, queue, trees etc.
/course/eleg67701-f/Topic-1b
4
Scalar Data Structure
Physical Layout in
the Computer Memory
Conceptual View
var1
value
0238
0239
Variable name
0240
Assignment operation:
0241
var1  value;
0242
var2  var1;
0243
var1  var3;
0244
Memory address
/course/eleg67701-f/Topic-1b
value
0245
5
Composite Data Structure: Array
Physical Layout in
the Computer Memory
Conceptual View
A
v1
v2
v3
v4
v5
0
1
2
3
4
Variable name
Array A[1..5]
0238
0239
v1
0240
v2
0241
A[0]  5
v3
0242
v4
k 1
0243
v5
A[k]  11
0244
nil
Accessing array elements:
A[k+1]  A[k] + 3
Memory address
/course/eleg67701-f/Topic-1b
0245
6
Data Abstraction: Tree
Physical Layout in
the Computer Memory
Conceptual View
T
T  0238
v1
v2
___
__
_
v3
___
__
_
___
__
_
v4
Accessing the elements:
T.value 12
T.left new(T)
___
__
_
___
__
_
Memory address
T.right new(T)
0238
0241
0239
v1
0240
0244
0241
nil
0242
v2
0243
nil
0244
nil
0245
v3
0247
/course/eleg67701-f/Topic-1b
..
7
Space Analysis
 Storage space, like time, is another limited
resource that is important to programmers
 Space requirements are also expressed as a
function of the input size
 Space functions are classified in the same
manner as running times
/course/eleg67701-f/Topic-1b
8
Complexity Analysis: Sorting
Algorithm
Time-Complexity
Space-Complexity
Insertionsort
O(n2)
O(n)
Quicksort
O(n.log n)
O(n)
/course/eleg67701-f/Topic-1b
9
Space-Time Tradeoff
 Reductions in running time are often possible
if we increase storage requirements
 Decreasing the amount of storage used by an
algorithm usually results in longer running
times

Using an array to lookup previously computed
values can drastically increase the speed of a
function
/course/eleg67701-f/Topic-1b
10
Case Study: Searching for Patterns
Problem: find the first occurrence of
pattern P of length m inside the text S of
length n.
 String matching problem
/course/eleg67701-f/Topic-1b
11
String Matching - Applications
 Text editing
 Term rewriting
 Lexical analysis
 Information retrieval
 And, bioinformatics
/course/eleg67701-f/Topic-1b
12
Model for Pattern-Matching Problem
Pattern
P
Pattern
Matcher
generator
Input string
S
Pattern
Matcher
/course/eleg67701-f/Topic-1b
Yes
No
13
Array Implementation
Text S represented as an array of characters: S [1..n]
Pattern P represented as an array of characters: P [1..m]
S
a
g
c
a
g
a
a
g
a
g
P Pag Pag Pagag Pgag Pag Pga
a
g
a
g
a
g
g
t
a
Time complexity = O(m.n)
Space complexity = O(m + n)
/course/eleg67701-f/Topic-1b
14
Can we be more clever ?
 When a mismatch is detected, say at
position k in the pattern string, we have
already successfully matched k-1 characters.
 We try to take advantage of this to decide
where to restart matching
S
a
P
g
a
g
c
a
Pga Pag
a
g
g
a
a
g
a
g
Pga Pag
ag
g
g
a
a
g
g
/course/eleg67701-f/Topic-1b
t
a
15
Problem of Matching Keyword
PROBLEM. Given a pattern p consisting of a single
keyword and an input string s, answer “yes” if p
occurs as a substring of s, that is, if s=xpy, for some
x and y; “no” otherwise.
For convenience, we will assume p=p1p2…pm and
s=s1s2…sn where pi represents the ith character of
the pattern and sj the jth character of the input
string.
/course/eleg67701-f/Topic-1b
16
The Knuth-Morris-Pratt Algorithm
Observation: when a mismatch occurs, we may not
need to restart the comparison all way back (from
the next input position).
What to do:
Constructing a table h, called the next function, that
determines how many characters to slide the pattern to the
right in case of a mismatch during the pattern-matching
process.
Knuth, D. E., Morris, J.H. and Pratt, V. R., Fast Pattern Matching
Algorithm for Strings, SIAM J. Comput Sci., 43, 1977, 323-350
/course/eleg67701-f/Topic-1b
17
The key idea is that if we have successfully matched the prefix
p=p1p2…pi-1 of the keyword with the substring sj-i+1 sj-i+2…
sj-1 of the input string and pi = sj, then we do not need to
reprocess any of the suffix sj-i+1 sj-i+2… sj-1 since we know
this portion of the text string is the prefix of the keyword
that we have just matched.
/course/eleg67701-f/Topic-1b
18
Note that the inner while loop will iterate as long as p_i
and s_j do not match each other. Once they match, the inner
while loop terminate, both i and j will shift by one, and inner
loop repeats ...
/course/eleg67701-f/Topic-1b
19
An Important Property of the
Next Function in KMP
Algorithm
The largest k less than i such that p1p2…pk-1 is a
suffix of p1p2…pi-1 (i.e., p1…pk-1 = pi-k+1…pi-1) and
pi = pk. if there is no such i, then hi=0
/course/eleg67701-f/Topic-1b
20
Backtrack or Not Backtrack ?
Assume P(i) = S(j)for some i and j, what
should we do?



KMP algorithm chose not to backtrack on the text
S (e.g. j) for a good reason
The choice is how to shift the pattern P (e.g. i) –
i.e. by how much
If for each j, the shift of P is a small constant, then
the total time complexity is clearly linear in n
/course/eleg67701-f/Topic-1b
21
An Example
Given:
Patten:
Next funciton:
Next function:
1
a
0
0
2
b
1
1
Input string:
3
a
0
0
4
a
2
2
5
b
1
1
6
a
0
7
b
4
0
8
a
0
4
0
9
a
2
2
10
b
1
1
11
a
0
0
12
a
7
7
13
b
1
1
abaababaabacabaababaabaab.
i = 12
Scenario 1:
a b a a b a b a a b a a b
a b a a b a b a a b a c a b a a b a b a a b a a b
j = 12
What is hi = h12 = ?
Scenario 2:
hi = 7
h12 = 7, i = 7
i
a b a a b a b a a b a a b
a b a a b a b a a b a c a b a a b a b a a b a a b
j
/course/eleg67701-f/Topic-1b
22
An Example
Scenario 3:
h7 = 4, i = 4
(Contn’d)
i
a b a a b a b a a b a a b
a b a a b a b a a b a c a b a a b a b a a b a a b
j
Subsequently
i = 2, 1, 0
Finally, a match is found:
i
a b a a b a b a a b a a b
a b a a b a b a a b a c a b a a b a b a a b a a b
j
/course/eleg67701-f/Topic-1b
23
Question:
when P(i) = S(j), how much should we shift?
Pattern
Input
i=1
i
Pi
j=1
j
Sj
P
S
Observations:
 We should shift P to the right
 But – by how much?
 One answer is: do not backtrack S(j)
/course/eleg67701-f/Topic-1b
24
Observation: Never backtrack on the input
string S.
/course/eleg67701-f/Topic-1b
25
How to Compute the Next Function?
j:= hj
hi:= hj
/course/eleg67701-f/Topic-1b
hi := j
26
How to Compute the Next Function?
j:= hj
hi:= hj
hi := j
Note: once p_i does not match p_j -- we know that j should be
the index to be found where a prefix before i matches a suffix ends at j
/course/eleg67701-f/Topic-1b
27
Interpretation of the Next Function
1
2
3
4
5
6
7
8
9
a
b
a
a
b
a
b
a
a
a
b
a
a
b
a
b
a
a
0
1
0
2
1
0
4
0
2
Note: P2 = P5
P4 = P9
 Interpretation
 Question: how to compute the next function?
/course/eleg67701-f/Topic-1b
28
Interpretation of the Next Function
1
2
3
4
5
6
7
8
9
a
b
a
a
b
a
b
a
a
a
b
a
a
b
a
b
a
a
0
1
0
2
1
0
4
0
2
Note: P1 = P5
P4 = P9
 Interpretation
 Question: how to compute the next function?
/course/eleg67701-f/Topic-1b
29
Interpretation of the Next Function
1
2
3
4
5
6
7
8
9
a
b
a
a
b
a
b
a
a
a
b
a
a
b
a
b
a
a
0
1
0
2
1
0
4
0
2
Note: P1 = P5
P4 = P9
 Interpretation
 Question: how to compute the next function?
/course/eleg67701-f/Topic-1b
30
KMP - Analysis
 The KMP algorithm never needs to
backtrack on the text string.
preprocessing
searching
Time complexity = O(m + n)
Space complexity = O(m + n)
/course/eleg67701-f/Topic-1b
31
KMP Algorithm Complexity
Analysis Hints
 What is the cost in the building of the
next function? (hint: in the code for the next
function, the operation j=h_j in the inner loop is
never executed more often than the statement i
:= i+1 in the outer loop)
 What is the cost of the matching itself?
(hint: similar to the above)
/course/eleg67701-f/Topic-1b
32
Other String Matching Algorithms
 The Boyer-Moore Algorithm [Boyer, R. S. and
Moore, J. E., A Fast String Searching Algorithm, CACM,
20(10), 1977, 62-72]
 The Karp-Rabin Algorithm [Karp, R. M. and
Rpbin, M. O., Efficient Randomized Pattern-Matching
Algorithm, IBM J. of Res. And Develop., 32(2), 1987, 249260].
/course/eleg67701-f/Topic-1b
33
Matching of A Set of Key Words ?
 Given a pattern of a set of keywords and an
input string S, answer “yes” if some
keywords occur as a substring of S, and “no”
otherwise.
 How to solve this ?
/course/eleg67701-f/Topic-1b
34
How about repeatedly apply
KMP ?
What time complexity KMP algorithm will have
when do a matching of k patterns?
- Preprocessing each of the k patterns: assume each
pattern has 0(m) in length, this will take 0(km) time
- Searching each pattern will take o (n) time per
pattern
so, total time = k • o(m+n)
/course/eleg67701-f/Topic-1b
35
Question:
Can we improve the time complexity
when k is large?
Answer:
Yes, preprocessing the input string –
tree implementation.
/course/eleg67701-f/Topic-1b
36
Model for Pattern-Matching Problem
Pattern
P
Pattern
Matcher
generator
Input
string
S
Pre Processing
Pattern
Matcher
/course/eleg67701-f/Topic-1b
Yes
No
37
Tree Implementation -- suffix tree
 Instead of preprocessing the pattern (P), preprocess
the text T !
 Use a tree structure where all suffixes of the text are
represented;
 Search for the pattern by looking for substrings of
the text;
 You can easily test whether P is a substring of T
because any substring of T is the prefix of some
suffix.
/course/eleg67701-f/Topic-1b
38
Suffix Tree
Con’d
A suffix tree T for an m-character string S is a rooted directed tree with
exactly m leaves numbered 1 to m. Each internal node, other than the root,
has at least two children and each edge is labeled with a nonempty
substring of S. No two edges out of a node can have edge-labels beginning
with the same character. The key feature of the suffix tree is that for any
leaf i, the concatenation of the edge-labels on the path from the root to leaf
i exactly spells out the suffix of S that starts at position i. That is, it spells
3
out S[i…m].
x a
b
x
a
c
w c
u
c
4
c
6
2
Suffix tree for string xabxac. The node labels u and w on the two interior nodes will be used.
/course/eleg67701-f/Topic-1b
39
Note on Suffix Tree
 Not all strings guaranteed to have
corresponding suffix trees
 For example:
consider xabxa: it does not have a suffix tree:
because here xa is both a prefix and suffix
(I.e. xa does not necessarily ends at a leaf)
 How to fix the problem: add $ - a special
“termination” character to the alphabet.
/course/eleg67701-f/Topic-1b
40
Algorithm for Constructing a
Suffix Tree
 A subtree can be constructed in linear time
[Weiner73, McCreight76, Ukkonen95]
/course/eleg67701-f/Topic-1b
41
Suffix Tree
preprocessing
searching
Time complexity = O(n + m)
Space complexity = O(m + n)
/course/eleg67701-f/Topic-1b
42
Question
 How to use suffix tree to help solving the
string matching problem ?
/course/eleg67701-f/Topic-1b
43
Other Tree based Methods
 Suffix tree is not the only one ..
/course/eleg67701-f/Topic-1b
44