Transcript Slide 1

ESA2008@Universitat Karlsruhe, Sep 15, 2008
An Online Algorithm for Finding
the Longest Previous Factors
Daisuke Okanohara
Kunihiko Sadakane
University of Tokyo
Kyushu University
Problem: Finding the longest
previous factors (matching)
• Input : A text T[0…n-1]
• At all position k, report the longest substring
T[k-len+1…k] that also occurs in previous
positions (history)
T[pos…pos+len-1] = T[k-len+1…k]
– c.f. LZ77, LZ-factorization
a b r a c a d a b r a
a b r a c a d a b r a d
(pos, len) = (0, 4)
(pos, len) = (5, 2)
Applications
• Data Compression
– LZ77, Prediction by Partial Matching
• Pattern Analysis
– Log analysis
• Data Mining
Previous approach
• Sequential search on the fly
– O(n2) time for a text of length n
• Offline- Index approach
– Read an whole text beforehand, and build an
index (suffix array/trees) for it.
– Search the match using the index
[Chen 07] [Chen 08] [Crochemore 08] [Kolpakov 01] [Larsson 99]
– 6n bytes, and O(n log n) time [Chen 08]
Suffix Arrays with Range Minimum Query
New Problem: Online finding the
longest previous factors
• Report match information just after reading
each character
– A case where we don’t know the length of data
beforehand, e.g. streaming data
• Previous approaches cannot deal with this
problem
Our approach for new problem
• Online construction of enhanced prefix
arrays
– Update an index just after reading each
character
– Although many methods used in LZ77 cannot
report the longest match, our method can.
• Succinct data structures
– Keep all information very compactly; using
about the same space for an original text
Prefix arrays
• Keep NOT suffix arrays (SA), but prefix arrays (PA)
– because when a character is added at the last of a text, SA may
cause W (n) changes, but PA not
– In PA, prefixes are sorted in the reverse-lexicographic order
T=aaaa
SA for T
0$
1 a$
2 aa$
3 aaa$
4 aaaa$
Tnew=aaaaz
SA for Tnew
PA for T
PA for Tnew
0$
4 aaaaz$
3 aaaz$
2 aaz$
1 az$
5 z$
0
$
1
$a
2 $aa
3 $aaa
4 $aaaa
0
$
1
$a
2
$aa
3 $aaa
4 $aaaa
5 $aaaaz
Our idea
• Weiner’s suffix tree construction algorithm
– Insert the suffixes from the shortest ones
– Modify it to the insert prefixes form the shortest ones
– Similar idea is used for the incremental construction of
compressed suffix arrays [Chan, et. al 2007], [Lippert 2005]
• We extend this work to the succinct version
– Our algorithm reports matching information as a byproduct of construction
– Do not require tree representation, we just use array
information
Preliminary: Dynamic Rank/Select
Dictionary (DRSD)
• For an text T[0…n-1], DRSD supports:
–
–
–
–
rank(T, c, i): return the number of c in T[0…i]
select(T, c, i): return the position of i-th c in T
insert(T, c, i): insert c at T[i]
delete(T, i): delete T[i]
• These operations can be supported in
time (O(logn) time if s < logn),
bits space
where s is the alphabet size [Lee, et. al. 07],
Preliminary:
Range Minimum Query (RMQ)
• Given an array E[0…n-1] of elements from totally
ordered set, rmq(E, l, r) returns the index of the
smallest element in E[l…r]
– i.e. rmq(E, l, r) = argmink∈[l, r]E[k]
– return the leftmost such element in the tie
• In the static case, RMQ can be supported in O(1)
time using 2n+o(n) bits space [Fischer, 2007]
• In the dynamic case, RMQ/insert/delete can be
supported in O(Tlogn) time using O(n) bits if the
lookup cost (E[i]) is O(T)
Data structures
• Keep the following data structures for T[0…k]
– Assume T[0]=$, $ is the unique smallest character
• B[0…k]: (Prefix-) BW-transformed Text
– B[i] = T[PA[i]+1] and B[i] = $ if PA[i]=k
• H[0…k]: Height Array
– will be explained in the next slide
• C[0…s-1] : Cumulative Array
– C[c] = the total number of characters c’ s.t. c’ < c in T
• s: The position for the next prefix to be inserted
T = $abaababa
i PA prefix
B
0 0
$ a
1
1
$a b
2 4
$abaa b
3 3
$aba a
4 8
$abaababa $
5 6
$abaaba b
6 2
$ab a
7 5
$abaab a
8 7
$abaabab a
H
0
1
1
3
3
0
2
2
0
T = $abaababa
i PA prefix
0
0
1
1
2
4
3
4
5
6
7
8
3
8
6
2
5
7
B
$ a
$a b
$abaa b
$aba
$abaababa
$abaaba
$ab
$abaab
$abaabab
a
$
b
a
a
a
H
0
1
1
3
3
0
2
2
0
PA stores the end
position of each prefix
(we will omit this)
Prefix stores prefixes
sorted in the reverselexicographic order
(Neither PA nor prefix
are stored explicitly)
We can examine PA[i]
by using SAlookup
operation using
O(log2n) time
as in FM-index
[Ferragina 00]
T = $abaababa
i
0
1
2
3
4
5
6
7
8
prefix
$
$a
$abaa
$aba
$abaababa
$abaaba
$ab
$abaab
$abaabab
B
a
b
b
a
$
b
a
a
a
B stores the next
character for each
prefix
H
(Burrows Wheeler’s
0 transform for prefix
1 arrays)
1
3
3
0
2
2
0
T = $abaababa
i
0
1
2
3
4
5
6
7
8
prefix
$
$a
$abaa
$aba
$abaababa
$abaaba
$ab
$abaab
$abaabab
B
a
b
b
a
$
b
a
a
a
H
0
1
1
3
3
0
2
2
0
H stores the length
of the longest
common suffix
between adjacent
prefixes
T = $abaababa
i
0
1
2
3
4
5
6
7
8
prefix
$
$a
$abaa
$aba
$abaababa
$abaaba
$ab
$abaab
$abaabab
B
a
b
b
a
$
b
a
a
a
H
0
1
1
3
3
0
2
2
0
s=4
s denotes the position
where $ in B, and
the longest
prefix is placed.
T = $abaababa
i
0
1
2
3
4
5
6
7
8
prefix
C[c] = the number of characters c’
that is smaller than c in T(=B)
B H
$ a
0
C[$]=0
$a b
1
C[a]=1
$abaa b
1
$aba a
3
$abaababa $
3
$abaaba b
0
C[b]=6
$ab a
2
$abaab a
2
$abaabab a
0
T = $abaababaa
i
0
1
2
3
4
5
6
7
8
The next character `a’ comes !
prefix
$
$a
$abaa
$aba
$abaababa
$abaaba
$ab
$abaab
$abaabab
B
a
b
b
a
$
b
a
a
a
H
0
1
1
3
3
0
2
2
0
T = $abaababaa
i
0
1
2
3
4
5
6
7
8
prefix
$
$a
$abaa
$aba
$abaababa
$abaaba
$ab
$abaab
$abaabab
B
a
b
b
a
$a
b
a
a
a
H
0
1
1
3
3
0
2
2
0
Replace $ in B[s] with a
(because $ is placed in
the position of the
longest prefix)
T = $abaababaa
i
0
1
2
3
4
5
6
7
8
prefix
$
$a
$abaa
$aba
$abaababa
$abaaba
$ab
$abaab
$abaabab
B
a
b
b
a
a
b
a
a
a
H
0
1
1
3
3
0
2
2
0
Find the position for the
new prefix $abaababaa
Count the number of
a in B[0…s-1]
= rank(B, a, s-1) = 2
T = $abaababaa
i
0
1
2
3
4
5
6
7
8
prefix
$
$a
$abaa
$abaababaa
$aba
$abaababa
$abaaba
$ab
$abaab
$abaabab
B
a
b
b
$
a
a
b
a
a
a
H
0
1
1 Insert $abaababaa
at 3rd position in a
3
3
0
2
2
0
C[a]+rank(B, a, s-1)
=3
s := C[a]+rank(B, a, s-1),
insert(B, s, $)
T = $abaababaa
i
0
1
2
3
4
5
6
7
8
prefix
$
$a
$abaa
$abaababaa
$aba
$abaababa
$abaaba
$ab
$abaab
$abaabab
B
a
b
b
$
a
a
b
a
a
a
H
0
1
3
3
0
2
2
0
Update H
This is actually
the length of the
longest match
in the history
T = $abaababa
i
0
1
2
3
4
5
6
7
8
prefix
$
$a
$abaa
$aba
$abaababa
$abaaba
$ab
$abaab
$abaabab
B
a
b
b
a
$
b
a
a
a
H
0
1
1
3
3
0
2
2
0
Recall that in the
previous step, $abaa
and $aba are placed
in the prefixes whose
B is `a’
These positions can be
found by using rank and
select
c. f.
succ(T, `c’, s)
= select(T, c, rank(T, s, c))
T = $abaababa
i
0
1
2
3
4
5
6
7
8
prefix
$
$a
$abaa
$aba
$abaababa
$abaaba
$ab
$abaab
$abaabab
B
a
b
b
a
$
b
a
a
a
H
0
1
1
3
3
0
2
2
0
RMQ(H, 4, 6) = 5,
H[5] = 0
Therefore
RMQ(H, 4, 6) + 1 is the
new value for the
next H entry
T = $abaababa
i
0
1
2
3
4
5
6
7
8
prefix
$
$a
$abaa
$aba
$abaababa
$abaaba
$ab
$abaab
$abaabab
B
a
b
b
a
$
b
a
a
a
H
0
1
1
3
3
0
2
2
0
RMQ(H, 3, 3) = 3,
H[3] = 3
Therefore
RMQ(H, 3, 3) + 1 is the
new value for the next
H entry
T = $abaababaa
i
0
1
2
3
4
5
6
7
8
prefix
$
$a
$abaa
$abaababaa
$aba
$abaababa
$abaaba
$ab
$abaab
$abaabab
B
a
b
b
$
a
a
b
a
a
a
H
0
1
4
1
3
3
0
2
2
0
rmq(H, 3, 3) + 1
rmq(H, 4, 6) + 1
T = $abaababaa
i
0
1
2
3
4
5
6
7
8
prefix
$
$a
$abaa
$abaababaa
$aba
$abaababa
$abaaba
$ab
$abaab
$abaabab
B
a
b
b
$
a
a
b
a
a
a
H
0
1
4
1
3
3
0
2
2
0
Report max(4, 1) = 4
as the length of the
longest factor
and report the position
of $abaa as SAlookup[2]
- len = 0
Report (pos=0, len=4)
as the max. matching
Overall algorithm
All operations are
rank, select, RMQ
Overall Analysis
• H is stored in 2n bits [Sadakane , Soda 02]
– naïve representation requires O(n log n) bits
– requires one SA lookup operation to decode
• B is stored in nlogs+ o(nlogs) bits
– by using dynamic rank/select dictionary
• The bottleneck of our algorithm is
rmq(H, I, r) which requires O(log3n) time
– SAlookup requires O(log2n) time
Overall Analysis (cont.)
• We can solve the online longest previous
factor problem in O(log3n) time for each
character, using nlog2s+ o(nlogs) + O(n) bits
of space
– where s is the alphabet size, and n is the length
of a text
Simulating window buffer
• If the working space is limited, we often discards
the history from the oldest ones
• We can simulate this by using the almost the same
operations as in the insertion operation
• We actually do not discard a character but ignore it
– If we actually discard an oldest character , it may cause
W(n) changes in B and H
– The effect of discarded character is remained (prefixes
are sorted according to the discarded characters)
– But this does not cause the problem if we only report
the matching information up to the history size
Experiments
• In experiment, we used a simpler data structure
(algorithm is same)
– B and H is store in the balanced binary tree
– Each leaf stores the small block of B and H
– We call this implementation as OS
• Compare OS with other offline algorithms
–
–
–
–
Require to read the whole text beforehand
CPSa, CPSd: SA+LCP with stack [Chen, et. al. 07]
CPS6n: SA with RMQ [Chen, et. al. 08]
kk-lz: mreps, specialized for σ=4 [Kolpakov 01]
Peak memory usage
in bytes per input symbol
• The space of OS is smallest in many real data
especially when the values in H is small
Runtime in milliseconds for searching
the longest previous factors
• OS is about 2~10 times slower than the
fastest ones due to the dynamic operations
Conclusion
• Solve online longest matching problem by
using enhanced prefix arrays
– Simple and easy to implement
– Require about 3~6 times space of the input
text
– Actually this is a by-product of construction of
compressed suffix trees c.f. Weiner’s algorithm
• Simple; and much room for improvements
– by using better rank/select/rmq implementation
Future work
• Construction of compressed suffix trees
– Update the parenthesis tree efficiently
– Actually, the time complexity for this is smaller
• Practical improvements
– Currently, dynamic succinct data structure is not
efficient due to cache misses, and memory
fragmentation
– Approximated version of longest matching
problem; enough for many application
Thank you for you attention !
Weiner’s suffix tree’s construction alg.
.
$a
$abraca
$abracada
$abra
$ab
$abracadab
$abracad
$abr
$a
$abraca
$abracada
$abra
$ab
$abracad
$abr
a
a
$
$abrac
$abr
$ab
$abr $abracad
$abrac
$a
$abraca
$abracada
$abra
$ab
$abracadab
$abracad
$abr
$abracadabr
$
$abrac
ba
$
$abr
$abrac
$abr
$abracad
$abracada
abr
a
$
$abrac
ab
$
$abr
$abrac
$ $abracad
$abracad
$abracada