Compressed suffix arrays and suffix trees with
Download
Report
Transcript Compressed suffix arrays and suffix trees with
Compressed suffix
arrays and suffix trees
with applications to text
indexing and string
matching
Jeffrey scott vitter
Roberto Grosssi
Agenda
A (very) short review on suffix arrays
Introduction
Problem Definition
Information theory reasoning
– Simple solution round 2
Compressed suffix arrays in ½*nloglogn +O(n) bits and O(loglogn)
access time
Rank And Select Problem definitions
–
Rank DS
Compressed suffix arrays in ε-1n + O(n) bits and O(logεn) access time
Select data structure (if time permits)
Short review on suffix arrays
A suffix array is a
sorted array of the
suffix of a string S
represented by an
array of pointers to
the suffixes of S
For example The
string TelAviv and
it’s corresponding
suffix array
S0
S1
S2
S3
S4
S5
S6
telaviv
elaviv
laviv
aviv
viv
iv
v
3 1 5 2 0 6 4
Introduction
Succinct data structures branch
Dna genome strings (small alphabet, large strings)
Mainly a Theoretical article
Problem Definition
The Algorithm Is composed of two phases
– compression
– lookup
Compress :
– given a suffix array Sa compress it to get it’s succinct
representation
lookup(i):
– Given the compressed representation return SA[i]
Some Definitions
We will deal (at first) with binary alphabet
– Σ = {a,b}
We will add a special end of string symbol #
And will set the relation between the characters to be
– a<#<b (*)
Basic Ram Model
– Log(n) word size
– Word lookup and arithmetic in constant time
Information theory reasoning
aaaa# aaab#
12345 12354
aaba#
14253
aabb#
12543
abaa#
34152
abaa#
13524
abab#
41532
abba#
15432
baaa# baab#
23451 23514
baba#
42531
babb#
25143
bbaa#
34521
bbab#
35241
bbba#
45321
bbbb#
54321
Information theory reasoning (2)
Suffix
array size nlog(n)
One to one corresponds between the
suffix array to the string
– Construction details
Number
of possible suffix arrays 2n-1
– Perfect compress n bits (the string itself)
– The cost for lookup Ω(n) see prev lecture
“Simple” solution round 2
different approach
Let’s
pack together each logn bits to
create a new alphabet.
So the text length will be n/logn and
the pattern length would be m/logn
The suffix array will take o(n) bits
Searching becomes hard (alignment)
– the text is aligned but the pattern isn’t
logn cases
“Simple” solution round 2
the text isn’t aligned the pattern occurs k bit right to a word
boundary
Need to append k bits to the pattern and check it
So we need to check 2^k cases
K~logn => n different cases to check
Assuming we know how much to pad!!
General framework
Abstract Data Type Optimization [Jacobson'89]
# distinct Data structures = C(n) => Each data
structure occupies O(log C(n)) bits.
Doesn’t guarantee the time complexity on the supported
operations
Compressed suffix arrays in
½*nloglogn +O(n) bits and O(loglogn) access time
Recursive method in nature
– Take advantage on the suffixes
Let Sa0 be the uncompressed suffix array
And N0 be it’s size (assume power of 2)
In The k phase of the compression we start with
Sak with the size N k N2
N
and create Sak+1 with the size N k 1 2
Sak+1 holds the permutation {1..Nk+1}
0
k
0
k 1
Sak+1 Construction
Create the Bk bit vector
Bk[i] = 1 iff Sak[i] is even
create the Rank vector
Rankk(j) counts the number of one bits in the
first j bits of Bk
Create the Ψk(i) vector
–
stores the 0 to 1 companion relation)
j if Sa k [i] is odd and Sa k [ j ] Sa k [i] 1
k (i)
i
otherwise
Store the even values from Sak in Sak+1
An Example
The
32 chars string T
abbabbabbabbabaaabababbabbbabba#
An Example
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Text
a
b
b
a
b
b
a
b
b
a
b
b
a
b
a
a
Sa0
15
16
31
13
17
19
28
10
7
4
1
21
24
32
14
30
B0
0
1
0
0
0
0
1
1
0
1
0
0
1
1
1
1
Rank0
0
1
1
1
1
1
2
3
3
4
4
4
5
6
7
8
Ψ0
2
2
14
15
18
23
7
8
28
10
30
31
13
14
15
16
Example …
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
a
a
b
a
b
a
b
b
a
b
b
b
a
b
b
a
#
30
12
18
27
9
6
3
20
23
29
11
26
8
5
2
22
25
1
1
1
0
0
1
0
1
0
0
0
1
1
0
1
1
0
8
9
10
10
10
11
11
12
12
12
12
13
14
14
15
16
16
16
17
18
7
8
21
10
23
13
16
17
27
28
21
30
31
27
How To compute Sak from Sak-1
Lemma
1
– Given suffix array Sak let Bk rankk Ψk and Sak+1
Be the result of the transformation performed
by phase k we can construct Sak from Sak+1
by the following formula
Sak[i] = 2* Sak+1[rankk(Ψk(i))]+(Bk[i]-1)
– Let’s split for 2 cases
Bk[i]
is even
Bk[i] is odd
Example continue
Sa1
8
14
5
2
12
16
7
15
6
9
3
10
13
4
1
11
B1
1
1
0
1
1
1
0
0
1
0
0
1
0
1
0
0
1
2
2
3
4
5
5
5
6
6
6
7
7
8
8
8
Ψ1
1
2
9
4
5
6
1
6
9
12
14
12
2
14
4
5
Sa2
4
7
1
6
8
3
5
2
B2
1
0
0
1
1
0
0
1
1
1
1
2
3
3
3
4
Ψ2
1
5
8
4
5
1
4
8
Sa3
2
3
4
1
Rank
1
Rank
2
Compress
– We Keep l = O(loglogn) levels
–
–
–
–
All Levels but the Sal level are save implicitly
For each of the level 0..l-1 we save Bj,rankj Ψj
rankj Ψj are stored implicitly
The Size of Sal is
N l * log N l
n
2
* (log
log log n
n
2
)
log log n
N
n
log log n
* (log
) n*
O ( n)
log n
log n
log n
lookup
just compute recursively Sak[i] from Sak+1[i]
Recursion depth loglogn
All data structure going to be used have o(1) access time
O(loglogn) lookup cost
How The Data Is Stored
The Bk bit vector is stored explctiy
– O(Nk) space
– O(1) lookup
– O(Nk) preprocess time
The RankK vector is stored implicitly using Jacobson rank data
structure
– O(Nk(loglognk)/lognk) space
– O(1) lookup
– O(Nk) preprocess time
The Ψk vector is stored implicitly (using rank and select)
- O(1)acess T ime
1
3
n
- using n( k 1 ) O( k
) bits
2 2
2 log log n
- O( N k 2 k ) preprocesstime
Ψk vector representation
nk
let j be theindex of ith1 in Bk
2
Consider the 2 k symbolsin positions2 k * ( Sak [ j ] 1)..2 k * Sak [ j ] 1
for each1 i
thissymbolpreceded the(2k Sak [ j ])th suffix in T
for each 2 k bit pattern wekeep a orderedlist of indices j [1, N k ]
corsponding to it
Let’s Take a look
level0
a list : {2,14,15,18,23,28,30,31} a list 8
b list : {7,8,10,13,16,17,21,27}
b list 8
level1
aa list {}
aa list 0
ab list {9}
ab list 1
ba list {1,6,12,14} ba list 4
bb list {2,4,5}
bb list 3
levlel2
aaaa list {}
aaaa list 0
baaa list {}
baaa list 0
aaab list {}
aaab list 0
baab list {}
baab list 0
aaba list {}
aaba list 0
baba list {1} baba list 1
aabb list {}
aabb list 0
babb list {4} babb list 1
abaa list {}
abaa list 0
bbaa list {}
bbaa list 0
abab list {}
abab list 0
bbab list {}
bbab list 0
abba list {5,8} abba list 2
bbba list {}
bbba list 0
abbb list 0
bbbb list {}
bbbb list 0
abbb list {}
An Example
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Text
a
b
b
a
b
b
a
b
b
a
b
b
a
b
a
a
Sa0
15
16
31
13
17
19
28
10
7
4
1
21
24
32
14
30
B0
0
1
0
0
0
0
1
1
0
1
0
0
1
1
1
1
Rank0
0
1
1
1
1
1
2
3
3
4
4
4
5
6
7
8
Ψ0
2
2
14
15
18
23
7
8
28
10
30
31
13
14
15
16
Example …
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
a
a
b
a
b
a
b
b
a
b
b
b
a
b
b
a
#
30
12
18
27
9
6
3
20
23
29
11
26
8
5
2
22
25
1
1
1
0
0
1
0
1
0
0
0
1
1
0
1
1
0
8
9
10
10
10
11
11
12
12
12
12
13
14
14
15
16
16
16
17
18
7
8
21
10
23
13
16
17
27
28
21
30
31
27
So What can we do with all the list’s
Concatenate them together in a lexicographical order and form
the Lk list
L1={9,1,6,12,14,2,4,5}
Let’s see how we can compute Ψk (i)
– If Bk[i] is even , it’s simply i
– Otherwise ,
– because all the prefix patterns saved are in sorted order,
– We saved in the Lk list till the point i , entries for all the odd
suffix’s before i , h=i-rank[i]
– So we can look up the h entry in Lk
And it will give us the answer
Simple example
L2={5,8,2,4}
Rank2={1,1,1,2,3,3,3,4}
B2={1,0,0,1,1,0,0,1}
Ψ2={1,5,8,4,5,1,4,8}
Ψ(3) = ?
Rank(3) = 1, h= 3-1 , L2[2] = 8
Ψ(3) =8
Ψk vector representation
Lemma 2
Given s integers in sorted order ,
each containing w bits ,where s<2w
we can store them with at most
s(2+w-floor(logs))+O(s/loglogs) bits
so that retrieving the hth integer takes constant time
Ψk vector representation
Take the first z=floor(logs) bits of each int, creating the q1..qs int
It’s easy to see that , q1<qi<qi+1<s (we take the msb bits after all)
The rest w-z bits of each int , will be ri
Si
10101010101010101010101010101
qi
1010101010101010101010101
ri
101
Ψk vector representation
Store ri in a simple array, (w-z)*s bits
Store q1..qs in a table supporting select and rank in
constant time.
The table Q is implemented in the following way
Instead of saving the number themselves,
we store q1,q2-q1,q2-q3,… qs-qs-1
in unary representation )0i1(
And add a select data structure.
Ψk vector representation
In order to get qi we simply do select(i) ,
and count the number of zeros before the ith 1
Qi = select(i) - rank(select(i))
Ψk vector representation
The q table size is
the size of the unary string is s+2z <2s + the
select overhead O(s/loglogs)
So we can output Si easily
Si=qi*2w-z+ri
Ψk vector representation
Lemma 3
We can store the concatenated list Lk used for Ψk in
n*(1/2+3/2k+1)+O(n/2kloglogn), so accessing the hth element will
k
take constant time, with preprocessing time o(n/2k+22 )
k
There are 22 lists, number them ,(even the empty ones)
Ψk vector representation
Lemma 3
We can store the concatenated list Lk used for Ψk in
n*(1/2+3/2k+1)+O(n/2kloglogn), so accessing the hth element will
k
take constant time, with preprocessing time o(n/2k+22 )
k
There are 22 lists, number them ,(even the empty ones)
Each Xi integer in the lists, 1<xi<Nk will be transformed into a new
integer by appending it’s list int representation
X` bit size is , 2K+lognk ,
Ψk vector representation
Lemma 3
We can store the concatenated list Lk used for Ψk in
n*(1/2+3/2k+1)+O(n/2kloglogn), so accessing the hth element will
k
take constant time, with preprocessing time o(n/2k+22 )
k
There are 22 lists, number them ,(even the empty ones)
Each Xi integer in the lists, 1<xi<Nk will be transformed into a new
integer by appending it’s list int representation
X` bit size is , 2K+lognk ,
After concatenating all the lists ,we have a Nk/2 sorted numbers sized
2K+lognk bits
Using lemma 2 we get.
O(1) access time
And a space bound of n(1/2+3/2k+1)+O(n/2kloglogn) bits
Sum it up (space complexity)
T henumber of levelis set tobe l loglogn
Sa l size is O(n) bits
Bk size is O(nk ) bits O(nk ) preprocessing time
Rankk size is O(nk (loglognk )/lognk ) O(nk ) preprocessing time
k size is n * (1/2 3/2k 1 ) O(n/2k loglogn) O(nk 2 2 )preprocessing time
k
the totalspace is
n
log log k
nlogn
1
1
2
n k O k
l
n
2
2
2
k 1
log k
2
l 1
1
1
3
1
n
k 1 O( k
) n log log n 6n O(
) O(n log log n)
2 log log n 2
log log n
2 2
the preprocesstimeis 1 nk 2 2 O(n)
l -1
k
Rank data structure
Due to Jacobson
Given a bit vector length n ,Rank[i] is the number of 1 bits
till I
Multilevel approach
We will slice the bit string to log2n chunks.
Between each chunk we will keep rank counter
Each chunk will be divvied into ½ * logn chunks ,
And a counter will be kept between each sub chunks
At The Bottom Level a simple Lookup table will be used.
Log2n chunks
7
Rank
14
3 101
½ logn sub chunks
Lookup table
The output 14+3+1
Rank Analysis
first level
n
n
chunks
,
each
having
a
logn
counter
,
total
of
O(
)
2
log n
logn
we have
2n
nloglogn
subchunks each havinga loglogn counter, totalof O(
)
logn
logn
T heLookuptable takes,
21/2 logn * log n * log log n O( n * log n * log log n)
o(n) totalspace
Compressed suffix arrays in ε-1n +
O(n) bits and O(logεn) access time
In order to break the space barrier we need to save less
levels =>longer lookup’s
Lets save 3 compressed levels only Sa0 Sal Sal`
L = ceil(loglogn) , l`=ceil(1/2loglogn)
using A Dictionary data structure , which Can say If an
element is member of the Dictionary, and support a rank
query, O(1) time for both queries
n
log nl *l bits
O
The Space complexity of the dictionary is
n
l
We keep in 2 dictionaries what items we have in the next
level D0 and Dl (from Sa0->Sal` Sal`->Sal
The Ψ`k function
We define the
Ψ`k function , which maps each 1 to it’s companion 0
j if Sa k [i] N k and Sa k [i] is even and Sa k [ j ] Sa k [i] 1
i otherwise
k (i)
Let’s define the φk function to be
j if Sa k [i] Nk and Sa k [ j ] Sa k [i] 1
k (i)
i otherwise
We just need to merge the indexes in Lk and L`k
Example
1
2
3
Sa1
8
14 5
B1
1
1
10
10
1
2
8
8
1
`1
k
4
5
6
7
8
9 10 11 12 13 14 15 16
2 12 16 7 15 6
0 1 1
9 4 5
3 11 13
9 11 13
1
6
6
6
0
1
7
1
0
6
8
6
level1
aa list 0
ab list {9}
ab list 1
ba list {1,6,12,14} ba list 4
bb list {2,4,5}
bb list 3
even List's
aa list {10}
aa list 1
ab list {8,11,13}
ab list 3
ba list {7,16}
ba list 2
bb list {3}
bb list 1
3
10 13
4
1
11
1 0 0 1 0 1 0 0
9 12 14 12 2 14 4 5
7 10 11 16 13 3 15 16
7 12 14 16 2 3 4 5
Odd list's
aa list {}
9
T hemerginggives
{10,8,9,11,13,6*,1,6,7,12,14,16,2,3,4,5}
The φk function implementation
Lemma 4 :We can store the concatenated list used for φk
– k =0 in n+O(n/loglogn) bits
– K>0 in n*(1+1/2k-1)+O(n/2kloglogn) , preprocess time of
k
O(n/2k +22 )
– If k>0 simply using lemma 3
– K=0
Encode a,# as 0, and b as 1.
Create a n bit vector , named l
L[f] = 0 iff the list for φ0 is a or # at the f position
We add a select and select0 data structure on top of it. O(n/loglogn)
Also we keep the number of 0 in l as c0,
Query φk(j) is done in the following way
if j = C0 , return select0(c0)
If j<c0 return select0(j)
If j>c0 return select(j-c0)
The Lookup algorithm
Sa[i] , we start walking the φk function i,i`,i``,i```
Sa0[i]+1=Sa0[i`]…
Until reaching entry found in the dictionary D0,
– Let s be the walk length
– And r the entry rank in the dictionary (how many items, already passed
to the next level?)
Using r we start walking the next level
– Let s` be the walk length
– And r` the entry rank in the dictionary
we return the following result Sal [r`]* 2l s`*2l ` s)
The walk length is , max(s,s`)<2l`<sqr(logn)
So the query time is O(sqr(logn))
The General multilevel Build
For every 0<ε<1 ,
Assume εl is an integer so 2εl<2logεn
Create all the levels , 0, εl,2εl ..l
Number of levels is ε-1+1 => lookup of O(logεn)
The General multilevel Build
1
1
n
n log n
)
) n(1 k 1 ) O( k
n O(
l
2 log log n
2
log log n k il
2
1i 1
(1 1 )n O(
n
n
n
)
) O( ) (1 1 )n O(
log log n
log n
log log n
thedicit onaries space
k
n
n log log n
)
) O(
D k O(nl l ) O(
log log n
log n
Select data structure
select(i)- returns the i 1 bit in the string
Same idea as rank , a bit more complicated
multilevel approach
At the first level we record the position of every lognloglognth bit,
– Total space o(N/loglogn)
Between each two bits, we keep the following data,
If the distance between them r>(lognloglogn)2
– we keep the absolute pos of all the indexes between them
log2nloglogn
– Other wise we keep , the relative position of each logrloglognth bit
Total space logr*loglogn <log2nloglogn = r/loglogn r<N !!!
Then we keep one more level (the same notions)
– Block size comes to the size of (lgn)4
Select data structure
After that, we keep a lookup table
For every logn/d pattern we save (d>=2)
– Number of 1 bits,
– the location of the ith 1 bit in the pattern
Same as before the space is O(n1/dlognloglogn)
The lookup is then very simple, just walk the levels,
Get a block and ask a query about him using the lookup
table.
Space complexity , O(n/loglogn)