Compressed suffix arrays and suffix trees with

Download Report

Transcript Compressed suffix arrays and suffix trees with

Compressed suffix
arrays and suffix trees
with applications to text
indexing and string
matching
Jeffrey scott vitter
Roberto Grosssi
Agenda




A (very) short review on suffix arrays
Introduction
Problem Definition
Information theory reasoning
– Simple solution round 2


Compressed suffix arrays in ½*nloglogn +O(n) bits and O(loglogn)
access time
Rank And Select Problem definitions
–


Rank DS
Compressed suffix arrays in ε-1n + O(n) bits and O(logεn) access time
Select data structure (if time permits)
Short review on suffix arrays
A suffix array is a
sorted array of the
suffix of a string S
represented by an
array of pointers to
the suffixes of S
 For example The
string TelAviv and
it’s corresponding
suffix array

S0
S1
S2
S3
S4
S5
S6
telaviv
elaviv
laviv
aviv
viv
iv
v
3 1 5 2 0 6 4
Introduction

Succinct data structures branch

Dna genome strings (small alphabet, large strings)

Mainly a Theoretical article
Problem Definition

The Algorithm Is composed of two phases
– compression
– lookup

Compress :
– given a suffix array Sa compress it to get it’s succinct
representation

lookup(i):
– Given the compressed representation return SA[i]
Some Definitions

We will deal (at first) with binary alphabet
– Σ = {a,b}

We will add a special end of string symbol #

And will set the relation between the characters to be
– a<#<b (*)

Basic Ram Model
– Log(n) word size
– Word lookup and arithmetic in constant time
Information theory reasoning
aaaa# aaab#
12345 12354
aaba#
14253
aabb#
12543
abaa#
34152
abaa#
13524
abab#
41532
abba#
15432
baaa# baab#
23451 23514
baba#
42531
babb#
25143
bbaa#
34521
bbab#
35241
bbba#
45321
bbbb#
54321
Information theory reasoning (2)
 Suffix
array size nlog(n)
 One to one corresponds between the
suffix array to the string
– Construction details
 Number
of possible suffix arrays 2n-1
– Perfect compress n bits (the string itself)
– The cost for lookup Ω(n) see prev lecture
“Simple” solution round 2
different approach
 Let’s
pack together each logn bits to
create a new alphabet.
 So the text length will be n/logn and
the pattern length would be m/logn
 The suffix array will take o(n) bits
 Searching becomes hard (alignment)
– the text is aligned but the pattern isn’t
logn cases
“Simple” solution round 2

the text isn’t aligned the pattern occurs k bit right to a word
boundary

Need to append k bits to the pattern and check it

So we need to check 2^k cases

K~logn => n different cases to check

Assuming we know how much to pad!!
General framework



Abstract Data Type Optimization [Jacobson'89]
# distinct Data structures = C(n) => Each data
structure occupies O(log C(n)) bits.
Doesn’t guarantee the time complexity on the supported
operations
Compressed suffix arrays in
½*nloglogn +O(n) bits and O(loglogn) access time

Recursive method in nature
– Take advantage on the suffixes

Let Sa0 be the uncompressed suffix array
And N0 be it’s size (assume power of 2)

In The k phase of the compression we start with

Sak with the size N k  N2
N
and create Sak+1 with the size N k 1  2
Sak+1 holds the permutation {1..Nk+1}
0
k
0
k 1
Sak+1 Construction



Create the Bk bit vector
Bk[i] = 1 iff Sak[i] is even
create the Rank vector
Rankk(j) counts the number of one bits in the
first j bits of Bk
Create the Ψk(i) vector
–
stores the 0 to 1 companion relation)
 j if Sa k [i] is odd and Sa k [ j ]  Sa k [i]  1
 k (i)  

i
otherwise



Store the even values from Sak in Sak+1
An Example
 The

32 chars string T
abbabbabbabbabaaabababbabbbabba#
An Example
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Text
a
b
b
a
b
b
a
b
b
a
b
b
a
b
a
a
Sa0
15
16
31
13
17
19
28
10
7
4
1
21
24
32
14
30
B0
0
1
0
0
0
0
1
1
0
1
0
0
1
1
1
1
Rank0
0
1
1
1
1
1
2
3
3
4
4
4
5
6
7
8
Ψ0
2
2
14
15
18
23
7
8
28
10
30
31
13
14
15
16
Example …
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
a
a
b
a
b
a
b
b
a
b
b
b
a
b
b
a
#
30
12
18
27
9
6
3
20
23
29
11
26
8
5
2
22
25
1
1
1
0
0
1
0
1
0
0
0
1
1
0
1
1
0
8
9
10
10
10
11
11
12
12
12
12
13
14
14
15
16
16
16
17
18
7
8
21
10
23
13
16
17
27
28
21
30
31
27
How To compute Sak from Sak-1
 Lemma
1
– Given suffix array Sak let Bk rankk Ψk and Sak+1
Be the result of the transformation performed
by phase k we can construct Sak from Sak+1
by the following formula
Sak[i] = 2* Sak+1[rankk(Ψk(i))]+(Bk[i]-1)
– Let’s split for 2 cases
 Bk[i]
is even
 Bk[i] is odd
Example continue
Sa1
8
14
5
2
12
16
7
15
6
9
3
10
13
4
1
11
B1
1
1
0
1
1
1
0
0
1
0
0
1
0
1
0
0
1
2
2
3
4
5
5
5
6
6
6
7
7
8
8
8
Ψ1
1
2
9
4
5
6
1
6
9
12
14
12
2
14
4
5
Sa2
4
7
1
6
8
3
5
2
B2
1
0
0
1
1
0
0
1
1
1
1
2
3
3
3
4
Ψ2
1
5
8
4
5
1
4
8
Sa3
2
3
4
1
Rank
1
Rank
2
Compress
– We Keep l = O(loglogn) levels
–
–
–
–
All Levels but the Sal level are save implicitly
For each of the level 0..l-1 we save Bj,rankj Ψj
rankj Ψj are stored implicitly
The Size of Sal is
N l * log N l 
n
2
* (log
log log n
n
2
)
log log n
N
n
log log n
* (log
)  n*
 O ( n)
log n
log n
log n
lookup

just compute recursively Sak[i] from Sak+1[i]

Recursion depth loglogn

All data structure going to be used have o(1) access time

O(loglogn) lookup cost
How The Data Is Stored

The Bk bit vector is stored explctiy
– O(Nk) space
– O(1) lookup
– O(Nk) preprocess time

The RankK vector is stored implicitly using Jacobson rank data
structure
– O(Nk(loglognk)/lognk) space
– O(1) lookup
– O(Nk) preprocess time

The Ψk vector is stored implicitly (using rank and select)
- O(1)acess T ime
1
3
n
- using n(  k 1 )  O( k
) bits
2 2
2 log log n
- O( N k  2 k ) preprocesstime
Ψk vector representation
nk
let j be theindex of ith1 in Bk
2
Consider the 2 k symbolsin positions2 k * ( Sak [ j ]  1)..2 k * Sak [ j ]  1
for each1  i 
thissymbolpreceded the(2k Sak [ j ])th suffix in T
for each 2 k bit pattern wekeep a orderedlist of indices j  [1, N k ]
corsponding to it
Let’s Take a look
level0
a list : {2,14,15,18,23,28,30,31} a list  8
b list : {7,8,10,13,16,17,21,27}
b list  8
level1
aa list {}
aa list  0
ab list {9}
ab list  1
ba list {1,6,12,14} ba list  4
bb list {2,4,5}
bb list  3
levlel2
aaaa list {}
aaaa list  0
baaa list {}
baaa list  0
aaab list {}
aaab list  0
baab list {}
baab list  0
aaba list {}
aaba list  0
baba list {1} baba list  1
aabb list {}
aabb list  0
babb list {4} babb list  1
abaa list {}
abaa list  0
bbaa list {}
bbaa list  0
abab list {}
abab list  0
bbab list {}
bbab list  0
abba list {5,8} abba list  2
bbba list {}
bbba list  0
abbb list  0
bbbb list {}
bbbb list  0
abbb list {}
An Example
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Text
a
b
b
a
b
b
a
b
b
a
b
b
a
b
a
a
Sa0
15
16
31
13
17
19
28
10
7
4
1
21
24
32
14
30
B0
0
1
0
0
0
0
1
1
0
1
0
0
1
1
1
1
Rank0
0
1
1
1
1
1
2
3
3
4
4
4
5
6
7
8
Ψ0
2
2
14
15
18
23
7
8
28
10
30
31
13
14
15
16
Example …
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
a
a
b
a
b
a
b
b
a
b
b
b
a
b
b
a
#
30
12
18
27
9
6
3
20
23
29
11
26
8
5
2
22
25
1
1
1
0
0
1
0
1
0
0
0
1
1
0
1
1
0
8
9
10
10
10
11
11
12
12
12
12
13
14
14
15
16
16
16
17
18
7
8
21
10
23
13
16
17
27
28
21
30
31
27
So What can we do with all the list’s



Concatenate them together in a lexicographical order and form
the Lk list
L1={9,1,6,12,14,2,4,5}
Let’s see how we can compute Ψk (i)
– If Bk[i] is even , it’s simply i
– Otherwise ,
– because all the prefix patterns saved are in sorted order,
– We saved in the Lk list till the point i , entries for all the odd
suffix’s before i , h=i-rank[i]
– So we can look up the h entry in Lk

And it will give us the answer
Simple example
 L2={5,8,2,4}
 Rank2={1,1,1,2,3,3,3,4}
 B2={1,0,0,1,1,0,0,1}
Ψ2={1,5,8,4,5,1,4,8}
 Ψ(3) = ?
 Rank(3) = 1, h= 3-1 , L2[2] = 8
 Ψ(3) =8 

Ψk vector representation

Lemma 2
Given s integers in sorted order ,
each containing w bits ,where s<2w
we can store them with at most
s(2+w-floor(logs))+O(s/loglogs) bits
so that retrieving the hth integer takes constant time
Ψk vector representation
Take the first z=floor(logs) bits of each int, creating the q1..qs int
It’s easy to see that , q1<qi<qi+1<s (we take the msb bits after all)
The rest w-z bits of each int , will be ri
Si
10101010101010101010101010101
qi
1010101010101010101010101
ri
101
Ψk vector representation
Store ri in a simple array, (w-z)*s bits
Store q1..qs in a table supporting select and rank in
constant time.
The table Q is implemented in the following way
Instead of saving the number themselves,
we store q1,q2-q1,q2-q3,… qs-qs-1
in unary representation )0i1(
And add a select data structure.
Ψk vector representation
In order to get qi we simply do select(i) ,
and count the number of zeros before the ith 1
Qi = select(i) - rank(select(i))
Ψk vector representation
The q table size is
the size of the unary string is s+2z <2s + the
select overhead O(s/loglogs)
So we can output Si easily
Si=qi*2w-z+ri
Ψk vector representation
Lemma 3
We can store the concatenated list Lk used for Ψk in
n*(1/2+3/2k+1)+O(n/2kloglogn), so accessing the hth element will
k
take constant time, with preprocessing time o(n/2k+22 )


k
There are 22 lists, number them ,(even the empty ones)
Ψk vector representation
Lemma 3
We can store the concatenated list Lk used for Ψk in
n*(1/2+3/2k+1)+O(n/2kloglogn), so accessing the hth element will
k
take constant time, with preprocessing time o(n/2k+22 )




k
There are 22 lists, number them ,(even the empty ones)
Each Xi integer in the lists, 1<xi<Nk will be transformed into a new
integer by appending it’s list int representation
X` bit size is , 2K+lognk ,
Ψk vector representation
Lemma 3
We can store the concatenated list Lk used for Ψk in
n*(1/2+3/2k+1)+O(n/2kloglogn), so accessing the hth element will
k
take constant time, with preprocessing time o(n/2k+22 )








k
There are 22 lists, number them ,(even the empty ones)
Each Xi integer in the lists, 1<xi<Nk will be transformed into a new
integer by appending it’s list int representation
X` bit size is , 2K+lognk ,
After concatenating all the lists ,we have a Nk/2 sorted numbers sized
2K+lognk bits
Using lemma 2 we get.
O(1) access time
And a space bound of n(1/2+3/2k+1)+O(n/2kloglogn) bits
Sum it up (space complexity)
T henumber of levelis set tobe l  loglogn
Sa l size is O(n) bits
Bk size is O(nk ) bits  O(nk ) preprocessing time
Rankk size is O(nk (loglognk )/lognk )  O(nk ) preprocessing time
 k size is n * (1/2  3/2k 1 )  O(n/2k loglogn) O(nk  2 2 )preprocessing time
k
the totalspace is

n

log log k


nlogn
1
1
2
  n k  O k
l
n
2
2
2


k 1
log k


2


l 1


 1
 1
3
1
n
   k 1  O( k
)   n log log n  6n  O(
)  O(n log log n)
2 log log n  2
log log n
 2 2




the preprocesstimeis 1 nk  2 2  O(n)
l -1
k
Rank data structure





Due to Jacobson
Given a bit vector length n ,Rank[i] is the number of 1 bits
till I
Multilevel approach
We will slice the bit string to log2n chunks.
Between each chunk we will keep rank counter

Each chunk will be divvied into ½ * logn chunks ,
And a counter will be kept between each sub chunks

At The Bottom Level a simple Lookup table will be used.

Log2n chunks
7
Rank
14
3 101
½ logn sub chunks
Lookup table
The output 14+3+1
Rank Analysis
first level
n
n
chunks
,
each
having
a
logn
counter
,
total
of
O(
)
2
log n
logn
we have
2n
nloglogn
subchunks each havinga loglogn counter, totalof O(
)
logn
logn
T heLookuptable takes,
21/2 logn * log n * log log n  O( n * log n * log log n)
o(n) totalspace
Compressed suffix arrays in ε-1n +
O(n) bits and O(logεn) access time





In order to break the space barrier we need to save less
levels =>longer lookup’s
Lets save 3 compressed levels only Sa0 Sal Sal`
L = ceil(loglogn) , l`=ceil(1/2loglogn)
using A Dictionary data structure , which Can say If an
element is member of the Dictionary, and support a rank
query, O(1) time for both queries
  n 
 log     nl *l bits
O
The Space complexity of the dictionary is
  n 

 l 
We keep in 2 dictionaries what items we have in the next
level D0 and Dl (from Sa0->Sal` Sal`->Sal
The Ψ`k function

We define the
Ψ`k function , which maps each 1 to it’s companion 0
 j if Sa k [i]  N k and Sa k [i] is even and Sa k [ j ]  Sa k [i]  1

i otherwise


 k (i)  

Let’s define the φk function to be
 j if Sa k [i]  Nk and Sa k [ j ]  Sa k [i]  1
k (i)  

i otherwise



We just need to merge the indexes in Lk and L`k
Example
1
2
3
Sa1
8
14 5
B1
1
1
10
10
1
2
8
8
1
 `1
k
4
5
6
7
8
9 10 11 12 13 14 15 16
2 12 16 7 15 6
0 1 1
9 4 5
3 11 13
9 11 13
1
6
6
6
0
1
7
1
0
6
8
6
level1
aa list  0
ab list {9}
ab list  1
ba list {1,6,12,14} ba list  4
bb list {2,4,5}
bb list  3
even List's
aa list {10}
aa list  1
ab list {8,11,13}
ab list  3
ba list {7,16}
ba list  2
bb list {3}
bb list  1
3
10 13
4
1
11
1 0 0 1 0 1 0 0
9 12 14 12 2 14 4 5
7 10 11 16 13 3 15 16
7 12 14 16 2 3 4 5
Odd list's
aa list {}
9
T hemerginggives
{10,8,9,11,13,6*,1,6,7,12,14,16,2,3,4,5}
The φk function implementation

Lemma 4 :We can store the concatenated list used for φk
– k =0 in n+O(n/loglogn) bits
– K>0 in n*(1+1/2k-1)+O(n/2kloglogn) , preprocess time of
k
O(n/2k +22 )
– If k>0 simply using lemma 3
– K=0









Encode a,# as 0, and b as 1.
Create a n bit vector , named l
L[f] = 0 iff the list for φ0 is a or # at the f position
We add a select and select0 data structure on top of it. O(n/loglogn)
Also we keep the number of 0 in l as c0,
Query φk(j) is done in the following way
if j = C0 , return select0(c0)
If j<c0 return select0(j)
If j>c0 return select(j-c0)
The Lookup algorithm



Sa[i] , we start walking the φk function i,i`,i``,i```
Sa0[i]+1=Sa0[i`]…
Until reaching entry found in the dictionary D0,
– Let s be the walk length
– And r the entry rank in the dictionary (how many items, already passed
to the next level?)

Using r we start walking the next level
– Let s` be the walk length
– And r` the entry rank in the dictionary

we return the following result Sal [r`]* 2l  s`*2l `  s)

The walk length is , max(s,s`)<2l`<sqr(logn)

So the query time is O(sqr(logn))
The General multilevel Build

For every 0<ε<1 ,
Assume εl is an integer so 2εl<2logεn
Create all the levels , 0, εl,2εl ..l

Number of levels is ε-1+1 => lookup of O(logεn)


The General multilevel Build
1
1
n
n log n
)
)   n(1  k 1 )  O( k
 n  O(
l
2 log log n
2
log log n k il
2
1i  1
 (1   1 )n  O(
n
n
n
)
)  O(  )  (1   1 )n  O(
log log n
log n
log log n
thedicit onaries space

k
n
n log log n
)
)  O(
D k  O(nl l )  O(

log log n
log n
Select data structure

select(i)- returns the i 1 bit in the string

Same idea as rank , a bit more complicated


multilevel approach
At the first level we record the position of every lognloglognth bit,
– Total space o(N/loglogn)


Between each two bits, we keep the following data,
If the distance between them r>(lognloglogn)2
– we keep the absolute pos of all the indexes between them

log2nloglogn
– Other wise we keep , the relative position of each logrloglognth bit


Total space logr*loglogn <log2nloglogn = r/loglogn r<N !!!
Then we keep one more level (the same notions)
– Block size comes to the size of (lgn)4
Select data structure


After that, we keep a lookup table
For every logn/d pattern we save (d>=2)
– Number of 1 bits,
– the location of the ith 1 bit in the pattern

Same as before the space is O(n1/dlognloglogn)

The lookup is then very simple, just walk the levels,


Get a block and ask a query about him using the lookup
table.
Space complexity , O(n/loglogn)