Succinct Data Structures - the David R. Cheriton School of Computer

Download Report

Transcript Succinct Data Structures - the David R. Cheriton School of Computer

Succinct Data Structures
Ian Munro
University of Waterloo
Joint work with David Benoit, Andrej Brodnik, D, Clark,
F. Fich, M. He, J. Horton, A. López-Ortiz, S.
Srinivasa Rao, Rajeev Raman, Venkatesh Raman,
Adam Storm …
How do we encode a large tree or other
combinatorial object of specialized
information
… even a static one
in a small amount of space
and still perform queries in constant time
???
Example of a Succinct Data
Structure:
The (Static) Bounded Subset
Given: Universe of n elements [0,...n-1]
and m arbitrary elements from this universe
Create: a static structure to support search
in constant time (lg n bit word and usual
operations)
Using: Essentially minimum possible # bits
n ))
... lg((m
Operation: Member query in O(1) time
(Brodnik & M.)
Focus on Trees
.. Because Computer Science is .. Arbophilic
- Directories (Unix, all the rest)
- Search trees (B-trees, binary search trees,
digital trees or tries)
- Graph structures (we do a tree based
search)
- Search indices for text (including DNA)
A Big Patricia Trie / Suffix Trie
0
1
0
1
100011



Given a large text file; treat it as bit vector
Construct a trie with leaves pointing to unique
locations in text that “match” path in trie (paths
must start at character boundaries)
Skip the nodes where there is no branching ( n-1
internal nodes)
Space for Trees
Abstract data type: binary tree
Size: n-1 internal nodes, n leaves
Operations: child, parent, subtree size, leaf
data
Motivation: “Obvious” representation of an n
node tree takes about 6 n lg n words (up,
left, right, size, memory manager, leaf
reference)
i.e. full suffix tree takes about 5 or 6 times
the space of suffix array (i.e. leaf
references only)
Succinct Representations of Trees
Start with Jacobson, then others:
There are about 4n/(πn)3/2 ordered rooted
trees, and same number of binary trees
Lower bound on specifying is about 2n bits
What are the natural representations?
Arbitrary Ordered Trees
Use parenthesis notation
 Represent the tree

As the binary string (((())())((())()())):
traverse tree as “(“ for node, then
subtrees, then “)”
 Each node takes 2 bits

Heap-like Notation for a Binary Tree
Add external nodes
Enumerate level by level
1
1
1
1
1
0
0
0
0
0
1
1
1
0
0
0
0
Store vector 11110111001000000 length2n+1
(Here don’t know size of subtrees; can be overcome.
Could use isomorphism to flip between notations)
How do we Navigate?
Jacobson’s key suggestion:
Operations on a bit vector
rank(x) = # 1’s up to & including x
select(x) = position of xth 1
So in the binary tree
leftchild(x) = 2 rank(x)
rightchild(x) = 2 rank(x) + 1
parent(x) = select(x/2)
Rank & Select
Rank -Auxiliary storage ~ 2nlglg n / lg n bits
#1’s up to each (lg n)2 rd bit
#1’s within these too each lg nth bit
Table lookup after that
Select -more complicated but similar notions
Key issue: Rank & Select take O(1) time
with lg n bit word (M. et al)
Aside: Interesting data type by itself
Other Combinatorial Objects
Planar Graphs (Lu et al)
Permutations [n]→ [n]
Or more generally
Functions [n] → [n]
But what are the operations?
Clearly π(i), but also π -1(i)
And then π k(i) and π -k(i)
Suffix Arrays (special permutations) in
linear space
Permutations: a Shortcut Notation
Let P be a simple array giving π; P[i] = π[i]
Also have B[i] be a pointer t positions back
in (the cycle of) the permutation;
B[i]= π-t[i] .. But only define B for every
tth position in cycle. (t is a constant;
ignore cycle length “round-off”)
2
4
5
13
1
8
3
12
10
So array representation
P = [8 4 12 5 13 x x 3 x 2 x 10 1]
1
2
3
4
5
6
7
8
9
10
11
12
13
Representing Shortcuts
In a cycle there is a B every t positions …
But these positions can be in arbitrary order
Which i’s have a B, and how do we store it?
Keep a vector of all positions
0 indicates no B 1 indicates a B
Rank gives the position of B[“i”] in B array
So: π(i) and π -1(i) in O(1) time & (1+ε)n lg n
bits
Theorem: Under a pointer machine model with
space (1+ ε) n references, we need time 1/ε
to answer π and π -1 queries; i.e. this is as
good as it gets.
Getting n lg n Bits: an Aside
This is the best we can do for O(1) operations
But using Benes networks:
1-Benes network is a 2 input/2 output switch
r+1-Benes network … join tops to tops
1
3
2
5
R-Benes Network
3
7
4
8
5
1
6
6
R-Benes Network
7
4
8
2
A Benes Network
Realizing the permutation
(3 5 7 8 1 6 4 2)
1
3
2
5
3
7
4
8
5
1
6
6
7
4
8
2
What can we do with it?
Divide into blocks of lg lg n gates … encode
their actions in a word. Taking advantage
of regularity of address mechanism
and also
Modify approach to avoid power of 2 issue
Can trace a path in time O(lg n/(lg lg n)
This is the best time we are able get for π
and π-1 in minimum space.
Observe: This method “violates” the pointer
machine lower bound by using
“micropointers”.
Back to the main track:
Powers of π
Consider the cycles of π
( 2 6 8)( 3 5 9 10)( 4 1 7)
Keep a bit vector to indicate the start of
each cycle
( 2 6 8 3 5 9 10 4 1 7)
Ignoring parentheses, view as new
permutation, ψ.
Note: ψ-1(i) is position containing i …
So we have ψ and ψ-1 as before
Use ψ-1(i) to find i, then bit vector (rank,
select) to find πk or π-k
Functions
Now consider arbitrary functions [n]→[n]
“A function is just a hairy permutation”
All tree edges lead to a cycle
Challenges here
Essentially write down the components in a
convenient order and use the n lg n bits to
describe the mapping (as per
permutations)
To get fk(i):
Find the level ancestor (k levels up) in a
tree
Or
Go up to root and apply f the remaining
number of steps around a cycle
Level Ancestors
There are several level ancestor techniques
using
O(1) time and O(n) WORDS.
Adapt Bender & Farach-Colton to work in
O(n) bits
But going the other way …
f-k is a set
Moving Down the tree requires
care
f-3( ) = ( )
The trick:
Report all nodes on a given
level of a tree in time
proportional to the number of
nodes, and
Don’t waste time on trees with
no answers
Final Function Result
Given an arbitrary function f: [n]→[n]
With an n lg n + O(n) bit representation we
can compute fk(i) in O(1) time and f-k(i) in
time O(1 + size of answer).
Back to Text … And Suffix Arrays
Text T[1..n] over (a,b)*# (a<#<b)
There are 2n-1 such texts, which of the n! suffix
arrays are valid?
1
2
3
4
5
6
7
8
7
5
1
8
3
6
2
b
8
b
6
a
1
a
3
b
7
a
2
#
5
M= 4
7
isn’t ..why?
1
5
8
2
3
6
SA= 4
is
a
SA-1= 4
Ascending to Max
M is a permutation so M-1 is its inverse
i.e. M-1[i] says where i is in M
Ascending-to-Max:  1  i  n-2
i)M-1[i] < M-1[n] and M-1[i+1] < M-1[n] 
M-1[i] < M-1[i+1]
ii)M-1[i] > M-1[n] and M-1[i+1] > M-1[n] 
M-1[i] > M-1[i+1]
4
4
7
7
5
1
1
5
8
8
3
2
6
3
2
6
OK
NO
Non-Nesting
Non-Nesting: 1  i,j  n-1 and M-1[i]<M-1[j]
i)M-1[i] < M-1[i+1] and M-1[j] < M-1[j+1] 
M-1[i+1] < M-1[j+1]
ii)M-1[i] > M-1[i+1] and M-1[j] > M-1[j+1] 
M-1[i+1] < M-1[j+1]
4
4
7
7
5
1
1
5
8
8
3
2
6
3
2
6
OK
NO
Characterization Theorem for Suffix
Arrays on Binary Texts
Theorem: Ascending to Max & Non-nesting
 Suffix Array
Corollary: Clean method of breaking SA into
segments
Corollary: Linear time algorithm to check
whether SA is valid
Cardinality Queries
T=
a b a a a b b a a a b a a b b#
Remember lengths longest run of a’s and of b’s
SA (broken by runs, but not stored explicitly)
8 3 | 9 4 12| 1 10 5 13|16 | 7 2 11 15|6 14
Ba,  bit vector .. If SA-1[i-1] in an “a” section store 1 in Ba,[SA-1[i]], else 0
Ba
0 0 1 1 0 0 1 1 1 0 0 1 1 0 1 1
Create rank structure on Ba, and similarly Bb, (Note these are reversed except at #)
Algorithm Count(T,P)
s ← 1; e ←n; i ← m;
while i>0 and se do
if P[i[=a then
s← rank1(Ba,s-1)+1; e←rank1(Ba,e)
else
s← na + 2 + rank1(Bb,s-1); e←na + 1 +rank1(Bb,e)
i ← i-1
Return max(e-s+1,0)
Time: O(length of query)
Listing Queries
Complex methods
Key idea: for queries of length at least d,
index every dth position .. For T and
forT(reversed)
So we have matches for T[i..n] and T[1,i-1]
View these as points in 2 space (Ferragina &
Manzini and Grossi & Vitter)
Do a range query (Alstrup et al)
Variety of results follow
General Conclusion
Interesting, and useful, combinatorial
objects can be:
Stored succinctly … O(lower bound) +o()
So that
Natural queries are performed in O(1) time
(or at least very close)
This can make the difference between using
them and not …