Transcript PPT

A Simple Optimal Representation for
Balanced Parentheses
Richard Geary, Naila Rahman, Rajeev Raman
(University of Leicester, UK)
and
Venkatesh Raman
(Institute for Mathematical Sciences, Chennai, India)
5th July 2004
CPM 2004
1
A Parentheses Data Structure
• Given: Balanced string of 2n parentheses.
( ( ( ( ))) ( ) ( ) )
• Support operations:
– ENCLOSE ( i )
– FINDCLOSE ( i ), FINDOPEN( i )
– EXCESS ( i )
• Applications to suffix tree, ordinal trees and stacksortable permutations.
5th July 2004
CPM 2004
2
Parentheses Representation
2n bits, O(n) time.
Θ(n lg n ) bits, O(1) time.
O(n) bits, O(1) time.
[Jacobson, `89]
2n+o(n) bits, O(1) time. [Munro, Raman, `01]
2n+o(n) bits, O(1) time. New data structure.
• Our new DS
– is simpler (no perfect hash tables),
– smaller o(n) term,
– uniform o(n) time and space construction algorithm.
• Implemented and shown to be quite practical
– far more compact than D/S using naïve representation,
– speed comparable to D/S using naïve representation.
5th July 2004
CPM 2004
3
XML
• XML: eXtensible Markup Language
– de facto standard for electronic data interchange.
• Document Object Model (DOM) standard API for
manipulating XML documents
– holds all data in memory,
– large memory usage.
5th July 2004
CPM 2004
4
Example XML document
person
<person>
<name>
<first>Bill</first>
<surname>Bloggs</surname>
</name>
name
dob
<dob>
<day>1</day>
<month>April</month>
<year>1961</year>
</dob>
</person>
firstname surname day
month year
• DOM NODE interface has methods PARENT(x),
NEXTSIB(x), PREVSIB(x),
LASTCHILD(x),FIRSTCHILD(x)
5th July 2004
CPM 2004
5
Obvious representation
•  2n pointers
– DOM: 3n.
• Ω(n log n) bits.
5th July 2004
CPM 2004
6
Using parentheses
1
<person>
<name>
<first>Bill</first>
<surname>Bloggs</surname>
</name>
<dob>
<day>1</day>
<month>April</month>
<year>1961</year>
</dob>
</person>
2
3
4
5
6
7
8
parentheses representation:
( ( ( ) ( ) ) ( ( ) ( ) ( ) ) )
12 3 4
5 6 7 8
2n + o(n) bits for tree structure.
5th July 2004
CPM 2004
7
Node interface ops using Parentheses DS
Node interface
PARENT
NEXTSIB
PREVSIB
LASTCHILD
5th July 2004
Parentheses DS
ENCLOSE
FINDCLOSE
FINDOPEN
FINDCLOSE, FINDOPEN
CPM 2004
8
Succinct DOM
• Succinct DOM:
– uses far less space than standard DOM,
– performance competitive with DOM.
• Node interface implemented by natural parentheses ops.
• Operations supported by parentheses data structures
– Jacobson `89,
– Munro and Raman `01,
– Our new data structure.
5th July 2004
CPM 2004
9
Our new D/S
Input: balanced string of 2n parentheses.
Assume recursive data structure to store balanced
string of 2N  2n parentheses.
If N is O(n / lg2 n)
store answers explicitly for every pair of parentheses.
Otherwise
Divide into blocks of size B  (lg N ) / 2
Number of blocks   4 N / lg N
5th July 2004
CPM 2004
10
FINDCLOSE(x)
( ( ( )
( ( ( ) ) ) ( )
(
1 2 3
4 5 6
8 9
7
( ) )
( ) ) )
10
• FINDCLOSE(3)?
• Matching parenthesis inside block – near parenthesis.
• Pre-computed table stores position of matching
parentheses for all near parentheses.
– O(1) time if near parenthesis.
– Table size is O( N (lg N 2 ))
5th July 2004
CPM 2004
11
Pioneer Parentheses
( ( ( )
1 2 3
( ( ( ) ) ) ( )
4 5 6
7
( ( ) ) ( ) ) )
8 9
10
FINDCLOSE(5)?
Matching parenthesis outside block – far parenthesis.
b(p) = block# of parenthesis at position p
 ( p) = position of match of p
q is 1st far parenthesis before p
p is pioneer if b(  ( p))!  b(  (q))
At most 2β-3 open pioneers.
Similarly at most 2β-3 close pioneers.
5th July 2004
CPM 2004
12
Pioneer Family
( ( ( )
1 2 3
( ( ( ) ) ) ( )
4 5 6
7
( ( ) ) ( ) ) )
8 9
10
( ( ) )
• Pioneer family: set of all opening and closing
pioneers along with their matching parentheses.
• Balanced string of size at most 4β-6.
5th July 2004
CPM 2004
13
Our D/S
2N
( ( ( )
( ( ( ) ) ) ( )
(
( ) )
( ) ) )
NND
O(N / lg N)
( ( ) )
Two levels of recursion.
When pioneer family is O(N/lg2N) we store explicit answers.
5th July 2004
CPM 2004
14
Space usage
NND uses O(N lg lg N / lg N) bits.
Tables use O( N lg lg N / lg N) bits.
O(N lg N)
if N is O( n / lg 2 n)
S(N )  
2 N  S( 16 N/ lg N)  O(N lg lg N/ lg N) otherwise
S(n) = 2n+ O(n lg lg n / lg n) = 2n +o(n) bits.
5th July 2004
CPM 2004
15
Pseudo-pioneers
• Near blocks: blocks which have no pioneers.
• Insert pseudo-pioneers at start and end of every
near block.
– Pseudo-pioneers do not effect FINDOPEN(x),
FINDCLOSE(x), ENCLOSE(x)
• Gap between pioneers now at most 2B = O(lg N).
5th July 2004
CPM 2004
16
NND
•
•
•
( ( ( )
( ( ( ) ) ) ( )
(
( ) )
( ) ) )
1 0 0 0
1 0 0 0 0 1 0 0
0 0 0 0
0 0 0 1
2n-bit vector used to find the pioneer for a far
parenthesis.
If pioneer at pos i in parentheses string then 1 at i in
NND.
Operations we need:
–
–
•
•
Find address of most recent 1 at position i
r = Rank(i)
p = Select(r)
Find ith 1in bit vector
p = Select(i)
We want succinct representation.
D/S should be simple and fast.
5th July 2004
CPM 2004
17
NND
Bit vector of length M with N 1s.
Gap between 1s at most (lg M)c.
t = lg M / 2 c lg lg M.
5th July 2004
CPM 2004
18
Select(i)
• Find ith 1 in bit vector.
• Array A1 stores position of every tth 1
– Space is N lg M  / t  O( M lg lg M / lg M ) bits
• Array A2 stores gaps between consecutive 1s
– Space is O( N lg lg M ) or O( M lg lg M / lg M ) bits.
• Table T1 allows us to lookup sum of upto t gap.
– Space is O( M lg lg M ) bits.
SELECT(i)
i’ = (i  1)/t 
i’’ = (i+1)/mod t
y = concat of A2[i’+1],..,A2[i’+i’’]
return A1[x] + T1[y]
5th July 2004
CPM 2004
19
Rank(i)
• Prefix sum at position i.
• Need two more arrays and tables of size at most
O(M lg lg M / lg M) bits.
5th July 2004
CPM 2004
20
Implementation Details
•
•
•
•
C++ on Sun UltraSparc-III and Pentium 4.
Implemented new and optimised Jacobson D/S.
CenterPoint XML for DOM.
Sample of 12 XML documents of varying sizes
and node counts.
• Blocksizes 32, 64, 128 and 256.
• Test was depth first tree walk, counting nodes of a
given XML type.
5th July 2004
CPM 2004
21
Space usage and performance
• Space usage for tree structure
– Std DOM: 96 bits per node.
– Jacobson: 3.3 – 16 bits per node.
– New D/S: 2.9 – 12.8 bits per node.
• Avg performance for succinct D/S relative to std
DOM
– UltraSparc: 1 to 2.5 times slower.
– Pentium 4: 1.7 to 4 times slower.
5th July 2004
CPM 2004
22
Conclusions and Future work
• Conceptually simple succinct representation for
balanced parentheses with O(1) time ops.
• o(n) time and space construction algorithm.
• Improved lower bound term for space bound.
• Relative performance very good on UltraSparc but
poorer on Pentium 4, which has small cache
– Cache optimisation is an interesting problem.
• Complete set of D/S for succinct DOM.
5th July 2004
CPM 2004
23