Transcript PPT

On Demand String Sorting over
Unbounded Alphabets
Carmel Kent
Moshe Lewenstein
Dafna Sheinwald
On-Demand String Sorting

Preprocessing a set of n strings for an efficient subsequent repeated:


Motivation, e.g.: Search engines recurrently return the next best k (k < n) pages





Pages typically ranked by relevance, but can also by values of specified field
Our Heap of Strings (HoS ) preprocesses in O (n) time and extracts next smallest in
O (log n) time + amortized O (N ) time, over all operations on all n strings


extract the next lexicographically smallest string
N = total length of the n strings.
Combines Heap and Longest Common Prefix properties
Works for unbounded alphabets
e Sorting of Strings is possible in O(n log n + N ) time.
Implementing all classic Heap operations, only paying extra O (N ) amortized, is not
simple
LCP


Def: lcp (S1 , S2) denotes the length of the largest
common prefix of S1 and S2
Folklore Lemma:


e For strings S1 m S2, S3



For strings S1 , S2,…,Sm , lcp (S1,Sm) = min1[i<m lcp (Si ,Si+1)
S2 [ S3 t lcp (S1,S2) [
Equivalently: lcp (S1,S2) >
lcp (S1,S3)
lcp (S1,S3) t S2 > S3
e For strings S1 [ S2, S3


S2 [ S3 t lcp (S1,S2) m
Equivalently: lcp (S1,S2) <
lcp (S1,S3)
lcp (S1,S3) t S2 > S3
Heap of Strings (HoS)



Binary (balanced) tree
Each node n holds a string S (n) and an lcp
value lcp (n)
Each node n satisfies the HoS property:


S(n) m S(p) for parent p of n
lcp (n) = lcp (S(n), S(p) )
Procedures for Heapify

Basic Step for 2 (BS2):




Given strings S1, S2,
compare the two, letter by letter,
until smaller is identified, and the lcp of the two is
determined.
Basic Step for 3 (BS3):



Given strings S1, S2, and S3,
compare all three, letter by letter,
until smallest, Si , is identified, and the lcp of each of the other
strings with Si is determined.
More Procedures for Heapify

Basic Step for 2, starting in position l, BS2(l ) :




Given strings S1, S2, with common prefix of length l
compare the two, letter by letter, starting from position l
until smaller, Si , is identified, and the lcp of the two is
determined.
Basic Step for 3, starting in position l, BS3(l ) :



Given strings S1, S2, and S3, with common prefix of length l
compare all three, letter by letter, starting from position l
until smallest, Si , is identified, and the lcp of each of the
other strings with Si is determined.
Heapify based on the classic O(n) process
Strings thrown into
a binary balanced tree
Bottom up,
Subtrees are made into HoS-s
Merge two little HoS-es and a root node
into one big HoS
BS3 of three nodes.
(two larger get lcp wrt smallest)
Swap, if needed, to get
smallest positioned as root
At most one subtree
gets a new root
This new root, as well as two
children, all have larger strings
than grand root, and have
lcp-s wrt it
Merge two little HoS-es and a root node
into one big HoS
Comparing lcp-s suffices.
On equality, read,
from lcp on, BS2(l), or BS3(l),
until smallest is found and
updated lcp determined.
Always record lcp with
strings found larger,
and swap, if needed,
to make smallest the parent.
Thus maintaining HoS property
If swap needed,
continue recursively
sifting down
Merge two little HoS-es and a root node
into one big HoS: Sifting Down


On each swap, a sub-HoS gets a new root.
That new root, as well as its two children, all
have larger strings than, and lcp wrt, old root



now positioned as parent of new root.
e Comparing lcp suffices to tell smallest of
On tie, read more of two (or three) strings,
starting at common lcp, until smallest
found and updated lcp is determined

Record updated lcp with larger string(s).
Sifting down in a HoS of height h


O(h) node operations
For each string comparison, at least one string
has its lcp field increase by the number of letter
comparisons made.
Heapifying into a HoS of n strings
of total length N takes O(n+N) time



A string with lcp = l never gets its prefix of
length l participating in any letter comparison
e No more than a total of O(N) letter
comparisons
e Heapifying completes in
O(n) node operations
+ a total of O(N) letter comparisons.
Extracting next smallest string
from a HoS







Extract the root
Both (now orphan) children are larger than, and have
lcp wrt, their gone parent
Comparing lcp-s suffices to find smaller of the two.
In case of a tie, BS2(l) finds smaller and updates lcp;
record updated lcp with the larger child
Promote smaller child to vacant parent position
Recurse in subtree rooted by promoted child
HoS property maintained.

Tree might become unbalanced, but not higher.
Extracting next smallest string
from a HoS
For a HoS of height h
O(h) node operations.
Thm: Sorting of n strings of total length N,
over unbounded alphabet,
Letter comparison only
is possible in O(n log n + N) time,
from common lcp on.
using O(n) space.
For each letter comparison,
at least one lcp grows.
lcp never decreases.
HoS becomes unbalanced
But height does not grow.
e No more than a total of O(N)
letter comparisons
for heapify followed by sequential
extraction of all strings
O(nlog n + N ) string sorting





String sorting is a classical problem, appears in
textbooks [Knuth, AHU]
Variants: multikey sorting, parallel sorting
Weight balanced ternary search trie [M 79]
achieves this runtime
QuickSort with average sorting time of
O(nlog n + N ) [BS, 97]
Multi phase merge sort, for enhancing cache
utilization [I, 05. IBM in the ’80s]
O(nlog n + N ) string sorting

Indexing data structures: suffix trees, suffix
arrays, BIS [AKLL, 05]




for suffixes of same string
Allow O(nlog n + N ) sorting
Some can adjust to a general set of strings (BIS)
All use O (n log n + N) just to build the data
structure and get the first result out.
Efficient On Demand Sorting

Thm: On Demand Sorting
of n strings of total length N
can be done with the extraction of the first
result in O(n + N1) time,
after which the retrieval of further results in
O(log n + Ni) time for the i-th result,
with Si Ni [ N.
Variation:
Find the smallest k < n strings

Maintain a HoS of k elements, with parents
LARGER than children.



root holds largest of smallest k
Build a HoS from arbitrary k of the set
For each remaining string in the set:



compare with root and determine lcp of the two
if new is larger than root – discard new
otherwise, discard root, and sift down
Find the smallest k < n strings
O (n log k + N) to identify k smallest of n + O ( k log k) to get these sorted.
Can HoS do additional operations
of the ordinary (integer) Heap?

Already seen: Extract min costs O(height)
yet does not maintain tree balanced



The classic delete takes the last leaf and sifts it
down from root, thus maintaining balance
Leaf loses the lcp it has gained, and compares
again its leading letters
Classic insertion by sifting up

Some nodes get their grandparent becoming their
parent, need to decrease their lcp
Insertion, creating embedded
data structure
BIS
Original nodes do not move.
Grandparent do not become
parents
BIS
When pumping up for
extraction, smallest node leaves
BIS and becomes HoS node.
BIS
BIS
BIS
HoS node never gets into
a BIS.
BIS Balanced Indexing Structures
AKLL, 2005


Adapted here from suffixes to any set of strings
BIS is an AVL tree with fixed size extra info (lcp s
and pointers) in nodes


allows to insert a string of length l to a tree of size n in
O (log n + l ) time
Deletion in O(log n)
Altogether

Thm: It is possible to construct a heap of n
strings in O(n) time and support further string
insertions and smallest string extractions in
time O(log(n) + log(m)) + O(N) amortized
over the whole sequence of heap operations,
where m is the number of strings inserted post
heapifying and were not extracted yet.
Conclusion



Combining basic elements, we support a
modern concept: On Demand
lcp is proved again an interesting, useful
measure (lcp with what?)
This is a real need: basic sort and k smallest
sort are implemented in a search engine
product