Hierachy-conscious Data Structures for String Analysis
Download
Report
Transcript Hierachy-conscious Data Structures for String Analysis
Hierarchy-conscious Data
Structures for String Analysis
Carlo Fantozzi
PhD Student (XVI ciclo)
Bioinformatics Course - June 25, 2002
Outline of the Talk
We will describe data structures
to efficiently deal with memory hierarchies
while analyzing strings
What is a memory hierarchy?
Computational models
Data structures
Memory Hierarchy
Imposed by technology
Increasing size
Decreasing speed
Orders of magnitude!
ALU
L1 inst
L1 data
L2 cache
Relevant data must be kept
in the faster levels
Who should do it?
Maybe the programmer
L3 cache
main memory
disk(s), network
Computational Models
No model describes the whole hierarchy
External memory: I/O model
RAM: limited capacity M
Disk: transfers in blocks of size B
Cost: number of I/O transfers only
Internal memory: ideal cache model
Cache RAM, RAM disk
Cache updates are automatic and optimal
Basic Data Structures: Tries
Trie: tree built on a set of strings
Every string has a path in the tree
PATRICIA trie: nonbranching nodes collapsed,
only 1st character stored
n strings Q(n) space
Loss of information
b
a
a
b
a
b
b
(a,2)
(a,1)
b
a
(b,1)
(b,1)
(a,2)
(b,2)
B-trees
Balanced search tree (enforced)
~n sorted keys per node, which
partition the keys in the subtrees
One node fits one disk block
Searching for keys matching prefix x
takes O(logBN + k/B) I/Os
Updates: O(logBN) I/Os
k1 k2 k3 … kn
String B-trees (Ferragina & Grossi)
Useful for “big” or “irregular” strings
The keys are pointers to strings
Unsorted strings are in a separate array
Who knows lexicographic order?
Each node contains a PATRICIA trie
built on the corresponding strings
search(x): O(logBn + (|x|+k)/B) I/Os
update(x): O(logBn + |x|/B) I/Os
Cache-oblivious B-trees (1)
Developed by Bender, Demaine, Farach
The B-tree is stored in memory
according to the van Emde Boas layout
h
1
1 2 3 4 5
2 3 4 5
recursively!
Balancing rule; long-distance nodes
search(x): O(logBN + k/B) I/Os
update(x): O(logBN) I/Os for suitable B
Cache-oblivious B-trees (2)
Developed by Brodal, Fagerberg and Jacob
1.
2.
3.
Build a dynamic, binary, balanced search tree
Embed it into a static, binary tree
Store it using the van Emde Boas layout
Same I/Os with a simpler data structure
Cache obliviousness:
The ideal cache model can be
simulated on a realistic, multilevel model
with constant slowdown (expected)
Suffix Trees
Flat memory model: s.t. can be built in
I/O model: studied by Farach et al.
1.
2.
3.
linear time if |S| is finite
sorting time if |S| is infinite
Build the odd tree (through aux suffix tree)
Build the even tree (use the odd tree)
Merge the two trees (through anchor nodes)
Sorting I/O complexity (optimal)
Suffix Arrays (1)
“Simplified suffix tree”
AS[i] = position of i-th suffix of S
search(x): O((|x|/B)logN + k/B) I/Os
Array construction, Manber & Myers:
1.
2.
3.
Put suffixes into buckets
Stage i: sort according to first 2i symbols
Read buckets in order
Takes 8N space and O(NlogN) I/Os
Suffix Arrays (2)
Array construction, Gonnet et al. :
cubic I/O complexity, but…
worst case analysis is too pessimistic
nearly all I/Os are bulk, i.e. sequential
Gonnet’s algorithm beats
the one of Manber & Myers in practice
Ferragina and Grossi: new algorithm
Better I/O complexity than Gonnet’s
Faster than Gonnet’s in practice