Presentation by Michal Balas

Download Report

Transcript Presentation by Michal Balas

I/O-efficient Point
Location using
Persistent B-Trees
Lars Arge, Andrew Danner, and Sha-Mayn Teh
Department of Computer Science, Duke
University (2003)
Michal Balas
1
The Planar Point Location
Problem
Storing a planar subdivision defined by N line
segments such that the region containing a
query point p can be computed efficiently
Michal Balas
2
Planar Point Location Applications



Geographic Information systems (GIS)
Spatial Databases
Graphics
Usually the datasets are larger than the size of
physical memory and must reside on disk
Michal Balas
3
Previous Works


So far, few theoretically I/O efficient
structures were developed, but all are
relatively complicated and none of them was
implemented
Vahrenhold and Hinrichs (2001) suggested a
heuristic structure that is simple and efficient
but theoretically non optimal
Michal Balas
4
Goal
find a planar point location structure that
minimizes the number of I/Os needed to
answer a query, which is efficient both in
theory and in practice.
Michal Balas
5
Lecture’s Road Map



Motivation
The Vertical Ray Shooting problem and the
need of persistent data structures
Review:





B-trees, B+ trees, and I/O model
Persistent B-trees
The modified Persistent B-tree
Experimental results
Open problems
Michal Balas
6
Vertical Ray Shooting



A generalized version of the Planar Point
Location problem
Given a set of N non-intersecting segments in
the plane, construct a data structure such
that the segment directly above a query point
p can be found efficiently
We will consider this problem.
Michal Balas
7
Example
Michal Balas
8
Vertical Ray Shooting




Based on the persistent search tree idea of
Sarnak and Tarjan (1986).
Any vertical line l in the plane introduces an
“above-below” order on the segments it
intersects.
We will “sweep” the plane from left to right
with a vertical line
Our “critical” x-axis points are the endpoints
of all segments
Michal Balas
9
Vertical Ray Shooting &
Persistent Search Trees



Sort critical points by x-values
For each critical point pi=(xi,yi) we can build a
search tree for the segments intersecting a
vertical line at xi according to the y-values (at
xi)
Until the next critical point pi+1 the tree is
static – it will change only in the next
begin/end point of a segment
Michal Balas
10
Vertical Ray Shooting &
Persistent Search Trees

Worst case analysis:


Hold a search tree to each critical point
Space: O(n2)
Michal Balas
11
Vertical Ray Shooting &
Persistent Search Trees

We should use the fact that two consecutive
trees (versions) differ only by one insertion or
deletion (assuming distinct x-values for all
endpoints).
Michal Balas
12
Vertical Ray Shooting &
Persistent Search Trees

Persistent data structure



Preserves versions. In ordinary (ephemeral) data
structures there is only one last version (every
update changes the data structure so its state
before the update can no longer be accessed)
Each update creates a version
The current version of the structure can be
modified and all versions of the structure, past
and present, can be accessed.
Michal Balas
13
Vertical Ray Shooting &
Persistent Search Trees



We would like to save a version of the search tree
for each critical point. Since we want to be space
efficient, we will use persistent search tree.
A persistent search tree differs from an ordinary
search tree in that after an insertion or deletion, the
old version of the tree can still be accessed.
Here the persistent search tree should supports
insertions and deletions in the present and queries
in the past. (partially persistent)
Michal Balas
14
Vertical Ray Shooting &
Persistent Search Trees



We will insert a segment into the persistent
search tree when its left endpoint is encountered
We will delete a segment persistently from the
tree when its right endpoint is encountered.
Two consecutive versions of the tree differ only
by a certain number of deletions and insertions
(in the distinct x-values case by 1 only)
Michal Balas
15
Vertical Ray Shooting &
Persistent Search Trees

Given a query point p=(x,y) , we will search
for the position of y in the version of the
search tree when the sweep line was at x.
Michal Balas
16
Vertical Ray Shooting &
Persistent Search Trees

Path Copying:




A balanced search tree
When x is inserted the changes are only on the
path from the root to x
Instead of copying the whole tree we will copy
only the updated path
The roots will be ordered by version
x
Michal Balas
17
Vertical Ray Shooting &
Persistent Search Trees

Path Copying:

Space: O(nlogn) – better, but not good enough
r1
r2
x
Michal Balas
18
Vertical Ray Shooting &
Persistent Search Trees

Extra Pointers :

Instead of copying the path, we will save for each
node a few pointers ( a list of left children and
right children, thought it’s a binary tree)
left
right
t1 t2
Michal Balas
19
Vertical Ray Shooting &
Persistent Search Trees

Extra Pointers :



Here there is no limitation on the # of pointers per
node
In the worst case, it will take O(logn) time to find
the relevant version per node (the pointers are in
a binary search tree) – which is not optimal
We need constant time per node
Michal Balas
20
Vertical Ray Shooting &
Persistent Search Trees

Sarnak & Tarjan solution:




Limited node copying, k extra pointers per node
k should be a small positive number (k=1 will do)
When a pointer is added to a node, if there is no
empty slot for a new pointer, we copy the node,
setting the initial left and right pointers of the copy
to their latest values.
Update the parent with the new copy, if the parent
has no free slot the process is repeated.
Michal Balas
21
Vertical Ray Shooting &
Persistent Search Trees

Sarnak & Tarjan solution - Space analysis



Amortized analysis: we will see that every set of
m operations takes O(m) space.
The potential of the structure is defined to be:
F = # live nodes – (1/k)*(# free slots in the live
nodes)
amortized space cost of update = (actual # of
nodes it creates) – DF
Michal Balas
22
Vertical Ray Shooting &
Persistent Search Trees

Sarnak & Tarjan solution - Space analysis
 We will show that amortized space cost of an update
is bounded by O(1) per update.
 If a new unused slot in node v is used, but the node is
still not full, then the actual # of new nodes created is
0, DF is (-1/k) (#free slots in live nodes decreased by
1), thus amortized space cost of this update is 1/k.
 If node copying has occurred, the actual # of new
nodes created is 1, DF is 1 (#free slots in live nodes
increased by k), thus amortized space cost of this
update is 0.
Michal Balas
23
Vertical Ray Shooting &
Persistent Search Trees

Sarnak & Tarjan solution - Space analysis
 During an update, node copying continues in the path
from node to root until the root is copied or a node
with a free slot is reached.
 The amortized space cost of node copying is 0 and of
occupying a free slot is 1/k
Michal Balas
24
Vertical Ray Shooting &
Persistent Search Trees

Sarnak & Tarjan solution - Space analysis
 The total amortized space cost of an update is
constant (0 or 1/k)
 The space of rebalance information per node is
constant
 In red-black trees, rebalancing after deletion or
insertion can be done in O(1) rotations and O(1) color
changes per update in the amortized case
 Since an insertion or deletion requires O(1) new
pointers not counting node copying, the amortized
space cost of an update is O(1)
Michal Balas
25
Vertical Ray Shooting &
Persistent Search Trees

Sarnak & Tarjan solution - Space analysis




sum up over all updates:
amortized space cost over all updates
= cn = required space – (Fend – Fstart)
Fstart=0
(we start with an empty data structure)
Fend=O(n) (according to the potential function
definition, this is an upper bound on
the potential in the end)
Required space = cn + O(n) = O(n) (this is a
bound on the number of nodes created)
Michal Balas
26
Vertical Ray Shooting &
Persistent Search Trees

Sarnak & Tarjan solution – Complexity




O(log m) query time (m is the total # of updates)
O(log n) update time (n is the current size of the
set)
O(1) amortized space per update
O(nlogn) preprocessing time
Michal Balas
27
Where are we going?
The use of Persistent Data
structures
The use of B-trees in the I/O
Model
(always preserves the previous version
of itself when it is modified)
(B-tree is the I/O model
equivalent of a search tree)
I/O efficient Persistent B-tree
(works great with totally ordered elements)
Modified I/O efficient Persistent B-tree
(only elements present in the same version of the structure need to
be comparable)
Michal Balas
28
Vertical Ray Shooting &
Persistent Search Trees

Two segments that cannot be intersected with the
same vertical line are not comparable ( “abovebelow”)

Corollary: Not all segments stored in the
persistent structure over its lifespan are
comparable
An I/O efficient structure cannot directly be
obtained using a persistent B-tree (because
standard persistent B-trees require total order
on all elements)
Michal Balas
29
Vertical Ray Shooting &
Persistent Search Trees

To make the structure I/O-efficient, we need
to modify the tree so it will only require
elements present in the same version of the
structure to be comparable
Michal Balas
30
Lecture’s Road Map



Motivation
The Vertical Ray Shooting problem and the
need of persistent data structures
Review:





B-trees, B+ trees, and I/O model
Persistent B-trees
The modified Persistent B-tree
Experimental results
Open problems
Michal Balas
31
Review: The I/O Model




Infinite disk size
M - Main Memory size
B - Block size
N - elements in the structure
M
Block I/O
Michal Balas
D
32
Review: The I/O Model - Cont



Computation can only occur on data stored in
main memory.
We are interested in the number of I/Os used
to answer a query.
The B-tree is the external memory equivalent
of the balanced search tree in internal
memory.
Michal Balas
33
Review: B-tree




A balanced search tree
All leaves are on the same level
All internal nodes (except the root) have
between B/2 and B children (q(B))
A node/leaf can be stored in O(1) blocks
Michal Balas
34
Review: B-tree - Cont



Space complexity of the tree: O(N/B) blocks
(where N is the number of elements) – linear
Tree height: O(logBN)
Insert/Delete can be done with O(logBN) I/Os
Michal Balas
35
Review: B+-tree


It is a B-tree in which all elements are stored
in the leaves.
The internal nodes contain “routing
elements”.
Michal Balas
36
B-tree Example (B+-tree)
1
d1
2
d2
3
d3
d4
3
5
4
5
6
d5
d6
Michal Balas
7
d7
37
Where are we going?
The use of Persistent Data
structures
The use of B-trees in the I/O
Model
(always preserves the previous version
of itself when it is modified)
(B-tree is the I/O model
equivalent of a search tree)
I/O efficient Persistent B-tree
(works great with totally ordered elements)
Modified I/O efficient Persistent B-tree
(only elements present in the same version of the structure need to
be comparable)
Michal Balas
38
Review: Persistent B-tree

Directed acyclic graph



The elements are in the sinks (leaves)
“routing elements” in internal nodes
Elements (and nodes) augmented with
“existence interval”


In this interval the element is “alive”
An element is “alive” - between its insert and its
delete version
Michal Balas
39
Review: Persistent B-tree - Cont

Nodes “alive” at time t form a (aB,B) B-tree,
0<a<1/2


We will work with a=1/4
Additional invariant:



A new node must contain between (a+g)B and
(1-g)B alive elements ( a > g )
For g=1/8, a=1/4, new node contains between
(3/8)B and (7/8)B alive elements
We require that g>2/B, a-g >=1/B, 2a+3g<= 1-3/B
Michal Balas
40
Review: Persistent B-tree - Cont

In order to find the appropriate root at time t,
the roots are stored in a standard B-tree


Takes O(logBN) I/Os
A node/leaf contains O(B) elements = O(1)
blocks
# Blocks needed to hold the structure: O(N/B)
Michal Balas
41
Persistent B-tree Insert
x is the element to insert into the current version of
the tree
Search the leaf l and insert x (O(logBN) I/Os)
if l contains > B elements -> Block overflow




(a)
(b)
(c)
Version-Split (copy all k alive elements from l to a new node
v and mark l as dead)
If k is in [(3/8)B,(7/8)B] - simple
If k > (7/8)B – strong overflow
If k < (3/8)B – strong underflow
Strong overflow/underflow violates the additional
invariant we defined earlier
Michal Balas
42
Persistent B-tree Insert
a)
If k is in [(3/8)B,(7/8)B] :
recursively update parent(l): persistently delete the
reference to l and insert a reference to v
Michal Balas
43
Persistent B-tree Insert - Cont
b)


If k > (7/8)B – strong overflow
split
create nodes v1, v2 each with k/2 elements.
k/2 is in ((3/8)B,(7/8)B) (this is not tight)
Update parent(l) recursively: persistently delete the reference to l and insert
two references to v1, v2
Michal Balas
44
Persistent B-tree Insert - Cont
b)


If k < (3/8)B – strong underflow
Version-split of sibling l’ of l -> obtain k’ other alive elements
(k’ is in [aB,B])
k+k’ >= 2aB, and a > g, thus k+k’ > (a+g)B (the invariant…)
1) if k+k’ <= (1-g)B: merge - create a new leaf with k+k’
elements
2) if k+k’ >(1-g)B: share – split to create two new leaves.
Update parent(l) recursively: persistently delete two references
and insert one or two
Michal Balas
45
Persistent B-tree Delete




x is the element to delete from the current version of the tree
Search the leaf l that contains and mark x as dead (O(logBN)
I/Os)
if l contains < (1/4)B alive elements -> Block underflow (this is
also a strong underflow, since k < (3/8)B )

Version-Split on a sibling node to obtain k+k’ elements.
k+k’ >= 2aB -1 , and a- g > =1/B, thus k+k’ > (a+g)B (the
invariant…)
mark l dead and create a new node v with k+k’ elements (merge)
if there is a strong overflow in v – share (as in insert)
Update parent(l) recursively: persistently delete two references
and insert one or two references
Michal Balas
46
Persistent B-tree –
Rebalance Operations
Delete
Insert
Block Overflow
Version-split
Block Underflow
Version-split
Done
0,0
Done
-1,+1
Strong Overflow
Strong Underflow
Split
Merge
Done
-2,+1
Done
Strong Overflow
-1,+2
Split
Done
Michal Balas
-2,+2
47
Persistent B-tree - Complexity
Updates: O(logBN) I/Os



search and rebalance on one path from root to leaf
What about the required space?
Michal Balas
48
Persistent B-tree - Complexity
A few observations:





A rebalance operation on leaf creates <= 2 new nodes
Once a leaf is created, at least gB updates have to be performed on it before
another rebalance operation will occur.
Two version-splits might only create one new leaf
Each time a leaf is created or a leaf version-split performed, a corresponding
insertion or deletion is performed recursively one level up the tree.
During N updates:






# leaves created <= 2N/gB = O(N/B)
# leaf version-splits<= 2N/gB
# nodes created one level up the tree <= 22N/(gB)2
By induction: # nodes created i levels up the tree <= 2i+1N/(gB)i+1
Total # nodes created <=
2N
gB
logB N

i =0
i
 2 
N
  =  
 gB  g > 2 B  B 
(it is also the # of blocks used after N updates)

Space: O(N/B) blocks
Michal Balas
49
Lecture’s Road Map



Motivation
The Vertical Ray Shooting problem and the
need of persistent data structures
Review:





B-trees, B+ trees, and I/O model
Persistent B-trees
The modified Persistent B-tree
Experimental results
Open problems
Michal Balas
50
Where are we going?
The use of Persistent Data
structures
The use of B-trees in the I/O
Model
(always preserves the previous version
of itself when it is modified)
(B-tree is the I/O model
equivalent of a search tree)
I/O efficient Persistent B-tree
(works great with totally ordered elements)
Modified I/O efficient Persistent B-tree
(only elements present in the same version of the structure need to
be comparable)
Michal Balas
51
The modified Persistent B-tree
Why do we need to modify the standard
Persistent B-tree?
Before, a few facts about standard B-tree:





The elements are in the leaves
Internal nodes contain “routing elements”
When a node v is created a reference is added to
parent(v) – normally a copy of the maximal element in
v is used as a routing element in parent(v)
Michal Balas
52
The modified Persistent B-tree
The structure contains multiple live copies of
the same element.
There may be copies of an element as routing
elements long after the element is deleted
When searching for an element in the structure
at version t we might be comparing to a copy of
a dead element.
Michal Balas
53
The modified Persistent B-tree
In this application (vertical ray shooting) not all
elements (segments) stored in the data
structure during its entire lifespan are abovebelow comparable
We cannot use the standard version of a
persistent B-tree, since it requires all elements
in the structure to be comparable.
Modification is needed!
Michal Balas
54
The modified Persistent B-tree
We want the structure to only require elements
present in the same version to be comparable
The modified structure:




Alive elements in time t form a B-tree with elements in
all nodes - internal + leaves. (not just in leaves)
# live copies of an element at any given time t <= 1
Michal Balas
55
The modified Persistent B-tree



There will be some modification to the
rebalance operations
The Insert algorithm remains
The delete algorithm is slightly modified:
Michal Balas
56
The modified Persistent B-tree

The modified delete algorithm:
When deleting an element x which is in internal
node u we need to be careful since x is associated
with a reference to a child uc of u that is still alive
1.
2.
3.
4.
5.
Find y : the predecessor of x in a leaf below u
Persistently delete y
Persistently delete x from u
Insert a live copy of y with a reference to the child uc
Perform the needed rebalance operations
Michal Balas
57
The modified Persistent B-treerebalance operations

Version-Split: copying all alive elements of u to a new node v
x
x
u
u
v
We can use x as the element associated with the reference to the new
node v , since the elements in v are a subsets of the elements in u
Michal Balas
58
The modified Persistent B-treerebalance operations

Split: when a strong overflow occurs after a version-split of u, two new nodes v, v’ are created
x
y x
u
u
y
v
v’
we promote the maximal element y in v to be associated with the reference to v in
parent(u) (instead of storing y in v).
x will be associated with the reference to v’ in parent(u).
v has one less element than it would have had using the regular split, but O(gB)
updates are still required on v before further structural changes are needed
Michal Balas
59
The modified Persistent B-treerebalance operations

Merge: when a strong underflow occurs after a version-split of u, a version-split of u’s sibling u’
is performed, and a new node v is created with the alive elements from u,u’
x y
y
u
u
u’
v
u’
X
The maximal between x and y , say y, is used as the reference to the
new node v. x is demoted and stored in the new node v
v has one more element than it would have had using the regular merge. But as
in split, O(gB) updates are still needed on v before further structural changes are
needed
Michal Balas
60
The modified Persistent B-treerebalance operations

Share: when a merge would result in a new node with a strong overflow, instead a version-split
on the two sibling nodes u and u’ is performed, and two new nodes v, v’ are created.
x y
z y
u
u’
z
u
v
u’
v’
X
The maximal element y can be reused as the reference to v’ but x cannot
be used as a reference to v. x is demoted to v and the maximal element z
in v is promoted to parent(u).
# of elements in the new node v is identical to the # of elements we would
have had using the regular share.
Michal Balas
61
The modified Persistent B-treeComplexity

Even though there is a difference in the
number of elements, the previous space
arguments still apply

Space: O(N/B) blocks
Update on the newest version: O(logBN) I/Os

Michal Balas
62
The modified Persistent B-treeSummary

A set of N non-intersecting segments in the
plane can be processed into a data structure
of size O(N/B) in O(NlogBN) I/Os such that a
vertical ray-shooting query can be answered
in O(logBN) I/Os
Michal Balas
63
The modified Persistent B-treeSummary


N updates on a persistent B-tree (standard or modified)
O(N log B N )
takes
I/Os
Goodrich et al. showed how to construct a persistent Btree structure (different from the basic one described
earlier) in O N log N  I/Os (the sorting bound)
B
B
The structure by Goodrich et al., requires that all
elements in the structure over its lifespan are
comparable
In the modified tree we cannot use that, since the
elements are not totally ordered, so this construction
complexity is not reached (so far)
M /B


Michal Balas
64
Where have we come from?
The use of Persistent Data
structures
The use of B-trees in the I/O
Model
(always preserves the previous version
of itself when it is modified)
(B-tree is the I/O model
equivalent of a search tree)
I/O efficient Persistent B-tree
(works great with totally ordered elements)
Modified I/O efficient Persistent B-tree
(only elements present in the same version of the structure need to
be comparable)
Michal Balas
65
Lecture’s Road Map



Motivation
The Vertical Ray Shooting problem and the
need of persistent data structures
Review:





B-trees, B+ trees, and I/O model
Persistent B-trees
The modified Persistent B-tree
Experimental results
Open problems
Michal Balas
66
Experimental Results





Compared the persistent B-tree and the grid
structure of Vahrenhold and Hinrichs
Implemented both using TPIE library
Used road data , containing all roads in the
US. Roads are broken at intersections
The query points were randomly sampled
from the datasets
Used also worst case artificially generated
dataset
Michal Balas
67
Experimental Results

In terms of query efficiency:


# I/Os per query, Time per query – both are much
lower in the persistent B-tree than in the Grid
structure. In synthetically generated worst case
dataset B-tree uses significantly fewer I/Os
size, Construction efficiency – grid
construction algorithm outperforms the
persistent B-tree on the real life datasets,
though on the worst case dataset the
persistent B-tree was significantly better.
Michal Balas
68
Lecture’s Road Map



Motivation
The Vertical Ray Shooting problem and the
need of persistent data structures
Review:





B-trees, B+ trees, and I/O model
Persistent B-trees
The modified Persistent B-tree
Experimental results
Open problems
Michal Balas
69
Open Problems

One major open problem is to construct the
N
N
O
(
log
) I/Os (here we saw a
structure in
M B
B
B
trivial algorithm that constructs in O( N log B N )
Michal Balas
70
Questions? ...
Michal Balas
71