Transcript R-Trees
Spatial Indexing
Many slides taken from George
Kollios, Boston University
Spatial Indexing
Point Access Methods can index only
points. What about regions?
Use the transformation technique
Z-ordering and quadtrees
New methods: Spatial Access Methods
SAMs
R-tree and variations
Problem
Given a collection of geometric objects
(points, lines, polygons, ...)
organize them on disk, to answer
spatial queries (range, nn, etc)
R-tree
z-ordering: cuts regions to pieces ->
dup. elim.
how could we avoid that?
Idea: try to extend/merge B-trees and
k-d trees
R-tree: a generalization of the B+-tree
for multidimensional spaces
Kd-Btrees
[Robinson, 81]: if f is the fanout, split
point-set in f parts; and so on, recursively
Kd-Btrees
But: insertions/deletions are tricky
(splits may propagate downwards and
upwards)
no guarantee on space utilization
R-trees
[Guttman 84] Main idea: allow parents to
overlap!
=> guaranteed 50% utilization
=> easier insertion/split algorithms.
(only deal with Minimum Bounding Rectangles
- MBRs)
R-trees
A multi-way external memory tree
Index nodes and data (leaf) nodes
All leaf nodes appear on the same level
Every node contains between m and M
entries
The root node has at least 2 entries
(children)
Example
eg., w/ fanout 4: group nearby rectangles
to parent MBRs; each group -> disk page
I
AC
G
F
B
E
D
H
J
Example
F=4
P1
P3
AC
G
F
B
E
P2 D
I
H
P4 J
A B C
D E
H I
F G
J
Example
F=4
P1
P3
AC
P1 P2 P3 P4
G
F
B
E
P2 D
I
H
P4 J
A B C
D E
H I
F G
J
R-trees - format of nodes
{(MBR; obj_ptr)} for leaf nodes
P1 P2 P3 P4
x-low; x-high
obj
y-low; y-high
ptr ...
...
A B C
R-trees - format of nodes
{(MBR; node_ptr)} for non-leaf nodes
x-low; x-high
y-low; y-high node
ptr
...
P1 P2 P3 P4
...
A B C
R-trees:Search
P1
P3
AC
P1 P2 P3 P4
G
F
B
E
P2 D
I
H
P4 J
A B C
D E
H I
F G
J
R-trees:Search
P1
P3
AC
P1 P2 P3 P4
G
F
B
E
P2 D
I
H
P4 J
A B C
D E
H I
F G
J
R-trees:Search
Main points:
every parent node completely covers its ‘children’
a child MBR may be covered by more than one
parent - it is stored under ONLY ONE of them. (ie.,
no need for dup. elim.)
a point query may follow multiple branches.
everything works for any(?) dimensionality
R-trees:Insertion
Insert X
P1
P3
AC
P1 P2 P3 P4
G
F
B
X
P2 D
I
E
H
P4 J
A B C
D E X
H I
F G
J
R-trees:Insertion
Insert Y
P1
P3
AC
P1 P2 P3 P4
G
F
B
Y
P2 D
I
E
H
P4 J
A B C
D E
H I
F G
J
R-trees:Insertion
Extend the parent MBR
P1
P3
AC
P1 P2 P3 P4
G
F
B
Y
P2 D
I
E
H
P4 J
A B C
D E Y
H I
F G
J
R-trees:Insertion
How to find the next node to insert the
new object?
Using ChooseLeaf: Find the entry that
needs the least enlargement to include Y.
Resolve ties using the area (smallest)
Other methods (later)
R-trees:Insertion
P1
If node is full then Split : ex. Insert w
P3
K
AC
W
B
E
P2 D
I
P1 P2 P3 P4
G
F
H
P4 J
A B C K
H I
D E
F G
J
R-trees:Split
Split node P1: partition the MBRs into two groups.
• (A1: plane sweep,
P1
K
AC
B
until 50% of rectangles)
W
• A2: ‘linear’ split
• A3: quadratic split
• A4: exponential split:
2M-1 choices
R-trees:Split
pick two rectangles as ‘seeds’;
assign each rectangle ‘R’ to the ‘closest’ ‘seed’
seed2
R
seed1
R-trees:Split
pick two rectangles as ‘seeds’;
assign each rectangle ‘R’ to the ‘closest’ ‘seed’:
‘closest’: the smallest increase in area
seed2
R
seed1
R-trees:Split
How to pick Seeds:
Linear:Find the highest and lowest side in each
dimension, normalize the separations, choose the
pair with the greatest normalized separation
Quadratic: For each pair E1 and E2, calculate the
rectangle J=MBR(E1, E2) and d= J-E1-E2. Choose
the pair with the largest d
R-trees:Insertion
Use the ChooseLeaf to find the leaf
node to insert an entry E
If leaf node is full, then Split, otherwise
insert there
Propagate the split upwards, if necessary
Adjust parent nodes
R-Trees:Deletion
Find the leaf node that contains the entry E
Remove E from this node
If underflow:
Eliminate the node by removing the node entries
and the parent entry
Reinsert the orphaned (other entries) into the tree
using Insert
Other method (later)
R-trees: Variations
R+-tree: DO not allow overlapping, so
split the objects (similar to z-values)
R*-tree: change the insertion, deletion
algorithms (minimize not only area but
also perimeter, forced re-insertion )
Hilbert R-tree: use the Hilbert values to
insert objects into the tree
R-trees - variations
what about static datasets (no
ins/del/upd)?
Q: Best way to pack points?
R-trees - variations
what about static datasets (no
ins/del/upd)?
Q: Best way to pack points?
A1: plane-sweep
great for queries on ‘x’;
terrible for ‘y’
R-trees - variations
what about static datasets (no
ins/del/upd)?
Q: Best way to pack points?
A1: plane-sweep
great for queries on ‘x’;
bad for ‘y’
R-trees - variations
what about static datasets (no
ins/del/upd)?
Q: Best way to pack points?
A1: plane-sweep
great for queries on ‘x’;
terrible for ‘y’
Q: how to improve?
R-trees - variations
A: plane-sweep on HILBERT curve!
R-trees - variations
A: plane-sweep on HILBERT curve!
In fact, it can be made dynamic (how?),
as well as to handle regions (how?)
R-trees - variations
Dynamic (‘Hilbert Rtree):
each point has an ‘h’value (hilbert value)
insertions: like a B-tree
on the h-value
but also store MBR, for
searches
R-trees - variations
what about other bounding shapes? (and
why?)
A1: arbitrary-orientation lines (cell-tree,
[Guenther]
A2: P-trees (polygon trees) (MB polygon:
0, 90, 45, 135 degree lines)
R-trees - variations
A3: L-shapes; holes (hB-tree)
A4: TV-trees [Lin+, VLDB-Journal 1994]
A5: SR-trees [Katayama+, SIGMOD97]
(used in Informedia)
R-trees - conclusions
Popular method; like multi-d B-trees
guaranteed utilization
good search times (for low-dim. at
least)
Informix ships DataBlade with R-trees
Spatial Queries
Spatial Queries
Given a collection of geometric objects
(points, lines, polygons, ...)
organize them on disk, to answer
point queries
range queries
k-nn queries
spatial joins (‘all pairs’ queries)
Spatial Queries
Given a collection of geometric objects
(points, lines, polygons, ...)
organize them on disk, to answer
point queries
range queries
k-nn queries
spatial joins (‘all pairs’ queries)
Spatial Queries
Given a collection of geometric objects
(points, lines, polygons, ...)
organize them on disk, to answer
point queries
range queries
k-nn queries
spatial joins (‘all pairs’ queries)
Spatial Queries
Given a collection of geometric objects
(points, lines, polygons, ...)
organize them on disk, to answer
point queries
range queries
k-nn queries
spatial joins (‘all pairs’ queries)
Spatial Queries
Given a collection of geometric objects
(points, lines, polygons, ...)
organize them on disk, to answer
point queries
range queries
k-nn queries
spatial joins (‘all pairs’ queries)
R-trees - Range search
pseudocode:
check the root
for each branch,
if its MBR intersects the query rectangle
apply range-search (or print out, if this
is a leaf)
R-trees - NN search
P1
P3
AC
G
F
B
q
E
P2 D
I
H
P4 J
R-trees - NN search
Q: How? (find near neighbor; refine...)
P1
P3
AC
G
F
B
q
E
P2 D
I
H
P4 J
R-trees - NN search
A1: depth-first search; then, range
query
P1
AC
G
F
B
q
E
P2 D
I
P3
H
P4 J
R-trees - NN search
A1: depth-first search; then, range
query P1
P3
I
AC
G
F
B
q
E
P2 D
H
P4 J
R-trees - NN search
A1: depth-first search; then, range
query P1
P3
I
AC
G
F
B
q
E
P2 D
H
P4 J
R-trees - NN search
A2: [Roussopoulos+, sigmod95]:
priority queue, with promising MBRs, and
their best and worst-case distance
main idea: Every face of any MBR
contains at least one point of an actual
spatial object!
R-trees - NN search
consider only P2 and P4, for illustration
P1
P3
AC
G
F
B
q
E
P2 D
I
H
P4 J
R-trees - NN search
best of P4
=> P4 is useless
for 1-nn
worst of P2
H
q
E
P2 D
P4 J
R-trees - NN search
what is really the worst of, say, P2?
worst of P2
q
E
P2 D
R-trees - NN search
what is really the worst of, say, P2?
A: the smallest of the two red
segments!
q
P2
Nearest-neighbor searching
Branch and bound strategy
Compute MINDIST and MINMAXDIST [RKV95]
MINDIST(p,R) is the minimum distance between
p and R with corner points l and u
the closest point in R is at least this distance away
R
u
Sqrt sum (pi-ri)2 where
p
MINDIST = 0
l
p
p
ri = li if pi < li
= ui if pi > ui
= pi otherwise
Nearest-neighbor searching
MINMAXDIST(p,R) is the minimum of the maximum distance
to each pair of faces of R
MaxDistanceToFace(p,R,k) = distance between p and
Maxk= (M1,M2,..,Mk-1mk,MK+1,..,Mn)
mi = closer of the two boundary points along i axis
Mi = farther of the two boundary points along i axis
Max2
u
p
l
Max1
Nearest-neighbor searching
MINMAXDIST(p,R) is the minimum of the maximum distance
to each pair of faces of R
MaxDistanceToFace(p,R,k) = distance between p and
Maxk= (M1,M2,..,Mk-1mk,MK+1,..,Mn)
mi = closer of the two boundary points along i axis
Mi = farther of the two boundary points along i axis
Max1
u
l
p
Max2
Pruning
ESTIMATE := smallest MINMAXDIST(p,R)
Prune an MBR R’ for which MINDIST(p,R’) is
greater than ESTIMATE.
Generalize to k-nearest neighbor searching
Maintain kth largest MINMAXDIST
Prune an MBR if MINDIST to it is larger than the
current estimate of kth MINMAXDIST
Can use objects to refine estimate
Order of searching
Depth first order
Inspect children in MINDIST order
For each node in the tree keep a list of
nodes to be visited
Prune some of these nodes in the list
Continue until the lists are empty
Another NN search
Global order [HS99]
Maintain distance to all entries in a
common list
Order the list by MINDIST
Repeat
Inspect the next MBR in the list
Add the children to the list and reorder
Until all remaining MBRs can be pruned
Spatial Join
Find all parks in a city
Find all trails that go through a forest
Basic operation
Single-scan queries
find all pairs of objects that overlap
nearest neighbor queries, range queries
Multiple-scan queries
spatial join
Algorithms
No existing index structures
Transform data into 1-d space [O89]
Partition-based spatial-merge join [PW96]
z-transform; sensitive to size of pixel
partition into tiles that can fit into memory
plane sweep algorithm on tiles
Spatial hash joins [LR96, KS97]
Sort data [BBKK01]
With index structures [BKS93, HJR97]
k-d trees and grid files
R-trees
R-tree based Join [BKS93]
S
R
Join1(R,S)
Repeat
Find a pair of intersecting entries E in R and F in
S
If R and S are leaf pages then add (E,F) to
result-set
Else Join1(E,F)
Until all pairs are examined
CPU and I/O bottleneck
Reducing CPU bottleneck
S
R
Join2(R,S,IntersectedVol)
Repeat
Until all pairs are examined
14+6 comparisons instead of 49
In general, number of comparisons equals
Find a pair of intersecting entries E in R and F in S that
overlap with IntersectedVol
If R and S are leaf pages then add (E,F) to result-set
Else Join2(E,F,CommonEF)
size(R) + size(S) + relevant(R)*relevant(S)
Reduce the product term
Using Plane Sweep
S
R
s1
s2
r1
r2
r3
Consider the extents along x-axis
Start with the first entry r1
sweep a vertical line
Using Plane Sweep
S
R
s1
s2
r1
r2
r3
Check if (r1,s1) intersect along y-dimension
Add (r1,s1) to result set
Using Plane Sweep
S
R
s1
s2
r1
r2
r3
Check if (r1,s2) intersect along y-dimension
Add (r1,s2) to result set
Using Plane Sweep
S
R
s1
s2
r1
r2
r3
Reached the end of r1
Start with next entry r2
Using Plane Sweep
S
R
s1
s2
r1
r2
r3
Reposition sweep line
Using Plane Sweep
S
R
s1
s2
r1
r2
r3
Check if r2 and s1 intersect along y
Do not add (r2,s1) to result
Using Plane Sweep
S
R
s1
s2
r1
r2
r3
Reached the end of r2
Start with next entry s1
Using Plane Sweep
S
R
s1
s2
r1
r2
r3
Total of 2(r1) + 1(r2) + 0 (s1)+ 1(s2)+ 0(r3) = 4 comparisons
Reducing I/O
Read schedule r1, s1, s2, r3
Every subtree examined only once
Consider a slightly different layout
Reducing I/O
S
R
r2
s1
r1
s2
r3
Read schedule is r1, s2, r2, s1, s2, r3
Subtree s2 is examined twice
Pinning of nodes
After examining a pair (E,F), compute the degree
of intersection of each entry
degree(E) is the number of intersections between E and
unprocessed rectangles of the other dataset
If the degrees are non-zero, pin the pages of the
entry with maximum degree
Perform spatial joins for this page
Continue with plane sweep
Reducing I/O
R
r2
S
s1
r1
s2
r3
After computing join(r1,s2),
degree(r1) = 0
degree(s2) = 1
So, examine s2 next
Read schedule = r1, s2, r3, r2, s1
Subtree s2 examined only once
References
[SK98] Optimal multi-step k-nearest neighbor
search, T. Seidl and H. Kriegel, SIGMOD 1998: 154-165.
[BBKK01] Epsilon Grid Order: An Algorithm for the
Similarity Join on Massive High-Dimensional Data,
C. Bohm, B. Braunmüller, F. Krebs and H.-P.
Kriegel, SIGMOD 2001.
[RKV95] Roussopoulos N., Kelley S., Vincent F.
Nearest Neighbor Queries. Proceedings of the
ACM-SIGMOD International Conference on
Management of Data, pages 71-79, 1995.
References
[HS99] G. R. Hjaltason and H. Samet, Distance
browsing in spatial databases, ACM Transactions on
Database Systems 24, 2 (June 1999), 265-318
[O89] Jack A. Orenstein: Redundancy in Spatial
Databases. SIGMOD Conference 1989: 294-305
[PW96] Jignesh M. Patel, David J. DeWitt: Partition
Based Spatial-Merge Join. SIGMOD Conference
1996: 259-270
[LR96] Ming-Ling Lo, Chinya V. Ravishankar:
Spatial Hash-Joins. SIGMOD Conference 1996:
247-258
References
[KS97] Nick Koudas, Kenneth C. Sevcik: Size
Separation Spatial Join. SIGMOD Conference 1997:
324-335
[HJR97] Yun-Wu Huang, Ning Jing, Elke A.
Rundensteiner: Spatial Joins Using R-trees:
Breadth-First Traversal with Global Optimizations.
VLDB 1997, 396-405
[BKS93] Thomas Brinkhoff, Hans-Peter Kriegel,
Bernhard Seeger: Efficient Processing of Spatial
Joins Using R-Trees. SIGMOD Conference 1993:
237-246