Multidimensional Access Methods

Transcript Multidimensional Access Methods

Multidimensional Access
Methods
Volker Gaede
IC-Parc, Imperial College, London
Oliver Gunther
Humboldt-Universitat, Berlin
ACM Computing Surveys, June 1998
Mohammad Reza Kolahdouzan, [email protected]
Outline
• Organization of Spatial Data
– What is Special About Spatial?
– Definitions and Queries
• Basic Data Structures
– One-Dimensional Access Method
– Main Memory Structures
• Point Access Methods
– Multidimensional Hashing
– Hierarchical Access Methods
– Space Filling Curves for Point Data
• Spatial Access Methods
–
–
–
–
Transformation
Overlapping Regions
Clipping
Multiple Layers
• Comparative Studies, Conclusion
2
What is special about Spatial Data?
• Basic properties:
– Complex structure
• object could be point or thousands of polygons
• variable tuple size
– Dynamic
• different operations (insert, delete, update) are interleaved
– Large
• e.g.: gigabytes for geographic maps
• Integration of secondary and tertiary memory
– Several proposals but no standard algebra
• no standard set of operators
• operators depends of domain (application specific)
– Not closed operators
• result of an operator can return any kind of object
– Expensive computational costs
3
What is special about Spatial Data? ...
• Special physical layer support is required for “Search”
operators (for search, update, … )
• Multidimensional access methods required.
– Hard to do: no total ordering for spatial objects that preserves
spatial proximity (i.e., no mapping from 2,3,... to dimensions to 1)
– One solution:
• single key structure, per dimension
• problem: each index is traversed independently of others
4
What is special about Spatial Data? ...
• Requirements for multidimensional access methods:
– Dynamic: keep track of changes (inserts, deletes, updates, ... )
– Secondary/tertiary storage management: not possible to have
everything in main memory
– Broad range of supported operations: not sacrifice one for others
– Independence of the input data (distribution) and insertion
sequence
– Simplicity
– Scalability
– Time efficiency of search: goal is to meet one-dimensional B-tree:
logarithmic worst case
– Space efficiency
– Concurrency and recovery: multiple concurrent accesses
– minimum impact for integration
5
Definitions
•
Multidimensional access method: support search in spatial DB
•
Point access methods (PAMs): designed to perform spatial search in
•
Spatial access methods (SAMs): can manage extended objects (lines,
•
•
•
Universe, Original Space: a d-dimensional Euclidean space, E(d)
Any point has a unique location, defined by its d coordinates
d-dimensional polytope, P: intersection of some finite number of closed
•
D-dimensional polyhedron:
point DBs
– point could have 2,3,… dimensions, but no extension.
polygons, ...)
half-spaces in E(d), such that the dimension of the smallest affine subspace
containing P is d.
– union of some polytopes.
– Divides the points in space to: its interior, boundary, and exterior
(mutually disjoint)
6
Definitions ...
•
Examples for polyhedron:
– line (polyline) = 1-dimensional polyhedron
– region (polygon) = 2-dimensional polyhedron
•
A set of k-dimensional polyhedra forms a data type,
•
•
Spatial attribute of a spatial data describes its extension o.G
synonyms:
region, … }
(e.g., {point, line,
– geometry, shape, spatial extension
– shape descriptor, shape description, shape information, geometric
description
7
Definitions ...
•
Simple solutions for indexing:
– Abstracting complex object with simpler shape, like bounding box or
sphere, and providing a link to actual data
• produces candidate solutions
• based on application, refinement may be necessary
– Decomposition, represent a shape with union of simpler shapes (e.g.,
convex polygons with bounded number of vertices).
8
Definitions ...
•
What is efficiency?
– Space: minimize number of occupied bytes
– Time: not clear!
• We care about elapsed time, but it depends on implementation,
hardware, …
• Solution: number of disk accesses for a search.
– Based on assumption that search is I/O bound (B-tree)
– May not be true for spatial objects
– Should still minimize number of disk accesses
9
Queries
•
•
•
•
No standard query language
Application domain dependent set of operators
Some more common operators (intersection, … )
Queries are usually in term of extension to SQL
10
Queries ...
• Common Queries
•
Query 1: Exact Match Query EMQ, Object Query
– returns objects that have similar extension
11
Queries ...
• Common Queries ...
•
Query 2: Point Query PQ
– returns an object that contains a point
12
Queries ...
• Common Queries ...
•
Query 3: Window Query WQ, Range query
– returns objects that have at least one point in an interval
– intervals are iso-oriented (parallel to coordinate axes)
13
Queries ...
• Common Queries ...
•
Query 4: Intersection Query IQ, Region query, Overlap Query
– returns objects that have at least one common point with another
object
14
Queries ...
• Common Queries ...
•
Query 5: Enclosure Query EQ
– returns objects that enclose another object
15
Queries ...
• Common Queries ...
•
Query 6: Containment Query CQ
– returns objects that are contained in another object
– contain and enclosure are dual
16
Queries ...
• Common Queries ...
•
Query 7: Adjacency Query AQ
– returns adjacent objects to another one
17
Queries ...
• Common Queries ...
•
Query 8: Nearest Neighbor Query NNQ
– returns an object that has the minimum distance to another one
– distance between two objects: distance between their closest points
18
Queries ...
• Common Queries ...
•
Query 9: Spatial join
– two collection R and S of spatial data
– spatial predicate θ (e.g., intersects(), contains(), adjacent(), … )
– find pair of objects
19
One Dimensional Access Methods
•
Important foundation of mutlidimensional methods
•
Access methods:
– linear hashing
– extendible hashing
– Hierarchical (e.g., B-tree)
•
Hierarchical:
– scalable
– behave well for skewed data
– nearly independent data distribution
•
Hashing:
– performance depends on distribution, and hashing function
20
One dimensional Access Methods ...
•
Linear hashing:
– divide universe [A,B) of possible hash values into binary intervals
– size of intervals: (B-A)/2^k and (B-A)/2^(k+1)
– if a bucket is full, split an interval (not necessarily the same bucket) to
two equal subintervals
• no guarantee for relieving overflowed bucket
– put overload of a bucket in overflow page, link it to original
•
Extendible hashing:
–
–
–
–
binary intervals (cells), like linear
use index in a directory for each cell, no overflow pages
if a bucket at maximal depth exceeds capacity, splits all cells into two
for skewed data, this may lead to different directory entries for same
data region
– pro: exact match takes at most two page accesses
– con: poor space utilization, non-incremental growth (doubling)
• solution: bounded-index extendible hashing
21
One dimensional Access Methods ...
•
B-tree:
–
–
–
–
–
•
organize data hierarchically
balanced
relate to nesting of intervals
interior nodes: point to immediate mutually disjoint descendants
leaf nodes: data items
comparison:
– B-tree: adaptive (size of interval depends on data)
– hash methods outperform B-tree for uniformly distributed data for:
• exact match queries, insertion, deletion
22
Main Memory Structures
•
•
•
Early methods
only work in main memory, not optimized for secondary memory
are adapted in real methods
•
our running example:
–
points, polygons (their centroids, or MBBs)
23
Main Memory Structures ...
•
k-d-tree
–
–
–
–
–
–
–
–
binary search tree
can only handle points (polygons with their centroids)
recursively divide the universe using (d-1)dimensional hyperplanes
hyperplanes:
• iso-oriented
• contain at least on data point
Interior nodes has one or two descendants, used to guide search
insert and search are straight forward
delete may require reorganizing a sub-tree
cons:
• structure sensitive to ordering of insertion
• data is all over the tree
24
Main Memory Structures ...
•
k-d-tree
25
Main Memory Structures ...
•
Adaptive k-d-tree
–
–
–
–
–
solves the problems of k-d-tree
tries to splits with about the same number of elements in each side
hyperplanes need not contain a data point
split till each subspace has some specific number of elements
result:
• splitting points not part of the data
• all data in leaves
• interior nodes contain dimension and coordinates of split
– cons:
• difficult to keep the tree balanced (when frequent inserts and deletes)
• tree still depends on order of insertion
– pro:
• very well for static environments
– con for all k-d-trees:
• for some certain distributions, no hyper-plane can split the data points evenly
26
Main Memory Structures ...
•
Adaptive k-d-tree
27
Main Memory Structures ...
•
BSP-tree
–
–
–
–
–
–
(binary space partitioning)
like k-d-tree splits the world
splits independent of other subspaces
hyper-planes have arbitrary orientations (not iso-oriented)
decomposes until number of objects is less than a threshold
interior node: hyper-planes
leaf nodes: subspaces, links to contained objects
– search for a point: straight forward
– range search: generalization of point search
– cons:
• generally not balanced, may have very deep subtress
• higher space requirements (info of the hyperplanes)
– pros:
• adapt well for different distributions
28
Main Memory Structures ...
•
BSP-tree
(binary space partitioning)
29
Main Memory Structures ...
•
BD-tree
–
–
–
–
–
binary tree
represents subdivisions of data into interval shaped regions
encode regions in bit strings (DZ expressions, z values, ST_Morton_Number)
associate bit strings to nodes
finding z values:
• left and down of hyperplanes : 0
• right and up of hyperplanes: 1
– splitting:
• make interval-shaped excision: one node is excision (interval-shaped) part,
the other one is remainder of the subspace
• result: each data bucket contain at least 1/3 of the original entries
– searching:
• first finds full prefix
• then traverses the tree based on prefix
30
Main Memory Structures ...
•
BD-tree
31
Main Memory Structures ...
•
The quadtree
–
–
–
–
–
–
–
–
–
closely related to k-d-tree
split with iso-oriented hyperplanes
difference: not binary tree any more
each node has 2^d descendants (in d dimensional space)
each node is interval shaped partitions
partitions don’t have to be equal size
decompose till a threshold
many variants
search is similar to k-d-tree
• for point query: traverses one subtree
• for region query: goes to often more than one subtree
– cons:
• not necessarily balanced
• deeper subtrees for dense regions
32
Main Memory Structures ...
•
point quadtree
– constructed by inserting points one by one
– how to insert:
• search for a point
• if not found, add it to the leaf node which search is stopped at
• divide corresponding partition to 2^d subspaces, new point in center
– deletion: requires reconstructing subtree
33
Main Memory Structures ...
•
region quadtree
– regular decomposition of the universe
– result: 2^d subspaces are equal
• very good for search operations
34
Point Access Methods
– Previous access methods designed for main memory
– Can be used for secondary memory, but performance is very
below optimum (no control over how OS accesses disks)
– PAMs: organize points in to buckets, buckets in to subspaces
– Different approaches for PAMs:
• hashing (extended, linear)
• hierarchical (tree based)
• space-filling curves
– lots of hybrid methods are also available
35
Point Access Methods ...
– Another taxonomy for PAMs: based on properties of buckets:
• disjoint of overlapped
• cover complete universe, or partial cover
• interval (box), or arbitrary shapes
36
Point Access Methods ...
– Multidimensional Extended Hashing
• Heuristics: hashing functions that preserves ordering to some extend
• idea: objects close in space should be stored close in disk
– Grid File:
•
•
•
•
superimposes d-dimensional orthogonal grid on the universe.
Grids not regular, different shapes and sizes for cells
grid directory links cells to data buckets
grid directory stored in disk, grid is in main memory
– result: no more than 2 disk accesses for exact match queries
• insert data: if bucket is full, requires cell splitting
37
Point Access Methods ...
– Grid File
– con: super-linear growth
even for uniformly
distributed data (lower left
part, 4 cells are pointed to one
bucket)
– exact match point query:
• use scales to locate cell
containing search point
• if not in memory, one
disk access to get it
• retrieve bucket that
have the objects in cell,
and compare to search
point (one more access)
– range query:
• test all cells that overlap
the search region
38
Point Access Methods ...
– EXCELL
• like grid file
• splits the universe regularly
• if insert require cell splitting, splits all cells (double the directory
size)
39
Point Access Methods ...
– Two level Grid file
•
•
•
•
idea: use a second grid file to manage grid directory
first level is root directory (coarsened level of the second level)
second level, actual grid directory
first level points to second, second to data
• pro:
– splits are confined to subdirectory regions, not effecting their
surroundings: slower directory growth.
– Second level in main memory: still 2 accesses for exact match
40
Point Access Methods ...
– Two level Grid file
41
Point Access Methods ...
– Twin Grid File
•
•
•
•
•
•
idea: second grid file: increase space utilization
objective: reduce total number of buckets (or cells) in two files
relation between files not hierarchical (like two-level)
both files spans the whole universe
data is distributed between two files dynamically
pro:
– space utilization: 90% (original grid 69%)
– superior to original grid for larger search spaces
• con: inferior to original grid for smaller query ranges
• inserting data:
–
–
–
–
if bucket overflows, first try to redistribute the data between files
could lead to bucket overflow or underflow in one
make the shift permanent if the number of buckets is minimized
otherwise split the bucket
• search requires checking both files
42
Point Access Methods ...
– Twin Grid File
43
Point Access Methods ...
– Multidimensional Linear Hashing
• no directory (unlike extended), or very small directory
• result:
– relatively little storage requirements
– can keep the related information in main memory
• early techniques:
– fail to support range queries
• multidimensional order-preserving linear hashing with partial expansions
(MOLHPE)!
– idea : partially extending the buckets without expanding the file size
at the same time
– guarantees modest file growth
– very good for uniformly distributed data, fails for nonuniform
distributions
44
Point Access Methods ...
– Hierarchical Access Methods
•
•
•
•
•
based on binary of multiway tree structure
like hashing, stores data in bucket
each bucket is leaf of a node, and a disk page
interior nodes of the tree guide search
search: top-down tree traversal
• difference between different methods: characteristics of the regions
45
Point Access Methods ...
– Hierarchical Access Methods:
• k-d-B-tree
–
–
–
–
–
–
–
–
combination of adaptive k-d-tree and B-tree
partition the universe like adaptive k-d
associates subspaces to tree nodes
interior nodes are intervals
nodes in same level are mutually disjoint
perfectly balanced (like B-tree)
search straightforward, like k-d-tree
insert: search, find the right bucket, if required split and move half
the data to it.
– Deletion: search, remove, if necessary merge node with siblings
– con: no minimum space utilization can be guaranteed
46
Point Access Methods ...
– Hierarchical Access Methods:
• k-d-B-tree
47
Point Access Methods ...
– Hierarchical Access Methods:
• LSD tree (??)
– directory is organized same as adaptive k-d-tree
– better adaptation to data distribution (in compare to fixed binary
partitioning)
– external balancing property: heights of external subtrees differ at
most by one (??)
– combines two split strategies to accommodate skewed data:
» data-dependent : based on data, tries to achieve most balanced
structure (equal number of data in both sides of split)
» distribution-dependent: split at fixed dimension and position
(know distribution is assumed)
48
Point Access Methods ...
– Hierarchical Access Methods:
• LSD tree (??)
49
Point Access Methods ...
– Hierarchical Access Methods:
• Buddy tree:
–
–
–
–
dynamic hashing scheme with tree structure (hybrid)
tree is made by consecutive insertions
cut the universe equally with iso-oriented hyperplanes
interior nodes: a partition and an interval (MBB of points or intervals
below node)
– intervals in same level nodes are mutually disjoint
– leaves are data (like other trees!!)
– each directory node has at least two entries
» so: may not be balanced
– when a node splits, MBB of two intervals are computed to reflect the
current situation
» so: tries to achieve high selectivity at directory level
– except for root, only one pointer refers to each directory page.
» so: guarantees linear growth
50
Point Access Methods ...
– Hierarchical Access Methods:
• Buddy tree:
51
Point Access Methods ...
– Hierarchical Access Methods:
• BANG file (Balanced and Nested Grid)
–
–
–
–
a hybrid method
divides the univers to intervals (boxes), similar to grid
difference: buckets regions may intersect
can form nonrectangular bucket regions by taking geometric
difference of two intervals (nesting) (?)
– increased storage utilization: redistributes data between bucket
during insertion
– balanced search tree to manage directory
52
Point Access Methods ...
– Hierarchical Access Methods:
• BANG file (Balanced and Nested Grid)
– first 3 rectangles: R1: R2, R5, R6
– then R3 and R4 in R2 and R5
– representation as bit interleaving
» * = universe
– a point search may require traversal of entire directory in depth-first manner
53
Point Access Methods ...
– Hierarchical Access Methods:
• hB-tree (?)
– utilizes k-d-B-tree to organize the space represented by interior
nodes
– difference in splitting: based on multiple attributes
– region not boxed shape
54
Point Access Methods ...
– Hierarchical Access Methods:
• BV-tree (?)
– tries to solve d-dimensional B-tree
– idea: maintain major strengths of Btree, by relaxing balancing and
space utilization
– BV-tree not balanced
– at least 33% space utilization (50% for B-tree)
55
Space Filling Curves for Point Data
– Reason that multidimensional is harder than one-dimensional:
• No total ordering that preserves spatial proximity
– work around:
• heuristic methods: total orders that preserves proximity at least
to some extend
• idea: objects that are close in original space, should also be close in
total order (one-dimensional image space)
• maybe good for point queries, not good for region queries
• all methods:
– divide the space to grids
– assign unique numbers for each grid
– index the original data based on that number
56
Space Filling Curves for Point Data ...
•
Peanu = Quad codes = Bentley = Ntrees = Location codes = z ordering
•
z-order and Hilbert are most suitable
for multidimensional access, Hilbert
better!
•
•
pro: insensitive to dimension
con: incompatible index partitions can
not be joined without recomputing
the codes of at least one of the
indices (??)
57
Spatial Access Methods
•
Previous methods are for points, not for objects with extension
•
How to do objects with extension? modifying point access methods
•
classification of methods:
– based on different techniques:
• transformation (object mapping)
• overlapping regions (object bounding)
• clipping (object duplication)
• multiple layers
– based on “base type”: primarily supported spatial data type, mostly
intervals
58
Spatial Access Methods ...
59
Spatial Access Methods ...
• Transformation
– transform object to different representation
– then use PAMs or one-dimensional access methods
– possible options:
• transform each object to higher dimensional point
• transform object to one-dimensional intervals using space-filling
curves
60
Spatial Access Methods ...
• Transformation …
– mapping to higher dimensional space
– e.g., four numbers ( = to a point in four dimensional space) for a
rectangle
– use one of the PAMs for this new point
– options:
• x and y coordinates of two diagonal corners, endpoint
transformation
• x and y coordinates of center, and height and width, midpoint
transformation
– more complex objects: approximate with rectangle or sphere!
• result: PAM provide partial result
61
Spatial Access Methods ...
• Transformation (mapping to higher dimensional space )…
– endpoint transformation for range query with search range [l,u]
– a) any point in shade areas corresponds to an interval in original space
that overlaps the search interval [l,u]
– b) containment/enclosure with interval [l,u] map into general region
query
– c) point query maps to range query
62
Spatial Access Methods ...
• Transformation (mapping to higher dimensional space )…
– midpoint transformation for range query with search range [l,u]
– a) any point in shade areas corresponds to an interval in original space
that overlaps the search interval [l,u]
– b) containment/enclosure with interval [l,u] map into general region
query
– c) point query maps to general region query
63
Spatial Access Methods ...
• Transformation (mapping to higher dimensional space ) …
– cons:
• formulation of point and range queries is more difficult in
new (dual) space
– finite search regions may map to infinite search regions in dual
space
– more complex queries with spatial predicates may not be
expressible at all
• depending on the mapping, the distribution of point in dual
space may be highly non-uniform, even if data in original
space is uniform
• image of two close objects may be far in dual space
64
Spatial Access Methods ...
• Transformation …
– Space-Filling Curves for Extended Objects
– has less drawbacks
– represent extended objects with grid cells
– equal to: represents extended object with union of several
simpler objects
– equal to: list of one-dimensional intervals that define position of
the grids.
– variations: z-ordering, Hilbert R-tree, UB-tree
65
Spatial Access Methods ...
• Transformation (Space-Filling Curves for Extended Objects) …
– z-order:
• Peano curve
• recursively splits universe to equal subspaces with iso-oriented
hypeplanes
• assign z-values to z-regions
• store z-values in standard one-dimensional index
• an approximation of the original object
• more z-regions = more accuracy = increase the size and complexity
of approximation : conflicting objectives!!
66
Spatial Access Methods ...
• Transformation (Space-Filling Curves for Extended Objects) …
– z-order:
• example: 9 regions, different shapes and sizes
• if peano code z1 if a previx of z2, then region corresponding to z1 is
enclosed in z2
67
Spatial Access Methods ...
• Overlapping regions
– idea: different data buckets correspond to mutually overlapping
subspaces
– can put any object to one bucket
– extends regions to accommodate new data
– increase search paths (due to overlap), even for point
• problem: performance, specially when objects are large in compare
to universe
– very large objects lead to ineffective index, the whole index should be
searched !!
• minor problem: ambiguity during insertion
– any subspace could be picked to enlarge
– solution:
» pick subspace that causes minimal additional overlap
» or the one that requires least enlargement
» or takes less time
68
Spatial Access Methods ...
• Overlapping regions
– R-tree:
• hierarchy of nested intervals
• nodes correspond to intervals
• intervals of descendant of a node are contained in interval of that
node !
• Same level nodes may have overlap
• leaf node: MBB and reference of the actual data
• Each node has between m (lower threshold) and M (upper threshold)
entries
• m ensures efficient storage
• R-tree is height-balanced
• search is similar to B-tree, but several intervals in each level may
satisfy the search
• provides candidate search results, requires refinement
• insertion: only one path is traversed, at each node pick the child
which requires least enlargement to cover the object
• deletion, may require adjustment in size of the covering interval
69
Spatial Access Methods ...
• Overlapping regions
– R-tree
70
Spatial Access Methods ...
• Overlapping regions
– R*-tree:
• similar to R-tree
• forced reinsert policy:
– if a node overflows, don’t split right away
– remove some (30% of M) nodes from the node, and reinsert them
• deletion and search are same as R-tree
• splitting policy:
– all R-tree policies
– minimize overlap between same level nodes (less probability for multiple
search paths)
– minimized region perimeters (regions should become squares)
– maximize storage utilization
• pro: 50% performance improvements
• con: cpu overhead for reinsert
71
Spatial Access Methods ...
• Overlapping regions
– R*-tree:
72
Spatial Access Methods ...
• Overlapping regions
– P-tree (Jagadish), JP-tree:
• idea: approximations of objects are not MBBs
• introduces m orientation (m > d)
– e.g., for d=2, m=4: two parallel to coordinate axes, two parallel to
diagonals.
•
•
•
•
Polytopes are parallel to these orientations
quality of approximation is related to m.
interior nodes are nested polytopes (not intervals)
polytopes in same level overlap
• pro: better approximation
• con: larger size
• experimental results: m=10 good for d=2 (5 times more space)
73
Spatial Access Methods ...
• Overlapping regions
– P-tree (Jagadish), JP-tree:
74
Spatial Access Methods ...
• Overlapping regions
– P-tree (Schiwietz), SP-tree:
• combine advantages of R*-tree and cell tree
• similar to R-tree, but interior nodes are nested polytopes instead of
rectangles
• idea: number of vertices of polytopes are not bounded
– good: more vertices, better approximation
– bad: should be bound to guarantee minimum fanout of interior
nodes (?)
– so: pentagon and hexagon offer best tradeoff
• if insertion or splitting leads to more vertices for bounding
rectangles, surplus vertices are moved one by one.
75
Spatial Access Methods ...
• Overlapping regions
– P-tree (Schiwietz), SP-tree:
76
Spatial Access Methods ...
• Overlapping regions
– SKD-tree
• a variant of k-d-tree, capable of storing spatially extended object
• allows regions to overlap
• keep upper and lower bound for discriminators, representing
maximal extent of the objects in two subtrees
• insertion:
– define centroid and MBB of the object.
– Compare centroid with discrimination lines to decide for next
child to be inspected
– may need to adjust lower and upper bound of extended objects
– if overflows node, split and insert new discriminators in SKD
• search as usual, start from root, at each node decide for child to
visit next
• deletion: if underflow, save remained data in siblings, remove
splitting hyperplane. if insertion, do insert !
77
Spatial Access Methods ...
• Overlapping regions
– SKD-tree
78
Spatial Access Methods ...
• Overlapping regions
– GBD tree (Generalized BD)
• GBD: balanced multi-way tree
• stores as hierarchy of MBBs
• interior nodes: MBB of (usually overlapping) MBBs of its
descendants
• leaf node: MBB of the objects whose centroids are contained in
corresponding bucket region.
• Intervals coded using DZ-expression (z-values)
• pro: faster insert and delete in compare to R-tree, because of
encoding scheme and placement by centroid
• but no advantage for search
79
Spatial Access Methods ...
• Overlapping regions
– GBD tree (Generalized BD)
80
Spatial Access Methods ...
• Overlapping regions
– PLOP hashing (piecewise linear order-preserving)
•
•
•
•
•
•
•
a variant of hashing
allows storage of extended objects without transforming into points
partition the universe same as grid file
extended objects may span more than one cell
uses d binary trees
interior nodes: (d-1) dimensional iso-oriented hyperplane
leaf nodes: d-dimensional subspaces forming the universe
• pro: superior to R-tree and R+-tree for uniformly distributed data
81
Spatial Access Methods ...
• Overlapping regions
– PLOP hashing (piecewise linear order-preserving) (??)
82
Spatial Access Methods ...
• Clipping
– no overlap between bucket regions (mutually disjoint)
– example:
• R+-tree: a variant of R-tree, no overlap between nodes in same level
• result: point queries follow single path
– cons:
• insertion: any data that spans more than one bucket has to be
subdivided along the partitioning hyperplanes. (several bucket entries
for same object)
• deletion: has to delete from several buckets
• if a method doesn’t partition complete data space, a new object may
lead to enlargement of several buckets
• deadlock: has to enlarge a bucket, but this may lead to overflow
• splitting: there might be cases that a bucket has to be split, but
there is no hyperplane that splits none of the objects in the bucket
(triggers other splits, problem for large files)
83
Spatial Access Methods ...
• Clipping
– Extended k-d-tree:
• extension of adaptive k-d-tree
• based on clipping (SKD based on overlapping)
• interior nodes: (d-1) dimensional partitioning hyperplanes,
represented by dimension and discriminator
• leaf node: rectangular subspace, and address of data page
• data page may be referenced by different leaf nodes
• insert:
–
–
–
–
start from root
test intersection with stored hyperplane
either go to one direction, or clip the object and go in both directions
overflow case, split page by new hyperplane (dimension perpendicular)
• delete:
– visit all subspaces intersecting with object
– underflow, merge with siblings
84
Spatial Access Methods ...
• Clipping
– Extended k-d-tree:
85
Spatial Access Methods ...
• Clipping
– R+-tree:
• idea: resolve problems for overlapping regions in R-tree
• same as other clippings: no overlap in same level nodes, if intersect
more than one index interval, has to be stored on different pages …
• point search: single path (faster than R-tree)
• range searches: usually more than one path
• insert:
– depending on number of intersection, may follow different
paths
– may place different fragments in different leaf nodes
– may need to enlarge interval for a corresponding leaf node,
which may overlap with other intervals, then split and reinsert
• delete:
– remove all fragments
– if underflow, merge with siblings (not always possible!!)
86
Spatial Access Methods ...
• Clipping
– R+-tree:
87
Spatial Access Methods ...
• Clipping
– Cell tree
•
•
•
•
•
•
•
•
•
•
•
goal: facilitate search for arbitrary shaped objects
decomposes universe to disjoint convex subspaces
interior node: hierarchy of nested polytopes
leaf node: one of the subspaces
convex polyhedra are subspaces of BSP (binary space partitioning)
cell tree = combination of BSP and R+-tree
non-convex objects are decomposed into convex components whose
union is original object
components don’t have to be disjoint
may have to subdivide component to several cells (clipping!)
overflow, split subspace and cell tree, distributed data between two
nodes. Split may propagate up the tree (?)
search: start from root, specify subspaces (using BSP partitioning),
go to corresponding sub-branch .
88
Spatial Access Methods ...
• Clipping
– Cell tree
89
Spatial Access Methods ...
• Multiple Layers
– a variant of overlapping
– data regions in different layers may overlap
– difference:
•
•
•
•
layers are organized in a hierarchy
each layer partitions the complete universe in different way
data layers within layers are disjoint
data regions do not adapt to the spatial extensions of the
corresponding data objects
– inserting object:
• find lowest layer that its hyperplanes don’t split the object
• if there is any, insert in to it
• if overflow, split the data region by introducing new hyperplane and
distributes the entries accordingly
• object intersect with hyperplane are moved to a higher layer
• result: large object in higher layers, close objects in lower layers
90
Spatial Access Methods ...
• Multiple Layers
– pro:
• possibly higher selectivity during searching due to restricted
overlap of different layers (?)
– con:
• suffers from fragmentation, bad for some distributions
• certain queries require inspection of all layers
• don’t know how to cluster object that are close, but in different
layers
• some ambiguity for the layer that should place the object
91
Spatial Access Methods ...
• Multiple Layers
– Multilayer Grid file
•
•
•
•
consists of ordered sequence of grid layers
each layer corresponds to separate grid file
each layer cover the whole universe
insert: into first grid file in the sequence that does not imply any
clipping of the object
• size of the bucket regions typically increase within the sequence
(larger object go to later layers)
• if a new object can not be stored in any of the current layers
without clipping, a new layer has to be allocated.
• Exact query: determine from scales which grid file in supposed to
hold the search interval
• other search queries: traversing sequence of layers, performing
search on each grid file
• pro: superior to original grid file
• cons: low space utilization, expensive directory maintenance
92
Spatial Access Methods ...
• Multiple Layers
– Multilayer Grid file
93
Spatial Access Methods ...
• Multiple Layers
– R-File:
•
•
•
•
•
•
idea: avoid problems of multilayer grid file
uses a single directory
universe cut same as BANG file (equal parts)
z-ordering to encode resulting bucket regions
bucket regions may overlap, but there is no clipping (?)
splitting algorithm:
– instead of cutting to two halves, keep the original region, make one for
one of the two halves, move data intervals to new bucket if they are
completely contained in the region
– pick the half that distributes data between two halves most even (?)
• pro: competitive to R-tree
• con: partitions the whole space, R-tree partitions the part that
contains object
– so: poor performance for non-uniformly distributed data
94
Spatial Access Methods ...
• Multiple Layers
– R-File (?)
95
Comparative Studies
Experimental Results
• Results confusing !
• Why?
–
–
–
–
–
–
–
uniformly distributed?
data type?
a lot of overlaps?
real data?
different variant (R+, R+ with bla, … )
different queries
...
96
Comparative Studies
Experimental Results
• Search performance for: R-tree, k-d-B-tree, R+-tree
uniformly distributed rectangles of varying size)
(10,000
– k-d-B-tree can never compete with R-tree variants.
– Not much difference between R and R+
difficult to code)
(R+ is significantly more
– R+ performs better when there is less overlap between
rectangles
• Performance for point access methods: hB-tree, BANG file,
two-level grid file, buddy tree.
– Buddy tree and BANG outperform two others
– Performance varies with data distribution and query range size
• for clustered data and query range of size 10%, no difference
between buddy and BANG
• query range of 0.1% of size of the universe, buddy performs twice
as fast
97
Comparative Studies
Experimental Results ...
• For extended objects: R-tree and PLOP hashing with buddy
tree and BANG file (enhanced with transformation technique to handle
rectangles)
– buddy and BANG outperformed for all data distributions
– (measurement for number of pages accessed, and not cpu time)
• R*-tree with several variants of R-tree (performance for point,
intersection and enclosure queries, also spatial join, for different data distributions)
– R* is the winner for queries, best storage utilization and
insertion time.
– (again, only disk access was measured)
• Another similar benchmark:
– Hilbert R-tree even better than R*-tree
98
Comparative Studies
Experimental Results ...
• Performance of clipping, overlapping and transformation on
top of buddy-tree, R*-tree and two-level grid file
– buddy with clipping and grid file failed completely (huge files)
– Transformation technique: fast insertion, low storage
utilization.
– R*-tree: long insertion times, good storage utilization.
– Buddy with overlapping superior to buddy with transformation
for intersection and containment queries.
– Performance of overlapping techniques decreases with large
query regions, still better than buddy with transformation.
– Buddy with overlapping better than R* for some queries.
99
Comparative Studies
Experimental Results ...
• Static and Dynamic variants of skd-tree with packed R-tree
– for large page sizes, skd-tree outperforms r-tree for page
access per search operation
– Space requirements of skd-tree is higher than R-tree
– Containment queries are answered faster in skd-tree
• skd-tree with extended k-d-tree enhanced with overflow
pages:
– skd-tree is superior
– k-d-tree performs well for uniformly distributed data
• R-tree with cell tree and R+-tree
(convex polygons used instead of polygons)
(two clipping-based access methods)
– Cell tree requires twice as much space.
– Average number of page accesses per search is less for cell
tree (even better with increase in size of the DB and query
region)
100
Comparative Studies
Experimental Results ...
• Original cell tree with cell tree with oversize shelves, R*tree and hB-tree (real catographic data):
– slight performance advantage of cell tree with oversize shelves
in compare to R* and hB-tree, but much better from original
cell tree
• query time of KD2B-tree (paged version of KD2-tree) and R-tree for
different queries:
– KD2B-tree outperforms R-tree for all queries
• PMR-quadtree with R*-tree and R+tree for indexing line
segments:
– performance for all is about the same
– R+: best insertion performance (but depends on page size)
– R*: occupies the least space, inferior to R+ for seacrhing line
segments
– PMR-quadtree not dependent to page size
101
Comparative Studies
Experimental Results ...
• Similar benchmark to previous one, two quadtree variants
with R*-tree and R+tree, but used polygons instead line
segments (and on top of a commercial object-oriented system):
– R+ and quadtrees outperform R* for point queries.
• A variant of zkdB+-tree with grid file, R-tree and R+-tree
(for insertion, deletion, search operations and storage utilization):
– z-ordering and grid file good for insertion and deletion, but
poor for search.
– R and R+: moderate insertion and deletion, superior search
– R+ slightly better than R, but very poor space utilization (not goor
choice for general purpose applications)
• R-file and R-tree:
– R-file 10-20% performance advantage for highly overlapped
rectangles
102
Comparative Studies
Experimental Results ...
• Different space-filling curves: z-order, Gray coding, Hilbert
coding:
– Hilbert mapping (form multidimensional to line) is superior to all
• SUMMARY:
– Best performers:
•
•
•
•
•
•
•
buddy
cell tree with oversize shelves
Hilbert tree
KD2B-tree
PMR-quadtree
R+-tree
R*-tree
– But still hardly comparable
103
Comparative Studies
Theoretical Results
• Very difficult to perform if stick to all modeling assumptions
• So very few results
• links to different studies are in the paper, good luck.
104
Conclusions
• Different spatial access methods
• No one is superior to others in whatever sense
• A method is a clear winner by a benchmark,
inferior by another benchmark!
– Reason:
• So many different criteria for optimality
• So many parameters to define performance
– Example:
• A good access method for iso-oriented rectangles, may not be good
for arbitrarily oriented lines.
• A good access method for dense data may not be good for sparse
data.
• An optimized index method for point queries may be inefficient for
region query
• A good method for static environment may not be good for an
environment which has too many insertion/deletion.
105
Conclusions ...
• Technology transfer
– Pick the method that is easy to understand and
implement and robust.
– Performance not that much important
– Try to optimize performance by highly tuned
implementation.
– Examples:
• Quadtree for SICAD and SmallWorld GIS.
• R-tree by Informix
• Z-ordering by Oracle.
106

Multidimensional Access Methods

Transcript Multidimensional Access Methods

Directory