No Slide Title

Download Report

Transcript No Slide Title

Multidimensional Data Structures
• Why do we need indexing?
– Quick access
• Search methods
– sequential search
– binary search
– balanced search tree (B or B+ tree)
• Why do we need new indexing structures
– Traditional data is one-dimensional
– multimedia data is multidimensional
• in general, if a given information has k features, it can be
represented by a k-dimensional space, where each dimension
corresponds to one feature
What kind of queries may be
expected?
• Given a set of points in k-dimensional space
– we may want to see if the point is in the set or not (exact
match)
– we may want to find the closest points to the given point
(similarity based search or nearest neighbor queries)
– given a region, we may want to find all the points in the
given region (range query)
• Approach
– divide the space into regions
– insert the new object corresponding region
– if the region is full, split the region
• Query
– determine which regions are required to answer the
query, and limit the search to these regions
Multidimensional Data Structures
•
•
•
•
•
K-d Trees
Point Quadtrees
MX-Quadtrees
R-Trees
Many others exist
– We do not discuss them in the class
K-d Trees
• Used to store k dimensional point data
• Not used to store region data
• A 2-d tree (k=2) stores 2-dimensional point data
while 3-d tree stores 3-dimensional point data, ..
2-d trees
• Node Structure
nodetype = record
INFO: infotype;
XVAL: real;
YVAL: real;
LLINK: nodetype;
RLINK: nodetype;
end
INFO
LLINK
XVAL
YVAL
RLINK
 INFO filed is any user-defined type
 XVAL and YVAL denote the coordinates of a point
associated with the node
 LLINK and RLINK fields point to two children
2-d Trees
• 2-d tree is a binary tree satisfying the following
properties
– If N is node in the tree such that level(N) is even, then
• every node M in the subtree rooted at N.LLINK satisfies
M.XVAL < N.XVAL
• every node P in the subtree rooted at N.RLINK satisfies
P.XVAL  N.XVAL
– If N is node in the tree such that level(N) is odd, then
• every node M in the subtree rooted at N.LLINK satisfies
M.YVAL < N.YVAL
• every node P in the subtree rooted at N.RLINK satisfies
P.YVAL  N.YVAL
2-d Trees: Insertion/Search
• To insert N into a tree pointed by T,
–
–
–
–
Check if N and T agree on their XVAL and YVAL
IF so, just overwrite T and we are done
Else branch left if N.XVAL < T.XVAL and branch right otherwise
Suppose P is the child. If N and P agree on their XVAL and YVAL,
overwrite P and we are done, else branch left if N.YVAL < P.YVAL and
branch right otherwise
– Repeat this procedure, branching on XVALs when we are at even levels,
and YVALs when we are at odd levels in the tree
A2
30,50
60
A1 >= 30
A1 < 30
(30,50) 
60,10
40
A2 < 10
(45,20)

20
20
40
(60,10)
A1
60
A2 >= 10
45,20
A1 < 45
A1 >= 45
Another example of 2-d tree
City
(XVAL,YVAL)
Banja Luka
(19,45)
Derventa
(40,50)
Toslic
(38,38)
Tuzla
(54,35)
Sinj
(4,4)
City
(XVAL,YVAL)
Banja Luka
(19,45)
Derventa
(40,50)
Toslic
(38,38)
Tuzla
(54,35)
Sinj
(4,4)
Example of Insertion
City
(XVAL,YVAL)
Banja Luka
(19,45)
Derventa
(40,50)
Toslic
(38,38)
Tuzla
(54,35)
Sinj
(4,4)
Example of Insertion
City
(XVAL,YVAL)
Banja Luka
(19,45)
Derventa
(40,50)
Toslic
(38,38)
Tuzla
(54,35)
Sinj
(4,4)
Example of Insertion
City
(XVAL,YVAL)
Banja Luka
(19,45)
Derventa
(40,50)
Toslic
(38,38)
Tuzla
(54,35)
Sinj
(4,4)
Example of Insertion
City
(XVAL,YVAL)
Banja Luka
(19,45)
Derventa
(40,50)
Toslic
(38,38)
Tuzla
(54,35)
Sinj
(4,4)
Deletion in 2-d Trees
• Suppose we wish to delete (x,y) from a 2-d tree T
– search for a node N in T with N.XVAL = x and N.YVAL = y
– if N is a leaf node, then set the LLINK and RLINK fields of
N’s parent to NIL and return N to appropriate storage
– otherwise, either the subtree rooted at N.LLINK (Tl) or the
subtree rooted at N.RLINK (Tr) is non-empty
• Step1: Find a “candidate replacement” node R that occurs either in
Tl or in Tr
• Step2: Replace all of N’s non-link fields by those of R
• Step3: Recursively delete R from Tl or Tr (whichever is applicable)
– the above recursion is guaranteed to terminate because Tl (Tr)
has strictly smaller height than the original tree T
Finding Candidate Replacement Node
• The candidate replacement node R must bear the same
spatial relation to all nodes P in both Tl or Tr that N
bore to P
• I.e, if P is to the southwest of N, then P must be to the
southwest of R, if P is to the northwest of N, then P
must be to the northwest of R, ….
• This means R must satisfy the following properties:
– 1. every node M in Tl is such that: M.XVAL < R.XVAL if
level(N) is even and M.YVAL < R.YVAL if level(N) is odd
– 2. every node M in Tr is such that: M.XVAL  R.XVAL if
level(N) is even and M.YVALR.YVAL if level(N) is odd
Finding Candidate Replacement Node
• If Tr is not empty, and level(N) is even, then any node
in Tr with smallest possible XVAL in Tr is a candidate
replacement node
• If Tr is empty, then we might not be able to find a
candidate replacement node from Tl
• In this case, find the node R’ in Tl with the smallest
possible XVAL filed. Replace N with this
• Set N.RLINK = N.LLINK and set N.LLINK = NIL
• Recursively delete R’
Range queries in 2-d trees
• A range query with respect to a 2-d tree T is a query
that specifies a point (xc,yc) and a distance r
• the answer is the set of all points (x,y) in T such that
(x,y) lies within distance r of (xc,yc)
• I.e., a range query defines a circle of radius r centered
at (xc,yc), and expects to find all points in the 2-d tree
that lie within the circle
• recall that each node N in T implicitly represents a
region RN
• If the circle specified in a query has no intersection
with RN, then there is no point in searching the
subtree rooted at N
Example of the Range query
(xc,yc) = (35,46)
r = 9.5
K-d trees
• It is a binary tree
• Every node contains a data record, a left pointer and
a right pointer
• At every level of the tree, a different attribute of the
tree is used as the discriminator in a round-robin
fashion
• All algorithms for 2-d trees generalize in the
obvious way to k-d trees
Point Quadtrees
• Point quad trees always split regions into 4 parts
• In a 2-d tree, node N splits a region into two by drawing one
line through the point (N.XVAL,N.YVAL)
• In a point quadtree, node N splits the region it represents by
drawing both horizontal and vertical line through the point
(N.XVAL,N.YVAL)
• These 4 parts are called the NW, SW, NE, SE quadrants
determined by node N; each of these corresponds to a child of
N
• Node Structure
qtnodetype = record
INFO: infotype;
XVAL: real;
YVAL: real;
NW,SW,NE,SE: qtnodetype;
end
Insertion into Point Quadtrees
City
(XVAL,YVAL)
Banja Luka
(19,45)
Derventa
(40,50)
Toslic
(38,38)
Tuzla
(54,35)
Sinj
(4,4)
Insertion into Point Quadtrees
City
(XVAL,YVAL)
Banja Luka
(19,45)
Derventa
(40,50)
Toslic
(38,38)
Tuzla
(54,35)
Sinj
(4,4)
Insertion into Point Quadtrees
City
(XVAL,YVAL)
Banja Luka
(19,45)
Derventa
(40,50)
Toslic
(38,38)
Tuzla
(54,35)
Sinj
(4,4)
Insertion into Point Quadtrees
City
(XVAL,YVAL)
Banja Luka
(19,45)
Derventa
(40,50)
Toslic
(38,38)
Tuzla
(54,35)
Sinj
(4,4)
Insertion into Point Quadtrees
City
(XVAL,YVAL)
Banja Luka
(19,45)
Derventa
(40,50)
Toslic
(38,38)
Tuzla
(54,35)
Sinj
(4,4)
Insertion into Point Quadtrees
City
(XVAL,YVAL)
Banja Luka
(19,45)
Derventa
(40,50)
Toslic
(38,38)
Tuzla
(54,35)
Sinj
(4,4)
Deletion in Point Quadtrees
• If the node being deleted is a leaf node, deletion is
trivial: we just set the appropriate link filed of node
N’s parent to NIL and return node to storage
• otherwise, as in the case of 2-d trees, we need to find
an appropriate replacement node
Expanded Node Type
• Expand the node structure to the following
qtnodetype = record
INFO: infotype;
XVAL,YVAL: real;
XLB,YLB,XUB,YUB: real  {-,+}
NW,SW,NE,SE: qtnodetype;
end
• When inserting a node N into T we need to ensure that
– If N is the root node, then N.XLB = - , N.YLB = - , N.XUB =
+, N.YUB = +
• If P is the parent of N (assume w=P.XUB-P.XLB and
h=P.YUB-Y.YLB), then
Case
N=P.NW
N=P.SW
N=P.NE
N=P.SE
N.XLB
P.XLB
P.XLB
P.XLB+w*.5
P.XLB+w*.5
N.XUB
P.XLB+w*.5
P.XLB+w*.5
P.XUB
P.XUB
N.YLB
N.YUB
P.YLB+h*.5
P.YUB
P.YLB
P.YLB+h*.5
P.YLB+h*.5
P.YUB
P.YLB
P.YLB+h*.5
Deletion in Point Quadtrees
• When deleting an interior node N, we must find a
replacement node R in one of the subtrees of N such that
–
–
–
–
every other node R1 in N.NW is to the northwest of R
every other node R2 in N.SW is to the southwest of R
every other node R3 in N.NE is to the northeast of R
every other node R4 in N.SE is to the southeast of R
• In general, it may not always be possible to find such a
replacement node
– deletion of an interior node N may require reinsertion of all
nodes in the subtrees of N
– In the worst case, this may require almost all nodes to be
reinserted
Range Searches in Point Quadtree
• Similar to that of 2-d trees
• each node in a point quadtree represents a region
• do not search regions that do not intersect the circle
defined by the query
The MX-Quadtree
• For both 2-d trees and point quadtrees,
– the shape of the tree depends on the order in which
objects are inserted
– the split (into 2 for 2-d and 4 for point quad) may be
uneven depending on exactly where the point
(N.XVAL,N.YVAL) is located inside the region
represented by N
• MX-quadtrees (MX stands for matrix) attempt to
– ensure that the shape (and height) of the tree are
independent of the number of nodes present in the tree as
well as the order of insertion of these nodes
– provide efficient deletion and search algorithms
The MX-Quadtree
• The map being represented is split up into a grid of size
(2k2k) of some k
• the application developer is free to choose k to reflect the
desired granularity, but once chosen, this must be kept fixed
• Node structure: exactly same as that of point quadtrees,
except that the root represents the region specified by
XLB=0, XUB=2k, YLB=0, YUB=2k
• When a region gets split, it gets split in the middle
– the regions represented by the four children of N (w denotes the
width of the region represented by N
Child
NW
SW
NE
SE
XLB
N.XLB
N.XLB
N.XLB+w/2
N.XLB+w/2
XUB
N.XLB+w/2
N.XLB+w/2
N.XLB+w
N.XLB+w
YLB
N.YLB+w/2
N.YLB
N.YLB+w/2
N.YLB
YUB
N.YLB+w
N.YLB+w/2
N.YLB+w
N.YLB+w/2
Insertion into MX-Quadtree
Insertion into MX-Quadtree
Deletion in MX-Quadtrees
• It is fairly simple because all points are represented
at the leaf level
• To delete N,
– First set the appropriate link of N’s parent (M) to NIL
– Check if all the four link fields of M are NIL
– If so, examine M’s parent (P), find the link field P.dir1 =
M, set P.dir1 = NIL, and see if P’s four link fields are
NIL
– If so, continue this process
– Complexity of deletion is O(k)
Range Queries in MX-Quadtrees
• Handled exactly the same way as for point
quadtrees, but there are 2 differences
– the content of XLB,XUB,YLB,YUB fields are different
– as points are stored at the leaf level, checking to see if a
point is in the circle defined by the range query needs to
be performed only at the leaf level
R-Trees
• R-trees are used to store rectangular regions of an image or
map
• R-trees are particularly useful in storing very large amounts
of data on disk
• They provide a convenient way of minimizing the number
of disk accesses
• Each R-tree has an associated order, K
• Each non-leaf node contains at most K rectangles and at
least K/2 rectangles (except root) (I.e, each non-root node
must be at least half full)
• This makes R-trees appropriate for disk based retrieval
(because each disk access brings back at least K/2
rectangles)
R-Trees
Manipulate two kinds of rectangles: “real” and “group”
R-Tree
• Node Structure
rtnodetype = record
Rec1, …. RecK: rectangle;
P1,…PK: rtnodetype;
end
Insertion into an R-tree
Insertion into an R-tree
Insertion into an R-tree
An incorrect insertion into an R-tree
Deletion in R-Trees
• Deletion may cause a node to “underflow” because
an R-Tree must contain at least K/2 rectangles (real or
group) (Recall B+-trees)
• When we delete a rectangle, we must make sure that the
node is not underfull
Comparison of the four data structures
• k-d trees
– easy to implement
– k-d tree with k nodes may have height k
– since the trees are binary, search and insertion are
expensive
• point quadtrees
–
–
–
–
easy to implement
comparison requires comparison of 2 attributes
deletion is difficult
complexity of range queries O(2n) where n is the
number of records in the tree
Comparison of the four data structures
• MX-quadtrees
– height is at most O(n) where the region represented is
composed of (2n2n) cells
– insertion, deletion, search: O(n)
– range search is very efficient O(N+2h) where N is the
number of points in the answer to the query and h is the
height of the tree
• R-tree
– insertion, deletion, search, same as MX-quadtrees
– since large number of rectangles are stored in each node,
they are appropriate for disk accesses
– the bounding rectangles may overlap (I.e., we may have
to search via multiple paths)
Other multidimensional data
structures
•
•
•
•
•
K-d-B trees
hB-tree
PMR-tree
R*-tree
….
Commercial Systems
• Informix’s MapInfo Geocoding datablade: allows
assignment of latitudinal and longitudinal elements to
records
• Informix’s Spatial datablade: employs R-tree
• Oracle Universal server provides a spatial data option and is
based on quadtree technology
• Intergraph’s Land Information System allows integration of
survey data, imagery, etc., Allows to create
temporal/historical view of landuse
• ESRI provides ARC/INFO system
– the spatial database engine works with geographic data stored in
Oracle, Informix, Sybase, Microsoft SQL server, and DB2
– Interesting to see what data structure do they employ!!