Transcript 14-multidim

Multidimensional Indexing:
Spatial Data Management &
High Dimensional Indexing
Types of Spatial Data
Point Data
 Points in a multidimensional space
 E.g., Raster data such as satellite imagery,
where each pixel stores a measured value
 E.g., Feature vectors extracted from text
Region Data
 Objects have spatial extent with location and
boundary
 DB typically uses geometric approximations
constructed using line segments, polygons,
etc., called vector data.
Spatial Indexing
 Point Access Methods (PAMs) vs Spatial Access
Methods (SAMs)
 PAM: index only point data
 Hierarchical (tree-based) structures
 Multidimensional Hashing
 Space filling curve
 SAM: index both points and regions
 Transformations
 Overlapping regions
 Clipping methods (non-overlapping)
 Data partitioning vs Space partitioning
Types of Spatial Queries
 Spatial Range Queries
 Find all cities within 50 miles of Troy
 Query has associated region (location, boundary)
 Answer includes overlapping or contained data regions
 Nearest-Neighbor Queries
 Find the 10 cities nearest to Troy
 Results must be ordered by proximity
 Spatial Join Queries
 Find all cities near a lake
 Expensive, join condition involves regions and proximity
Applications of Spatial Data
 Geographic Information Systems (GIS)
 E.g., ESRI’s ArcInfo; OpenGIS Consortium
 Geospatial information
 All classes of spatial queries and data are common
 Computer-Aided Design/Manufacturing
 Store spatial objects such as surface of airplane fuselage
 Range queries and spatial join queries are common
 Multimedia Databases
 Images, video, text, etc. stored and retrieved by content
 First converted to feature vector form; high dimensionality
 Nearest-neighbor queries are the most common
High Dimensional Indexing
• Requirements
o Fast range/window query search (range query
o Fast similarity search
 Similarity range query
 K-nearest neighbour query (KNN query)
Feature Base Similarity Search
Complex Objects
Feature Vectors
Feature extraction
and transformation
Index for range/
similarity Search
Index construction
Nasdaq
Similarity Queries
Retrieval by Colour
Given a
sample image
Similarity Search based on sample image
in color composition
Query Requirement
Window/Range query: Retrieve data
points fall within a given range along
each dimension.
Designed to support range retrieval,
facilitate joins and similarity search (if
applicable).
Query Requirement
• Similarity queries:
Similarity range and KNN queries
• Similarity range query: Given a query point, find all data
points within a given distance r to the query point.
r
•KNN query: Given a query point,
find the K nearest neighbours,
in distance to the point.
Kth NN
Single-Dimensional Indexes
B+ trees are fundamentally single-dimensional indexes.
When we create a composite search key B+ tree, e.g., an
index on <age, sal>, we effectively linearize the 2dimensional space since we sort entries first by age and
then by sal.
Consider entries:
<11, 80>, <12, 10>
<12, 20>, <13, 75>
80
70
60
50
40
30
20
10
B+ tree
order
11
12
13
Multidimensional Indexes
A multidimensional index clusters entries so as
to exploit “nearness” in multidimensional space.
Keeping track of entries and maintaining a
balanced index structure presents a challenge!
Consider entries:
<11, 80>, <12, 10>
<12, 20>, <13, 75>
80
70
60
50
40
30
20
10
Spatial
clusters
11
12
13
B+ tree
order
Motivation for
Multidimensional Indexes
Spatial queries (GIS, CAD).
 Find all hotels within a radius of 5 miles from the
conference venue.
 Find the city with population 500,000 or more that
is nearest to Kalamazoo, MI.
 Find all cities that lie on the Nile in Egypt.
 Find all parts that touch the fuselage (in a plane
design).
Similarity queries (content-based retrieval).
 Given a face, find the five most similar faces.
Multidimensional range queries.
 50 < age < 55 AND 80K < sal < 90K
What’s the difficulty?
An index based on spatial location needed.
 One-dimensional indexes don’t support
multidimensional searching efficiently.
 Hash indexes only support point queries; want
to support range queries as well.
 Must support inserts and deletes gracefully.
Ideally, want to support non-point data as
well (e.g., lines, shapes).
Multi-dimensional Indexes
Multi-key Indexes
Grid Files
Partitioned Hash Indexes
kd-Trees
Quad Trees
R Trees
Bitmap indexes
Partitioned hash function
Idea:
Key1
010110 1110010
h1
h2
Key2
Example:
h1(toy)
h1(sales)
h1(art)
.
.
h2(10k)
h2(20k)
h2(30k)
h2(40k)
.
.
Insert
=0
=1
=1
=01
=11
=01
=00
000
001
010
011
100
101
110
111
<Fred>
<Joe><Sally>
<Fred,toy,10k>,<Joe,sales,10k>
<Sally,art,30k>
h1(toy)
=0
000
<Fred>
h1(sales) =1
001
<Joe><Jan>
h1(art)
=1
010
<Mary>
.
011
.
<Sally>
h2(10k)
=01
100
h2(20k)
=11
101
<Tom><Bill>
h2(30k)
=01
110
<Andy>
h2(40k)
=00
111
.
.
• Find Emp. with Dept. = Sales  Sal=40k
h1(toy)
=0
000
h1(sales) =1
001
h1(art)
=1
010
.
011
.
h2(10k)
=01
100
h2(20k)
=11
101
h2(30k)
=01
110
h2(40k)
=00
111
.
.
• Find Emp. with Sal=30k
<Fred>
<Joe><Jan>
<Mary>
<Sally>
<Tom><Bill>
<Andy>
h1(toy)
=0
000
<Fred>
h1(sales) =1
001
<Joe><Jan>
h1(art)
=1
010
<Mary>
.
011
.
<Sally>
h2(10k)
=01
100
h2(20k)
=11
101
<Tom><Bill>
h2(30k)
=01
110
<Andy>
h2(40k)
=00
111
.
.
• Find Emp. with Dept. = Sales
Grid File
Hashing methods for multidimensional
points (extension of Extensible hashing)
Idea: Use a grid to partition the
space each cell is associated with one
page
Two disk access principle (exact match)
Grid File
 Start with one bucket for the whole
space.
 Select dividers along each
dimension. Partition space into cells
 Dividers cut all the way.
 Each cell corresponds to 1 disk
page.
 Many cells can point to the same
page.
 Cell directory potentially
exponential in the number of
dimensions
Grid File Implementation
Dynamic structure using a grid directory
 Grid array: a 2 dimensional array with
pointers to buckets (this array can be large,
disk resident) G(0,…, nx-1, 0, …, ny-1)
 Linear scales: Two 1 dimensional arrays that
are used to access the grid array (main
memory) X(0, …, nx-1), Y(0, …, ny-1)
Example
Buckets/Disk
Blocks
Grid Directory
Linear scale
Y
Linear scale X
Grid File Search
 Exact Match Search: at most 2 I/Os assuming linear scales fit in memory.
 First use liner scales to determine the index into the cell directory
 access the cell directory to retrieve the bucket address (may cause 1
I/O if cell directory does not fit in memory)
 access the appropriate bucket (1 I/O)
 Range Queries:
 use linear scales to determine the index into the cell directory.
 Access the cell directory to retrieve the bucket addresses of buckets
to visit.
 Access the buckets.
Grid File Insertions
 Determine the bucket into which insertion must occur.
 If space in bucket, insert.
 Else, split bucket
 how to choose a good dimension to split?
 If bucket split causes a cell directory to split do so and adjust
linear scales.
 insertion of these new entries potentially requires a complete
reorganization of the cell directory--- expensive!!!
Grid File Deletions
 Deletions may decrease the space utilization. Merge
buckets
 We need to decide which cells to merge and a
merging threshold
 Buddy system and neighbor system
 A bucket can merge with only one buddy in each dimension
 Merge adjacent regions if the result is a rectangle
Grid File Example
(N=6)
A
1
A
6
2
A 1
5
3
4
2
3
4
5
6
Grid File Example
(N=6)
A
B
1
A
6
B
7
2
8
9
5
12
3
10
11
4
A 1
2
3
3
5
4
7
85
B
4
6
9
11 12
2
10
6
Grid File Example
(N=6)
A
14
1
B
6
7
13
15
8
B
C
B
2
9
5
C
A
A 1
3
7
5
8
15
13
7 14
8 10
B
2
4
6
9
C
3
5
10
12
3
10
11
4
11 12
Grid File Example
(N=6)
A
D
1
14
B
6
7
13
16
15
8
D
B
B
C
B
C
B
2
9
5
C
A
A 1
2
3
7
8
15
13
3 16
5
8
13
4 14
7
8 10
5
6
B
2
4
6
C
3
5
10
12
3
10
11
4
D 7
14 15
9
11 12
Grid File Example
(N=6)
y4
A
y3
y2
H D
I
F
B
E
G
y1
C
x1 x2
x3
x4
A
H
D
F
B
A
I
D
F
B
A
I
G
F
B
E
E
G
F
B
C
C
C
C
B
Kd-Trees
 Binary partitioning of space. Split of
the form a < V & a >= V for some
attribute (Internal nodes)
The dimensions to “cut” or “split”
alternate among all dimensions
Doesn’t have to span the whole dim
(unlike Grid Files)
Leaves are blocks that hold the points
Kd-Trees Example
kd
A
B
C
y1
x1
D
E
B
yC1
A
C
BD
CD
BB C
EE F
BC
CD
y2
F
x1
x2
xD2
D
yE2
E
F
KdTrees Example
kd
A
1
2
y9
y8
6
x2
3
4
7
5
8
y8
9
x5
y7
y6 10
y5
y4
y3
11
y2
12
13
y3
y7 2
x1
15
14
16
y2
17
18
y1
15
19
20
21
x1
x2
x3 x4
x5
16 1
17
x3
x4
x8
10 3
4
7
y5
y9
6
x8 x9
x9
x7 18
8
y4
5
x6
9 y1
y6
13
11
x6 x7
y2
12
14
x8
21 19
20
kDB Trees Example
kdB
B
y9
C
1
2
x2
3
4
B
5
y8
B
7
8
y7
y6 10
11
y5
y4
y3 15
y2
9
E
C x4
x5
C
15
F
E
x9
5
4
D x
3
14
17
18
y1
19
20
21
x1
x2
D
x3 x4
x5
x6 x7
x8 x9
F
6
x6
17
y1
7
y5
13
9
y6
y2
16
18
10
x8
3
13
16 1
y9
2
y2
D
12
y7
x1
y8
6
y3
F
y4
x8
21 19
20
11
E
x7
8
12
14
Region Quadtree
A
2 3
1
4 5
7 8
6 9 10 13 14
15 16
11 12 17 18 19
NW
1
2
3
NE
B
4
SW
SE
C
5
F
6 D 11 12
7
8
9 10
13 14 E 19
15 16 17 18
Point Quad-tree
(50,50)
(75,75) (25,25)
(20,88)
(0,100)
(100,100)
(88,65)
(52,15)
(92,1)
(0,0)
(100,0)
(75,25)
The R-Tree
The R-tree is a tree-structured index
that remains balanced on inserts and
deletes.
Each key stored in a leaf entry is
intuitively a box, or collection of
intervals, with one interval per
dimension.
Root of
R Tree
Example in 2-D:
Y
X
Leaf
level
R-Tree Properties
Leaf entry = < n-dimensional box, rid >
 key value is a box.
 Box is the tightest bounding box for a data object.
Non-leaf entry = < n-dim box, ptr to child
node >
 Box covers all boxes in child node (in fact,
subtree).
All leaves at same distance from root.
Nodes can be kept 50% full (except root).
 Can choose a parameter m that is <= 50%, and
ensure that every node is at least m% full.
Example of an R-Tree
Leaf entry
R1
R4
R3
R8
R9
R10
Index entry
R11
R5 R13
R14
R12
R6
R15
Spatial object
approximated by
bounding box R8
R7
R18
R17
R16
R19
R2
Example R-Tree (Contd.)
R1 R2
R3 R4 R5
R8 R9 R10 R11R12
R6 R7
R13R14
R15 R16
R17R18R19
Search for Objects Overlapping
Box Q
Start at root.
1. If current node is non-leaf, for each
entry <E, ptr>, if box E overlaps Q,
search subtree identified by ptr.
2. If current node is leaf, for each entry
<E, rid>, if E overlaps Q, rid identifies
an object that might overlap Q.
Note: May have to search several subtrees at each node!
(In contrast, a B-tree equality search goes to just one leaf.)
Improving Search Using Constraints
It is convenient to store boxes in the Rtree as approximations of arbitrary regions,
because boxes can be represented
compactly.
But why not use convex polygons to
approximate query regions more
accurately?
 Will reduce overlap with nodes in tree, and
reduce the number of nodes fetched by
avoiding some branches altogether.
 Cost of overlap test is higher than bounding
box intersection, but it is a main-memory cost,
and can actually be done quite efficiently.
Generally a win.
Insert Entry <B, ptr>
Start at root and go down to “best-fit” leaf
L.
 Go to child whose box needs least enlargement
to cover B; resolve ties by going to smallest area
child.
If best-fit leaf L has space, insert entry and
stop. Otherwise, split L into L1 and L2.
 Adjust entry for L in its parent so that the box
now covers (only) L1.
 Add an entry (in the parent node of L) for L2.
(This could cause the parent node to recursively
split.)
Splitting a Node During
Insertion
The entries in node L plus the newly inserted
entry must be distributed between L1 and L2.
Goal is to reduce likelihood of both L1 and L2
being searched on subsequent queries.
Idea: Redistribute so as to minimize area of
L1 plus area of L2.
Exhaustive algorithm is too slow;
quadratic and linear heuristics are
described in the paper.
GOOD SPLIT!
BAD!
R-Tree Variants
 The R* tree uses the concept of forced reinserts to
reduce overlap in tree nodes. When a node overflows,
instead of splitting:
 Remove some (say, 30% of the) entries and reinsert them into
the tree.
 Could result in all reinserted entries fitting on some existing
pages, avoiding a split.
 R* trees also use a different heuristic, minimizing box
perimeters rather than box areas during insertion.
 Another variant, the R+ tree, avoids overlap by inserting
an object into multiple leaves if necessary.
 Searches now take a single path to a leaf, at cost of
redundancy.
GiST
 The Generalized Search Tree (GiST) abstracts the
“tree” nature of a class of indexes including B+ trees
and R-tree variants.
 Striking similarities in insert/delete/search and even
concurrency control algorithms make it possible to provide
“templates” for these algorithms that can be customized to
obtain the many different tree index structures.
 B+ trees are so important (and simple enough to allow further
specialization) that they are implemented specially in all
DBMSs.
 GiST provides an alternative for implementing other tree
indexes in an ORDBS.
Comments on R-Trees
Deletion consists of searching for the entry to
be deleted, removing it, and if the node
becomes under-full, deleting the node and then
re-inserting the remaining entries.
Overall, works quite well for 2D and 3D
datasets. Several variants (notably, R+ and R*
trees) have been proposed; widely used.
Can improve search performance by using a
convex polygon to approximate query shape
(instead of a bounding box) and testing for
polygon-box intersection.
Bitmap Index
Bitmap index: specialized index that takes
advantage
 Read-mostly data: data produced from scientific
experiments can be appended in large groups
Fast operations
 “Predicate queries” can be performed with bitwise
logical operations
• Predicate ops: =, <, >, <=, >=, range,
• Logical ops:
AND, OR, XOR, NOT
 They are well supported by hardware
Easy to compress, potentially small index size
Each individual bitmap is small and frequently
used ones can be cached in memory
Bitmap Index
 Can be useful for stable columns with few values
 Bitmap:
 String of bits: 0 (no match) or 1 (match)
 One bit for each row
 Bitmap index record
 Column value
 Bitmap
 DBMS converts bit position into row identifier.
Bitmap Index Example
Faculty Table
RowId
1
2
3
4
5
6
7
8
9
10
11
12
FacSSN
098-55-1234
123-45-6789
456-89-1243
111-09-0245
931-99-2034
998-00-1245
287-44-3341
230-21-9432
321-44-5588
443-22-3356
559-87-3211
220-44-5688
… FacRank
Asst
Asst
Assc
Prof
Asst
Prof
Assc
Asst
Prof
Assc
Prof
Asst
Bitmap Index on FacRank
FacRank
Asst
Assc
Prof
Bitmap
110010010001
001000100100
000101001010
Compressing Bitmaps:
Run Length Encoding
Bit vector for Assc: 0010000000010000100
Runs: 2, 8, 4
Can we do: 10 1000 100? Is this
unambiguous?
Fixed bits per run (max=4)
 0010 1000 0100
Variable bits per run
 1010 11101000 110100
 11101000 is broken into: 1110 (#bits), 1000
(value)
• i.e., 4 bits are required, value is 8
Operation-efficient Compression Methods
Based on variations of Run Length Compression
Uncompressed:
0000000000001111000000000 ......0000001000000001111111100000000 .... 0000100
Compressed:
12, 0, 0, 0, 0, 1000, 8, 0, 0, 0, 0, 0, 0, 0, 0, 1000
Store very short sequences as-is
AND/OR/COUNT operations:
Can uncompress on the fly
Indexing High-Dimensional Data
 Typically, high-dimensional datasets are collections of
points, not regions.
 E.g., Feature vectors in multimedia applications.
 Very sparse
 Nearest neighbor queries are common.
 R-tree becomes worse than sequential scan for most datasets
with more than a dozen dimensions.
 As dimensionality increases contrast (ratio of distances
between nearest and farthest points) usually
decreases; “nearest neighbor” is not meaningful.
 In any given data set, advisable to empirically test contrast.