2008_179_R-tree - University of California

Download Report

Transcript 2008_179_R-tree - University of California

Guest Lecture Tonight
•
•
•
•
•
Room 205 in EBII (under Dr. Keoghs office)
Starts at 6:10pm
Attendance is mandatory
Best behavior is mandatory
Having an intelligent question is mandatory
Spatial Access Methods
Dr Eamonn Keogh
Computer Science & Engineering Department
University of California - Riverside
Riverside,CA 92521
[email protected]
Mathematical Preliminaries I
y
What is the distance between the
points Blue (1,1) and Red (5,7) ?
(5,7)
The formula for the Euclidian distance
between two points is…
D( Blue , Red )  (x)  (y )
2
2
D( Blue , Red )  (1  5) 2  (1  7) 2
D( Blue , Red )  7.21
(1,1)
x
Consider a classic database…
Name
Marios Pizza
ID
1
Type Phone Location Grade
ITA 888-1212 244, 365 D
Joes Bugers
2
US
848-1298
34, 764
A
Tinas Mexican
3
MEX
878-1333
123, 32
A
Sues Pasta
4
ITA
878-1342
876, 65
B
Assume that record(2).Name returns “Joes Bugers”
and record(1).Location returns (244, 365) etc.
Name
Marios Pizza
ID
1
Type Phone Location Grade
ITA 888-1212 244,365 D
Joes Bugers
2
US
848-1298
34,764
A
Tinas Mexican
3
MEX
878-1333
123,32
A
Sues Pasta
4
ITA
878-1342
876,65
B
Given such a database we can easily answer queries such as
• List all Mexican restaurants.
• List all Grade A restaurants.
However, classic databases do not allow queries such as
• List all Mexican restaurants within five miles of UCR
• List the 5 pizza restaurants nearest to 91 and 60.
These kinds of queries are called spatial queries.
One possible way to do spatial queries
Algorithm One_Nearest_Neighbor_Query(location)
best_so_far = infinity;
for i = 1 to
number_items_in_database
retrieve record(i) from database;
if distance(location,record(i).location )
best_so_far = distance(location,record(i).location )
NNpointer= i;
end;
end;
disp(‘The nearest neighbor is’,record(NNpointer ).name );
end;
But this does not work well because…
…so we need to index the data
My informal definition of indexing. Organizing the data
such that queries can be answered in sub linear time.
Indexing some kinds of data is trivial…
We can index the names field simply by creating 26
files that have the names plus pointers back to the
original database.
Adams Fish House
Al’s Burgers
Al’s Pizza
Amom Tacos
23
34
1
45
Bills Sushi
Bobs Burgers
Bombs Away!
Brian’s eats
56
78
12
99
Can o Beans
Cocos
Champagnes
Chris's Fish Hous
So, we call always index 1-dimensional data (if you can sort it, you
can index it), such that we can answer 1-nearest neighbor queries by
accessing just O(log(n) ) of the database. (n is the number of items in the database).
But we can’t sort 2 dimensional data…
This is an important problem which has attracted a great deal of
research interest….
We can in fact index 2 (and higher) dimensional reasonably efficiently
with special data structures known as Spatial Access Methods (SAM)
or Multidimensional Access Methods.
Although there are many variations we will focus on the R-Tree,
introduced by Guttman in the 1984 SIGMOD conference.
We have seen the formula for the distance between two points. We
will find it useful to have a formula for the distance between a
point and the closest possible point within an Minimum Bounding
Rectangle MBR…
1) Suppose we have a query
point Q and one known point R
Could any of the points in the
MBR be closer to Q than R is?
R = (1,7)
Q = (3,5)
MBR = {(6,1),(8,4)}
We have seen the formula for the distance between two points. We
will find it useful to have a formula for the distance between a
point and the closest possible point within an Minimum Bounding
Rectangle MBR…
Suppose we have a query point
Q and one known point B
Could any of the points in the
MBR be closer to Q than B is?
Q = (3,5)
MBR = {(6,1),(8,4)}
B = (1,1)
The formula for the distance between a point and the closest
possible point within an MBR
MBR = {(L.x,L.y)(U.x,U.y)}
Q = (x,y)
MINDIST(Q,MBR)
if L.x < x < U.x and L.y < y < U.y then 0
elseif L.x < x < U.x then min( (L.y -y)2 , (U.y -y)2 )
elseif ….
We can visualize the
locations of interest simply
as points in space...
…then both nearest
neighbor queries and range
queries have a simple
geometric interpretation…
Q1
Q2
Name
Marios Pizza
Joes Bugers
Tinas Mexican
ID
1
2
3
Imagine that each point
has a ID pointer back to
the original database…
Type Phone Location Grade
ITA 888-1212 244,365 D
US
848-1298
MEX
878-1333
34,764
123,32
2
A
A
23
1
We can group clusters of
datapoints into “boxes”, called
Minimum Bounding Rectangles
(MBRs).
R1
R2
Each MBR can be represented with
just two points. The lower left
corner, and the upper right corner.
R4
R5
R3
R6
R9
R7
R8
We can further recursively group
MBRs into larger MBRs….
…these nested MBRs are organized
as a tree (called a spatial access tree
or a multidimensional tree).
R10
R11
R10 R11 R12
R1 R2 R3
R12
R4 R5 R6
R7 R8 R9
Data nodes containing points
R10
{(1,0),(7,9)}
R1
R12
At the leave nodes we have
the location, and a pointer to
the record in question.
At the internal nodes, we just
have MBR information.
R2 R3
{(1,3),(5,4)} {(2,0),(7,9)} {(1,1),(7,8)}
(3,4) 77
(1,3) 88
(2,3) 22
(5,4) 13
(2,2) 47
(3,0) 86
(7,9) 52
(5,1) 32
(1,4) 45
(5,6) 27
(7,8) 73
Data nodes
We can define a function, MINDIST(point, MBR), which tells us
the minimum possible distance between any point and any MBR, at
any level of the tree.
MINDIST(point, MBR) = 5
MINDIST(point, MBR) = 0
We can use the MINDIST(point, MBR), to do fast search..
R10
R11
R10 R11 R12
R1 R2 R3
R4 R5 R6
R7 R8 R9
Data nodes containing points
R12
We can use the MINDIST(point, MBR), to do fast search..
0
R10
R11
10
17
R10 R11 R12
R1 R2 R3
R4 R5 R6
R7 R8 R9
Data nodes containing points
R12
We now go to disk, and retrieve all the data objects whose pointers are in the green
node. We measure the true distance between our query and those objects. Let us
imagine two scenarios, the closest object, the “best-so-far” has a value of..
• 1.5 units
(we are done searching!)
• 4.0 units
(we have to look in R2, but then we are done)
0
R10
10
17
R10 R11 R12
R11
0
2
10
R1 R2 R3
R4 R5 R6
R7 R8 R9
Data nodes containing points
R12
R10
R11
A simplifying assumption to
make coding the R-Tree easier
We have seen that searching R_trees is
easy!
But building the MBRs looks like a
non-trivial problem.
R12
R10 R11 R12
R1 R2 R3
I will allow a simplifying assumption to
make it much easer to understand and
code…
R4 R5 R6
R7 R8 R9
Data nodes containing points
12,12
We have a collection
of datapoints in 2
dimensions.
Before we begin, we must decide
on the max number of points per
leaf MBR, lets say 3…
0,0
12,12
The definition of an MBR is the
smallest (axis parallel) rectangle
that can contain a set of points.
We can represent the MBR with
just 2 points, the lower left
corner (L) and the upper right
corner (U).
MBR1 = {(L.x,L.y)(U.x,U.y)}
MBR1 = {(1,1)(9.6,8.9)}
0,0
This is the root of our R-Tree
MBR = {(1,1)(9.6,8.9)}
MBR1 = {(1,1)(9.6,8.9)}
MBR2 = {(1,1)(4.3,3.95)}
MBR4
MBR3
MBR2
MBR5
MBR3 = {(1,3.95)(4.3,8.9)}
MBR4 =
MBR1 = {(1,1)(9.6,8.9)}
MBR2 = {(1,1)(4.3,3.95)}
MBR3 = {(1,3.95)(4.3,8.9)}
MBR10 = {(1,3.95)(2.65,6.42)}
MBR12
MBR11
MBR4
(1.1, 4.4)
(1.3, 4.1)
(1.5, 4.3)
MBR13
MBR10
MBR2
MBR5
44
34
12
MBR4 =
MBR11 = {(1,6.42)(2.65,8.9)
Under this definition of an R-tree (which is not the classic
definition)
• Every internal node has exactly 4 children.
• Every MBR at a given level of the tree is mutually
exclusive (a point cannot be in two MBRs at a given level).
• Every leaf node is in one of these two classes
•It does not cover any points, and therefore contains a null pointer.
•It covers at least one point, and has a pointer to a file which contains those points,
together with the primary key.
Project Report Format
• A new file on the webpage gives details on
the format of your final report (the white
binders)
Title Page
Requirements Analysis.
Requirements Elicitation.
System Specification.
System Design:
Program Design:
Coding:
Unit and Integration Testing:
System Testing:
Acceptance Testing:
A Maintenance Plan: Assume that like the vast majority of software that is commissioned, your
contract requires you to maintain the software.
Conclusions: Assume that you write this just before handing in your document. Summarize briefly
what you did for you project. Address the following; what most surprised you about the process of
creating the project. What would you do differently if you had to do it again?
References: Every, book, paper, webpage you used must be cited in a standard format. You could use
the American Psychological format [1, 2], or the IEEE standard [3], or any other format so long as you
are consistent and complete. You should also reference any standard template libraries used in your
code.
Appendices: Including source code, printed in 2 “pages” per page format. You should also list a
professional quality, one-page resume for each member of the group (use the same template for each).
Acknowledgements: Your chance to thank anyone who helped you complete the project.
Two (identical) Cd roms in a high quality plastic wallet: Containing source code, test data, all the
weekly archives, and a readme.txt file that contains a brief (one line) description of all the files.
How I grade the project…
Grading Rubric for CS179. Database project.
Group name (all last names):
The project binder is in the correct basic format, and is neat. The format is consistent.
1
2
3
4
5
Notes:
The project is well documented (the students carefully referenced, the books, papers, URLs that they used)
1
Notes:
2
3
4
5
The students made some effort to do elicitation. The came up with interview questions in advance, they noted the answers to
the questions. (optional, they also use some other elicitation technique )
1
2
3
4
5
Notes:
The students made some effort to do requirements specification. The specification considers things like
the minimum hardware requirements, usability factors, and user scenarios, privacy issues. The
specification actually says what the system will do under normal scenarios, and how it will handle errors.
1
Notes:
2
3
4
5
The students actually designed the project (as opposed to going straight to coding) Design may include flowcharts or
pseudocode or English descriptions.
1
2
3
4
5
4
5
Notes:
The code is documented, indented etc..
1
Notes:
2
3
The project was tested. The students demonstrated how they tested the project. There was unit testing and integration testing.
1
2
3
4
5
4
5
Notes:
The overall quality of the finished product
1
2
3
Notes:
General remarks.
Tentative grade:
Final grade:
(may be modified after second pass).
Special Circumstances: Students in same group get different grades because…