An Incremental Refining Spatial Join Algorithm for Estimating Query

Download Report

Transcript An Incremental Refining Spatial Join Algorithm for Estimating Query

An Incremental Refining Spatial
Join Algorithm for Estimating
Query Results in GIS
Wan D. Bae, Shayma Alkobaisi, Scott T. Leutenegger
Department of Computer Science
University of Denver
{wbae, salkobai, leut}@cs.du.edu
Outline
•
•
•
•
Introduction
Motivation
Spatial Join Estimation
IRSJ Algorithm
– Sampling
– Joining
– Statistics
• Experiments
• Conclusion
Introduction
• GIS data is used to describe the geometry and
location of geographic phenomena.
• GIS data is represented in two ways:
– Raster: divides the world into cells.
– Vector: defines features based on coordinate-based
structures (e.g. point, line, polygon).
• This paper targets vector data.
Introduction Cont.
• Geographic or spatial queries are applied to
spatially indexed databases.
– e.g. containment and intersection.
• We focus on finding the number of
intersections of two spatial datasets (spatial
joins).
– e.g. number of roads that intersect rivers in the US.
Spatial Joins
• Spatial joins relate two data sets that share locations
in space.
• The processing of spatial queries can be accelerated
when some spatial indexing such as R-tree exists.
• Spatial joins of two R-trees can be done by applying
synchronized tree traversals on both R-tree nodes to
find intersecting items.
Spatial Joins Cont.
R: Rivers
R1
r1
r2
R2
r3
r2
r1
S1
r4
R2
R1
S: Cities
r4
r5
s1
s2
S2
s3
S1
s1
r5
r3
s4
S2
s2
s3
s4
Motivation
• GIS supports very large data sets
finding exact
answers to spatial queries can be very time
consuming.
• In GIS data analysis a fast estimation of the final
result that has error bounded to 2%-10% can do the
job.
• So, provide an approximate answer through an
incremental refining process.
• Thus, allow for more interactive data exploration.
Examples
1. “What are the intersections of mineral plants
and radiometric ages areas in the US?”
2. “Where do mineral resources intersect
geochemical sediments in the US?”
Examples Cont.
Locations of geochemical sediments in CO
Locations of mineral resources in CO
Intersections
Spatial Join Estimation
• Parametric: uses some properties of data
distribution to present a formula for the
estimation. (e.g. power law, fractal dimension).
• Histograms: keep certain information for
different regions of the data to be used when a
query is given. (e.g. Geometric Histogram,
Euler Histogram).
Spatial Join Estimation Cont.
• Sampling: uses smaller data sets (samples) to
calculate an estimate of the final result by
applying the join on the sample.
Incremental Refining Join Process
Dataset 1 (R)
Sampling
Dataset 2 (S)
WQ
Samples
# intersections
(intermediate
result)
Statistics
Final Estimation
w/ CI
Report
Incremental Process
User
Random Sampling
• We assume that both data sets are indexed using Rtrees.
• Samples are chosen from one R-tree called the outer
relation R.
• Samples are used as window queries to query the
inner relation S.
• Randomness:
– Acceptance/Rejection method: inclusion probability is
proportional to some parameter of the item sampled.
Tuple and Page Level Sampling
• Tuple-level:
– A page is selected at random from R and one tuple
(MBR) of that page is chosen at random.
• Page-level:
– A page is selected at random from R and all tuples
(MBRs) of that page are used as a sample.
Window Query
• The chosen MBR from one data set (R-tree)
serves as a window query to the other data set
to find the intersections.
• The query returns all the objects from the
second data set that overlaps with the query
window.
• The number of intersections found is used in
the process of finding an approximate answer
to the query.
Window Query Example
R: Rivers
R1
r1
r2
R2
r3
r2
r1
S1
r4
R2
R1
S: Cities
r4
r5
s1
s2
S2
s3
S1
s1
r5
r3
s4
S2
s2
s3
s4
Estimated Value and Confidence
Interval
• Estimated Value: the statistic computed from sample
information.
• Population Proportion: fraction indicating the part of
the sample having a particular interest.
• Confidence interval: an interval that estimates a
population parameter within a range of possible values
at specified probability.
– The specified probability is called the level of confidence.
E  zc

p1  pˆ  N  n

n 1
N
IRSJt Algorithm
1. C
0; CI
0 {count, confidence interval}
2. repeat
3.
for i = 0 to k do
4.
L
Choose leaf from R at random
5.
M
MBR of a randomly chosen tuple within L
6.
I
number of intersections of a Window Query (M,S)
7.
C C+I
8.
end for
9.
CI
Compute confidence interval using C
10.
EV
Compute estimated value using C
11. until The desired confidence interval Cf attained
Experiments
(settings and data sets)
•
•
•
•
IRSJ compared to full R-tree join.
Confidence level set to 95%.
Varied buffer size and data size.
Data sets:
–
–
Synthetic: U x S, S x U, U x U (# of tuples in each
relation varied from 100,000 to 600,000).
Real: from the U.S. Geological Survey:
1.
2.
Mineral Resources in the US 2005 (300,432 tuples).
Geochemistry of unconsolidated sediments in the US 2001
(199,850 tuples).
Experiments Cont.
(Synthetic data results)
Estimated Value
U-600K x S-400K
Synthetic Cont.
Confidence Interval
U-600K x S-400K
Synthetic Cont.
I/Os with 10% buffer
Synthetic Cont.
R-tree join Ratio to IRSJt
Experiments Cont.
(Real data results)
Estimated Value
Real Cont.
Confidence Interval
Real Cont.
Buffer
Size
CI = 10
CI = 5
CI = 3
CI=2
CI=1
R-join
I/Os
5%
10%
233
182
403
333
770
752
1896
1164
8159
3795
27756
24756
Node
Accesses
5%
10%
780
668
2591
2136
6933
9028
18299
19671
87233
85822
262200
262200
I/O and Node Accesses of IRSJt and a full R-tree join
Conclusion
• Proposed Incremental Refining Spatial Join:
– Page-level
– Tuple-level
• Experimental results showed:
– IRSJ provides a reasonably accurate estimate in much
earlier stages than the exact answer obtained by full R-tree
join.
– IRSJt performs better than IRSJp.
– As the data size increased, the improvement of IRSJt over
full R-tree join increased.
Dataset 2 (S)
Dataset 1 (R)
Sampling
WQ
# intersections
(intermediate
result)
Samples
Statistics
Final Estimation
w/ CI
Dataset
Output / Input
Report
Incremental Joining Process
User
Process
R: Rivers
L1
r1
r2
L2
r3
r2
r1
r3
S1
r4
L2
L1
S: Cities
r4
r5
s1
s2
S2
s3
s4
S1
s1
r5
S2
s2
r3
s3
s4
R: Rivers
L1
r1
r2
L2
r3
r4
L2
L1
r2
r1
r4
r5
r3
r5
R: Rivers
L1
L2 L3
ST1
r1
L2
L4
L5
L6
ST3
ST2
r3
r2
r4
r6
r10
r9
r12
r5
r11
L5
r2
r4
L3
L6
r6
r7
L1
L4
r3
r12
r10
r1
r8
r11
r5
r9
r7
r8
R: Rivers
L1
r1
L2
r3
r2
r4
L2 L3
r6
L4
r10
r9
L5
L6
r12
r5
r11
L5
r2
r4
L3
L6
r6
r7
L1
L4
r3
r12
r10
r1
r8
r11
r5
r9
r7
r8
End
Yes
No
Dataset 1 (R)
Dataset 2 (S)
Sampling
Samples
Output / Input
User
Report
WQ
# intersections
(intermediate
result)
Dataset
Stop
Statistics
Process
Final Estimation
w/ CI
Transition step
R: Rivers
R1
R2
R3
R4
ST1
L1
L1
r1
ST2
L2
L3
L2
r3
L4
L3
r2
r4
L5
r6
r9
r10
r12
R3
R2
r2
R4
r9
L3
L2
L7
L4
R1
r4
L6
L5
r11
r5
r6
r7
r3
L1
r1
r12
r10
L7
r8
L4
L8
L6
L8
R: Rivers
R1
R2
R3
R4
ST1
R1
ST2
R2
R3
R4
L 1 L2
L 3 L4
L5 L6
L7 L8
L1
L2
L3
L4
L5
L6
L7
L8
r1 r2
r3 r4
r5 r6
r7 r8
r9 r10
r11 r12
r13 r14 r15
r16 r17 r18