Mining Spatial Data

Download Report

Transcript Mining Spatial Data

Mining Spatial Data
Irena Yasnitsky
Xing Xiong
Agenda
 What is Spatial data
 What makes spatial data mining different
 Discovery of Spatial Association Rules
by Krzysztof Koperski and Jiawei Han
 Z-values, MBRs, R*-trees…
 Efficient Polygon Amalgamation methods
by X. Zhou, D. Truffet, and J. Han
What is Spatial Data?
 Objects of types:




points
lines
polygons
etc.
 Used in/for:

GIS - Geographic Information Systems

GPS - Global Positioning System

Environmental studies

etc …
What makes mining Spatial data different?
 Huge objects


Polygon (3 7.5, 12 -56, 1 2, 3 7.5)
Very large object
 Heavy computations


intersect(Polygon1, line2)
contain(Polygon1, Polygon2)
 Usually maintain large index tables


Z-value
Each Spatial object can be indexed by one or
many Z-values
Examples of Spatial Patterns
 Historic Examples


1855 Asiatic Cholera in London :
A water pump identified as the source…
Fluoride and healthy gums near Colorado river…
Modern Examples

Cancer clusters to investigate environment health hazards…

Crime hotspots for planning police patrol routes…

Bald eagles nest on tall trees near open water…
Unusual warming of Pacific ocean (El Nino) affects
weather in USA…

Sources:
 K. Koperski and J. Han.
[2] Discovery of Spatial Association Rules
In Geographic Information Databases
In Proc. 4th Int'l Symp. on Large Spatial Databases
(SSD'95), pp. 47-66, Portland, Maine, Aug. 1995.
 X. Zhou, D. Truffet, and J. Han
[3] Efficient Polygon Amalgamation Methods
for Spatial OLAP and Spatial Data Mining
In SSD'99, pp. 167-187, Hong Kong, Aug. 1999.
[1] Discovery of Spatial Association Rules
In Geographic Information Databases
 A spatial association rule is a rule indicating
certain association relationship among a set of
spatial and possibly some non-spatial predicates.
Example:
“Most big cities in Canada are
close to the CanadaU.S. border”
 A strong rule indicates that the patterns in the rule
have relatively frequent occurrences in the database and
strong implication relationships.
Discovery of Spatial Association Rules (cont.)
More formally,
 A spatial association rule is a rule in the form of
P 1  …  Pm  Q 1  …  Qn (c%) ,
where at least one of the predicates P 1 ,…, Pm , Q 1 ,…, Qn is a spatial predicate,
and c% is the confidence of the rule [which indicates that c% of objects satisfying
the antecedent of the rule will also satisfy the consequent of the rule].
 A set of predicates P is large in set S at level k if the support of P is no less than
its minimum support threshold σ`k for level k, and all ancestors of P from the
concept hierarchy are large at their corresponding levels.
 The confidence of a rule “P  Q/S” is high at level k if its confidence is no less
than its corresponding minimum confidence threshold φ`k
 A rule “P  Q/S” is strong if predicate “P  Q” is large in set S and the
confidence of “P  Q/S” is high.
Discovery of Spatial Association Rules (cont.)
Method:
Resembles
Apriori +
Hierarchy…
 Step 5:
Find large predicates and mine rules (Fine_predicate_DB);
 Step 4:
Fine_predicate_DB :=
refined spatial computation(Large_Coarse_predicate_DB);
 Step 3:
Large_Coarse_predicate_DB := filter with min support(Coarse predicate DB);
 Step 2:
Coarse_predicate_DB := coarse spatial computation(Task_relevant_DB);
 Step 1: Task_relevant_DB := extract task relevant objects(SDB, RDB);
Discovery of Spatial Association Rules (cont.)
Example problem:
 The study of spatial association relationships is confined to British Columbia,
with the following database relations for organizing and representing spatial
objects:
1. town(name, type, population, geo, …).
2. road(name, type, geo , …).
3. water(name, type, geo , …).
4. mine(name, type, geo , …).
 Suppose a user is interested in finding within the map of British
Columbia the strong spatial association relationships between large towns
and other “near by” objects including mines, country boundary, water
(sea, lake, or river) and major highways
 A predicate close_to(A, B) says that a spatial objects A and B
are close one to another, and g_close_to is a predefined generalized
predicate which covers a set of spatial predicates: intersect,
adjacent to, contains, close_to.
Discovery of Spatial Association Rules (cont.)
Example GeoMiner query:
discover spatial association rules
inside British Columbia
from road R, water W, mines M, boundary B
in relevance to town T
where
g_close_to(T.geo, X.geo) and
X in {R, W, M, B} and
T.type = “large” and
R.type in {divided_highway} and
W.type in {sea, ocean, large_lake, large_river} and
B.admin_region_1 in “B.C.”and
B.admin_region_2 in “U.S.A.”
Discovery of Spatial Association Rules (cont.)
Note: “close_to” is a conditiondependent predicate and is
defined by a set of knowledge rules.
For example, the first of the following rule states:
X is a town
and
Y is a country
X is close_to Y,
then
if their distance
is within 80 kms
Rules:
close_to(X,Y )  is_a(X, town)  is_a(Y, country)  dist(X, Y, d)  d = 80 km
close_to(X,Y )  is_a(X, town)  is_a(Y, road)  dist(X, Y, d)  d = 5 km
Discovery of Spatial Association Rules (cont.)
Level
1
Water
Sea
Large
river
river
Lake
Small
river
Fraser
river
2
3
4
g_close_to
Intersects
Adjacent
to
Intersects
Hierarchy for data relations
Close_to
Not_disjoint
Inside
Covered
by
Equal
Contains
Covers
Inside
Contains
Hierarchy of topological relations
Discovery of Spatial Association Rules (cont.)
Step 1:

Task_relevant_DB := extract task relevant objects(SDB, RDB);
The set of relevant data is retrieved by execution of the data retrieval
methods of the data mining query, which extracts the following data
sets whose spatial portion is inside B.C.:
(1) Towns
only large towns;
(2) Roads
only divided highways 2 ;
(3) Water
only seas, oceans, large
lakes and large rivers;
any mines;
(4) Mines
(5) boundary only the boundary
of B.C., and U.S.A.
...
g_close_to(T.geo, X.geo)
and X in {R, W, M, B} and
T.type = “large” and
R.type in
{divided_highway} and
W.type in {sea, ocean,
large_lake, large_river}
and B.admin_region_1 in
“B.C.”and
B.admin_region_2 in
“U.S.A.” ...
Discovery of Spatial Association Rules (cont.)
Step 2:
Coarse_predicate_DB :=
coarse spatial computation(Task_relevant_DB);
The “generalized_close_to” (g_close_to) relationship between (large)
towns and the other four classes of entities is computed at a relatively coarse
resolution level. [MBR data structure or R*trees and other approximations] – Later…
At this level we can already mine:
is_a(X, large_town)  g_close_to(X, water): (80%)
is_a(X, large_town)  g_close_to(X, sea)  g_close_to (X, us_boundary):(92%)
Town
Water
Road
Boundary
Victoria
Juan_de_Fuca_Strait
Highway_1, Highway_17
US
Saanich
Juan_de_Fuca_Strait
Highway_1, Highway_17
US
Mine
Highway_97
Prince
Pentincton
Okanagan_Lake
Highway_97
US
Allala
…
…
…
…
…
Discovery of Spatial Association Rules (cont.)
Step 3:
Large_Coarse_predicate_DB :=
filtering_with_min_support(Coarse predicate DB);
Refined computation is performed on the large predicate sets, i.e.,
those retained in the g_close_to table. Each g_close_to predicate is replaced by one or a
set of concrete predicate(s) such as intersect, adjacent_to, close_to, inside, etc.
Town
Water
Road
Boundary
Victoria
<adjacent_to, Juan_de_Fuca_Strait>
<intersects, Highway_1>,
<intersects, Highway_17>
<close_to, US>
Saanich
<adjacent_to, Juan_de_Fuca_Strait>
<intersects, Highway_1>,
<close_to, Highway_17>
<close_to, US>
<intersects, Highway_97>
Prince
Pentincton
<adjacent_to, Okanagan_Lake >
<intersects, Highway_97>
<close_to, US>
…
…
…
…
Discovery of Spatial Association Rules (cont.)
Step 4:
Large_Coarse_predicate_DB :=
filtering_with_min_support(Coarse predicate DB);
The levelbylevel detailed computation of large
predicates and the corresponding association rules is
presented as follows.
The computation starts at the topmost concept level and
computes large predicates at this level.
k
Large k-predicate set
Min support
= 20
for level 1
count
1
<adjacent_to, water>
32
1
<intersects, highway>
29
1
<close_to, highway >
29
1
<close_to, us_boundary >
28
2
<adjacent_to, water>, <intersects, highway>
25
2
<adjacent_to, water>, <close_to, us_boundary >
23
2
<intersects, highway>, <close_to, us_boundary >
26
3
<adjacent_to, water>, <intersects, highway>, <close_to, us_boundary >
22
Discovery of Spatial Association Rules (cont.)
Step 5:
Find large predicates and mine rules(Fine_predicate_DB);
After mining rules at the highest level of the concept hierarchy,
large kpredicates can be computed in the same way at the lower
concept levels, which results in tables
k
Level 2:
Min support
= 10
for level 2
Large k-predicate set
count
1
<adjacent_to, sea>
21
1
<adjacent_to, large_river>
11
1
<close_to, us_boundary >
28
1
<intersects, provincial_highway>
21
1
<close_to, provincial_ highway >
24
2
<adjacent_to, sea>, <close_to, us_boundary >
15
2
<close_to, us_boundary >, <intersects, provincial_ highway>
19
2
<adjacent_to, sea>, <close_to, provincial_highway >
11
2
<close_to, us_boundary >, <close_to, provincial_highway >
22
3
<adjacent_to, sea>, < close_to, provincial_highway >, <close_to, us_boundary >
10
Discovery of Spatial Association Rules (cont.)
Level 3:
k
Large k-predicate set
count
1
<adjacent_to, georgia_straight>
9
1
<adjacent_to, fraser_river>
10
1
<close_to, us_boundary >
28
2
<adjacent_to, georgia_straight>, <close_to, us_boundary >
7
Min
support = 7
for level 3
A rule example:
is_a(X, large town)  adjacent(X, sea)  close_to (X, us_boundary) : (100%)
The mining process stops
at the lowest level of the
hierarchies or when an
empty large 1predicate set
is derived.
Z-order
…
The “generalized_close_to” (g_close_to)
relationship between (large) towns and the other four
classes of entities is computed at a relatively
coarse resolution level.
[MBR data structure or R*trees and other
approximations] – Later…
R*-tree
MBR
Using slides from book of: Shashi Shekhar and Sanjay Chawla
MBR data structure
MBR – Minimal Bounding Rectangle
Allows rough computation – to filter non-relevant data
When used for indexing – is stored in R-trees
Is simply a rectangle of leftmost, rightmost, upper and
lower bound of a spatial object
What objects
intersect with
A?
Using MBR data structure
L1
O1
F1
L3
R1
A
R2
R3
O2
L2
L4
Using MBR data structure (cont.)
L1
O1
F1
L3
R1
L2
Filter out ones
whose MBRs do
not intersect A’s
MBR
O2
A
R2
L4
R3
Using MBR data structure (cont.)
Now we make exact
calculations on the
reduced set of
objects!
L1
F1
R1
O2
A
L4
Using MBR data structure (cont.)
We’ve got
L1, R1 and O2
L1
Note: to check whether MBRs
R1
O2
intersect, we only compare there
minimums and maximums; the
intersect(O1, O2) operation for
arbitrary spatial objects is very
expensive.
A
L4
In this example we apply
exact intersect operation five times
instead of ten.
R-Tree
Properties of R-trees
•Balanced
• Nodes are rectangle
• child’s rectangle within parent’s
• possible overlap among rectangles!
Implementation of find operation
Search root to identify relevant children
Search selected children recursively
R-Tree example
 Z-order
Z-order
Will be presented
by Xing Xiong…
 Efficient Polygon
Amalgamation algorithm
Efficient Polygon Amalgamation Methods for
Spatial OLAP and Spatial Data Mining
--Xiaofang Zhou, David Truffet, And Jiawei Han
Introduction
The polygon amalgamation operation
computes the boundary of the union of a set
of polygons.
This presentation will introduce two efficient
polygon amalgamation methods.
Introduction
 Why are we interested in amalgamation?
It’s a fundamental operation for emerging new
applications such as spatial OLAP and spatial data
mining.
Example:
 Roll up operation in OLAP.
 Group similar spatial objects into clusters in data
mining.
Introduction
•Important observation: Only boundary polygons contribute to the target
polygon.
•Question: How can we exclude the internal polygons from our computation?
Introduction
I’m going to introduce two efficient methods for identifying
internal polygons without retrieving them from databases.
 Adjacency-based method
 Occupancy-based method
A simple approach
• t(S):Target polygon
• Boundary polygon: A polygon which shares its boundary with that of t(S).
• Touching: a line k touch k’ if there exists one and only one common point(termed joint point)
between k and k’, such that the joint point is an end point of k and not an end point of k’. For
example, (b,c) touches(e,f) at point c.
• Preprocessing: whenever a line k touches k’:(m,n) at point v, we replace k’ with two line
segments(m,v), (v,n). For example, (f,e) is replaced by (e,c) and (c, f).
A simple approach

If S is the set of polygons to be amalgamated, t(S) is the targeted
polygon, a line segment is on the boundary of t(S) if and only if it has no
identical line segments in S . If we remove all identical line segments in S, the
remaining line segments will form the target polygon t(S).
A simple approach
Adjacency-based approach
 Two polygons are adjacent if they have at least one pair of identical
line segments.
 The adjacency table of a set of polygons P is defined as a two column
table ADJACENCY (p, p’) where p,p’ are identifiers of polygons in P.
 A tuple (p,p’) is in the table ADJACENCY if and only if p and p’ are
adjacent.
Basic idea: Use adjacency table to identify boundary polygons. Remove
identical line segments in boundary polygons to get t(S). Most of the
internal polygons will not be retrieved.
Adjacency-based approach
•Identify the set of boundary polygons: S.
Basic idea: a polygon is on the boundary of S if it’s adjacent
to some polygons which don’t belong to S
Adjacency-based approach
Shadow ring problem:
• S: p2, p3, p4 are boundary polygons.
• P1 will not be involved in the computation. So those line segments of
p2, p3 and p4 which were originally shared with p1 have lost their
counterpart. These line segments couldn’t be removed and form a shadow
ring.
Adjacency-based approach

1.
2.
Solution to the shadow ring problem:
Identify S+: the set of sub-boundary polygons which are internal
polygons adjacent to boundary polygons.
We extract both S and S+, remove all identical line segments in
them. We get the target polygon and possibly a shadow ring. But
this time we know the shadow ring is formed by line segments of
sub-boundary polygons S+. So we then remove the line segments
belonging to S+. The remaining line segments will form the
boundary polygon.
Adjacency-based approach
Occupancy-based approach
 Z-values: a spatial data access method which establishes certain
relationship between the data space and spatial objects or their
approximation bounding rectangles. It approximates a given object’s
shape by recursively decomposing the embedding data space into
smaller data space known as Peano cells.
 Z-values is commonly used spatial indexing mechanism. Our
occupancy-based approach is built on top of a simple extension of Zvalues.
Occupancy-based approach
The following is a description of one way to assign Z-values.
Examples: p1:14, p2: 1221
Occupancy-based approach
1. The accuracy of approximation can be improved by assigning multiple Zvalues to a polygon.(e.g., p1 can be approximated by three Peano cells: 141,
142, 143).
2. As we can see, this decomposing can go on forever. In this algorithm, a
number(termed HWM) is chosen such that a cell is to be further decomposed
into quadrants when the number of polygons overlapping with the cell exceeds
the HWM.
Occupancy-based approach
 Z-values with Occupancy
idea: Let C be the set of all Peano cells with which S polygons overlap
with. If cC is not completely occupied by S polygons, we call c boundary
cell. An S polygon overlapping with a boundary cell is likely to be a
boundary polygon.
Occupancy-based approach
Special case: when one Peano cell M is totally occupied by polygons and its neighbor
(another cell N) has no overlapping polygons. Boundary cell can’t be identified.
We introduce(z, p, 0) if polygon p is adjacent to z but not inside z, then z can be
identified as boundary cell.
Occupancy-based approach

Two extra points:

1.
2.

1.
2.
Occupancy-based approach will not produce shadow rings. Because:
Any line segment which doesn’t overlap with boundary cells is not part of the
target polygon, and thus can be discarded.
Any line segment inside a boundary cell either is part of the target polygon, or
its counterpart from another polygon must also be inside this boundary cell.
Fuzzy amalgamation.
This is a desirable advantage of Occupancy-based method over two other
methods.
Instead of defining internal cells as those with 100% occupancy, we can adjust
to a low threshold(e.g, 95%). This is useful when user wants to ignore data
noises.
Performance
 In comparison with SIMPLE algorithm which retrieves all the source
polygons, both adjacency-based and occupancy-based algorithm fetch
a smaller number of polygons.
 SIMPLE algorithm processes all the line segments in the source
polygons. Adjacency-based algorithm processes the line segments of
the boundary and sub-boundary polygons. Occupancy-based
algorithm just processes the line segments which intersects with
boundary Peano cells.
 Experimental results show that occupancy-based method has achieved
a remarkably better overall performance, in particular with an HWM
which puts an average of 5-7 polygons to a cell.
And now
discussion
© Peter de Sève 1997