Spatial Data Mining - COW :: Ceng

Download Report

Transcript Spatial Data Mining - COW :: Ceng



Introduction
Methods for Knowledge Discovery in Spatial
Databases
◦
◦
◦
◦
◦



Generalization-Based Knowledge Discovery
Methods Using Clustering
Methods Exploring Spatial Associations
Using Approximation and Aggregation
Mining in Image Databases
Future Directions
Conclusion
References
Our objectives:
◦ Describe existing spatial data mining methods
◦ Give a general perspective of the field’s current
state
◦ Summarize the paper’s description of the state of
spatial data mining in 1996.

A spatial database is a database that is
optimized to store and query data that is
related to objects in space, including points,
lines and polygons. While typical databases
can understand various numeric and
character types of data, additional
functionality needs to be added for databases
to process spatial data types. These are
typically called geometry or feature.







Database systems use indexes to quickly look up
values and the way that most databases index
data is not optimal for spatial queries. Instead,
spatial databases use a spatial index to speed up
database operations.
In addition to typical SQL queries such as SELECT
statements, spatial databases can perform a wide
variety of spatial operations. A few examples are:
Spatial Measurements
Spatial Functions
Spatial Predicates
Constructor Functions
Observer Functions

Data mining is a field at the intersection of computer
science and statistics, it is the process that attempts
to discover patterns in large data sets. It utilizes
methods at the intersection of artificial intelligence,
machine learning, statistics, and database systems.
The overall goal of the data mining process is to
extract information from a data set and transform it
into an understandable structure for further use.
Aside from the raw analysis step, it involves database
and data management aspects, data preprocessing,
model and inference considerations, interestingness
metrics, complexity considerations, post-processing
of discovered structures, visualization, and online
updating.


“Spatial data mining, or knowledge discovery in
spatial database, refers to the extraction of implicit
knowledge, spatial relations, or other patterns not
explicitly stored in spatial databases.” (Koperski
and Han, 1995)
Data mining, or knowledge discovery in databases,
refers to the “ discovery of interesting, implicit, and
previously unknown knowledge from large
databases.” (Frawley et al, 1992)
WHAT’S THE DIFFERENCE?
DATA = The (WHAT) dimension.
an attribute of an object.
SPATIAL DATA = (WHERE) & (WHAT)
Attribute data referenced to a specific location.
The Attributes of spatial objects are:
- Highly dependant on location
- Often influenced by neighboring objects
Object
What
Milk
Is cold
Yogurt
Is warm
Butter
Is warm
Object Where
What
Milk
In fridge (X1,Y1)
Is cold
Milk
On table (X2,Y2)
Is warm
Object
Where
What
SPATIAL DATA
MINING:
Milk
In fridge (X1,Y1)
Is warm
THE FRIDGE IS
BROKEN!
Yogurt
In fridge (X1,Y1)
Is warm
Butter
In fridge (X1,Y1)
Is warm
Object() : Location() -> Characteristic()
(Milk) : (In Fridge) -> [should be] -> (Cold)




To understand spatial data
To discover relationships between spatial and
non spatial data
To capture the general characteristics in a
concise way
To build spatial knowledge-bases



Various kinds of rules can be discovered from
databases in general
Spatial Characteristic Rule: General
description for spatial data.
Example: A rule describing the general price
range of houses in various geographic
regions in a city.




Spatial Discriminant Rule: General description
of the features discriminating or contrasting a
class of spatial data from other classes.
Example: Comparison of price ranges of
houses in different geographical regions.
Spatial Association Rule: Describes the
implication of one or a set of features by
another set of features in spatial databases.
Example: Associating the price range of the
houses with nearby spatial features, like
beaches.





Thematic Maps:
Presents the spatial distribution of attributes.
Differs from general maps where the
objective is to present the positions of
objects.
Used for discovering different rules
2 ways to represent a thematic map:
◦ Raster
◦ Vector




Image Databases:
Special kind of spatial databases where data
almost entirely consists of images or pictures.
Used in remote sensing, medical imaging
Usually stored in form of grid arrays
representing the image intensity in spectral
ranges.





Spatial Data Structures:
Consists of points, lines, rectangles, etc.
Spatial Computations:
Spatial join is one of the most expensive
spatial operations.
Map overlay is an important operation for
geographic information systems
SD mining algorithms must
efficiently overcome:
 The huge volume of
spatial data
 The complexity of spatial
data types/structures
 The complexity of spatial
accessing/query methods
 Expensive spatial
processing operations
HUGE
S-DB
www.jupiterimages.co
m
Object =
Where is:
Citizen{Brad}
GO!
Which highways cross
Nat’n Park boundaries?
= spatial JOIN
◦
◦
◦
◦
◦
Generalization-Based Knowledge Discovery
Methods Using Clustering
Methods Exploring Spatial Associations
Using Approximation and Aggregation
Mining in Image Databases

Spatial Data
◦ Geometric (location, area, perimeter)
◦ Topological (adjacency, inclusion)

Non-Spatial Data
◦ Stored in a traditional database with an attribute
that is a pointer to the spatial description (Aref et
al. - 1991)

Need for background knowledge in the form
of concept hierarchies : spatial and nonspatial
◦ Non-Spatial:
◦ Spatial: Counties -> Provinces -> Larger Regions

Two generalization-baseds algorithms presented by
Lu et al. - 1993:
◦ Spatial Data Dominant Generalization:
 Generalization of the spatial objects continues until the "spatial
generalization threshold" is reached.
 Non-spatial data is analyzed.
◦ Non-Spatial Data Dominant Generalization:
 The non-spatial data is generalized into higher concept level.
 Neighbouring areas with similar generalized attributes are
merged.

Dependent upon the concept hierarchies and need for
algorithms that does not use these hierarchies




No need for a background information like
concept hierarchies
Foundation of clusters directly from the data
A similar approach in machine learning is
called "unsupervised learning"
Clustering algorithms:
◦ PAM (Kaufmann and Rousseeuw – 1990)
◦ CLARAN (Kaufmann and Rousseeuw – 1990)
◦ CLARANS (Ng and Han - 1994)
 SD (CLARANS) – Spatial Dominant Approach
 NSD (CLARANS) – Non-Spatial Dominant Approach




n objects and k clusters
Selecting the most representative point for
each cluster
Most centrally located point in a cluster –
medoid
Computationally inefficient
Total Cost = 20
10
10
10
9
9
9
8
8
8
Arbitrary
choose k
object as
initial
medoids
7
6
5
4
3
2
1
Assign
each
remaining
object to
nearest
medoids
7
6
5
4
3
2
1
0
0
0
1
2
3
4
5
6
7
8
9
10
0
K=2
Do loop
Until no
change
1
2
3
4
5
6
7
8
9
10
7
6
5
4
3
2
1
0
0
10
3
4
5
6
7
8
9
10
10
Compute
total cost of
swapping
9
If quality is
improved.
2
Randomly select a
nonmedoid object,Oramdom
Total Cost = 26
Swapping O
and Oramdom
1
8
7
6
9
8
7
6
5
5
4
4
3
3
2
2
1
1
0
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
29
Very similar to PAM
 Sampling is the difference
 The idea:
If the sample is selected in a random
manner, it will be representing the data correctly
and, therefore, CLARA can deal with larger data sets
with respect to PAM.
 The drawback:
May not do the best clustering -> An object
is a medoid and it is not selected when sampling

Tries to mix PAM and CLARA
 The idea:
CLARA -> All stages with same sample
CLARANS -> At each step a random
sample
 The chance to miss the potential solutions is
decreased to minimum
 Experimentally shown as more efficient than
PAM and CLARA
 Also, detection of outliers i.e points that are
not belong to any cluster


Based upon CLARANS, two spatial mining
algorithms were developed:
◦ SD (CLARANS) – Spatial Dominant Approach
◦ NSD (CLARANS) – Non-Spatial Dominant Approach
 Clustering of objects spatially by CLARANS
 Attribute-oriented induction on non-spatial description of
the objects
 Result: Description of each cluster by it’s relative nonspatial attributes

Spatial Cluster: Downtown Edmunton

Non-Spatial (Attribute) Cluster:
50% Commercial,
40% Residental,
10% Public Services
 Attribute-oriented generalization on non-spatial
description of the objects


For each generalized tuple, clustering of the
objects by CLARANS
In the case of an overlap between clusters, the
merging of those clusters




Spatial Cluster:
Region East of 50 St,
South of 35 St,
North of Whitemud.
Non-Spatial (Attribute) Cluster:
Mostly Industrial

Two Drawbacks:
◦ The assumption that all objects to be clustered are
stored in main memory
 Requirement of disk-based methods
 Integration with spatial access methods like R* tree
◦ Efficiency of the algorithm can be improved by
some modifications
 Focusing on the representative objects when selecting
medoids
 Restricting the access to certain objects that do not
actually contribute to the computation




Presented because of the non-availability or
time consuming construction of the R-Trees
An algorithm presented by Zhang et al. –
1996
Can be used with any clustering algorithm
like CLARANS
Two concepts that are used in the algorithm:
◦ Clustering Feature
 A triple summarizing information about subclusters of
points
◦ CF Tree
 A balanced tree that stores clustering features
Clustering Feature: CF = (N, LS, SS)
N: Number of data points
LS: Ni=1=Xi
SS: Ni=1=Xi2
CF = (5, (16,30),(54,190))
10
9
8
7
6
5
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
(3,4)
(2,6)
(4,5)
(4,7)
(3,8)





Desire to discover rules that associate spatial
objects with other spatial objects
Association rules -> Agrawal et al. – 1993
Association rules for spatial databases ->
Koperski and Han – 1995
Example Rule: is_a(x, school) -> close_to(x,
park) (80%)
Other predicates: intersects, overlap, disjoint,
left_of, west_of, close_to, far_away

Support ( Support (X) )
Probability that a tuple contains X

Confidence ( X->Y )
Probability that a tuple having X also contains Y


Predefined thresholds (minimum support,
minimum confidence) to determine
associations
Strong Rule: A Rule that has a support, no
less than minimum support; and has a
confidence, no less than minimum confidence



Methods that answer where the groups of data
are
More interesting: Why the clusters are there?
Rephrased: What are the characteristics of the
clusters in terms of the features that are close
to them?
C1
C1
C2
Cluster C1 centered at
X1,Y1
Cluster C2 centered at
X2,Y2
C2
45% of the objects in Clusters
C1 & C2 are close to feature
(River)





CRH algorithm presented by Knorr and Ng –
1995
CRH: C (encompassing circle), R (isothetic
rectangle), H (convex hull)
Using filters to reduce the candidate features
from a large number of features
Input: Any number of features and one
cluster
Output: List of relevant features and points
located from a defined distance from features





There have been studies in this field, on the
automatic recognition and categorization of
objects
Usual systems are composed of 3
components:
Data focusing
Feature extraction
Classifaction learning



Data Focusing: Increases the overall efficiency
of the system by first identifying the portion
of the image being analyzed that is most
likely to containt the target object
Feature Extraction: Extracts interesting
features from the data. Pattern recognition
methods are used
Classification Learning: Discriminates the
target objects from other objects that look
alike



This system has about 80% accuracy.
It is difficult for experts to provide
classifications with 100% certainty
False classifications can produce large errors
because they are treated as negative
examples
“The variety of yet unexplored topics and problems makes knowledge discovery in spatial
databases an attractive and challenging research field.”

Identified Future Directions for spatial data mining:
◦ Data Mining in Spatial Object Oriented DB
◦ Alternative Clustering Techniques:
 Clustering overlapping objects, Fuzzy Clustering of Spatial Data
◦ Mining under uncertainty:
 Evidential reasoning, Fuzzy sets approaches
◦ Spatial Data Deviation and Evolution Rules:
◦
◦
◦
◦
◦
 Rule application to data that changes over time
Interleaved Generalization (spatial and non-spatial)
Generalization of Temporal Spatial Data (data evolution)
Parallel Data Mining (multi-processor systems)
Spatial Data Mining Query Language
Multidimensional Rule Visualization and Multiple Thematic
Maps
Topic
Currently Active
Research Field
References
DM in Spatial Obj-Oriented DB
Yes
>10 (1997 – 2012)
Alternative / Fuzzy Spatial
Clustering
Yes
1996,
>10 (1997 – 2012)
Mining under uncertainty
Yes
>10 (1997 – 2012)
Deviation / Evolution Rules
Yes
>10 (1997 – 2012)
Interleaved Generalization
Vague…
Generalization of Temporal
Spatial Data
Yes
Parallel Data Mining
merged
Spatial Data Mining Query
Language
Yes: GeoMiner, GMQL,
Spatial SQL
1997 (h&k),
>10 (1997 – 2012)
Visualization Topics
Yes – esp. GIS
>10 (1997 – 2012)
>10 (1997 – 2012)
1991, 1994, 1996,


Data Mining / Knowledge Discovery of Spatial
Data is a large, active research area.
While it was a “young” field at the time this
survey paper was written, it is quickly
maturing in applications such as:
◦ Geographic Information Systems
◦ Medical Imaging
◦ Robotics Navigation







http://www.cubrid.org/wp-content/uploads/2011/09/2deminsional-mbr.png
http://www.cubrid.org/wp-content/uploads/2011/09/geometryobject-model-examples.png
http://www.cubrid.org/wp-content/uploads/2011/09/mercatorprojection.png
http://www.lib.unc.edu/reference/gis/images/spatial1.jpg
http://postgis.refractions.net/documentation/casestudies/northda
kota/Map_Service_01.jpg
http://postgis.refractions.net/documentation/casestudies/northda
kota/4D_Screen.jpg
http://rmbl.info/gis/layers.jpg/layers-full.jpg

Krzysztof Koperski, Junas Adhikary, Jiawei Han.
Spatial Data Mining: Progress and Challenges
Survey Paper. Workshop on Research Issues on
Data Mining and Knowledge Discovery, 1996









W. G. Aref and H. Samet. Extending DBMS with Spatial Operations. In Proc. 2nd Symp.
SSD’91, pp.299-318, Zurich, Switzerland, Aug. 1991.
W. Lu, J. Han, and B. C. Ooi. Discovery of General Knowledge in Large Spatial Databases.
In Proc. Far East Workshop on Geographic Information Systems pp. 275-289, Singapore,
June 1993.
L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster
Analysis. John Wiley & Sons, 1990.
R. Ng and J. Han. Efficient and effective clustering method for spatial data mining. In
Proc. 1994 Int. Conf. Very Large Data Bases, pp. 144-155, Santiago, Chile, September
1994.
T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: an Efficient Data Clustering Method for
Very Large Databases. In Proc. 1996 ACM-SIGMOD Int. Conf. Management of Data,
Montreal, Canada, June 1996.
R. Agrawal, T. Imielinski, and A. Swami. Mining Association Rules Between Sets of Items
in Large Databases. In Proc. 1993 ACM-SIGMOD Int. Conf. Management of Data, pp.
207-216, Washington, D.C., May 1993.
K. Koperski and J. Han. Discovery of Spatial Association Rules in Geogrpahic Information
Databases. In Proc. 4th Int’l Symp. On Large Spatial Databases (SSD’95), pp. 47-66,
Portland, Maine, August 1995.
E. Knorr and R. T. Ng. Applying Computational Geometry Concepts to Discovering
Spatial Aggregate Proximity Relationships. In Technical Report, University of British
Columbia, 1995.
Paper Review and Slides by Brad Danielson