Spatial Data Mining CSE 6331, Fall 1999
Transcript Spatial Data Mining CSE 6331, Fall 1999
Spatial Data Mining
CSE 6331, Fall 1999
What I shall be talking about….
Introduction (What, Why and Where)
Orthodox Methods and some primitives
Spatial Data Structures and Computations
Data Organization and Spatial Data Mining Architecture
Methods for knowledge discovery
– Generalized-based Approach
• Spatial-Data-Dominant Generalization
• Non-Spatial-Data-Dominant Generalization
– Spatial Association Rules
• Geominer : Tool for Spatial Data Mining
• Future Work
• Conclusions & References
Introduction (What, Why and
Where Spatial Data Mining)
• Used for mining data in spatial databases and databases with huge
amounts of data.
• Spatial data is defined as distance-related data in an object.
– A spatial object contains both spatial and non-spatial data.
– A spatial object is stored using spatial indexes e.g. R-Trees etc.
• A spatial database stores spatial objects and spatial relationships
between these objects.
• Spatial Data Mining is used for knowledge discovery in areas e.g.
Geographical Information Systems (GIS)
– Work great for numeric attributes.
– Gives out realistic outputs.
– These methods assume statistical independence among spatial data
– Cannot model non-linear rules and cannot symbolic values very
• Two techniques borrowed from Machine Learning are widely used in
– Learning From Example
– Specialization and Generalization
• Rules: Following types of possible rules can be generated from the
– Characteristic Rules
– Discriminant Rules
– Association Rule
• Thematic Maps: Present spatial distribution of a single/multiple
attributes. Used for discovering association rules. Typically, we
represent non-spatial attributes in this map.
Spatial Data Structures,
• Spatial Data Structures (Points, rectangles, circles etc.)
• Indices for spatial data structures are made using
– Multi-dimensional trees
• Spatial Computation Techniques
– Spatial Joins: Quite expensive. Uses R*-Trees for efficiency.
First step finds all possible pairs of intersecting objects. Second
step uses detailed methods to check for intersection.
– Map overlays are used in GIS.
Data Organization of a Spatial
• Spatial data (e.g. seismic data) consists of spatial and non-spatial
– Non-spatial attributes-- stored in relational databases.
– Another attribute in the database points to the spatial description of
– Spatial data is described using two properties -- geometric and
• Knowledge discovery methods based on spatial data use both spatial
and non-spatial data.
Spatial Data Mining Architecture
Generalization Based Knowledge
• Requires existence of background knowledge (concept hierarchies) for
both spatial and non-spatial data.
• Concept hierarchies are typically given by domain experts.
Spatial Attribute Concept
Generalization Based Knowledge
• Attribute Oriented Induction
– Generalize the specific values of the task-relevant attributes to
– Remove attributes when:
• further generalization is not possible OR
• the number of distinct values for an attribute at the higher
levels exceeds the generalization threshold for that attribute.
– Merge identical tuples. Keep quantitative measure of tuples
merged to allow quantitative presentation of knowledge acquired.
• Two main algorithms:
– Spatial-Data-Dominant Generalization
– Non-Spatial-Dominant Generalization
• Collect all task-relevant data sets
• Generalize the spatial data first by merging the spatial regions
according to the description in the concept hierarchy. Do not
generalize beyond spatial generalization threshold limit (indicates max.
number of spatial regions to create).
• Non-spatial data is analyzed for each spatial object using the attributeoriented approach.
• Asymptotic run time of this algorithm is O(N lg N) where N=# of
• Collect task-relevant data
• Perform attribute-oriented induction on non-spatial attributes.
– Generalization threshold indicates the stopping point for the
• Pointers to spatial objects are collected and put alongwith the generalized
• Neighboring areas with same generalized value for the non-spatial
attributes are merged together using “adjacent_to” spatial predicate.
• Complexity is O(N lg N) where N = # of spatial objects.
What’s the problem with
Generalized Data Mining
• Concept hierarchies can only be done by the domain expert.
• Concept hierarchies may not be known in advance in unexplored
• Interestingness and quality of the mined characteristic rules will
depend on the given concept hierarchy.
Spatial Association Rules
• Why we
– All the methods discussed previously only mine characteristic
rules. We sometimes need to find relationships (spatial and nonspatial) between attributes that belong to different classes.
• A spatial association rule is of the form X -->Y (c%), where X and Y
are sets of spatial or nonspatial predicates and c% is the confidence of
• Examples-– is_a(X,City) and has(X, beach)-->close_to(X, sea) (Spatial RHS)
– Is_a(X, student) --> goes_to(X, school) -- not a spatial association
• Concepts of minimum support and confidence are used to generate
rules. Different values at each level .
Topological Relations Hierarchy
Spatial Association Rules
• Use two-step spatial computation technique to reduce computations.
• Computation starts at the high level of spatial predicates like g_close_to
• More detailed and finer, but more expensive, spatial computations are
applied at lower concept levels only to those patterns that are large at the
level of the predicate g_close_to.
– Filtration of large patterns saves a computations since much fewer spatial
association relationships left at the lower concept levels.
• Still the user is required to provide concept hierarchies.
GeoMiner: A Spatial Data
• Can mine characteristic, comparison and association rules.
• Support spatial data cube module construction, Spatial OLAP and
Spatial Data Mining modules.
• Uses a spatial data mining language called GMQL, extension of spatial
• Functions on top of DBMiner. Directs all non-spatial data related
functions to DBMiner.
Example- Mine characteristic rules
MINE CHARACTERISTICS AS ``USA_states''
WITH RESPECT TO states_census.geo, statename, pop,
MINE SPATIAL ASSOCIATIONS AS "Golf Courses"
WITH RESPECT TO Washington.geo, Washington_Golf_Courses.geo, CFCC
FROM Washington_Golf_Courses, Washington
WHERE CLOSE_TO(Washington.geo, Washington_Golf_Courses.geo, "3 km")
AND Washington.CFCC "Golf Course"
SET SUPPORT THRESHOLD 0.4
SET CONFIDENCE THRESHOLD 0.5
Future Work and Directions
• Using multiple thematic maps.
• Consider interleaving spatial and nonspatial generalizations to get
more efficient results.
• Using parallel machines or distributed farms of workstations can
achieve good speed-up.
• Extending Spatial Data Mining to use OO based approaches.
• Using mature statistical methods may produce interesting new
• Use in Spatio-temporal databases to study data deviation and evolution
• Spatial data mining is a promising field of research with wide
applications in GIS, medical imaging, robot motion planning, etc.
• Although, the field is quite young, a number of algorithms and
techniques have been proposed to discover various kinds of knowledge
from spatial data. I surveyed some of the existing methods for spatial
data mining and mentioned their strengths and weaknesses. The
variety of yet unexplored topics and problems makes knowledge
discovery in spatial databases an attractive and challenging research
• Wei Lu and Jiawei Han, `` Discovery of General Knowledge in Large
Spatial Databases'', Proc. of 1993 Far East Workshop on Geographic
Information Systems (FEGIS'93), Singapore, June 1993, pp. 275-289
• J. Han, K. Koperski, and N. Stefanovic, `` GeoMiner: A System
Prototype for Spatial Data Mining'', Proc. 1997 ACM-SIGMOD Int'l
Conf. On Management of Data(SIGMOD'97), Tucson, Arizona, May
• K. Koperski and J. Han, `` Discovery of Spatial Association Rules in
Geographic Information Databases'', Proc. 4th Int'l Symp. on Large
Spatial Databases (SSD95), Maine, Aug. 1995, pp. 47-66.