Spatial Data Mining CSE 6331, Fall 1999

download report

Transcript Spatial Data Mining CSE 6331, Fall 1999

Spatial Data Mining
CSE 6331, Fall 1999
Ajay Gupta
[email protected]
[email protected]
What I shall be talking about….
•
•
•
•
•
Introduction (What, Why and Where)
Orthodox Methods and some primitives
Spatial Data Structures and Computations
Data Organization and Spatial Data Mining Architecture
Methods for knowledge discovery
– Generalized-based Approach
• Spatial-Data-Dominant Generalization
• Non-Spatial-Data-Dominant Generalization
– Spatial Association Rules
• Geominer : Tool for Spatial Data Mining
• Future Work
• Conclusions & References
Introduction (What, Why and
Where Spatial Data Mining)
• Used for mining data in spatial databases and databases with huge
amounts of data.
• Spatial data is defined as distance-related data in an object.
– A spatial object contains both spatial and non-spatial data.
– A spatial object is stored using spatial indexes e.g. R-Trees etc.
• A spatial database stores spatial objects and spatial relationships
between these objects.
• Spatial Data Mining is used for knowledge discovery in areas e.g.
Geographical Information Systems (GIS)
Orthodox Methods
• Advantages
– Work great for numeric attributes.
– Gives out realistic outputs.
• Disadvantages
– These methods assume statistical independence among spatial data
objects.
– Cannot model non-linear rules and cannot symbolic values very
well.
• Two techniques borrowed from Machine Learning are widely used in
this field:
– Learning From Example
– Specialization and Generalization
Some primitives
• Rules: Following types of possible rules can be generated from the
database:
– Characteristic Rules
– Discriminant Rules
– Association Rule
• Thematic Maps: Present spatial distribution of a single/multiple
attributes. Used for discovering association rules. Typically, we
represent non-spatial attributes in this map.
Spatial Data Structures,
Computations
• Spatial Data Structures (Points, rectangles, circles etc.)
• Indices for spatial data structures are made using
– Multi-dimensional trees
– R-Trees/R*-Trees
• Spatial Computation Techniques
– Spatial Joins: Quite expensive. Uses R*-Trees for efficiency.
First step finds all possible pairs of intersecting objects. Second
step uses detailed methods to check for intersection.
– Map overlays are used in GIS.
Data Organization of a Spatial
Database
• Spatial data (e.g. seismic data) consists of spatial and non-spatial
objects/attributes.
– Non-spatial attributes-- stored in relational databases.
– Another attribute in the database points to the spatial description of
the object.
– Spatial data is described using two properties -- geometric and
topological.
• Knowledge discovery methods based on spatial data use both spatial
and non-spatial data.
Spatial Data Mining Architecture
From User
From
DBMS
Domain
Knowledge
Controller
DB
Interface
Focussing
Component
Pattern
Extraction
Knowledge Base
Evaluation
Discovery
Generalization Based Knowledge
Discovery
• Requires existence of background knowledge (concept hierarchies) for
both spatial and non-spatial data.
• Concept hierarchies are typically given by domain experts.
Computers
Software
Application
Software
Word
Procesors
MS Office
Notepad
Hardware
System
Software
Browsers
Netscape
Compilers
Internet
Explorer
Linkers
Memory
Modules
RAMs
Hard
Disks
CPUs
Intel-based
Processors
8086
8088
Motorolabased
Processors
Spatial Attribute Concept
Hierarchy
USA
Texas
Dallas
Irving
Beltline
Road
Tarrant
Carrolton
Walnut
Hill
Maryland
Marsh
Lane
Arlington
Frankford
Drive
Euless
Montgomery
North
Potomac
German
town
Fredrick
Baltimore
Spring
Wells
Generalization Based Knowledge
Discovery
• Attribute Oriented Induction
– Generalize the specific values of the task-relevant attributes to
higher levels.
– Remove attributes when:
• further generalization is not possible OR
• the number of distinct values for an attribute at the higher
levels exceeds the generalization threshold for that attribute.
– Merge identical tuples. Keep quantitative measure of tuples
merged to allow quantitative presentation of knowledge acquired.
• Two main algorithms:
– Spatial-Data-Dominant Generalization
– Non-Spatial-Dominant Generalization
Spatial Data-Dominant
Generalization
• Collect all task-relevant data sets
• Generalize the spatial data first by merging the spatial regions
according to the description in the concept hierarchy. Do not
generalize beyond spatial generalization threshold limit (indicates max.
number of spatial regions to create).
• Non-spatial data is analyzed for each spatial object using the attributeoriented approach.
• Asymptotic run time of this algorithm is O(N lg N) where N=# of
spatial objects
Spatial-Data-Dominant
Generalization
Non-Spatial-Data Dominant
Generalization
• Collect task-relevant data
• Perform attribute-oriented induction on non-spatial attributes.
– Generalization threshold indicates the stopping point for the
generalization.
• Pointers to spatial objects are collected and put alongwith the generalized
non-spatial data.
• Neighboring areas with same generalized value for the non-spatial
attributes are merged together using “adjacent_to” spatial predicate.
• Complexity is O(N lg N) where N = # of spatial objects.
Non-Spatial-Data Dominant
Generalization
What’s the problem with
Generalized Data Mining
Techniques
• Concept hierarchies can only be done by the domain expert.
• Concept hierarchies may not be known in advance in unexplored
domains.
• Interestingness and quality of the mined characteristic rules will
depend on the given concept hierarchy.
Spatial Association Rules
• Why we
– All the methods discussed previously only mine characteristic
rules. We sometimes need to find relationships (spatial and nonspatial) between attributes that belong to different classes.
• A spatial association rule is of the form X -->Y (c%), where X and Y
are sets of spatial or nonspatial predicates and c% is the confidence of
the rule.
• Examples-– is_a(X,City) and has(X, beach)-->close_to(X, sea) (Spatial RHS)
– Is_a(X, student) --> goes_to(X, school) -- not a spatial association
rule.
• Concepts of minimum support and confidence are used to generate
rules. Different values at each level .
Topological Relations Hierarchy
g_close_to
not_disjoint
INTERSECTS
adjacent_to
intersects
INSIDE
covered_
by
close_to
CONTAINS
inside
covers
EQUAL
contains
Spatial Association Rules
• Use two-step spatial computation technique to reduce computations.
• Computation starts at the high level of spatial predicates like g_close_to
(generalized close_to).
• More detailed and finer, but more expensive, spatial computations are
applied at lower concept levels only to those patterns that are large at the
level of the predicate g_close_to.
– Filtration of large patterns saves a computations since much fewer spatial
association relationships left at the lower concept levels.
• Still the user is required to provide concept hierarchies.
GeoMiner: A Spatial Data
Mining Tool
• Can mine characteristic, comparison and association rules.
• Support spatial data cube module construction, Spatial OLAP and
Spatial Data Mining modules.
• Uses a spatial data mining language called GMQL, extension of spatial
SQL.
• Functions on top of DBMiner. Directs all non-spatial data related
functions to DBMiner.
Geo-Characterizer
Example- Mine characteristic rules
MINE CHARACTERISTICS AS ``USA_states''
ANALYZE sum(pop)
WITH RESPECT TO states_census.geo, statename, pop,
med_fam_income, with_bachelor_degp
FROM states_census
Geo-Associator
MINE SPATIAL ASSOCIATIONS AS "Golf Courses"
WITH RESPECT TO Washington.geo, Washington_Golf_Courses.geo, CFCC
FROM Washington_Golf_Courses, Washington
WHERE CLOSE_TO(Washington.geo, Washington_Golf_Courses.geo, "3 km")
AND Washington.CFCC "Golf Course"
SET SUPPORT THRESHOLD 0.4
SET CONFIDENCE THRESHOLD 0.5
Future Work and Directions
• Using multiple thematic maps.
• Consider interleaving spatial and nonspatial generalizations to get
more efficient results.
• Using parallel machines or distributed farms of workstations can
achieve good speed-up.
• Extending Spatial Data Mining to use OO based approaches.
• Using mature statistical methods may produce interesting new
techniques.
• Use in Spatio-temporal databases to study data deviation and evolution
rules.
Conclusion
• Spatial data mining is a promising field of research with wide
applications in GIS, medical imaging, robot motion planning, etc.
• Although, the field is quite young, a number of algorithms and
techniques have been proposed to discover various kinds of knowledge
from spatial data. I surveyed some of the existing methods for spatial
data mining and mentioned their strengths and weaknesses. The
variety of yet unexplored topics and problems makes knowledge
discovery in spatial databases an attractive and challenging research
field.
References
• Wei Lu and Jiawei Han, `` Discovery of General Knowledge in Large
Spatial Databases'', Proc. of 1993 Far East Workshop on Geographic
Information Systems (FEGIS'93), Singapore, June 1993, pp. 275-289
• J. Han, K. Koperski, and N. Stefanovic, `` GeoMiner: A System
Prototype for Spatial Data Mining'', Proc. 1997 ACM-SIGMOD Int'l
Conf. On Management of Data(SIGMOD'97), Tucson, Arizona, May
1997.
• K. Koperski and J. Han, `` Discovery of Spatial Association Rules in
Geographic Information Databases'', Proc. 4th Int'l Symp. on Large
Spatial Databases (SSD95), Maine, Aug. 1995, pp. 47-66.