#### Transcript Spatial Data Mining CSE 6331, Fall 1999

Spatial Data Mining CSE 6331, Fall 1999 Ajay Gupta [email protected] [email protected] What I shall be talking about…. • • • • • Introduction (What, Why and Where) Orthodox Methods and some primitives Spatial Data Structures and Computations Data Organization and Spatial Data Mining Architecture Methods for knowledge discovery – Generalized-based Approach • Spatial-Data-Dominant Generalization • Non-Spatial-Data-Dominant Generalization – Spatial Association Rules • Geominer : Tool for Spatial Data Mining • Future Work • Conclusions & References Introduction (What, Why and Where Spatial Data Mining) • Used for mining data in spatial databases and databases with huge amounts of data. • Spatial data is defined as distance-related data in an object. – A spatial object contains both spatial and non-spatial data. – A spatial object is stored using spatial indexes e.g. R-Trees etc. • A spatial database stores spatial objects and spatial relationships between these objects. • Spatial Data Mining is used for knowledge discovery in areas e.g. Geographical Information Systems (GIS) Orthodox Methods • Advantages – Work great for numeric attributes. – Gives out realistic outputs. • Disadvantages – These methods assume statistical independence among spatial data objects. – Cannot model non-linear rules and cannot symbolic values very well. • Two techniques borrowed from Machine Learning are widely used in this field: – Learning From Example – Specialization and Generalization Some primitives • Rules: Following types of possible rules can be generated from the database: – Characteristic Rules – Discriminant Rules – Association Rule • Thematic Maps: Present spatial distribution of a single/multiple attributes. Used for discovering association rules. Typically, we represent non-spatial attributes in this map. Spatial Data Structures, Computations • Spatial Data Structures (Points, rectangles, circles etc.) • Indices for spatial data structures are made using – Multi-dimensional trees – R-Trees/R*-Trees • Spatial Computation Techniques – Spatial Joins: Quite expensive. Uses R*-Trees for efficiency. First step finds all possible pairs of intersecting objects. Second step uses detailed methods to check for intersection. – Map overlays are used in GIS. Data Organization of a Spatial Database • Spatial data (e.g. seismic data) consists of spatial and non-spatial objects/attributes. – Non-spatial attributes-- stored in relational databases. – Another attribute in the database points to the spatial description of the object. – Spatial data is described using two properties -- geometric and topological. • Knowledge discovery methods based on spatial data use both spatial and non-spatial data. Spatial Data Mining Architecture From User From DBMS Domain Knowledge Controller DB Interface Focussing Component Pattern Extraction Knowledge Base Evaluation Discovery Generalization Based Knowledge Discovery • Requires existence of background knowledge (concept hierarchies) for both spatial and non-spatial data. • Concept hierarchies are typically given by domain experts. Computers Software Application Software Word Procesors MS Office Notepad Hardware System Software Browsers Netscape Compilers Internet Explorer Linkers Memory Modules RAMs Hard Disks CPUs Intel-based Processors 8086 8088 Motorolabased Processors Spatial Attribute Concept Hierarchy USA Texas Dallas Irving Beltline Road Tarrant Carrolton Walnut Hill Maryland Marsh Lane Arlington Frankford Drive Euless Montgomery North Potomac German town Fredrick Baltimore Spring Wells Generalization Based Knowledge Discovery • Attribute Oriented Induction – Generalize the specific values of the task-relevant attributes to higher levels. – Remove attributes when: • further generalization is not possible OR • the number of distinct values for an attribute at the higher levels exceeds the generalization threshold for that attribute. – Merge identical tuples. Keep quantitative measure of tuples merged to allow quantitative presentation of knowledge acquired. • Two main algorithms: – Spatial-Data-Dominant Generalization – Non-Spatial-Dominant Generalization Spatial Data-Dominant Generalization • Collect all task-relevant data sets • Generalize the spatial data first by merging the spatial regions according to the description in the concept hierarchy. Do not generalize beyond spatial generalization threshold limit (indicates max. number of spatial regions to create). • Non-spatial data is analyzed for each spatial object using the attributeoriented approach. • Asymptotic run time of this algorithm is O(N lg N) where N=# of spatial objects Spatial-Data-Dominant Generalization Non-Spatial-Data Dominant Generalization • Collect task-relevant data • Perform attribute-oriented induction on non-spatial attributes. – Generalization threshold indicates the stopping point for the generalization. • Pointers to spatial objects are collected and put alongwith the generalized non-spatial data. • Neighboring areas with same generalized value for the non-spatial attributes are merged together using “adjacent_to” spatial predicate. • Complexity is O(N lg N) where N = # of spatial objects. Non-Spatial-Data Dominant Generalization What’s the problem with Generalized Data Mining Techniques • Concept hierarchies can only be done by the domain expert. • Concept hierarchies may not be known in advance in unexplored domains. • Interestingness and quality of the mined characteristic rules will depend on the given concept hierarchy. Spatial Association Rules • Why we – All the methods discussed previously only mine characteristic rules. We sometimes need to find relationships (spatial and nonspatial) between attributes that belong to different classes. • A spatial association rule is of the form X -->Y (c%), where X and Y are sets of spatial or nonspatial predicates and c% is the confidence of the rule. • Examples-– is_a(X,City) and has(X, beach)-->close_to(X, sea) (Spatial RHS) – Is_a(X, student) --> goes_to(X, school) -- not a spatial association rule. • Concepts of minimum support and confidence are used to generate rules. Different values at each level . Topological Relations Hierarchy g_close_to not_disjoint INTERSECTS adjacent_to intersects INSIDE covered_ by close_to CONTAINS inside covers EQUAL contains Spatial Association Rules • Use two-step spatial computation technique to reduce computations. • Computation starts at the high level of spatial predicates like g_close_to (generalized close_to). • More detailed and finer, but more expensive, spatial computations are applied at lower concept levels only to those patterns that are large at the level of the predicate g_close_to. – Filtration of large patterns saves a computations since much fewer spatial association relationships left at the lower concept levels. • Still the user is required to provide concept hierarchies. GeoMiner: A Spatial Data Mining Tool • Can mine characteristic, comparison and association rules. • Support spatial data cube module construction, Spatial OLAP and Spatial Data Mining modules. • Uses a spatial data mining language called GMQL, extension of spatial SQL. • Functions on top of DBMiner. Directs all non-spatial data related functions to DBMiner. Geo-Characterizer Example- Mine characteristic rules MINE CHARACTERISTICS AS ``USA_states'' ANALYZE sum(pop) WITH RESPECT TO states_census.geo, statename, pop, med_fam_income, with_bachelor_degp FROM states_census Geo-Associator MINE SPATIAL ASSOCIATIONS AS "Golf Courses" WITH RESPECT TO Washington.geo, Washington_Golf_Courses.geo, CFCC FROM Washington_Golf_Courses, Washington WHERE CLOSE_TO(Washington.geo, Washington_Golf_Courses.geo, "3 km") AND Washington.CFCC "Golf Course" SET SUPPORT THRESHOLD 0.4 SET CONFIDENCE THRESHOLD 0.5 Future Work and Directions • Using multiple thematic maps. • Consider interleaving spatial and nonspatial generalizations to get more efficient results. • Using parallel machines or distributed farms of workstations can achieve good speed-up. • Extending Spatial Data Mining to use OO based approaches. • Using mature statistical methods may produce interesting new techniques. • Use in Spatio-temporal databases to study data deviation and evolution rules. Conclusion • Spatial data mining is a promising field of research with wide applications in GIS, medical imaging, robot motion planning, etc. • Although, the field is quite young, a number of algorithms and techniques have been proposed to discover various kinds of knowledge from spatial data. I surveyed some of the existing methods for spatial data mining and mentioned their strengths and weaknesses. The variety of yet unexplored topics and problems makes knowledge discovery in spatial databases an attractive and challenging research field. References • Wei Lu and Jiawei Han, `` Discovery of General Knowledge in Large Spatial Databases'', Proc. of 1993 Far East Workshop on Geographic Information Systems (FEGIS'93), Singapore, June 1993, pp. 275-289 • J. Han, K. Koperski, and N. Stefanovic, `` GeoMiner: A System Prototype for Spatial Data Mining'', Proc. 1997 ACM-SIGMOD Int'l Conf. On Management of Data(SIGMOD'97), Tucson, Arizona, May 1997. • K. Koperski and J. Han, `` Discovery of Spatial Association Rules in Geographic Information Databases'', Proc. 4th Int'l Symp. on Large Spatial Databases (SSD95), Maine, Aug. 1995, pp. 47-66.