Knowledge discovery for spatial data

Download Report

Transcript Knowledge discovery for spatial data

Spatial Knowledge Discovery the IST-SPIN!-project
Michael May
German National Resarch Center for Information
Technology (GMD)
SPIN! Michael May, EC-GIS 2000, 29.6.00
1
1.
Introduction: Spatial Knowledge
Discovery
SPIN!
SPIN! Michael May, EC-GIS 2000, 29.6.00
2
Motivation
• GIS revolution brought an explosion of geographically
referenced data, yet few tools to automatically extract
useful information
• One very interesting development are interactive
thematic maps (CDV (Dykes 1997), SAGE (Haining
1998), Descartes (Andrienko & Andrienko 1999))
• Leave room for complementary methods:
– Hard to visualize multi-variate dependencies
– visual identification of patterns is subjective
SPIN! Michael May, EC-GIS 2000, 29.6.00
3
Knowledge Discovery in
Databases (KDD)
Characterization:
Knowledge discovery in databases is the nontrivial process of identifying valid, novel,
potentially useful, and ultimately
understandable patterns in data.
(Fayyad, Piatetsky-Shapiro, Smyth 1996)
SPIN! Michael May, EC-GIS 2000, 29.6.00
4
Data Mining: example task
• Predicting crop yield using data on past yield and data
about soil condition, climate
• detecting credit fraud by spotting unusual transaction
patterns
• classifying stars by using spectral data
SPIN! Michael May, EC-GIS 2000, 29.6.00
5
Knowledge Discovery Cycle
Visualization &
Interpretation
Selection &
Transformation
Data Mining
SPIN! Michael May, EC-GIS 2000, 29.6.00
6
Kepler
Preprocessing
• Data selection
• Cleaning
• Transformation
SPIN! Michael May, EC-GIS 2000, 29.6.00
7
Kepler
Exploration
• Drill down:
exploring the
data at different
levels of
aggregation
• Descriptive
statistics &
visualization
SPIN! Michael May, EC-GIS 2000, 29.6.00
8
Kepler
Data Mining
• Decision tree
• Regression tree
• Subgroup
discovery
• k-nearest
neighbor
• ILP (Foil)
SPIN! Michael May, EC-GIS 2000, 29.6.00
Decision Trees (DTI)
Regression Trees (RT)
k-NN
Subgroup (Midos)
ILP (Foil)
9
Kepler Visualization
Decision Trees
Subgroups
Rules
SPIN! Michael May, EC-GIS 2000, 29.6.00
10
Data Mining vs. GIS
Data Mining
GIS
•system generates hypothesis
•user generates hypothesis
•search (and visualization) in
abstract space
•visualization in geographical
space
•inductive generalizations
exceeding content of database
•shows what’s inside the data
Both techniques are exploratory
SPIN! Michael May, EC-GIS 2000, 29.6.00
11
2.
SPIN!:
Spatial Mining for Data of
Public Interest
SPIN!
SPIN! Michael May, EC-GIS 2000, 29.6.00
12
SPIN!
Spatial Mining for data of public
interest
 German National Research Center for Information
Technology (GMD),
 University of Bari, Italy;
 School of Geography, University of Leeds, UK
 Dialogis Software & Services GmbH, Bonn, Germany;
 Professional GeoSystems (PGS), Amsterdam, Holland;
 Metropolitan and Victoria Univ., Manchester, MIMAS,
 IITP, Russian Academy of Sciences, Moscow;
 GeoForschungszentrum Potsdam, Germany.
IST-1999-10536 SPIN!, Duration: 1/2000-12/2002
Coordination: GMD, [email protected]
SPIN! Michael May, EC-GIS 2000, 29.6.00
13
SPIN! Objectives
• Develop a system architecture integrating state of the art GIS and
Data Mining functionality in an open, extensible, internet-enabled
architecture
• Adapt
- inductive logic programming learning methods and
- Bayesian Markov Chain Monte Carlo to
spatial data
• Develop new visualization for Data Mining in GIS
• Develop new visualization of temporal and spatial data.
• Apply system to
- seismic and volcano data analysis (with GFZ)
- web-based dissemination of census data (with ONS and MIMAS)
SPIN! Michael May, EC-GIS 2000, 29.6.00
14
Level 1: Data access and
management
Provided by data mining platform Kepler
• data access to heterogeneous and distributed data
sources (RDBMS, flat file, spatial data)
• data query and transformation (restriction, projection,
union, join, calculated rows)
• exploratory non-spatial visualization
• organizing and documenting analysis tasks
SPIN! Michael May, EC-GIS 2000, 29.6.00
15
Level 2: Internet-enabled
map viewer
• Lava/Magma Java-based internet GIS developed by
Professional GeoSystems (PGS)
• support for zooming, panning etc.
• Excellent scalability through client-side caching
SPIN! Michael May, EC-GIS 2000, 29.6.00
16
Level 1: Data access and
management
Provided by data mining platform Kepler
• data access to heterogeneous and distributed data
sources (RDBMS, flat file, spatial data)
• data query and transformation (restriction, projection,
union, join, calculated rows)
• exploratory non-spatial visualization
• organizing and documenting analysis tasks
SPIN! Michael May, EC-GIS 2000, 29.6.00
17
Level 2: Internet-enabled
map viewer
• Lava/Magma Java-based internet GIS developed by
Professional GeoSystems (PGS)
• support for zooming, panning etc.
• Excellent scalability through client-side caching
SPIN! Michael May, EC-GIS 2000, 29.6.00
18
Level 3: Interactive thematic maps
Knowledge-based map design
(Andrienko & Andrienko 1999)
Dynamic maps allowing interactive
manipulation
Rule base on
map design
Map designer
Data
characterization:
types and
relationships

Selected data
subset
SPIN! Michael May, EC-GIS 2000, 29.6.00
19
Level 4: Automated cluster
detection:GAM/K
• searching for localised spatial clustering
• examining circles of varying sizes that cover the region
of interest
• compare relative frequency with
expected value
• retain significant circles
• apply kernel smoothing
Openshaw 1998, 2000
SPIN! Michael May, EC-GIS 2000, 29.6.00
20
Level 5: Explaining clusters and
spatial phenomena
Assume we have produced a classification or clustering
using either Descartes or GAM:
What attributes are associated with a cluster and could
potentially explain it?
SPIN! Michael May, EC-GIS 2000, 29.6.00
21
Example: GIS & decision trees
Thematic Map
SPIN! Michael May, EC-GIS 2000, 29.6.00
Decision Tree
22
Using Inductive Logic
Programming
• Learning approach based on first-order predicate logic
• can express relations between instances
• greater representational power compared to attributevalue learners using ‘single-table data’
• Topological relations such as adjacent_to, close_to, inside
can be included search for explanations
crime_hotspot(X) :city(Z),
high_unemployment(Z),
train_station(Y),
inside(Y,Z),
close_to(X,Y).
SPIN! Michael May, EC-GIS 2000, 29.6.00
Topological
predicates
23
SPIN! Michael May, EC-GIS 2000, 29.6.00
24
Application challenge
SPIN! Michael May, EC-GIS 2000, 29.6.00
25
Eruption of Merapi volcano, Java, Indonesia
SPIN! Michael May, EC-GIS 2000, 29.6.00
26
Merapi volcano in central Java
SPIN! Michael May, EC-GIS 2000, 29.6.00
27
Tasks for Merapi application
 Estimation of possible future eruption
 Combining information about land use/land
cover, infrastructure and population in order to
make a damage assessment.
 Dissemination of information for volcano risk
mitigation over the Internet.
SPIN! Michael May, EC-GIS 2000, 29.6.00
28
Web-based dissemination of
census data
• MIMAS disseminates UK census data to the UK
academic sector
• Using SPIN!-technology for providing additional value
to the mere distribution of data
• UK Unitary development plans, selected application
area: Manchester Stockport
– Forecasting numbers of houses needed
– Allocation of land
– Development control
SPIN! Michael May, EC-GIS 2000, 29.6.00
29
Conclusion
• Integration of Data Mining and GIS is a logical
progression of spatial data analysis technology
• Integrating interactive statistical maps in the knowledge
discovery process improves the visualization and
interpretation step of KDD
• Map based data analysis can be supplemented by data
mining methods for potential explanations of patterns
• First prototype expected by the end
of the year 2000!
SPIN! Michael May, EC-GIS 2000, 29.6.00
30