Geographic Data Mining

Download Report

Transcript Geographic Data Mining

Geographic Data Mining
Marc van Kreveld
Seminar for GIVE
Block 1, 2003/2004
About …
• A form of geographical analysis
• Current topic of interest in GIS research
(and database research and AI
research)
• Finding hidden information in large
collections of geographic data
This seminar
• Learning about a topic together
• Presenting to each other + interaction
• Added value by good examples:
– for important concepts, algorithms
– possibly self-thought of, or extended
– referring to GIS data and issues (hence the
GIS course prerequisite)
• Written assignment: joint survey
Material
• Book by Harvey Miller and Jiawei Han
(editors): selected chapters
• Possibly: papers from conference
proceedings
• Mostly provided by me
Weeks
• Week 36-46
• Probably:
– Not September 4 (this Thursday)
– Not in week 40 (Sept. 29 & Oct. 2)
– Not October 23
• The above depending on participation!
Overview of Geographic
Data Mining & Knowledge
Discovery
Chapter 1 of the book
•
•
•
•
KDD: knowledge discovery in databases
Data warehouses
Data mining
Geographic aspects of the above
Knowledge Discovery in
Databases (KDD)
• Large databases contain interesting
patterns: non-random properties and
relationships that are:
– valid (general enough to apply to new data)
– novel (non-trivial and unexpected)
– useful (leads to effective action: decision
making or investigation)
– ultimately understandable (simple, and
interpretable by humans)
Knowledge Discovery in
Databases (KDD)
• Because of quantity of data nowadays
• Because we want information, not data
• Because computing power allows it
KDD opposed to statistics
• Statistics
– small and clean numeric database
– scientifically sampled
– specific questions in mind
• KDD: none of the above
KDD techniques
•
•
•
•
•
Statistics
Machine learning
Pattern recognition
Numeric search (?)
Scientific visualization
Data warehouse
• Large repository of data
• For analytical processing
(DB: transactional processing)
• Heterogeneous: different sources and
formats
(DB: homogeneous)
• Supports OLAP tools (OnLine Analytical
Processing)
OLAP Example
• Measure of interest: sales
• Dimensions of interest: item, store, week
• (item, store, week)  money
[quantity sold times price ]
OLAP Example
• 2-dim. aggregation:
(item, store, . )  money
• Another 2-dim. aggregation: sales by
store and by week
• 1-dim. aggregation: sales by week (all
items and stores)
• Data cube: all 2d possible aggregations,
different types of summaries
KDD steps
•
•
•
•
•
•
Data selection
Data pre-processing
Data enrichment
Data reduction and projection
Data mining
Interpretation and reporting
Presence of steps and order not fixed
KDD steps
• Data selection: which records, variables
chosen?
KDD steps
• Data selection
• Data pre-processing: removing noise,
duplicate records, handling missing
data, …
KDD steps
• Data selection
• Data pre-processing
• Data enrichment: combining the
selected data with external data
KDD steps
•
•
•
•
Data selection
Data pre-processing
Data enrichment
Data reduction and projection:
reduction in number, reducing
dimension
KDD steps
•
•
•
•
•
Data selection
Data pre-processing
Data enrichment
Data reduction and projection
Data mining: uncovering information,
interesting patterns
KDD steps
•
•
•
•
•
•
Data selection
Data pre-processing
Data enrichment
Data reduction and projection
Data mining
Interpretation and reporting: evaluating,
understanding, communicating
Data mining
•
•
•
•
•
Segmentation
Dependency analysis
Deviation and outlier analysis
Trend detection
Generalization and characterization
DM - segmentation
Description:
• Clustering: finding a
finite set of implicit
classes
• Classification:
mapping data items
into pre-defined
classes
Techniques:
• Cluster analysis
• Bayesian
classification
• Decision or
classification trees
• Artificial neural
networks
DM - segmentation
given classes
clustering
classification
DM – dependency analysis
Description:
• Finding rules to
predict the value of
some attribute based
on other attributes
(4, 12, 0.24)
(3, 14, 0.21)
(7, 13, 0.43)
(2, 9, 0.78)
(11, 11, 0.55)
Techniques:
• Bayesian networks
• Association rules
(5, 11, ???)
(???, 12, 0.51)
DM – dependency analysis
• Confidence and support measures for
association rules of the form:
[ if X then Y ]:
confidence =
#(X and Y in DB) / #(X in DB)
support =
#(X and Y in DB) / #(all in DB)
DM – deviation & outlier
analysis
Description:
• Finding data with
unusual deviations
(=errors, or data of
particular interest)
Techniques:
• Clustering, other
mining methods
• Outlier analysis
DM – trend detection
Description:
• Finding lines, curves,
summarizing the
database (often as a
function over time)
Techniques:
• Regression
• Sequential pattern
extraction
DM – generalization and
characterization
Description:
• Obtaining compact
descriptions of the
data
Techniques:
• Summary rules
• Attribute-oriented
induction
higher level concept
concept hierarchy
low level concept
Visualization and
knowledge discovery
• KDD is difficult to automate  steered
by human intelligence
• Visualization helps to understand the
data and which data mining techniques
to try
KD + geography
• Special case of KDD
• Other special cases
– marketing
– biology
– astronomy
• Main features: location, distance, dimensionality (with dependent dimensions)
KD + geography
(attr1, attr2, attr3, attr4); attr’s are numbers
and (relatively) independent: statistics
(attr1, attr2, attr3, attr4); attr’s can also be on
other measurement scales: KDD
(attr1, attr2, attr3, attr4); attr’s are often
dependent and can be shapes: KD + geography
Often: (lat., long., attr1, attr2, …)
or: (shape description, attr1, attr2, …)
KD + geography
• Study of scalable versions of DM tasks
(in lat. and long.)
• Certain dimensions can be non-metric
(travel time need not be symmetric)
• DM in data that is not in the form of
tuples: sets of thematic map layers
Geographic data mining
• Spatial segmentation (clustering,
classification)
• Spatial dependency (spatial association
rules)
• Spatial trend detection
• Geographic characterization and
generalization
GDM – spatial
association rules
• Example: If a location is within 500 m
from water and the average winter
temperature is at least –2 degrees
then there are frogs around
distance
relationship
GDM – spatial trend
detection
• Patterns of change with respect to
neighborhood of some object
• Example: (North America) Further from
Pacific ocean  fewer earthquakes
GDM - applications
•
•
•
•
Map interpretation
Remote sensing interpretation
Environmental mapping (soil type, etc.)
Extracting spatio-temporal patterns for
cyclones, crimes
• Spatial interaction (movement/flow of
people, capital, goods)
Conclusions
• GDM & GKD is an extension of (tool for)
geographical analysis
• GDM is different from DM due to
–
–
–
–
–
Geographic spaces, not attribute space
Neighborhood is extremely important
Scale issues
Data is different
Applications (interesting patterns to mine
for) are different
This seminar on GDM
• First: chapters from the book
–
–
–
–
–
–
CH 1: GDM & KD: an overview
(today)
CH 2: Paradigms for spatial and spatio-temporal DM(11-9)
CH 3: Fundamentals of spatial DW for GKD
(15-9)
CH 7: Algorithms and applications of SDM (Ronny) (18-9)
CH 8: Spatial clustering in DM
(22-9)
CH 6: Modeling spatial dependencies
(25-9)
(not: 29-9 and 2-10)
– CH 9: Detecting outliers
(6-10)
– CH 10: Knowledge construction based on GVis and KDD
– CH 14: Mining mobile trajectories
This seminar
• All PowerPoint presentations on the Web
page of the course
• Survey paper or written exam; possible
topics for survey:
–
–
–
–
Hierarchical clustering
Clustering with obstacles
Proximity relationship mining
…
• Or: joint survey of (geometric)
algorithms for GDM
Each presentation
• The chapter contents
• Additional (spatial) examples
(from the Web links or self-constructed)
• Detect and present algorithmic problems
that appear  together: report on
algorithmic issues in GDM
• Present your chapter; don’t be afraid of
overlap with other chapters