PP Geographic analysis

Download Report

Transcript PP Geographic analysis

Geographical analysis
Overlay, cluster analysis, autocorrelation, trends, models, network
analysis, spatial data mining
Geographical analysis
• Combination of different geographic data sets
or themes by overlay or statistics
• Discovery of patterns, dependencies
• Discovery of trends, changes (time)
• Development of models
• Interpolation, extrapolation, prediction
• Spatial decision support, planning
• Consequence analysis (What if?)
Example overlay
• Two subdivisions with labeled regions
soil
Soil
Soil
Soil
Soil
vegetation
type
type
type
type
1
2
3
4
Birch forest
Beech forest
Mixed forest
Birch forest
on soil type 2
Kinds of overlay
• Two subdivisions with the same boundaries
- nominal and nominal
Religion and voting per municipality
- nominal and ratio
Voting and income per municipality
- ratio and ratio
Average income and age of employees
• Two subdivisions with different boundaries
Soil type and vegetation
• Subdivision and elevation model
Vegetation and precipitation
Kinds of overlay, cont’d
• Subdivision and point set
quarters in city, occurrences of violence on the
street
• Two elevation models
elevation and precipitation
• Elevation model and point set
elevation and epicenters of earthquakes
• Two point sets
money machines, street robbery locations
• Network and subdivision, other network,
elevation model
Result of overlay
• New subdivision or map layer, e.g. for further
processing
• Table with combined data
• Count, surface area
Soil
Type
Type
Type
Type
….
1
2
3
4
Vegetation
Area
Beech
Birch
Mixed
Beech
….
30
15
8
2
ha
ha
ha
ha
#patches
2
2
1
1
Buffer and overlay
• Neighborhood analysis: data of a theme
within a given distance (buffer) of objects
of another theme
Sightings of nesting locations of the great blue heron (point set)
Rivers; buffer with width 500 m of a river
Overlay  Nesting locations great blue heron near river
Overlay: ways of combination
• Combination (join) of attributes
• One layer as selection for the other
– Vegetation types only for soil type 2
– Land use within 1 km of a river
Overlay in raster
• Pixel-wise operation, if the rasters have the
same coordinate (reference) system
Pixel-wise
AND
Forest
Population increase
above 2% per year
Both
Overlay in vector
• E.g. the plane sweep algorithm as given in
Computational Geometry (line segment
intersection), to get the overlay in a
topological structure
• Using R-trees as an indexing structure to find
intersections of boundaries
Combined (multi-way) overlays
• Site planning, new construction sites depending
on multiple criteria
• Another example (earth sciences):
Parametric land classification: partitioning of
the land based on chosen, classified themes
Elevation
Annual precipitation
Types of rock
Overlay: partitioning
based on the three themes
Analysis point set
• Points in an attribute space: statistics, e.g.
regression, principal component analysis,
dendrograms
(area, population, #crimes)
(12, 34.000,
(14, 45.000,
(15, 41.000,
(17, 63.000,
(17, 66.000,
……
……
#crimes
#population
34)
31)
14)
82)
79)
Analysis point set
• Points in geographical space without associated
value: clusters, patterns, regularity, spread
Actual average nearest
neighbor distance
versus expected Av. NN.
Dist. for this number of
points in the region
For example: volcanoes in a region; crimes in a city
Analysis point set
• Points in geographical space with value: up to
what distance are measured values “similar”
(or correlated)?
11 10
12
13
12
19
14
16
18
21
21
20
22
17
15
16
Analysis point set
• Temperature at location x and 5 km away from
x is expected to be nearly the same
• Elevation (in Switzerland) at location x and 5
km away from x is not expected to be related
(even over 1 km), but it is expected to be
nearly the same 100 meters away
• Other examples:
– depth to groundwater
– soil humidity
– nitrate concentration in the soil
Analysis point set
• Points in geographical space with value:
auto-correlation (~ up to what distance are
measured values “similar”, or correlated)
11 10
12
13
12
19
14
16
18
21
21
20
22
17
15
16
n points 
(n choose 2) pairs;
each pair has a
distance and a
difference in value
difference 2
Average
difference 2
 observed
expected
2
difference
distance
Classify distances and
determine average per class
distance
Observed variogram
Average
2
difference
 observed
expected
2
difference
Model variogram (linear)
σ
sill
2
nugget
distance
range
distance
Smaller distances 
more correlation, smaller variance
Importance auto-correlation
• Descriptive statistic of a data set: describes
the distance-dependency of auto-correlation
• Interpolation based on data further away than
the range is nonsense
11
10
12
16
20
13
14
??
21
16
19
18
21
12
22
15
17
range
Importance auto-correlation
• If the range of a geographic variable is small,
more sample point measurements are needed
to obtain a good representation of the
geographic variable through spatial
interpolation
 influences cost of an analysis or decision
procedure, and quality of the outcome of the
analysis
Analysis subdivision
• Nominal subdivision: auto-correlation
(~ clustering of equivalent classes)
• Ratio subdivision: auto-correlation
PvdA
CDA
VVD
Auto-correlation
No auto-correlation
Auto-correlation nominal subdivision
Join count statistic:
PvdA
CDA
VVD
• 22 neighbor relations (adjacencies)
among 12 provinces
• Pr(province A = VVD and province B
= VVD) = 4/12 * 3/11
• E(VVD adj. VVD) = 22 * 12/132 = 2
• Reality: 4 times
• E(CDA adj. PvdA) = 5.33; reality
once
4/12 * 4/11 * 2 * 22
Geographical models
• Properties of (geographical) models:
–
–
–
–
–
–
selective (simplification, more ideal)
approximative
analogous (resembles reality)
structured (usable, analyzable, transformable)
suggestive
re-usable (usable in related situations)
Geographical models
• Functions of models:
–
–
–
–
–
–
psychological (for understanding, visualization)
organizational (framework for definitions)
explanatory
constructive (beginning of theories, laws)
communicative (transfer scientific ideas)
predictive
Example: forest fire
• Is the Kröller-Müller museum well enough
protected against (forest)fire?
• Data: proximity fire dept., burning properties
of land cover, wind, origin of fire
• Model for: fire spread
Time neighbor pixel on fire: [1.41 *] b * ws * (1- sh) * (0.2 + cos )
b = burn factor
ws = wind speed
 = angle wind – direction pixel
sh = soil humidity
Forest fire
Wind, speed 3
Forest; burn factor 0.8
Heath; burn factor 0.6
Road; burn factor 0.2
Museum
Soil
humidity
Origin
< 3 minutes
< 6 minutes
< 9 minutes
> 9 minutes
Forest fire model
• Selective: only surface cover, humidity and
wind; no temperature, seasonal differences, …
• Approximative: surface cover in 4 classes; no
distinction in forest type, etc., pixel based so
direction discretized
• Structured: pixels, simple for definition relations
between pixels
• Re-usable: approach/model also applies to
other locations (and other spread processes)
Network analysis
• When distance or travel time on a network
(graph) is considered
• Dijkstra’s shortest path algorithm
• Reachability measure for a destination:
potential value
potential (i )   w c

j ij
j
wj = weight origin j
 = distance decay parameter
c ij = distance cost between
origin j and destination i
Example reachability
• Law Ambulance Transport: every location
must be reachable within 15 minutes (from
origin of ambulance)
Example reachability
• Physician’s practice:
- optimal practice size: 2350 (minimum: 800)
- minimize distance to practice
- improve current situation with as few changes
as possible
Current
situation: 16
practices,
30.000 people,
average 1875
per practice
Computed,
improved
situation: 13
practices
Example in table
Original
New
16
13
Number of practice locations
9
7
Number of practices < 800 size
2
0
3957
4624
Average travel distance (km)
0,9
1,2
Largest distance (km)
5,2
5,4
Number of practices
Number of people > 3 km
Analysis elevation model
• Landscape shape recognition:
- peaks and pits
- valleys and ridges
- convexity, concavity
• Water flow, erosion,
watershed regions,
landslides, avalanches
Spatial data mining
• Finding spatial patterns in large spatial data
sets
– within one spatial data set
– across two or more data sets
• With time: spatio-temporal data mining
Spatial data mining and
computation
• “Geographic data mining involves the
application of computational tools to reveal
interesting patterns in objects and events
distributed in geographic space and across time”
(Miller & Han, 2001)
• Large data sets 
attempt to carefully define interesting
patterns (to avoid finding non-interesting
patterns)  advanced algorithms needed
for efficiency
Clustering?
• Are the people clustered in this room?
 How do we define a cluster?
• In spatial data mining we have objects/
entities with a location given by coordinates
• Cluster definitions involve distance between
locations
Clustering - options
•
•
•
•
Determine
Determine
Determine
Determine
whether clustering occurs
the degree of clustering
the clusters
the largest cluster
• Determine the outliers
Co-location
• Are the men clustered?
• Are the women clustered?
• Is there a co-location of men and women?
 co-location pattern
Co-location
• Like before, we may be interested in
–
–
–
–
–
is there co-location?
the degree of co-location
the largest co-location
the co-locations themselves
the objects not involved in co-location
Spatio-temporal data
• Locations have a time stamp
• Interesting patterns involve space and time
• Example here: time-stamped point set
Trajectory data
• Entities with a trajectory (time-stamped motion
path)
• Interesting patterns involve subgroups
with similar heading, expected arrival,
joint motion, ...
• n entities = trajectories; n = 10 – 100,000
• t time steps; t = 10 – 100,000
 input size is nt
• m size subgroup (unknown); m = 10 – 100,000
Trajectory data
•
•
•
•
Migration patterns of animals
Trajectories of tornadoes
Tracking of (suspect) individuals for security
Lifelines of people for social behavior
Example pattern in trajectories
• What is the location visited by most entities?
location =
circular region of
specified radius
Example pattern in trajectories
• What is the location visited by most entities?
location =
circular region of
specified radius
4 entities
Example pattern in trajectories
• What is the location visited by most entities?
location =
circular region of
specified radius
3 entities
Example pattern in trajectories
• Compute buffer of each trajectory
Example pattern in trajectories
• Compute buffer of each trajectory
• Compute the arrangement
of the buffers and the cover
count of each cell
1
1
1
2
0
1
1
Example pattern in trajectories
• One trajectory has t time stamps; its buffer
can be computed in O(t log t) time
• All buffers can be computed in O(nt log t) time
• The arrangement can be computed in
O(nt log (nt) + k) time, where k = O((nt)2) is
the complexity of the arrangement
• Cell cover counts are determined in O(k) time
Example pattern in trajectories
• Total: O(nt log (nt) + k) time
• If the most visited location is visited by
m entities, this is O(nt log (nt) + ntm)
• Note: input size is nt ;
n entities, each with location at t moments
Patterns in entity data
Spatial data
Spatio-temporal data
• n points (locations)
• Distance is important
• n trajectories, each
has t time steps
• Distance is timedependent
– clustering pattern
• Presence of
attributes (e.g.
male/female):
– co-location patterns
– flock pattern
– meet pattern
• Heading and speed
are important and are
also time-dependent
Patterns in trajectories
• n trajectories, each with t time steps
 n polygonal lines with t vertices
Patterns in trajectories
• Flock and meet patterns: large enough subset
that has same “character” during a time interval
– close to each other
– same direction of motion
– ...
Flock: changing location
Meet: fixed location
• Determine the longest duration pattern
Patterns in trajectories
• Longest flock: given a radius r and subset
size m, determine the longest time interval for
which the m entities were within each other’s
proximity (circle radius r)
Time = 0 1 2 3 4 5 6 7 8
m=3
longest flock
in [ 1.9 , 6.4 ]
Patterns in trajectories
• Computing the longest flock is NP-hard
• This remains true for radius cr approximations
with c < 2
• A radius 2 approximation of the longest flock can
be computed in time O(n2 t log n)
... meaning: if the longest flock for radius r
has duration , then we surely find a flock of
duration   for radius 2r
Patterns in trajectories
fixed radius
m=3
flock
meet
fixed subset
Patterns in trajectories
• Go into 3D (space-time) for algorithms
time
4
3
2
1
0
flock
meet
Patterns in trajectories
Exact radius results
flock
meet
NP-hard
O(n2t2 (n2 log n + t))
Patterns in trajectories
Approximate radius results
flock
O(n2t log n)
factor 2
meet
O((n2t log n) / (m2))
factor 1+
Patterns in trajectories
• Flock and meet patterns require algorithms in 3dimensional space (space-time)
• Exact algorithms are inefficient  only suitable
for smaller data sets
• Approximation can reduce running time with an
order of magnitude
Summary
• There are many types of geographical analysis,
it is the main task of a GIS
• Overlay and buffer analysis are most important
• Statistics is also very important
• Spatial and spatio-temporal data mining gives
new types of analysis of geographic data