CPSC 601.82 Lecture 8

Download Report

Transcript CPSC 601.82 Lecture 8

Statistical Analysis in GIS
Dr. M. Gavrilova




Importance of correct data representation
Variance and covariance
Autocorrelation
Applications to pattern analysis and
geometric modeling
Four colors, three dimensions,
and two plots to visualize five
data points
http://www.math.yorku.ca/SCS/Gallery/
Steven Skiena, Stony Brook, NY
http://www.cs.sunysb.edu/skiena
http://www.math.yorku.ca/SCS/Gallery/
Results of a poll of
happiness from the World
Values Survey project of
people throughout the
world in relation to
economy, GNP per capita.
Many countries,
particularly those in Latin
America, had higher marks
for happiness than their
economic situation would
predict.
Conclusion is based on the
assumption that happiness
should be linearly related
to GNP.
An organized collection of computer
hardware, software, geographic data, and
personnel designed to efficiently capture,
store, update, manipulate, analyze, and
display all forms of geographically
referenced data.
Provides
an efficient and generally reliable means of
obtaining knowledge about spatial processes,
◦ a way of maximizing our knowledge of spatial
◦
processes with the minimum of error.








Spatial Data
location and attribute  Pi (x, y, z)
Spatial Stochastic Processes
statistics and inference
Spatial is special
spatial autocorrelation
spatial non-stationarity
proximity
The Space Shuttle Challenger exploded shortly after take-off in January
1986. Cause: failure of the O-ring seals used to isolate the fuel supply
from burning gases. Graph from the Report of the Presidential Commission
on the Space Shuttle Challenger Accident, 1986.
NASA staff had analysed the data on the relation between temperature
and number of O-ring failures (out of 6), but they had excluded
observations where no O-rings failed, believing that they were
uninformative.
They were main observations showing no failure at warm temperatures
(65-80 degF).
Apart from the disasterouse omitting the observations
with 0 failures:
1. drawing a smoothed curve to fit the points
2. removing the background grid which obscure data
gives a graph which shows excessive risks associated
with both high and low temperatures


Reanalysis of the O-ring data involved fitting a
logistic regression model. This provides a
predicted extrapolation (black curve) of the
probability of failure to the low (31 degF)
temperature at the time of the launch and
confidence bands on that extrapolation (red
curves). See also Tappin, L. (1994). "Analyzing
data relating to the Challenger disaster".
Mathematics Teacher, 87, 423-426
There's not much data at low temperatures (the
confidence band is quite wide), but the predicted
probability of failure is uncomfortably high.
Would you take a ride on Challenger when the
weather is cold?
The French engineer, Charles Minard (1781-1870), illustrated
the disastrous result of Napoleon's failed Russian campaign of
1812. The graph shows the size of the army by the width of the
band across the map of the campaign on its outward and return
legs, with temperature on the retreat shown on the line graph
at the bottom. Many consider Minard's original the best
statistical graphic ever drawn.
• Samples, populations, consist of individuals.
• Values of certain attributes are called
observations (e. g.: age, income).
• Attributes vary across individuals, and they are
called variables.
• Variables are described by distributions and their
parameters (e.g.: Normal, Poisson, ).
• A random variable X assumes its value according
to the outcome of a chance experiment (coin, dice).

Variance is the sum of squared deviations
from the mean divided by n (or n-1) sample
number.
Sample Variance
Population Variance
Spatial autocorrelation is a measure of the similarity of
objects within an area.
Jay Lee and Louis K. Marion, 2001

The formula to compute Moran’s index is the following:
n
M 
A
w z z
z
ij i
j
i, j
2
i
i

where n is the number of individual points,

A – area of the bounding polygon, i.e. the total area of the
map including all points

zi- value of the parameter measured for point I (attribute)

wij is computed according to the following rule, min(dij) is
the smallest of all distances between all pairs of points
computed:
wij 

min ij ( d ij )
d ij
zi
zj
In this formula, distance dij is computed according to the
formulas for Euclidean, supremum or Manhattan metrics.
Since dii is equal to 0, wii will become infinite, thus cases
when i=j should be excluded. This will result in n2 –n
pairs of points.
◦ The sum by all i,j means that ALL ORDERED PAIRS of
points (i.e. order of consideration of pair ij is
important) should be considered by the formula.
Sometimes, only pair of sample points within a
specific distance from each other are considered.
Example: autocorrelation on a grid.
Sample points are combined in one cell. Size
and location of the cell defines
autocorrelation parameters.
Consider all pairs of GRID CELLS, where XC
and YC now denote coordinates of the
center of each grid cell and the attribute z
for each grid is the sum of combined
attributes of all points that belong to this
cell.
Result: insight on pattern analysis and
correlation can be obtained.
Analysis of instances of patients undergoing cardiac
catheterization, and location of those instances,
i.e. city blocks.
Primary question: spatial variation of heart disease:
random or non-random pattern?
Secondary question: relationship between disease
occurrence and social and demographic factors
(Spatial Regression).
Analysis results are affected by grid size
•
prone to subjective choices
•
constrained by spatial resolution of data
Solving the problem by
• using a non-arbitrary grid(s)
• implementing a “guided” selection of the square unit
area or grid size
• Definition of a city-block grid based on the main
division in the city, i.e. using the squared grid
centered on the intersection between Center Street
and Center Avenue as the main axes of the
geometric plan thus created.
• Grid regularity decreases as distance increases
from its center.
• L_p norms provide flexibility to adjust grid’s size
and shape consequently.
Application of varying L_p norms
Varying spatial weights for spatial
autocorrelation
Autocorrelation analysis at varying scales
(CDA, community)
Data: 2001/1996 census
Spatial Correlation Estimate
Statistic = "moran" Sampling = "free"
Correlation = 0.1429
Variance
= 0.001341
Std. Error = 0.03662
Normal statistic = 3.921
Normal p-value (2-sided) = 8.802e-5
Null Hypothesis: No spatial autocorrelation
Sensitivity of Spatial Autocorrelation to
L_p norm
spatial weight
Proposed method useful in determining
best distance
best spatial weight
In context of multivariate spatial regression
“best”  lowest variance


The Calgary Journal, Regional publication,
“Researchers link heart disease to urban
lifestyles” on SPARCS activity profile, Oct. 26 –
Nov. 8, 2005
High risk of heart attack: male, high
education, married
# cells*
Min.
Max.
Mean
St. dev.
Sum
Skew
Kurt.
Oil spill counts
44
(2,741)
0
3
0.02
0.162
53
9.85
113.6
Flight counts
2151
(2,741)
0
309
13.75
27.12
37,681
4.21
25.6
The mean and the standard deviation provide
information about the statistical dispersion of the data;
and skewness (irregular) and kurtosis (bulging in Greek)
indicate highly skewed distributions or lack of normality
in the data.


Our exploratory analyses indicate that there
is a positive spatial autocorrelation within
datasets for all variables.
An initial overview of the statistical
distribution and normality of each of the
variables selected for this study indicated
absence of normality in the data.
Exploratory Spatial Analysis of Illegal Oil Discharges Detected off Canada’s Pacific Coast.
Norma Serra-Sogas1, Patrick O’Hara2, Rosaline Canessa3, Stefania Bertazzon4 and Marina Gavrilova5



Proper statistical analysis is important
Variance and autocorrelation are two
important vehicles for data analysis
Combining these measures with various
metrics, hierarchical structures, grids,
attributes and also data filtering/visualization
methods is a direction of current research.