Spatial Statistics Lecture

Download Report

Transcript Spatial Statistics Lecture

University at Albany School of Public Health
EPI 621, Geographic Information Systems and Public Health
Introduction to
Smoothing and Spatial
Regression
Glen Johnson, PhD
Lehman College / CUNY School of Public Health
[email protected]
Consider points distributed in space
“Pure” Point process:
Only coordinates locating
some “events”.
Set of points,
S ={s1, s2, … , sn}
_____________________
Examples include
• location of burglaries
• location of disease cases
• location of trees, etc.
Points represent locations of something
that is measured.
Values of a random variable, Z, are
observed for a set S of locations, such
that the set of measurements are
Z(s) ={Z(s1), Z(s2), … , Z(sn)}
___________________________
Examples include
• cases and controls (binary outcome)
identified by location of residence
• Population-based count
(integer outcome) tied to geographic
centroids
• PCBs measured in mg/kg
(continuous outcome) in soil cores
taken at specific point locations
Example of a Pure Point Process: Baltimore Crime Events
Question: How to interpolate a smoothed surface that
shows varying “intensity” of the points?
(source: http://www.people.fas.harvard.edu/~zhukov/spatial.html)
Kernel Density Estimation
From: Cromely and McLafferty. 2002. GIS and Public Health.
Kernel Density Estimation
Estimate “intensity” of events at regular grid points
as a function of nearby observed events. General
formula for any point x is:
æ x - xi ö
1
kç
÷
å
nh i=1 è h ø
n
where xi are “observed” points for i = 1, …, n locations
in the study area, k(.) is a kernel function that assigns
decreasing weight to observed points as they approach
the bandwidth h. Points that lie beyond the bandwidth,
h, are given zero weighting.
Results from Kernel Density Smoothing in R
Baltimore Crime Locations (Kernel Density)
Bandwidth = 0.1
Bandwidth = 0.15
160000
140000
120000
100000
80000
Bandwidth = 0.007
Bandwidth = 0.05
60000
40000
20000
0
Kernel Density Surface of Bike Share Locations in NYC
Source: http://spatialityblog.com/2011/09/29/spatial-analysis-of-nyc-bikeshare-maps/
Examples of Values Observed at Point Locations, Z(s) :
Question: How to interpolate a smoothed surface that
captures variation in Z(s)?
First, consider “deterministic”
approaches to spatial interpolation:
• Deterministic models do not
acknowledge uncertainty.
• Only real advantage is simplicity; good
for exploratory analysis
• Several options, all with limitations. We
will consider Inverse Distance Weighted
(IDW) because of its common usage.
Inverse Distance Weighted Surface
Interpolation
Define
search
parameters
Interpolate value at point s0 as
n
Z ( s0 )   i Z ( si )
i 1
for n neighboring observed values Z ( si ),
where the weight i 
p
d 0,
i
n
 d0, ip
i 1
for distance d .
Define power of
distance-decay
function
Illustration: Tampa Bay sediment total organic carbon
True “geostatistical” models assume the data,
Z(S) = {Z(s1), Z(s2), … , Z(sn)}, are a partial
realization of a random field.
Note that the set of locations
S are a subset of some
2-dimensional spatial domain
D, that is a subset of the
real plane.
General Protocol:
1. Characterize properties of spatial
autocorrelation through variogram
modeling;
2. Predict values for spatial locations where no
data exist, through Kriging.
A semivariogram is defined as
1
2
 (h)  E( Z ( s )  Z ( s  h ))
2
for distance h between the two locations, and is
estimated as
1
ˆ (h j ) 
2nh
nh
 (Z (s )
i 1
i
 Z ( si  h ))
2
for nh pairs separated by distance hj (called a “lag”).
After repeating for different lags, say j =1, … 10, the
semivariance can be plotted as a function of
distance.
Given any location si, all other locations are treated as
within distance h if they fall within a search window
defined by the direction, lag h, angular tolerance and
bandwidth.
bandwidth
Adapted from Waller and Gotway. Applied Spatial Statistics for Public Health. Wiley, 2004.
Example semivariogram cloud for pairwise differences (red
dots) , with the average semivariance for each lag (blue +),
and a fitted semivariogram model (solid blue line)
Characteristics of a semivariogram
Range = the distance within which positive spatial
autocorrelation exists
Nugget = spatial discontinuity + observation error
Sill = maximum semivariance
If the variogram form does not depend on direction,
the spatial process is isotropic.
If it does depend on direction, it is anisotropic.
Multiple semivariograms for
different
directions. Note
changing
parameter is the
range.
Surface map of
semivariance
shows values
more similar in
NW-SE direction
and more different
in SW-NE
Kriging then uses semivariogram model
results to define weights used for interpolating
values where no data exists.
The result is called the “Best Linear Unbiased
Predictor”. The basic form is
p
Z ( s0 )   i Z ( si )
i 1
Where the λi assign weights to neighboring
values according to semivariogram modeling
that defines a distance-decay relation within
the range, beyond which the weight goes to
zero.
Several variations of Kriging:
• Simple (assumes known mean)
• Ordinary (assumes constant mean, though
unknown) [our focus this week]
• Universal (non-stationary mean)
• Cokriging (prediction based on more than one
inter-related spatial processes)
• Indicator (probability mapping based on binary
variable) [you will see in the lab work]
• Block (areal prediction from point data)
• And other variations …
Example of two types of
Kriging for California O3:
1. Ordinary Kriging
(Detrended,
Anisotropic)
-continuous surface
2. Indicator Kriging
- probability isolines
What if point locations are centroids of polygons and
the value Z(si) represents aggregation within polygon i ?
With polygon data, we can still define
neighbors as some function of Euclidean
distance between polygon centroids, as we
do for point-level data,
but now we have other ways to define
neighbors and their weights …
Defining spatial “Neighborhoods”
Raster or
Lattice:
Rook
Queen
- 1st order
Queen
- 2nd order
i
Spatial Regression Modeling as a
method for both
• assessing the effects of covariates
and…
• smoothing a response variable