Spatial Statistics - The University of Texas at Dallas
Download
Report
Transcript Spatial Statistics - The University of Texas at Dallas
Descriptive Statistics for
Spatial Distributions
Review Standard Descriptive Statistics
Centrographic Statistics for Spatial Data
Mean Center, Centroid, Standard Distance Deviation, Standard Distance Ellipse
Density Kernel Estimation, Mapping
Briggs Henan University 2010
1
Spatial Analysis: successive levels of sophistication
1. Spatial data description: classic GIS capabilities
–
–
Spatial queries & measurement,
buffering, map layer overlay
2. Exploratory Spatial Data Analysis (ESDA):
–
–
–
searching for patterns and possible explanations
GeoVisualization through data graphing and mapping
Descriptive spatial statistics: Centrographic statistics
3. Spatial statistical analysis and hypothesis testing
–
Are data “to be expected” or are they “unexpected”
relative to some statistical model, usually of a random
process
4. Spatial modeling or prediction
–
Constructing models (of processes) to predict spatial
outcomes (patterns)
Briggs Henan University 2010
2
Standard Statistical Analysis
Two parts:
1. Descriptive statistics
Concerned with obtaining summary measures to describe
a set of data
For example, the mean and the standard deviation
2.
Inferential statistics
Concerned with making inferences from samples about a
populations
Similarly, we have Descriptive and
Inferential Spatial Statistics
Briggs Henan University 2010
3
Spatial Statistics
Descriptive Spatial Statistics: Centrographic Statistics (This time)
– single, summary measures of a spatial distribution
–- Spatial equivalents of mean, standard deviation, etc..
Inferential Spatial Statistics: Point Pattern Analysis (Next time)
Analysis of point location only--no quantity or magnitude (no attribute variable)
--Quadrat Analysis
--Nearest Neighbor Analysis, Ripley’s K function
Spatial Autocorrelation (Weeks 5 and 6)
– One attribute variable with different magnitudes at each location
The Weights Matrix
Global Measures of Spatial Autocorrelation (Moran’s I, Geary’s C, Getis/Ord Global G)
Local Measures of Spatial Autocorrelation (LISA and others)
Prediction with Correlation and Regression (Week 7)
–Two or more attribute variables
Standard statistical models
Spatial statistical models
4
Briggs Henan University 2010
Standard Statistical Analysis:
A Quick Review
1. Descriptive statistics
– Concerned with obtaining summary measures to
describe a set of data
– Calculate a few numbers to represent all the data
– we begin by looking at one variable (“univariate”)
• Later , we will look at two variables (bivariate)
Three types:
– Measures of Central Tendency
– Measures of Dispersion or Variability
– Frequency distributions
I hope you are already familiar with these.
Henan University 2010
I will quickly review the mainBriggs
ideas.
5
Standard Descriptive Statistics
Central Tendency
• Central Tendency: single summary measure for one
Formulae for mean
variable:
1. mean (average)
2. median (middle value)
--50% larger and 50% smaller
--rank order data and select middle number
3. mode (most frequently occurring)
These may be obtained in ArcGIS by:
--opening a table, right clicking on column heading, and selecting Statistics
--going to ArcToolbox>Analysis>Statistics>Summary Statistics
ADMIN_NAME
Beijing
Liaoning
Tianjin
Taiwan
Shanghai
Guangdong
Heilongjiang
Shanxi
Jilin
Xinjiang
Hebei
Guangxi
Hunan
Jiangxi
Hong Kong
Henan
Hubei
Chongqing
Shandong
Jiangsu
Nei Mongol
Shaanxi
Hainan
Macao
Zhejiang
Ningxia
Sichuan
Fujian
Yunnan
Anhui
Guizhou
Qinghai
Gansu
Xizang
Sum
Illiteracy-Prcnt Rank order
3.11
1
3.48
2
3.52
3
3.9
4
3.97
5
4.02
6
4.16
7
4.42
8
4.44
9
4.64
10
4.83
11
5.61
12
5.87
13
6.49
14
6.5
15
7.36
16
7.69
17
7.8
18
7.96
19
8.05
20
8.14
21
8.19
22
8.65
23
8.7
24
9.36
25
10.09
26
10.24
27
10.38
28
13.29
29
14.49
30
14.58
31
16.68
32
17.77
33
37.77
34
Calculation of
mean and median
Mean
296.15 / 34 = 8.71
Median
(7.69 + 7.8)/2 = 7.75
(there are 2 “middle values”)
Note: data for Taiwan is included
7
296.15
Briggs Henan University 2010
Standard Descriptive Statistics
Variability or Dispersion
• Dispersion: measures of spread or variability
– Variance
• average squared distance of observations from mean
– Standard Deviation (square root of variance)
• “average” distance of observations from the mean
Formulae for variance
n
i =1
( Xi - X )
N
2
n
=
i =1
2
X i - [( X ) 2 / N ]
N
Definition Formula Computation Formula
These may be obtained in ArcGIS by:
--opening a table, right clicking on column heading, and selecting Statistics
--going to ArcToolbox>Analysis>Statistics>Summary Statistics
Illiteracy-Prcnt
(X - Xmean)
(X-Xmean)
squared
14.49
5.780
33.40500009
Beijing
3.11
-5.600
31.3632942
Fujian
10.38
1.670
2.787917734
Gansu
17.77
9.060
82.07827067
Guangdong
4.02
-4.690
21.99885891
Guangxi
5.61
-3.100
9.611823616
Guizhou
14.58
5.870
34.45344715
Hainan
8.65
-0.060
0.003635381
Hebei
4.83
-3.880
15.05668244
Heilongjiang
4.16
-4.550
20.70517656
Henan
7.36
-1.350
1.823294204
Hubei
7.69
-1.020
1.041000087
Hunan
5.87
-2.840
8.067270675
Nei Mongol
8.14
-0.570
0.325235381
Jiangsu
8.05
-0.660
0.435988322
Jiangxi
6.49
-2.220
4.929705969
Jilin
4.44
-4.270
18.23541185
Liaoning
3.48
-5.230
27.35597656
Ningxia
10.09
1.380
1.903588322
Qinghai
16.68
7.970
63.51621185
Shaanxi
8.19
-0.520
0.270705969
Shandong
7.96
-0.750
0.562941263
Shanghai
3.97
-4.740
22.47038832
Shanxi
4.42
-4.290
18.40662362
Sichuan
10.24
1.530
2.340000087
Taiwan
3.9
-4.810
23.1389295
Tianjin
3.52
-5.190
26.93915303
Xizang
37.77
29.060
844.466506
Xinjiang
4.64
-4.070
16.5672942
Yunnan
13.29
4.580
20.97370597
Zhejiang
9.36
0.650
0.422117734
Chongqing
7.8
-0.910
0.828635381
Hong Kong
6.5
-2.210
4.885400087
Macao
8.7
-0.010
0.000105969
ADMIN_NAME
Anhui
Sum
296.15
0.000
1361.370297
Mean
8.710294118
Variance
40.04030285
StanDev
6.3277
Calculation of
Variance and
Standard Deviation
Variance from Definition Formula
1361.370/34 = 40.04
Variance from Computation Formula
[3940.924 – (296.15 * 296.15)/34]/34
=40.04
Standard Deviation =
40.04
=6.33
Note: data for Taiwan is included
Briggs Henan University 2010
9
Classic Descriptive Statistics: Univariate
Frequency distributions
A count of the frequency with which values occur on a variable
70000
60000
50000
40000
30000
20000
10000
0
US population, by age group:
50 million people age 45-59 (data for 2000)
Series1
under 15 to 30 to 45 to 60 to 75 and
15
29
44
59
74 older
years years years years years
Source:
http://www.census.gov/compendia/statab/
US Bureau of the Census: Statistical Abstract of the US
Often represented by the area under a frequency curve
70000
This area represents
100% of the data
60000
50000
40000
30000
20000
100%
Series1
10000
0
under
15
years
15 to
29
years
30 to
44
years
45 to
59
years
60 to
74
years
75 and
older
In ArcGIS, you may obtain frequency counts
on a categorical variable via:
--ArcToolbox>Analysis>Statistics>Frequency
Frequency Distributions for China Province Data
Symetric Distribution
Skewed Distribution (right skew)
“tail” extends
to right
Mean is
“pulled” to
the right
Height of bar shows frequency
There are 16 provinces with
percent urban between 38.4% and
50.8% (mode)
Mode = (38.1+50.8)/2 =44.5
Mean = 48.97
Median = 44.0
Symetric distribution:
mean = median = mode
Height of bar shows frequency
There are 17 provinces with
illiteracy between 5.4% and 10.7%
(mode)
Mode = (5.4+10.7)/2 =8.05
Mean = 8.7
Median = (7.69 + 7.8)/2 = 7.75
Symetric distribution:
mean > median
Frequency Distributions for China Province Data:
Variability
Symetric Distribution
Standard deviation:
A measure of “the average”
distance of each observation from
the mean
Standard deviation = 14.8
Skewed Distribution (right skew)
Standard deviation = 6.33
“tail” extends
to right
On average, illiteracy values are
closer to the mean. There is less
“spread” in this data
Caution—these values are incorrect!
• Why?
• Incorrect to calculate mean for percentages
– Each percentage has a different base population
• Should calculate weighted mean
X =
n
i =1
wixi
wi =population of each
n
w
i
province
i =1
• Very common error in GIS because we use
aggregated data frequently
13
Briggs Henan University 2010
Correct Values!
•
•
•
•
Unweighted mean = 8.7
Weighted mean = 7.75
Weighted mean is smaller.
The largest provinces
have lower illiteracy
Why?
Highest rates in
small provinces
IlliteracyADMIN_NAME Prcnt
Pop2008
IlliteracyADMIN_NAME Prcnt
Guangdong
4.02
95,440,000
Ningxia
10.09
6,176,900
Henan
7.36
94,290,000
Qinghai
16.68
5,543,000
Shandong
7.96
94,172,300
Xizang (Tibet)
37.77
2,870,000
Pop2008
14
Briggs Henan University 2010
ADMIN_NAME
Anhui
Beijing
Fujian
Gansu
Guangdong
Guangxi
Guizhou
Hainan
Hebei
Heilongjiang
Henan
Hubei
Hunan
Nei Mongol
Jiangsu
Jiangxi
Jilin
Liaoning
Ningxia
Qinghai
Shaanxi
Shandong
Shanghai
Shanxi
Sichuan
Taiwan
Tianjin
Xizang
Xinjiang
Yunnan
Zhejiang
Chongqing
Hong Kong
Macao
Illiteracy-Prcnt
14.49
3.11
10.38
17.77
4.02
5.61
14.58
8.65
4.83
4.16
7.36
7.69
5.87
8.14
8.05
6.49
4.44
3.48
10.09
16.68
8.19
7.96
3.97
4.42
10.24
3.9
3.52
37.77
4.64
13.29
9.36
7.8
6.5
8.7
Pop2008
61,350,000
22,000,000
36,040,000
26,281,200
95,440,000
48,160,000
37,927,300
8,540,000
69,888,200
38,253,900
94,290,000
57,110,000
63,800,000
24,137,300
76,773,000
44,000,000
27,340,000
43,147,000
6,176,900
5,543,000
37,620,000
94,172,300
19,210,000
34,106,100
81,380,000
23,140,000
11,760,000
2,870,000
21,308,000
45,430,000
51,200,000
31,442,300
7,003,700
542,400
x*w
888961500
68420000
374095200
467016924
383668800
270177600
552980034
73871000
337560006
159136224
693974400
439175900
374506000
196477622
618022650
285560000
121389600
150151560
62324921
92457240
308107800
749611508
76263700
150748962
833331200
90246000
41395200
108399900
98869120
603764700
479232000
245249940
45524050
4718880
Calculation of
weighted mean
Unweighted mean
296.15 / 34 = 8.71
Weighted mean
10,445,390,141 / 1,347,382,600
= 7.75
Note: we should also calculate a
weighted standard deviation
15
Sum
296.15
1347382600
10445390141
Briggs Henan University 2010
Centrographic Statistics
Descriptive statistics for spatial distributions
Mean Center
Centroid
Standard Distance Deviation
Standard Distance Ellipse
Density Kernel Estimation
(Add Frequency Distributions and mapping—use GeoDA to produce)
Briggs Henan University 2010
1
Centrographic Statistics
Measures of Centrality
Measures of Dispersion
– Mean Center
-- Standard Distance
– Centroid
-- Standard Deviational Ellipse
– Weighted mean center
– Center of Minimum Distance
• Two dimensional (spatial) equivalents of standard
descriptive statistics for a single-variable (univariate).
• Used for point data
– May be used for polygons by first obtaining the centroid of
each polygon
• Best used to compare two distributions with each other
– 1990 with 2000
– males with females
(O&U Ch. 4 p. 77-81)
Briggs Henan University 2010
17
Mean Center
• Simply the mean of the X and the mean of the Y
coordinates for a set of points
• Sum of differences between the mean X and all
other Xs is zero (same for Y)
• Minimizes sum of squared distances
between itself and all points
min
d
2
iC
Distant points have large effect:
Values for Xinjiang will have larger effect
Provides a single point summary measure
for the location of a set of points
18
Briggs Henan University 2010
Centroid
• The equivalent for polygons of the mean center for a point
distribution
• The center of gravity or balancing point of a polygon
• if polygon is composed of straight line segments between
nodes, centroid given by “average X, average Y” of nodes
(there is an example later)
• Calculation sometimes approximated as center of bounding
box
– Not good
• By calculating the centroids for a set of polygons can apply
Centrographic Statistics to polygons
19
Briggs Henan University 2010
Centroids for Provinces of China
20
Briggs Henan University 2010
Centroids for Provinces of China
21
Briggs Henan University 2010
Warning:
Centroid may not be inside its polygon
• For Gansu Province, China, centroid is
within neighboring province of Qinghai
• Problem arises
with crescentshaped polygons
22
Briggs Henan University 2010
Weighted Mean Center
• Produced by weighting each X and Y
coordinate by another variable (Wi)
• Centroids derived from polygons can be
weighted by any characteristic of the polygon
– For example, the population of a province
X =
i=1 wixi
n
i=1 wi
n
Y=
n
w
iyi
i =1
n
i =1
wi
23
Briggs Henan University 2010
10
Calculating the centroid of a
polygon or the mean center of
a set of points.
4,7
7,7
5
ID
1
2
3
4
5
7,3
2,3
X
2
4
7
7
6
sum
Centroid/MC
26
5.2
n
X=
22
4.4
Xi
i =1
n
n
Y
i
,Y =
i =1
n
0
6,2
(same example data as
for area of polygon)
Y
3
7
7
3
2
0
10
10
5
Calculating the weighted mean
center. Note how it is pulled
toward the high weight point.
4,7
5
7,7
7,3
2,3
0
6,2
0
5
i
X
Y
weight
1
2
3
4
5
2
4
7
7
6
3
7
7
3
2
3,000
500
400
100
300
sum
w MC
26
22
4,300
wX
6,000
2,000
2,800
700
1,800
13,300
3.09
wY
9,000
3,500
2,800
300
600
n
n
wX
wY
X=
,Y =
w
w
i
i
i =1
i i
i =1
i
i
16,200
3.77
10
24
Briggs Henan University 2010
Center of Minimum Distance or Median Center
• Also called point of minimum aggregate travel
• That point (MD) which minimizes
sum of distances between itself
min
diMD
and all other points (i)
• No direct solution. Can only be derived by approximation
• Not a determinate solution. Multiple points may meet this
criteria—see next bullet.
• Same as Median center:
– Intersection of two orthogonal lines
(at right angles to each other),
such that each line has half of the points
to its left and half to its right
– Because the orientation of the axis for the
lines is arbitrary, multiple points may
meet this criteria.
Source: Neft, 1966
25
Briggs Henan University 2010
Median and Mean
Centers for US Population
Median Center:
Intersection of a north/south and an
east/west line drawn so half of
population lives above and half
below the e/w line, and half lives to
the left and half to the right of the n/s
line
Mean Center:
Balancing point of a weightless map,
if equal weights placed on it at the
residence of every person on census
day.
Source: US Statistical Abstract 200326
Briggs Henan University 2010
Standard Distance Deviation
• Represents the standard deviation of the
distance of each point from the mean center
• Is the two dimensional equivalent of
standard deviation for a single variable
• Given by:
2
2
(
X
i
X
c
)
(
Y
i
Y
c
)
i =1
i =1
n
n
Formulae for standard
deviation of single variable
n
2
(
X
i- X)
i =1
N
Or, with weights
i=1 wi( Xi - Xc)2 i=1 wi(Yi - Yc)2
n
n
N
i=1 wi
n
2
which by Pythagoras
d
iC
i =1
reduces to:
N
---essentially the average distance of points from the center
Provides a single unit measure of the spread or dispersion of a
distribution.
We can also calculate a weighted standard distance analogous to the
27
weighted mean center.
Briggs Henan University 2010
n
10
Standard Distance Deviation Example
Circle with radii=SDD=2.9
4,7
5
7,7
X
Y
(X - Xc)2
(Y - Yc)2
1
2
3
4
5
2
4
7
7
6
3
7
7
3
2
10.2
1.4
3.2
3.2
0.6
2.0
6.8
6.8
2.0
5.8
sum
Centroid
26
5.2
22
4.4
18.8
23.2
sum
divide N
sq rt
42.00
8.40
2.90
6,2
0
i
7,3
2,3
0
10
5
i
X
Y
(X - Xc)2
(Y - Yc)2
1
2
3
4
5
2
4
7
7
6
3
7
7
3
2
10.2
1.4
3.2
3.2
0.6
2.0
6.8
6.8
2.0
5.8
sum
Centroid
26
5.2
22
4.4
18.8
23.2
sum of sums
divide N
sq rt
sdd =
n
i =1
42
8.4
2.90
( Xi - Xc ) 2 i =1 (Yi - Yc ) 2
n
N
Briggs Henan University 2010
28
Standard Deviational Ellipse: concept
• Standard distance deviation is a good single measure
of the dispersion of the points around the mean center,
but it does not capture any directional bias
– doesn’t capture the shape of the distribution.
• The standard deviation ellipse gives dispersion in two
dimensions
• Defined by 3 parameters
– Angle of rotation
– Dispersion (spread) along major axis
– Dispersion (spread) along minor axis
The major axis defines the
direction of maximum spread
of the distribution
The minor axis is perpendicular to it
and defines the minimum spread
29
Briggs Henan University 2010
Standard Deviational Ellipse: calculation
• Formulae for calculation may be found in references
such as
– Lee and Wong pp. 48-49
– Levine, Chapter 4, pp.125-128
• Basic concept is to:
– Find the axis going through maximum dispersion (thus derive
angle of rotation)
– Calculate standard deviation of the points along this axis (thus
derive the length (radii) of major axis)
– Calculate standard deviation of points along the axis
perpendicular to major axis (thus derive the length (radii) of
minor axis)
30
Briggs Henan University 2010
Mean Center & Standard Deviational Ellipse:
example
There appears to be no
major difference
between the location of
the software and the
telecommunications
industry in North
Texas.
31
Briggs Henan University 2010
Implementation in ArcGIS
In ArcToolbox
Median Center for a set of points
Standard deviation ellipse
Centroid for a set of points
Standard distance
• To calculate centroid for a set of polygons, with ArcGIS:
ArcToolbox>Data Management Tools>Features>Feature to Point (requires ArcInfo)
• To calculate using GeoDA:
32
– Tools>Shape>Polygons to Centroids
Briggs Henan University 2010
Density Kernel Estimation
• commonly used to “visually enhance” a point pattern
• Is an example of “exploratory spatial data analysis”
(ESDA)
Kernel=10,000
Kernel=5,000
33
Briggs Henan University 2010
low
high
low
high
•
SIMPLE Kernel option (see example above)
– A “neighborhood” or kernel is defined around each grid cell consisting of all grid
cells with centers within the specified kernel (search) radius
– The number of points that fall within that neighborhood is totaled
– The point total is divided by the area of the neighborhood to give the grid cell’s value
•
Density KERNEL option
– a smoothly curved surface is fitted over each point
– The surface value is highest at the location of the point, and diminishes with increasing distance
from the point, reaching zero at the kernel distance from the point.
– Volume under the surface equals 1 (or the population value if a population variable is used)
– Uses quadratic kernel function described in Silverman (1986, p. 76, equation 4.5).
– The density at each output grid cell is calculated by adding the values of all the kernel surfaces
where they overlay the grid cell center.
Implementation in ArcGIS
• If specify a “population field”
software calculates as if there are
that number of points at that
location.
• The search radius:
• the size of the neighborhood or
kernel which is successively
defined around every cell (simple
kernel) or each point (density
kernel)
• Output cell size:
• Size of each raster cell
• Search radius and output cell size are
based on measurement units of the
data (here it is feet)
• It is good to “round” them (e.g.
to 10,000 and 1,000)
What have we learned today?
• We have learned about descriptive spatial
statistics, often called Centrographic
Statistics
• Next time, we will learn about Inferential
Spatial Statistics
36
Briggs Henan University 2010
Project for you
• The China data on my web site has population data for
the provinces of China in 2008
• Obtain population counts for 2000, 1990 and/or any
other year
• Calculate the weighted mean center of China’s
population for each year
• Be sure to use the same set of geographic units each
time
– For example, if you do not have data for Taiwan or Hong
Kong for one year, omit these geographic units for all years
37
Briggs Henan University 2010
Texts
O’Sullivan, David and David Unwin, 2010. Geographic
Information Analysis. Hoboken, NJ: John Wiley, 2nd ed.
Other Useful Books:
Mitchell, Andy 2005. ESRI Guide to GIS Analysis Volume 2: Spatial
Measurement & Statistics. Redlands, CA: ESRI Press.
Allen, David W 2009. GIS Tutorial II: Spatial Analysis Workbook.
Redlands, CA: ESRI Press.
Wong, David W.S. and Jay Lee 2005. Statistical Analysis of Geographic
Information. Hoboken, NJ: John Wiley, 2nd ed.
Ned Levine and Associates, Crime Stat III Manual, Washington, D.C.
National Institutes of Justice, 2004 with later updates.
http://www.icpsr.umich.edu/CrimeStat/
Density Kernel Estimation
Silverman, B.W. 1986. Density Estimation for Statistics and
Data Analysis. New York: Chapman and Hall.