Lecture 8 – 10 Spatial Correspondence

Download Report

Transcript Lecture 8 – 10 Spatial Correspondence

Spatial Correspondence of
Areal Distributions
• Quadrat and nearest-neighbor analysis
deal with a single distribution of points
• Often, we want to measure the
distribution of two or more variables
• The coefficient of Areal correspondence
and chi-square statistics perform these
tasks
© Arthur J. Lembo, Jr.
Cornell University
Coefficient of Areal
Correspondence
• Simple measure of the extent to which
two distributions correspond to one
another
– Compare wheat farming to areas of minimal
rainfall
• Based on the approach of overlay
analysis
© Arthur J. Lembo, Jr.
Cornell University
Overlay Analysis
• Two distributions of interest are mapped
at the same scale and the outline of one
is overlaid with the other
© Arthur J. Lembo, Jr.
Cornell University
Coefficient of Areal
Correspondence
• CAC is the ratio between the area of the
region where the two distributions
overlap and the total area of the regions
covered by the individual distributions of
the entire region
AreaC
AreaA  AreaB  AreaC
© Arthur J. Lembo, Jr.
Cornell University
0
11 0
.4
.6  .6  .4
1
1
© Arthur J. Lembo, Jr.
Cornell University
Result of CAC
• Where there is no correspondence, CAC is
equal to 0
• Where there is total correspondence, CAC is
equal to 1
• CAC provides a simple measure of the extent
of spatial association between two distributions,
but it cannot provide any information about the
statistical significance of the relationship
© Arthur J. Lembo, Jr.
Cornell University
Resemblance Matrix
• Proposed by Court (1970)
SumLikeAreas  SumUnlikeAreas
TotalArea
• Advantages over CAC
– Limits are –1 to +1 with a perfect negative
correspondence given a value of –1
– Sampling distribution is roughly normal, so you can
test for statistical significance
•
© Arthur J. Lembo, Jr.
Cornell University
Chi-Square Statistic
• Measures the strength of association
between two distributions
• Class Example
– Relationship between wheat yield and
precipitation
– Two maps showing high and low yields and
high and low precipitation
© Arthur J. Lembo, Jr.
Cornell University
HIGH
PRECIP
HIGH
YIELD
© Arthur J. Lembo, Jr.
Cornell University
High Precip.
High Yield
© Arthur J. Lembo, Jr.
Cornell University
Chi-Square
• By combining distribution on one map we can
better understand the relationship between the
two distributions
• In this example we are using a grid
– The finer the grid, the more precise the measurement
• Four possibilities exist
– Low rainfall, low yield
– Low rainfall, high yield
– High rainfall, low yield
– High rainfall, high yield
© Arthur J. Lembo, Jr.
Cornell University
Chi-Square
• Record the total number of occurrences
into a table of observed frequencies
WHEAT
High
High
PRECIP.
Low
Low
8
2
5
13
© Arthur J. Lembo, Jr.
Cornell University
Chi-Square
• Create a table of expected frequencies
using probability statistics (% High rain *
# of high yield cells)
– Row total * column total / table total
High
High
PRECIP.
Low
WHEAT
Low
(10*13)/28
(10*15)/28
4.6
5.3
(13*18)/28
(18*15)/28
8.3
9.6
High
WHEAT
Low
5
5
8
10
High
PRECIP.
© Arthur J. Lembo, Jr.
Cornell University
Low
Compute Chi-Square
X 
2
(O  E )2
E
• Therefore, in our example we have
High
High
Low
Low
8
2
5
13
High
High
Low
Observed
High/High
High/Low
Low
5
5
8
10
Expected
Low/Low
Low/Low
(8  5) 2 (2  5)2 (5  8)2 (13  10)2



 5.625
5
5 © Arthur J. Lembo,
8 Jr.
10
Cornell University
Interpreting Chi Square
• Zero indicates no relationship
• Large numbers indicate stronger relationship
• Or, a table of significance can be consulted to
determine if the specific value is statistically
significant
• The fact that we have shown that there is a
correlation between variables does NOT mean
that we have found out anything about WHY
this is so. In our analysis we might state our
assumptions as to why this is so, but we would
need to perform other analyses to show
causation.
© Arthur J. Lembo, Jr.
Cornell University
If you don’t have Chi-Square
values
• Yule’s Q
• Value of Yule’s Q always
lies between –1 and +1
• Value of 0 indicates no
relationship
• Value of +1 indicates a positive
relationship
• Value of –1 indicates a negative
relationship
© Arthur J. Lembo, Jr.
Cornell University
(8 *13)  (2 * 5)
(8 *13)  (2 * 5)
104  10
104  10
 .82
Analysis of Election 2000
Polygon to Polygon
Point to Polygon
Assessing Our Cultural Divide: Results from the 2000 Presidential Election
ANALYSIS OF SPATIAL AUTOCORRELATION
JOIN COUNT ANALYSIS
Join Count Analysis is a method of spatial autocorrelation that evaluates the
statistical significance of clustering among neighboring polygons. Based
upon the total number of counties won by each candidate (Gore: 588; Bush:
2214), the expected number of adjacent counties that voted for the same
candidate (i.e. two adjacent counties voting for Bush) was computed . In
addition, the actual number of adjacent counties that voted for the same
candidate was also computed using spatial analysis techniques in ArcView
GIS. The results were as follows:
Table 1. Expected vs. Actual Joins of Adjacent Counties Voting for the
Same Candidate
Expected
Actual Expected
Expected
Actual
Gore/Gore Joins
Gore/Gore Joins
Bush/Bush Joins
Bush/Bush Joins
438
879
5516
6253
Assuming an independent random process, we computed the z-score, or
number of standard deviations away from the mean for each candidate’s
specified number of joins (ZGore/Gore 15.47; ZBush/Bush 8.75).
A purely random sample drawn from a population whose true mean is 0 at
the 95% confidence level would fall within a z-score range of +/- 1.96 in
magnitude. Both numbers were significantly higher than 1.96, indicating
significant positive spatial autocorrelation. Therefore, the join count
analysis showed that clustering exists within the county voting patterns.
Inferred from this analysis is the observation that regionalized voting
patterns existed in the 2000 Presidential Election.
Arthur J. Lembo, Jr.; Ph.D.
Cornell University
Paul Overberg
USAToday
ABSTRACT
Although the 2000 Presidential election was one
of the closest in recent history, many
commentators noted that the voting patterns
appeared to exhibit a “cultural divide”, with urban
areas voting for Al Gore, and rural areas voting for
George W. Bush. Because most of the comments
are based on a subjective view of the county
voting patterns, this project attempts to provide a
quantifiable measure of the voting patterns
exhibited during the 2000 election. Specifically,
we were interested in determining if a statistically
significant clustering pattern existed based on
county-wide results, and if each candidate won
their assumed cultural association (Gore: Urban;
Bush: rural).
To test these hypotheses, two separate spatial
analysis methods were performed on county-wide
voting patterns within the United States. The first
method utilized a principle of spatial
autocorrelation called join count analysis to
determine if voting patterns exhibited evidence of
spatial clustering. The second method used map
overlay to determine the likelihood of cities
falling within either Bush or Gore counties.
ANALYSIS OF SPATIAL CORRESPONDENCE
OVERLAY ANALYSIS
A second analysis was used to determine the likelihood of a county
with urban areas voting for either candidate. For this study, four
categories were evaluated: counties with small cities (under 50,000),
medium sized cities (50,000 – 75,000), large sized cities (greater than
75,000), and no cities. Based on the percentage of counties won by
each candidate (Gore: 22%; Bush: 78%) we computed the random
probability that a city would fall within a Bush county or a Gore
county. This probability allowed us to determine the expected
number of cities that would be located within Gore counties or Bush
counties. The actual number of cities located in a Gore county or
Bush county was determined using overlay analysis with ArcView.
Similar to the previous example, z-scores were computed for each of
the categories as follows:
Z
(O  E ) 2
pqn
where O is the observed number of cities falling within
a county, E is the expected number of cities falling within a county, p is the probability
of a city falling in a Bush county, q is the probability of a city falling in a Gore County,
and n are the total number of cities.
Table 2. Cities Falling Inside a County Won by Either Bush or Gore
Expected
Gore
Large (> 75K)
66
Medium (50-75K) 54
Small (<50K)
544
No City
427
Expected
Bush
238
196
1273
1588
Observed
Gore
184
147
2030
347
Observed
Bush
119
98
1236
1690
Z
Gore
267
470
4,998
18
Z
Bush
272
55
3
29
As previously stated, a purely random sample drawn from a
population whose true mean is 0 at the 95% confidence level would
fall within a z-score range of +/- 1.96 in magnitude. Table 2 indicates
that each of the z-score values exceed 1.96. Implied from this is that
significant correlation among votes for Al Gore and counties with
cities, and votes for George W. Bush and counties without cities
(rural areas) exists.
Conclusion
Figure 1. Examples of Cities in Relation to the Distribution of Counties. These examples from New York
and Minnesota show that although Bush (in red) won a majority of the counties, the cities appear clustered
primarily within the few counties in which Gore won (in blue). For example, in Minnesota, a majority of the
cities exist within Hennepin County, while in New York, virtually every county Gore won has a city within its
border.
© Arthur J. Lembo, Jr.
Cornell University
This analysis provided quantifiable evidence that positive spatial autocorrelation
(clustering) of voting patterns existed during the 2000 Presidential Election. Also, the
analysis showed a high statistical correlation between urbanized areas and county votes
for Al Gore. Further analysis is necessary to better understand causation (i.e. ethnicity,
income, age), however both analyses indicate that geographic regions (i.e. urban areas)
may have played a large role in the vote determination for Election 2000.
Data Provided Courtesy of Election Data Services, and USAToday
Election 2000 Results
• Join Count Analysis
Table 1. Expected vs. Actual Joins of Adjacent Counties Voting for the Same Candidate
Expected
Gore/Gore Joins
438
Actual Expected
Gore/Gore Joins
879
Expected
Bush/Bush Joins
5516
Actual
Bush/Bush Joins
6253
ZGore/Gore 15.47; ZBush/Bush 8.75
Not mutually exclusive
from large cities.
We must account for this
• Overlay Analysis
Table 2. Cities Falling Inside a County Won by Either Bush or Gore
Gore
Large (> 75K)
Medium (50-75K)
Small (<50K)
No City
Expected
Bush
66
54
544
427
Expected
Gore
238
196
1273
1588
Observed
Bush
184
147
2030
347
Observed
Z
Gore Bush
119
267
98
470
1236
4,998
1690
18
© Arthur J. Lembo, Jr.
Cornell University
Z
272
55
3
29
Election 2000 Results
• There was obvious spatial
autocorrelation in the way way people
voted. That is, Bush counties and Gore
counties were highly clustered
• Also, there are a very high correlation
between urbanized counties voting for
Gore, and non-urbanized counties voting
for Bush
© Arthur J. Lembo, Jr.
Cornell University
Analysis of Environmental
Justice
Point in Polygon Analysis
By
Greg Thorhaug
css620 project – Spring 2001
© Arthur J. Lembo, Jr.
Cornell University
© Arthur J. Lembo, Jr.
Cornell University
Erie Chi-Squared
TOTAL
Low
Medium
High
Poverty Expected
Minority Expected
AREA
2784133584
2743262089
37710244.45
3161250.755
AREA
SITES
2784133584
162
2668104627
155.248639
20151280.46 1.172539799
95877675.83 5.578821209
SITES
162
159.6218
2.194241
0.183943
Poverty Observed
TOTAL
Low
Medium
High
Minority Observed
162
131
24
7
162
132
7
23
CHI-Squared Statistic
Low
Medium
High
SUM
CHIINV =
5.132182846
216.6996098
252.570429
474.4022216
5.991476357
3.481506946
28.96216607
54.40172021
86.84539322
© Arthur J. Lembo, Jr.
Cornell University
Summary
• Spatial Data Analysis is possible, through basic
statistical methods
• More in-depth analysis is possible using spatial
statistics
• GIS software may be used to prepare data for
statistical analysis
• Spatial data analysis techniques provide a
powerful tool for analyzing GIS data, and
enable users to solve creative problems
© Arthur J. Lembo, Jr.
Cornell University
Cross Tabulation
•
Assume we have a 9 cell land cover map, one from 1980 and one from 2000 with three categories:
A, B, and C.
Ground Reference Data
A
B
Interpreted Land Cover Data
B
B
B
C
B
A
C
Cross
Tabulate
Cross Tabulated Grid
A
A
B
AA
BA
BB
B
C
C
BB
BC
CC
A
A
B
BA
AA
CB
•
You can see that the resulting cross tabulation provides a pixel, by pixel comparison of the
interpreted land cover types with the two dates. So, for the upper left hand cell, the 1980 land use
was A, and the 2000 land cover also indicated the value of A. Therefore, this is a match between
the 1980 data and 2000 data. However, in the lower right cell you can see that the 1980 data
indicated a value of C, while the 2000 value was B. This is not a match, and would indicate an
error between the two sources.
•
We can now quantify the results into a matrix as shown below. This matrix, is oftentimes called a
confusion matrix
A B C
A
B
C
© Arthur J. Lembo, Jr.
Cornell University
2
2
0
0
2
1
0
1
1
Confusion Matrix
•
•
•
•
The matrix on the right shows the comparison of
the two hypothetical data sets. The 1980 data
set and the 2000 data set .
As an example, geographic features that were
classified as A on the map in 1980, and actually
were still be A in 2000, represent the upper left
hand matrix with the value 2 (there were two
pixels that met this criteria). This means that 2
units in the overall map that were A, actually is A.
Similarly, the same exists for the classifications
of B and C.
But, there may have been times where the 1980
value was A and the 2000 value was B. In this
case, the 2 represented in the top row of the
matrix says that there are 2 units of something
that was A in 1980, but is now B in 2000.
Ground
Reference
A B C
A
Map
B
Classification
C
2
2
0
0
2
1
0
1
1
Ground
Reference
A B C
A
Map
B
Classification
C
We can begin to add these number up, by adding
an additional row and column. But what do these
numbers tell us?
© Arthur J. Lembo, Jr.
Cornell University
2
2
0
4
0
2
1
3
0
1
1
2
2
5
2
Comparing the maps
•
The bottom row tells us that there were
two cells that were A, five cells that
Ground
were B, and two cells that were C. The
rightmost column tells us that we
Reference
A B C
mapped four cells as A, three cells as
B, and 2 cells as C. Adding up the
A 2 2 0 4
Map
Diagonal cells says there were 5 cells
B 0 2 1 3
Classification
where we actually got it right.
C 0 1 1 2
•
So, the overall map comparison is
really a function of:
–
Total cells on the diagonal / total
number of cells.
•
(2 + 2 + 1) / (2 + 2 + 0 +0 + 2 + 1 + 0
+ 1+1) = 5/9 = .55% agreement
© Arthur J. Lembo, Jr.
Cornell University
2
5
2
Other Accuracy Assessment
•
•
The total correspondence of our example is 55%. But, that only tells us part of the story. What if we were really
interested in classification B? Where there changes in classification B? Even here, there are two different ways of
interpreting that question:
–
If I were interested in mapping all the areas of B, how well did I get them all? This is called the map Producer’s Accuracy.
That is, how well did we produce a map of classification B.
–
If I were to use the map to find B, how successful would I be? This is called the Map User’s Accuracy. That is, much
confidence should a user of the map have for a given classification.
To compute the map user’s accuracy, we would divide the total number correct within a row with the total number in
the whole row. Staying with our example of classification B:
–
We said that we had two cells where B was correct. However, we actually said that there were three cells that contained B
(in other words, we incorrectly called a cell B, when it should have been C). Therefore, we have:
•
–
•
2 correct B values / 3 total values = .66 user’s accuracy.
This means that if we were to use this map and look for the classification of B, we would be correct 66% of the time.
To compute the map producer’s accuracy, we would divide the total number of correct within a column with the total
number in the whole column. Staying with our example of classification B:
–
We said that we had two cells where B was correct. However, we actually said that there were five cells that should have
been B. Therefore, we have:
•
–
2 correct B values / 5 total values that should be B = .4 producer’s accuracy
Ground
Reference
A B C
This means that the map produced only 40% of all the B’s that were out there.
A
Map
B
Classification
C
© Arthur J. Lembo, Jr.
Cornell University
2
2
0
4
0
2
1
3
0
1
1
2
2
5
2
User and Producer Accuracy
•
•
To test your understanding of all this,
compute the user’s and producer’s accuracy
for classifications A and C.
This also gives us some indication of the
nature of the errors. For instance, it appears
that we confused classification A with
classification B (we said on two occasions
that B was A). By understanding the nature
of the errors, perhaps we can go back, look
over our process and correct for that
mistake.
Users
Ground
Accuracy
Reference
A B C
A
Map
B
Classification
C
Producers
Accuracy
© Arthur J. Lembo, Jr.
Cornell University
2
2
0
4
0
2
1
3
0
1
1
2
2
5
2
.4
.66