Statistical Data Analysis

Download Report

Transcript Statistical Data Analysis

Statistical Data Analysis
Chapter 9 - Montello and Sutton
An Introduction to Scientific
Research Methods in Geography
Overview
Statistical data analysis
Statistical description
Statistical inference
Geospatial Analysis
Data Analysis
Set of display and mathematical
techniques
Logical and conceptual considerations
Allows us to:
Extract meaning from systematically
collected measurements
Communicate that meaning to others
Geographers and Data
Geographers view data as statistical
(complex and imperfect) rather than
deterministic
Three reasons:
Imperfect sample of larger population
Measurement involves error
Phenomena are expressions of complex
sets of many interacting variables
Statistical Description
Goal: summarize potentially important
properties of our data using
Parameters - summary indices to describe
the population
Properties:
Central tendency
Variability / dispersion
Form / shape of distribution
Relationships
Central Tendency
Average or representative value
Three most common:
Mode - most frequent
Median - middle value
Mean (“average”)
Variability / Dispersion
Tells how data points differ from the central
tendency
How representative the central tendency is
Greater when variability is low
Three common:
Range - distance between high and low
Variance - average of deviations from the mean
Standard deviation - square root of the variance
Form / Distribution I
Shape of entire data set
Modality - number of local modes
Skewness - distribution uneven
Positive - mostly low and medium scores
Negative - mostly medium and high scores
Symmetry - mirror around central tendency
Bimodal
Unimodal - normal or “bell-shaped” curve
Form / Distribution II
Derived scores
Describe the value of individual scores
relative to the rest of the data set
Three common:
Rank - 1, 2, 3, etc.
Percentile rank - percentage of the data
that is less than the score in question
z-score - standard deviation units above or
below the mean of the data set
Relationships I
Systematic (consistent) patterns of high
or low values across pairs of variables
Linear relationship - two variables form
a straight line when graphed
Positive (or direct) - high value A has high
value B; low value A has low value B
Negative (or indirect) - high value A has
low value B; low value A has high value B
Relationships II
Relationship strength - degree that
patterns hold across all cases
Correlation coefficient - square of
correlation measure of relationship
strength
Regression analysis - expresses
relationship as an equation that predicts
the values of Y (criterion variable) as a
function of X (predictor variable)
Monotonic relationship - goes up or
down; not necessarily in a straight line
Statistical Inference I
Goal: Draw informed guesses about
likely patterns in population, based on
sample data evidence
Assign probabilities to guesses
Sampling distribution - distribution of a
sample statistic based on all possible
samples of a given size, from a given
population
Statistical Inference II
Assumptions:
Distribution is normal and variances are
equal
Data values are independent
Model specification (such as linearity,
inclusive of relevant predictor constructs)
Statistical Inference III
Two approaches:
Estimation
Point estimate - guess about specific
parameter value
Confidence interval - range of values
distributed around the point estimate,
expressed as probability
Hypothesis Testing
Null hypothesis (H0) is about exact point of
parameter
Alternative hypothesis (HA) is that the exact
point of the parameter is not the null
Statistical Inference IV
Four possible outcomes, based on:
Two possible truths (H0 is true, HA is false)
Two possible decisions (reject H0 and
accept HA; reject both H0 and HA)
Two types of errors:
Type I - reject H0 when H0 is true
Type II - fail to reject H0 when H0 is false
Geospatial Analysis
Geography data are different:
They are spatially distributed
Have location, extent or size, shape,
pattern, connectivity, etc.
They represent natural and human earthsurface features and processes
Spatiality is the focus or is central to the
analysis
Spatiality
Influences the accuracy of inferential
statistical analyses of nonspatial variables
Spatial autocorrelation exists when there
are patterns of spatial dependence – places
are “like” other places
Distance decay – near things are “more like”
each other than things further away
Areal Units
Which areal units to use?
Problems:
Using data from continuous source, but treat with
discrete spatial analysis techniques
Politicization of unit determination (like
gerrymandering)
Modifiable Areal Unit Problem (MAUP) – effect
that theoretically arbitrary areal geometries have
on geographic analysis
Questions
 Why is data analysis in geography usually
conceptualized in statistical (probabilistic) terms?
 What is meant by strength and form of statistical
relationships?
 What is the purpose of statistical inference? Why are
statistical inferences necessarily and ultimately
uncertain?
 What are two types of correct decisions and two
types of errors possible when hypothesis testing?
 What is spatial autocorrelation, what forms can it
take, and why is it so important to geographic data
analysis?