Transcript Chapter_9
Chapter 9
Statistical Data Analysis
An Introduction to Scientific
Research Methods in Geography
Montello and Sutton
Data Analysis
Data Analysis
Helps us achieve the four scientific goals of
description, prediction, explanation, and
control
Statisical Data Analysis
Three primary reasons geographers treat data
in a statisitical fashion
http://rlv.zcache.com/knowledge_is_power_do_statistics_stats_humor_fly
er-p2440846222778564182dwj5_400.jpg
Statistical Description
Descriptive Statistics
Parameters
Central Tendency
Mode
Median
Mean X , m
Arithmetic mean
When would you use the median or the mode
instead of the mean?
Descriptive Statistics
Variability
Range
= largest value – smallest value
Variance
å(x - m)
N
2
s2 =
i =1
i
N
Standard Deviation
N
s=
å(x - m)
i
i =1
N
2
Descriptive Statistics
Form
Modality
Skewness
Positive
Negative
Symmetry
Unimodal – Bell-shaped
Normal Distribution
http://people.eku.edu/falkenbergs/images/skewness.jpg
Descriptive Statistics
Derived Scores
Percentile Rank
Highest – 99th percentile
Where is the median?
Z-score
Standard deviation units above or below the mean
x-m
z=
s
Descriptive Statistics
Relationship
Linear Relationship
Positive
Negative
Relationship Strength
Weak, strong, no relationship
Correlation Coefficient
Between -1 and 1
0 – no relationship
Regression Analysis
Criterion variables (Y)
Predictor variables (X)
http://hosting.soonet.ca/eliris/remotesensing/LectureImages/correlation.gif
Correlation – Causation?
http://xkcd.com/552/
“Correlation doesn’t imply causation, but it does
waggle its eyebrows suggestively and gesture
furtively while mouthing ‘look over there’.” - XKCD
Statistical Inference
Inferential Statistics
Statistics
Sampling error
Given our sample statistics, we infer our
parameters
Assign probabilities to our guesses
Power and difficulty of inferential statistics
comes from deriving probabilities about how
likely it is that sample patterns reflect
population patterns
Inferential Statistics
Sampling distribution
Ex: sampling distribution of means – show the
probability that a single sample would have a
mean within some given RANGE of values
Central limit theorem – sampling distribution
of sample means will be normal with a mean
equal to the population mean and a standard
deviation equal to the population standard
deviation divided by the square root of the
sample size
Inferential Statistics
Generation of sampling distributions
Assumptions
Distributional assumptions
Nonparametric
Parametric
Normality
Homogeneity of variance
Independence of scores
Correct specification of models
Estimation and Hypothesis Testing
Estimation
Point estimation
Confidence Interval
Usually 95%
Hypothesis Testing
Null hypothesis
A hypothesis about the exact (point) value of a
parameter or set of parameters
Use sample statistics to make an inference about
the probable truth of our null hypothesis
Hypothesis Testing
Alternative
Hypothesis
Hypothesis that the
parameter does not
equal the exact value
hypothesized in the
null
A range rather than an
exact value
If A is true,
Then B is true
B is not true
B is true
Therefore,
A is not true
Therefore, ???
Modus Tollens
Useful for
disconfirming
Not confirming!
Example
From a recent nationwide study it is known that the
typical American watches 25 hours of television per
week, with a population standard deviation of 5.6 hours.
Suppose 50 Denver residents are randomly sampled
with an average viewing time of 22 hours per week and a
standard deviation of 4.8. Are Denver television viewing
habits different from nationwide viewing habits?
Step 1: State your null and alternative hypotheses
H 0 : X = 25
H A : X ¹ 25
What is this saying?
Example
Step 2: Determine your appropriate test statistic and its sampling
distribution assuming the null is true
We are testing a sample mean where n>30 and so a z distribution can
be used
Step 3: Calculate the test statistic from your sample data
X = 22 m = 25
X -m
22 - 25
z=
=
= -3.79
s = 4.8 s = 5.6
s/ n
5.6 / 50
n = 50
Step 4: Compare the empirically obtained test statistic to the null
sampling distribution
P value: p = .0001
OR Critical value at .05 significance level: z = ±1.96
Decision: Reject the null hypothesis
-3.79 is less than -1.96: reject
The p value is very small, less than .05 and even .01: reject
Error
You have made either a correct inference
or a mistake
Type I error is the rejection level, p (or α)
Type II error - β
http://www.mirrorservice.org/sites/home.ubalt.edu/ntsbarsh/Businessstat/error.gif
Data in Space and Place
Spatiality is a focus in geography, unlike other disciplines
Spatial autocorrelation
First Law of Geography: Everything is related to everything else,
but near things are more related than distant things
Positive v negative spatial autocorrelation
A violation of the important statistical assumption of
independence
Ex: If its raining in my backyard, I can say with a high degree of
confidence its raining in my neighbor’s backyard, but my level of
confidence that it is raining across town is lower, and 300 miles
away even lower
Variogram
http://www.innovativegis.com/basis/Papers/Other/ASPRSchapter/
Default_files/image023.png
Data in Space and Place
“Spatial data are special” – a special difficulty
Which areal units should be used to analyze
geographic data
Modifiable Areal Unit Problem
Gerrymandering
Geographic phenomena are often scale
dependent
Must identify the scale of a phenomena and collect
and organize data in units of that size
Data aggregation issues
Discussion Questions
What measure of central tendency is best for nominal
data?
When pollsters tell you that a candidate is favored by
44% of likely voters, plus or minus 3 percent, what is the
44% and what is the plus/minus 3%?
A survey of all users of a park in 1980 found the average
number of people per party to be 3.5. In a random
sample of 35 parties in 2000 the average was 2.9. If you
wanted to test if the number of persons per party in 2000
was different from the number in 1980, what would your
null and alternative hypotheses be?
In the United States, we presume that someone is
innocent. If a guilty person were found to be not guilty,
what type of error would this be?
A researcher finds that a particular learning software has
an effect on student’s test scores, when actually it does
not. What type of error is this?