Transcript Slide 1
You will go through the process of
science, learning how statistics is
applied to a study of saguaros.
OBSERVATION
Saguaros seem to occur in different
numbers on the north versus south slope
of Gates Pass.
Gates Pass
www.gamineral.org/t04-gates_pass.html
RESEARCH QUESTION
Based on the observation that saguaros
seem to occur in different numbers on the
north and south slopes of Gates Pass:
What descriptive question (versus causal
question) could you ask?
Descriptive: Is there a difference in saguaro
density between north- and south-facing
slopes? Only requires a count to answer.
Causal: Why is saguaro density affected by
whether the slope faces north versus south?
Requires controlled studies of many factors to answer.
LITERATURE REVIEW
Not needed to come up with the multiple
hypotheses for this question because it is
a descriptive question – there either is not
an effect or there is an effect of slope
direction on saguaro density.
MULTIPLE HYPOTHESES
The null hypothesis (H0) – the default that
there is no relationship between two
measured phenomena. There is no difference.
Source: http://freshspectrum.com/i-am-the-null-hypothesis/]
MULTIPLE HYPOTHESES
The null hypothesis (H0) – the default that
there is no relationship between two
measured phenomena. There is no difference.
H0: There is no difference in Saguaro
density between north and south-facing
slopes.
MULTIPLE HYPOTHESES
The alternate hypotheses (H1 , H2)
– there is a relationship between two
measured phenomena. There is a difference.
H1: Saguaro density is greater on the
north-facing slope compared to the southfacing slope.
H2: Saguaro density is greater on the
south-facing slope compared to the northfacing slope.
DEDUCTIONS
What evidence would you need to be convinced
each hypothesis is correct or incorrect?
In other words, By how much would the
densities have to differ for you to be convinced
that the direction of the slope affects saguaro
density?
We will come back to this later…but in “real life”
you are supposed to come up with deductions
before collecting any data.
TESTS: Three Data Sets
North Slope South Slope
Data set A
99
101
North Slope South Slope
Data set B
90
110
North Slope South Slope
Data set C
80
120
Imagine that these are three possible sets of data that
you could have collected by counting saguaros on a
north and south slope. Note that I have kept the total
number of saguaros the same (200) for each data set.
TESTS: THREE EXAMPLES
Here are the same data displayed in a bar graph
TENTATIVE CONCLUSION
For each data set (A-C) did the evidence convince
you that the differences in densities were
significant enough to warrant ruling out the null
hypothesis (that the distribution of saguaros on the
two slopes was just random) and tentatively
concluding that slope did affect saguaro density?
TENTATIVE CONCLUSION
Perhaps it would help to know
what the chance is that the
distribution is random?
p=probability this
would happen
randomly?
North Slope
A
99
South Slope
101
p=?%
North Slope
B
90
South Slope
110
p=?%
North Slope
C
82
South Slope
118
p=?%
TENTATIVE CONCLUSION
But how do we determine
the probability that the
distribution is random?
STATISTICS
STATISTICS
If you are comparing counts, then use the
chi (pronounced kie) square test.
Example: count of saguaros on two slopes.
If you are comparing averages, then use
the t-test.
Example: comparing average height of
saguaros on two slopes.
Chi-Square Test
You compare the actual counts to what
the expected count would be if the
distribution was random.
In our case, with a total of 200 saguaros
counted on both slopes, what would the
expected distribution be if they were
distributed perfectly random?
Expected
North Slope South Slope
100
100
Chi-Square Test
The chi-squared test tells you the probability
that the difference between observed and
expected occurred by chance.
Observed
Expected
Difference
North Slope
99
100
-1
South Slope
101
100
1
Chi-Square Test
Use my Excel file online
2 categories
Category 1
Category 2
Your Data
>>>
99
101
2 categories
Category 1
Category 2
Your Data
>>>
90
Type in the
numbers in the
gray boxes and
then hit enter
110
P value
P value
0.89
<<<significant
if <0.05
0.16
<<<significant
if <0.05
TENTATIVE CONCLUSION
Using my Excel file online,
you would come up with these
probabilities.
p=probability this
would happen
randomly
North Slope
A
99
South Slope
101
p=0.89=89%
North Slope
B
90
South Slope
110
p=0.16=16%
North Slope
C
82
South Slope
118
p=0.01=1%
DEDUCTIONS
Which brings us back to deductions.
What probability of being wrong are we willing
to risk?
The worse mistake you can make in science (Type 1
error) is to conclude that there is a relationship when
it was really random.
Are we willing to be wrong 89% of the time?
16% of the time? 1% of the time?
Scientists most often use 5% = p=0.05
Remember, this is the
Magic number:
DEDUCTIONS
H0: There is no difference in Saguaro
density between north and south-facing
slopes.
D0: The p value for the chi square test will
be 0.05 or greater when comparing
saguaro densities.
DEDUCTIONS
H1: There is a difference in Saguaro
density between north and south-facing
slopes.
D0: The p value for the chi square test will
be less than 0.05 when comparing saguaro
densities.
TENTATIVE CONCLUSION
For data sets A&B, because p>0.05 there is no
significant difference in saguaro density so slope
direction unlikely to affect saguaro density.
For data set C, because p<0.05 saguaro density is
significantly greater on the south slope so slope
direction likely affects saguaro density.
This could be your table
Table 1. Number of saguaros per hectare on
the north and south slope of Gates Pass near
Tucson, AZ as counted September 1, 2014
Significance determined by the chi square test.
Data set A
Data set B
Data set C
North
slope
South
slope
99
90
80
101
110
120
Significant?
No; p=0.89
No; p=0.16
Yes; p=0.005
This could be your graph
Figure 1. Number of saguaros per hectare on
the north and south slope of Gates Pass near
Tucson, AZ as counted September 1, 2014.
T-Test Example
The t-test tells you the probability that the
difference between two averages is random,
and considers variability within the data.
For example
H0: There is not a difference in saguaro
heights between north- and south-facing
slopes.
H1: There is a difference in saguaro heights
between north- and south-facing slopes.
Sample Data
Table 1. Saguaro height (in meters) on the
north and south slope of Gates Pass near
Tucson, AZ as counted September 1, 2014.
North-facing Slope
3.5
0.2
0.5
1.5
3.1
0.8
1.2
South-facing slope
1.0
3.2
3.5
4.2
0.8
0.7
3.4
Sample Data
Table 2. Average saguaro height (in meters) for
7 saguaros measured on the north and south
slope of Gates Pass near Tucson, AZ on
September 1, 2014.
North-facing Slope
1.5
South-facing slope
2.4
Are these significantly different?
It depends on sample size and
amount of variability in the data.
Number saguaros
x x xx x x x x x xx x x x
1.0
Number saguaros
South slope
average
North slope
average
Lots of variability:
Probably not
significantly
different
2.0
3.0
4.0
Height (m)
North slope
average
South slope
average
xx
xxx
xx
1.0
xx
xx
xxx
2.0
3.0
Height
Little variability:
Probably
significantly
different
4.0
T-Test
Use my Excel file online
Click on the t-test tab at the bottom.
Group 1
3.5
0.2
Etc.
Group 2
1.0
3.2
Etc.
P value
TENTATIVE CONCLUSION?
0.272
If P<0.05
then we tentatively conclude that
there is a significant difference
because there is less than 5%
chance that it could have
happened randomly.
If P>0.05
then we tentatively conclude that
there is NOT a significant
difference because there is
more than 5% chance that it
could have happened randomly.
T-Test
Group 1
3.5
0.2
Etc.
Group 2
1.0
3.2
Etc.
P value
0.272
TENTATIVE CONCLUSION?
Average saguaro height on the north slope (1.5 m)
is not significantly different (p=0.27) from average
saguaro height on the south slope (2.4 m).
REVIEW
Before you collect data (i.e., in your
proposal) you have to decide how much
difference is enough to convince you that
there is cause and effect going on versus
just random chance.
Statistics (e.g., chi-square and t-test) can
be used to calculate the probability that
the difference is random.
Use p<0.05 to rule out the null hypothesis
and tentatively conclude there are
significant differences that suggest cause
and effect.