Data Analysis, Your Research Questions, and Writing

Download Report

Transcript Data Analysis, Your Research Questions, and Writing

Week 7: Making use of the Badger
Mill Creek Field trip
Data Analysis, Your Research Questions, and
Writing
Zoo 511
Spring 2009
Outline
• Field trip review
• Statistics and data analysis (with examples in
Excel)
• Badger Mill research questions/hypotheses
• Writing a scientific paper
• Enter data
Today’s goals
• Provide a basic background on how to use and
interpret common statistical tests
• Prepare you to generate questions for your
paper, and to analyze data to answer these
questions
• Get all data entered!
Why use statistics?
Are there more orange spotted sunfish in
pools or runs?
Pool
2
7
3
Run
5
4
1
12
10
Why use statistics?
Are there more orange spotted sunfish in
pools or runs?
Pool
2
Run
5
3
1
5
6
•Statistics help us find patterns in the face of variation, and draw
inferences beyond our sample sites.
•Statistics help us tell our story; they are not the story in themselves!
Important note
“Data” is the plural form of datum.
WRONG: Data was analyzed using Microsoft Excel.
RIGHT: Data were analyzed using Microsoft Excel.
When in doubt, substitute the word “apples” for “data”, and
ask if your sentence makes sense.
1
2
5
67
45
87
8
57
90
=
A Few Necessary Terms
Categorical Variable: Discrete groups, such as Type
of Reach (Riffle, Run, Pool)
Continuous Variable: Measurements along a
continuum, such as Flow Velocity
What type of variable is “Mottled Sculpin /meter2”?
What type of variable is “Substrate Type”?
A Few Necessary Terms
Explanatory Variable: Independent variable. On xaxis. The variable you use to predict another
variable.
Response Variable: Dependent variable. On y-axis.
The variable that is hypothesized to depend on/be
predicted by the explanatory variable.
A Few Necessary Terms
Mean: The most likely value of a random variable or
set of observations (the average)
Variance: A measure of how far the observed values
differ from the expected variables (Standard
deviation is the square root of variance).
Normal distribution: a symmetrical probability
distribution described by a mean and variance. An
assumption of many standard statistical tests.
N~(μ1,σ1)
N~(μ1,σ2)
N~(μ2,σ2)
Statistical Tests
Hypothesis Testing: In statistics, we are always
testing a Null Hypothesis (Ho) against an alternate
hypothesis (Ha).
p-value: The probability of observing our data or
more extreme data assuming the null hypothesis
is correct
Statistical Significance: We reject the null
hypothesis if the p-value is below a set value (α),
usually 0.05.
Statistical Tests: Appropriate Use
For our data, the response variable will always be
continuous.
T-test: A categorical explanatory variable with only
2 options.
ANOVA: A categorical explanatory variable with >2
options.
Regression: A continuous explanatory variable
Student’s T-Test
Tests the statistical significance of the
difference between means from two
independent samples
Null hypothesis: No difference between means.
Compares the means of 2 samples of a categorical
variable
Mottled
Sculpin/m2
Cross Plains Salmo Pond
Precautions and Limitations
• Meet Assumptions
• Observations from data with a normal
distribution (histogram)
• Samples are independent
• Assumed equal variance (this assumption
can be relaxed)
• No other sample biases
• Interpreting the p-value
Walk through t-test
Mottled Sculpin/m2
Analysis of Variance (ANOVA)
Tests the statistical significance of the
difference between means from two or
more independent groups
Riffle Pool
Run
Null hypothesis: No difference between means.
Precautions and Limitations
• Meet Assumptions
• Samples are independent and identically
distributed (iid).
• Assumed equal variance among groups
•Residuals are normally distributed
•Groups are classified correctly
• No other sample biases
• Interpreting the p-value
Walk through ANOVA
Simple Linear Regression
• Analyzes relationship between two
continuous variables: predictor and response
•Null hypothesis: there is no relationship
(slope=0)
Least squared line
(regression line:
y=mx+b)
Residuals
Residuals
Residuals are the distances from observed points
to the best-fit line
Residuals always sum to zero
Regression chooses the best-fit line to minimize
the sum of square-residuals. It is called the Least
Squares Line.
Precautions and Limitations
• Meet Assumptions
• Relationship is linear (not exponential,
quadratic, etc)
• X is measured without error
• For any given value of X, sampled Y’s are
independent
• Normal distribution of residual errors
•Interpret the p-value and R-squared value.
P-value: probability of observing your data (or
more extreme data) if no relationship existed.
• Indicates the strength of the relationship, you
can think of this as a measure of predictability
R-Squared indicates how much variance in the
response variable is explained by the
explanatory variable.
If this is low, other variables likely play a role. If
this is high, it DOES NOT INDICATE A
SIGNIFICANT RELATIONSHIP!
R-Squared and P-value
High R-Squared
Low p-value (significant relationship)
R-Squared and P-value
Low R-Squared
Low p-value (significant relationship)
R-Squared and P-value
High R-Squared
High p-value (NO significant relationship)
R-Squared and P-value
Low R-Squared
High p-value (No significant relationship)
Walk through Regression 1
Residual vs. Fitted Value Plots
Observed
Values
(Points)
Model Values (Line)
Residual Plots Can Help Test Assumptions
0
0
“Normal” Scatter
Fan Shape:
Unequal
Variance
0
Curve
(linearity)
Have we violated any assumptions?
If assumptions are violated:
• Try transforming data (log transformation, square
root transformation)
• Most of these tests are fairly robust to violations of
assumptions of normality and equal variance (only
be concerned if obvious problems exist)
• Diagnostics (residual plots, histograms) should NOT
be reported in your paper. Rather, a statement that
diagnostic tests were performed to assure that
assumptions of a linear regression were not violated
is sufficient.
Walk through regression
2, with residual plots
Statistical significance
0.4
0.5
Darters/m2
Darters/m2
0.5
R2=0.85
p=0.045
Y=0.02+0.1X
0.3
0.2
0.4
0.3
0.2
0.1
0.1
0.0
0.0
1
2
3
Flow Velocity
4
R2=0.6
p=0.055
Y=0.02+0.1X
1
2
3
4
Flow Velocity
Take home message: using a cutoff of 0.05 as a cutoff for
significance is ARBITRARY! Use your p-values as one of multiple
tools for interpreting your results (especially because you will likely
have small sample sizes).
Statistical vs. biological significance
Darters/m2
0.5
0.4
• In the observed range of
flow velocities, you would
expect a difference of 0.2 fish
per m2.
R2=0.85
p=0.045
Y=0.02+0.1X
0.3
0.2
0.1
0.0
1
2
3
Flow Velocity
4
• If your reach contained 100
m2 of habitat, you would
expect a difference of 20 fish.
Take home message: there is no magic number to determine
biological significance. YOU need to think about what your results
mean, and interpret them in an ecological context.
Part 2: Your Questions
Read the handout!!!!
Your questions should be specific
and answerable
WRONG
What habitat do fish
prefer?
In what kind of stream
are brown trout most
likely to be found?
RIGHT
Does sculpin CPUE differ
among geomorphic
units?
Is brown trout density
related to flow velocity?
Example Questions
Does sculpin CPUE differ among
geomorphic units?
Is brown trout density related to
flow velocity?
Brown Trout/m2
Sculpin
Sculpin per CPUE
minute
6
5
4
3
2
1
0
RIFFLE
RUN
POOL
Current Velocity (m/s)
Other data sources
Last year’s data: all of the same information was
collected from the same place, around the same
time of year. Replication!
USGS: http://waterdata.usgs.gov/nwis/uv?05435943
Think about these data sources as you generate
your questions.
Two questions with a
supporting paragraph for
each are due Monday by
10:00 am via email.
Name your file: Classday_Lastname_Questions.doc
(e.g., Wednesday_Hansen_Questions.doc)
Part 3: Writing
Read the handout!!!!
Why Write?
• Gain experience articulating thoughts
• Writing is a learning experience
• It is the currency of communication (in
science, law, business, etc…)
Order of a scientific paper
(see handout!)
1.
2.
3.
4.
5.
Title
Abstract
Introduction – set up your study
Methods – study site, data analyses
Results –analyses, reference tables and
figures here
6. Discussion – interpret results
7. Literature Cited
8. Tables and figures
Think before you write
• Analysis  results: figures & numbers
• Search the literature  context
Outline
•
•
•
•
Start with basic parts
Add subsections
Add topic sentences
This will take some time, but will make
your paper much easier to write and of
much higher quality!!
Writing
Start with what you know
– Methods: two parts
• Sampling: site description and sampling
techniques relevant to your hypothesis
• Statistical analysis
– Results
• Report the findings
• What did your analyses reveal?
• FIGURES SHOULD STAND ALONE!!!!
Note on results
• Make ecology the subject of your sentences,
not statistics. Statistics help you tell your
story, they are not your story in themselves.
WRONG: Linear regression showed that there was a significant
positive relationship with a p-value of 0.04 and an R2 of 0.81
between brown trout abundance and flow velocity.
RIGHT: Brown trout abundance increased with increasing flow
velocity (R2=0.81, p=0.04).
Intro and discussion:
Why does it matter? What does it mean?
• Introduction
– What is the context of the study
– Set up the experiment
• Discussion
– What do the results mean?
– Was your hypothesis correct?
– What is interesting/exciting about your
findings?
Writing
The last steps
• Abstract:
– for most, the hardest part of writing a
scientific paper
– Short summary of the important points of
the paper
• Title
– Short, sweet, descriptive
• Literature Cited
Peer Review…?
• Criticism is important…”constructive
criticism” is best!
• Two types: Internal and External. Point
of internal review is to make external
review go well
• Reviews need to be taken seriously
Lab
Enter Badger Mill Creek
Data