P2 - Excel, Basic Statsx

Download Report

Transcript P2 - Excel, Basic Statsx

The Data Collection and Statistical
Analysis in IB Biology
Part II – Basic Stats , Standard Deviation and Variability
John Gasparini
The Munich International School
Remember our two species of butterflies?
http://www.zipcodezoo.com/hp350/Adelpha_basiloides_0.jpg
Spot Celled Sister (Adelpha basiloides)
http://4.bp.blogspot.com/-M8r6KZeMas/TWWe7aOgiWI/AAAAAAAAAP8/ck9cmUdqfas/s1600/Adelpha
_cytherea_ButterflyPhotography-BB_Blogspot_JGJ.jpg
Smooth-Banded Sister (Adelpha cytherea)
Research Question
"Is there a significant difference
in proboscis length and body
mass between A. basiloides and
A. cytherea?”
These are closely related species
from the Nymphalidae family both are found in the tropics of
Central America and both feed
on the nectar of flowers.
Imagine that you have collected data on the proboscis length and body mass of our two
butterfly species. Record it properly. You must be neat to reduce problems later!
Give the raw data tables
proper titles
Include uncertainties!
Be consistent in your
number of decimal places.
Don’t use more than the
sensitivity limits of your
instrument.
Imagine that you have collected data on the proboscis length and body mass of our two
butterfly species. Record it properly. You must be neat to reduce problems later!
What is the number of
butterflies sampled for
each species?
What is the total number
of butterflies sampled?
Imagine that you have collected data on the proboscis length and body mass of our two
butterfly species. Record it properly. You must be neat to reduce problems later!
What is the number of
butterflies sampled for
each species?
n = 15
What is the total number
of butterflies sampled?
Total sampled in both
species = 30
Now that we have recorded our raw data in an organized fashion it is time to
calculate some basic statistics for our datasets…
We’ll start with three, "Measurements of Central Tendency.”
Each is a summary score that tries in some way to represent a set of scores. It is a
single score generated from a dataset that in some way is typical of the distribution
of scores.
1) Mode
2) Median
3) Mean (average)
Fancy name. Don’t get caught up
in it. These are easy stats and
you know most of them already.
Mode: This is the score or value that occurs most
frequently in a dataset.
What is the Mode of
this dataset?
Mode: This is the score or value that occurs most
frequently in a dataset.
What is the Mode of
this dataset?
Answer: 23.5
Why? The value 23.5 occurs the
most in the dataset – twice to be
exact.
Not very complicated…
Mode: This is the score or value that occurs most
frequently in a dataset.
Datasets can be amodal,
monomodal, bimodal and
multimodal.
(You should be able to figure out what
these terms mean.)
Note: this dataset is difference
from the one before which was
monomodal.
Which of these terms
would best describe the
dataset to the left?
Mode: This is the score or value that occurs most
frequently in a dataset.
Datasets can be amodal,
monomodal, bimodal and
multimodal.
(You should be able to figure out what
these terms mean.)
Note: this dataset is difference
from the one before which was
monomodal.
Which of these terms
would best describe the
dataset to the left?
Answer = Amodal, as there is no repeating value
Median: This is a middle point of scores in a dataset.
50%of the scores are above the median, and 50% are
below it.
The median is a point and it does
not have to be and actual score in
that distribution.
What is the Median of
this dataset?
Think about what the median
would be for a dataset with an
even number of samples – e.g.
Median value of the dataset 10,
7, 8 and 6?
Median: This is a middle point of scores in a dataset.
50%of the scores are above the median, and 50% are
below it.
The median is a point and it does
not have to be and actual score in
that distribution.
What is the Median of
this dataset? =
23.2
Think about what the median would
be for a dataset with an even number
of samples – e.g. Median value of the
dataset 10, 7, 8 and 6?
= 7.5
Mean: This is the average value of the dataset and
all of you should be able to calculate this easily…
OK. So all of this is made terribly easy
if you learn to use Excel properly.
Click on the image below and watch the podcast on how to use Excel
to calculate Modes, Medians, and Means within a spreadsheet. You
need to master these skills.
http://www.youtube.com/watch?v=ziQcGGBvH00&feature=youtu.be
Now what we need to do is graph
the data in Excel. This, too, is fairly
easy. View the podcast below to
see how this is done.
Do not forget all of the
rules that you have
learned over the years
on what is expected in
terms of graphical
presentation of data!
(Remember these from 6th
grade?)
For Graphs…
•
•
•
•
http://youtu.be/-WsEgIbfbug
•
•
•
Be neat, and make the graph large enough to be easily read.
Use a pencil and a ruler, if constructing the graph by hand.
Each axis should have a LABEL and the UNITS of measurement.
The independent variable should be on the X-axis, and the
dependent variable should be on the Y-axis.
Scale the axes properly so that the data is effectively displayed.
Use the appropriate type of graph - line graph, scatter plot, bar
graph, etc.
Data points should be properly positioned relative to the axes
scales.
Using Excel, we’ve generated the graph shown below...
Now what does it tell us?
How would you analyze these results?
What conclusions would you draw in viewing this graph?
What it tells us is that A. cytherea has a higher
mean bill length than A. basiloides.
But this is only part of
the picture and is a 9th
and 10th grade analysis
of the datasets.
Why?
We need to go
further in our
statistical
analysis because
Mean values
are not always
accurate
representative
scores!
Well… because the mean is a measure of the central tendency of the
dataset, but it tells us NOTHING, NOTHING! about the spread of the
data.
The data points that we are analyzing could be tightly clustered around
the mean or they could have high variability.
Range is a simple and easy to compute measure of variability in a
dataset:
(Max sample value – Min sample value) = RANGE
What is the RANGE of this small dataset?
54, 56, 67, 72, 19, 52, 56, 56, 66, 68, 57, 58, 63
Range is a simple and easy to compute measure of variability in a
dataset:
(Max sample value – Min sample value) = RANGE
What is the RANGE of this small dataset?
54, 56, 67, 72, 19, 52, 56, 56, 66, 68, 57, 58, 63
(72 – 19) =
53 = RANGE
This large range value suggest that there is a great
deal of variability in our dataset, but here we can
see that RANGE is also limited in that it tell us
nothing about the variability within the distribution.
?
When we plot out the dataset on a simple number line, one can see the flaw in
relying just on the MEAN and RANGE values as measurements of central tendencies
and variability:
56, 67, 72, 11, 56, 56, 66, 19, 68, 57, 58, 63
The Mean (X) of this dataset = 54.1
11
19
58
57
56
(X) 56
68
63 66
67 72
The vast majority of values are clustered
around this end of the distribution. The
mean is not in the middle of this cluster, at is
has been affected by the outliers, 11 and 19.
This dataset has a skewed
distribution!
+/- 1 s.d. =
68% of data!
The greater
the SD value
the greater
the
variability!
How do you calculate the standard deviation of a dataset?
We are going to leave the mathematics behind this measure of variability to
your math teachers, but you have to be able to calculate S.D. values in Excel.
Follow the link to a podcast tutorial on using Excel to calculate standard deviation:
http://youtu.be/90YWFllx1EA
Error bars are a graphical representation of the variability of data. Error bars
can be used to represent range, standard deviation or other measures of
variability. In IB Biology STANDARD DEVIATION ERROR BARS will be most
useful.
Error bars are a graphical representation of the variability of data. Error bars
can be used to represent range, standard deviation or other measures of
variability. In IB Biology STANDARD DEVIATION ERROR BARS will be most
useful.
SET A – the bar (mean) for A is
higher than B
SET B – the S.D. error bar is longer
for B than A
How do you put standard deviation error bars on the graphs that
you generate?
Follow the link to a podcast tutorial on putting error bars on graphs in Excel:
http://youtu.be/oV0vbQlp9AI
What do
error bars
tell us?
The overlap of error bars gives us a clue
as to the significance of the results!
Overlap!
No
overlap
LOTS OF OVERLAP = LOTS OF SHARED DATA
NO OVERLAP = VERY LITTLE SHARED DATA
Results are NOT LIKELY TO BE SIGNIFICANTLY
DIFFERENT! The difference between means is
most likely due to chance
Results ARE LIKELY TO BE SIGNIFICANTLY
DIFFERENT! The difference between means is
most likely to be REAL
a. SET B
b. SET B
c. SET A
d. SET B
e. SET A
Let’s look back at our original data and
try to answer the first half of our
research question.
"Is there a significant
difference in proboscis
length between A. basiloides
and A. cytherea?”
Now, given your knowledge about what the standard
deviation of a dataset represents, what should your
conclusion be in regards to the proboscis lengths of
A. cytherea and A. basiloides?
Let’s look back at our original data and
try to answer the first half of our
research question.
Lots of overlap in
SD error bars!
"Is there a significant
difference in proboscis
length between A. basiloides
and A. cytherea?”
NO! The two datasets
Now, given your knowledge about what the standard
deviation of a dataset represents, what should your
conclusion be in regards to the proboscis lengths of
A. cytherea and A. basiloides?
contain too much shared
data to conclusively state
that a significant difference
exists between the proboscis
lengths of these butterflies.
But what about when
we look at the mean
body mass values for
the two species?
?
There is some overlap.
This one is hard to call.
We need another
statistical test to tell us
if there is a difference
in these data sets.
Something more
refined…