mrsgreenbiology.files.wordpress.com

Download Report

Transcript mrsgreenbiology.files.wordpress.com

Topic 1 Statistical
Analysis
What you need to know:
1.1.1 State that error bars are a graphical representation of the
variability of data.
1.1.2 Calculate the mean and standard deviation of a set of values.
1.1.3 State that the term standard deviation is used to summarize the
spread of values around the mean, and that 68% of the values fall
within one standard deviation of the mean.
1.1.4 Explain how the standard deviation is useful for comparing the
means and the spread of data between two or
more samples.
1.1.5 Deduce the significance of the difference between two sets of
data using calculated values for t and the appropriate tables.
1.1.6 Explain that the existence of a correlation does not establish
Many advertisements for products from
cosmetics to cars use Science to sell them.
Look at these examples:
Why do people use Science in
advertisements?
People trust science!
Why?
Because they believe that Scientists have
carried out research and that the data they
have collected can be trusted.
This is because of the rigorous nature of the
Scientific Method
TOK Link!
Can you always believe
scientists?
Lies, damned lies and
statistics!
What has statistics got to do
with Biology?!
•As Scientists we make observations
•We come up with a hypothesis
•We carry out experimental work
•We collect data
•But, what does the data tell us?
•Data needs to be processed
•Data needs to be analysed
Assessment Statement
1.1.2 Calculate the mean and standard deviation of
a set of values.
1.1.3 State that the term standard deviation is used
to summarize the spread of values around the
mean, and that 68% of the values fall within one
standard deviation of the mean.
1.1.4 Explain how the standard deviation is useful
for comparing the means and the spread of data
between two or more samples.
The Mean
When you carried out experiments in the past
you will have probably calculated an average
or mean of your results.
The mean is a measure of the central
tendency (middle value) of the data.
But the mean isn’t always the best way of
representing the central tendency of our data.
It depends on what type of data we have.
There are three measures of central
tendency
Data collected from an experiment falls into 3
types:
Data Type
Example
Central Tendency
Nominal
Cats, red cars, usually
frequency counts
mode
Ordinal
Ranked - 1st, 2nd or
relative data
median
Integral
On a scale (measured)
- mm, oC etc.
mean
Statistical Analysis
Measures of central tendency (of a set of
data)
Mean -
Median Mode -
Mean
Task
=10+14+10+12+10+12+10+11+9+11
=109/10
Find the mean, median and mode of the
=10.9
following data points.
Mode
10,14,
14,10,
10,12,
12,10,
10,12,
12,10,
10,11,
11,9 9and
and11
11
10,
= 10
Median
9,10,10,10,10,11,11,12,12,14
Even number so 10+11 / 2
= 10.5
Task
Look at the examples on the next three slides.
Which measure of central tendency would you
use in each situation and why?
As part of a promotion for Justin Bieber’s new CD,
you can win 1 million VND by guessing how
many tracks will be on the album!
To help him make the best guess, Huy has written
down the number of tracks on each of his other
Justin Bieber CDs.
His results are as follows:
10, 14, 10, 12, 10, 10, 11, 9 and 11
What number should he guess and why?
Indonesia is a country with 237 million people.
Most of the people are extremely poor, living in
huts. There is a small, but growing “middle
class” and there are a few extremely wealthy
people.
Does mean, mode
or median give the
best idea of the
“average” income?
A fertilizer company developed a new highpotassium fertilizer that increased the yield of
rice. In their advertisements, what number
should they use to state the percentage increase
in yield - the mean, median or mode?
Let’s look at some data
Class A height/cm +/- 0.1cm
Class B height/cm +/- 0.1cm
150
166
162
164
182
163
165
159
177
162
175
166
166
165
163
168
172
167
185
164
Class A height/cm +/0.1cm
Class B height/cm +/0.1cm
150
166
162
168
182
167
165
169
177
170
175
166
166
171
163
168
172
167
178
168
169
168
What does this tell us?
Looking at the mean heights for the two classes
one might conclude that...
...the heights of the students in each class are
pretty similar
However, the mean doesn’t tell us anything
about the spread of data
Range and variability
Range
In
In this case, the
is smallshow
and
both cases range
the graphs
data is clustered
normalthe
distribution
around the mean
a
(a bell shaped curve) and the mean
lies in the middle of the frequency
distribution
In this case the
range is large and
the data is much
more variable
Standard Deviation
Is a measure of how much the data varies
from the mean.
Is meaningful only for data with a normal
distribution.
95% of data
lie within 2
standard
deviations of
the mean
In a normal distribution, 68%
of data lie within 1 standard
deviation of the mean
Calculating Standard Deviation
You will be given the formula (phew!)
You are expected to be able to calculate
STD using a graphic or scientific calculator
You can also use Excel for this
•Standard deviation calculator
1. Click on the cell
where you wish the
answer to appear
1. Type: =stdev(
1. Highlight the data
using your mouse
1. Close brackets
1. Press enter
standard
deviation
sum
mean
data point
number of
data points
What does it mean?
A small STD indicates that the data is
clustered closely around the mean
A large STD indicates a wider spread
around the mean
We may add STD as an error bar on a graph
Assessment Statements
1.1.1 State that error bars are a graphical
representation of the variability of data.
Error bars
Error bars can be added to graphs to show the
range of data or the STD
This shows us how data is spread
What does
the size of
the error bar
tell us?
Bigger error
bars show a
greater spread
of data
The blue line represents the height of year 12
students in the school. What might the red
line represent?
What if distribution isn’t normal?
Then the mean does not lie at the centre of the
frequency distribution and you cannot use
STD to show the spread of the data
Assessment Statement
1.1.5 Deduce the significance of the difference
between two sets of data using calculated
values for t and the appropriate tables.
Student t Test
The t test tells us the probability (P) that two sets of
data are the same
If P = 0 the two sets of data are exactly the same
If P = 1 the two sets of data are not at all the same
The higher the value of P the more the data overlap
Smaller overlap = more significant results
How do we use a t test?
A researcher wishes to learn
whether the pH of soil affects
seed germination of a plant
found in forests near her home.
She filled 10 flower pots with acid
soil (pH 5.5) and ten flower pots
with neutral soil (pH 7.0) and
planted 100 seeds in each pot.
The mean number of seeds that
germinated in each type of soil
Acid Soil pH
5.5
Neutral Soil
pH 7.0
42
43
45
51
40
56
37
40
41
32
41
54
48
51
50
55
45
50
46
48
Hypothesis
The researcher is testing whether soil pH affects
germination of the herb.
Her hypothesis (H1) states that the mean
germination at pH 5.5 is different than the mean
germination at pH 7.0.
The null hypothesis (Ho) states that there is no
significant difference between the two soils.
Putting the data into a programme to calculate
the t value gives us an answer of 1.66
GraphPad QuickCalcs: t test calculator
We can look this value up in a t table.
The t table tells you how confident you can be
that your values are different
t-test table
1.
2.
3.
4.
Select the column with the probability that you want.
e.g. 0.05 means '95% chance'
Select the row for degrees of freedom.
For two data sets the number of degrees of freedom is equal to
(n1 + n2)-2 In this case (10+10) -2 =18
5. Compare the critival value in the table with your t-value.
6. The results are significant if the t-value is greater than the
critical value.
So, our critical value from the t table is 2.09
Our calculated t value is 1.66
If t < critical value we accept the null hypothesis
If t > critical value we reject the null hypothesis
In this case 1.66(t) < 2.09(critical value)
So we accept the null hypothesis.
pH does not affect the germination of the plant.
Limitations to the t test
For the t test to be applied:
The data must have a normal distribution
Must have a sample size of at least 10
Assessment Statement
1.1.6 Explain that the existence of a correlation
does not establish that there is a causal
relationship between two values.
Correlation: Relationship between
two quantities such that when one
quantity changes the other does too
Correlation and Causation
A phrase used to
emphasize that
correlation
between two
variables does
not
automatically
imply that one
causes the
other.
Correlation: The more firemen fighting
a fire, the bigger the fire is going to be.
Causation: Firemen make fires bigger
Correlation: As ice cream sales
increase, the rate of drowning deaths
increases sharply.
Causation:
Ice cream causes
drowning
Correlation:Since the 1950s, both
the atmospheric CO2 level and crime
levels have increased sharply.
Causation:
Atmospheric
CO2 causes
crime
Determining Causation
Imagine you did badly on a test and guessed
that the cause was not studying.
How could you prove this?
If one could rewind history, and change only one
small thing, then causation could be observed.
The same student writing the same test under the
same circumstances but having studied the
night before.
A major goal of scientific experiments
is to control variables as best as
possible.
We could run an
experiment on identical
twins who were known
to consistently get the
same grades on their
tests.
One twin is sent to study
for six hours while the
other is sent to the
If their test scores
suddenly diverged
by a large degree,
this would be
strong evidence
that studying had a
causal effect on
test scores.
Correlation between
studying and test
scores would
almost certainly
Headline
“Diet of fish can prevent teen violence.”
Participants were a group of 3-year-olds given an “enriched diet,
exercise, and cognitive stimulation.” They were compared to
a control group who did not go through this same program.
By age 23 they were 64% less likely than a control group of
children not on the program to have criminal records.
Assume, of course, that the enriched diet included fish.
Note, also, that the media article does not mention what the
other kids ate or did.
Does the data support the headline?
What are some “third variable” explanations?
How could you reword the headline?
Headline
“Higher beer prices cut gonorrhea rates”
The research suggests “that raising the price of a six-pack of beer by
20 cents would cut gonorrhea rates by almost 9%”
Researchers considered gonorrhea rates from 1981 to 1995 among
teens and young adults in states that raised the legal drinking age
or increased the state beer tax.
“Of the 36 beer tax increases that we reviewed, gonorrhea rates
declined among teens aged 15 to 19 in 24 instances. Among young
adults aged 20 to 24, they declined in 26 instances.”
Important side note: 1981 is also when the CDC recognized AIDS and
HIV; condoms protect against both HIV and gonorrhea.
Does the data support the headline?
What are some “third variable” explanations?
How could you reword the headline?
Headline
“Luckiest people” born in summer
Online public survey (40,000 people)
Those born in May were most likely to consider themselves lucky;
those born in October had most negative views of their life.
People who took part in the survey gave their birthdates and
rated the degree to which they saw themselves as lucky or
unlucky
The poll found there was a summer-winter divide between people
born from March to August and those born from September to
February.
50% of those born in May considered themselves lucky; 43% of
those born in October.
It isn’t clear when the survey took place (i.e., what month)
Does the data support the headline?
What are some “third variable” explanations?
How could you reword the headline?
See some interesting trends on this website:
Gapminder World