Statistics for A2 biology

Download Report

Transcript Statistics for A2 biology

Statistics for A2 biology
Statistics
is . . . . .
The branch of mathematics dealing
with . . .
• Uncertainty, or
• Probability
Statistics for A2 biology
Checking for Patterns: the Villagers
•
•
•
•
Two groups of men from separate villages.
One group looks taller.
Is there a real difference?
Could these just be two samples from the same group, with
sampling error?
• What is the chance that both samples could have come from
the same population?
Statistics for A2 biology
Hypotheses: support or reject
In Science, we can’t prove that there is a difference between
the two villages.
We make predictions (hypotheses), carry out experiments
and examine the data.We can then say one of two things:
 The hypothesis is supported by the data,
 The hypothesis is not supported and can therefore be
rejected.
When using statistics, we make two hypotheses:
 The Null Hypothesis, H0 proposes that there is no pattern
at all, e.g. the observed difference in heights in the two villages
is a result of sampling error. Both villages are part of the same
population.
 The alternative hypothesis, H1 proposes that there is a
pattern, e.g. the observed difference in heights in the two
villages can not be accounted for by pure chance.
Checking for Patterns: the light-seeking fleas
50
% positively phototaxic
Statistics for A2 biology
45
40
35
30
25
20
15
10
5
0
0
20
40
60
80
100
time in hours
• fleas are tested at various times after hatching
• we record how many move towards the light (positive
phototaxis)
• there seems to be a clear positive correlation
• what is the probability that this apparent pattern is a result of
pure chance?
Statistics for A2 biology
Hypotheses for the flea experiment
 The Null Hypothesis, H0 proposes that there is no pattern
at all: the apparent correlation between time of hatching and
percentage of positively phototaxic fleas is the result of pure
chance.
 The alternative hypothesis, H1 proposes that there is a
pattern: the apparent correlation between time of hatching and
percentage of positively phototaxic fleas cannot be accounted
for by pure chance.
Statistics for A2 biology
Checking for patterns: genetic ratios
• Gregor Mendel, the founder of genetics, crossed
two pea plants with green pods. Both were
heterozygous for the recessive characteristic yellow
pods. In the next generation 428 plants had green
pods and 152 yellow pods.
• According to the theory, the expected ratio is 3:1.
The actual ratio is 2.82:1.
• Could this deviation from the ideal ratio be
produced by pure chance or is there a significant
deviation, so that there is not adequate support for
the hypothesis?
Statistics for A2 biology
Hypotheses for the genetics cross
 The Null Hypothesis, H0 proposes that the deviation from
the expected 3:1 ration could have been produced by pure
chance.
 The alternative hypothesis, H1 proposes that the deviation
from the expected 3:1 ration could not have been produced by
pure chance – in other words, it is not a 3:1 ratio.
 These hypotheses may seem rather strange, as the expected
ratio is in the null hypothesis. This time we want support for
the null hypothesis, as this means that the interpretation of a
3:1 ratio is correct.
 The golden rule is not broken - null hypotheses always propose
that variations are produced by chance.
Statistics for A2 biology
The Villagers
We ask the men to stand in order of height.
What do you notice about the ranked men?
Statistics for A2 biology
The villagers (2)
1
1
3
3
5
5
7
8
9
9
9
12 13 13 13
16 16 16 19 20
1.5 1.5 3.5 3.5 5.5 5.5 7
8
10 10 10
12 14 14 14
17 17 17 19 20
BLUE VILLAGE:
ORANGE VILLAGE:
1.5, 1.5, 3.5, 3.5, 5.5,
5.5, 8, 10, 10, 14,
7, 10, 12, 14, 17
14, 17, 17, 19, 20
1) We give a number for the rank of each.
2) With equal ranks, we give the average of the sequence of equals
3) Collect the ranks for the two villages into two groups
Average ranks are quite different (BLUE 7.55; ORANGE 13.45), but
could this happen by chance?
The Mann-Whitney Test (1)
Statistics for A2 biology
Average rank for BLUE is 7.55;
Average rank for ORANGE 13.45;
This looks pretty convincing, but to be sure, we have to allow for the
size of samples as a big difference in ranks gives more certainty when
the sample is big.
First we calculate total of ranks in each group:
R1 = (ranks for blue village) = 75.5
R2 = (ranks for orange village) = 134.5
, the Greek letter
sigma means “the sum of”
Now, we calculate two values for Mann-Whitney’s U, as follows:
U1=n1.n2 + 0.5n1 (n1 + 1) - R1
U2 = n1.n2 + 0.5n2 (n2 + 1) - R2
(n1 & n2 are the numbers in the
samples, in this case both 10)
Statistics for A2 biology
The Mann-Whitney Test (2)
R1 =
R2 =
(ranks for blue village) = 75.5
(ranks for orange village) = 134.5
Now, we calculate two values for Mann-Whitney’s U, as follows:
U1=n1.n2 + 0.5n1 (n1 + 1) – R1
 U1=(10x10) + 0.5x10(10-1) – 75.5
= 100 + 45 – 75.5 = 69.5
U2 = n1.n2 + 0.5n2 (n2 + 1) - R2
 U2=(10x10) + 0.5x10(10-1) – 134.5
= 100 + 45 – 134.5 = 10.5
The lowest of the two values of U counts for the next stage:
significance testing.
Statistics for A2 biology
The Mann-Whitney Test (3)
Firstly let us return to the Hypotheses for this problem:
 The Null Hypothesis H0: the observed difference in heights in the
two villages is a result of sampling error. Both villages are part of the
same population.
 The alternative hypothesis H1: the observed difference in heights in
the two villages can not be accounted for by pure chance.
We look up the smallest value for U (10.5) in a significance table.
This gives us the probability for the null hypothesis.
The Mann-Whitney Test (3)
Statistics for A2 biology
We look up the smallest value for U (10.5) in a significance table.
This gives us the probability for the null hypothesis: that the observed
difference in heights in the two villages is a result of sampling error. Both
villages are part of the same population.
Critical values of U at the 5% Significance
Level
n1
n2 
3
4
5
6
7
8
9
10
1
1
2
2
3
1
2
3
4
4
5
1
2
3
5
6
7
8
3
4
5
6
1
2
3
5
6
8
10
11
7
1
3
5
6
8
10
12
14
8
2
4
6
8
10
13
15
17
9
2
4
7
10
12
15
17
20
10
3
5
8
11
14
17
20
23
11
3
6
9
13
16
19
23
26
this says that this table is for a
probability of 5% or p=0.05
we look up the number with the
correct values for n1 and n2
if our calculated value for U is less
than or equal to the critical value,
then the probability of the null
hypothesis is less than 5% (p<0.05)
Statistics for A2 biology
The Mann-Whitney Test (4)
The probability (p) that the observed difference in heights in the two
villages is a result of sampling error and that both villages are part of the
same population is given by:
p<0.05
We therefore reject the null hypothesis.
We can say that there is a significant difference between the heights of
men in the two villages.
We can say that the difference is significant at the 5% level.
(In fact the value for U is so far below the critical value for the 5% significance
level, that it is very likely, it is even more significant. To find out, we would need a
different table for the greater level of significance).
The advantage of replicates
Critical values of U for a One-Tailed Test at the 0.05 Significance Level or a Two-Tailed Test at the 0.1
Significance Level (N.B. in biology, a probability of 0.1 is not significant).
Statistics for A2 biology
n1
n2 
2
3
4
5
6
7
2
3
8
9
10
11
12
13
14
15
16
17
18
19
20
1
1
1
1
2
2
2
3
3
3
4
4
4
5
6
7
7
8
9
9
10
11
1
2
2
3
3
4
5
1
2
3
4
5
6
7
8
8
1
3
5
8
10
13
15
18
20
23
Note how the critical values
9 10 11 12 14 15 16 17 18
increase with bigger samples. As a
13 15 16 18 19 20 22 23 25
significant difference requires a low
17 19 21 23 25 26 28 30 32
number, doing many replicates
21
24 26
28 30to 33
35 37 39 a
makes
it easier
demonstrate
26 28 significant
31 33 36 difference
39 41 44 47
9
1
3
6
9
12
15
18
21
24
27
30
33
36
39
42
45
48
51
54
10
1
4
7
11
14
17
20
24
27
31
34
37
41
44
48
51
55
58
62
11
1
5
8
12
16
19
23
27
31
34
38
42
46
50
54
57
61
65
69
12
2
5
9
13
17
21
26
30
34
38
42
47
51
55
60
64
68
72
77
13
2
6
10
15
19
24
28
33
37
42
47
51
56
61
65
70
75
80
84
14
2
7
11
16
21
26
31
36
41
46
51
56
61
66
71
77
82
87
92
15
3
7
12
18
23
28
33
39
44
50
55
61
66
72
77
83
88
94 100
16
3
8
14
19
25
30
36
42
48
54
60
65
71
77
83
89
95 101 107
17
3
9
15
20
26
33
39
45
51
57
64
70
77
83
89
96 102 109 115
18
4
9
16
22
28
35
41
48
55
61
68
75
82
88
95 102 109 116 123
19
4
10
17
23
30
37
44
51
58
65
72
80
87
94 101 109 116 123 130
20
4
11
18
25
32
39
47
54
62
69
77
84
92 100 107 115 123 130 138
4
5
1
2
4
5
6
8
9
11
12
6
2
3
5
7
8
10
12
14
16
7
2
4
6
8
11
13
15
17
19
Tailed-ness in the U test
Statistics for A2 biology
One and two tailed tests
• In a two tailed test, your alternative hypothesis
simply proposes that there is simply a difference
between the two groups compared.
• It proposes that the groups are different but not
which group has the highest values.
• In a one tailed test, your alternative hypothesis
proposes that there is a a difference between the
two groups compared, with a definite direction
(either the first or the second group has the highest
values).
• In a good experiment, the investigator should be
able to make a prediction with direction, and onetailed tests are the rule.
Statistics for A2 biology
testing for correlation (1)
• As with the villagers, the
statistical test used for the
flea example uses ranking,
but in a different way.
• Check out this graph
showing “perfect”
correlation.
6
5
4
3
2
• Now, give each data point a
rank on the x axis . . .
1
2
1
• And then on the y axis.
3
4
5
6
• Now make a table of pairs
of rankings . . .
x ranks
1
2
3
4
5
6
y ranks
1
2
3
4
5
6
There is a perfect
match.
Statistics for A2 biology
testing for correlation(2)
• Now, check out this graph
showing a less than perfect
correlation.
6
5
4
3
2
• Again, give each data point a
rank on the x axis . . .
1
1
• And then on the y axis.
2
3
4
5=
5=
• Now make a table of pairs
of rankings . . .
x ranks
1
2
3
4
5=
5=
y ranks
1
3
2
5
5
6
The rankings no
longer match
perfectly.
Statistics for A2 biology
testing for correlation (3)
• Now, check out this graph
showing no apparent
correlation.
5= 5=
4
3
2
1
• Again, give each data point a
rank on the x axis . . .
12
3 4
5 6
• And then on the y axis.
• Now make a table of pairs
of rankings . . .
x ranks
1
2
3
4
5
6
y ranks
5=
2
4
1
5=
3
There are a lot of
mismatches in the
rankings.
testing for correlation (4)
Statistics for A2 biology
Spearman’s rank correlation test (1)
•
•
•
•
•
Spearman’s test starts with a table comparing rankings on the x and y axes,
It gives a single number, called the correlation coefficient,
The symbol for this is r,
r is in the range +1.0 to -1.0,
when r = +1.0, this means a perfect positive correlation, with the line of
best fit going up from bottom left to top right, and the rankings the same
on both the x and the y axis,
• when r = -1.0, this means a perfect negative correlation, with the line of
best fit going up from top left to bottom right, and the rankings exactly the
opposite on the x and the y axis,
• When r = 0, there is no correlation
• When r is between 0 and either –1 or +1, there is a weaker correlation.
r = -1.0
r = +1.0
r = -0.6
r = 0.0
r = +0.8
Spearman’s rank correlation test (2)
Statistics for A2 biology
Calculating the correlation coefficient, r
tabulate
data
rank
for x
calculate
difference between
ranks
rank
for y
hrs after
hatching
% phototaxic
x ranks
(rx)
y ranks
(r y)
d
(rx - r y)
d2
5
3
1
1
0
0
18
5
2
2
0
0
25
12
3
4
-1
1
31
7
4
3
1
1
42
17
5
5
0
0
50
22
6
6
0
0
88
35
7
7
0
0
64
37
8
8.5
-0.5
0.25
72
46
9
10
-1
1
80
37
10
8.5
1.5
2.25
 d2 =
5.5
n = 10
number of pairs
square
this
sum of
squared
deviations
Spearman’s rank correlation test (3)
Statistics for A2 biology
Calculating the correlation coefficient, r (cont.)
r is calculated according to
this equation:
r =1-
6 d2
n3 - n
2
6(5.5 )
r =11000 -10
181.5
 r =1990
r = + 0.817
Spearman’s rank correlation test (3)
Statistics for A2 biology
Significance of the correlation coefficient, r
Consider this graph:
Add a best fit line:
•
•
•
•
•
•
•
The points are “all” on the best fit line,
The ranks are the same for “all” points
So the correlation coefficient is: (?)
r = + 1.0
But what does this mean?
Actually, nothing because you can always draw a straight line through
two points
If the next point comes here (blue dot) . . .
•
then the correlation looks “safer”, but if it comes here (green dot) . .
•
then the correlation looks highly unlikely
Significance of the correlation test (2)
Statistics for A2 biology
We look up the value for r (+0.817) in a significance table.
This gives us the probability for the null hypothesis: that the apparent
correlation on the graph could have been obtained by pure chance.
The value of r is greater than the
critical value for p = 0.001 (2tailed test) or p = 0.005 (1-tailed
test). So the null hypothesis is very
unlikely and we have excellent
support for the alternative
hypothesis.
We see where our calculated
value for r fits on the line. In this
case it is to the right of the
biggest number.
we find the correct line in the
table.  (the Greek letter nu) is
the number of data points: in this
case 10
Significance level (one-tailed)

0.05
0.025
0.01
0.005
Significance level (two-tailed)
0.01
0.1
0.05
0.02
5
0.900
1.000
1.000
6
0.829
0.886
0.943
1.000
7
0.714
0.786
0.893
0.929
8
0.654
0.738
0.833
0.881
9
0.600
0.683
0.783
0.833
10
0.564
0.648
0.745
0.794
11
0.523
0.623
0.736
0.818
12
0.497
0.591
0.703
0.780
Significance of the correlation test (3)
Statistics for A2 biology
One and two tailed tests
• In a two tailed test, your alternative hypothesis
simply proposes that there is some sort of a
correlation between the variables x and y . . .
• It proposes that the variables are linked but not
whether it is a positive or a negative correlation.
• In a one tailed test, your alternative hypothesis
proposes that there is a correlation between the
variables x and y with a definite direction (either
positive or negative),
• In a good experiment, the investigator should be
able to make a prediction with direction, and onetailed tests are the rule.
Significance of the correlation test (4)
Statistics for A2 biology
Final conclusion on the flea experiment
• We return to the original null hypothesis and give
its probability . . .
The probability (p) that the observed positive correlation
between the age of the fleas (in hours since hatching)
and % positive phototaxis could have been produced by
purely random variation is given by p < 0,005 (or p <
0.5%) . . .
• We then give the “other side of the coin”: the
support for the alternative hypothesis . . .
. . . so that the hypothesis that phototaxic behaviour of
fleas is related to the age of the fleas is supported at
the 0,5% level.
Significance level (one-tailed)
Statistics for A2 biology

0.05
0.025
0.01
0.005
Significance level (two-tailed)
0.1
0.05
0.02
0.01
5
0.900
1.000
1.000
6
0.829
0.886
0.943
1.000
7
0.714
0.786
0.893
0.929
8
0.654
0.738
0.833
0.881
9
0.600
0.683
0.783
0.833
10
0.564
0.648
0.745
0.794
11
0.523
0.623
0.736
0.818
12
0.497
0.591
0.703
0.780
13
0.475
0.566
0.673
0.745
14
0.457
0.545
0.646
0.716
15
0.441
0.525
0.623
0.689
20
0.377
0.450
0.534
0.591
25
0.366
0.400
0.475
0.526
30
0.305
0.364
0.432
0.476
35
0.282
0.336
0.399
0.442
40
0.263
0.314
0.373
0.413
50
0.235
0.280
0.332
0.368
60
0.214
0.255
0.303
0.335
70
0.198
0.236
0.280
0.310
80
0.185
0.221
0.262
0.290
90
0.174
0.208
0.247
0.273
100
0.165
0.197
0.234
0.259
The statistical
advantage of
thoroughness
Look down the column for p = 0.01
(two-tailed).
With bigger samples, the critical
value becomes smaller,
This means it is easier to show that
there is a correlation,
Unfortunately, this means spending
more time collecting data . .
but it is worth it, to get a
conclusive result.
Problems with correlation: 1 non-linearity
to the size of an oceanic island.
The rankings are very similar and clearly, this will be a significant
correlation . . . . .
r = 0.964, with  = 10, p < 0.001
But look at the graph!
area /
km2
species
count
(rx)
(r y)
3
5
1
1
18
7
2
2
128
9
3
3
245
13
4
4
1130
20
5
6
1300
18
6
5
5632
50
7
8
6213
45
8
7
45405
85
9
10
53060
78
10
9
This is not a linear relationship; the graph
looks right when both scales are
logarithmic
100
100
no.
No.of
of species
species
Statistics for A2 biology
The data in the table relate number of reptile and amphibian species
80
60
10
40
20
10
10
10000
10
20000
100
30000
50000 100000
60000
10004000010000
2
area
Area of Island / km
of island / km2
Statistics for A2 biology
Problems with correlation: 2 “rogue points”
Spreadsheets use statistical techniques to calculate the equation for
the “best fit line”.
But a single “rogue point” (ringed) can distort the line considerably.
The true line is more like the one shown in red.
It is often better to draw the line yourself.You need to decide which
points to ignore, and whether the relationship is linear.
30
25
20
15
10
5
0
0
2
4
6
8
10
12
Problems with correlation: 3
Consider this graph:
number of public houses
Statistics for A2 biology
IS CORRELATION THE SAME AS CAUSATION?
number of places of worship
Nobody would suggest that an increase in numbers of churches,
mosques, synagogues and other places of worship cause an increase
in public houses.
They are both related to a third variable . . .
The size of the community.
Problems with correlation: 4
Statistics for A2 biology
IS CORRELATION THE SAME AS CAUSATION?
SMOKING AND LUNG CANCER
Even the tobacco companies cannot deny the correlation between
cigarette consumption and risk of lung cancer.
But they have brought the idea of causation into question . . .
. . . . suggesting a third variable which has nothing to do with smoking,
e.g. a certain gene, which has two effects – one to increase risk of
cancer and two to make a person more likely to take up cigarette
smoking,
HIV AND AIDS
A very controversial hypothesis suggested that the presence of
particles of the human immunodeficiency virus (HIV) were not the
cause of AIDS but just another symptom.
The true cause was suggested to be the reckless and irresponsible lifestyle of the patient.
Statistics for A2 biology
Genetic Ratios: are deviations significant?
• Mendel crossed two pea plants with green pods. Both were
heterozygous for the recessive characteristic yellow pods.
• In the offspring, 428 plants had green pods and 152 had yellow pods.
• The expected ratio is 3:1.The actual ratio is 2.82:1.
• Could this deviation from the ideal ratio be produced by pure chance
or is there a significant deviation?
• This is a job for the

2
test (chi squared).
• This test compares actual numerical patterns with expected patterns
and gives the probability that chance could have caused the deviations.
Checking Genetic Ratios with 2
Statistics for A2 biology
Enter the observed values into the first column of a table (O = observed),
Calculate the values expected for a “perfect ratio”: total offspring = 580;
¾ of this is 435 and ¼ is 145,
Enter these values in the second column (E = expected),
In the third column, calculate deviations from the expected values (O – E),
Square this value in the third column, and
in the final column divide by the expected.
2 is the sum of the final column
O
E
O - E (O – E)2
(O – E)2 / E
428 435
-7
49
0.113
152 145
7
49
0.338
2 =
0.451
Checking Genetic Ratios with 2: 2
Statistics for A2 biology
THE SIGNIFICANCE TEST
The value of 0.451 for 2 does not
mean anything yet.
First, we must look up the value in
significance tables,
As with the Spearman’s rank table,
there are lines in the 2 table for
different values of  the number of
degrees of freedom,
For 2, this is the number of data items
minus one, so in this case  = 1,
We see where our calculated value fits
on this line,
It is well below the critical value for p =
0.05, so we give the probability of the
null hypothesis as:
p > 0.05

Significance level
0.05
0.01
0.005
0.001
1
3.84
6.64
7.88
10.83
2
5.99
9.21
10.60
13.82
3
7.82
11.34
12.84
16.27
4
9.49
13.28
14.86
18.46
5
11.07
15.09
16.75
20.52
6
12.59
16.81
18.55
22.46
Checking Genetic Ratios with 2: 3
Statistics for A2 biology
THE SIGNIFICANCE TEST: 2
What does this probability mean?
Let us return to the null hypothesis:
 The Null Hypothesis, H0 proposes that the deviation from the
expected 3:1 ration could have been produced by pure chance.
As the probability is greater than 5%, then we cannot reject the
null hypothesis!
At first, this looks like a failure, until we realize that this is just
what we want:
There is no significant deviation from a 3:1 ratio, so we can accept
the alternative hypothesis that this is a “good” 3:1 ratio.
Statistics for A2 biology
NO tailed-ness in the 2 test
One and two tailed tests
As hypotheses for tests predict “fit” or “no-fit” and
have no direction, there are no one-tailed or two-tailed
tests.
Statistics: which test?
Statistics for A2 biology
PURPOSE
To compare two groups, e.g.
heights of trees from different
woods, or speed of breakdown
of protein by two different
enzymes.
To check for correlation
between two variables e.g.
effect of temperature on
metabolic rate
To check for goodness of fit to
a numerical pattern, e.g. are
woodlice randomly distributed
in a choice chamber?
WHICH TEST?
REQUIRES
THE MANNWHITNEY U
TEST
At least 6 in
each group
in an
experiment 6
replicates!
SPEARMAN’S
RANK
CORRELATION
TEST
THE
 TEST
2
different
sources give
minimum
between 8
and 15 data
points
2 numbers