Transcript Licence
Licence
This presentation is © 2010-11, Anne Segonds-Pichon.
This presentation is distributed under the creative commons Attribution-Non-Commercial-Share Alike 2.0 licence. This means that you are free:
to copy, distribute, display, and perform the work
to make derivative works
Under the following conditions:
Attribution. You must give the original author credit.
Non-Commercial. You may not use this work for commercial purposes.
Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under a licence identical to this one.
Please note that:
For any reuse or distribution, you must make clear to others the licence terms of this work.
Any of these conditions can be waived if you get permission from the copyright holder.
Nothing in this license impairs or restricts the author's moral rights.
Full details of this licence can be found at
http://creativecommons.org/licenses/by-nc-sa/2.0/uk/legalcode
Introduction to statistics
with GraphPad Prism 5
Anne Segonds-Pichon
([email protected])
Experimental design
Power analysis
Sample size
Experimental design
Think stats!!
• Translate your biological question into stats.
• What type of data are you going to collect?
e.g. Categorical or quantitative?
• Very important:
Difference between technical and biological replicates.
• Technical replicates involve taking one sample from
one tube and analysing it across multiple conditions.
• Biological replicates are different samples measured
across multiple conditions.
Good experimental design relies on 2
principles:
• replication: the more times something is
repeated, the greater the confidence of ending up
with a genuine result.
• randomization: experimental subjects must be
allocated to treatment groups at random
Common errors in the design of experiments:
• experiments done on an ad hoc basis
• e.g. very different group size or time variability not
taken into account
• control and treatment done on different days
• inappropriate choice of treatment group
• e.g. different age group within genotypes
• experiment too large or too small
• e.g. groups so small, no stat analysis can be done,
or only non parametric
Power analysis
• Typically to estimate a sufficient Sample size
• Definition of power: probability of detecting
the specified effect at the specified significance
level.
• Also: the power of a statistical test is the
probability that the test will reject the null
hypothesis when the null hypothesis is false.
• Arbitrarily, accepted power: 80% to 90%
The power analysis depends on the
relationship between 6 variables:
•
•
•
•
•
•
the
the
the
the
the
the
effect size of biological interest
standard deviation
significance level
desired power of the experiment
sample size
alternative hypothesis (ie one or two-sided test)
Fix any five of these and a mathematical
relationship can be used to estimate the sixth.
1 The effect size of biological interest
•the larger the effect size, the smaller the
experiment will need to be to detect it.
• measure of effect size: Cohen’s d
with d = Mean 1 – Mean 2
pooled SD
• effect size conventions:
• d = 0.20 – small
• d = 0.50 – medium
• d = 0.80 – large
2 The standard deviation
• ideally: pilot study
• if not: difficult to estimate
• solutions:
• literature
• previous experiments
• best and worst case based on lowest
and highest of the available estimates
3 The significance level
• usually 5% (p<0.05)
• p-value is ‘the probability that a result as least
as extreme as the one actually found could have
been found if the null hypothesis were true’.
• e.g. a difference between 2 means (treatment and
control) as ‘big’ as the one actually found in the
experiment could be found even if the treatment had
no effect.
• Don’t throw away a p-value=0.051 !
4 The desired power of the experiment:
• ~80%
5 The sample size
• that’s the all point!
• home office: 3 Rs:
• replacement, refinement, reduction
Effect size
Standard deviation
Sample size
6 The alternative hypothesis
• is it a one or two-sided test?
Good news: G*Power can do it for you …
… if you know all the parameters.
• with no prior knowledge, the trickiest parameters are
effect size and standard deviation.
Power Analysis
9.0
Variable
8.5
8.0
7.5
7.0
Sample 1
Sample 2
Sample 1
Sample 2
9.0
Variable
8.5
8.0
7.5
7.0
Alternative: the resource equation method:
• more appropriate in biology than medical experiments
• more about the existence of a difference than the size
of it
E = N –T where:
E is the error degrees of freedom
N is the total number of experimental units
T is number of treatments combinations
Exemple: 5 treatment groups and 6 mice in each group
E = 30 – 5 = 25 Rule-of-thumb: E should be between 10 and 20
To be handled with care !
Common errors in the statistical
analysis of experiments:
• failure to do any statistical analysis on
numerical data
• failure to screen raw data for errors
• inappropriate standardisation
• misinterpretation of p-values
• inappropriate or incorrect statistical analysis
and in particular the Student’s t-test
Remember:
Stats are all about understanding and controlling
variation.
signal
noise
signal
noise
If the noise is low then the signal is detectable …
= statistical significance
… but if the noise (i.e. interindividual variation) is large
then the same signal will not be detected
= no statistical significance
In a statistical test, the ratio of signal to noise
determines the significance.
“To consult a statistician after an experiment is finished
is often merely to ask him to conduct a post-mortem examination.
He can perhaps say what the experiment died of.”
R.A.Fisher, 1938
Qualitative data
• = not numerical
• = values taken = usually names (also nominal)
• e.g. variable sex: male or female
• Values can be numbers but not numerical
• e.g. group number = numerical label but not unit of
measurement
• Qualitative variable with intrinsic order in their
categories = ordinal
• Particular case: qualitative variable with 2
categories: binary or dichotomous
• e.g. alive/dead or male/female
Analysis of qualitative data
Example of data (cats and dogs.xlsx):
• Cats and dogs trained to line dance
• 2 different rewards: food or affection
• Is there a difference between the rewards?
• Is there a significant relationship between my 2
variables?
– are the animals rewarded by food more likely to line
dance than the one rewarded by affection?
• To answer this question:
– a Chi-square test
Chi-square test
• In a chi-square test, the observed frequencies for two or
more groups are compared with expected frequencies
by chance.
– With observed frequency = collected data
• Example with the cats and dogs.xlsx
Chi-square test (2)
Expected frequency =
(row total)*(column total)/grand total
Di d they dance? * Type of Traini ng * Anima l Crosstabulation
Animal
Cat
Did they
dance?
Yes
No
Total
Dog
Did they
dance?
Yes
No
Total
Count
% within Did they
Count
% within Did they
Count
% within Did they
Count
% within Did they
Count
% within Did they
Count
% within Did they
danc e?
danc e?
danc e?
danc e?
danc e?
danc e?
Ty pe of Training
Food as
Affection as
Reward
Reward
26
6
81.3%
18.8%
6
30
16.7%
83.3%
32
36
47.1%
52.9%
23
24
48.9%
51.1%
9
10
47.4%
52.6%
32
34
48.5%
51.5%
Total
32
100.0%
36
100.0%
68
100.0%
47
100.0%
19
100.0%
66
100.0%
Example: expected frequency of cats line
dancing after having received food as a
reward:
Probability of line dancing: 32/68
Probability of receiving food: 32/68
Expected frequency:(32/68)*(32/68)=0.22
22% of 68 = 15.1
Di d they dance? * Type of Training * Anima l Crosstabulation
Animal
Cat
Did they
dance?
Yes
No
Total
Dog
Did they
dance?
Yes
No
Total
Count
Ex pec ted
Count
Ex pec ted
Count
Ex pec ted
Count
Ex pec ted
Count
Ex pec ted
Count
Ex pec ted
Count
Count
Count
Count
Count
Count
Ty pe of Training
Food as
Affection as
Reward
Reward
26
6
15.1
16.9
6
30
16.9
19.1
32
36
32.0
36.0
23
24
22.8
24.2
9
10
9.2
9.8
32
34
32.0
34.0
Total
32
32.0
36
36.0
68
68.0
47
47.0
19
19.0
66
66.0
For the cats:
Chi2 = (26-15.1)2/15.1 + (6-16.9)2/16.9 +
(6-16.9)2 /16.9 + (30-19.1)2/19.1 = 28.4
Is 28.4 big enough
for the test to be significant?
The Null hypothesis and the error types
• The null hypothesis (H0): H0 = no effect
• e.g.: the animals rewarded by food are as likely to line
dance as the one rewarded by affection
• The aim of a statistical test is to accept or to reject H0.
Statistical decision
True state of H0
H0 True
H0 False
Reject H0
Type I error
False Positive
Correct
True Positive
Do not reject H0
Correct
True Negative
Type II error
False Negative
• Traditionally, a test or a difference are said to be
“significant” if the probability of type I error is: α =< 0.05
• High specificity = low False Positives = low Type I error
• High sensitivity = low False Negatives = low Type II error
Chi-square test: results
Dog
Cat
30
20
Counts
Counts
30
10
0
Dance Yes
Dance No
20
10
0
Food
Affection
Food
Affection
• In our example:
cats are more likely to line dance if they are given food as
reward than affection (p<0.0001) whereas dogs don’t mind
(p=0.908).
Quantitative data
• They take numerical values (units of
measurement)
• They can be discrete (values vary by finite
specific steps) or continuous (any values)
• They can be described by a series of
parameters:
– Mean, variance, standard deviation, standard
error and confidence interval
The mean
• Definition: average of all values in a column
• It can be considered as a model because it
summaries the data
– Example: a group of 5 persons: number of
friends of each members of the group: 1, 2, 3, 3
and 4
• Mean: (1+2+3+3+4)/5 = 2.6 friends per person
– Clearly an hypothetical value
• How can we know that it is an accurate
model?
– Difference between the real data and the model
created
The mean (2)
• Calculate the magnitude of the differences between
each data and the mean:
– Total error = sum of difference
From Field, 2000
=0
• No errors !
– Positive and negative: they cancel each other out.
Sum of Squared errors (SS)
• To avoid the problem of the direction of the error: we
square them
– Instead of sum of errors: sum of squared errors (SS):
• SS gives a good measure of the accuracy of the model
• But: dependent upon the amount of data: the more data, the
higher the SS.
• Solution: to divide the SS by the number of observations (N)
• As we are interested in measuring the error in the sample to
estimate the one in the population we divide the SS by N-1
instead of N and we get the variance (S2) = SS/N-1
Variance and standard deviation
• Problem with variance: measure in squared units
– For more convenience, the square root of the variance is
taken to obtain a measure in the same unit as the original
measure:
• the standard deviation
– S.D. = √(SS/N-1) = √(s2) = s
• The standard deviation is a measure of how well the mean
represents the data
Standard deviation
Small S.D:
data close to the mean:
mean is a good fit of the data
Large S.D.:
data distant from the mean:
mean is not an accurate representation
SD and SEM (SEM = SD/√N)
• Many scientists are confused about the
difference between the standard deviation
(SD) and the standard error of the mean
(SEM).
– The SD quantifies how much the values vary from
one another (scatter or spread).
• The SD does not change predictably as you acquire more
data.
– The SEM tells you how much variability there is in
this statistic across samples from the same
population.
• The SEM gets smaller as your samples get larger,
– the mean of a large sample is likely to be closer to the true
mean than is the mean of a small sample.
SD and SEM
The SD quantifies the scatter of the data.
The SEM quantifies how far the sample
mean is from the true population mean.
SD or SEM ?
• If the scatter is caused by biological
variability, it is important to show the
variation.
– Report the SD rather than the SEM.
• Better, show a graph of all data points, or perhaps report
the largest and smallest value there is no reason to only
report the mean and SD.
• If you are using an in vitro system with no
biological variability, the scatter can only
result from experimental imprecision (no
biological meaning).
– Report the SEM since the SD is less useful here.
• Instead, report the SEM to give your readers a sense of
how well you have determined the mean.
Confidence interval
•
95% of observations in a normal
distribution lie within +/- 1.96 SEM
- So limits of 95% CI:
[Mean - 1.96 SEM; Mean + 1.96 SEM]
- SEM = SD/√N
Error bars
Type
Description
Standard deviation (SD)
Descriptive
Typical or average difference
between the data points and their
mean.
Standard error (SEM)
Inferential
A measure of how variable the
mean will be, if you repeat the
whole study many times.
Confidence interval (CI), usually
95% CI
Inferential
A range of values you can be
95% confident contains the true
mean.
SE gap ~ 4.5 n=3
SE gap ~ 2 n=3
16
Dependent variable
Dependent variable
13
12
11
~ 2 x SE: p~0.05
10
9
15
14
13
~ 4.5 x SE: p~0.01
12
11
10
8
A
9
B
A
B
SE gap ~ 2 n>=10
SE gap ~ 1 n>=10
12.0
11.0
~ 1 x SE: p~0.05
10.5
10.0
9.5
A
B
Dependent variable
Dependent variable
11.5
11.5
11.0
~ 2 x SE: p~0.01
10.5
10.0
9.5
A
B
CI overlap ~ 1 n=3
CI overlap ~ 0.5 n=3
Dependent variable
Dependent variable
14
12
~ 1 x CI: p~0.05
10
8
15
~ 0.5 x CI: p~0.05
10
6
A
B
A
CI overlap ~ 0.5 n>=10
CI overlap ~ 0 n>=10
12
11
~ 0.5 x CI: p~0.05
10
A
B
Dependent variable
Dependent variable
12
9
B
11
~ 0 x CI: p~0.01
10
9
A
B
Analysis of quantitative data
• Check for normality
• Choose the correct statistical test to answer
your question:
– They are 2 types of statistical tests:
• Parametric tests with 4 assumptions to be met by the
data,
• Non-parametric tests with no or few assumptions (e.g.
Mann-Whitney test) and/or for qualitative data (e.g. χ2
test).
Assumptions of Parametric Data
• All parametric tests have 4 basic assumptions that
must be met for the test to be accurate.
1) Normally distributed data
– Normal shape, bell shape, Gaussian shape
• Transformations can be made to make data suitable
for parametric analysis
Assumptions of Parametric Data (2)
• Frequent departure from normality:
– Skewness: lack of symmetry of a distribution
– Kurtosis: measure of the degree of peakedness in
the distribution
• The two distributions below have the same variance
approximately the same skew, but differ markedly in
kurtosis.
Assumptions of Parametric Data (3)
2) Homogeneity in variance
• The variance should not change systematically
throughout the data
3) Interval data
• The distance between points of the scale should
be equal at all parts along the scale
4) Independence
• Data from different subjects are independent
– Values corresponding to one subjects do not influence
the values corresponding to another subject.
– Important in repeated measures experiments
Analysis of quantitative data
• Is there a difference between my groups regarding the
variable I am measuring?
– e.g.: are the mice in the group A heavier than the one in
group B?
• Tests with 2 groups:
– Parametric: t-test
– Non parametric: Mann-Whitney/Wilcoxon rank sum test
• Tests with more than 2 groups:
– Parametric: Analysis of variance (one-way ANOVA)
– Non parametric: Kruskal Wallis
• Is there a relationship between my 2 (continuous)
variables?
– e.g.: is there a relationship between the daily intake in
calories and an increase in body weight?
• Test: Correlation (parametric or non-parametric)
Remember:
Stats are all about understanding and
controlling variation.
signal
noise
signal
noise
If the noise is low then the signal is detectable …
= statistical significance
… but if the noise (i.e. interindividual variation) is large
then the same signal will not be detected
= no statistical significance
In a statistical test, the ratio of signal to noise
determines the significance.
Comparison between 2 groups:
t-test
• Basic idea:
– When we are looking at the differences between scores for
2 groups, we have to judge the difference between their
means relative to the spread or variability of their scores
• Ex: comparison of 2 groups control and treatment
t-test (2)
t-test (3)
t-test (4)
• 3 types:
– Independent t-test
• it compares means for two independent
groups of cases.
– Paired t-test
• it looks at the difference between two
variables for a single group:
– the second sample is the same as the first after
some treatment has been applied
– One-Sample t-test
• it tests whether the mean of a single variable
differs from a specified constant (often 0)
Example: coyote.xlsx
• Question: are the males coyote bigger
than the females?
• First step: how do my data look like?
– 4 assumptions for parametric tests
– Plot the data
Assumptions for parametric tests
Histogram of Coyote:Freq. dist. (histogram)
10
Counts
8
Counts OK here
but if several groups of different sizes,
go for percentages
Female
Male
6
4
2
0
15
707274767880828486889092949698100
102
104
106 707274767880828486889092949698100
102
104
106
Bin Center
Female
Male
Counts
10
5
Normality
0
15
69 72 75 78 81 84 87 90 93 96 99 102105
69 72 75 78 81 84 87 90 93 96 99 102105
Bin Center
Female
Male
Counts
10
5
0
68 72 76 80 84 88 92 96 100 104 108
68 72 76 80 84 88 92 96 100 104 108
Bin Center
Coyote
110
Maximum
100
Length (cm)
Upper Quartile (Q3) 75th percentile
Interquartile Range (IQR)
90
Lower Quartile (Q1) 25th percentile
Median
80
Smallest data value
> lower cutoff
Cutoff = Q1 – 1.5*IQR
70
60
Outlier
Male
Female
Independent t-test: example
coyote.pzf
120
Standard error
95
90
85
Body length (cm)
Body Mass
100
110
100
90
80
70
60
80
Female
Females
Male
Males
95
Standard deviation
94
93
Length (cm)
Body Mass
100
95
90
85
92
91
90
89
88
87
86
80
Female
Male
Confidence interval
85
Male
Female
Independent t-test: results
coyote.xlsx
Males tend to be longer than females
but not significantly so (p=0.1045).
What about the power of the analysis?
Homogeneity in variance
What about the power of the analysis?
You would need a sample 3 times bigger
to reach the accepted power of 80%.
But is a 2.3 cm difference between genders biologically relevant?
Another example of t-test: height husband wife.xlsx
height husband wife.xlsx
20
15
Husband and Wife
Difference
Height
200
180
10
5
0
160
-5
140
Husband
Wife
-10
Normality
Dependent t-test: example
height husband wife.xlsx
200
15
180
Difference
Height (cm)
20
160
140
10
5
0
-5
Husband
Wife
-10
Paired t-test = One sample t-test
Results
No test for homogeneity of variances
Comparison of more than 2 means
• Why can’t we do several t-tests?
– Because it increases the familywise error
rate.
• What is the familywise error rate?
– The error rate across tests conducted on
the same experimental data.
Familywise error rate
• Example: if you want to compare 3 groups and you carry out 3 ttests, each with a 5% level of significance
• The probability of not making the type I error is 95% (=1 – 0.05)
– the overall probability of no type I errors is:
0.95 * 0.95 * 0.95 = 0.857
– So the probability of making at least one type I error is
1-0.857 = 0.143 or 14.3%
– The probability has increased from 5% to 14.3% !
– If you compare 5 groups instead of 3, the familywise error rate is
40% !!!!! (=1-(0.95)n)
• Solution for multiple comparisons: Analysis of variance
Analysis of variance
• Extension of the 2 group comparison of a ttest but with a slightly different logic:
– If you want to compare 5 means, for example, you
can compare each mean with another
• It gives you 10 possible 2-group comparisons
– Complicated ! So, the logic of the t-test cannot be directly
transferred to the analysis of variance (=ANOVA)
• Instead the ANOVA compares variances:
– If variance between the 5 means > variance within the 5 groups
(random error)
• then the means must be more spread out than it would have been by
chance.
Analysis of variance
• The statistic for ANOVA is the F ratio.
• F=
• F=
Variance between the groups
Variance within the groups (individual variability)
Variation explained by the model (= systematic)
Variation explained by unsystematic factors (= random variation)
• If the variance amongst sample means is greater
than the error/random variance, then F>1
– In an ANOVA, you test whether F is significantly higher
than 1 or not.
Analysis of variance
Source of variation Sum of Squares df
Mean Square
F
p-value
Between Groups
2.665
4
0.6663
8.423
<0.0001
Within Groups
5.775
73
0.0791
Total
8.44
77
• Variance (= SS / N-1) is the mean square
– df: degree of freedom with df = N-1
Hypothetical model
Between groups variability
Within groups variability
Total sum of squares
Parametric tests assumptions
Normality
Analysis of variance: Post hoc tests
• The ANOVA is an “omnibus” test: it tells you that
there is (or not) a difference between your means
but not exactly which means are significantly
different from which other ones.
– To find out, you need to apply post hoc tests.
– These post hoc tests should only be used when the
ANOVA finds a significant effect.
Analysis of variance: example
protein expression.xlsx
Homogeneity in variance
F=0.6702/0.07896=8.49
Post hoc tests
Protein expression
10
10
8
8
6
6
4
4
2
2
0
A
B
C
D
E
0
A
B
Cell groups
E
1.5
Protein expression (Log)
Protein expression (Log)
D
Cell groups
1.5
1.0
0.5
0.0
-0.5
-1.0
C
A
B
C
D
E
1.0
0.5
0.0
-0.5
-1.0
A
B
C
Cell groups
D
E
1
0.1
A
B
C
Cell groups
D
E
0.4
Log(Protein Expression)
Protein expression
10
0.2
-0.0
-0.2
-0.4
A
B
C
Cell groups
D
E
Correlation
•
A correlation coefficient is an index number that measures:
– The magnitude and the direction of the relation between 2
variables
– It is designed to range in value between -1 and +1
Correlation
• Most widely-used correlation coefficient:
– Pearson product-moment correlation coefficient
“r”
• The 2 variables do not have to be measured in the same
units but they have to be proportional (meaning linearly
related)
– Coefficient of determination:
• r is the correlation between X and Y
• r2 is the coefficient of determination:
– It gives you the proportion of variance in Y that can be
explained by X, in percentage.
Correlation: example
roe deer.xlsx
• Is there a relationship between parasite burden
and body mass in roe deer?
30
Male
Body Mass
Female
25
20
15
10
1.0
1.5
2.0
2.5
Digestive Parasites
3.0
3.5
Correlation: example
roe deer.xlsx
There is a negative correlation between
parasite load and fitness but this relationship
is only significant for the males
(p=0.0049 vs. females: p=0.2940).
Exercises
• Arachnophobia
– Is it as scary to look at the picture of a spider
than at a real one?
• Cane toad
– Is the proportion of cane toads infected by
intestinal parasites the same in 3 different areas
of Queensland?
• Colorectal cancer (CRC)
– Are DNA scores higher in people with colorectal
cancer?
• Migration of neutrophils
– Do neutrophils go further depending on which
inhibitor you use?
Arachnophobia
30
Difference
20
10
0
ff
di
-10
-20
Answer:
If you are an arachnophobe,
it is scarier to look at a real spider
than at the picture of one
(p=0.0310).
Cane toad
Cane toad
Number of toads
20
Infected
Uninfected
15
Answer:
The proportion of cane toads infected
by intestinal parasites varies significantly
between the 3 different areas of Queensland
(p=0.0359), the animals being more likely
to be parasitized in Rockhampton and Mackay
than in Bowen.
10
5
0
Rockhampton
Bowen
Mackay
Colorectal cancer
80
CRC
no CRC
Frequencies
60
40
20
0
0
7
14
21
28
35
Bin Center
Answer:
Higher DNA scores appear
to be associated
with a greater likelihood of CRC
(p=0.0007).
Migration of neutrophils
Migrated neutrophils (%of total)
A
B
C
20
15
10
5
0
Migrated neutrophils (%of total)
25
25
20
15
10
5
0
A
B
Genotypes
C
A
B
C
Genotypes
Answer:
There is significant difference between
the 3 inhibitors, with inhibitor C being
the more effective
and inhibitor A the least one (p<0.0001).
* Outcome ** Predictor
Choosing the Correct Statistical Test (adapted from “Choosing the correct stastitics”
developed by James D. Leeper, Ph.D)