Transcript Caution

Caution
There are three kinds of Lies; Lies, Damned Lies and Statistics
Attributed to Mark Twain or Benjamin Disraeli!
Probably Sir Charles Dilke
caution.1
Tuesday, 11 April 2017
2:14 PM
Caution
The Unknown
As we know
There are known knowns
There are things we know we know
We also know
There are known unknowns
That is to say
We know there are some things
We do not know
But there are also unknown unknowns
The ones we don't know
We don't know
Donald Rumsfeld (Secretary of Defense 1975-1977 and 2001-2006)
12 Feb. 2002, Department of Defense news briefing.
caution.2
Caution
In Psychology there is great reliance on the “p” value.
Over the past twenty year serious flaws have been
pointed out with this reliance. In general reporting of a
confidence interval is recommended.
Thomas W. Nix and J. Jackson Barnette, 1998, “The Data Analysis Dilemma: Ban or
Abandon. A Review of Null Hypothesis Significance Testing” Research In The Schools
5(2) 3-14 also “A Review of Hypothesis Testing Revisited: Rejoinder to Thompson,
Knapp, and Levin” Research In The Schools 5(2) 55-57.
J.L. Moran et al., 2004, “A farewell to p-values” Critical Care and Resuscitation 6
130-137.
R. Hubbard and R.M. Lindsay, 2008, “Why p Values Are Not a Useful Measure of
Evidence in Statistical Significance Testing” Theory and Psychology 18 69-88.
caution.3
Caution
The p value debate is on-going
The ASA's statement on p-values: context, process, and purpose
Ronald L. Wasserstein & Nicole A. Lazar
The American Statistician 2016 , DOI: 10.1080/00031305.2016.1154108
Q: Why do so many colleges and grad schools teach p = .05?
A: Because that's still what the scientific community and journal editors use.
Q: Why do so many people still use p = 0.05?
A: Because that's what they were taught in college or grad school.
With a more readable discussion
The mismeasure of scientific significance by Trevor Butterworth 7 Mar 2016
Statistical Tests, P-values, Confidence Intervals, and Power: A Guide to Misinterpretations
Greenland S., Senn SJ., Rothman KJ., Carlin JB., Poole C., Goodman SN., Altman DG.
reprinted by the European Journal of Epidemiology
caution.4
1.
2.
3.
4.
5.
6.
Caution
P-values can indicate how incompatible the data are with a
specified statistical model.
P-values do not measure the probability that the studied
hypothesis is true, or the probability that the data were
produced by random chance alone.
Scientific conclusions and business or policy decisions should not
be based only on whether a p-value passes a specific threshold.
Proper inference requires full reporting and transparency
A p-value, or statistical significance, does not measure the size
of an effect or the importance of a result.
By itself, a p-value does not provide a good measure of evidence
regarding a model or hypothesis.
The ASA's statement on p-values: context, process, and purpose
Ronald L. Wasserstein & Nicole A. Lazar
The American Statistician 2016 , DOI: 10.1080/00031305.2016.1154108
caution.5
Caution
Stop taking the p (Plus magazine, Marianne Freiberger, 12 April 2016 ), why a
time-honoured statistical tool is becoming problematic.
For a discussion and an amusing cartoon see The American Statistical
Association's Statement on the Use of P Values - Jim Frost - 23 March,
2016
caution.6
Caution
caution.7
Caution
caution.8
Caution
As pointed out by Johansson
1. p is uniformly distributed under the null
hypothesis and can therefore never indicate
evidence for the null.
2. p is conditioned solely on the null hypothesis and is
therefore unsuited to quantify evidence, because
evidence is always relative in the sense of being
evidence for or against a hypothesis relative to
another hypothesis.
3. p designates probability of obtaining evidence
(given the null), rather than strength of evidence.
caution.9
Caution
4. p depends on unobserved data and subjective
intentions and therefore implies, given the
evidential interpretation, that the evidential
strength of observed data depends on things that
did not happen and subjective intentions.
T. Johansson, 2011,“Hail the impossible: p-values, evidence, and
likelihood” Scandinavian Journal of Psychology, 52, 113–125.
caution.10
Caution
...an unreasonable yet widespread practice is the labelling
of all randomized trials as either positive or negative on
the basis of whether the P value for the primary outcome
is less than 0.05. This view is overly simplistic. P values
should be interpreted as a continuum wherein the
smaller the P value, the greater the strength of the
evidence for a real treatment effect.
Confidence intervals are also useful in indicating the
range of uncertainty around the estimated treatment
effect.
The Primary Outcome Fails — What Next?
Stuart J. Pocock and Gregg W. Stone
The New England Journal of Medicine 2016 375 861-870 DOI:
10.1056/NEJMra1510064
caution.11
Caution
For an alternate but equally jaundiced view see
Valen E. Johnson
Revised standards for statistical evidence
Proceedings of the National Academy of Sciences of the United States of America 2013 110(48)
19313–19317.
Which is nicely summarised in
Erika Check Hayden
Weak statistical standards implicated in scientific irreproducibility
Nature 11 November 2013.
and
Geoff Cumming
The problem with p values: how significant are they, really?
The new statistics: why and how.
Psychol. Sci. 2014 25 7–29 DOI: 10.1177/0956797613504966
Partially opposed by
Savalei, V. and Dunn, E.
Is the call to abandon p-values the red herring of the replicability crisis?
Frontiers In Psychology 2015 6 DOI: 10.3389/fpsyg.2015.00245
caution.12
Caution
P values, the 'gold standard' of statistical validity, are not as reliable
as many scientists assume.
However it seems to get the explanation of hypothesis testing wrong!
Regina Nuzzo
Scientific method: Statistical errors
Nature Volume: 506, Pages: 150–152 13 February 2014
Victoria Savalei and Elizabeth Dunn
Is the call to abandon p-values the red herring of the replicability crisis?
Frontiers In Psychology 6 Mar 2015 Volume: 6
DOI: 10.3389/fpsyg.2015.00245
caution.13
Caution
A review of p-values in the biomedical literature from 1990 to 2015
shows that these widely misunderstood statistics are being used
increasingly, instead of better metrics of effect size or uncertainty.
Evolution of Reporting p-values in the Biomedical Literature, 1990-2015.
David Chavalarias, Joshua David Wallach, Alvin Ho Ting Li, John P. A. Ioannidis
Journal of the American Medical Association 2016 315(11) 1141-1148 DOI:
10.1001/jama.2016.1952
Misleading p-values showing up more often in biomedical journal articles - ScienceDaily
- 15 March 2016
caution.14
Caution
This paper serves to demonstrate that the practise of using one, two,
or three asterisks (according to a type-I-risk α either 0.05, 0.01, or
0.001) in significance testing as given particularly with regard to
empirical research in psychology is in no way in accordance with the
Neyman-Pearson theory of statistical hypothesis testing. Claiming aposteriori that even a low type-I-risk α leads to significance merely
discloses a researcher’s self-deception. The authors emphasise that
by using sequential sampling procedures instead of fixed sample sizes
the practice of asterisks” would not arise.
The misuse of asterisks in hypothesis testing
Dieter Rasch, Klaus D. Kubinger, Jörg Schmidtke and Joachim Häusler
Psychology Science, Volume 46(2), 2004, 227-242.
15
Caution
“Like elaborately plumed birds…we preen and strut and
display our t-values.” That was Edward Leamer’s
uncharitable description of his profession in 1983.
Mr. Leamer, an economist at the University of California in
Los Angeles, was frustrated by empirical economists’
emphasis on measures of correlation over underlying
questions of cause and effect, such as whether people
who spend more years in school go on to earn more in later
life.
Cause and defect The Economist, 13 August 2009, p. 68
“Instrumental variables help to isolate causal relationships. But they can be
taken too far.”
caution.16
Caution
Hardly anyone, he wrote gloomily, “takes anyone else’s
data analyses seriously”.
To make his point, Mr. Leamer showed how different (but
apparently reasonable) choices about which variables to
include in an analysis of the effect of capital punishment
on murder rates could lead to the conclusion that the
death penalty led to more murders, fewer murders, or had
no effect at all.
“Let’s take the con out of econometrics”, by Edward Leamer, American
Economic Review 73(1), March 1983
caution.17
Caution
Confidence intervals have frequently been proposed as a
more useful alternative to null hypothesis significance
testing, and their use is strongly encouraged in the APA
Manual (American Psychological Association 2009 Publication Manual of the
American Psychological Association (6th ed.). Washington, DC).
The misunderstandings surrounding p-values and
confidence intervals are particularly unfortunate because
they constitute the main tools by which psychologists
draw conclusions from data.
Robust misinterpretation of confidence intervals
Hoekstra, R., Morey, R.D., Rouder, J.N. et al. Psychon Bull Rev (2014) 21: 1157.
DOI: 10.3758/s13423-013-0572-3
caution.18
Caution
“it encouraged, among other things, the use of confidence
intervals, because ‘it is hard to imagine a situation in which
a dichotomous accept–reject decision is better than
reporting an actual p-value or, better still, a confidence
interval’” (Wilkinson and TFSI, 1999, p. 599).”
Wilkinson, L. and APA Task Force on Statistical Inference (1999) Statistical
methods in psychology journals: Guidelines and explanations American
Psychologist, 54, 594–604.
caution.19
Caution
“calling confidence intervals ‘in general, the best reporting
strategy’”(APA, 2001, p. 22; APA, 2009, p. 34).
American Psychological Association. (2001). Publication manual of the
American Psychological Association (5th ed.). Washington, DC:
American Psychological Association 2009. Publication manual of the American
Psychological Association (6th ed.). Washington, DC:
caution.20
Caution
“A confidence interval is a numerical interval constructed
around the estimate of a parameter. Such an interval does
not, however, directly indicate a property of the
parameter; instead, it indicates a property of the
procedure, as is typical for a frequentist technique.”
Robust misinterpretation of confidence intervals
Hoekstra, R., Morey, R., Rouder, J. and Wagenmakers, E. (2014). Robust
misinterpretation of confidence intervals Psychonomic Bulletin and Review,
21(5), 1157-1164 DOI: 10.3758/s13423-013-0572-3
Reformers say psychologists should change how they report their results, but
does anyone understand the alternative? BPS Research Digest
caution.21
Caution
Ridding science of shoddy statistics will require scrutiny of every step, not merely the
last one.
There is no statistic more maligned than the p value. Hundreds of papers and blogposts
have been written about what some statisticians deride as 'null hypothesis significance
testing' (NHST; see, for example). NHST deems whether the results of a data analysis
are important on the basis of whether a summary statistic (such as a p value) has
crossed a threshold. Given the discourse, it is no surprise that some hailed as a victory
the banning of NHST methods (and all of statistical inference) (Trafimow, D. & Marks, M.,
Basic Appl. Soc. Psych. 37, 1–2 2015 DOI: 10.1080/01973533.2015.1012991).
Statistics: P values are just the tip of the iceberg
Jeffrey T. Leek & Roger D. Peng Nature 2015 520 612
DOI: 10.1038/520612a
P values, the 'gold standard' of statistical validity, are not as reliable as many scientists assume
Regina Nuzzo, Nature 2014 506 150-152
On the challenges of drawing conclusions Daniël Lakens, Peer J 3:e1142 2015 DOI:
10.7717/peerj.1142bb
caution.22
Caution
Recent developments in psychology are showing apparently reasonable but inherently flawed
positions against data testing techniques (often called hypothesis testing techniques, even when
they do not test hypotheses but assume them true for testing purposes).
These positions are such as banning testing explicitly and most inferential statistics implicitly,
recommending substituting confidence intervals for null hypothesis significant testing (NHST)
explicitly and for all other data testing implicitly, and recommending research preregistration as a
solution to the low publication of non-significant results.
In reading Woolston's articles, readers' comments to such articles, and the related literature, it
appears that philosophical misinterpretations are not getting through and still need to be readdressed today. I believe that a chief source of misinterpretations is the current NHST
framework, an incompatible mishmash between the testing theories of Fisher and of NeymanPearson.
The resulting misinterpretations have both a statistical and a theoretical background. Statistical
misinterpretations of p-values have been addressed elsewhere, thus I reserve this article for
resolving theoretical misinterpretations regarding statistical significance.
Perezgonzalez J.D. "The meaning of significance in data testing" Frontiers in Psychology 2015 6:1293. DOI:
10.3389/fpsyg.2015.01293
Woolston, C. “Psychology journal bans P values” Nature 2015 519:9. DOI: 10.1038/519009f
caution.23
Woolston, C. “Online debate erupts to ask: is science broken?” Nature 2015 519:393. DOI: 10.1038/519393f
Caution
Please mark each of the statement below “true” or “false”. False means
that the statement does not follow logically from the above statement.
Also note that all, several, or none of the statements may be correct:
Hoekstra, R., Morey, R., Rouder, J., and Wagenmakers, E. (2014). Robust misinterpretation of
confidence intervals Psychonomic Bulletin and Review, 21(5), 1157-1164 DOI:
10.3758/s13423-013-0572-3
caution.24
Caution
The 95% confidence interval for the mean ranges from 0.1 to 0.4!
1.
The probability that the true mean is greater than 0 is at least 95%.
[] true/false []
2.
The probability that the true mean equals 0 is smaller than 5%.
[] true/false []
3.
The “null hypothesis” that the true mean equals 0 is likely to be incorrect.
[] true/false []
4.
There is a 95% probability that the true mean lies between 0.1 and 0.4.
[] true/false []
5.
We can be 95% confident that the true mean lies between 0.1 and 0.4.
[] true/false []
6.
If we were to repeat the experiment over and over, then 95% of the time
[] true/false []
the true mean falls between 0.1 and 0.4.
caution.25
Caution
The 95% confidence interval for the mean ranges from 0.1 to 0.4!
1.
The probability that the true mean is greater than 0 is at least 95%.
[] true/false []
2.
The probability that the true mean equals 0 is smaller than 5%.
[] true/false []
3.
The “null hypothesis” that the true mean equals 0 is likely to be incorrect.
[] true/false []
4.
There is a 95% probability that the true mean lies between 0.1 and 0.4.
[] true/false []
5.
We can be 95% confident that the true mean lies between 0.1 and 0.4.
[] true/false []
6.
If we were to repeat the experiment over and over, then 95% of the time
[] true/false []
the true mean falls between 0.1 and 0.4.
Assign probabilities to parameters or hypotheses,
something that is not allowed within the frequentist
framework.
caution.26
Caution
The 95% confidence interval for the mean ranges from 0.1 to 0.4!
1.
The probability that the true mean is greater than 0 is at least 95%.
[] true/false []
2.
The probability that the true mean equals 0 is smaller than 5%.
[] true/false []
3.
The “null hypothesis” that the true mean equals 0 is likely to be incorrect.
[] true/false []
4.
There is a 95% probability that the true mean lies between 0.1 and 0.4.
[] true/false []
5.
We can be 95% confident that the true mean lies between 0.1 and 0.4.
[] true/false []
6.
If we were to repeat the experiment over and over, then 95% of the time
[] true/false []
the true mean falls between 0.1 and 0.4.
Assign probabilities to parameters or hypotheses,
something that is not allowed within the frequentist
framework.
caution.27
Caution
The 95% confidence interval for the mean ranges from 0.1 to 0.4!
1.
The probability that the true mean is greater than 0 is at least 95%.
[] true/false []
2.
The probability that the true mean equals 0 is smaller than 5%.
[] true/false []
3.
The “null hypothesis” that the true mean equals 0 is likely to be incorrect.
[] true/false []
4.
There is a 95% probability that the true mean lies between 0.1 and 0.4.
[] true/false []
5.
We can be 95% confident that the true mean lies between 0.1 and 0.4.
[] true/false []
6.
If we were to repeat the experiment over and over, then 95% of the time
[] true/false []
the true mean falls between 0.1 and 0.4.
Assign probabilities to parameters or hypotheses,
something that is not allowed within the frequentist
framework.
caution.28
Caution
The 95% confidence interval for the mean ranges from 0.1 to 0.4!
1.
The probability that the true mean is greater than 0 is at least 95%.
[] true/false []
2.
The probability that the true mean equals 0 is smaller than 5%.
[] true/false []
3.
The “null hypothesis” that the true mean equals 0 is likely to be incorrect.
[] true/false []
4.
There is a 95% probability that the true mean lies between 0.1 and 0.4.
[] true/false []
5.
We can be 95% confident that the true mean lies between 0.1 and 0.4.
[] true/false []
6.
If we were to repeat the experiment over and over, then 95% of the time
[] true/false []
the true mean falls between 0.1 and 0.4.
Assign probabilities to parameters or hypotheses,
something that is not allowed within the frequentist
framework.
caution.29
Caution
The 95% confidence interval for the mean ranges from 0.1 to 0.4!
1.
The probability that the true mean is greater than 0 is at least 95%.
[] true/false []
2.
The probability that the true mean equals 0 is smaller than 5%.
[] true/false []
3.
The “null hypothesis” that the true mean equals 0 is likely to be incorrect.
[] true/false []
4.
There is a 95% probability that the true mean lies between 0.1 and 0.4.
[] true/false []
5.
We can be 95% confident that the true mean lies between 0.1 and 0.4.
[] true/false []
6.
If we were to repeat the experiment over and over, then 95% of the time
[] true/false []
the true mean falls between 0.1 and 0.4.
Mentions the boundaries of the confidence interval
(i.e., 0.1 and 0.4), whereas a confidence interval can be
used to evaluate only the procedure and not a specific
interval.
caution.30
Caution
The 95% confidence interval for the mean ranges from 0.1 to 0.4!
1.
The probability that the true mean is greater than 0 is at least 95%.
[] true/false []
2.
The probability that the true mean equals 0 is smaller than 5%.
[] true/false []
3.
The “null hypothesis” that the true mean equals 0 is likely to be incorrect.
[] true/false []
4.
There is a 95% probability that the true mean lies between 0.1 and 0.4.
[] true/false []
5.
We can be 95% confident that the true mean lies between 0.1 and 0.4.
[] true/false []
6.
If we were to repeat the experiment over and over, then 95% of the time
[] true/false []
the true mean falls between 0.1 and 0.4.
Mentions the boundaries of the confidence interval
(i.e., 0.1 and 0.4), whereas a confidence interval can be
used to evaluate only the procedure and not a specific
interval.
caution.31
Caution
The 95% confidence interval for the mean ranges from 0.1 to 0.4!
1.
The probability that the true mean is greater than 0 is at least 95%.
[] true/false []
2.
The probability that the true mean equals 0 is smaller than 5%.
[] true/false []
3.
The “null hypothesis” that the true mean equals 0 is likely to be incorrect.
[] true/false []
4.
There is a 95% probability that the true mean lies between 0.1 and 0.4.
[] true/false []
5.
We can be 95% confident that the true mean lies between 0.1 and 0.4.
[] true/false []
6.
If we were to repeat the experiment over and over, then 95% of the time
[] true/false []
the true mean falls between 0.1 and 0.4.
To sum up, all six statements are incorrect. Note that
all six err in the same direction of wishful thinking.
caution.32
Caution
The correct statement, which was absent from the
list, is the following:
The 95% confidence interval for the mean ranges from 0.1 to 0.4!
7.
If we were to repeat the experiment over and over, then 95 % of the time
[] true/false []
the confidence intervals contain the true mean.
caution.33
Caution
Suppose you have a treatment that you suspect may alter performance on a
certain task.
You compare the means of your control and experimental groups (say 20
subjects in each sample).
Further, suppose you use a simple independent means t-test and your result
is significant (t = 2.7, d.f. = 18, p = 0.01).
Please mark each of the statements below as “true” or “false.” “False”
means that the statement does not follow logically from the above
premises. Also note that several or none of the statements may be correct
(Gigerenzer 2004).
Gigerenzer, G. (2004). Mindless statistics. The Journal of Socio-Economics, 33, 587–
606.
caution.34
Caution
Which statements are in fact true? Recall that a p-value is the probability of the observed data
(or of more extreme data points), given that the null hypothesis H0 is true, defined in symbols as
p(D|H0).This definition can be rephrased in a more technical form by introducing the statistical
model underlying the analysis (Gigerenzer et al., 1989, chapter 3).
1.
You have absolutely disproved the null hypothesis (that is, there is no [] true/false []
difference between the population means).
2.
You have found the probability of the null hypothesis being true.
[] true/false []
3.
You have absolutely proved your experimental hypothesis (that there [] true/false []
is a difference between the population means).
4.
You can deduce the probability of the experimental hypothesis being [] true/false []
true.
5.
You know, if you decide to reject the null hypothesis, the probability [] true/false []
that you are making the wrong decision.
6.
You have a reliable experimental finding in the sense that if, [] true/false []
hypothetically, the experiment were repeated a great number of times,
you would obtain a significant result on 99% of occasions.
Gigerenzer, G., Swijtink, Z., Porter, T., Daston, L., Beatty, J., Krüger, L., 1989. The Empire of
Chance. How Probability Changed Science and Every Day Life. Cambridge University Press,
caution.35
Cambridge, UK.
Caution
Suppose you have a treatment that you suspect may alter performance on a certain task.
You compare the means of your control and experimental groups (say 20 subjects in each sample).
Further, suppose you use a simple independent means t-test and your result is significant (t = 2.7,
d.f. = 18, p = 0.01).
1.
You have absolutely disproved the null hypothesis (that is, there is no [] true/false []
difference between the population means).
2.
You have found the probability of the null hypothesis being true.
[] true/false []
3.
You have absolutely proved your experimental hypothesis (that there [] true/false []
is a difference between the population means).
4.
You can deduce the probability of the experimental hypothesis being [] true/false []
true.
5.
You know, if you decide to reject the null hypothesis, the probability [] true/false []
that you are making the wrong decision.
6.
You have a reliable experimental finding in the sense that if, [] true/false []
hypothetically, the experiment were repeated a great number of times,
you would obtain a significant result on 99% of occasions.
Is easily detected as being false, because a significance test can never disprove the null hypothesis or
the (undefined) experimental hypothesis. They are instances of the illusion of certainty (Gigerenzer,
2002).
Gigerenzer, G., 2002. Calculated Risks: How to Know When Numbers Deceive You. Simon and Schuster, New York (UK
edition: Reckoning with Risk: Learning to Live with Uncertainty. Penguin, London).
caution.36
Caution
Suppose you have a treatment that you suspect may alter performance on a certain task.
You compare the means of your control and experimental groups (say 20 subjects in each sample).
Further, suppose you use a simple independent means t-test and your result is significant (t = 2.7,
d.f. = 18, p = 0.01).
1.
You have absolutely disproved the null hypothesis (that is, there is no [] true/false []
difference between the population means).
2.
You have found the probability of the null hypothesis being true.
[] true/false []
3.
You have absolutely proved your experimental hypothesis (that there [] true/false []
is a difference between the population means).
4.
You can deduce the probability of the experimental hypothesis being [] true/false []
true.
5.
You know, if you decide to reject the null hypothesis, the probability [] true/false []
that you are making the wrong decision.
6.
You have a reliable experimental finding in the sense that if, [] true/false []
hypothetically, the experiment were repeated a great number of times,
you would obtain a significant result on 99% of occasions.
Is also false. The probability p(D|H0) is not the same as p(H0|D), and more generally, a significance
test does not provide a probability for a hypothesis. The statistical toolbox, of course, contains
tools that would allow estimating probabilities of hypotheses, such as Bayesian statistics.
caution.37
Caution
Suppose you have a treatment that you suspect may alter performance on a certain task.
You compare the means of your control and experimental groups (say 20 subjects in each sample).
Further, suppose you use a simple independent means t-test and your result is significant (t = 2.7,
d.f. = 18, p = 0.01).
1.
You have absolutely disproved the null hypothesis (that is, there is no [] true/false []
difference between the population means).
2.
You have found the probability of the null hypothesis being true.
[] true/false []
3.
You have absolutely proved your experimental hypothesis (that there [] true/false []
is a difference between the population means).
4.
You can deduce the probability of the experimental hypothesis being [] true/false []
true.
5.
You know, if you decide to reject the null hypothesis, the probability [] true/false []
that you are making the wrong decision.
6.
You have a reliable experimental finding in the sense that if, [] true/false []
hypothetically, the experiment were repeated a great number of times,
you would obtain a significant result on 99% of occasions.
Is easily detected as being false, because a significance test can never disprove the null hypothesis or
the (undefined) experimental hypothesis. They are instances of the illusion of certainty (Gigerenzer,
2002).
Gigerenzer, G., 2002. Calculated Risks: How to Know When Numbers Deceive You. Simon and Schuster, New York (UK
caution.38
edition: Reckoning with Risk: Learning to Live with Uncertainty. Penguin, London).
Caution
Suppose you have a treatment that you suspect may alter performance on a certain task.
You compare the means of your control and experimental groups (say 20 subjects in each sample).
Further, suppose you use a simple independent means t-test and your result is significant (t = 2.7,
d.f. = 18, p = 0.01).
1.
You have absolutely disproved the null hypothesis (that is, there is no [] true/false []
difference between the population means).
2.
You have found the probability of the null hypothesis being true.
[] true/false []
3.
You have absolutely proved your experimental hypothesis (that there [] true/false []
is a difference between the population means).
4.
You can deduce the probability of the experimental hypothesis being [] true/false []
true.
5.
You know, if you decide to reject the null hypothesis, the probability [] true/false []
that you are making the wrong decision.
6.
You have a reliable experimental finding in the sense that if, [] true/false []
hypothetically, the experiment were repeated a great number of times,
you would obtain a significant result on 99% of occasions.
Is also false. The probability p(D|H0) is not the same as p(H0|D), and more generally, a significance
test does not provide a probability for a hypothesis. The statistical toolbox, of course, contains
tools that would allow estimating probabilities of hypotheses, such as Bayesian statistics.
caution.39
Caution
Suppose you have a treatment that you suspect may alter performance on a certain task.
You compare the means of your control and experimental groups (say 20 subjects in each sample).
Further, suppose you use a simple independent means t-test and your result is significant (t = 2.7,
d.f. = 18, p = 0.01).
1.
You have absolutely disproved the null hypothesis (that is, there is no [] true/false []
difference between the population means).
2.
You have found the probability of the null hypothesis being true.
[] true/false []
3.
You have absolutely proved your experimental hypothesis (that there [] true/false []
is a difference between the population means).
4.
You can deduce the probability of the experimental hypothesis being [] true/false []
true.
5.
You know, if you decide to reject the null hypothesis, the probability [] true/false []
that you are making the wrong decision.
6.
You have a reliable experimental finding in the sense that if, [] true/false []
hypothetically, the experiment were repeated a great number of times,
you would obtain a significant result on 99% of occasions.
Also refers to a probability of a hypothesis. This is because if one rejects the null hypothesis, the
only possibility of making a wrong decision is if the null hypothesis is true. Thus, it makes
essentially the same claim as Statement 2 does, and both are incorrect.
caution.40
Caution
Suppose you have a treatment that you suspect may alter performance on a certain task.
You compare the means of your control and experimental groups (say 20 subjects in each sample).
Further, suppose you use a simple independent means t-test and your result is significant (t = 2.7,
d.f. = 18, p = 0.01).
1.
You have absolutely disproved the null hypothesis (that is, there is no [] true/false []
difference between the population means).
2.
You have found the probability of the null hypothesis being true.
[] true/false []
3.
You have absolutely proved your experimental hypothesis (that there [] true/false []
is a difference between the population means).
4.
You can deduce the probability of the experimental hypothesis being [] true/false []
true.
5.
You know, if you decide to reject the null hypothesis, the probability [] true/false []
that you are making the wrong decision.
6.
You have a reliable experimental finding in the sense that if, [] true/false []
hypothetically, the experiment were repeated a great number of times,
you would obtain a significant result on 99% of occasions.
Amounts to the replication fallacy (Gigerenzer, 1993, 2000). Here, 1% (p=0.01) is taken to imply
that such significant data would reappear in 99% of the repetitions. Statement 6 could be made
only if one knew that the null hypothesis was true. In formal terms, p(D|H0) is confused with
caution.41
1−p(D).
Caution
Suppose you have a treatment that you suspect may alter performance on a certain task.
You compare the means of your control and experimental groups (say 20 subjects in each sample).
Further, suppose you use a simple independent means t-test and your result is significant (t = 2.7,
d.f. = 18, p = 0.01).
1.
You have absolutely disproved the null hypothesis (that is, there is no [] true/false []
difference between the population means).
2.
You have found the probability of the null hypothesis being true.
[] true/false []
3.
You have absolutely proved your experimental hypothesis (that there [] true/false []
is a difference between the population means).
4.
You can deduce the probability of the experimental hypothesis being [] true/false []
true.
5.
You know, if you decide to reject the null hypothesis, the probability [] true/false []
that you are making the wrong decision.
6.
You have a reliable experimental finding in the sense that if, [] true/false []
hypothetically, the experiment were repeated a great number of times,
you would obtain a significant result on 99% of occasions.
Gigerenzer, G., 1993. The superego, the ego, and the id in statistical reasoning. In: Keren, G., Lewis, C. (Eds.), A
Handbook for Data Analysis in the Behavioral Sciences: Methodological Issues. Erlbaum, Hillsdale, NJ, pp. 311–339.
Gigerenzer, G., 2000. Adaptive Thinking: Rationality in the Real World. Oxford University Press, New York.
caution.42
Caution
We need to make substantial changes to how we conduct research.
First, in response to heightened concern that our published research literature is
incomplete and untrustworthy, we need new requirements to ensure research integrity.
These include pre-specification of studies whenever possible, avoidance of selection and
other inappropriate data-analytic practices, complete reporting, and encouragement of
replication.
Second, in response to renewed recognition of the severe flaws of null-hypothesis
significance testing (NHST), we need to shift from reliance on NHST to estimation and
other preferred techniques. The new statistics refers to recommended practices,
including estimation based on effect sizes, confidence intervals, and meta-analysis. The
techniques are not new, but adopting them widely would be new for many researchers,
as well as highly beneficial.
This article explains why the new statistics are important and offers guidance for their
use. It describes an eight-step (see below) new-statistics strategy for research with
integrity, which starts with formulation of research questions in estimation terms, has
no place for NHST, and is aimed at building a cumulative quantitative discipline.
Cumming, G., “The New Statistics: Why and How” Psychological Science January 2014 25(1) 7-29 DOI: caution.43
10.1177/0956797613504966
Caution
1. Formulate research questions in estimation terms. To use estimation thinking, ask
“How large is the effect?” or “To what extent . . . ?” Avoid dichotomous
expressions such as “test the hypothesis of no difference” or “Is this treatment
better?”
2. Identify the effect sizes that will best answer the research questions. If, for
example, the question asks about the difference between two means, then that
difference is the required effect size. If the question asks how well a model
describes some data, then the effect size is a measure of goodness of fit.
3. Declare full details of the intended procedure and data analysis. Pre-specify as
many aspects of your intended study as you can, including sample sizes. A fully
pre-specified study is best.
4. After running the study, calculate point estimates and confidence intervals for
the chosen effect sizes . For an experiment, the estimated difference between
the means is 16.9, 95% confidence interval [6.1, 27.7].
Cumming, G., “The New Statistics: Why and How” Psychological Science January 2014 25(1) 7-29 DOI:
10.1177/0956797613504966
caution.44
Caution
5. Make one or more figures, including confidence intervals. Use error bars to
depict 95% confidence intervals.
6. Interpret the effect sizes and confidence intervals. In writing up results,
discuss the effect size estimates, which are the main research outcome, and the
confidence interval lengths, which indicate precision. Consider theoretical and
practical implications, in accord with the research aims.
7. Use meta-analytic thinking throughout. Think of any single study as building on
past studies and leading to future studies. Present results to facilitate their
inclusion in future meta-analyses. Use meta-analysis to integrate findings
whenever appropriate.
8. Report. Make a full description of the research, preferably including the raw
data, available to other researchers. This may be done via journal publication or
posting to some enduring publicly available online repository. Be fully transparent
about every step, including data analysis and especially about any exploration or
selection, which requires the corresponding results to be identified as
speculative.
Cumming, G., “The New Statistics: Why and How” Psychological Science January 2014 25(1) 7-29 DOI:
10.1177/0956797613504966
caution.45
Caution
One approach is, Bayesian statistics, which is a subset of the field of statistics in
which the evidence about the true state of the world is expressed in terms of
degrees of belief or, more specifically, Bayesian probabilities. One formulation of the
"key ideas of Bayesian statistics" is "that probability is orderly opinion, and that
inference from data is nothing other than the revision of such opinion in the light of
relevant new information."
The lack of reproducibility of scientific research undermines public confidence in
science and leads to the misuse of resources when researchers attempt to replicate
and extend fallacious research findings. Using recent developments in Bayesian
hypothesis testing, a root cause of non-reproducibility is traced to the conduct of
significance tests at inappropriately high levels of significance. Modifications of
common standards of evidence are proposed to reduce the rate of nonreproducibility of scientific research by a factor of 5 or greater.
Johnson, V.E., “Revised standards for statistical evidence”. Proceedings of the
National Academy of Sciences 2013 110(48) 19313-19317 DOI: 10.1073/pnas.1313476110
H. R. N. van Erp, R. O. Linger and P. H. A. J. M. van Gelder “An outline of the Bayesian decision
theory” AIP Conf. Proc. 1757, 050001 (2016) DOI:; 10.1063/1.4959057
caution.46
Caution
Recent advances in Bayesian hypothesis testing have led to the development of
uniformly most powerful Bayesian tests, which represent an objective, default class
of Bayesian hypothesis tests that have the same rejection regions as classical
significance tests. Based on the correspondence between these two classes of tests,
it is possible to equate the size of classical hypothesis tests with evidence
thresholds in Bayesian tests, and to equate p values with Bayes factors.
An examination of these connections suggest that recent concerns over the lack of
reproducibility of scientific studies can be attributed largely to the conduct of
significance tests at unjustifiably high levels of significance. To correct this
problem, evidence thresholds required for the declaration of a significant finding
should be increased to 25-50:1, and to 100-200:1 for the declaration of a highly
significant finding. In terms of classical hypothesis tests, these evidence standards
mandate the conduct of tests at the 0.005 or 0.001 level of significance.
Johnson, V.E., Revised standards for statistical evidence. Proceedings of the
National Academy of Sciences 2013 110(48) 19313-19317 DOI: 10.1073/pnas.1313476110
caution.47
Do Not Forget
The first rule of performing a project
1
The supervisor is always right
The second rule of performing a project
2
If the supervisor is wrong, rule 1 applies
caution.48
caution.49