Transcript Slide 1

Alternatives to Null Hypothesis Significance Testing
and Variable-Based Modeling
James W. Grice
Oklahoma State University
Department of Psychology
Presented to researchers and staff of Walter Reed Army Research Institute, Silver Spring, MD, April 14th, 2015.
Null Hypothesis Significance Testing (NHST)
α = pcrit = .05
Thoughts running through the researcher’s mind:
Do I have an effect?
Are my results significant?
Is my hypothesis supported?
NHST
Do I have any effects?
Are my results significant?
Are my hypotheses supported?
NHST
Do I have any effects?
Are my results significant?
Are my hypotheses supported?
NHST
The Null Ritual:
1. Set up a statistical null hypothesis of “no mean difference” or
“zero correlation.” Don’t specify the predictions of your research
hypothesis or of any alternative substantive hypotheses.
2. Use 5% as a convention for rejecting the null. If significant,
accept your research hypothesis. Report the result as p < 0.05, p
< 0.01, or p < 0.001 (whichever comes next to the obtained pvalue).
3. Always perform this procedure.
p. 588, Gigerenzer, G. (2001). Mindless Statistics. Journal of Socio-Economics, 33, 587-606.
NHST
Linear relationship between optimism and visiting a doctor
after detecting a lump in the breast.
rxy 
z z
x
n 1
y
 .18
Assumption-laden NHST
-.18*
Optimism
Delay visit
to doctor
Assumptions
• Linearity
• Random Sampling
• Bivariate Normal Population Distribution
• Homoscedasticity
• Continuous variables
• Independence of pairs of observations
• Ho is true
• “p ≤ .05” is proper significance level
Goal is to estimate a population parameter; here, the population correlation
NHST
Hypotheses:
Ho : ρxy = 0
HA : ρxy > 0 or ρxy < 0
where ρxy is the population correlation
Assumptions
• Linearity
• Random Sampling
• Bivariate Normal Population Distribution
• Homoscedasticity
• Continuous variables
• Independence of pairs of observations
• Ho is true
• “p ≤ .05” is proper significance level
NHST
Ho : ρxy = 0
pcrit = .05
rcrit = -.169
rcrit = .169
Sampling Distribution : Distribution of possible outcomes (r values) with
assumptions being fulfilled.
NHST
pcrit = .05
.0185
.0185
rcrit = -.169
-.18
rcrit = .169
pobs = .037
+.18
Specifically: Given the assumptions, pobs is the probability of
obtaining a result at least as extreme as +/- .18 in a repeated,
random sampling scheme.
This is all you get!
NHST
Things you may want, but do not get from the p-value…
“Bakan (1966) and Thompson (1996, 1999) catalogue some of the most common:
1. A p value is the probability the results will replicate if the study is conducted again (false).
2. We should have more confidence in p values obtained with larger Ns than smaller Ns (this is not
only false but backwards).
3. A p value is a measure of the degree of confidence in the obtained result (false).
4. A p value automates the process of making an inductive inference (false, you still have to do that
yourself—and most don’t bother).
5. Significance testing lends objectivity to the inferential process (it really doesn’t).
6. A p value is an inference from population parameters to our research hypothesis (false, it is only an
inference from sample statistics to population parameters).
7. A p value is a measure of the confidence we should have in the veracity of our research hypothesis
(false).
8. A p value tells you something about the members of your sample (no it doesn’t).
9. A p value is a measure of the validity of the inductions made based on the results (false).
10. A p value is the probability the null is true (or false) given the data (it is not).
11. A p value is the probability the alternative hypothesis is true (or false; this is false).
12. A p value is the probability that the results obtained occurred due to chance (very popular but
nevertheless false).”
p. 73. Lambdin, C. (2011) Significance tests as sorcery: Science is empirical—significance tests are not. Theory &
Psychology, 22(1) 67–90.
NHST
pcrit = .05
.0185
.0185
rcrit = -.169
-.18
rcrit = .169
pobs = .037
+.18
Specifically: Given the assumptions, pobs is the probability of
obtaining a result at least as extreme as +/- .18 in a repeated,
random sampling scheme.
This is all you get!
NHST
“The 16th edition of a highly influential textbook, Gerrig and
Zimbardo’s Psychology and Life (2002), portrays the null ritual
as statistics per se and calls it the ‘backbone of psychological
research’ ” (p. 46).
p. 589, Gigerenzer, G. (2001). Mindless Statistics. Journal of Socio-Economics, 33, 587-606.
NHST
-.18*
Optimism
Assumptions
•
•
•
•
•
•
•
Linearity
Random Sampling
Bivariate Normal Population Distribution
Homoscedasticity
Continuous variables
Independence of pairs of observations
Ho is true
Delay visit
to doctor
Hypotheses:
Ho : ρxy = 0
HA : ρxy > 0 or ρxy < 0
Goal: ? ≤ ρxy ≤ ?
NHST
Population of Women
All women over 40 years of age?
Only women without a history of breast cancer in their families?
Only women who have had children?
Only American women?
Population correlation often has no empirical reality
NHST
Population of Women
“…researchers may find themselves assuming that their sample is a
random sample from an imaginary population. Such a population
has no empirical existence, but is defined in an essentially
circular way—as that population from which the sample may be
assumed to be randomly drawn. At the risk of the obvious,
inferences to imaginary populations are also imaginary.”
Berk, R. A. & Freedman, D. A. (2003). Statistical assumptions as empirical commitments. In T. G. Blomberg
and S. Cohen (eds.), Law, Punishment, and Social Control: Essays in Honor of Sheldon Messinger, 2nd
ed., pp. 235-254, Aldine de Gruyter.
NHST
The authors did not draw a random sample!
What of the other assumptions as well?
Assumptions
•
•
•
•
•
•
•
Linearity
Random Sampling
Bivariate Normal Population Distribution
Homoscedasticity
Continuous variables
Independence of pairs of observations
Ho is true
Hypotheses:
Ho : ρxy = 0
HA : ρxy > 0 or ρxy < 0
Goal: ? ≤ ρxy ≤ ?
NHST
rcrit = .169
rcrit = -.169
-.18
pobs = ?
+.18
The correlation (r = -18, n = 135) is statistically significant (p = .038).
I have an effect. My result is significant. My hypothesis is supported.
Statisticians: “We have corrections for some assumption violations.”
NHST
rcrit = .169
rcrit = -.169
-.18
pobs = ?
+.18
“These adjustments will be successful only under restrictive
assumptions whose relevance to the social world is dubious. Moreover,
adjustments require new layers of technical complexity, which tend to
distance the researcher from the data. Very soon, the model rather than
the data will be driving the research.”
Berk & Freedman (2003).
NHST
Paul Meehl: NHST is “one of the worst things that ever
happened in the history of psychology” (p. 817; Journal of
Consulting and Clinical Psychology, 46, 806-834).
Ioannidis, J. P. (2005). Why most published research
findings are false. PLoS Med, 2(8), e124.
NHST
A few references…
Gigerenzer, G. (2004) Mindless statistics. The Journal of Socio-Economics, 33, 587-606.
Lambdin, C. (2011) Significance tests as sorcery: Science is empirical—significance tests are
not. Theory & Psychology, 22(1) 67–90.
Ziliak, S. & McCloskey, D. (2008). The Cult of Statistical Significance: How the Standard
Error Costs Us Jobs, Justice and Lives. Ann Arbor: University of Michigan Press.
McCloskey, D. (1995). The insignificance of statistical significance. Scientific American 72,
32–33.
Cohen, J. (1994). The earth is round (p < 0.05). American Psychologist , 49, 997–1003.
Branch, M. (2014). Malignant side effects of null-hypothesis significance testing. Theory &
Psychology, 24(2), 256-277.
Nuzzo, R. (2014). Statistical errors. Nature, 506, 151-152.
Ioannidis, J. P. (2005). Why most published research findings are false. PLoS Med, 2(8),
e124.
What must we do?
Some suggest…
1. Replace or supplement p-values with confidence intervals and
effect sizes
2. Replace NHST with Bayesian statistics
Others suggest…
Attempt a Gestalt shift:
1. De-emphasize mean and variance-based statistics
2. Think in terms of patterns
3. Focus on accuracy
4. Create analogical (particularly iconic) models
5. …all of this will require that we take our numbers more seriously
Effect Sizes and Confidence Intervals
Hypothetical Results from Four Studies:
1. R2 = .67; p = .002; CI.95 = .40 to .94
2. R2 = .67; p = .002; CI.95 = .40 to .94
3. R2 = .67; p = .002; CI.95 = .40 to .94
4. R2 = .67; p = .002; CI.95 = .40 to .94
Notice the large effect sizes, small p-values, and
moderately wide confidence intervals (df = 1,10)
Effect Sizes and Confidence Intervals
R2 = .67; df = 1, 10; p = .002
Effect Sizes and Confidence Intervals
R2 = .67; df = 1, 10; p = .002
Effect Sizes and Confidence Intervals
R2 = .67; df = 1, 10; p = .002
Effect Sizes and Confidence Intervals
R2 = .67; df = 1, 10; p = .002
Effect Sizes and Confidence Intervals
•
•
•
“LOT [optimism] scores were related inversely to delay…”
“Consistent with theory and prior research, overall, optimism
explained both delay and…” (p. 205)
Optimism was a significant predictor of delay
Effect Sizes and Confidence Intervals
A Study in Terror Management Theory
Norenzayan, A. & Hansen, I. (2006). Belief in Supernatural Agents in the face of death. Personality and Social
Psychology Bulletin, 32, 174-187.
•
•
•
Random assignment to one of two groups:
1. Write about favorite food
2. Write about personal death
Memory task to clear your short term memory
“How strongly do you believe in God?”
Not at all 1
2
3
4
|
5
Midpoint
6
7 Very Strongly
Effect Sizes and Confidence Intervals
t obs 
xD  xF
s2p
nD

s2p
nF
Assumption-laden NHST
Thought of
Death
t(64) = 2.18*
Belief in
God
Assumptions
• Random assignment (or sampling)
• Normal population distributions
• Homogeneity of population variances
• Continuous dependent variable
• Independence of observations
• Ho is true
• “p ≤ .05” is proper significance level
Goal is to estimate two population parameters, µDeath and µFood, and the difference between them.
Effect Sizes and Confidence Intervals
Hypotheses: Ho : μFood = μDeath; HA : μFood > μDeath or μFood < μDeath
MDeath = 4.39 (SD = 1.64), MFood = 3.42 (SD = 1.97), t(64) = 2.18, p
< .033, d = .54 (medium effect using Cohen’s conventions),
CI.95: .08 to 1.86.
Effect Sizes and Confidence Intervals
Output from a Bayesian estimation program
Accuracy
“In contrast [to traditional statistical methods],
ODA maximizes the accuracy of a model.”
(Yarnold, P., & Soltysik, R. (2005). Optimal Data Analysis. APA,
Washington, DC. (p. 4).
Accuracy & Patterns
Focus on patterns and accuracy using the
Percent Correct Classification (PCC) index
Accuracy & Patterns
Thought
of Death
t(64) = 2.18*
Increased
Religiosity
MDeath = 4.39 (SD = 1.64), MFood = 3.42 (SD = 1.97), t(64) = 2.18, p
< .033, d = .54 (medium effect using Cohen’s conventions),
CI.95: .08 to 1.86.
OOM shows the pattern of results makes no sense with regard to Terror
Management Theory when examined at the level of the individuals in
the study and when we attempt to take our numbers seriously
Persons & Patterns, not Aggregates
0.13***
Daily PTSD
symptoms
Daily NA
-0.14
(-0.02)
0.42***
Number of
standard
drinks/day
*** p < .001
• Diary data for 54 women. Plenty of within-person data!
(Cohn, Hagman, Moore, Mitchell, Ehlke (2014). Psychology of Addictive Behaviors, 28, 114-126.)
• “Statisticism” : In part is a failure to recognize the difference
between an aggregate statistical effect and the cause-effect
processes at the level of the persons (Lamiell, J. T., 2013, New Ideas in
Psychology, 31, 65-71).
• How many individual women fit this causal model?
Persons & Patterns, not Aggregates
“Indeed, only six women responded to the survey on all
14 days, and the median number of completed days was
equal to 11. The median PCC value was equal to 44.35,
indicating general incongruity between the relative
changes in PTSD and negative affect observations
across all days and all women. More specifically, PCC
values for only 23 women exceeded 50%, and of those
only eight patterns 1) passed the eye test, 2) included
seven or more days of observations, and 3) showed
some variability in the observations.” Grice et al., in press.
Inferences
Rather than seeking:
1. An inference to a population parameter : ? ≤ µDeath - µFood ≤ ?
2. An inference about aggregate statistics (in Bayesian analysis)
We are seeking:
Inference to best explanation. Why are the data patterned in
such and such a manner?
Philosophical Realism
Aristotle
•
•
•
•
•
Philosophical Realism : AKA “Reasoned common
sense”
Natural science (epistēmē) is demonstrable
knowledge of nature through its causes
Causes inhere in the things themselves and are
knowable; this is causality
Thing-based rather than event-based ontology
Cause : Material, Formal, Efficient, and Final
Philosophical Realism
Philosophical Realism
Thought
of Death
t(64) = 2.18*
? ≤ µDeath - µFood ≤ ?
Increased
Religiosity
Philosophical Realism
St. Thomas Aquinas
Philosophical Realist
Analogical (Iconic) Models
Analogical (Iconic) Models
Analogical (Iconic) Models
Integrated Model from Bill Powers’ Perceptual Control Theory
Powers, W.T. (2008). Living control systems III: Modeling behavior. Montclair, NJ: Benchmark Publications.
Analogical (Iconic) Models
http://ccl.northwestern.edu/netlogo/
https://www.youtube.com/watch?v=AJXFiO-ULv0
What must we do?
So…Forget NHST!
Attempt a Gestalt shift:
1. De-emphasize mean and variance-based statistics
2. Think in terms of patterns
3. Focus on accuracy
4. Create analogical (particularly iconic) models
5. …all of this will require that we take our numbers more seriously
The End
http://www.idiogrid.com/OOM