Lecture notes

Download Report

Transcript Lecture notes

Hypothesis testing and effect
sizes
Sylvain Chartier
Laboratory for Computational
Neurodynamics and Cognition
Centre for Neural Dynamics
Example
Preliminary test of a virtual reality (VR) anxiety-provoking tool using a
sample of participants with obsessive-compulsive disorder (OCD). In
order to be considered OCD, a difference of 1 SD (35) must be
observed.
Example
The results in this study suggest that the anxiety of those with OCD is
higher than that of healthy controls in the VR. As shown in Table 1,
the OCD group and the control group presented a difference in
checking time. OCD characteristics can account for these differences
in behavioral and anxiety.
In conclusion, the results suggest that VR is valuable method for
anxiety provocation.
Beliefs in the H0
• The data are the result of chance (i.e. that H0 is true);
• Doing a type I error if H0 is rejected (i.e. rejecting H0 when it
should not have been rejected);
• That an experimental replication produces statistically significant
results (by calculating 1-p);
• That the decision (to reject H0 or not) is correct;
• To obtain results as extreme as these if H0 is true
Probability of what?
p(D|H) = p(H|D)
Observed r
The probability of being a female
given that I am pregnant
The probability of being pregnant
given that I am a female
H0=?
• Since we cannot prove H1, we will try to refute H0.
Hypothesis testing
H0: m = 0
H1: m  0
H1
H1
H0
•
•
H0 has a probability of 1/ of being true -> 0
H1 has a probability of (-1)/ of being true -> 1
We will always reject H0!
What to do?
Confidence interval ?
Which hypotheses would you
reject?
BCDEF
A
B
C
D
E
F
0
Treatment Effect
For example, B:
-Confidence interval does not cross
zero.
-So the results for that experiment are
statistically significant r<0.05.
-We have substantial evidence that the
difference is not really zero.
What to do?
Confidence interval + equivalence
Zone of scientific or
clinical indifference
Which hypotheses would you
reject?
CD
A
B
Which hypotheses would you
consider equivalent?
AB
C
D
E
F
0
Treatment Effect
Which hypotheses would you
consider ambiguous?
EF
Example
Preliminary test of a virtual reality anxiety-provoking tool using a
sample of participants with obsessive-compulsive disorder (OCD). In
order to be considered OCD, a difference of 1 SD (35) must be
observed.
Example
-1 SD
+1SD
A confidence interval of 95% will
gives the following bounds:
[16.82 à 74.99]
0
What about the effect size?
R2
F
df  t 2
2
1 R
2
R
61  3.156 2
2
1 R
 R 2  0.14
Confidence intervals around
effect sizes
Getting confidence intervals on means, means difference, Z-scores, standard
deviation, regression coefficients, etc. is quite simple.
However, things are not that simple for standardized effect sizes and
bounded effect sizes (coefficient of correlation, R2, proportion, etc.). We have
to use numerical methods.
What about nonpivotal quantities?
Pivotal quantity: is a function of observations and unobservable parameters
whose probability distribution does not depend on unknown parameters.
Examples: Mean, mean difference, Z-score, standard deviation, bivariate
correlation, regression coefficient, etc.
Nonpivotal quantity: is a function of observations and unobservable
parameters whose probability distribution depend on unknown parameters.
Examples: standardized effect sizes (standardized mean differences,
standardized regression coefficients, coefficients of variation, etc.) and effect
sizes that are bounded (correlation coefficients, squared multiple correlation
coefficients, proportions, etc.).
Confidence intervals for such effects cannot be obtained by inverting their
corresponding test statistic (like in pivotal quantities).
Confidence intervals around
effect sizes
Effect size of the study is 0.81, which is considered as high
according to Cohen.
However, if we compute the confidence interval we get : [0.279;
1.31]
In other words, we have no idea about the real utility of VR to elicit
compulsive behaviors.
Conclusion
If we asked to a runner how much time it takes him to complete 42
km?
And its answer would be between 3 to 15 hours. We would be
skeptical, then why don’t we use the same skepticism about our
statistical analysis?