Power & Effect Size
Download
Report
Transcript Power & Effect Size
POWER
AND
EFFECT SIZE
Previous Weeks
A few weeks ago I made a small chart outlining all the
different statistical tests we’ve covered (week 9)
I
want to complete that chart using information from the
past week
Most of this is a repeat – but a few new tests have
been added
Important
that you are familiar with these tests, know when
they are appropriate to use, and how to run (most of) them
in SPSS
Excused from running ANCOVA, RM ANOVA
When to use specific statistical tests…
# of IV
(format)
# of DV
(format)
1
(continuous)
1
(continuous)
1
(continuous)
1
(continuous)
Multiple
1
(continuous)
Examining…
Test/Notes
Association
Pearson Correlation
(r)
Prediction
Simple Linear
Regression (m + b)
Prediction
Multiple Linear
Regression (m + b)
# of IV
(format)
# of DV
(format)
Examining…
Test/Notes
1 (grouping, 2
levels)
1
(continuous)
Group
differences
When one group is a
‘known’ population =
One-Sample t-test
Group
differences
When both groups
are independent =
Independent Samples
t-test
Group
differences
When both groups
are dependent =
Paired Samples t-test
1 (grouping, 2
levels)
1 (grouping, 2
levels)
1 (grouping,
∞ levels)
1
(continuous)
1
(continuous)
1
(continuous)
Group
differences
One-Way ANOVA,
with Post-Hoc (F ratio)
# of IV
(format)
∞ (grouping,
∞ levels)
∞ (grouping,
∞ levels)
∞ (grouping,
∞ levels)
# of DV
(format)
Examining…
Test/Notes
1
(continuous)
Group
Differences and
interactions
Factorial ANOVA with
Post-Hoc and/or
Estimated Marginal
Means (F ratio)
1
(continuous)
Group
Differences,
interactions,
controlling for
confounders
ANCOVA with
Estimated Marginal
Means (F ratio)
Analysis of CoVariance
1
(continuous)
Group
Differences,
interactions,
controlling for
confounders in a
related sample
Repeated Measures
ANOVA
with Estimated
Marginal Means
(F ratio)
(e.g., longitudinal)
Tonight…
A break from learning a new statistical ‘test’
Focus will be on two critical statistical ‘concepts’
Statistical
Related
Brief
Power
to Alpha/Statistical Significance
overview of Effect Size
Statistically
significant results vs Meaningful results
First, a quick review of error in testing…
Example Hypothesis
Pretend my masters thesis topic is the influence of
exercise on body composition
I
believe people that exercise more, will have lower %BF
To study this:
I
draw a sample and group subjects by how much they exercise –
High and Low Exercise Groups (this is my IV)
I also assess %BF in each subject as a continuous variable (DV)
I plan to see if the two groups have different mean %BF
My hypotheses (HO and HA):
HA:
There is a difference in %BF between the groups
HO: There is not a difference in %BF between the groups
Example Continued
Now I’m going to run my statistical test, get my test
statistic, and calculate a p-value
I’ve
set alpha at the standard 0.05 level
By the way, what statistical test should I use…?
My final decision on my hypotheses is going to be
based on that p-value:
I
could reject the null hypothesis (accept HA)
I could accept the null hypothesis (reject HA)
Statistical Errors…
Since there are two potential decisions (and only one
of them can be correct), there are two possible errors
I can make:
Type I Error
We
could reject the null hypothesis although it was really
true (should have accepted null)
Type II Error
We
could fail to reject the null hypothesis when it was
really untrue (should have rejected null)
HA: There is a difference in %BF between the groups
HO: There is not a difference in %BF between the groups
There are really 4
potential outcomes,
based on what is “true”
and what we “decide”
Our Decision
Reject HO
Accept HO
HO
Type I Error
Correct
HA
Correct
Type II Error
What is
True
Statistical Errors…
Remember –
My final decision is based on the p-value
If p </= 0.05, our decision is reject HO
If p > 0.05, our decision is accept HO
Our Decision
Reject HO
Accept HO
HO
Type I Error
Correct
HA
Correct
Type II Error
What is
True
Statistical Errors…
In my analysis, I find:
High Exercise Group mean %BF = 22%
Low Exercise Group mean %BF = 26%
p = 0.08
What is my decision?
Is it possible I’ve made an error
in my decision?
Accept HO
There is NOT a difference in %BF between the groups
Why is that my decision? The means ARE different?
I can’t be confident that the 4% difference between the two
groups is not due to random sampling error
Possible Error…?
If I did make an error, what type would it be?
Type
When you find a p-value greater than alpha
The
II Error
only possible error is Type II error
When you find a p-value less than alpha
The
only possible error is Type I error
If p </= 0.05, our decision is reject HO
If p > 0.05, our decision is accept HO
Our p = 0.08, we
accepted HO
The only possible error
is Type II
Our Decision
Reject HO
Accept HO
HO
Type I Error
Correct
HA
Correct
Type II Error
What is
True
Possible Error…?
Compare Type I and Type II error like this:
The
only concern when you find statistical significance (p <
0.05) is Type I Error
Is
the difference between groups REAL or due to Random
Sampling Error
Thankfully, the p-value tells you exactly what the probability of
that random sampling error is
In other words, the p-value tells you how likely Type I error is
But, does the p-value tell you how likely Type II error is?
The
probability of Type II error is better provided by Power
Possible Error…?
Probability of Type II error is provided by Power
Statistical
Power, also known as β (actually 1 – β)
We will not discuss the specific calculation of power in this class
SPSS can calculate this for you
Power (Beta) is related to Alpha, but:
Alpha
is the probability of having Type I error
Lower
Power
number is better (i.e., 0.05 vs 0.01 vs 0.001)
is the probability of NOT having Type II error
The
probability of being right (correctly rejecting the null hypothesis)
Higher number is better (typical goal is 0.80)
Let’s continue this in the context of my ‘thesis’ example
Statistical Errors…
In my analysis, I found:
High Exercise Group mean %BF = 22%
Low Exercise Group mean %BF = 26%
p = 0.08
Decided to accept the null
What do I do when I don’t find statistical significance?
What happens when the result does not reflect
expectations?
First, consider the situation
Should it be statistically significant?
The most obvious thing you need to consider is if you
REALLY should have found a statistically significant result?
Just
because you wanted your test to be significant doesn’t
mean it should be
This wouldn’t be Type II error – it would just be the correct
decision!
In my example, researchers have shown in several studies
that exercise does influence %BF
This
result ‘should’ be statistically significant, right?
If the answer is yes, then you need to consider power
In my ‘thesis’
This result ‘should’ be statistically significant, right?
Probably an issue with Statistical Power
This
scenario plays out at least once a year between myself
and a grad student working on a thesis or research project
How
can I increase the chance that I will find statistically
significant results?
Why was this analysis not statistically significant?
What can I do to decrease the chance of Type II error?
Several different factors influence power
Your
ability to detect a true difference
How can I increase Power?
1) Increase Alpha level
Changing
alpha from 0.05 to 0.10 will increase your
power (better chance of finding significant results)
Downsides to increasing your alpha level?
This
will increase the chance of Type I error!
This
is rarely acceptable in practice
Only really an option when working in a new area:
Researchers
are unsure of how to measure a new variable
Researchers are unaware of confounders to control for
How can I increase Power?
2) Increase N
Sample
size is directly used when calculating p-values
Including
more subjects will increase your chance of
finding statistically significant results
Downsides
More
More
to increasing sample size?
subjects means more time/money
subjects is ALWAYS a better option if possible
How can I increase Power?
3) Use fewer groups/variables (simpler designs)
Related
‘Use
↑
to sample size but different
fewer groups’ NOT ‘Use less subjects’
groups negatively effects your degrees of freedom
Remember,
df is calculated with # groups and # subjects
Lots
of variables, groups and interactions make it more
difficult to find statistically significant differences
The
purpose of the Family-wise error rate is to make it harder
to find significant results!
Downsides
to fewer groups/variables?
Sometimes
you NEED to make several comparisons and test for
interactions - unavoidable
How can I increase Power?
4) Measure variables more accurately
If
variables are poorly measured (sloppy work, broken
equipment, outdated equipment, etc…) this increases
measurement error
More measurement error decreases confidence in the result
For example, perhaps I underestimated %BF in my ‘low
exercise’ group? This could lead to Type II Error.
More of an internal validity problem than statistical problem
Downsides to measuring more accurately?
None
– if you can afford the best tools
How can I increase Power?
5) Decrease subject variability
Subjects
will have various characteristics that may also be
correlated with your variables
SES,
sex, race/ethnicity, age, etc…
These variables can confound your results, making it harder to
find statistically significant results
When planning your sample (to enhance power), select subjects
that are very similar to each other
This is a reason why repeated measures tests and paired samples
are more likely to have statistically significant results
Downside
Will
to decreasing subject variability?
decrease your external validity – generalizability
If you only test women, your results do not apply to men
How can I increase Power?
6) Increase magnitude of the mean difference
If
your groups are not different enough, make them more
different!
For example, instead of measuring just high and low
exercisers, perhaps I compare marathon runners vs
completely sedentary people?
Compare
a ‘very’ high exercise to a ‘very’ low exercise group
Sampling at the extremes, getting rid of the middle group
Downsides
to using the extremes?
Similar
to decreasing subject variability, this will decrease your
external validity
Questions on Power/Increasing Power?
The Catch-22 of Power and P-values
I’ve mentioned this previously – but once you are able
to draw a large sample, this will ruin the utility of
p/statistical significance
The
larger your sample, the more likely you’ll find
statistically significant results
Sometimes
miniscule differences between groups or tiny
correlations are ‘significant’
This becomes relevant once sample size grows to 100~150
subjects per group
Once you approach 1000 subjects, it’s hard not to find p < 0.05
Example
from most highly cited paper in Psych, 2004…
This paper was the first to find a link between playing
video games/TV and aggression in children:
Every correlation in this table except 1 has p < 0.05
Do you remember what a correlation of 0.10 looks like?
r = 0.10
Do you see a relationship
between these two variables?
What now?
This realization has led scientists to begin to avoid pvalues (or at least avoid just reporting p-values)
Moving
towards reporting with 95% confidence intervals
Especially in areas of research where large samples are
common (epidemiology, psychology, sociology, etc..)
Some people interpret ‘statistically significant’ as being
‘important’
We’ve
mentioned several times this is NOT true
Statistically significant just means it’s likely not Type I error
Can have ‘important’ results that aren’t statistically significant
Effect Size
To get an idea of how ‘important’ a difference or
association is, we can use Effect Size
There
are over 40 different types of effect size
Depends
on statistical test used
SPSS will NOT always calculate effect size
Effect
size is like a ‘descriptive’ statistic that tells you about
the magnitude of the association or group difference
Not
impacted by statistical significance
Effect size can stay the same even if p-value changes
Present the two together when possible
The
goal is not to teach you how to calculate effect size, but
to understand how to interpret it when you see it
Effect Size
Understanding effect size from correlations and
regressions is easy (and you already know it):
r2,
coefficient of determination
%
Pearson correlations between %BF and 3 variables:
r
Variance accounted for
= 0.54, r = -0.92, r = 0.70
Which of the three correlations has the most
important association with %BF?
r2
= 0.29, r2 = 0.85, r2 = 0.49
Interpreting Effect Size
Usually, guidelines are given for interpreting the
effect size
Help
you to know how important the effect is
Only a guide, you can use your own brain to compare
In general, r2 is interpreted as:
0.01
or smaller, a Trivial Effect
0.01 to 0.09, a Small Effect
0.09 to 0.25, a Moderate Effect
> 0.25, a Large Effect
Effect Size in Regression
Two regression equations contain 4 predictors of
%BF. Each ‘model’ is statistically significant. Here
are their r2 values:
0.29
and 0.15
Which has the largest effect size? Do either or the
regression models have a large effect size?
0.29
model is the most important, and has a ‘large effect
size’.
0.15 model is of ‘moderate’ importance.
Effect Size for Group Differences
Effect size in t-tests and ANOVA’s is a bit more
complicated
In general, effect size is a ratio of the mean difference
between two groups and the standard deviation
Does
this remind you of anything we’ve previously seen?
Z-score = (Score – Mean)/SD
Effect size, when calculated this way, is basically
determining how many standard deviations the two
groups are different by
E.g.,
effect size of 1 means the two groups are different by
1 standard deviation (this would be a big difference)!
Example
When working with t-tests, calculating effect size by
the mean difference/SD is called Cohen’s d
<
0.1 Trivial effect
0.1-0.3 Small effect
0.3-0.5 Medium effect
> 0.5 Large effect
The next slide is the result of a repeated measures
t-test from a past lecture, we’ll calculate Cohen’s d
Paired-Samples t-test Output
Mean difference = 2.9, Std. Deviation = 5.2
Cohen’s d = 0.55, a large effect size
Essentially,
the weight loss program reduced body
weight by just about half a standard deviation
Other example
I sample a group of 100 ISU students and find their
average IQ is 103.
Recall,
the population mean for IQ is 100, SD = 15.
I run a one-sample t-test and find it to be statistically
significant (p < 0.05)
However, effect size is…
0.2,
or Small Effect
Interpretation:
While this difference is likely not due to
random sampling error – it’s not very important either
Other types of effect sizes
SPSS will not calculate Cohen’s d for t-tests
However, it will calculate effect size for ANOVA’s (if
you request it)
Cohen’s d, but Partial Eta Squared (η2)
Similar to r2, interpreted the same way (same scale)
Not
Here is last week’s cancer example
Does
Tumor Size and Lymph Node Involvement effect
Survival Time
I’ll re-run and request effect size…
Notice, η2 can be used for the entire ‘model’, or each main
effect and interaction individually
How
would you describe the effect of Tumor Size, or our
interaction?
Trivial to Small Effect – How did we get a significant p-value?
Other factors not in our model are also very important
Notice that the r2 is equal to the η2 of the full model
The
advantage of η2 is that you can evaluate individual effects
Effect Size Summary
Many other types of effect sizes are out there – I just
wanted to show you the effect sizes most commonly
used with the tests we know:
and Regression: r2
T-tests: Cohen’s d
ANOVA: Partial eta squared (η2) and/or r2
Correlation
You are responsible for knowing:
The
general theory behind effect size/why to use them
What tests they are associated with
How to interpret them
QUESTIONS ON POWER?
EFFECT SIZE?
Upcoming…
In-class activity
Homework:
Cronk
– Read Appendix A (pg. 115-19) on Effect Size
Holcomb Exercises 21 and 22
No out-of-class SPSS work this week
Things are slowing down - next week we’ll discuss
non-parametric tests
Chi-Square
and Odds Ratio