effect size indicator
Download
Report
Transcript effect size indicator
RMTD 404
Lecture 8
Power
Recall what you learned about statistical errors in Chapter 4:
• Type I Error: Finding a difference when there is no true difference in the
populations (i.e., incorrectly rejecting a true null hypothesis), designated by
α.
• Type II Error: Not finding a difference when there is a true difference in the
populations (i.e., incorrectly retaining a false null hypothesis), designated by
β.
Power is the probability of finding a difference when there is a true difference
in the populations (i.e., correctly rejecting a false null hypothesis), designated
1-β.
2
Power
Power
Factors affecting power
There are four key factors that influence the power of a statistical test:
1.
The alpha (α) that a researcher chooses;
2.
The magnitude of the true population difference (effect size)
3.
The sample size
4.
The statistical test used
Let’s try some of these in R (http://homepages.luc.edu/~rwill5/code.html)
4
Alpha’s influence on power
A small alpha (α) makes the critical
value more extreme so that less of
the alternative distribution is
allocated to the rejection region.
Hence, we have less power with
smaller alphas.
Alpha = .10
Alpha = .05
A larger α makes the critical value less
extreme so that more of the
alternative distribution is allocated
to the rejection region. Hence, we
have more power with larger
alphas.
5
Effect size’s influence on power
A small effect size makes the critical
value more extreme on the alternative
distribution so that less of that
distribution’s area is allocated to the
rejection region. Hence, we have less
power with smaller effect sizes.
A larger effect size makes the critical
value less extreme on the alternative
distribution so that more of that
distribution’s area is allocated to the
rejection region. Hence, we have
more power with larger effect sizes.
6
Power
A small sample size makes the critical
value more extreme on the alternative
distribution so that less of that
distribution’s area is allocated to the
rejection region. Hence, we have less
power with smaller sample sizes.
A larger sample size makes the critical
value less extreme on the
alternative distribution so that
more of that distribution’s area is
allocated to the rejection region.
Hence, we have more power with
larger sample sizes.
7
Influence of sample size & variance on power
Recall that the central limit theorem defines the standard error of the mean as
X
X
N
Hence, as sample size increases, the size of the standard error of the mean
decreases. As the sample size decreases, the size of the standard error of
the mean increases.
Similarly, as σx decreases, the standard error of the mean would also decrease,
indicating that effects are easier to detect with more homogeneous
populations.
8
Influence of statistical test on power
One last note, different statistical tests provide different levels of power, all
other things being equal. The increase in power results from assumptions
that are made about the data being analyzed. Of course, if these
assumptions are invalid, then the data-based decision that you make
based on a hypothesis test may also be invalid.
9
Estimating Sample Size
A good reason to perform power analysis is that these computations allow you
to estimate the sample size that you would need to detect what you
believe is a meaningful effect size.
An important component of the power analysis computation is the effect size
indicator. You must specify the size of the effect you wish to detect in order
to determine the sample size that you must use.
However, the previous equations have suggested that you must have your data
in hand in order to get the effect size indicator. Hence, you need an
estimate of the effect size in order to perform power and sample size
computations.
So how may people do I need?
We will use my favorite table thus far – the Power/Delta Table
d n
10
Sources of effect size estimation
There are three ways to estimate the size of the effect that you’ll want to
detect:
•
Prior Research: You can estimate the effect size from prior studies that
give the necessary statistics. This will allow you to detect effect sizes
similar to those found by other researchers in similar studies.
•
Professional Judgment: Based on your own experiences, you may be able
to identify an effect size that is substantively interesting. This will allow you
to detect effect sizes that have real-world meaning, based on your
experiences.
•
Convention: You can also use Cohen’s rule of thumb (e.g., small = .20,
medium = .50, large = .80). This approach is probably only advisable when
you don’t have enough information to perform the estimation use either
of the previous approaches.
11
An effect size indicator for t-tests
One measure of the magnitude of an effect, an effect size indicator, depicts
the magnitude of the effect scaled in population standard deviation units.
X 0 parameter version
d
X
If we want to estimate d from observed data, then we can transform the
equation to:
X 0
d
sX
statistic version
A rule of thumb for interpreting d is that:
d = .20 is a small effect size
d = .50 is a medium effect size
d = .80 is a large effect size
12
A visual depiction of d
d = .20
%overlap = 85
d = .80
%overlap = 53
d = .50
%overlap = 66
d = 1.10
%overlap = 41
13
Another Effect Size Indicator for t-tests
A similar measure of the magnitude of an effect is the squared point-biserial
correlation, which is similar to the measures of association that we
discussed in the context of the chi-square test. Rather than depicting the
magnitude of the effect on the population standard deviation scale (as is
the case for d), the squared point-biserial correlation indicates the
proportion of shared variance between the independent and dependent
variable.
2
rpb
2
tobserved
2
tobserved df
2
A rule of thumb for interpreting rpb
is that:
2
rpb
= .01 is a small effect size
2
rpb
= .06 is a medium effect size
2
rpb
= .14 is a large effect size
14
Reporting effect size indicators
To provide a more informative substantive interpretation, we would report and
interpret the effect size indicators. So, we might say something like the
following.
The difference in means for students in Program A (31.21) and Program B
(37.86) is too large to be accounted for by sampling error, t(15) = 2.23, p
2
= .02. In addition, this effect size is quite large ( rpb
.25 ), indicating
that the observed difference is not an artifact of a large sample size.
Scores of females (M = 3.56) were higher than the scores of males (M =
2.21), and this difference was statistically significant and the effect size
was moderate, t(20) = 4.41, p < .0001, d = .55.
There was a statistically significant difference between the mean ratings of
hubands and wives (D 2.31, p .01), but the effect size indicated
2
.003 ).
that this difference is probably trivial ( rpb
15
Effect size calculations: One-sample t-test
For the one-sample t-test, d is estimated as:
d
X 0
sX
We can interpret the observed value as defined by Cohen’s rule-of-thumb
criteria with values of .8 indicating large effect sizes.
The d index is very important in planning a study, because you need to specify a
meaningful effect size that you’d like to detect in order to determine the
sample size required to detect that difference.
As you have seen earlier that the effect size is related to the sample size. We
use the statistic δ (delta)=d[f(n)] to represent this combination where the
particular function of n will be defined differently for each individual test.
16
For the one-sample t-test, δ is based on the function of n . Specifically,
d n. Given δ as defined here, we can determine the power of the
one-sample t test from the table of power on p.678.
Back to the example we had for one-sample t-test:
The mean GRE score of 300 students in School of Education at LUC is 565, and
the standard deviation equals 75. We know the mean of the GRE test-taker
population is 500. Thus, X 565, 0 500,and sX 75.
d
X 0 565 500
0.87
sX
75
Then d n 0.87* 300 15.07From the Appendix Power, for δ=15.07 with
α=0.05, the power is beyond 0.99. This means that, if we reject the null hypothesis,
we are 99% certain our students’ GRE mean is different from 500. There is still less
than 1% of the chance to make Type II error.
17
Sometimes the researcher is interested in knowing how many samples he
should have in his study in order to obtain certain power.
For example, a researcher wants to set power at .80 when he thinks (based on
previous experience or literature) the effect size of her study is around
d=0.20. According to the Appendix Power table, for power= .80 and
α=0.05, δ must equal 2.80.
And we have δ and ca simply solve for n.
d n
2
2
2.80
n
196
d 0.20
Therefore, if the researcher wants to have an 80% chance of rejecting the null
hypothesis when the effect size is 0.2, he will have to use 196 random
samples.
18
[Example]
Literature Show that main influence score of peer pressure is 520 with a
standard deviation of 80. An investigator would like to show that a minor
change in conditions will produce scores with mean of only 500. He plans
to run a t test to compare his sample mean with a population mean of
520.
Effect size:
d
500 520
0.25
80
If the sample size is 100, the δis:
d n 0.25* 100 2.5
Check the Appendix Power table, the power= .71
19
What sample sizes would be needed to raise power to .70, .80, and .90?
(1)To have power=.70 with α=.05 , the δis close to 2.50.
δ=2.40 power=0.67
δ=2.50 power=0.71
You can use interpolation 2.5 delta 0.71 0.70and delta=2.475.
2.5 2.4
0.71 0.67
To still detect the d=-2.5 with delta=2.475:
d n
2
2
2.475
n
98.01 99(round up)
d 0.25
(2)To have power=.80 with α=.05 , the δis close to 2.8.
2.8
n
125.44 126
d 0.25
2
d n
2
20
What sample sizes would be needed to raise power to .70, .80, and .90?
(3)To have power=.90 with α=.05 , the δis in between 3.20 and 3.30.
δ=3.20 power=0.89
δ=3.30 power=0.91
Use interpolation 3.30 delta 0.91 0.90 , and delta=3.25.
3.30 3.20 0.91 0.89
To still detect the d=-2.5 with delta=3.25:
d n
2
2
3.25
n
169
d
0.25
21
Effect Sizes: Two Independent-Samples
The effect size index for the two independent sample t-test is defined as
follows.
1 2 X1 X 2
d
s pooled
spooled is defined as the common standard deviation (recall that we typically
assume that the variances are equal).
s 2X s 2X
1
2 for equal-sized samples.
s pooled
2
sX1 and sX 2 can be known from the population, estimated based on prior
research, or estimated from the data.
22
In the case of unequal sample sizes, we pooled the variance as we do when
computing the t-test. Recall what the pooled variance does—it estimates
the population variance, weighting each sample variance by its sample
size. Hence, the pooled variance is an estimate of the population variance
that weights each case in the study equally.
So, we can rewrite d for the unequal sample size case as follows.
X
d
1
X2
s pooled
X
1
X2
n1 1 s12 n2 1 s22
n1 n2 2
And we need to calculate δ to find the power. The δ for the two-sample case is
defined as
d
n, and we also need to know n when it’s not the same
2 for the two groups.
23
When we deal with power for the t-test with unequal sample sizes, we need a
single value of n to work with the power tables, so we need to combine
the sample sizes from the two groups. The formula for the effective
sample size is based on the harmonic mean.
nh
2
1 1
n1 n2
2n1n2
n1 n2
Note that when we have unequal sample sizes, we need more participants to
achieve the same level of power as a study in which sample sizes are equal
(a balanced study). Consider the following two ways of dividing 100
participants into two groups.
In this case, 100 people in the
2n1n2 2 40 60
unbalanced design has power
n
48
n1 n2
40 60
equivalent to a balanced study
with only 96 people.
compared to
n 50 in balanced studies
What’s the point?—balance your
samples when possible.
24
Let’s calculate the power of the two independent samples t-test that was
shown on p.13 in the t-test slide set.
IV: Teacher’s happiness (0=low happiness; 1=high happiness)
DV: Student’s achievement
Group Statistics
Follow-up
Reading std score
COMPOSITE SEX
MALE
FEMALE
X1 X 2
.
s pooled
Effect size:
d
Solving this
d
N
117
138
Mean
50.4083
51.3812
Std. Deviation
10.37854
9.03615
With the unequal sample size,
n for n gives us
2
Std. Error
Mean
.95950
.76921
s 2X s 2X
1
2
s pooled
2
which is the sample size per group
d = (51.3812 – 50.4083) / ((10.3785+9.0362)/2) = .1
nh = (2*117*138)/(117+138) = 126
δ = .1*sqrt(126/2) = 0.7937254
25
Summary:
One-Sample T-Test (effect size, delta (for power estimate), and sample size:
d
X 0
sX
Two-Independent Samples Test (effect size, delta (for power estimate), and
per-group group sample size:
X X2
d 1
s pooled
s 2X s 2X
1
2
s pooled
2
d
n
2
26
Effect Sizes: Matched-Samples
The d index for the matched sample t-test is defined as:
dD
X1 X 2
s( X1 X 2 )
s( X1 X 2 ) is the standard deviation of mean difference
A problem arises: To calculate s( X1 X 2 ) , we need to know the correlation
between X1 and X2.
According to the variance sum law:
(2X X ) X2 X2 2 X X
1
2
1
2
1
2
To solve the problem, we make the general assumption of homogeneity of
variance X2 1 X2 2 2
27
So the variance sum law can be revised
(2X X ) X2 X2 2 X X
1
2
1
2
1
2
2 2 2 2 2 2 2 2
=2 2 (1 )
So ( X1 X 2 ) 2(1 )
The statistic form:
s( X1 X 2 ) s 2(1 r )
Then we have to come up with the best guess of the correlation between X1 and
X2 to calculate the ( X1 X 2 ) . And the δ is defined as d n .
28