Sample Sizes for IE
Download
Report
Transcript Sample Sizes for IE
Sample Sizes for IE
Power Calculations
Overview
General question: How large does the sample need to
be to credibly detect a given effect size?
What does “Credibly” mean here?
We can be reasonably sure that the difference between the
treatment group and the comparison group is due to the
program
Randomization removes bias, but it does not remove
noise. To reduce noise, we need a large sample size.
But how large is large?
Measuring Impact
At the end of an experiment, we will compare the outcome of
interest in the treatment and the comparison groups.
We are interested in the difference:
Mean in treatment - Mean in control = Effect size
For example: mean of the malaria prevalence in villages with ITN
distribution vs. mean of malaria prevalence in villages with no
ITNs
To make conclusions based on that effect size, we need it to be
calculated with precision- since there is always variability in
data
If there are other many unobserved factors affecting outcomes,
it is harder to say whether the treatment had an effect
Precise outcomes
Low Standard Deviation
25
15
mean 50
mean 60
10
5
Number of Villagers Exposed to Malaria
Blue = treatment
89
85
81
77
73
69
65
61
57
53
49
45
41
37
33
0
value
Frequency
20
Some noise
Medium Standard Deviation
9
6
5
mean 50
mean 60
4
3
2
1
Number
89
85
81
77
73
69
65
61
57
53
49
45
41
37
33
0
value
Frequency
8
7
Very noisy
High Standard Deviation
8
7
5
mean 50
mean 60
4
3
2
1
Number
89
85
81
77
73
69
65
61
57
53
49
45
41
37
33
lu
e
0
va
Frequency
6
Confidence Intervals
We only work with data which is a sample of the
population. In order to assess whether this is valid for
the entire population, we need a measure of reliability
A 95% confidence interval for an effect size tells us
that, for 95% of any samples that we could have drawn
from the same population, the estimated effect would
have fallen into this interval.
The Standard error (se) of the estimate in the sample
captures both the size of the sample and the variability
of the outcome
it is larger with a small sample and with a variable outcome
Two Types of Errors
First type of error : Conclude that there is an effect, when in fact
there are no effect.
The level of your test is the probability that you will falsely conclude that
the program has an effect, when in fact it does not.
So with a level of 5%, you can be 95% confident in the validity of your
conclusion that the program had an effect.
To be confident, a= 5%, 10%, 1%
Rule of thumb is that if the effect size is more than twice the
standard error, you can conclude with more than 95% certainty
that the program had an effect
Two Types of Errors
Second type of error: you fail to reject that the program
had no effect, when it fact it does have an effect.
The Power of a test is the probability of finding a
significant effect in the RCT
Only with a significant effect can you cleanly influence
policy
Power Calculations are a tool to see how likely we are
to find a significant effect for a given sample size
What you Need for a Power Calculation
Significance level
-This is often conventionally set at 5%.
- Lower levels (less likely to reject a false positive), we
need more sample size to detect the effect
Power Level
-A power level of 80% says: 80% of the time, if there is
a true effect you will be able to detect it in a given
sample
-Larger sample
More Power
The mean and the
variability of the outcome
in the comparison group
-From previous surveys conducted in similar settings
-The larger the variability is, the larger the sample
needed for a given power
The effect size that we
want to detect
-What is the smallest effect that should prompt a policy
response?
- The smaller the expected effect size the larger
sample size needed
How to Determine Effect Size
What is the smallest effect that should justify the program to be
adopted (in terms of cost-benefit)?
Common danger: use an effect size that is too optimistic too
small of sample size
How large an effect you can detect with a given sample depends
on how variable the outcomes is.
Sets minimum effect size we would want to be able to test for
Example: If all children have very similar diarrhea prevalence without a
program, a very small impact will be easy to detect
The Standardized effect size is the effect size divided by the
standard deviation of the outcome
Common effect sizes are: .20 (small); .40 (medium); .50 (large)
Design Factors to Take into Account
Availability of a Baseline
A baseline can help reduce needed sample size since:
1.
2.
Removes some variability in data, increasing precision
Can been use it to stratify and create subgroups
The level of randomization
Whenever treatment occurs at a group level, this reduces
power relative to randomization at individual level
Cluster (Group) Randomization
Rural Water Project: Water Guard
Individual
Rural Water Project: Spring
Improvement
Village
Community-based Monitoring in
Uganda
Village
HIV/AIDS Education
School-level
Implications from Group Design
The outcomes for all the individuals within a unit
may be correlated
All villagers affected by spring improvements at same time
All students at school with trained teachers may have
benefited from information
The sample size needs to be adjusted for this
correlation
The more correlation within the group, the more we
need to adjust the standard errors
Implications
It is extremely important to randomize an adequate number of
groups.
Typically the number of individual within groups matter less
than the number of groups
Big increases in power usually only happens when the number
of groups that are randomized increase
If you randomize at the level of the district, with one treated
district and one control district, you have 2 observations!
Conclusions
Power calculations involve some guess work
Some time we do not have the right information to
conduct it very properly
However, it is important to do them to:
Avoid launching studies that will have no power at all: waste
of time and money
Determine the appropriate resources to the studies that you
decide to conduct (and not too much)
If you have a fixed budget, can determine whether the project
is feasible at all
Software: http://sitemaker.umich.edu/group-based/optimal_design_software