Transcript Folie 1

Sample Size
Calculation
PD Dr. Rolf Lefering
IFOM - Institut für Forschung in der Operativen Medizin
Universität Witten/Herdecke
Campus Köln-Merheim
Sample Size Calculation
sample size
uncertainty
costs & effort & time
Sample Size Calculation
Single study group
- continuous measurement
- count of events
Comparative trial (2 or more groups)
- continuous measurement
- count of events
Confidence Interval
Which true value is compatible
with the observation?
Confidence interval
... range where the true value lies with a
high probability (usually 95%)
Confidence Interval
Example:
56 patients with open fractures, 9 developed an infection (16%)
sample
all patients
with open fractures
n=56
infection rate:
16%
true value ???
Confidence Interval
Formula for event rates
n = sample size
p = percentage
CI95 = P +/- 1,96 *
p * (100 - p)

n
Example: n = 56
p = 16%
CI95 = 16 +/- 1,96 *  (16*84) / 56 = 16 +/- 9,6
[ 6,4 - 25,6 ]
Confidence Interval
95% confidence interval around a 20% incidence rate
50
incidence rate (%)
45
40
35
30
25
20
15
10
5
0
10
20
30
40
50
60
70
80
90 100 110 120 130 140 150
sample size
Confidence Interval
Formula for continuous variables
Mean:
M = mean
SE = standard error
SD = standard deviation
n = sample size
Remember:
SE = SD / n
CI95 = M  1,96 * SE
1,65 für 90%
1,96 für 95%
2,58 für 99%
Sample Size Calculation
Comparative
trials
„What is the sample size to show that early
weight-bearing therapy is better ?“
„Which key should I press here now ?“
„What is the sample size to show that
early weight bearing therapy, as
compared to standard therapy, is able
to reduce the time until return to work
from 10 weeks to 8 weeks, where time
to work has a SD of 3 ?“
36 cases per group !
Outcome Measures
Survival
Organ failure
Hospital stay
Recurrence
Complications
rate
Sepsis
Lab
Wound
values
infection
Beweglichkeit
Wellbeing
Pain
Fear
Inedpemdence,
autonomy
Depressionen
Fatigue
Social
status
Blood
pressure
Anxiety
Select Outcome Measure
• Relevance
Does this endpoint convince the patient / the scientific
community?
• Reliability; measurability
Could the outcome easily be measured, without much
variation, also by different people?
• Sensitivity
Does the intervention lead to a significant change in the
outcome measure?
• Robustness
How much is the endpoint influenced by other factors?
Select Outcome Measure
• Primary endpoint
Main hypothesis or core question; aim of the study
Statistics:
confirmative
• Secondary endpoints
Other interesting questions, additional endpoints
Statistics:
explorative
(could be confirmative in case of a large difference)
Advantage:
prospective selection in the study protocol
• Retrospektively selected endpoints
Selected when the trial is done, based on subgroup differences
Statistics:
ONLY explorative !
Sample Size Calculation
Sample
size
Certainty
 - error
Power
Difference
to be detected
Statistical Testing
A statistical test
is a method (or tool) to decide whether
an observed difference* is really present
or just based on variation by chance
*
this is true for a test for difference which is the most frequently applied one in medicine
Statistical Testing
Test for difference
„Intervention A is better than B“
Test for equivalence
„Intervention A and B have the same effect“
Test for non- inferiority
„Intervention A is not worse than B“
Statistical Testing
How a test procedure works
1. Want to show: there is a difference
2. Assume:
there is NO difference between the groups;
(„equal effects“, null-hypothesis)
3. Try to disprove this assumption:
- perform study / experiment
- measure the difference
4. Calculate:
the probability that such a difference could
occur although the assumption („no
difference“) was true
= p-value
Statistical Testing
statistical test for difference:
The p-value is the probability for
the case that the observed
difference occured just by chance
Statistical Testing
statistical test for difference :
p is the probability for
„no difference“
Statistical Testing
„Germany and Spain are
equally strong soccer teams !“
trial
Game tonight:
n=6
Null hypothesis
6 : 0 für Germany
statistical
test:
p = 0,031
p-value says:
How big is the chance that one of
two equally strong teams scores
6 goals, and the other one none.
Spain could still be equally strong as
Germany, but the chance is small (3,1%)
Statistical Testing
small
sample
large
sample
small
difference
p=0,68
p=0,05
large
difference
p=0,05
p<0,001
Statistical Testing
The more cases are included, the better could
„equality“ be disproved
Example: drug A has a success rate of 80%, while drug B is
better with a healing rate of 90%
sample size
20
40
100
200
400
1000
drug A
80%
drug B
90%
p-value
8/10
16/20
40/50
80/100
160/200
400/500
9/10
18/20
45/50
90/100
180/200
450/500
0,53
0,38
0,16
0,048
0,005
<0,001
Statistical Testing
A „significant“ p-value ...
does NOT prove the size
of the difference,
but only excludes equality!
Statistical Testing
p-value
p-value large (>0.05)
p-value small (0.05)
The observed difference is
probably caused by chance
only, or the sample size in not
sufficient to exclude chance
chance alone is not sufficient
to explain this difference
null-hypothesis
in maintained
“no difference”
 there is a systematic
difference
null-hypothesis
is rejected
“significant difference“
Statistical Testing
Errors
The decision
- for a difference (significance, p  0.05)
- or against it („equality“, not significant, p > 0.05)
is not certain but only a probability (p-value). Therefore, errors are
possible:
Type 1 error:
Decision for a difference although there is none
=> wrong finding
Type 2 error:
Decision for „equality“ although there is one
=> missed finding
Statistical Testing
Errors
Truth
Test says ...
no difference
significant
type 1 error
wrong finding

not significant
C
difference
C
type 2 error
missed finding
b
Statistical Testing
type 1 error

“wrong finding“
type 2 error
b
„missed finding“
Fire detector
wrong alarm
no alarm
in case of fire
Court
conviction of
an innocent
set a
criminal free
difference
was “significant”
by chance
difference
was missed
Clinical study
Power
“What is the Power of the study ?”
Type 2 error b
probability to miss a difference
Power = 1 - b
probability to detect a difference
Power depends on:
- the magnitude of a difference
- the sample size
- the variation of the outcome measure
- the significance level ()
Power
“What is the Power of the study ?”
POWER
is the probability to detect a
certain difference X with the
given sample size n as
significant (at level ).
“Does the study have enough power
to detect a difference of size X ?”
Power
When to perform power calculations?
1. Planning phase – sample size calculation:
if the assumed difference really exists, what risk
would I take to miss this difference ?
2. Final analysis – in case of a non-significant result:
what size of difference could be rejected with the
present data ?
Power
Example
Clinical trial:
Laparoscopic versus open appendectomy
Endpoint:
Maximum post-operative pain intensity
(VAS 0-100 points)
Patients:
30 cases per group
Results:
lap.:
28 (SD 18)
open: 32 (SD 17)
p = 0.38 not significant !
What is the power of the study ???
Sample Size Calculation
Sample
size
Certainty
 - error
Power
Difference
to be detected
Sample Size Calculation
Sample
size
 = 0.05
b = 0.20
Difference
to be detected
 error
Risk to find a difference by chance
b error
Risk to miss a real difference
Sample Size Calculation
Sample
size
 = 0.05
b = 0.20
PT & PC
or
Difference
& SD
Event rates: Percentages in the treatment and the control group
Continuous measures: difference of means and standard deviation
Sample Size Calculation
Continuous Endpoints
SD unknown
if the variation (standard deviation) is not known,
the expected advantage could be expressed as
„effect size“
which is the difference in units of the (unknown) SD
Example:
• pain values are at least 1 SD below the control group (effect size = 1.0)
• the difference will be at least half a SD (effect size = 0.5)
Sample Size Calculation
Continuous Endpoints
Test with non-parametric rank statistics
• non-normal distribution, or non-metric values
• Mann-Whitney U-test; Wilcoxon test
Use t-Test for sample size calculation
and add 10% of cases
Sample Size Calculation
Guess …
How many patients are needed to show that
a new intervention is able to reduce the
complication rate from 20% to 14% ?
(=0.05; b=0.20, i.e. 80% power)
Sample Size Calculation
Dupont WD,
Plummer WD
Power and
Sample Size
Calculations:
A Review and
Computer
Program
Contr. Clin. Trials
(1990) 11:116-128
http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/PowerSampleSize
Sample Size Calculation
Multiple Testing
• Mehr als eine Versuchs-/Therapiegruppe
• Mehrere Zielgrößen
• Mehrere Follow-Up Zeitpunkte
• Zwischenauswertungen
• Subgruppen-Analysen
Multiple testing increases the risk of
arbitrary significant results
Overall statistical error in 8 tests at the 0.05 level:
α = 1 - 0.95 8 = 1 - 0,66 = 0.34
Multiple Testing
• 1 test (with 5% error)
• 2 tests (with 5% error each)
correct
at least 1 error
95%
5%
90,25%
9,75%
• 3 tests
• 4 tests
90,25%
4,75%
4,75%
0,25%
• 5 tests
• …..
Multiple Testing
correct
at least 1 error
95%
5%
• 2 tests (with 5% error each)
90,2%
9,8%
• 3 tests
85,7%
14,3%
• 4 tests
81,5%
18,5%
• 5 tests
77,4%
22,6%
• 1 test (with 5% error)
• …..
Multiple Testing
What could you do?
 Select ONE primary and multiple secondary questions
 Combination of endpoints
multiple complications  „Negative event“
multiple time points  AUC, maximum value, time to normal
multiple endpoints  sum score acc. to O‘Brian
 Adjustment of p-values,
i.e. each endpoint is tested with a „stronger“ α level
e.g. Bonferroni: k tests at level α / k
(5 tests at the 1% level, instead of 1 Test at 5% level)
 A priori ordered hypotheses
predefine the order of tests (each at 5% level)
Interim Analysis
• Fixed sample size
end of trial
• Sequential design
after each case
• Group sequential design
after each step
• Adaptive design
after each step
Interim Analysis
aus:
TR Flemming, DP Harrington, PC O‘Brian
Design of group sequential tests. Contr. Clin Trials (1984) 5: 348-361