Sample Size and Power

Transcript Sample Size and Power

Sample Size and Power
Laura Lee Johnson
[email protected]
Tuesday, November 2, 2004
Objectives
• Intuition behind power and sample
size calculations
• Common sample size formulas for
the tests
• Tying the first three lectures
together
Take Away Message
• Get some input from a statistician
– This part of the design is vital and
mistakes can be costly!
• Take all calculations with a few grains
of salt
– “Fudge factor” is important!
• Analysis Follows Design
Outline
•
•
•
•
•
•
•
Power
Basic Sample Size Information
Examples (see text for more)
Changes to the basic formula
Multiple comparisons
Poor proposal sample size statements
Conclusion and Resources
Power Depends on Sample Size
• Power = 1-β = P( reject H0 | H1 true )
– “Probability of rejecting the null
hypothesis if the alternative
hypothesis is true.”
• More subjects  higher power
Power is Effected by…..
• Variation in the outcome (σ2 )
– ↓ σ2
→ power ↑
• Significance level (α)
–↑α
→ power ↑
• Difference (effect) to be detected (δ)
–↑δ
→ power ↑
• One-tailed vs. two-tailed tests
– Power is greater in one-tailed tests than in
comparable two-tailed tests
Power Changes
• 2n = 32, 2 sample test, 81% power,
δ=2, σ = 2, α = 0.05, 2-sided test
• Variance/Standard deviation
– σ: 2 → 1 Power: 81% → 99.99%
– σ: 2 → 3 Power: 81% → 47%
• Significance level (α)
– α : 0.05 → 0.01 Power: 81% → 69%
– α : 0.05 → 0.10 Power: 81% → 94%
Power Changes
• 2n = 32, 2 sample test, 81% power, δ=2,
σ = 2, α = 0.05, 2-sided test
• Difference to be detected (δ)
– δ : 2 → 1 Power: 81% → 29%
– δ : 2 → 3 Power: 81% → 99%
• Sample size (n)
– n: 32 → 64 Power: 81% → 98%
– n: 32 → 28 Power: 81% → 75%
• One-tailed vs. two-tailed tests
– Power: 81% → 88%
Power should be….?
• Phase III: industry minimum = 80%
• Some say Type I error = Type II
error
• Many large “definitive” studies
have power around 99.9%
• Proteomics/genomics studies: aim
for high power because Type II
error a bear!
Power Formula
• Depends on study design
• Not hard, but can be VERY algebra
intensive
• May want to use a computer
program or statistician
Outline
•
•
•
•
•
•
•
Power
Basic Sample Size Information
Examples (see text for more)
Changes to the basic formula
Multiple comparisons
Rejected sample size statements
Conclusion and Resources
Sample Size Formula Information
• Variables of interest
– type of data e.g. continuous, categorical
•
•
•
•
Desired power
Desired significance level
Effect/difference of clinical importance
Standard deviations of continuous
outcome variables
• One or two-sided tests
Sample Size and Study
Design
•
•
•
•
•
•
•
Randomized controlled trial (RCT)
Block/stratified-block randomized trial
Equivalence trial
Non-randomized intervention study
Observational study
Prevalence study
Measuring sensitivity and specificity
Sample Size and Data
Structure
•
•
•
•
Paired data
Repeated measures
Groups of equal sizes
Hierarchical data
Notes
• Non-randomized studies looking
for differences or associations
– require larger sample to allow
adjustment for confounding factors
• Absolute sample size is of interest
– surveys sometimes take % of
population approach
More Notes
• Study’s primary outcome is the variable
you do the sample size calculation for
– If secondary outcome variables
considered important make sure sample
size is sufficient
• Increase the ‘real’ sample size to reflect
loss to follow up, expected response
rate, lack of compliance, etc.
– Make the link between the calculation and
increase
Purpose→Formula→Analysis
• Demonstrate superiority
– Sample size sufficient to detect
difference between treatments
• Demonstrate equally effective
– Equivalence trial or a 'negative' trial
– Sample size required to demonstrate
equivalence larger than required to
demonstrate a difference
Outline
•
•
•
•
•
•
•
Power
Basic sample size information
Examples (see text for more)
Changes to the basic formula
Multiple comparisons
Rejected sample size statements
Conclusion and Resources
Sample Size in Clinical Trials
•
•
•
•
Two groups
Continuous outcome
Mean difference
Similar ideas hold for other
outcomes
Phase I: Dose Escalation
• Dose limiting toxicity (DLT) must
be defined
• Decide a few dose levels (e.g. 4)
• At least three patients will be
treated on each dose level (cohort)
• Not a power or sample size
calculation issue
Phase I (cont.)
• Enroll 3 patients
• If 0/3 patients develop DLT
– Escalate to new dose
• If DLT is observed in 1 of 3 patients
– Expand cohort to 6
– Escalate if 3/3 new patients do not
develop DLT (i.e. 1/6 develop DLT)
Phase I (cont.)
• Maximum Tolerated Dose (MTD)
– Dose level immediately below the
level at which ≥2 patients in a cohort
of 3 to 6 patients experienced a DLT
• Usually go for “safe dose”
– MTD or a maximum dosage that is
pre-specified in the protocol
Phase I Note
• Entry of patients to a new dose
level does not occur until all
patients in the previous level are
beyond a certain time frame where
you look for toxicity
• Not a power or sample size
calculation issue
Phase II Designs
• Screening of new therapies
• Not to prove efficacy
• Sufficient activity to be tested in a
randomized study
• Issues of safety paramount
• Small number of patients
Phase II Design Problems
• Placebo effect
• Investigator bias
• Might be unblinded or single
blinded treatment
• Regression to the mean
Phase II Example : Two-Stage
Optimal Design
• Single arm, two stage, using an
optimal design & predefined
response
• Rule out response probability of
20% (H0: p=0.20)
• Level that demonstrates useful
activity is 40% (H1:p=0.40)
• α = 0.10, β = 0.10
Phase II:
Two-Stage Optimal Design
• Seek to rule out undesirably low
response probability
– E.g. only 20% respond (p0=0.20)
• Seek to rule out p0 in favor of p1;
shows “useful” activity
– E.g. 40% are stable (p1=0.40)
Two-Stage Optimal Design
• Let α = 0.1 (10% probability of
accepting a poor agent)
• Let β = 0.1 (10% probability of
rejecting a good agent)
• Charts in Simon (1989) paper with
different p1 – p0 amounts and
varying α and β values
Table from Simon (1989)
Blow up: Simon (1989) Table
Phase II Example
• Initially enroll 17 patients.
– 0-3 of the 17 have a clinical response
then stop accrual and assume not an
active agent
• If ≥ 4/17 respond, then accrual will
continue to 37 patients.
Phase II Example
• If 4-10 of the 37 respond this is
insufficient activity to continue
• If ≥ 11/37 respond then the agent will be
considered active.
• Under this design if the null hypothesis
were true (20% response probability)
there is a 55% probability of early
termination.
Sample Size Differences
• If the null hypothesis (H0) is true
• Using two-stage optimal design
– On average 26 subjects enrolled
• Using a 1-sample test of proportions
– 34 patients
– If feasible
• Using a 2-sample randomized test of
proportions
– 86 patients per group
Phase II: Historical Controls
• Want to double disease X survival
from 15.7 months to 31 months.
• α = 0.05, one tailed, β = 0.20
• Need 60 patients, about 30 in each
of 2 arms; can accrue 1/month
• Need 36 months of follow-up
• Use historical controls
Phase II: Historical Controls
• Old data set from 35 patients treated at
NCI with disease X, initially treated
from 1980 to 1999
• Currently 3 of 35 patients alive
• Median survival time for historical
patients is 15.7 months
• Almost like an observational study
• Use Dixon and Simon (1988) method for
analysis
Phase III Survival Example
• Primary objective: determine if
patients with metastatic melanoma
who undergo Procedure A have a
different overall survival compared
with patients receiving standard of
care (SOC)
• Trial is a two arm randomized
phase III single institution trial
Number of Patients to Enroll?
• 1:1 ratio between the two arms
• 80% power to detect a difference
between 8 month median survival and
16 month median survival
• Two-tailed α = 0.05
• 24 months of follow-up after the last
patient has been enrolled.
• 36 months of accrual
Phase III Survival
• Look at nomograms (Schoenfeld
and Richter). Can use formulas
• Need 38/arm, so let’s try to recruit
42/arm – total of 84 patients
• Anticipate approximately 30
patients/year entering the trial
Sample Size Example
• Study effect of new sleep aid
• 1 sample test
• Baseline to sleep time after taking the
medication for one week
• Two-sided test, α = 0.05, power = 90%
• Difference = 1 (4 hours of sleep to 5)
• Standard deviation = 2 hr
Sleep Aid Example
•
•
•
•
n
1 sample test
2-sided test, α = 0.05, 1-β = 90%
σ = 2hr (standard deviation)
δ = 1 hr (difference of interest)
( Z1 / 2  Z1  ) 2  2
2
(1.960  1.282) 2 22

 42.04  43
2
1
Sample Size:
Change Effect or Difference
• Change difference of interest from 1hr
to 2 hr
• n goes from 43 to 11
(1.960  1.282) 2
n
 10.51  11
2
2
2
2
Sample Size: Change Power
• Change power from 90% to 80%
• n goes from 11 to 8
• (Small sample: start thinking about
using the t distribution)
(1.960  0.841) 2
n
 7.85  8
2
2
2
2
Sample Size:
Change Standard Deviation
• Change the standard deviation from 2
to 3
• n goes from 8 to 18
(1.960  0.841) 3
n
 17.65  18
2
2
2
2
Sleep Aid Example: 2 Sample
• Original design (2-sided test, α = 0.05, 1-β =
90%, σ = 2hr, δ = 1 hr)
• Two sample randomized parallel design
• Needed 43 in the one-sample design
• In 2-sample need twice that, in each group!
• 4 times as many people are needed in this
design
n
2( Z1 / 2  Z1  ) 2  2
2
2(1.960  1.282) 2 22

 84.1  85  170 total!
2
1
Sample Size:
Change Effect or Difference
• Change difference of interest from 1hr
to 2 hr
• n goes from 72 to 44
2(1.960  1.282) 2
n
 21.02  22  44 total
2
2
2
2
Sample Size: Change Power
• Change power from 90% to 80%
• n goes from 44 to 32
2(1.960  0.841)2 22
n
 15.69  16  32 total
2
2
Sample Size:
Change Standard Deviation
• Change the standard deviation from 2
to 3
• n goes from 32 to 72
2(1.960  0.841) 2 32
n
 35.31  36  72 total
2
2
Conclusion
• Changes in the detectable difference have
HUGE impacts on sample size
– 20 point difference → 25 patients/group
– 10 point difference → 100 patients/group
– 5 point difference → 400 patients/group
• Changes in α, β, σ, number of samples, if it is
a 1- or 2-sided test can all have a large
impact on your sample size calculation
2N 
4( Z1 / 2  Z1  ) 2  2
2
Sample Size:
Matched Pair Designs
• Similar to 1-sample formula
• Means (paired t-test)
– Mean difference from paired data
– Variance of differences
• Proportions
– Based on discordant pairs
Examples in the Text
•
•
•
•
Several with paired designs
Two and one sample means
Proportions
How to take pilot data and design
the next study
Outline
•
•
•
•
Power
Basic Sample Size Information
Examples (see text for more)
Changes to the basic formula/
Observational studies
• Multiple comparisons
• Rejected sample size statements
• Conclusion and Resources
Unequal #s in Each Group
• Ratio of cases to controls
• Use if want λ patients randomized to the
treatment arm for every patient randomized
to the placebo arm
• Take no more than 4-5 controls/case
n2   n1   controls for every case
n1 
( Z1 / 2  Z1  ) 2 ( 12   22 /  )
2
K:1 Sample Size Shortcut
• Use equal variance sample size
formula: TOTAL sample size
increases by a factor of
(k+1)2/4k
• Total sample size for two equal
groups = 26; want 2:1 ratio
• 26*(2+1)2/(4*2) = 26*9/8 = 29.25 ≈ 30
• 20 in one group and 10 in the other
Unequal #s in Each Group:
Fixed # of Cases
• Case-Control Study
• Only so many new devices
• Sample size calculation says n=13
cases and controls are needed
• Only have 11 cases!
• Want the same precision
• n0 = 11 cases
• kn0 = # of controls
How many controls?
n
k
2n0  n
• k = 13 / (2*11 – 13) = 13 / 9 = 1.44
• kn0 = 1.44*11 ≈ 16 controls (and 11 cases)
– Same precision as 13 controls and 13
cases
# of Events is Important
• Cohort of exposed and unexposed
people
• Relative Risk = R
• Prevalence in the unexposed
population = π1
Formulas and Example
Risk of event in exposed group
R
Risk of event in unxposed group
n1 
( Z1 / 2  Z1  )
2
 #of events in unexposed group
2( R  1)
n2  Rn1  #events in exposed group
2
n1 and n2 are the number of events in the two groups
required to detect a relative risk of R with power 1-
N  n1 /  1  # subjects per group
# of Covariates and # of Subjects
• At least 10 subjects for every variable
investigated
– In logistic regression
– No general justification
– This is stability, not power
– Peduzzi et al., (1985) biased regression
coefficients and variance estimates
• Principle component analysis (PCA)
(Thorndike 1978 p 184): N≥10m+50 or
even N ≥ m2 + 50
Balanced designs are easier to
analyze
• Equal numbers in two groups is
the easiest to handle
• If you have more than two groups,
still, equal sample sizes easiest
• Complicated design = simulations
– Done by the statistician
Outline
•
•
•
•
•
•
•
Power
Basic Sample Size Information
Examples (see text for more)
Changes to the basic formula
Multiple comparisons
Rejected sample size statements
Conclusion and Resources
Multiple Comparisons
• If you have 4 groups
– All 2 way comparisons of means
– 6 different tests
• Bonferroni: divide α by # of tests
– 0.025/6 ≈ 0.0042
• High-throughput laboratory tests
DNA Microarrays/Proteomics
• Same formula (Simon et al. 2003)
– α = 0.001 and β = 0.05
– Possibly stricter
• Simulations (Pepe 2003)
– based on pilot data
– k0 = # genes going on for further study
– k1 = rank of genes want to ensure you get
P[ Rank (g) ≤ k0 | True Rank (g) ≤ k1 ]
Outline
•
•
•
•
•
•
•
Power
Basic Sample Size Information
Examples (see text for more)
Changes to the basic formula
Multiple comparisons
Rejected sample size statements
Conclusion and Resources
Rejected Sample Size
Statements
• "A previous study in this area recruited
150 subjects and found highly
significant results (p=0.014), and
therefore a similar sample size should
be sufficient here."
– Previous studies may have been
'lucky' to find significant results, due
to random sampling variation.
No Prior Information
• "Sample sizes are not provided
because there is no prior information
on which to base them."
• Find previously published information
• Conduct small pre-study
• If a very preliminary pilot study, sample
size calculations not usually necessary
Variance?
• No prior information on standard
deviations
– Give the size of difference that may
be detected in terms of number of
standard deviations
Number of Available Patients
• "The clinic sees around 50 patients a year, of
whom 10% may refuse to take part in the
study. Therefore over the 2 years of the
study, the sample size will be 90 patients. "
• Although most studies need to balance
feasibility with study power, the sample size
should not be decided on the number of
available patients alone.
• If you know # of patients is an issue, can
phrase in terms of power
Outline
•
•
•
•
•
•
•
Power
Basic Sample Size Information
Examples (see text for more)
Changes to the basic formula
Multiple comparisons
Rejected sample size statements
Conclusion and Resources
Conclusions
• Changes in the detectable difference have
HUGE impacts on sample size
– 20 point difference → 25 patients/group
– 10 point difference → 100 patients/group
– 5 point difference → 400 patients/group
• Changes in α, β, σ, number of samples, if it is
a 1- or 2-sided test can all have a large
impact on your sample size calculation
4( Z1 / 2  Z1  ) 
2
2N 

2
2
No Estimate of the Variance?
• Make a sample size or power table
• Use a wide variety of possible
standard deviations
• Protect with high sample size if
possible
Analysis Follows Design
• Questions → Hypotheses →
• Experimental Design → Samples →
• Data → Analyses →Conclusions
• Take all of your design information to a
statistician early and often
– Guidance
– Assumptions
Resources: General Books
• Altman (1991) Practical Statistics for Medical
Research. Chapman and Hall
• Bland (2000) An Introduction to Medical
Statistics, 3rd. ed. Oxford University Press
• Armitage, Berry and Matthews (2002)
Statistical Methods in Medical Research, 4th
ed. Blackwell, Oxford
• Fisher and Van Belle (1996, 2004) Wiley
• Simon et al. (2003) Design and Analysis of
DNA Microarray Investigations. Springer
Verlag
Sample Size Specific Tables
• Continuous data: Machin et al. (1998)
Statistical Tables for the Design of Clinical
Studies, Second Edition Blackwell, Oxford
• Categorical data: Lemeshow et al. (1996)
Adequacy of sample size in health studies.
Wiley
• Sequential trials: Whitehead, J. (1997) The
Design and Analysis of Sequential Clinical
Trials, revised 2nd. ed. Wiley
• Equivalence trials: Pocock SJ. (1983) Clinical
Trials: A Practical Approach. Wiley
Resources: Articles
• Simon R. Optimal two-stage
designs for phase II clinical trials.
Controlled Clinical Trials. 10:1-10,
1989.
• Thall, Simon, Ellenberg. A twostage design for choosing among
several experimental treatments
and a control in clinical trials.
Biometrics. 45(2):537-547, 1989.
Resources: Articles
• Schoenfeld, Richter. Nomograms for
calculating the number of patients needed
for a clinical trial with survival as an
endpoint. Biometrics. 38(1):163-170, 1982.
• Bland JM and Altman DG. One and two sided
tests of significance. British Medical Journal
309: 248, 1994.
• Pepe, Longton, Anderson, Schummer.
Selecting differentially expressed genes from
microarry experiments. Biometrics.
59(1):133-142, 2003.
Resources: URLs
• Sample size calculations simplified
– http://www.tufts.edu/~gdallal/SIZE.HTM
• Statistics guide for research grant
applicants, St. George’s Hospital Medical
School
– http://www.sghms.ac.uk/depts/phs/guide/
size.htm
• Software: nQuery, EpiTable, SeqTrial, PS
(http://www.mc.vanderbilt.edu/prevmed/ps/)
Questions?

Sample Size and Power

Transcript Sample Size and Power

Directory