January 13, 2011

Download Report

Transcript January 13, 2011

Statistics in Biomedical Research
RISE Program 2011
Los Angeles Biomedical Research Institute
at Harbor-UCLA Medical Center
January 13, 2011
Peter D. Christenson
Scientific Decision Making
Setting:
Two groups of animals: one gets a new molecule
you have created, the other group doesn't.
Measure the relevant outcome in all animals.
How do we decide if the molecule has an effect?
Balancing Risk: Business
You have a vending machine business.
The machines have a dollar bill reader.
You can set the reader to be loose or strict.
Setting too high → rejects valid bills, loses customers.
Setting too low → accepts bogus bills, you lose.
Need to balance both errors in some way.
Balancing Risk: U.S. Legal System
Need to decide guilty or innocent.
Jury or judge measures degree of guilt.
Civil case: lower degree needed than legal case.
Setting degree high → frees suspects who are guilty.
Setting degree low → jails suspects who are innocent.
Need to balance both errors in some way.
Balancing Risk: Personal
Investments have expected returns of 1% to 20%.
As expected returns ↑, the chances they don’t ↑.
You choose your degree of risk.
Setting degree high → chances of losing ↑.
Setting degree low → chances of missing big $ ↑.
Need to balance both errors in some way.
Balancing Risk: Scientific Research
Perform experiment on 20 mice to measure an effect.
Only 20 mice may not be representative.
Need to decide if effect is real or random.
You choose a minimal degree of effect → Call real.
Setting degree high → chances* of missing effect ↑.
Setting degree low → chances** of wrong + result ↑.
Need to balance both errors in some way.
What would you want chances for * and for ** to be?
Scientific Decision Making
Setting: Two groups, one gets drug A, one gets
placebo (B). Measure outcome.
Subjects may respond very differently.
How do we decide if the drug has “an effect”?
Perhaps: Say yes if the mean outcome of those
receiving drug is greater than the mean of the
others? Or twice as great? Or the worst responder on
drug was better than the best on placebo? Other?
Meaning or Randomness?
Meaning or Randomness?
This is the goal of science in general.
The role of statistics is to give an objective
way to make those decisions.
Meaning or Randomness?
Scientific inference: Perform experiment.
Make a decision: Is it real or random?
Quantify chances that our decision is
correct or not.
Other areas of life:
Suspect guilty? Nobel laureate's opinion?
Make a decision: Is it real or random?
Usually cannot quantify.
Specialness of Scientific Research
Scientific method:
Assume the opposite of what we think.
Design the experiment so that our opinions cannot
influence the outcome.
Say exactly how we will make a conclusion, i.e.
making a decision from the experiment.
Tie our hands behind our back. Do experiment.
Make decision. Find the chances (from
calculations, not opinion) that we are wrong.
Experimental conclusions are not expert opinion.
Decision Making
We first discuss using a medical device to make
decisions about a patient.
These decisions could be right or wrong.
We then make an analogy to using an experiment to
make decisions about a scientific question.
These decisions could be right or wrong.
Decision Making
The next eight slides will make an analogy to
how conclusions or decisions from
experiments are made.
The numbers are made-up.
Mammograms are really better than this.
Decision Making: Diagnosis
Definitely
Not Cancer
0
Definitely
Cancer
Mammogram Spot Darkness
10
A particular woman with cancer may not have a 10.
Another woman without cancer may not have 0.
How is the decision made
for intermediate darkness?
Need graph here of the overlap in the CA and nonCA groups.
It needs to correspond to the %s in the next few
slides.
Decision Making: Diagnosis
Suppose a study found the mammogram rating
(0-10) for 1000 women who definitely have
cancer by biopsy (truth).
Proportion of 1000 Women:
1000/1000
Use What Cutoff?
990/1000
900/1000
600/1000
100/1000
0/1000
0
2
4
6
8
Mammogram Spot Darkness
10
Decision Making: Sensitivity
Sensitivity = Chances of correctly detecting disease.
Cutoff for Spot Darkness Mammogram Sensitivity
≥0
100%
>2
99%
>4
90%
>6
60%
>8
10%
>10
0%
Why not just choose a low cutoff and detect almost
everyone with disease?
Decision Making Continued
Suppose a study found the mammogram rating
(0-10) for 1000 women who definitely do NOT
have cancer by biopsy (truth).
Proportion of 1000 Women:
Use What Cutoff?
350/1000
1000/1000
950/1000
900/1000
700/1000
0/1000
0
2
4
6
8
Mammogram Spot Darkness
10
Decision Making: Specificity
Specificity=Chances of correctly NOT detecting disease.
Cutoff for Spot Darkness Mammogram Specificity
<0
0%
≤2
35%
≤4
70%
≤6
90%
≤8
95%
≤10
100%
Decision Making: Tradeoff
Cutoff
Sensitivity
Specificity
0
100%
0%
2
99%
35%
4
90%
70%
6
60%
90%
8
10%
95%
10
0%
100%
Choice of cutoff depends on whether the diagnosis is a
screening or a final one. For example:
Cutoff=4 : Call disease in 90% with it and 30% without.
Graphical Representation of Tradeoffs
Make Decision: If Spot>6, Decide CA.
Decide A=B
cutoff
True
Non-CA
Patients
↑
If Spot≤6, Decide Not CA.
Decide A≠B
True CA Patients
# of
women
90%
0
2
\\\ = Specificity = 90%.
/// = Sensitivity = 60%.
60%
4
6
8
Mammogram Spot Darkness
10
Area under
curve =
Probability
Tradeoffs From a Stricter Cutoff
Decide A=B
Decide A≠B
cutoff
95%
10%
0
2
4
6
Mammogram Spot Darkness
8
10
Decision Making for Diagnosis: Summary
As sensitivity increases, specificity decreases and
vice-versa.
Cannot increase both sensitivity and specificity
together.
We now develop sensitivity and specificity to test or
decide scientific claims. Analogy:
True Disease ↔ True claim, real effect.
Decide Disease ↔ Decide effect is real.
But, can both increase sensitivity and specificity
together.
Decision Making
End of analogy.
Back to our original problem in experiments.
Scientific Decision Making
Setting: Two groups, one gets drug A, one gets
placebo (B). Measure outcome.
Subjects may respond very differently.
How do we decide if the drug has “an effect”?
Perhaps: Say yes if the mean outcome of those
receiving drug is greater than the mean of the
others? Or twice as great? Or the worst responder on
drug was better than the best on placebo? Other?
Scientific Decision Making
Setting: Two groups, one gets drug A, one
gets placebo (B). Measure outcome.
How do we decide if the drug has an effect?
Perhaps: Say yes if the mean of those
receiving drug is greater than the mean of the
placebo group? Other decision rules?
Let’s just try an arbitrary decision rule:
Let Δ = Group A Mean minus Group B Mean
Decide that A>B if Δ>2. [Not just Δ>2.]
Eventual Graphical Representation
Make Decision:
If Δ>2, then Decide A≠B.
Decide A=B
If Δ≤2, then Decide A=B.
Decide A≠B
True Effect
(A≈B+2.2)
True No
Effect
(A=B)
90%
60%
-2
0
2
4
6
Δ = Group A Mean minus Group B Mean
1. Where do these curves come from?
2. What are the consequences of using cutoff=2?
Question 2 First
2. What are the consequences of using cutoff=2?
Answer:
If the effect is real (A≠B), there is a 60% chance of
deciding so. [Actually, if in particular A is 2.2 more
than B.] This is the experiment’s sensitivity, more
often called power.
If effect is not real (A=B), there is a 90% of deciding
so. This is the experiment’s specificity. More often,
100-specificity is called the level of significance.
Question 2 Continued
What if cutoff=1 was used instead?
60%
Δ:
85%
0
1
2.2
If the effect is real (Δ=A-B=2.2), there is about a 85%
chance of deciding so. Sensitivity ↑ (from 60%).
If effect is not real (Δ=A-B=0), there is about a 60% of
deciding so. Specificity ↓ (from about 90%).
Typical Choice of Cutoff
Require specificity to be 95%. This means there is only a 5%
chance of wrongly declaring an effect. → Need overwhelming
evidence, beyond a reasonable (5%) doubt, to make a claim.
95%
Specificity
~45%
Power
-2
0
2
4
6
Δ = Group A Mean minus Group B Mean
Strength of the Scientific Method
Scientists (and their journals and FDA) require overwhelming
evidence, beyond a reasonable (5%) doubt, not just
“preponderance of the truth” which would be specificity=50%.
Similar to US court of law. So much stronger than expert opinion.
Only 5%
chance of
a false
positive
claim
~45%
Power
How can we increase power above this 45%, but maintain the
chances of a false positive conclusion at ≤5%?
Are we just stuck with knowing that many true conjectures will
be thrown away as collateral damage to this rigor?
How to Increase Power
How can we increase power above this 45%, but maintain the
chances of a false positive conclusion at ≤5%?
Are we just stuck with knowing that many true conjectures will
be thrown away as collateral damage to this rigor?
To answer this, we need to go into how the curves are made:
So, we take a detour for
the next 8 slides to
show this.
Back to Question 1
1. Where do the curves in the last figure come
from?
Answer:
You specify three quantities: (1) where their peaks
are (the experiment’s detectable difference),
and how wide they are (which is determined by
(2) natural variation and (3) the # of subjects or
animals or tissue samples, N).
Those specifications give a unique set of “bellshaped” curves. How?
A “Law of Large Numbers”
Suppose individuals have values ranging from Lo to
Hi, but the % with any particular value could be
anything, say:
N=1
Prob ↑
Lo
Hi
Lo
You choose a sample of 2 of these individuals, and
find their average. What value do you expect the
average to have?
Hi
A “Law of Large Numbers”
In both cases, values near the center will be more
likely:
N=2
Prob ↑
Lo
Hi
Lo
Now choose a sample of 4 of these individuals, and
find their average. What value do you expect the
average to have?
Hi
A “Law of Large Numbers”
In both cases, values near the center will be more
likely:
N=4
Prob ↑
Lo
Hi
Lo
Hi
Now choose a sample of 10 of these individuals, and
find their average. What value do you expect the
average to have?
A “Law of Large Numbers”
In both cases, values near the center will be more
likely:
N = 10
Prob ↑
Lo
Hi
Lo
Hi
Now choose a sample of 50 of these individuals, and
find their average. What value do you expect the
average to have?
A “Law of Large Numbers”
In both cases, values near the center will be more
likely:
N = 50
Prob ↑
Lo
Hi
Lo
Hi
A remarkable fact is that not only is the mean of the
sample is expected to be close to the mean of
“everyone” if N is large enough, but we know exact
probabilities of how close, and the shape of the curve.
Summary: Law of Large Numbers
Prob ↑
N=1
SD
↔
SD
Lo
Hi
Lo
SD is about 1/6 of
the total range.
↔
Hi
Large N
SD(Mean) = SD/√N
SD ≈ 1.25 x
average deviation
from the center.
↔
Lo
Value of the mean of N subjects
Hi
Law of Large Numbers: Another View
rescaled
You can make the range of possible
values for a mean as small as you like
by choosing a large enough sample.
Also the shape will always be a bell
curve if the sample is large enough.
Scientific Decision Making
So, where are we?
We can now answer the basic
dilemma we raised.
Repeat earlier slide:
Strength of the Scientific Method
Scientists (and their journals and FDA) require overwhelming
evidence, beyond a reasonable (5%) doubt, not just
“preponderance of the truth” which would be specificity=50%.
Similar to US court of law. So much stronger than expert opinion.
Only 5%
chance of
a false
positive
claim
~45%
Power
N = 50
How can we increase power, but maintain the chances of a
false positive conclusion at ≤5%?
Are we just stuck with knowing that many true conjectures will
be thrown away as collateral damage to this rigor?
Scientific Decision Making
So, the answer is that by choosing N large
enough, the mean has to be in a small range.
That narrow the curves.
That in turn increases the chances that we will
find the effect in our study, i.e., its power.
The next slide shows this.
N = 50
95%
Fix the max
chances of
a false
positive
claim at 5%
95%
95%
45%
45% Power
Find N that
gives the
power you
want.
N = 75
74% Power
N = 88
80% Power
74%
80%
Putting it All Together
In many experiments, five factors are inter-related.
Specifying four of these determines the fifth:
1. Study size, N.
2. Power, usually 80% to 90% is used.
3. Acceptable false positive chance, usually 5%.
4. Magnitude of the effect to be detected (Δ).
5. Heterogeneity among subjects or units (SD).
The next 2 slides show how these factors are typically
examined, and easy software to do the calculations.
Quote from An LA BioMed Protocol
The following table presents detectable differences, with p=0.05 and
80% power, for different study sizes.
Detectable
Difference in
Detectable
Total
Difference in
Change in
Number
Mean Number
Change in
of
Mean MAP
of
Subjects (mm Hg)(1) Vasopressors(2)
20
40
60
80
100
120
10.9
7.4
6.0
5.2
4.6
4.2
0.77
0.49
0.39
0.34
0.30
0.27
Thus, with a total of the planned 80 subjects, we are 80% sure to detect
(p<0.05) group differences if treatments actually differ by at least 5.2 mm
Hg in MAP change, or by a mean 0.34 change in number of vasopressors.
Software for Previous Slide
Pilot data: SD=8.19 for ΔMAP in 36 subjects.
For p-value<0.05, power=80%, N=40/group, the
detectable Δ of 5.2 in the previous table is found as:
Study Size : May Not be Based on Power
Precision refers to how well a measure is estimated.
Margin of error = the ± value (half-width) of the 95%
confidence interval (sorry – not discussed here).
Smaller margin of error ←→ greater precision.
To achieve a specified margin of error, solve the CI
formula for N.
Polls: N ≈ 1000→ margin of error on % ≈ 1/√N ≈ 3%.
Pilot Studies, Phase I, Some Phase II: Power not
relevant; may have a goal of obtaining an SD for
future studies.
Study Design Considerations
Statistical Components of Protocols
•
•
•
•
•
•
•
•
•
•
Target population / source of subjects.
Quantification of aims, hypotheses.
Case definitions, endpoints quantified.
Randomization plan, if one will be used.
Masking, if used.
Study size: screen, enroll, complete.
Use of data from non-completers.
Justification of study size (power, precision, other).
Methods of analysis.
Mid-study analyses.
Resources, Software, and References
Professional Statistics Software Package
Comprehensive, but steep learning curve: SAS, SPSS, Stata.
Output
Stored
data;
accessible.
Enter
code;
syntax.
Microsoft Excel for Statistics
• Primarily for
descriptive statistics.
• Limited output.
Typical Statistics Software Package
Select Methods from Menus
www.ncss.com
www.minitab.com
www.stata.com
$100 - $500
Output after menu selection
Data in spreadsheet
Free Statistics Software: Mystat
www.systat.com
Free Study Size Software
www.stat.uiowa.edu/~rlenth/Power
http://gcrc.labiomed.org/biostat
This
and
other
biostat
talks
posted
Recommended Textbook: Making Inference
Design issues
Biases
How to read papers
Meta-analyses
Dropouts
Non-mathematical
Many examples
Thank You
Nils Simonson, in
Furberg & Furberg,
Evaluating Clinical Research
Outline
Meaning or randomness?
Decisions, truth and errors.
Sensitivity and specificity.
Laws of large numbers.
Experiment size and study power.
Study design considerations.
Resources, software, and references.