Transcript Slide 1
Using and understanding
numbers in health news
and research
Heejung Bang, PhD
Department of Public Health
Weill Medical College of Cornell University
1
A rationale for today’s talk
Coffee
is bad yesterday, but good today
and bad again tomorrow.
“It's the cure of the week or the killer of the
week, the danger of the week.” says B.
Kramer.
“I've seen so many contradictory studies
with coffee that I've come to ignore them
all.” says D. Berry.
What to believe? For a while, you may just
keep drinking coffee.
2
Hardly a day goes by without a new
headline about the supposed health risks
or benefits of some thing…
Are these headlines justified?
Often, the answer is NO.
3
R. Peto phrases the nature of the conflict
this way: “Epidemiology is so beautiful and
provides such an important perspective on
human life and death, but an incredible
amount of rubbish is published,”
by which he means the results of
observational studies that appear daily in
the news media and often become the
basis of public-health recommendations
about what we should or should not do.
4
3 major reasons for coffee-like situations
Confounding
Multiple
Faulty
testing
design/sample selection
5
Topics to be covered today
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
Numbers in press release
Lies, Damn Lies & Statistics
Association vs. Causation
Experiment (e.g., RCT) vs. Observational study
Replicate or Perish
Hierarchy of evidence and study design
Meta-analysis
Multiple testing
Same words, different meanings?
Data sharing
Other Take-Home messages
6
1. Numbers in press release
No
p-value, no odds or hazards ratio in
press release!
-- Ask people on the street “what is pvalue?”
-- Only we may laugh if I make a statistical
joke using 0.05, 1.96 and 95%, etc.
7
What is P-value?
In statistical hypothesis testing, the p-value is the
probability of obtaining a result at least as extreme as
a given data point, under the null hypothesis.
-- If there is no hypothesis, there is no test and no pvalue.
Current statistical training and practice, statistical
testing/p-value are overly emphasized.
However, p-value (1 number, 0-1) can be useful to
decision making.
-- you cannot say “it depends” all the times although it
can be true.
8
Numerator & denominator
Always try to check numerator and
denominator (and when, how long)
Try to read footnotes under *
-- 100% increase can be 1 → 2 cases
-- 20% event rate can be 1 out of 5 samples
9
Large Number myths
With large N, one will more likely find a difference
when a difference truly exists – notion of statistical
power.
However, many fundamental problems (e.g., bias,
confounding and wrong sample selection)
CANNOT be cured by large N. (more later)
Combining multiple incorrect stories can create
more serious problems than reporting a single
incorrect story. (more later in meta)
N>200,000 needed to detect 20% reduction in
mortality (Mann, Science 1990)
Means (and t-test) can be very dangerous b/c with
large N, everything is significant
-- Perhaps, for DNA and race, Watson should see the
entire distribution or SD!
10
2. Lies, damned lies & statistics
There are three kinds of lies --B Disraeli & M Twain
--- Title speaks for itself
“J Robins makes statistics tell the truth: Numbers in
the service of health” (Harvard Gazette interview)
If numbers/statistics are properly generated and
used, they can be the best piece of empirical
evidence.
--- some empirical evidence is almost always good to
have
--- it is hard to fight with numbers (and age)!
11
Some Advice
No
statistics is better than bad statistics.
Just present your data (e.g., N=3) when
statistics are not necessary.
Descriptive statistics vs. inferential statistics
If
you use wrong stats, you can be on the news.
See ‘Statistical flaw trips up study of bad stats’.
Nature 2006
12
3. Association vs. Causation
#1
error in health news, Association=Causation
In 1748, D. Hume stated ‘we may define a
cause to be an object followed by another…
where, if the first object had not been, the
second never had existed.’
---this is a true cause!
A more profound quote from Hume is
‘All arguments concerning existence are
founded on the relation of cause and effect.’
13
Misuses and abuses of “causes”
You may avoid the words ‘cause’, ‘responsible’,
‘influence’, ‘impact’ or ‘effect’ in your paper or
press release (esp., title), if results are obtained
from observational studies. Instead you may use
‘association’ or ‘correlation’.
Often, “may/might” not enough.
Media misuses and public misunderstands this
severely
--- Every morning, we hear new causes of some
disease are found.
14
50% risk reduction, 20% risk reduction, and so on.
If you add up, by now all causes of cancer (& many
other diseases) should have been identified.
Almost all are association, not causation.
-- there are an exceedingly large number of
associated and correlated factors, compared to true
causes.
-- a survey of 246 suggested coronary risk factors.
Hopkins & Williams (1981)
-- I believe cancer >1000 risk factors.
‘Too many don’t do’ is no better than ‘do anything’.
15
Why Association ≠ Causation?
Confounders
aka,
third variable(s)
Biggest threat to any observational studies.
Definition of ‘confound’:
vt. Throw (things) into disorder; mix up;
confuse. (Oxford Dictionary)
However, confounders CANNOT be defined
in terms of statistical notions alone (Pearl)
16
Confounder samplers
Grey Hair vs. heart attack
Stork vs. birth rate
Rock & Roll vs. HIV
Eating late & weight gain?
Drinking (or match-carrying) & lung cancer
No father’s name & infant mortality
Long leg & skin cancer
Vitamins/HRT, too?
Any remedy?
-- first thing to do is ‘Use common sense’. Think about any
other (hidden) factor or alternative explanation’.
17
Common sense & serendipity
Common sense is the basis for most of the
ideas for designing scientific
investigations. --- M Davidian
although we should not ignore the
importance of serendipity in science
18
By the way, why ‘causes’ are so
important?
If causes can be removed, susceptibility ceases to
matter (Rose 1985) and the outcome will not occur.
Neither associated nor correlated factors have this
power.
Gladly, some efforts have been made:
‘Distinguishing Association from Causation:
A Backgrounder for Journalists’ from
American Council on Science and Health
19
Greenland’s Dictum (Science 1995)
There is nothing sinful about going out and getting
evidence, like asking people how much do you
drink and checking breast cancer records.
There’s nothing sinful about seeing if that evidence
correlates.
There’s nothing sinful about checking for
confounding variables.
The sin comes in believing a causal hypothesis is
true because your study came up with a positive
result, or believing the opposite because your
study was negative.
20
Association to causation?
In 1965, Hill proposed a set of the following causal criteria:
1.
Strength
2.
Consistency
3.
Specificity
4.
Temporality (i.e., cause before effect)
5.
Biological gradient
6.
Plausibility
7.
Coherence
8.
Experiment
9.
Analogy
However, Hill also said “None of my nine viewpoints can
bring indisputable evidence for or against the causeand-effect hypothesis and none can be required as a
sine qua non’.
21
Another big problem:
bias and faulty design/samples
Selection bias: the distortion of a statistical analysis, due to
the method of collecting samples.
The easiest way to cheat (intentionally or unintentionally)
-- Make group1: group2 = healthy people: sick people.
-- Oftentimes, treatment is bad in observational studies, why?
-- Do a survey among your friends only
-- People are different from the beginning?? (e.g., vegetarians
vs. meat-lover, HRT users vs. non-users)
Case-control study & matching: easy to say but hard to do
correctly.
-- Vitamin C and cancer
For any comparison: FAIRNESS is most important!
-- Numerous other biases exist
22
Would you believe these p-values?
(Cameron and Pauling, 1976)
This famous study has failed to replicate 16 or so times! Pauling received two Nobel.
23
4. Experiment vs.
Observational study
Although the arguing from experiments and
observations by induction be no demonstration of
general conclusions, yet it is the best way of arguing
which the nature of things admits of. --- I Newton
Newton’s "experimental philosophy" of science:
Science should not, as Descartes argued, be based
on fundamental principles discovered by reason,
but based on fundamental axioms shown to be true
by experiments.
24
Why clinical trials are important?
Randomized Controlled Trial (RCT) is the most
common form of experiment on humans.
‘Average causal effects’ can be estimated from
experiment.
-- To know the true effect of treatment within person,
one should be treated and untreated at the same
time.
Experimentation trumps observation. (power of
coin-flip! Confounders disappear.)
Very difficult to cheat in RCTs (due to
randomization and protocol).
“Causality: God knows but humans need a time
machine. When God is busy and no time machine
is available, a RCT would do.”
25
Problems/issues of RCTs
Restrictive
settings
Human subjects under experiments
Can be unethical or infeasible
Short terms
1-2 treatments, 1-2 doses only
Limited generalizability
Other issues: blinding, drop-up, compliance
26
Problems/issues of observational
studies
Bias & confounding
Post-hoc arguments about biological plausibility
must be viewed with some skepticism since the
human imagination seems capable of developing a
rationale for most findings, however unanticipated
(Ware 2003).
i.e., retrospective rationalization.
We are tempted to ‘Pick & Choose’!
Data-dredging, Fishing expedition, Significancechasing (p<0.05)
Observational studies can overcomes some
limitations of RCTs.
27
Ideal attitudes
RCTs
and observational studies should be
complementary each other, rather than
competing.
--because real life stories can be complicated.
When RCTs and observational studies
conflict, generally (not always) go with RCTs.
Even if you conduct a observational study, try
to think in a RCT way. (e.g., a priori 1-2
hypothesis, protocol, data analysis plan, ask
yourself ‘Is this result likely to replicate in
28
RCT?’)
Quotes for observational studies
The data are still no more than observational, no
matter how sophisticated the analytic
methodology – anonymous reviewer
Observational studies are not a substitute for
clinical trials no matter how sophisticated the
statistical adjustment may seem – D. Freedman
No fancy statistical analysis is better than the
quality of the data. Garbage in, garbage out, as
they say. So whether the data is good enough to
need this level of improvement, only time will tell.
– J. Robins
Remark: However, advanced statistical technique,
causal inference, may help.
29
Some studies are difficult
Diet/alcohol:
Type/amount, How to measure?
Do you remember what we ate last week?
Exercise/physical activity/SES: Can we
measure? Do you tell the truth?
-- people tend to say ‘yes’, ‘moderately’
Long term cumulative effects
Positive thinking and spirituality?
Quality and value of life: How to define and
measure
-- priceless?
30
5. Replicate or perish
Publish or perish: Old era
vs. Replicate or perish: New era
Replicability of the scientific findings can never be
overemphasized. Results being ‘significant’ or
‘predictive’ without being replicable misinform the
public and needlessly expend time and resources,
and they are no service to investigators and
science –S. Young
Given that we currently have too many findings,
often with low credibility, replication and rigorous
evaluation become as important as or even more
important than discovery - J. Ioannidis (2006)
-- Pay more attention to 2nd study!
31
Examples of highly cited heart-disease
studies that were later contradicted
(Ioannidis 2005)
-- The Nurses Health Study, showing a 44% relative risk
reduction in coronary disease in women receiving
hormone therapy. Later refuted by Women's Health
Initiative, which found that hormone treatment
significantly increases the risk of coronary events.
-- Two large cohort studies, the Health Professionals
Follow-Up Study and the Nurses Health Study, and a
RCT all found that vitamin E was associated with a
significantly reduced risk of coronary artery disease.
But larger randomized trials subsequently showed no
benefit of vitamin E on coronary disease
32
More Ioannidis
Ioannidis (2005) serves as a reminder of the
perils of small trials, nonrandomized trials, and
those using surrogate markers.
He concludes "Evidence from recent trials, no
matter how impressive, should be interpreted
with caution when only one trial is available. It is
important to know whether other similar or larger
trials are still ongoing or being planned.
Therefore, transparent and thorough trial
registration is of paramount importance to limit
premature claims [of] efficacy."
33
More Freedman
Modeling, the search for significance, the
preference for novelty, and lack of interest
in assumptions --- these norms are likely
to generate a flood of nonreproducible
results
34
6. Hierarchy of evidence
study design
35
What goes on top?
ANSWER is total evidence.
RCT can provide strong evidence for a
causal effect, especially if its findings are
replicated by other studies
36
When you read the article, you may
check the study design
Cross-sectional
study: which is first? what is
cause and what is effect?
e.g., depression vs. obesity
Prospective cohort studies: much better but
still not causal
Prospective is generally better than
retrospective
RCT is better than non-RCT
37
7. Meta-analysis
Statistical technique for systematic literature review
There are 3 things you should not watch being made:
law, sausage & meta-analysis
No data collection but Nothing is free.
Can you find all studies in the universe including ones
in researchers’ file drawers? Or at least unbiased
subsample? Google or pubmed can do? NO!
Publication bias (favoring positive studies) and
language bias, etc.
Much bigger problem in obs studies than RCTs.
Combining multiple incorrect stories is worse than one
38
incorrect story.
Funny (real) titles of papers about
meta-analysis
Meta-analysis: apples and oranges, or fruitless
Apples and oranges (and pears, oh my!): the
search for moderators in meta-analysis
Of apples and oranges, file drawers and garbage:
why validity issues in meta-analysis will not go
away
Meta analysis/shmeta-analysis
Meta-analysis of clinical trials: a consumer's
guide.
Publication bias in situ
39
8. Multiple testing
Multiple testing/comparisons refers to the testing
of more than one hypothesis at a time.
When many hypotheses are tested, and each test
has a specified Type I error probability (α), the
probability that at least 1 Type I error is committed
increases with the number of hypotheses.
Bonferroni method: α=0.05/# of tests
Many researchers’ thorny issue.
-- Bonferroni might be the most hated statistician in
history.
-- ‘Escaping the Bonferroni iron claw in ecological
40
studies’ by Garcı´a et al. (2004)
Two errors
Type I (false positive: rejecting H0 when it is true)
vs. Type II (false negative: accepting H0 when it is
false)
-- Controlling Type I is more important in stat and
court. (e.g., innocent → guilty: disaster!)
-- In other fields, Type 2 can be more important.
α=p=0.05 – is this the law in science? Only 5%
error do you commit in your life?
α=5% seems reasonable to one research
question/publication.
41
42
Multiple testing in different forms
Subgroup analyses
-- You should always do subgroup analyses but
never believe them. – R. Peto
-- Multiple testing adjustment and cross-validation
may be solutions.
Trying different cutpoints (e.g., tertiles, quintiles,
etc.)
-- A priori chosen cutpoints or multiple testing
adjustment can be solutions.
Nothing is free. To look more, you have to pay.
43
Multiple testing
(underlying mechanism)
Lottery tickets should not be free. In random and
independent events as the lottery, the probability of
having a winning number depends on the N of tickets
you have purchased. When one evaluates the
outcome of a scientific work, attention must be given
not only to the potential interest of the ‘significant’
outcomes but also to the N of ‘lottery tickets’ the
authors have ‘bought’. Those having many have a
much higher chance of ‘winning a lottery prize’ than of
getting a meaningful scientific result. It would be unfair
not to distinguish between significant results of wellplanned, powerful, sharply focused studies, and those
from ‘fishing expeditions’, with a much higher
probability of catching an old truck tyre than of a really
big fish. --- Garcı´a et al. (2004)
44
Multiple testing disaster I
In the 1970s, every disease was reported to
be associated with an HLA allele
(schizophrenia, hypertension.... you name
it!). Researchers did case control studies
with 40 antigens, so there was a very high
probability of at least one was significant
result This result was reported without any
mention of the fact that it was the most
significant of 40 tests --- R. Elston
45
Multiple testing disaster II
Association between reserpine (then a popular
antihypertensive) and breast cancer. Shapiro
(2004) gave the history. His team published
initial results that were extensively covered by
media with a huge impact on research
community. When the results did not replicate,
he confessed that the initial findings were
chance due to thousands of comparisons
involving hundreds of outcomes and hundreds of
exposures. He hopes that we learned for the
future from his mistake.
46
Multiple testing disaster III
You are what your mother eats (Mathews et al.
2008).
All over the places on the news and internet. Over
50,000 Google hits for 1st week.
Numerous comparisons were conducted
Sodium, calcium, potassium, etc. were significant
(p<0.05), but sodium was dismissed claiming it is
hard to measure accurately.
--possible ‘pick and choose’!
Other problems: lack of biological credibility,
difficulty in dietary data.
47
Leaving no trace (Shaffer 2007)
Usually these attempts through which the
experimenter passed don't leave any
traces; the public will only know the result
that has been found worth pointing out;
and as a consequence, someone
unfamiliar with the attempts which have
led to this result completely lacks a clear
rule for deciding whether the result can or
can not be attributed to chance.
48
If you keep testing without
controlling α
Everything is Dangerous – S. Young
It is fairly easy to find risk factors for premature
morbidity or mortality. Indeed, given a large
enough study and enough measured factors and
outcomes, almost any potentially interesting
variable will be linked to some health outcome –
Christenfeld et al. 2004.
Even checking 1000 correlation can be a sin– S.
Young
The only thing to fear is fear itself…………………….
…..………………………………and everything else
49
Multiple testing adjustment
In RCTs: mandatory (by FDA)
-- If not, more (interim) looks would lead what you
want
In genetic/genomic studies: almost mandatory
-- think about # of genes!
In observational studies: almost infeasible
Realistic strategies can be:
1. α=5% for one hypothesis. Adjust multiple testing or
state clearly how many tests/comparisons you
conducted.
2. Think and act in RCT ways.
50
Replication, again
is universal solution for multiplicity and
subgroup analyses (Vandenbroucke 2008)
In genome-wide analyses, it is a prerequisite
for publication (Khoury et al. 2007)
-- However, replication is for someone else! The data
analysis strategy of splitting the data into two parts,
testing and verification, can be considered.
51
9. Same words, different meanings?
Professionals
vs. lay terms (or even among
scientific disciplines): not always the same
e.g., risk, hazard, odds, likelihood, rate,
prevalence, incidence, valid, unbiased,
consistent, cost-effective (≠cheap),
efficient, SD vs. SE
People on the street may not distinguish
RCT from observational study.
As easy and intuitive as possible but should
be correct
52
10. Data sharing
Although … is tax-supported, its data are not
available to us. ….Policies governing data
dissemination need to be reconsidered, although
due regard must be paid to patient confidentiality.
Only by thorough scrutiny can error be avoided.
Transparency is the best assurance of scientific
quality. (Freedman & Pettiti 2005)
Open the data to public (after sufficient deidentification).
“Alas, we don’t have the process” –D. Kenney
responding to ‘File-drawer problem, revisited’ by
Young & Bang (2004)
53
So, who are responsible?
Authors/scientists: lack of integrity, pressure to get
grant, pub, CV, want to be famous or on the news,
publish to publish
Editors: too many journals, reviewers are busy, 2
wks review time, shortage of stat reviewers
Media: don’t think critically, to surprise or shock
people
Lay persons: like more shocking news, may not
use common sense
We are all responsible for all ---Dostoevsky
(Rose’s Epi book)
54
Try to remember as scientists
If false positive and false negative results
continued to be produced with disturbing
frequency, it might be so true that ‘we are fast
becoming a nuisance to society….people don’t
take us seriously anymore, and when they do… we
may unintentionally do more harm than good’ --Trichopoulos.
Remember it is extremely difficult to un-shock the
shocked.
Any researchers who use observational studies
may want to remind them of one question when
they do research ‘is this result likely reproducible in
RCT (if it will ever happen)?’
Transparency!!!! Ultimately, data to be shared.
55
Try to remember as readers
RCT vs. Obs studies (remember hierarchy)
RCT is the best available but not perfect
Use your common sense, while don’t forget
serendipity (Start from the null. Ask ‘Why’, rather
than ‘Why Not’?)
Check N (denom & num, large N does not fix bias)
Think about Third variables (i.e., confounders)
Be careful about meta-analysis
Do not worship p-value (<0.05)
Perhaps, by Chance? Multiple testing. How many
questions/analyses? Anything hidden?
56
Today’s Quotes
In
God we trust; all others must bring data
(protocol and SAS output). --- W. Deming (&
K. Griffin)
All models are wrong, but some are useful. -- G. Box
Do whatever you what. But you should be
responsible for what you do.
57
Useful reading (research articles)
Bang, H. (2009) Introduction to Observational Studies.
Young, SS. and Bang, H. (2004) The file-drawer problem,
revisited. Science.
Taubes, G. (1995) Epidemiology faces its limits. Science.
Freedman, DA. (2008) Oasis or Mirage. Chance.
Shapiro, S. (2004) Looking to the 21st century: have we
learned from our mistakes, or are we doomed to compound
them?
Ioannidis, JPA. (2006) Evolution and translation of research
findings: From bench to where? PLOS.
Breslow, NE. (2003) Are statistical contributions to medicine
undervalued? Biometrics.
Austin, PC. (2006) Testing multiple statistical hypotheses
resulted in spurious associations: a study of astrological
signs and health.
Begg, C. (2001) The search for cancer risk factors: when
58
can we stop looking?
Useful reading (newspaper articles)
Do we really know what makes us healthy? --- NYT
2007
Scientists do the numbers: Coffee is good for you -no, it's bad. Epidemiological studies can come up
with some crazy results, causing some critics to
wonder if they're really worthwhile. --- LAT 2007
Women's Health Studies leave questions in place of
certainty --- NYT 2006
Why so much medical research is rot --- The
Economist 2007.
59
Do we really know
what makes us healthy?
‘We know exactly why certain people
commit suicide. We don’t know, within the
ordinary concepts of causality, why certain
others don't commit suicide. …. We know
a great deal more about the causes of
physical disease than we do about the
causes of physical health.’ --- Scott Peck,
MD, in the book ‘The Road Less
Travelled’.
60
Founded 1804
Population 3700
Altitude
432
Total
5936
Mean 1978.66
SD
1640.98
Modified from “Entering Hillsville”, by Dana Fradon 1977 New Yorker
61