Numbers in health news and research

Download Report

Transcript Numbers in health news and research

Using and understanding
numbers in health news
and research
Heejung Bang
1
A rationale for today’s talk
 Coffee
is bad yesterday, but good today
and bad again tomorrow.
 “It's the cure of the week or the killer of the
week, the danger of the week.” says B.
Kramer.
 “I've seen so many contradictory studies
with coffee that I've come to ignore them
all.” says D. Berry.
 What to believe? For a while, you may just
keep drinking coffee.
2
Hardly a day goes by without a new
headline about the supposed health risks
or benefits of some thing…
Are these headlines justified?
Often, the answer is NO.
3
 Vitamin
D and calcium supplements: take
them or leave them? How to follow the
changing recommendations without
making yourself dizzy. (Harv Womens
Health Watch 2012)
4
R. Peto phrases the nature of the conflict
this way: “Epidemiology is so beautiful and
provides such an important perspective on
human life and death, but an incredible
amount of rubbish is published,”
- the results of observational studies that
appear daily in the news media and often
become the basis of public-health
recommendations about what we should
or should not do.
Personal: This is not just Epi’s problem. We
just receive more attention + higher impact...5
 Are
statistical contributions to
medicine undervalued? … Breslow.
Biometrics 2003
 Is statistical method of any value in
medical research?... Greenwood. Lancet
1924
 Publication as prostitution… Frey. 2003
6
Major reasons for coffee-like situations
 Confounding
 Multiple
 Faulty
testing
design/sample selection
 Incorrect
analysis
 Exaggeration
(in reporting)
7
Topics to be covered
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
Numbers in press release (& p-value)
Lies, Damn Lies & Statistics
Association vs. Causation
Experiment (e.g., RCT) vs. Observational study
Replicate or Perish
Hierarchy of evidence and study design
Meta-analysis
Multiple testing (and Multiple modeling)
Incorrect analysis
Same words, different meanings?
Probability blind, option blind, Innumercy
Data sharing
Take-home messages --So, who are responsible?
8
1. Numbers in press release
 No
p-value, no odds or hazards ratio in
press release!
-- Ask people on the street “what is pvalue?”
-- Only we may laugh if I make a statistical
joke using 0.05, 1.96, 95%, and 80%
power.
9
 “What
is the most important letter in
medicine?
---- Some say the answer may be “P”.
Remark: Three most influential numbers in biomedicine
may be (smaller) P-value, (smaller) N, and (larger)
Impact factor.
10
Sample, real titles
 “The
earth is round (P< .05)” by J
Cohen. Am Psychol 1994.
 “A dirty dozen: twelve p-value
misconceptions” by S Goodman. Semin
Hematol 2008.
 “What
is a p-value anyway? 34 stories
to help you actually understand statistics”
by AJ Vickers. 2010.
11
What is P-value?
Almighty P-value ---- Peck (1971)
12

P-value is the probability of obtaining a test statistic
at least as extreme as the statistic observed, under
the null hypothesis (and every other assumptions
made).
-- If there is no hypothesis, there is no test and no pvalue.
 Current statistical training and practice, statistical
testing/p-value are overly emphasized.
 However, p-value (1 number, 0-1) can be useful for
decision making.
-- you cannot say “it depends” all the times although it
is true.
-- do we ever know “clinical significance” measure?
13
Goodman (2008)
14
15
16
Clinical Trials as a Religion
(Rimm and Bortin 1978)
17
18
19
20
21
22
Numerator & denominator

Always try to check numerator and
denominator (and when, how long) and
absolute vs. relative.
Try to read footnote under *
-- 100% increase can be 1 → 2 cases
-- 20% event rate can be 1 out of 5 samples

23
Large Number myths
With large N, one will more likely declare a difference
when a difference truly exists – notion of statistical
power.
 However, many fundamental problems (e.g., bias,
confounding and wrong sample selection) CANNOT be
cured by large N. (more later)
 Combining multiple incorrect stories can create more
serious problems than reporting a single incorrect story.
(more later in meta)
 N>200,000 needed to detect 20% reduction in mortality
(Mann, Science 1990)
 t-test can be very dangerous b/c with large N, every
difference is significant
-- Perhaps, for DNA and race, Watson should see the
entire distribution or SD!
-- Disparity can’t be zero --- J. Kaufman.
24

"The 2nd most important alphabet in medical
research may be N.“
 Some say “Statisticians tend to disagree on
many things but all seem to agree ‘N: larger is
better.’”
 At times, N is not N.
- asking 10 people “Will you vote for Obama?”
vs. asking 1 person the same question 10 times.
 303 studies can suffer same problems (e.g.,
same biases, same method, virtually same
sample).
 What will you do if 1000 genes are significant?
25

2. Lies, damned lies & statistics

There are three kinds of lies --B Disraeli & M Twain
--- Title speaks for itself
 “J Robins makes statistics tell the truth: Numbers in
the service of health” (Harvard Gazette interview)
 If numbers/statistics are properly generated and
used, they can be the best piece of empirical
evidence.
--- some empirical evidence is almost always good to
have
--- it is hard to fight with numbers (and age)!
--- score vs. detailed reviews
--- if Disraeli were alive, he might amend his line to
“lies, damned lies, revealed by statistics.” (Newsweek
2008)
26
Some Advice
 No
statistics is better than bad statistics.
 Just present your data (e.g., N=3) when
statistics are not necessary.
 Descriptive statistics vs. inferential statistics
27
3. Association vs. Causation
 #1
error in health news, Association=Causation
 ‘We may define a cause to be an object
followed by another… where, if the first object
had not been, the second never had existed.’
(Hume, 1748)
---this is a true cause!
‘All arguments concerning existence are
founded on the relation of cause and effect.’ -Hume.
28
Misuses and abuses of “causes”
You may avoid the words ‘cause’, ‘responsible’,
‘influence’, ‘impact’ or ‘effect’ in your paper or
press release (esp., title), if results are obtained
from observational studies. Instead you may use
‘association’ or ‘correlation’.
 Often, “may/might” is not enough.
 Media misuses and public misunderstands
severely
--- Every morning, we hear new causes of some
disease are found.

29

50% risk reduction, 20% risk reduction, and so on.
If you add up, by now all causes of cancer (& many
other diseases) should have been identified.
 Almost all are association, not causation.
-- there are an exceedingly large number of
associated and correlated factors, compared to true
causes.
-- a survey of 246 suggested coronary risk factors.
Hopkins & Williams (1981)
-- I believe cancer >1000 risk factors.
‘Too many don’t dos’ is no better than ‘do anything’.
30
Why Association ≠ Causation?
Confounders
 aka,
third variable(s)
 Biggest threat to any observational studies
 Definition of ‘confound’:
vt. Throw (things) into disorder; mix up;
confuse. (Oxford Dictionary)
 Yet, confounders can’t be defined in terms of
statistical notions alone (Pearl)
31
Pearl (1998)
Why There Is No Statistical Test For
Confounding, Why Many Think There Is,
And Why They Are Almost Right.
32
“Potential” confounder samplers

Grey Hair vs. heart attack
 Stork vs. birth rate
 Rock & Roll vs. HIV
 Eating late & weight gain?
 Drinking (or match-carrying) & lung cancer
 No father’s name & infant mortality
 Long leg & skin cancer
 Vitamins/HRT, too?
 Autism & TV watching
 Autism & rainfall
Any remedy?
-- first thing to do: use common sense. Think about any
other (hidden) factor or alternative explanation.
33
Common sense & serendipity
Common sense is the basis for most of the
ideas for designing scientific investigations.
--- Davidian
On the other hand, we should not ignore the
importance of serendipity in science
e.g., discovery of aspirin continues to be
an accidental sequence of events related
only by serendipity (Fürstenwerth 2011)
Serendipity is a dirty word…. Bailar (1976)
34
By the way,
why ‘causes’ are so important?

If causes can be removed, susceptibility ceases to
matter (Rose 1985) and the outcome will not occur.
Neither associated nor correlated factors have this
power.
Remark: For treatment/intervention, we should know
causes. But for screening, causes are not
necessary. Correlates can be useful.
35
Greenland’s Dictum (Science 1995)
There is nothing sinful about going out and getting
evidence, like asking people how much do you
drink and checking breast cancer records.
There’s nothing sinful about seeing if that evidence
correlates.
There’s nothing sinful about checking for
confounding variables.
The sin comes in believing a causal hypothesis is
true because your study came up with a positive
result, or believing the opposite because your
study was negative.
36
Association to causation?
In 1965, Hill proposed a set of the following causal viewpoints:
1.
Strength
2.
Consistency
3.
Specificity
4.
Temporality (i.e., cause before effect)
5.
Biological gradient
6.
Plausibility
7.
Coherence
8.
Experiment
9.
Analogy
However, Hill also said “None of my nine viewpoints can bring
indisputable evidence for or against the cause-and-effect
hypothesis and none can be required as a sine qua non’.
37
Another big problem:
bias and faulty design/samples

Selection bias: the distortion of a statistical analysis, due to
the method of collecting samples.
 An easy way to cheat (intentionally or unintentionally)
-- Make group1: group2 = healthy people: sick people.
-- Oftentimes, treatment is bad in observational studies, why?
-- Do a survey among your friends only
-- People are different from the beginning?? (e.g., vegetarians
vs. meat-lover, HRT users vs. non-users)
 Case-control study & matching: easy to say but hard to do
correctly.
-- Vitamin C and cancer
 For any comparison: FAIRNESS is most important!
-- Numerous other biases exist
38
Would you believe these p-values?
(Cameron and Pauling, 1976)
This famous study has failed to replicate 16 or so times! Pauling received two Nobel.
39
A catalogue of biases
 Yes
saying bias
 Social desirability bias
 Forced choice bias
 Omitted variables bias
 Data dredging bias
 Hot topic bias
 All’s well literature bias
 etc, etc, etc.
40
True random sample?
 The
pitfalls of non-random or defective
random samples and lack of control
groups are well documented in elementary
texts (Aliaga and Gunderson 1999). Of
course, true random samples are painful, if
not impossible, to take. If the sample only
contains swans from Europe, you are at
risk in making conclusions about
Australian swans. That is buyer beware.
(Taleb 2001; Lund 2007)
 Better or worse convenience sample.
41
4. Experiment vs.
Observational study

Although the arguing from experiments and
observations by induction be no demonstration of
general conclusions, yet it is the best way of arguing
which the nature of things admits of. --- I Newton

Newton’s "experimental philosophy" of science:
Science should not, as Descartes argued, be based
on fundamental principles discovered by reason,
but based on fundamental axioms shown to be true
by experiments.
42
Why clinical trials are important?

Randomized Controlled Trial (RCT) is the most
common form of experiment on humans.
 ‘Average causal effects’ can be estimated from
experiment.
-- To know ‘individual causal effect’, one should be
treated and untreated at the same time.
 Experimentation trumps observation. (power of
coin-flip!) and Design trumps analysis.
 Difficult to cheat in RCTs (due to randomization,
protocol and prior registration)…….
 “Causality: God knows but humans need a time
machine. When God is busy and no time machine
is available, a RCT would do.”
43
Nearly causal effect (N=2)
44
Problems/issues of RCTs
 Restrictive
and unnatural settings
 Human subjects under experiments
 Can be unethical or infeasible
 Short terms
 1-2 treatments, 1-2 doses only
 Limited generalizability
 Other issues: blinding, drop-up, compliance
 RCT may do only 1 thing well…
45
Problems/issues of obs studies


Bias & confounding
Post-hoc arguments about biological plausibility
must be viewed with some skepticism since the
human imagination seems capable of developing a
rationale for most findings, however unanticipated
(Ware 2003).
i.e., retrospective rationalization.
 We are tempted to ‘Pick & Choose’!
 Data-dredging, Fishing expedition, Significancechasing (p<0.05)
 Observational studies can overcomes some
limitations of RCTs. Important in the CER era.
46
Ideal attitudes

RCTs and observational studies should be
complementary each other, rather than competing.
--because real life stories can be complicated.
 When RCTs and observational studies conflict,
generally (not always) go with RCTs.
 Even if you conduct a observational study, try to
think in a RCT way. (e.g., a priori 1-2 hypothesis,
protocol, data analysis plan, assume someone is
watching you and your analysis can be audited, ask
yourself ‘Is this result likely to replicate in RCT?’)
47
Bad but helpful press?

The data are still no more than observational, no
matter how sophisticated the analytic
methodology – anonymous reviewer
 Observational studies are not a substitute for
clinical trials no matter how sophisticated the
statistical adjustment may seem – D. Freedman
 No fancy statistical analysis is better than the
quality of the data. Garbage in, garbage out, as
they say. So whether the data is good enough to
need this level of improvement, only time will tell.
– J. Robins
48
Any statistical remedies?
 There
are advanced statistical technique,
causal inference, may help ………….....
(or hurt)
because powerful tools can be dangerous
if not handled with care.
49
Most ideal research direction?
 Observation
-> Experiment -> Observation
(…. possible repeats)
 Before
a RCT, we need convincing
observational and/or experimental
evidencesss.
 Total
evidence: Basic science (e.g.,
laboratory or animal setting), Observation
and Experiment show similar results
50
Common sense before evidence?
Parachute use to prevent death and major trauma
related to gravitational challenges: systematic
review of randomised controlled trials (Smith
and Pell, BMJ 2003)
Authors stated “our search strategy did not find
any RCT of this parachute.”
What this study adds: Individuals who insist that all
interventions need to be validated by a RCT
need to come down to earth with a bump.
51
Some studies are difficult

Diet/alcohol: Type/amount, How to measure? Do
you remember what we ate last week?
 Exercise/physical activity/SES: Can we measure?
Do you tell the truth?
-- people tend to say ‘yes’, ‘moderately’
 Long term cumulative effects (e.g., diet, skincare,
2nd hand smoking, calcium, multivitamins, organic
food?)
 Positive thinking and spirituality?
 Quality and value of life: How to define and
measure
-- priceless?
52
5. Replicate or perish

Publish or perish: Old era
vs. Replicate/Validate or perish: New era
Replication of the scientific findings can never be
overemphasized. Results being ‘significant’ or
‘predictive’ without being replicated misinform the
public and needlessly expend time and resources,
and they are no service to investigators and
science –S. Young
Given that we currently have too many findings,
often with low credibility, replication and rigorous
evaluation become as important as or even more
important than discovery - J. Ioannidis (2006)
-- Pay more attention to 2nd or 3rd study!
53
Examples of highly cited heart-disease
studies that were later contradicted
(Ioannidis 2005)
-- The Nurses Health Study, showing a 44% relative risk
reduction in coronary disease in women receiving
hormone therapy. Later refuted by Women's Health
Initiative, which found that hormone treatment
significantly increases the risk of coronary events.
-- Two large cohort studies, the Health Professionals
Follow-Up Study and the Nurses Health Study, and a
RCT all found that vitamin E was associated with a
significantly reduced risk of coronary artery disease.
But larger randomized trials subsequently showed no
benefit of vitamin E on coronary disease
54
Alternative explanations
for vitamin E (by Greenland)
The doses and formulations across all theses trials
were not identical. Plus, there have been
arguments for decades, mostly ignored by the
trialists, that the natural and synthetic compounds
showing vitamin E activity for deficiency prevention
are very different in their spectrum of effects. To my
knowledge there has been no trial of a size capable
of showing anything that has examined the natural
mix of E-active compounds found in foods, include
at least 4 d-isomers of tocopherol and 4 d-isomers
of tocotrienols; the crap used in the trials I've seen
is a synthetic mix of the d and l-alpha isomer (50%
each). While some vitamins like C are convincingly
alike in natural and synthetic forms, E is far from
55
that.
More Ioannidis

Ioannidis (2005) serves as a reminder of the
perils of small trials, nonrandomized trials, and
those using surrogate markers.

He concludes "Evidence from recent trials, no
matter how impressive, should be interpreted
with caution when only one trial is available. It is
important to know whether other similar or larger
trials are still ongoing or being planned.
Therefore, transparent and thorough trial
registration is of paramount importance to limit
premature claims [of] efficacy."
56
More Freedman
Modeling, the search for significance, the
preference for novelty, and lack of interest
in assumptions --- these norms are likely
to generate a flood of nonreproducible
results.
57
Any message for us?
 Not
only patients and general public, but
young researchers run a grave risk of
chasing a false positive.
 We may need more evidence or time for
trans-fat, salt, 3rd hand-smoking, vitamin
C, E, D, etc.
 Unwise to ignore small effects. Why
women live longer than men (on
average)? Perhaps, many small effects
contribute, such as gene, environmental,
behavioral/lifestyle, biological, social…..
58
Do you want this to happen?
Ileal-lymphoid-nodular hyperplasia, non-specific colitis, and pervasive developmental disorder in children.
By Wakefield et al. Inflammatory Bowel Disease Study Group, University Department of Medicine, Royal Free Hospital and School of Medicine,
London, UK. Lancet. 1998 Feb 28;351(9103):637-41.
Comment in:
Lancet. 2000 Jul 8;356(9224):161-2.
Lancet. 2000 Jul 8;356(9224):160-1.
Lancet. 1998 Mar 21;351(9106):907-8; author reply 908-9.
Lancet. 1998 Feb 28;351(9103):611-2.
Lancet. 2002 Jun 15;359(9323):2051-2.
Lancet. 1998 Mar 21;351(9106):906; author reply 908-9.
Lancet. 1998 Mar 21;351(9106):906-7; author reply 908-9.
Lancet. 1998 Mar 21;351(9106):907; author reply 908-9.
Lancet. 1998 Mar 21;351(9106):905; author reply 908-9.
Lancet. 1998 May 2;351(9112):1358.
Lancet. 2000 Jul 8;356(9224):161.
Lancet. 1998 May 2;351(9112):1355; author reply 1356.
Lancet. 1998 May 2;351(9112):1355; author reply 1356.
Lancet. 1998 May 2;351(9112):1355-6; author reply 1356.
Lancet. 1998 May 2;351(9112):1356.
Lancet. 1998 May 2;351(9112):1357.
Lancet. 1998 May 2;351(9112):1357.
Lancet. 2004 Mar 6;363(9411):747-9.
Lancet. 2004 Mar 6;363(9411):820-1.
Lancet. 2004 Mar 6;363(9411):821-2.
Lancet. 2004 Mar 6;363(9411):822-3.
Lancet. 2004 Mar 6;363(9411):823-4.
Lancet. 1998 Mar 21;351(9106):907; author reply 908-9.
Lancet. 1998 Mar 21;351(9106):905-6; author reply 908-9.
Lancet. 1998 May 2;351(9112):1357-8.
Lancet. 1998 Jul 18;352(9123):234-5.
59
To err is human but…
60
But does this plot not bother you?
AUC=1?
61
Validation can be more tricky than we think:
Longevity debate: Chips to blame?
 At
the heart of a feverish debate over the
validity of a recent genome-wide association
study (GWAS) of centenarians is the
authors' possible misuse of gene chips in
different testing groups, part of an ongoing
issue affecting other GWAS research.
 Original and validation studies can make the
same mistake.
62
Lesson to learn from the
longevity gene study
 The
reanalysis "should be finished within
the week, and it will be passed on to the
editors at Science," he noted.
"It's a study that took 15 years to get to
where it is today and a few more weeks
isn't too long to wait," said Perls. "Don't
rush it, get the right answer."
63
6. Hierarchy of evidence in study design
64
What goes on top?
ANSWER is total evidence.
RCT can provide strong evidence for a
causal effect, especially if its findings are
replicated by other studies and metaanalyses.
Disclaimer: This hierarchy does not adjust for quality. Say,
RCT can be worse than case-reports, etc.
65
When you read the article, you may
check the study design
 Cross-sectional
study: which is first? what is
cause and what is effect?
e.g., depression vs. obesity
 Prospective cohort studies: generall much
better that cross-sectional but still not causal
 Prospective is generally better than
retrospective
 RCT is generally better than non-RCT
 Cross-sectional is better for national survey
66
7. Meta-analysis









Statistical technique + systematic literature review
There are 3 things you should not watch being made: law,
sausage & meta-analysis
Combine or not?
No new data collection but nothing is free.
In file drawers? Unbiased subsample? Google or
PubMed? Clinicaltrials.gov?
Publication bias (favoring positive studies), etc.
Man bites dog >> Dog bites man.
Much bigger problem in obs studies than RCTs.
Combining multiple incorrect stories is worse than one
incorrect story.
67
With N=infty, length of CI=0.
Funny (real) titles about meta






Meta-analysis: apples and oranges, or fruitless
Apples and oranges (and pears, oh my!): the
search for moderators in meta-analysis
Of apples and oranges, file drawers and garbage:
why validity issues in meta-analysis will not go
away
Meta analysis/shmeta-analysis
Meta-analysis of clinical trials: a consumer's
guide.
Publication bias in situ
68
 Popularity
-
-
of meta and BigData will
continue.
Big will be Bigger.
No one wants to read 100 original papers
Decision on cumulative evidence
Another mousetrap in science?
69
8. Multiple testing
“By design, ignorance or wicked mind”
70

Multiple testing/comparisons refers to the testing
of > 1 hypothesis.
 When many hypotheses are tested, and each test
has a specified Type I error probability (α), the
probability that at least 1 Type I error is committed
increases with the number of hypotheses.
 Bonferroni method: α=0.05/# of tests (e.g., 0.01
for 5 tests)
 Many researchers’ thorny issue.
71
Bonferroni might be the most hated
statistician in history. Real paper titles:
-- Escaping the Bonferroni iron claw in
ecological studies
-- To Bonferroni or not to Bonferroni
-- A farewell to Bonferroni
But Dunn described “So simple, So general”.72
Two errors

Type I (false positive: rejecting H0 when it is true)
vs. Type II (false negative: accepting H0 when it is
false)
-- Controlling Type I is more important in stat and
court. (e.g., innocent → guilty: disaster!)
-- Type 2 can be more important (e.g., CIA or IRS).
 α=p=0.05 – is this the law in science?
 α=5% seems reasonable to 1 research
question/publication.
73
74
Multiple testing in different forms
1. Subgroup analyses

You should always do subgroup analyses but
never believe them. – R. Peto
 We plead guilty. – D. Berry
- Multiple testing adjustment and cross-validation
may be solutions. And “Common sense” or “Time
will tell”.
75
2. Trying different cutpoints or finding
optimal cutpoint (e.g., tertiles, quintiles, etc.)
- A priori chosen cutpoints or multiple testing
adjustment can be solutions.
3. Interim analyses (e.g., for abstract
preparation)
76
4. "If you torture the data enough, nature will
always confess.”…. R. Coase
77
Underlying mechanism
 No
free lunch. To look more, there is price.
 If not, as you look more often, you are the
winner (e.g., diligent pharmaceutical
company).
 Yet, even statisticians do not always agree –
to adjust vs. to not adjust or how to adjust.
 This is a reason why we emphasize “1
primary hypothesis” in general. But less strict
for secondary hypotheses/analyses.
78
Multiple testing
(~lottery mechanism?)
Lottery tickets should not be free. In random and
independent events as the lottery, the probability of
having a winning number depends on the N of tickets
you have purchased. When one evaluates the
outcome of a scientific work, attention must be given
not only to the potential interest of the ‘significant’
outcomes but also to the N of ‘lottery tickets’ the
authors have ‘bought’. Those having many have a
much higher chance of ‘winning a lottery prize’ than of
getting a meaningful scientific result. It would be unfair
not to distinguish between significant results of wellplanned, powerful, sharply focused studies, and those
from ‘fishing expeditions’, with a much higher
probability of catching an old truck tyre than of a really
big fish. --- Garcı´a et al. (2004)
79
Multiple testing disaster I
In the 1970s, every disease was reported to
be associated with an HLA allele
(schizophrenia, hypertension.... you name
it!). Researchers did case control studies
with 40 antigens, so there was a very high
probability of at least one was significant
result This result was reported without any
mention of the fact that it was the most
significant of 40 tests --- R. Elston
80
Multiple testing disaster II
Association between reserpine (then a popular
antihypertensive) and breast cancer. Shapiro
(2004) gave the history. His team published
initial results that were extensively covered by
media with a huge impact on research
community. When the results did not replicate,
he confessed that the initial findings were
chance due to thousands of comparisons
involving hundreds of outcomes and hundreds of
exposures. He hopes that we learned for the
future from his mistake.
81
Multiple testing disaster III

You are what your mother eats (Mathews et al.
2008).
-- Using 740 British women, foetal sex is associated
with maternal diet at conception. 56% of women in
the highest third of preconceptional energy intake
bore boys, compared with 45% in the lowest third.
Intakes during pregnancy were not associated with
sex, suggesting that the foetus does not manipulate
maternal diet.
Support hypotheses predicting investment in costly
male offspring when resources are plentiful. Dietary
changes may explain the falling proportion of male
births in industrialized countries. Relevant to the
current debate about the artificial selection of
offspring sex in fertility treatment and commercial
82
‘gender clinics’.
 All
over the places on the news and
internet. Over 50,000 Google hits for 1st
week.
 Numerous comparisons were conducted.
 Sodium, calcium, potassium, etc. were
significant (p<0.05), but sodium was
dismissed claiming it is hard to measure
accurately.
--possible ‘pick & choose’!
 Other problems: lack of biological
credibility, inherent difficulty in dietary
data, N=740.
83
“Fooled by randomness?”
"Did you test broccoli?” (Young et al. 2009)
84
 Since
recorded time, humans have
attempted to control gender of children.
Undoubtedly very many of failed attempts
were unpublished and went into the file
drawer. That a few likely false positives
are in the literature is not surprising at all.
(S. Young)
 “…more genetic than environmental.
…. if genetic variability exists, it is of a very
low order of magnitude.” (Edwards 1962) 85
Referee 3
Comments to the Author(s)
Splendid. Your analysis of the 'data' presented by
Mathews et al confirms what should have been the
suspicions of any graduate with a minimal
understanding of statistics. It is a shame you do not
have the time to go into the details of the other flaws
in the paper that you refer to in paragraph one of
your introduction. As an educator I applaud your
work. It is poor science, such as that presented by
Mathews et al (the aim of which appears to be little
more than personal publicity and gratification) that
so damages the public understanding of Science
and Scientists in our Society.
86
 You
are what your mother eats
 Blame your mother
Vs.
 You aren’t what your mother eats
 Bran makes the man (WSJ)
87
88
Leaving no trace (Shaffer 2007)
Usually these attempts through which the
experimenter passed don't leave any
traces; the public will only know the result
that has been found worth pointing out;
and as a consequence, someone
unfamiliar with the attempts which have
led to this result completely lacks a clear
rule for deciding whether the result can or
can not be attributed to chance.
89
If you keep testing without
controlling α
Everything is dangerous. – S. Young
It is fairly easy to find risk factors for premature
morbidity or mortality. Indeed, given a large
enough study and enough measured factors and
outcomes, almost any potentially interesting
variable will be linked to some health outcome. –
Christenfeld et al. 2004.
 It is foolish to ask ‘Are the effects of A and B
different?’ They are always different – for some
decimal place. – J. Tukey


The only thing to fear is fear itself…………………….
…..………………………………and everything
else.
90
Or it would become
91
 Exploring
the randomness can be fatal
attraction to some researchers, including
myself. – K. Bernstein
 No
adjustments are needed for multiple
comparisons – K.J. Rothman
Epidemiology (1990)
92
Multiple testing adjustment

In RCTs: mandatory (by FDA)
-- If not, more (interim) looks would lead what you
want.
 In genetic/genomic studies: almost mandatory
-- think about # of genes!
 In (large) observational studies: almost infeasible
-- e.g., so many investigators, so many manuscripts
Realistic strategies can be:
1. α=5% for 1 hypothesis. Adjust multiple testing or
clarify how many tests/comparisons you conducted.
Or state “We did not adjust.”
2. Think and act in RCT ways.
3. (Empirical) Bayes have been suggested.
93
Peto et al. (NEJM 2008)
94
A fundamental question
Multiple testing is testing "Intention" rather
than "Science“ or “True status”.
p=0.03 can be significant for 1 comparison
but non-significant for 2 comparisons.
Most RCTs test intention, not treatment,
namely, ITT.
95
Importance of randomness
Society needs to be more cognizant of randomness.
Society wants a cause for every occurring event.
Is an extra hurricane or two this year really due to
El-Ni˜no and/or global warming? Are the two less
murders your city had this year more attributable
to the new police patrol car (the newspaper
headline) or two less severe domestic disputes
(randomness)? Although we seek for causes
whenever we can, some phenomena may be
explained by randomness and/or as extreme
events with non-zero probabilities. (Lund 2007)
96
Yet, importance of false negative

Nutrition, physical activity, environmental
exposure, skincare, positive mind, spirituality
may have small effects but their total or PH
impacts can be huge.
- how many times do you take statins or HAART?
- how many times do you eat, breathe or walk?
 RCTs and rigorous statistical testing may not
address these problems adequately.
 Even small – but cumulative - effects can be
HUGELY important.
97
Replication, again
is universal solution for multiplicity and
subgroup analyses (Vandenbroucke 2008)
In genome-wide analyses, it is a prerequisite
for publication (Khoury et al. 2007)
-- Replication is generally for/by someone else! Internal
validation (e.g., split sample for testing and verification)
can be the next best option, if N is large enough.
98
 By
comparing 5,417 cases and 5,417
controls from the Psychiatric Genomics
Consortium, Neale’s team found that none
of the 30 SNPs were significantly
associated with ASD risk.
http://www.the-scientist.com/?articles.view/articleNo/38030/title/Genetic-Test-for-Autism-Refuted/
99
9. Incorrect analysis
(a few historical samplers)
 Simpson's
paradox is a statistical
paradox wherein the successes of groups
seem reversed when the groups are
combined.
-- Famous example: Berkeley sex bias
case in graduate school admission.
 If you use wrong stats, you can be on the
news. See ‘Statistical flaw trips up study of
bad stats’. Nature 2006
100
Show me the data
 For
$$ data (e.g., income, medical cost,
housing price), reporting Mean only can be
problematic. Median should be
accompanied.
-- fyi, Am Stat Assoc does not report mean
& SD in annual salary survey.
 Do you know Impact factor calculation is
based on Mean and Thompson Scientific
refused to provide Median?
-- Science/Nature/NEJM may like ‘mean’.
101
Longevity genes debate:
Chips to blame?
Both experts and Illumina agree that if
scientists find something interesting, they
always need to double-check their results
with some other method, like Taqman. "If it
looks OK you can't be certain it's OK," said
Goldstein. "For really important findings
you want to separately genotype it."
102
10. Same words, different meanings?
Professional’s vs. layman’s terms (or even
among scientific disciplines): not always the
same.
e.g., risk, hazard, odds, likelihood, rate, prevalence,
incidence, valid, unbiased, consistent, costeffective (≠cheap), efficient, SD vs. SE, regress.
 People on the street may not distinguish RCT
from observational study.
 As easy and intuitive as possible but should be
correct
 Association ≠ Effect (stat’s fault! effect size, main
effect, interaction effect, effect modification)

103
 There
are many stat terms that may make
lay people/politician shocked – e.g.,
normal, discrimination, collapse, bias,
inferior, defier/noncomplier, ignorable.
104
X
and Y are variables for statisticians but
parameters to clinicians.
 X and X2 are nonlinear for epi/clinicians
but linear model for statisticians.
 X: independent variable/covariate vs.
exogenous variable.
 Rate can be a number of different things –
e.g., tax rate, birth rate, response rate,
hazard rate.
 Multivariate vs. Multiple vs. Multivariables
105
Multiple faces of Case-control?
Since it seems unlikely that (mis-)use of
the term will go away, some
clarity would be introduced ....
‘When I use a word,’ Humpty Dumpty said in
a rather scornful tone, ‘it means just what I
choose it to mean—neither
more nor less.’ ……... T. Marshall
106
11. Option blind, Probability blind

107
 Intuition
vs. calculated probability
 Simpson’s paradox & Jensen’s inequality
could be still counterintuitive
108
12. Data sharing

Although … is tax-supported, its data are not
available to us. ….Policies governing data
dissemination need to be reconsidered, although
due regard must be paid to patient confidentiality.
Only by thorough scrutiny can error be avoided.
Transparency is the best assurance of scientific
quality. (Freedman & Pettiti 2005)

Open the data to public (after sufficient deidentification).

If there is a will, there is a way, esp., data sharing
109
Secret, Hostage or Monopoly Science?
Data are cheap-N-fair?
Rich get richer?
110
 “Alas,
we don’t have the process” –
Science editor (2004) responding to ‘Filedrawer problem, revisited’.
Clinical Trial Data — A Proposal
from the International Committee of
Medical Journal Editors… NEJM 2016
 Sharing
111
Any realistic solutions?





There’s a simple fix. If the top dozen or so medical
journals refused to consider publication of any
research results without a pledge from the authors
to make the raw data available for follow-up
analysis, the problem would disappear.
-- DR Bacon, M.D.
Management and Incentives…. Back to Deming?
Some CDC and NIH data can be obtained with no
or minimal restrictions.
Many genetic/genomic data are on internet.
You open first….
112
You must be the change you
wish to see - M. Ghandi
How many Editorials?
 Managing UK research data for future
use
The BMJ is now asking authors for data
sharing statements (BMJ 2009)
 Sharing Clinical Trial Data — A Proposal
from the International Committee of
Medical Journal Editors (NEJM 2016)
113
13. So, who are responsible?

Authors/scientists: lack of integrity, pressure to get
grant, pub, job, CV, ego, want to be famous or on the
news, publish to publish; we=mammals with COIs.
 Editors: too many journals, reviewers are busy, 2
wks review time, shortage of stat reviewers
 Media: don’t think critically, to surprise or shock
people.
 Lay persons: like more shocking news, may not use
common sense.
We are all responsible for all ---Dostoevsky
(epigraph in Rose’s Epi book)
114
Try to remember as scientists

If false positive and false negative results
continued to be produced with disturbing
frequency, it might be so true that ‘we are fast
becoming a nuisance to society….people
don’t take us seriously anymore, and when
they do… we may unintentionally do more
harm than good’ ---Trichopoulos.
 It is extremely difficult to un-shock the
shocked. Unscaring not as easy as Scaring
 Transparency! Ultimately, data to be shared.
 The permeation of statistical research with
the experimental spirit.. Greenwood/Hill
(1924/1953)
115
They don’t take you seriously…
 Typical
medical advice "this is good for
you, but we do not recommend you
starting doing it." Same as when the told
folks not to stock up on Tamiflu a few
years ago when there was a increased risk
of bird flu. And same as when they say
"don't panic, don't worry" whenever there
is some extreme risk of disease spreading.
 Way to go, 'researchers'! What in the
world would we do without you!
116
Try to remember as readers












RCT vs. Obs studies (pros and cons).
RCT is the best available but do not worship.
Use your common sense, while don’t forget serendipity (Ask ‘Why’, rather
than ‘Why Not’? Some results may be too good to be true.)
Try to love ‘null’ hypothesis too.
Data are much more important than statistical methods – garbage in,
garbage out.
Check N (e.g., denom & num, large N does not fix bias)
Samples/data are independent or dependent? Or biased sample?
Think about Third variables (i.e., confounders)
Be careful about meta-analysis. But if well done, there is nothing like
meta-analysis.
Do not be governed by p-value (<0.05).
Perhaps, by Chance? Multiple testing. How many questions/analyses?
Anything hidden?
Researchers’ COIs.
117
Fig. 1. The nine circles of scientific hell (with apologies to Dante and xkcd).
Neuroskeptic Perspectives on Psychological Science
2012;7:643-644
Copyright © by Association for Psychological Science
119
Today’s Quotes
 In
God we trust; all others must bring data
(protocol and SAS output). --- W. Deming (&
S. Young & K. Griffin)
 All models are wrong, but some are useful.
--- G. Box
 Far better an approximate answer to the right
question, which is often vague, than an exact
answer to the wrong question, which can
always be made precise --- J. Tukey
120
Protocol, protocol, protocol

NIH cancels massive U.S. children’s study, 2014

The National Children’s Study (NCS), which has
struggled to get off the ground and has already
cost more than $1.2 billion,..

“there is no protocol,..”
http://news.sciencemag.org/funding/2014/12/nih-cancels-massive-u-s-children-s-study
121
Yet, Observations
can still be beautiful
 smoking
& lung cancer and heart disease
 second hand smoking & lung cancer
 BMI/waist, blood pressures, SES on health
outcomes
 alcohol consumption & liver disease
 asbestos & cancer
 vinyl chloride and angiosarcoma
 Cats have 9 lives (Dianmond, Nature 1988)
122
RCT?
You first
123