Statistically significant

Download Report

Transcript Statistically significant

Statistical Guideline of Nature
Ji-Qian Fang
School of Public Health
Sun Yat-Sen University
2008.10
Challenge to Nature Medicine
• An editorial of Nature Medicine (2005)
<Statistically significant> :
“Some of the articles published in Nature and
Nature Medicine were criticized due to the
deficiency in statistical issues”.
What happened?
• Emili García-Berthou and Carles Alcaraz (Girona
Univ., Spain) published an article in BMC Medical
Research Methodology (May 2004).
They reviewed 181 research papers of Nature (2001) ,
found that 38% of them have at least one mistake in
statistics.
• Since then, a series of critical articles have been
published, of which one written by Robert Matthews
(The Financial Times) analyzed the statistical
methodology of the articles in Nature Medicine
(2000).They found that 31% of the authors had
misunderstood the meaning of P-value, even some one
reported the P-value with unnecessary precision ( 0.002387).
Independent statistical “audit”
• Nature Medicine invited two experts from the
University of Columbia to work out “statistical
audit” , especially to evaluate 21 articles published
in 2003 with a list of consolidated criteria on
statistics.
• They found that some papers almost did not have
any quantitative analysis, and some contained very
complicated statistical and mathematical issues.
While most of them have just used a litter statistical
testing, but with incomplete descriptions such that
one could hardly assess whether they were
appropriate or not.
Checklist of statistical adequacy
1.Reported n at start of study and for each analysis
2.Provided sample size calculation or justification
Examples
We believed that . . . the incidence of symptomatic
deep venous thrombosis or pulmonary embolism or
death would be 4% in the placebo group and 1.5%
in the ardeparin sodium group. Based on 0.9 power
to detect a significant difference (P=0.05, two-sided),
976 patients were required for each study group.
To compensate for non-evaluable patients, we
planned to enroll 1000 patients per group
• To have an 85% chance of detecting as
significant (at the two sided 5% level) a five point
difference between the two groups in the mean SF36 general health perception scores, with an
assumed standard deviation of 20 and a loss to
follow up of 20%, 360 women (720 in total) in each
group were required.
3. Identified all statistical methods unambiguously
4. If statistical methods were described adequately,
were any of them clearly inappropriate?
Example
All data analysis was carried out according to a
preestablished analysis plan. Proportions were
2
compared by  tests with continuity correction or
Fisher’s exact test when appropriate. Mean serum
retinol concentrations were compared by t test. . .
Two sided significance tests were used throughout.
• Multivariate analyses were conducted
with logistic regression. The durations of episodes
and signs of disease were compared by using
proportional hazards regression.
Methods for additional analyses, such as subgroup
analyses and adjusted analyses:
Example
Proportions of patients responding were compared
between treatment groups with the Mantel-Haenszel
chisquared test, adjusted for the stratification variable,
methotrexate (氨甲叶酸) use.
• . . . it was planned to assess the relative benefit of
CHART in an exploratory manner in subgroups:
age, sex, performance status, stage, site, and histology.
To test for differences in the effect of CHART, a
chisquared test for interaction was performed, or when
appropriate a chi-squared test for trend (131).
5. Provided alpha for all statistical tests
6. Specified whether tests were one-sided or
two-sided
7. Stated whether the data met the
assumptions of the test
8. Reported actual P values for primary
analyses
Example
The data of two samples were adequately normally
distributed(Shapiro-Wilk test:P1=0.466;P2=
0.482) and the two population variances were equal
at the significant level 0.10(F=1.345;P=0.261), so
two independent samples t test was used(t=4.137;
df=18;P=0.001). The results indicated a
statistically significant difference between effects of
two drugs at two-tailed significant level 0.05 and the
average increase of concentration of Hb was higher in
patients taking the new drug, which could also be
observed from the 95% confidence interval of the
difference of two population means (3.829, 11.731).
9. Were the statistical measures (mean, standard
error, standard deviation, etc.) reported, and
were they clearly labeled?
Example
The results show that the mean ± SD of IL-2 for
the experimental group (n=31) was 16.00IU/ml±
7.50 IU/ml and for the control group (n=30) was
20.00IU/ml±8.00 IU/ml; the difference between
the two group means was 4.00IU/ml, and the 95%
CI of the difference was(0.0304, 7.9696)(IU/mL)
10. Was the unit of analysis clearly stated in all
comparisons?
11. Are mean and standard deviation used to
describe data sets that may be non-normally
distributed or when the sample size is very small?
Results of Blood Gas Analysis (血气分析)
Group
n
Experiment
Control
12
10
Age (year)
pH
63.00±15
7.36±0.17
62.50±12.49 7.38±0.19
What are the problems?
PaCO2
X  SX
PaO2
63.00±15
9.25±1.91
63.00±13.69 9.16±1.96
SaO2
85.12±5.99
86.45±7.11
12. Explanation of unusual or complex statistical
methods
Example
In order to compare the effects of common feed, feed with
plasma protein and feed with bioprotein on weight
growing to weaning young pigs,30 weaning young pigs
were matched to 10 blocks by gender, days of age and
baseline weight. Then 3 individuals in each block were
randomly assigned to 1 of 3 treatment groups. After 10
days, the changes in weights from baseline were measured.
---- Random block design
The mean change of weight SD was 3.33kg0.48kg
for the group of common feed, 3.83kg 0.61kg for that of
plasma protein, and 4.10kg 0.68kg for that of bioprotein.
Results of two-way ANOVA under the significance
level of 0.05 indicated statistically significant differences
among 3 treatment groups (F=6.8112, P=0.0063). Similar
results were found among 10 blocks (F=2.7407, P=0.0328).
---- Results of ANOVA
13. Explanation of data exclusions, if any
Example
• The primary analysis was intention-to-treat and
involved all patients who were randomly assigned
• One patient in the alendronate group was lost to
follow up; thus data from 31 patients were
available for the intention-to-treat analysis. Five
patients were considered protocol violators . . .
Consequently, 26 patients remained for the perprotocol analyses
Protocol deviations
• Authors should report all departures from the
protocol, including unplanned changes to
interventions, examinations, data collection, and
methods of analysis.
• The nature of the protocol deviation and the
exact reason for excluding participants after
randomization should always be reported.
14. Explained reasons for any discrepancy
between initial n and n for each analysis
Example
Initially, the 60 rats were randomly divided into 3
groups, 15 for each, to receive 3 levels of doses
respectively. However, at the end of the first
week, 2 rats in the group of low dose escaped; on
the 40-th day, 1 rat in the group of high dose and
1 in the control group escaped …
15. Explained method of treatment assignment (randomization, if
any)
Example
Determination of whether a patient would be treated by
Streptomycin(链霉素)and bed-rest (S case) or by bed-rest
alone (C case) was made by reference to a statistical series
based on random sampling numbers drawn up for each
sex at each centre by Prof. Bradford Hill; the details of
the series were unknown to any of the investigators or
to the coordinator and were contained in a set of sealed
envelopes, each bearing on the outside only the name
of the hospital and a number. After acceptance of a patient
by the panel, the envelope was opened at the central office;
the card inside told the medical officer of the centre if the
patient was to be an S or a C case.
2
Smin
 13.4667
16. Explained any data transformation
Example
18 patients with acute encephalitis B (乙型脑炎) in a
clinic were randomly allocated into 3 groups. Each
group accepted different kind of treatments, say
treatment A, B and C; and the fevering days were
measured as the effects of treatments.
• Consider the two assumptions of one-way ANOVA.
The fevering days are positively skew from the normal
2
2
distribution; and the ratio of Smax
is closed to 10,
and Smin
the assumption of homogeneity of variances is also
abandoned. Therefore, a square root transformation of
the scale for the fevering days is applied…
• The new scales have been used in computation of oneway ANOVA. It resulted in that there is no significant
difference on the average fevering days (scales of square
roots) among the three kinds of treatments.
17. Discussed adjustments for multiple testing
Example
F处理
Multiple comparison with Bonferroni adjustment
(alpha level of 0.0167) revealed that the effects of the two
treatments with protein were significantly higher than
that of common feed, while the difference between the
two treatments with protein was not statistically
significant.
----Multiple comparison
For graphs
18. Were effect sizes distorted? (by truncation of y
axis, etc.)
三甲医院数(家)
Number of hospitals
Number of hospitals
北京
天津
河北
山西
内蒙
50
40
30
20
北京
北京
天津
天津
What are the problem?
河北
河北
山西
山西
内蒙古
内蒙
19. Were error bars unlabeled?
20. Were error bars absent?
Cholesterol (mg /d L)
•What is the height for?
Normal
Patient
•What are the bars for?
Cholesterol (mg /d L)
•What are the stars for?
Normal
Patient
Summary
Three errors are particularly common
• Multiple comparisons: When making multiple
statistical comparisons on a single data set,
authors should explain how they adjusted the
alpha level to avoid an inflated Type I error rate,
or they should select statistical tests appropriate
for multiple groups (such as ANOVA rather than
a series of t-tests).
• Normal distribution: Many statistical tests
require that the data be approximately normally
distributed; when using these tests, authors
should explain how they tested their data for
normality. If the data do not meet the assumptions
of the test, then a non-parametric alternative
should be used instead.
Small sample size: When the sample size is small
(less than about 10), authors should use tests
appropriate to small samples or justify their use
of large-sample tests.
Thanks