Case-control studies
Download
Report
Transcript Case-control studies
Data analysis
considerations for (clinical)
research
Jarno Tuimala
2015-09-08
Schedule 2015
LECTURES
Haartman instituutti, Haartmaninkatu 3, pieni luentosali 14:15-15:45
1.
2.
3.
4.
5.
Tue 1.9.2015
Otto Helve: Introduction and curriculum of a clinical investigator
Wed 2.9.2015
Jussi Merenmies: Evaluating results from a randomised controlled
trial
Erkki Isometsä: Clinical Epidemiology: observational studies
Tue 8.9.2015
Jarno Tuimala: Statistical considerations for research plans
Wed 9.9.2015
Ritva Loponen, Harriet Colliander: Clinical trial registrations and
submissions to the authorities
Tue 22.9.2015
Mikael Knip: Research in international setting
2
Principles of experimental design
•
Ronald A. Fisher (1935)
1.
2.
3.
4.
5.
•
Comparison (results are in relation to something)
Replication (several obs. units per groups)
Randomization (randomly allocate units to groups)
Blocking (take confounding into account)
Factorial experiments (study interactions)
Originally built on a foundation of analysis of
variance (ANOVA), and aimed for agricultural
experiments
3
Question 1
The world’s oldest clinical trial
• Bible, Book of Daniel, 1:3-16.
• Treatment group: Four boys from
Israel were given just water and
vegetables.
• Control group: Another group of
boys received meat and wine from
the king's table.
• After ten days the groups were
visually compared, and the
treatment group was found to be
healthier than the control group.
• In addition, the treatment group was
ten times better in all matters of
wisdom and understanding than the
control group.
1.
2.
3.
Comparison
Replication
Randomization
• What design principles were used in this
experiment?
• What are the response and explanatory
variables?
• How would you make the experiment better?
• What statistical method(s) would you use for
analyzing this?
4
Explanation
• Design principles used:
• Comparison (two groups)
• Replication (several individuals in both groups)
• Variables:
• Response variables: health, knowledge
• Explanatory variables: diet
• Make a better study:
• Measure at baseline, i.e., at startof the study
• Allocate individuals to groups randomly
5
Things to consider
• Hypothesis
• Outcome measures
• Data sources
• Registries
• Experiments
• Data management
• Study design
• Observational and experimental design
• Sample size
• Statistical analysis
• Reporting
6
Hypothesis
7
Study objectives
• Testable hypotheses?
• Primary and secondary questions?
• Example:
• Primary: Does smoking cause lung cancer?
• Secondary: Are old smokers in worse shape than old nonsmokers?
8
Outcome measures
• What will be measured?
• Does the individual get the disease (yes/no)?
• How long does it take for the individual to get the disease
(time)?
• How severe is the disease (laboratory tests, various scores
or gradings)?
• Proxies?
• Example
• Do smokers get cancer more often than non-smokers?
• Does it take longer for non-smokers to get cancer than for
smokers?
9
Smoking and cancer
• Objective: Find, if smoking causes cancer
• Hypothesis: Smokers get cancer more often than
non-smokers
• Next:
• What kind of data is needed to test this?
• Where to get data to test this?
10
Data sources
11
Smoking and cancer
• Hypothesis: Smokers get cancer more often than
non-smokers
• Data needs, at least:
• Two groups: smokers, non-smokers
• Data: Smoking status, cancer end-point status
12
Registry or experimental study?
• Experimental
• Expose individuals to tobacco smoke?
• Not ethical -> registry study
• A review from an ethical board is needed
• Registry
• If strictly registry-based, no ethical board review needed
• If patients or their relatives are contacted, a review is
mandatory
13
Registries (examples)
• National
•
•
•
•
•
•
Hospital’s discharge registry (HILMO) [THL]
Cancer registry [Cancer Society / THL]
Causes of Death [Statistics Finland]
Medications [KELA]
New special embursements for medicines [KELA]
ASA Registry [TTL]
• Local
• Hospital registries
• Studies
• Health 2000 / 2011
14
Registry study example
• Easy to assess whether individual has or has had lung
cancer
• Much harder to assess whether they smoked or not
• Health 2000 /2011 helps
• Use Health 2000 or 2011 data to pick the smokers and
non-smokers
• Link with other registries (cancer registry) to assess the
cancer status
• Do you need to collect other variables?
15
Confounding
Causal inference
1
7
Causality is (often) the aim
• Causal effects?
– The amount of total damage of a fire and the number
of firemen at the site are strongly correlated. Do the
firemen cause the damages?
– More of the lung cancer patients are smokers than
non-smokers. Does smoking cause lung cancer?
• Evidence based medicine...
1
8
Confounding
confounder
cause
outcome
19
Confounding
occupation
smoking
Lung cancer
20
Question 2
• The amount of total damage of a fire and the number of
firemen at the site are strongly correlated. Do the
firemen cause the damages?
• What could be:
1. Outcome
2. Cause
3. Confounder(s)
For the firemen example?
21
Explanation
• Confounding:
– The variable must
be independently
associated with the
outcome
– The variable must
be associated with
the exposure under
study in the source
population
– It should not lie on
the causal pathway
between exposure
and disease
http://bolt.mph.ufl.edu/6050-6052/unit-1/causation/
2
2
Smoking and cancer
• Hypothesis: Smokers get cancer more often than
non-smokers
• Data needs, at least:
• Two groups: smokers, non-smokers
• Data: Smoking status, cancer end-point status [from
Health 2000 or 2011]
• Data: Occupational exposure [from ASA registry], Age,
Sex [from Health 2000 or 2011]
23
Note on causality
Cancer
Exposure
Time
This is the right way!
24
Note on causality
Don’t do this!
Heart attack
Statins
Time
25
Confounding by indication,
example
• Mikkola R, Heikkinen J, Lahtinen J, Paone R,
Juvonen T, Biancari F. Does blood transfusion affect
intermediate survival after coronary artery bypass
surgery? Scandinavian journal of surgery : SJS :
official organ for the Finnish Surgical Society and
the Scandinavian Surgical Society 102: 110-6, 2013.
26
Confounding by indication
• The patient’s condition affects the way treatments or
medication are allocated (confounding by severity).
• So, business as usual, but it creates problems during
epidemiological (observational) studies.
33
Confounding by indication
• If the effect of treatment is not adjusted for
the initial condition of the patient, a risk for
drawing a wrong conclusion is high!
34
Solutions
• The previous example is a type of confounding by indication
called confounding by severity.
• Usual statistical methods, such as multivariate regression
do not adjust for unmeasured variables that are often of
importance in this kind of a situation.
• Or even if measured, the severity of disease is a royal pain
to adjust for!
– Propensity score adjustment, inverse-probability weighting
(Rubin) or instrumental variable methods (factor analysis and
structural equation modeling) might work better.
– If possible, better to use a controlled trial, where patients can
be randomized to treatment and no-treatment (or placebo).
– Remember natural experiments, also!
– In other words, this is not necessarily very easy...
29
Causal pathway and confounders
Socioeconomic
status
Alcohol
Occupation
Tobacco
Lung cancer
CYP2D6
30
Study designs
Study designs
• Observational studies
• Case-control studies
• Cohort studies
• Treatment studies
• Randomized Controlled Trials (RCTs)
32
Case-control study
33
Case-control study - Initiation
Age
Sampling
Time
34
Case-control study – Disease
status
Sampling
Case
Has the disease
Age
Control
Doesn’t have the disease
Time
35
Case-control study - Sampling
Sampling
Case
Has the disease
Age
Control
Doesn’t have the disease
Time
36
Case-control study - Matching
Sampling
Case
Has the disease
Age
Control
Doesn’t have the disease
Time
37
Case-control study – Exposure?
Sampling
Case
Has the disease
Age
Control
Doesn’t have the disease
Time
38
Smoking and cancer
• Hypothesis: Smokers get cancer more often than
non-smokers
• Data needs, at least:
• Two groups: smokers, non-smokers
• Data: Smoking status, cancer end-point status [from
Health 2000 or 2011]
• Data: Occupational exposure [from ASA registry], Age,
Sex [from Health 2000 or 2011]
• Study design: Case-control study
• Cases and controls sampled from Health 2011
39
Designed experiments
40
Treatment studies - Factorial
design
• In designed experiments!
• Sometimes used in clinical trials, also
• Factor is a manipulated phenomenan, or a treatment,
presumed to affect the experiment, e.g.:
• Name of the factor: factor levels
• Sex: male and female rats
• vitamin C: low and high level
• Factorial designs have at least two distinct factors
41
Full factorial design, terms
Famale
Normal
Male
• The base is the number of
factor levels and the
exponent gives the number
of factors. Thus, there is a
family of full factorial design
that can be marked as 2k.
Diet
Levels
Chocolate
Group 1
Group 2
Group 3
Group 4
Sex
Factors
• The full factorial design
shown on the previous slides
is often marked as 22 (or 2x2)
and gives 2*2=4 different
combinations or treatments.
42
Question 3
• We have selected to use a case-control study.
• Could a similar hypothesis be studies with other
designs? What about:
• Cohort study?
• Trial?
• Factorial design?
• And why or why not?
44
Explanation
• Smoking and lung cancer can be studied by:
• Cohort study
• But it can’t be studied with:
• Trials
• Factorial designs
• Why?
• Time from exposure to cancer -> prospective cohort
study
• Why not?
• Unethical to expose individuals to tobacco smoke on
purpose
45
Sample size
46
Sample size
• How many individuals do you need to have (in both groups) in order to be able
to find a statistically significant difference (between the groups)?
• Essential step!
• Many published studies are under-powered
• R. Tsang, L. Colley, L. D. Lynd. Inadequate statistical power to detect
clinically significant differences in adverse event rates in randomized
controlled trials. Journal of Clinical Epidemiology, 62:609–616, 2009.
• Educated guesswork
• Very straightforward: Go to the library, and search for similar experiments you are going to
perform, and see how large a sample size is utilized in those.
• Formal power analysis
• Should be done before the experiment is conducted.
• Will complement the educated guesswork, or be worked out even without it.
• Can be used for estimating any of the things listed on the following slides, if other four are
known or guessed.
47
These affect the sample size
• Desired power ↑ -> sample size ↑
• Desired ”p-value” ↑ -> sample size ↓
• Effect size ↑ -> sample size ↓
• Possibly estimated by a pilot study
• Amount of random variation ↑ -> sample size ↑
• Possibly estimated by a pilot study
• Desired levels for Type I and Type II errors
• Usually
• Type I (alpha) = 0.05 (”p-value”), false positives
• Type II (beta) = 0.80 (”power”), 1 – frequency false negatives
48
Power for a case-control study
49
Analysis
Statistical analysis plan Write it before you have the data!
1. Introduction
2. Data sources
3. Analysis objectives
4. Analysis sets / populations / subgroups
5. Endpoints and covariates
6. Handling of missing values
7. Other data convensions
8. Statistical procedures
9. Adjustment for confounders, etc.
10. Sensitivity analyses
11. Rationale for deviation (during the analysis) from this plan
12. Quality control plan
13. Programming plans
14. References
15. Appendices
Adapted from https://www.pfizer.com/files/research/research_clinical_trials/Clinical_Data_Access_Request_Sample_SAP.pdf
51
Data manipulation
52
Data manipulation
• Missing values
• Not all individuals necessarily have values for all
variables
• For example, some individuals might miss information
for age and sex
• Solutions
• Remove from the analysis all individuals with at least
one missing value
• Impute, or estimate, the missing values using
information from other variables
• SPSS offers, for example, a pairwise deletion possibility,
but it biases the results
53
Example
Individual
Age
Sex
Smoking
Cancer
1
64
M
S
1
2
79
M
S
0
3
??
M
NS
0
4
91
M
NS
1
5
83
F
S
1
6
65
F
NS
0
7
90
F
NS
0
54
Example - imputation
Individual
Age
Sex
Smoking
Cancer
1
64
M
S
1
2
79
M
S
0
3
90
M
NS
0
4
91
M
NS
1
5
83
F
S
1
6
65
F
NS
0
7
90
F
NS
0
55
Example – case-wise deletion
Individual
Age
Sex
Smoking
Cancer
1
64
M
S
1
2
79
M
S
0
4
91
M
NS
1
5
83
F
S
1
6
65
F
NS
0
7
90
F
NS
0
56
Smoking and cancer
• Hypothesis: Smokers get cancer more often than
non-smokers
• Data needs, at least:
• Two groups: smokers, non-smokers
• Data: Smoking status, cancer end-point status [from
Health 2000 or 2011]
• Data: Occupational exposure [from ASA registry)
• Study design: Case-control study
• Cases and controls sampled from Health 2011
• Analysis:
• Missing values for explanatory variables are imputed
57
Statistical analyses
58
Odds ratio – a measure of association
•
•
•
•
Odds for cancer | smoker: 12 / 8 = 1.5
Odds for cancer | non-smoker: 36 / 180 = 0.2
Odds ratio = 1.5 / 0.2 = 7.5
The odd for a smoker getting lung cancer is 7.5
times that of an odd for a non-smoker
59
Chi square test for odds ratio
• Is smoking associated with the cancer status?
Pearson's Chi-squared test with Yates'
continuity correction
data: m
X-squared = 18.6247
df = 1
p-value = 1.591e-05
Warning message:
In chisq.test(m) : Chi-squared approximation
may be incorrect
60
Fisher’s exact test for odds ratio
Fisher's Exact Test for Count Data
data: m
p-value = 5.103e-05
alternative hypothesis: true odds ratio is
not equal to 1
95 percent confidence interval:
2.575804 22.522438
sample estimates:
odds ratio
7.407224
61
What is a P-value?
• Technicalities:
– Null hypothesis: the odds ratio is not different from one
– Alternative hypothesis: the odds ratio is different from one
• P-value gives us the probability that we would get a)
such an extreme test statistic (here, X-squared) value
or b) observe such an extreme data set, if the null
hypothesis is true.
• Usually the P-value is compared to a cut-off, say 0.05,
and if the P-value is smaller than the cut-off, the
result is called statistically significant.
62
What is a P-value?
• P-values are used for testing hypothesis: one P-value
per hypothesis!
– Does smoking predispose individuals for lung cancer?
– Does a larger exposure (more cigarettes smoked) give rise
to larger risk?
• If there is no hypothesis to be tested, do not
generate a P-value!
• P-value is not the whole story. Pay attention to the
effect size, also. More on this later.
20
What is a confidence interval?
• A counterpart of p-value with a cut-off of 0.05 can be
thought of being the confidence interval of 95%.
• If the same experiment would be repeated, say, a
hundred times, the true population value (of OR)
would fall inside the confidence interval in average 95
times out of hundred.
• If the 95% confidence interval for an odds ratio does
not include one, the result is statistically significant at a
0.05 risk level.
• Used for giving an idea of how imprecise the result is.
64
Which test to use - table
Types of your dependent variable
Interval/Ratio (Normality
assumed)
Interval/Ratio (Normality
not assumed), Ordinal
Dichotomy (Binomial)
Compare two unpaired
groups
Unpaired t test
Mann-Whitney test
Fisher's test
Compare two paired
groups
Paired t test
Wilcoxon test
McNemar's test
Compare more than two
unmatched groups
ANOVA
Kruskal-Wallis test
Chi-square test
Compare more than two
matched groups
Repeated-measures
ANOVA
Friedman test
Cochran's Q test
Find relationship between
two variables
Pearson correlation
Spearman correlation
Cramer's V
Predict a value with one
independent variable
Linear/Non-linear
regression
Non-parametric regression
Logistic regression
Predict a value with
multiple independent
variables or binomial
variables
Multiple linear/non-linear
regression
Poisson regression, survival
Multiple logistic regression
analysis
65
Adapted from http://yatani.jp/HCIstats/HomePage
Adjusting for confounding
66
Stratification
occupation
cases
controls
non-smokers
smokers
non-smokers
smokers
Housewives and
white-collars
36
12
180
8
Other occupations
10
6
56
5
strata
Mantel & Haenszel, 1956
hw & wc
Lung cancer
No lung cancer
Smokers
12
8
Non-smokers
36
180
other
Lung cancer
No lung cancer
Smokers
6
5
Non-smokers
10
56
26
Separate analysis
OR=7.4 (2.6-22.5)
OR=6.5 (1.4-32.9)
hw & wc
Lung cancer
No lung cancer
Smokers
12
8
Non-smokers
36
180
other
Lung cancer
No lung cancer
Smokers
6
5
Non-smokers
10
56
68
Stratified analysis
occupation
cases
controls
non-smokers
smokers
non-smokers
smokers
Housewives and
white-collars
36
12
180
8
Other occupations
10
6
56
5
• Mantel-Haenzel’s test: A Chi Square test with
weighting over a stratification variable
– OR = 7.2 (3.3 – 15.9)
– Effect of smoking is significant even when the
confounding variable (occupation) is adjusted for.
69
Regression modeling
Response variable
Example
Regression method
Continuous
Height of a person
Linear regression
Dichotomous
Disease / no disease
[case-control studies]
Logistic regression
Count
Number of naevi
Poisson regression
[cohort studies, and others]
Time
Time to death
[cohort studies]
Cox’s regression
• These are very general and flexible methods
• Several explanatory variables can be used in the model
• Interactions between explanatory variables can be modeled
• If you know these, you seldom need anything else, since e.g., ttest, ANOVA, and ANCOVA can all be performed using linear
(regression) models.
70
Logistic regression
• Regression:
– Allows adjusting for several confounders and covariates at the same time
– Different types for different purposes
• Linear, logistic, Poisson, survival time, ...
• Logistic regression:
– The response (dependent) variable has two possible
values (yes / no)
– Estimates an odds ratio, confidence interval and a pvalue for every variable or variable’s level.
30
Age, occupation and smoking
Variable
Age
<45
45-54
55-64
>65
Occupation
housewife
white-collar
other
Smoking
no
yes
OR (95% CI)
1
1.91 (0.61... 6.75)
2.05 (0.68... 7.24)
3.35 (1.07...12.18)
1
0.92 (0.42... 1.91)
0.97 (0.46... 1.97)
1
9.97 (4.22...25.28)
• Effect of
smoking is
adjusted for
both age and
occupation at
the same
time.
• Note that
after
adjustment
the OR is
higher than
the raw OR!
72
Causal pathway and confounders
Socioeconomic
status
Alcohol
Occupation
Tobacco
Lung cancer
CYP2D6
73
Causal pathway and confounders
Socioeconomic
status
Alcohol
Tobacco
Lung cancer
CYP2D6
74
Genotype and smoking
• Observation: Smoking is associated with lung
cancer.
• Tobacco industry: observed association
between smoking and lung cancer could be
explained by some cancer predisposing
genotype that also creates a craving for
nicotine.
75
CYP2D6 genotype and smoking
• Hypothesis: Carriers of CYP2D6 inactivating
allele(s) metabolize chemicals in tobacco
faster than others, and makes these
individuals smoke more often than others.
• Observation: Risk of lung cancer for carriers of
inactivating mutation is 0.69 (95% CI = 0.520.90).
Pharmacogenetics. 1998 Jun;8(3):227-38.
76
CYP2D6 and smoking
Smokers
Genotype +
Genotype -
Case
95
233
Control
165
304
Fisher's Exact Test for Count Data
p-value = 0.06609
95 percent confidence interval:
0.5467929 1.0296625
sample estimates:
odds ratio
0.7514752
http://carcin.oxfordjournals.org/content/18/6/1203.full.pdf
77
Logistic regression for CYP2D6
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.6825
0.2694 -9.956
<2e-16 ***
smoking
2.4245
0.2759
8.787
<2e-16 ***
cyp2d6
-0.3116
0.1507 -2.068
0.0386 *
--Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Null deviance: 1329.1
Residual deviance: 1189.7
AIC: 1195.7
on 1052
on 1050
degrees of freedom
degrees of freedom
OR = 0.73 (0.54-0.99)
http://carcin.oxfordjournals.org/content/18/6/1203.full.pdf
78
CYP2D6 and smoking - predictions
Smoking
Pred = 0
Pred = 1
Status = 0
241
469
Status = 1
15
328
CYP2D6
Pred = 0
Pred = 1
Status = 0
247
463
Status = 1
98
245
Smoking + CYP2D6
Pred = 0
Pred = 1
Status = 0
241
469
Status = 1
15
328
http://carcin.oxfordjournals.org/content/18/6/1203.full.pdf
Compare
No effect?!
79
Regression modeling
Cox regression example
80
Regression modeling
Cox regression example
81
Question 4
• We have collected data on
• Response variable:
• Lung cancer
• Explanatory variables:
•
•
•
•
Smoking
Age
Sex
Occupation
• What statistical method(s) would you use to assess
the association of explanatory variables and lung
cancer
82
Smoking and cancer
• Hypothesis: Smokers get cancer more often than nonsmokers
• Data needs, at least:
• Two groups: smokers, non-smokers
• Data: Smoking status, cancer end-point status [from Health
2000 or 2011]
• Data: Occupational exposure [from ASA registry), age, sex
[from Health 2011]
• Study design: Case-control study
• Cases and controls sampled from Health 2011
• Analysis:
• Missing values for explanatory variables are imputed
• Confounders are adjusted for using logistic regression
83
Clinical relevance
84
Statistical and clinical significance
• Even if the result is statistically significant, it may
• not be clinically significant
– Minimal clinically important difference (MCID)
• MCID has to be decided before the study
– Sometimes it is known beforehand, sometimes not,
and it has to be based on an educated guess.
• For case-control studies, MCID can also be
thought of as, e.g., how much some new
predictors help in setting the diagnosis.
85
COPD
• For forced expiratory volume in one second (FEV1) an
increase of about 100 mL, which can be perceived by
patients, is sometimes considered MCID.
• Bronchodilators in healthy persons:
– Salbutamol: FEV1 increase of 62 mL (0 – 152 mL)
• Bronchodilators in COPD patients (FEV1 in litres):
–
–
–
–
–
–
Pre-salbutamol: 1.29 (0.80-2.12)
Post-salbutamol: 1.53 (1.19-2.58)
Post-placebo: 1.40 (1.36-1.42)
Post-caffeine: 1.36 (1.31-1.41) [5 mg / kg, for asthma]
Post-indacaterol: 1.71 (1.63-1.78)
Post-formoterol: 1.65 (1.59-1.70)
COPD. 2005 Mar;2(1):111-24; Chest. 2008 Aug;134(2):387-93; Caffeine for asthma (Cochrane Review)
39
COPD
87
Smoking and cancer
• Hypothesis: Smokers get cancer more often than nonsmokers
• Data needs, at least:
• Two groups: smokers, non-smokers
• Data: Smoking status, cancer end-point status [from Health
2000 or 2011]
• Data: Occupational exposure [from ASA registry), age, sex
[from Health 2011]
• Study design: Case-control study
• Cases and controls sampled from Health 2011
• Analysis:
• Missing values for explanatory variables are imputed
• Confounders are adjusted for using logistic regression
• Clinical relevance is set at OR > 2
88
Presenting results
89
Odds ratio – a measure of association
•
•
•
•
Odds for cancer | smoker: 12 / 8 = 1.5
Odds for cancer | non-smoker: 36 / 180 = 0.2
Odds ratio = 1.5 / 0.2 = 7.5
The odd for a smoker getting lung cancer is 7.5
times that of an odd for a non-smoker
90
Graphical representation of the table
91
Age, occupation and smoking
Variable
Age
<45
45-54
55-64
>65
Occupation
housewife
white-collar
other
Smoking
no
yes
OR (95% CI)
1
1.91 (0.61... 6.75)
2.05 (0.68... 7.24)
3.35 (1.07...12.18)
1
0.92 (0.42... 1.91)
0.97 (0.46... 1.97)
1
9.97 (4.22...25.28)
• Effect of
smoking is
adjusted for
both age and
occupation at
the same
time.
• Note that
after
adjustment
the OR is
higher than
the raw OR!
92
Risk theatre
Non-smoker cases / 10 years
Smokers cases / 10 years
Doll & Hill 191566
Statins – base rate (absolute risk)
94
Statins – relative risk
95
Statins – benefits versus adverse
effects
96
Statins – risk theatre
97
Statins – NNI / risk theatre
98
Smoking and cancer
• Hypothesis: Smokers get cancer more often than nonsmokers
• Data needs, at least:
• Two groups: smokers, non-smokers
• Data: Smoking status, cancer end-point status [from Health 2000 or
2011]
• Data: Occupational exposure [from ASA registry), age, sex [from
Health 2011]
• Study design: Case-control study
• Cases and controls sampled from Health 2011
• Analysis:
• Missing values for explanatory variables are imputed
• Confounders are adjusted for using logistic regression. No subgroup
analyses are planned.
• Clinical relevance is set at OR > 2
• Results are presented as a regression table and graphically
99
Reporting guidelines
100
Reporting guidelines
• STROBE
• STrengthening the Reporting of OBservational studies in
Epidemiology
• CONSORT
• CONsolidated Standards of Reporting Trials
• Follow these!
101
STROBE
Methods
Study design
4
Present key elements of study design early in the paper
Setting
5
Participants
6
Describe the setting, locations, and relevant dates, including periods of recruitment,
exposure, follow-up, and data collection
(a) Give the eligibility criteria, and the sources and methods of case ascertainment
and control selection. Give the rationale for the choice of cases and controls
(b) For matched studies, give matching criteria and the number of controls per case
Variables
Data sources/
measurement
Clearly define all outcomes, exposures, predictors, potential confounders, and effect
modifiers. Give diagnostic criteria, if applicable
8* For each variable of interest, give sources of data and details of methods of
assessment (measurement). Describe comparability of assessment methods if there is
more than one group
7
Bias
9
Describe any efforts to address potential sources of bias
Study size
Quantitative
variables
Statistical methods
10 Explain how the study size was arrived at
11 Explain how quantitative variables were handled in the analyses. If applicable,
describe which groupings were chosen and why
12 (a) Describe all statistical methods, including those used to control for confounding
(b) Describe any methods used to examine subgroups and interactions
(c) Explain how missing data were addressed
(d) If applicable, explain how matching of cases and controls was addressed
(e) Describe any sensitivity analyses
102
STROBE
Results
Participants
13*
(a) Report numbers of individuals at each stage of study—eg
numbers potentially eligible, examined for eligibility, confirmed
eligible, included in the study, completing follow-up, and analysed
(b) Give reasons for non-participation at each stage
Descriptive
data
14*
Outcome data
15*
Main results
16
(c) Consider use of a flow diagram
(a) Give characteristics of study participants (eg demographic,
clinical, social) and information on exposures and potential
confounders
(b) Indicate number of participants with missing data for each
variable of interest
Report numbers in each exposure category, or summary measures
of exposure
(a) Give unadjusted estimates and, if applicable, confounderadjusted estimates and their precision (eg, 95% confidence
interval). Make clear which confounders were adjusted for and why
they were included
(b) Report category boundaries when continuous variables were
categorized
(c) If relevant, consider translating estimates of relative risk into
absolute risk for a meaningful time period
103
Data management
Reproducible research
• To sum up the previous steps:
• Data gathering
• Data analysis
• Data presentation
• Working habit:
•
•
•
•
•
•
Document everything
Everything is a text file
Save in an open file format
Files should be human readable
Tie your files together
Have a data management plan
• Organization, (long-term) storage, availability
• Use versioning on all files
105
Clinical trials at Duke
• Potti et al. studied chemosentivity of cancer cell
lines.
• Results were going to be applied in a clinical trial.
• And so it begins...
• http://bioinformatics.mdanderson.org/Supplement
s/ReproRsch-All/Modified/StarterSet/index.html
106
Summary in two minutes
• Coombs et al. delved into the analysis...
• Doxorubicin
• Sensitive / resistant labels were reversed in the analysis
• Some samples in the test data were duplicated
• Some samples are labeled both sensitive and resistant
• Cisplatin and pemetrexed
• Gene lists were off by one, the correct list does not
differentiate the cell lines
• Some genes are not on arrays that were used
• Sensitive / resistant labels are again reversed
• And the list goes on, see:
http://arxiv.org/pdf/1010.1092.pdf
107
Meticulous documentation
• Protect the individuals recruited for the study
• Protect your co-workers and co-authors
• Protect yourself
• Work openly.
• Everybody makes mistakes. Embrace and learn
from them!
108
Smoking and cancer
• Hypothesis: Smokers get cancer more often than non-smokers
• Data needs, at least:
• Two groups: smokers, non-smokers
• Data: Smoking status, cancer end-point status [from Health 2000 or 2011 and
cancer registry]
• Data: Occupational exposure [from ASA registry), age, sex [from Health 2011]
• Study design: Case-control study
• Cases and controls sampled from Health 2011
• Analysis:
• Missing values for explanatory variables are imputed
• Confounders are adjusted for using logistic regression. No subgroup analyses
are planned.
• Clinical relevance is set at OR > 2
• Results are presented as a regression table and graphically
• Data management
• Primary data is available from the mentioned registries. Code documenting data
manipulation and analyses is published as supplementary information with the
article.
109
Wrap-up
110
Summary
• The analysis methods are coupled to the study
design which is itself affected by the hypotheses
• Write the analysis plan before you have the data
• Prepare for small deviations, but don’t change the major
themes
• Learn regression methology, it will serve you well
• Make a data management plan. Document
everything
• Learn from mistakes. Everybody makes them.
• Also, protect the innocent. It’s better to have a horrible
end than horrors without end.
111
Question 5 - Homework
10 but the official told Daniel, "I am afraid of my lord the king, who has assigned your food and drink. Why
should he see you looking worse than the other young men your age? The king would then have my head
because of you."
11 Daniel then said to the guard whom the chief official had appointed over Daniel, Hananiah, Mishael
and Azariah,
12 "Please test your servants for ten days: Give us nothing but vegetables to eat and water to drink.
13 Then compare our appearance with that of the young men who eat the royal food, and treat your
servants in accordance with what you see."
14 So he agreed to this and tested them for ten days.
15 At the end of the ten days they looked healthier and better nourished than any of the young men who
ate the royal food.
16 So the guard took away their choice food and the wine they were to drink and gave them vegetables
instead.
17 To these four young men God gave knowledge and understanding of all kinds of literature and
learning. And Daniel could understand visions and dreams of all kinds.
18 At the end of the time set by the king to bring them in, the chief official presented them to
Nebuchadnezzar.
19 The king talked with them, and he found none equal to Daniel, Hananiah, Mishael and Azariah; so they
entered the king's service.
20 In every matter of wisdom and understanding about which the king questioned them, he found them
ten times better than all the magicians and enchanters in his whole kingdom.
• Does this description fulfill the STROBE guidelines.
If not, what is missing?
112