Causal Story

Download Report

Transcript Causal Story

Using Domain Knowledge to Construct Causal Models
from Clinical Observational Data
Scott A. Malec1, Rastegar M. Majid2, Hongfang Liu2, Elmer V. Bernstam1, Peng Wei3, Trevor Cohen1
1 The
School of Biomedical Informatics, The University of Texas Health Science Center at Houston, 2 Mayo Clinic, Rochester, MN,
3 Biostatistics Department, Division of Quantitative Sciences, University of Texas MD Anderson Cancer Center, Houston, TX
Introduction
•
Background
Adverse drug events (ADEs) burden health systems worldwide and
pose a lethal danger to individuals.
•
•
Example: Vioxx, a COX-2 inhibitor, and heart attacks.
Data and Methods
Conventional adjustment methods: “explaining away” regression
(Figure A), empirical identification of confounding and lasso
regression[1], meta-analysis.
•
Literature-based
Discovery
(LBD)
: identify novel therapeutic
• Drug safety evaluation does not end with FDA approval, since not all
[4]
scenarios can be anticipated from randomized controlled trials, thus
the ongoing need for pharmacovigilance (PV), the post-marketing
surveillance of pharmaceuticals and other therapies.
• Problem: Spontaneous Reporting Systems (e.g. FAERS)
have been the primary source of data for PV, but these are
known to be incomplete and inaccurate. Electronic Health
Records (EHR), which reflect routine clinical practice,
could also be used, but these present other challenges,
e.g., confounding, or exogenous sources of bias between
variables[1,2,3].
hypotheses from implicit relationships between concepts
•
•
UTHealth BIG subset: ~ 2.2 million free text outpatient clinical
notes, processed with MedLEE to identify drugs/ADEs[7].
•
•
Mayo Clinic data: ~ 3.6 million free text in- and out-patient clinical
notes, 131,000 patients for whom the Mayo Clinic was their home
of primary care, a cohort with extensive records, processed with
medTagger and medXN (http://www.ohnlp.org) for concept
identification and medication detection, respectively[8].
SemMedDB version 24_32 (version 1.5 of SemRep) using the
EpiphaNet LBD system (http://epiphanet.uth.tmc.edu) for
convenience.
•
Given a prediction/outcome drug/ADE pairs, identify concepts
within their causal orbit and causal relationship with drug
pharmacological class, if any, with ADE (e.g., Ibuprofen∈ NSAIDs,
NSAIDS GIB). LBD-identified “causal story”.
•
Two tiers: Predictors: drug + confounders + Outcome tier: ADE.
“Discovery patterns” semantically constrain relationships between
concepts: CAUSES, ISA, INHIBITS, etc.[5,6]
pertain to a variables of interest, which can be identified with LBD.
•
Knowledge
(Figure B).
• Causal Story: relevant concepts (nodes) and constraints (edges) that
•
Data
Causal methods indicate direction of influence, beyond correlation
(Figure E).
Pearl’s formalization of causality: graph + structural equations.
Counterfactual intervention: where data generating mechanism is
artificially perturbed. Causal Bayesian network (Figures C & D).
• Motivation: we would like to use EHR as a data source for
• Solution: to determine causality given clinical observational
PV, but it contains exogenous sources of bias, known as
data (derived from EHR) and a drug/ADE pair, synthesize a
“confounding”. We are seeking to overcome confounding
causal Bayesian network from input with their causal story.
by using the literature to identify confounding variables
with which to perform statistical “confounding adjustment” • What we propose is tantamount to simulating abductive reasoning, or
reasoning from observational data to a parsimonious plausible theory.
and to inform and construct causal models.
•
Co-occurrence based, no patient-level or temporal modeling.
Reference Standard
Method and Evaluation
•
165 positive and 234 negative examples of drug/ADE pairs[9],
developed through manual curation.
1. For each pair, use DPs to identify “causal stories” for drug/ADE
pair (Figure F).
•
Four ADEs: acute kidney injury (AKI), acute liver injury (ALI),
gastrointestinal bleeding (GIB), myocardial infarction (MI).
2. Create statistical models using multivariate logistic regression for
baseline.
Discovery Patterns (DPs) (“manually selected”)
•
Alternate etiology DPs: CAUSES, PREDISPOSES[10].
•
“True Confounder” DP: TREATS+COEXISTS_WITH[2,10].
Tables and Charts
Figure A: Coefficient
Shrinkage with Multivariate
Logistic Regression for
Nevirapine/GIB (UTHealth).
Additional
Variables
(Confounders
)
0
Figure B: LBD and discovery patterns.
Figure C:
Confounding
vs.
‘
X
Conditional
Independence.
16.788
Figure D: Several directed graphs may
fit any set of data (referred to as partial
ancestral graphs, or PAGs)
Z
X
Y
Y
•
•
Predict risk with instantiated parameterized models[13].
•
Integrate more causal knowledge.
•
Causal models for MI performed well in both data sets.
•
Expand variety of causal discovery algorithms.
•
The incorporation of minimal causal knowledge germane to a
predictor/outcome (pharmacological class of drug) usually
improves performance.
•
Create models that combine concepts from multiple
discovery pathways for more complete causal stories.
•
However, there was no gain between models that were agnostic
vs. gnostic of causal drug pharmacological class/ADE association
for UTHealth/GIB. ROC also took inexplicable hit with Mayo/ALI.
Structural equation:
fy(x) = βx + ε, where ε is the “error term”.
2 (HIV
infections)
6.219
X causes Y, if X is in the structural equation that defines Y.
Simulating Abductive Reasoning in silico
“No causation without manipulation.” – Paul W. Holland (1986)
•
3
(hemorrhoids)
2.579
By synthesizing relevant domain knowledge with observational
data into a causal bayesian network that explains the data, we
simulated aspects of abductive reasoning.
•
Constrained by “discovery patterns” (DPs), domain knowledge
(“causal story”) arrives in the form of causally germane concepts
(nodes) and relationships (edges)[5,3].
ADE DP
4 (aspirin)
0.362
Figure E: Regression graph
vs.
causal graph.
CAUS
AKI
TCOE
CAUS
ALI
GIB
0.5957 (0.5565)
CAUS
0.8146 (0.8146)
PRED
Mayo Clinic
Baselines
AUC (Mayo)
0.withDrugClass
(0.without
DrugClass)
•
NA
UnAdj: 0.6128 0.6154 (0.52447)
Adj: 0.5053-0.6128 0.7292 (0.6181) UnAdj: 0.5000 0.7303 (0.5166)
TCOE
CAUS
AMI
AUC (UTHealth)
0.withDrugClass
(0.without
DrugClass)
UnAdj: 0.5670
0.7169 (0.5838)
Adj: 0.4520-0.5297
0.6165 (0.5222)
PRED
TCOE
Figure F: Rejected LBD concepts.
UTHealth BIG
Baselines
UnAdj: 0.6530
0.8015 (0.8015)
Adj: 0.6818-0.7189
0.8562 (0.8573)
PRED
UnAdj: 0.5112 0.7461 (0.5606)
Adj: 0.5032-0.5196
0.6368 (0.549)
TCOE
0.6368 (0.549)
0.6773 (0.6172)
UnAdj: 0.6813
0.6413 (0.5045)
0.67 (0.6072)
•
0.8777 (0.5733)
UnAdj: 0.5790
This DP-filtered knowledge enforces both parsimony and
consistency with what is known, thereby improving the reasoning
quality and enhancing causal detection performance, much like
human experts (and even certain toddlers) leverage knowledge
derived from experience to rapidly identify salient features of an
unfamiliar or novel problem space[11,12].
Abduction “satisfices rather than maximizes its response to the
agent's cognitive target”[11].
Concluding Remarks
•
Our LBD-enhanced causal modeling method offers
significant performance boost over traditional statistical
methods for the task of detecting causal drug/ADE
associations from clinical observational data.
•
Our method should be broadly applicable in any area of
biomedicine that utilizes observational data as input.
References
1.
Li Y, Salmasian H, Vilar S, Chase H, Friedman C, Wei Y. A method for controlling complex confounding
effects in the detection of adverse drug reactions using electronic health records, J Am Med Inform Assoc.
2014; 21(2):308–14.
2.
Pearl J. Causality: Models, reasoning, inference, 2nd edition, New York: Cambridge University Press; 2009.
3.
Pearl J, Glymour M, Jewell N. Causal inference in statistics: a primer, Chichester, UK: Wiley; 2014.
4.
Swanson D. Fish oil, raynaud’s syndrome, and undiscovered public knowledge, Persp Bio Med. 2010;
30(1):7-18.
5.
Hristovski D, Friedman C, Rindflesch TC, Peterlin B. Exploiting semantic relations for literature-based
discovery, in: AMIA Annu Symp Proc. 2006; 349–53.
6.
Shang N, Xu H, Rindflesch T, Cohen T. Identifying plausible adverse drug reactions using knowledge
extracted from the literature, J Biomed Inform. 2014; 52, 293–310.
7.
Ryan P, Schuemie M, Welebob E, Duke J, Valentine S, Hartzema A. Defining a reference set to support
methodological research in drug safety, Drug Saf. 2013; 36, S33–47.
8.
Friedman C. A broad-coverage natural language processing system, in: AMIA Annu Symp Proc. 1995; 270–4.
0.8777 (0.5941)
Limitations
9.
Torii M, Wagholikar K, Liu H. Using machine learning for concept extraction on clinical documents from
multiple data sources, J Am Med Inform Assoc. 2011;18(5):580-587.
0.8616 (0.578)
•
10.
Malec SA, Wei P, Bernstam EV, Myneni S, Cohen T. Using literature to identify confounding variables in
clinical observational data, in: AMIA Annu Symp Proc, Chicago. 2016; 118–173.
11.
Gabbay DM, Woods J. A practical logic of cognitive systems, volume 2, the reach of abduction: insight and
trial, Amsterdam, Netherlands: Elsevier; 2005.
12.
Rogers TT, McClelland, JL. Semantic cognition: a parallel distributed processing approach, Cambridge, UK:
MIT Press; 2004.
13.
Darwiche A. Modeling and reasoning with bayesian networks, UK: Cambridge University Press; 2009.
0.7639 (0.5417)
UnAdj: 0.5397
Future Work and Conclusion
With causal models, we see performance gains of 5–30% over
traditional statistical methods.
6.673
1 (AIDS)
4. Evaluation: calculate ROCs for statistical and causal models.
Discussion of Results
Z
NevirapineGIB
coefficient
3. Use causal discovery tools (Tetrad: http://github.com/bd2kccd
using Fast Greedy Search [FGS]) to produce plausible causal
hypotheses (in graph form) to explain the data.
0.7557 (0.5335)
•
0.7170 (0.5235)
Data transformation issues. Current results report ~75% of
UTHealth vs. 97% of Mayo Clinic data set.
Truly novel causal associations may be overlooked because they
have been historically under-recognized in the literature.
Implications
ADEs
AKI: acute kidney injury, ALI: acute liver injury,
GIB: gastrointestinal bleeding, MI: myocardial infarction
•
Prioritize drugs for regulatory review.
Discovery Patterns
(DPs)
CAUS: CAUSES, PRED: PREDISPOSES, TCOE: TREATS + COEXISTS_WITH
•
Miscellaneous
UnAdj: unadjusted correlation coefficients, Adj: confounding-adjusted correlation coefficients
Open domain applications (any area with observational data and a
source of structured knowledge).
Acknowledgements
This research was supported by US National Library of Medicine grant R01 LM011563, NCATS
UL1 TR000371, and NIH/BD2K supplement R01 LM011563-02S1.
Please contact the first author via email: [email protected]