No Slide Title

Download Report

Transcript No Slide Title

Risk modeling of colorectal cancer using machine learning algorithms on a
hybridized genealogical and clinical dataset
David P. Taylor, MS 1,2, Lisa A. Cannon-Albright, PhD 1, Randall W. Burt, MD 1,3, Jason P. Jones, PhD 1,2,
1
1,2
1,2
Carol Sweeney, PhD , Marc S. Williams, MD , Peter J. Haug, MD
1
University of Utah, Salt Lake City, UT;
Introduction
Although colorectal cancer (CRC) is the third most
common cancer diagnosed in the United States
and the second leading cause of death among
cancers, a majority of US adults are not being
screened regularly or appropriately. Knowledge of
increased risk can be a motivating factor in
deciding to be screened. Family history is a wellestablished risk factor for CRC and the
incorporation of family history in electronic health
records (EHRs) in a structured format is gaining
momentum.1 A comprehensive CRC risk model
that includes a variety of risk/protective factors has
been developed.2 However, data were generally
self-reported and family history of CRC was limited
to affected first-degree relatives (FDRs). In an
ongoing research project, we have investigated risk
from combinations of affected first-, second-, and
third-degree relatives (FDRs, SDRs, and TDRs,
respectively) in more than 2.3 million probands in
the Utah Population Database (UPDB), a
population-based,
electronic,
genealogical
resource that also contains cancer registry
records.3 Many individuals in the UPDB also have
clinical records in the Intermountain Healthcare
system. Clinical data for particular CRC risk
factors are available electronically starting in 1993,
however, the availability is sporadic until the 20002002 time frame.
Our goal is to build a risk model for CRC that
includes objective clinical and family history data,
and in particular, to compare the family history
component of risk with the component from
available clinical/behavioral factors. An important
outcome of this work will be to determine if typically
available, electronic, clinical and administrative
data can improve the predictive power of a risk
model beyond that obtained using just family
history, and if so, how the model may be
generalized to other settings.
Contact Information
David Taylor
[email protected]
2
3
Intermountain Healthcare, Salt Lake City, UT;
Influence of Family History on CRC Risk
Number of Number of Number of
affected
affected
affected
FDRs
SDRs
TDRs
In an analysis of 2,327,327 individuals included in ≥3
generation family histories, 10,556 had a diagnosis of
CRC. Increased numbers of affected FDRs influences
risk much more than affected SDRs or TDRs.
However, when combined with a positive first-degree
family history, a positive second- and third-degree
family history can significantly increase risk. Age at
diagnosis of CRC in affected relatives contributes
significantly to risk estimates. Even diagnosis
between 60 and 69 years of age in affected FDRs
increases risk equivalent to the level of an affected
FDR without respect to age at diagnosis.
Huntsman Cancer Institute, Salt Lake City, UT
N
(probands)
Familial
relative risk
(FRR)
Lower CI
(95%
level)
Upper CI
(95%
level)
0
0
0
1,470,367
0.83
0.81
0.86
0
1
2
20,321
1.33
1.13
1.55
1
NA
NA
87,089
1.91
1.82
2.00
≥1
NA
NA
94,931
2.05
1.96
2.14
≥1 (dx age
<50)
NA
NA
6291
3.31
2.79
3.89
≥1 (dx age
60-69)
NA
NA
25,084
2.22
2.04
2.40
1
1
0
8836
1.88
1.59
2.20
1
1
≥3
1357
3.28
2.44
4.31
2
NA
NA
6966
3.01
2.66
3.38
Mother and
father both
affected
NA
NA
450
4.97
2.72
8.34
3
NA
NA
762
4.43
3.24
5.90
Table 1: Selected familial relative risks (FRR) for constellations
of affected FDRs, SDRs, and TDRs.
Data included in analysis from 1993 until reference date (most data are from 1998 and on)
1 Select cases
Must have ≥1 visit/stay
documented in this time
period
Birth (+/- 1 year)
Case
CRC Dx
Build and
5
test models
Control
• Seen as an inpatient or
outpatient at
Intermountain at least
once during the years
1996-2000, at age 18 or
older.
• Member of a UPDB
dataset containing
individuals part of ≥3
generation family history
• Diagnosed (for the first
time) with CRC ≥ year
2000
• CRC site/histology
codes specified by Utah
Cancer Registry
1996
2000
6 months prior to
Dx (reference
date)
1 year prior to
Dx
Figure 1: Example timelines for a case and matched control included in dataset
2 Select controls
Gather clinical data
3
•Same criteria as cases
(except diagnosis)
Matching criteria:
• +/- 1 year on birth year
• Sex
• Has to have visit/stay 1
year prior to case Dx date
or anytime after
• 1:10 case/control
matching ratio
Collect
and
aggregate
4
family history data
•Exposure data ending at a
‘Reference’ date 6 months before
the diagnosis date for each case,
and at the same reference date for
controls matched to that case
• Data gathered: other diagnoses,
medications, BMI, screening
procedures, tobacco and alcohol
use
Selected Preliminary Results
Presence in
controls (%)
Year data are first
available
35%
Colonoscopy
4.4
9.5
1997
30%
Flexible
sigmoidoscopy
2.7
4.3
1997
25%
FOBT
7.5
10.1
1995
Adenoma
2.2
2.4
1999
• Collect CRC and other
cancer histories for all
known FDRs, SDRs, and
TDRs of cases and controls
• Tally counts of affected
relatives per case/control by
cancer
• Prepare
dataset for
analysis
• Withhold part of
dataset for
testing or utilize
Bootstrap or
cross-validation
• Compare
performance of
models using
Area Under the
Curve (AUC) of
an ROC
1.5
1.1
•Documented medication order does not mean the
medication was taken by patient
•Not able to determine lifetime exposure from
available dataset, which represent 1993 and later
Despite the challenges we are not able to mitigate,
we hope the high-quality family history data as well
as screening and disease diagnosis data may still
provide valuable insights on CRC risk, with further
refinement of research objectives, study design,
and the data analyzed.
Future Work
•Determine whether expanded set of cases and
matched controls improves performance
•Use familial relative risk estimates, current ages,
and local morbidity and mortality rates, in a sample
of patients to estimate absolute risks of acquiring
CRC over a period of time
• Estimate gaps in risk-appropriate screening based
on review of EHR in “relatives” sample
References
All
0 aff FDRs
≥1 aff FDR
20%
15%
Advanced adenoma
•Other missing/incomplete data (e.g., patient might
be seen outside Intermountain; incomplete
documentation such as tobacco/alcohol or aspirin
use; can’t distinguish “no” from “unknown” status
for many variables)
40%
Presence in
cases (%)
Finding
Control must have
visit/stay after this date
•First CRC Dx date must
be > year 2000
•Erroneous data (e.g., BMI out of possible range,
duplicate records). (Known erroneous data
disregarded and only individuals with single
identifier included in study.)
•Limited contribution of individual risk/protective
factors on CRC risk
Methods for Model Incorporating Clinical Variables and Family History
Case
Challenges
1999
10%
IBD
5.2
5.2
1994
Diabetes
18.8
14.1
1994
5%
Tobacco use
2.9
1.6
2004
0%
Colonoscopy
Statin order
8.6
7.7
1997
HRT order
4.4
3.2
1997
Table 2: Comparisons between 789 cases and 7886
controls for presence of selected findings in the EHR.
Flex Sig
FOBT
Adenoma
Advanced
Adenoma
Figure 2: Percentages of individuals from a sample of 128,927 relatives of cases and
controls, not known to be deceased, between 50 and 100 years of age, not known to
have CRC, with evidence of having been screened at least once and (separately)
evidence of adenomas in the EHR. Chart shows percentages for all individuals, those
with 0 affected FDRs, and those with ≥1 affected FDR.
1. Feero WG, Bigley MB, Brinner KM, et.al. New Standards and
Enhanced Utility for Family Health History Information in the
Electronic Health Record: An Update from the American Health
Information Community's Family Health History Multi-Stakeholder
Workgroup. J Am Med Inform Assoc. 2008 Nov-Dec;15(6):723-8.
2. Freedman AN, Slattery ML, Ballard-Barbash R, et al. Colorectal
cancer risk prediction tool for white men and women without
known susceptibility. J Clin Oncol. 2009 Feb 10;27(5):686-93.
3. Taylor DP, Burt RW, Williams MS, Haug PJ, Cannon-Albright LA.
Population-based family-history-specific risks for colorectal
cancer: a constellation approach. Gastroenterology. (In press)