Biomarkers of Asthma Ancillary Study of CAMP
Download
Report
Transcript Biomarkers of Asthma Ancillary Study of CAMP
Case Study for Clinical
Relevancy: Asthma
Scott T. Weiss, M.D., M.S.
Professor of Medicine
Harvard Medical School
Director, Center for Genomic Medicine
Director, Program in Bioinformatics
Associate Director, Channing Laboratory
Brigham and Women’s Hospital
Boston, MA
BRIGHAM AND
WOMEN’S HOSPITAL
HARVARD
MEDICAL SCHOOL
Outline
•
•
•
•
•
•
Context: focus on process and data
Overview of Asthma DBP
Smoking as an example of the data issues
Predicting COPD in those with asthma
Predicting asthma exacerbations
Genetic prediction of asthma exacerbations
current status
• DNA collection
• Lessons Learned
• Conclusions
Context
• Channing Lab - extensive genetics &
pharmacogenetics resources focused on
airways diseases
• Faculty with clinical, epidemiology, genetic,
and bioinformatics training and experience
• multidisciplinary research collaborative track
record
• Good i2b2 driver: from bench to clinic
• Strong focus and direction for Cores
Broad Goals of Channing Program in
Predictive Medicine
•
•
•
•
•
Genetic variation clinical practice
Disease risk (asthma diagnosis)
Natural history (exacerbations)
Individual response to medication (pharmacogenetics)
Develop predictive tests (genetic and nongenetic) in
Channing populations
• Validate these tests in Partners asthma cohort (PAC) at
least as proof of concept
I2B2 Airways DBP: Overview
Partners
Clinical
Services
Develop
statistical
models
Predict clinical
outcomes
after
adjustment for
covariates
RPDR
Extract important
phenotypes
from text: NLP
Extract
data from
Airways
Disease
patients
Extract relevant
quantitative and
coded phenotypes
RPDR:
Recruit,
validate,
genotype
Before we start
• Numerous important covariates
• e.g. age, tobacco, comorbidities,
medications
• Adjust outcomes for covariates
• Some (eg age, gender,Dx, encounter)
readily available
• Obtained through Core 4
• Others require substantial effort e.g.
medications, tobacco use, comorbid
conditions
• Collaboration - NLP experts in Core 1
Phenotypes from text
• Extract specific data items
– Medication
– Smoking status
– Diagnoses (Co-morbidity)
• Extract findings to assist with case
selection
• Extract findings to assist with clinical
predictions
Smoking Status- Examples
SOCIAL HISTORY: The patient is married with four grown daughters,
Smoker
uses tobacco, has wine with
dinner.
SOCIAL HISTORY: The patient is a nonsmoker. No alcohol.
Non-Smoker
SOCIAL HISTORY: Negative for tobacco, alcohol, and IV drug abuse.
BRIEF RESUME OF HOSPITAL COURSE:
Smoker
63 yo woman with COPD, 50 pack-yr tobacco (quit 3 wks ago), spinalPast
stenosis,
...
SOCIAL HISTORY: The patient lives in rehab, married. Unclear smoking history
???
from the admission note…
HOSPITAL COURSE: ... It was recommended that she receive …We also added Lactinax, oral
form of Lactobacillus acidophilus
to attempt
a repopulation of her gut.
Hard
to pick
SH: widow,lives alone,2 children,no tob/alcohol.
Hard to pick
Smoking -Text Processing
No. Cases
2796
No. Attributes
50
No.Classes
5
Cases per class
Denies smoking
146
Never smoked
427
Past smoker
952
Current Smoker
1010
Control cases
261
Manually
classified
Smoking Status
Preliminary results
•
•
•
•
•
Raw sample ~ 20,000 reports
Feature extraction >3000
Feature selection 25 - 1000
“Gold standard” sample cases ~ 2,800
Correct classification rate 46 - 81%
(compared to Gold Standard)
Smoking Status
Preliminary results
Data Set
Classification
Method
Test
Cases
No.
Features
% Correctly
Classified
Stemmed
one-gram
Naïve Bayes
CV 10x
917
80.92
Stemmed
one-gram
Naïve Bayes
CV 10x
231
80.46
One-gram
SVM
Split 2/3
50
79.70
One-gram
Naïve Bayes
Split 2/3
50
78.02
Bi-gram
SVM
Split 2/3
25
49.57
Bi-gram
Naïve Bayes
Split 2/3
25
70.73
Tri-gram
SVM
Split 2/3
25
44.63
Tri-gram
Naïve Bayes
Split 2/3
25
65.05
More …
Baseline
performance
Increase,
combine
features
should
improve
performance
Data Extraction
“Raw”
Patient Data
Text Processing
Word/pattern filters
Stemming
Lexicon matching
Parsing
…
Feature Analysis
Classification
Clustering
Statistical Analysis
…
-------------------------------------------
Data Mining Pipeline
“Smart Data”
Medications
Smoking status
Co-morbidity
Asthma Preceding COPD
• Significant overlap of asthma and COPD DX
• Common denominator = smoking
• Asthma is known to precede and predict the
development of COPD independent of
smoking
• Could we develop a multivariate clinical
predictor that would predict which
asthmatics would get COPD?
Study Design
Source: Partners Healthcare
Research Patient Data
Repository (RPDR).
RPDR: MGH, BWH, etc
clinical repository for
researchers.
Training: 9349 asthmatics
(843 COPD, 8506 controls)
first encounter 1988 1998.
Test: A future set of 992
asthmatics (46 COPD, 946
controls) first encounter
from 1999-2002.
Data Collection
Criteria: Patients observed for
at least 5 years, at least 18
at the first encouter, and
race, sex, height, weight,
and smoking available.
Comorbodities: International
Classification of Diseases,
9th Revision (ICD-9) codes
as admission diagnosis or
ER primary diagnosis (104)
COPD: ICD-9 code for
“Chronic Bronchitis”,
“Emphysema” “Chronic
Airways Obstruction, not
otherwise specified.”
Analysis
Model: A Bayesian network was generated from the
training set of 9349 asthmatics (843 COPD, 8506
controls) encountered between1988 and 1998 from
104 comoribities and race, gender, age, smoking.
Results: The risk of COPD is modulated by gender,
race, and smoking history, and 14 comorbidities:
Viral and chlamydial infections, diabetes mellitus,
volume depletion, acute myocardial infarction,
intermediate coronary syndrome, cardiac
dysrhythmias, heart failure, acute upper respiratory
infections, acute bronchitis and bronchiolitis,
pneumonia, early or threatened labor, normal
delivery, shortness of breath, respiratory distress.
Network Model
Validation
Propagation: a Bayesian network can compute the
probability distribution of any variable given an
instance of some or all the other variables.
Test data: a future set of 992 asthmatics (46 COPD,
946 controls) first encounter from 1999-2002.
Prediction: for each patient, predict the probability of
COPD given the other elements in the network (comorbidities and demographics).
Validation: compare the predicted with the observed
COPD status.
Predictive Validation
One variable at the time
Asthma Exacerbations
• Asthma attacks involve worsening of asthma
symptoms including bronchoconstriction and
inflammatory response
• Major cause of morbidity and mortality in asthma
• 11.7 million Americans have an exacerbation
every year (3.9 million children)
• In US children, exacerbations are the third leading
cause of hospitalizations (198,000 occurrences
per year)
• Cost of asthma exacerbations US=4 billion
dollars, Partners=20 million dollars
RPDR Exacerbation Prediction
.67 any ER/hosp visits
>2 ER/hosp visits
>3 ER/hosp visits
1
Sensitivity
0.8
0.6
0.4
0.2
0
0
0.2
0.4
0.6
Specificity
0.8
1
Genetic Prediction of Asthma Exacerbation
Objective
Predict asthma exacerbation from genetic data
Subjects
290 CAMP participants
• Not on steroids
• Followed for 10+ years
• Have genetic data available
Phenotype
Case: Reported overnight hospitalization(s) (n=83)
Control: No overnight hospitalizations or ER visits (n=207)
Genotype
2443 SNPs from 349 candidate genes
• In Hardy-Weinberg equilibrium among controls
• Minor allele frequency > 0.05
Exacerbation Model
132 of 2443 SNPs
in 55 of 349 genes
predict
exacerbation
Validation
Method: Prediction on fitted values
Result: Area under the ROC curve (AUROC) is 0.97
AUROC measures accuracy as
trade-off between sensitivity and
specificity
AUROC
Rating
0.5 - 0.6
Fail
0.6 - 0.7
Poor
0.7 - 0.8
Fair
0.8 - 0.9
Good
0.9 - 1.0
Excellent
AUROC = 0.97
Cross-Validation
Method: 20-fold cross-validation to test robustness
1. Data is split into 20 groups
2. One group is used as independent and remaining 19 are used to quantify
the model
3. (2) is repeated until each group has been independent set
Result: AUROC is 0.84 (good)
AUROC = 0.84
Partners Asthma DNA collection #1
•
•
•
•
•
•
•
Recruit Partners asthma patients
Partners Asthma Center, NWH, MGH
High quality spirometric phenotyping
Blood for DNA extraction and storage
Children and adults
High cost (>$1000/subject)
Low intensity 6 months only 100 subjects
recruited
• Doctors and patients need education
Partners Asthma DNA collection #2
•
•
•
•
•
•
•
Recruit Partners asthma cohort patients
Leverage CRIMSON blood samples
Leverage data mart for phenotype data
Blood for DNA extraction and storage
Children and adults cases and controls
low cost (<$30/subject)
High intensity 9 months >3000 subjects
recruited
Figure 1
Data Flow for Asthma DBP
Channing
ADMPN#
RPDR
Send to RPD
converts ADMPN#
to MRN sends
to pathology
Pathology (Crimson)
MRN
Crimson ID#
ADMPN
sends back to Channing with
sample for DNA extraction
Figure 1 Legend
Deidentified data file analyzed by Channing subjects for DNA collection selected. File sent to
RPDR converted back to MR# and sent to Crimson. Samples identified and given Crimson ID# ≡
ADMPN and sample Sent back to Channing.
Recruitment for DBP from Crimson at BWH: Asthma Cases by Utilization and Race
370
350
330
310
290
270
250
230
210
Hi Cauc
Lo Cauc
Hi Af Am
Lo Af Am
190
170
150
130
110
90
70
50
30
10
-10
May
May-Jun
May-Jul May-Aug May-Sept May-Oct May-Nov May-Dec May-Jan May-Feb May-Mar
Recruitment for DBP from Crimson at BWH: Asthma Cases and Controls by Race
1310
1260
1210
1160
1110
1060
1010
960
910
860
Cauc Asthma
Cauc Controls
Af Am Asthma
Af Am Controls
810
760
710
660
610
560
510
460
410
360
310
260
210
160
110
60
10
-40
May
May-Jun May-Jul May-Aug
MaySept
May-Oct May-Nov May-Dec May-Jan May-Feb May-Mar
Summary of Samples to 04/07/08
Running total:
High African American:
111
Low African American:
222
Controls African American:
880
High Caucasian:
59
Low Caucasian:
454
Controls Caucasian:
1,341
Lessons learned 1
•
•
•
•
•
•
•
Get what you ask for
Regular meetings, regular meetings
Negotiate your demands
Tools are not enough
Leverage your peers
Recruiting patients is hard work
IRB is hard work
Lessons learned 2
• You can never have enough statistics or
bioinformatics
• Genotyping and its technologies are
secondary
• The RPDR data are dirty!
• Listen to Shawn
• Be flexible
Summary:
Airways disease as a driver for i2b2
•
•
•
•
“Typical” complex disease challenge
Big impact on health care system
Potential for large clinical impact
Core 1: Extracting phenotypes from free
text; statistical models
• Core 2: Viewer for CRC
• Core 4: Data provisioning
Conclusions
• The stronger the existing program, the
more successful the I2B2 collaboration
• Communication is key
• Fit the question to the data not the other
way around
• Data access will be an issue for the future
Collaborators (and what they did)
• Scott, Zak, John, and Susanne: money, project
management, IRB, and big picture
• Ross: Channing bioinformatics, file structures, geek to
geek translation with the cores, beta testing, 850
collection, IRB, links to other genetic bioinformatics tools
and projects
• Shawn and Vivian: asthma and control data mart
• Anne, LJ, James: nongenetic predictors in CAMP
• Marco and Blanca: nongenetic predictors in PAC
• Marco and Blanca: genetic predictors in CAMP
• Marco and Blanca: genetic predictors in PAC
• Lynn: Crimson
Acknowledgments:
Ross Lazarus
Blanca E. Himes
Marco F. Ramoni
Isaac Kohane
Shawn Murphy
Susanne Churchill
Anne Fuhlbrigge
LJ Wei
James Sigornivitch
Lynn Bry