Clinical Natural Language Processing

Download Report

Transcript Clinical Natural Language Processing

Clinical Natural Language
Processing: Part I
Guergana K. Savova, PhD
Childrens Hospital Boston and
Harvard Medical School
Investigators
(in alphabetical order)
 Childrens Hospital Boston and HMS (site PI: Guergana
Savova)
 MIT (site PI: Peter Szolovits)
 MITRE corporation (site PI: Lynette Hirschman)
 Seattle Group Health (site PI: David Carrell)
 SUNY Albany (site PI: Ozlem Uzuner)
 University of California, San Diego (site PI: Wendy
Chapman
 University of Colorado (site PI: Martha Palmer)
 University of Pittsburg (site PI: Henk Harkema)
 University of Utah and Intermountain Healthcare (site
PI: Peter Haug)
Special Acknowledgement
 Our talented super software developers
–
–
–
–
Vinod Kaggal, lead
Dingcheng Li
Pei Chen
James Masanz
Overview
 Part 1:
– Background and objectives of SHARP 4 cNLP
project
– Year 1 achievements
– Clinical Text Analysis and Knowledge
Extraction System (cTAKES)
– Year 2 proposed projects
– Graphical User Interface to cTAKES: demo
 Part 2:
– cTAKES: demo
Aims
 Information extraction (IE): transformation of
unstructured text into structured representations
and merging clinical data extracted from free text
with structured data
– Entity and Event discovery
– Relation discovery
– Normalization template: Clinical Element Model (CEM)
 Overarching goal
– high-throughput phenotype extraction from clinical free
text based on standards and the principles of
interoperability
– general purpose clinical NLP tool with applications to the
majority of all imaginable use cases
Processing Clinical Notes
A 43-year-old
woman was diagnosed with type 2 diabetes
A 43-year-old woman was diagnosed with type
2 diabetes
mellitus
mellitus by her family physician 3 months before
this by her family physician 3 months before this
presentation.
Her initial blood glucose was 340 mg/dL.
presentation. Her initial blood glucose was 340
mg/dL. Glyburide
2.5 mg
2.5 mg once daily was prescribed. Since then, Glyburide
self-monitoring
of once daily was prescribed. Since then,
self-monitoring
of blood glucose (SMBG) showed blood
blood glucose (SMBG) showed blood glucose levels
of 250-270
glucose
levels
of
250-270 mg/dL. She was referred to an
mg/dL. She was referred to an endocrinologist for further
endocrinologist for further evaluation.
evaluation.
On acutely
examination,
On examination, she was normotensive and not
ill. Hershe was normotensive and not acutely
ill.a Her
body
mass index (BMI) was 18.7 kg/m2 following
body mass index (BMI) was 18.7 kg/m2 following
recent
10 lb
a recentand
10 ankle
lb weight loss. Her thyroid was
weight loss. Her thyroid was symmetrically enlarged
symmetrically
enlarged and ankle reflexes absent. Her
reflexes absent. Her blood glucose was 272 mg/dL,
and her
bloodshowed
glucose
was 272 mg/dL, and her hemoglobin A1c
hemoglobin A1c (HbA1c) was 10.3%. A lipid profile
a total
(HbA1c)
was
10.3%.
A lipid profile showed a total
cholesterol of 261 mg/dL, triglyceride level of 321 mg/dL, HDL
cholesterol
of 261 mg/dL, triglyceride level of 321
level of 48 mg/dL, and an LDL of 150 mg/dL. Thyroid
function
mg/dL, HDL level of 48 mg/dL, and an LDL of 150 mg/dL.
was normal. Urinanalysis showed trace ketones.
Thyroid function was normal. Urinanalysis showed trace
She adhered to a regular exercise program and vitamin regimen,
ketones.
smoked 2 packs of cigarettes daily for the past 25 years, and
She
adhered
to a regular exercise program and vitamin
limited her alcohol intake to 1 drink daily. Her
mother's
brother
regimen, smoked 2 packs of cigarettes daily for the
was diabetic.
past 25 years, and limited her alcohol intake to 1
drink daily. Her mother's brother was diabetic.
A 43-year-old woman
A 43-year-old woman was
was diagnosed with
diagnosed with type 2
type 2 diabetes mellitus
diabetes mellitus by her
by her family physician
family physician
3
A 43-year-old
woman was3 months before this
mpresentation.
Her
initial
diagnosed with type 2 diabetes
presentation. Her
blood glucose
wasby
340
mg/dL.
mellitus
her
family physician
initial blood glucose
Glyburide
3 months before this
was 340 mg/dL.
presentation. Her initial blood
Glyburide
glucose was 340 mg/dL.
Glyburide
Clinical Element Model
Disorder CEM
text:
code:
subject:
relative temporal context:
negation indicator:
diabetes mellitus
73211009
patient
3 months ago
not negated
Medication CEM
text:
code:
subject:
frequency:
negation indicator:
strength:
Glyburide
315989
patient
once daily
not negated
2.5 mg
Tobacco Use CEM
text:
code:
subject:
relative temporal context:
negation indicator:
smoking
365981007
patient
25 years
not negated
Disorder CEM
text:
code:
subject:
relative temporal context:
negation indicator:
diabetes mellitus
73211009
family member
not negated
A 43-year-old woman was diagnosed with type 2 diabetes
mellitus by her family physician 3 months before this
presentation. Her initial blood glucose was 340 mg/dL.
Glyburide 2.5 mg once daily was prescribed. Since then,
self-monitoring of blood glucose (SMBG) showed blood
glucose levels of 250-270 mg/dL. She was referred to an
endocrinologist for further evaluation.
On examination, she was normotensive and not acutely
ill. Her body mass index (BMI) was 18.7 kg/m2 following
a recent 10 lb weight loss. Her thyroid was
symmetrically enlarged and ankle reflexes absent. Her
blood glucose was 272 mg/dL, and her hemoglobin A1c
(HbA1c) was 10.3%. A lipid profile showed a total
cholesterol of 261 mg/dL, triglyceride level of 321
mg/dL, HDL level of 48 mg/dL, and an LDL of 150 mg/dL.
Thyroid function was normal. Urinanalysis showed trace
ketones.
She adhered to a regular exercise program and vitamin
regimen, smoked 2 packs of cigarettes daily for the
past 25 years, and limited her alcohol intake to 1
drink daily. Her mother's brother was diabetic.
Comparative Effectiveness
Disorder CEM
text:
code:
subject:
relative temporal context:
negation indicator:
diabetes mellitus
73211009
patient
3 months ago
not negated
Medication CEM
text:
code:
subject:
frequency:
negation indicator:
strength:
Glyburide
315989
patient
once daily
not negated
2.5 mg
Tobacco Use CEM
text:
code:
subject:
relative temporal context:
negation indicator:
smoking
365981007
patient
25 years
not negated
Disorder CEM
text:
code:
subject:
relative temporal context:
negation indicator:
diabetes mellitus
73211009
family member
not negated
Compare the effectiveness of different treatment
strategies (e.g., modifying target levels for glucose,
lipid, or blood pressure) in reducing cardiovascular
complications in newly diagnosed adolescents and
adults with type 2 diabetes.
Compare the effectiveness of traditional behavioral
interventions versus economic incentives in
motivating behavior changes (e.g., weight loss,
smoking cessation, avoiding alcohol and substance
abuse) in children and adults.
Meaningful Use
Disorder CEM
text:
code:
subject:
relative temporal context:
negation indicator:
diabetes mellitus
73211009
patient
3 months ago
not negated
Medication CEM
text:
code:
subject:
frequency:
negation indicator:
strength:
Glyburide
315989
patient
once daily
not negated
2.5 mg
Tobacco Use CEM
text:
code:
subject:
relative temporal context:
negation indicator:
smoking
365981007
patient
25 years
not negated
Disorder CEM
text:
code:
subject:
relative temporal context:
negation indicator:
diabetes mellitus
73211009
family member
not negated
• Maintain problem list
• Maintain active med list
• Record smoking status
• Provide clinical summaries for each office visit
• Generate patient lists for specific conditions
• Submit syndromic surveillance data
Clinical Practice
Disorder CEM
text:
code:
subject:
relative temporal context:
negation indicator:
diabetes mellitus
73211009
patient
3 months ago
not negated
Medication CEM
text:
code:
subject:
frequency:
negation indicator:
strength:
Glyburide
315989
patient
once daily
not negated
2.5 mg
• Provide problem list and meds from the visit
Applications
 Meaningful use of the EMR
 Comparative effectiveness
 Clinical investigation
– Patient cohort identification
– Phenotype extraction
 Epidemiology
 Clinical practice
 …..
How does NLP fit?
 Demo pipeline, v1
– All medications in Mayo dataset extracted
with cTAKES (NLP method)
– Processed 360,452 notes for 10,000 patients
– 3,442,000 CEMs were created
– Processing time: 1.6 sec/doc
Year 1
Y1 Technical and Scientific
Activities
 Gold standard corpus development:
– corpus creation methodology
–
–
–
–
–
de-id and PHI surrogate generation tools
seed corpus generation (PAD, pneumonia, breast cancer)
annotation schema development based on CEM normalization target
annotation guidelines and pilot annotations
gold standard annotations are in progress
 Type System for software development
 Development of Evaluation workbench
 Methods development
– entity and event discovery
– relation discovery
Y1 Software Deliverables
(cTAKES modules)
2010
2011
JUL AUG SEP OCT NOV DEC
Dependency Parser
JAN FEB MAR APR MAY JUN
Smoking Status
Classifier
Full-Cycle
Pipeline v1
Drug Profile Module
CEM ‘orderMedAmb’
Population
SHARP Security Roundtable
for Cloud-Deployed cNLP
 May 23-24, 2011
 Participants: SHARP 1, SHARP 4, health care
organizations, the Veterans Administration, industry, and
other research institutions
 Providing guidance to institutions seeking to use cloud
technologies to support development and application of
cNLP tools
 A set of recommendations for the novel legal and
governance issues regarding the proper stewardship and use
of clinical data
SHARP Collaborations
 SHARP 1:
– Around security in a cloud computing
environment
 SHARP 3 (SMaRT):
– Around extraction of data from the clinical
narrative
– I2b2 database for data persistence?
Partnerships
 NCBC-funded initiatives
– Integrating Informatics and Biology to the Bedside (i2b2)
– Integrating Data for Analysis, Anonymization and Sharing (iDASH)
– Ontology Development and Information Extraction (ODIE)
 Veterans Administration
 R01s
– Shared annotated lexical resource
– Temporal relation discovery for the clinical domain
– Milti-source integrated platform for answering clinical questions
 University of York (UK), University of Trento (Italy),
Brandeis University (USA)
 eMERGE, PGRN (Pharmacogenomics Research Network)
clinical Text Analysis and
Knowledge Extraction System
(cTAKES)
Overview
• Goal:
•
•
•
•
•
Phenotype extraction
Generic – to be used for a variety of retrievals and use cases
Expandable – at the information model level and methods
Modular
Cutting edge technologies – best methods combining existing
practices and novel research with rapid technology transfer
• Terminology agnostic: able to plug in any terminology
• Best software practices (80M+ notes)
• Stand-alone tool easily pluggable within other platforms/toolsets
• Apache v2.0 license
• http://sourceforge.net/projects/ohnlp/
• Commitment to both R and D in R&D
cTAKES Adoption
 May, 2011:
– 2306 downloads*
 i2b2 NLP cell
integration; relevance
to CTSAs
 eMERGE (SGH, NW)
 PGRN (HMS, NW)
 Extensions: Yale
(YTEX), MITRE
* Source: http://sourceforge.net/project/stats/?group_id=255545&ugn=ohnlp&type=&mode=alltime
cTAKES Technical Details
• Open source
• Apache v2.0 license
• http://sourceforge.net/projects/ohnlp/
• Java 1.5
• Framework
• IBM’s Unstructured Information Management Architecture
(UIMA) open source framework, Apache project
• Methods
• Natural Language Processing methods (NLP)
• Based on standards and conventions to foster interoperability
• Application
• High-throughput system
cTAKES: Components
•
•
•
•
•
•
Sentence boundary detection (OpenNLP technology)
Tokenization (rule-based)
Morphologic normalization (NLM’s LVG)
POS tagging (OpenNLP technology)
Shallow parsing (OpenNLP technology)
Named Entity Recognition
• Dictionary mapping (lookup algorithm)
• Machine learning (MAWUI)
• types: diseases/disorders, signs/symptoms, anatomical sites, procedures,
medications
•
•
•
•
•
Negation and context identification (NegEx)
Dependency parser
Drug Profile module
Smoking status classifier
CEM normalization module
Output Example: Drug Object
• “Tamoxifen 20 mg po daily started on March 1, 2005.”
• Drug
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Text: Tamoxifen
Associated code: C0351245
Strength: 20 mg
Start date: March 1, 2005
End date: null
Dosage: 1.0
Frequency: 1.0
Frequency unit: daily
Duration: null
Route: Enteral Oral
Form: null
Status: current
Change Status: no change
Certainty: null
Conversion to CEMs
CAS
Transform
cTAKES
Drug NER
jCAS
Consumer
Freemarker
Transform
Template
CEM
Year 2 and Forward
The patient returns to the outpatient clinic today for follow-up
Loc
Agent
the patient will complete his thiotepa dose today , and he will return
tomorrow for the last dose of his thiotepa .
His donor completed stem-cell collection yesterday
Courtesy of Martha Palmer
The patient returns to the outpatient clinic today for follow-up
the patient will complete his thiotepa dose today
Loc
Agent
Agent
, and he will return
tomorrow for the last dose of his thiotepa .
His donor completed stem-cell collection yesterday
Theme
Courtesy of Martha Palmer
The patient returns to the outpatient clinic today for followup the patient will complete his thiotepa dose today , and
he will return tomorrow for the last dose of his thiotepa .
Loc
Agent
Agent
Theme
Agent
Purpose
His donor completed stem-cell collection yesterday
Courtesy of Martha Palmer
Agent
Action
Coreference:
“patient’s donor”
Loc
Agent
Agent
Theme
Agent
Purpose
The patient returns to the outpatient clinic today for follow-up
the patient will complete his thiotepa dose today , and he will return
tomorrow for the last dose of his thiotepa .
Courtesy of Martha Palmer
His donor completed stem-cell collection yesterday
Agent
OVERLAP
Coreference:
“patient’s donor”
Action
TERMINATES
OVERLAP
OVERLAP
Loc
Agent
Agent
Theme
Agent
OVERLAP
Purpose
The patient returns to the outpatient clinic today for follow-up
the patient will complete his thiotepa dose today , and he will return
tomorrow for the last dose of his thiotepa .
Courtesy of Martha Palmer
His donor completed stem-cell collection yesterday
Donor
stem-cell
collection
complete
d
Patient
return to
clinic,
thiotepa
dose
Final
thiotepa
dose
The patient returns to the outpatient clinic today for follow-up
the patient will complete his thiotepa dose today , and he will return
tomorrow for the last dose of his thiotepa .
His donor completed stem-cell collection yesterday
Courtesy of Martha Palmer
Y2 Proposed Deliverables
 Release of a library of de-identification tools (Sept, 2011)
– MIST
– MIT/SUNY
 Evaluation workbench (Sept, 2011)
 cTAKES Side Effects module (Aug, 2011)
 Modules for relation extraction (Dec, 2011)
– Semantic role labeler
– Relation classifier
– Integration of CLEAR-TK (University of Colorado)
 End-to-end tool, v2 (cTAKES v2) (April, 2012)
– NLP to populate CEMs for Diseases, Sign/Symptoms, Procedures,
Labs, Anatomical sites
– Integration of LexGrid/LexEVS services
Development Challenges and
Opportunities
 Open source strategy
 Release early release often
 Test driven development with continuous
integration
 All milestones measured by what we can get IRB
and DUA approved and deployed with real or deidentified clinical data
Courtesy of David Carrell
Partnerships
 Strengthen existing SHARP collaborations
– Initiate collaborations with SHARP 2 around
usability
– SHARP 1: methods for data security in a cloud
deployed framework
– I2b2: the glue between SHARP 3 and SHARP
4
 Non-SHARP collaborations
Graphical User Interface (GUI)
to cTAKES: a Prototype
Pei Chen
Childrens Hospital Boston
cTAKES as a Service
 Objectives
1. Demo cTAKES prototype web application

Empower End Users to leverage cTAKES
2. Gather feedback for future cTAKES GUI
3. Potential system integrations with other
applications
(i.e. i2b2, ARC, Web Annotator)
 Developed within i2b2 to integrate
cTAKES in the i2b2 NLP cell
cTAKES Web Application
http://chipweb2.chip.org/cTakes_webservice_trunk/index.html
Single clinical note
Technologies
Front-End
Middleware
Back-End
Web GUI
Web Services
cTAKES
– ExtJS
– JavaScript
 JAVA
 Apache CXF
 JSON
– JAVA
– UIMA
Deployment Considerations
Deployment Model
Security
Performance
Licensing (UMLS, Apache, GPL v.3)