Mayo Clinical Text Analysis

Download Report

Transcript Mayo Clinical Text Analysis

Clinical Text Analysis and
Knowledge Extraction System
(cTAKES)
Guergana Savova
1
Overview
• cTAKES and open-source components
• cTAKES and non open-source components
• Current efforts
2
Open Health Natural Language
Processing Consortium
• www.ohnlp.org (part of caBIG Vocabulary Knowledge
Center web presence)
• Goal
• foster an open-source collaborative community around
clinical NLP that can deliver best-of-breed annotators,
leverage the dynamic features of UIMA flow-control, and
establish the infrastructure for clinical NLP.
• Two open source releases as part of OHNLP
• Mayo’s pipeline for processing clinical notes (cTAKES)
• IBM’s pipeline for processing medical notes (MedKAT)
and pathology reports (MedKAT/P)
3
4
5
clinical Text Analysis and
Knowledge Extraction System
(cTAKES)
6
Enterprise Data Trust
One Logical Data Environment
PRIME
MIDIA
MAGIC
MCLSS
Mayo Clinic
One Logical
Enterprise
DataTrust
Trust
DSS
Others
7
• Developed at Mayo
Overview
• Design principles:
• Information extraction from the clinical narrative
• Generic – to be used for a variety of retrievals and
•
•
•
use cases
Expandable – at the information model level and
methods
Modular
Scalable and robust to meet the rigours of a clinical
research production environment (80M+ notes)
8
cTAKES Technical Details
• Open source release March 15, 2009
• www.ohnlp.org
• Downloads: Documentation and Downloads
• Technical details: Publications
• Framework
• IBM’s Unstructured Information Management
Architecture (UIMA) open source framework
• Methods
• Natural Language Processing methods (NLP)
• Application
• High-throughput phenotype extraction system
(80M+ notes; 80B+ tokens)
9
cTAKES: Components
• Core components
• Sentence boundary detection (OpenNLP)
• Tokenization (rule-based)
• Morphologic normalization (NLM’s “norm”)
• POS tagging (OpenNLP)
• Shallow parsing (OpenNLP)
• Named Entity Recognition
• Diseases/disorders, signs/symptoms, procedures,
anatomical sites, medications
• Dictionary mapping (lookup algorithm)
• Machine learning (MAWUI)
• Negation and status identification (NegEx)
10
cTAKES example
11
Annotated Corpora
• Linguistic gold standard
• 273 clinical notes; 100,650 tokens; 7,299
•
sentences
Annotated for sentence boundaries, tokens, POS
tags (IAA=0.993) and shallow parses (IAA
PSA=0.905, IAA Kappa = 0.854)
• NE gold standard
• 160 notes; 47,975 tokens; 1,466 NEs of type
Disorder
12
Evaluation Overview - I
• Sentence boundary detector (accuracy)
• Comparable to Buyko et al, 2006
• Tokenizer (accuracy)
• Accuracy=0.9490
• Baseline space-delimited tokenizer accuracy =
0.7156
13
Evaluation overview - II
• POS tagger (accuracy)
• Comparable to Buyko et al., 2006
• Shallow parser (CoNLL script)
• Comparable to Buyko et al., 2006
14
Evaluation overview - III
• NER
• Main sources of errors
• Abbreviations and WSD, e.g. “Dr.” with a mapping
•
to “diabetic retinopathy”, “is” -> “immune
suppression
Lexical variations and complex level of synonymy
“bladder showed very mild trabeculation” -> “trabeculated
bladder”
15
cTAKES Publication
• Preliminary results:
• Savova, Guergana; Kipper-Schuler, Karin;
Buntrock, James and Chute, Christopher. 2008.
UIMA-based clinical information extraction
system. LREC 2008: Towards enhanced
interoperability for large HLT systems: UIMA for
NLP.
• Manuscript with detailed system description and
evaluation under review at JAMIA
16
clinical Text Analysis and
Knowledge Extraction System
(cTAKES):
non-open Source Components
17
Components in Production
• Drug profile
• WSD module for 50 ambiguities
• Patient smoking status discovery
• Document level
• Patient level
18
Expansion of cTAKES
Information/Annotation Model with
Medication-specific Attributes
Drug mention class















drug mention text : Drug Mention Element
associated code primary: Associated Code Element
associated code secondary: Associated Code Element
context: Context Element
negation: boolean
start date: Start Date Element
end date: End Date Element
dosage: Dosage Element
frequency: Frequency Element
frequency unit: Frequency Unit Element
duration: Duration Element
route: Route Element
form: Form Element
change status: Drug Change Status Element
strength: Strength Element
Values extracted from text:
Tamoxifen 20 mg po daily started on March 1, 2005















drug mention text : Tamoxifen
associated code primary: C0351245
associated code secondary: null
context: current
negation: false
start date: March 1, 2005
end date: null
dosage: 1.0
frequency: 1.0
frequency unit: daily
duration: null
route: Enteral_Oral
form: null
change status: noChange
strength: 20 mg
19
cTAKES publication
• Manuscript with detailed algorithm description
and evaluation under review with Cancer
Epidemiology, Biomarkers and Prevention
journal
20
Current Efforts - I
• Side effects medication extraction
• Anaphoric relations and coreference (ODIE)
• In collaboration with Chapman
• Semantic processing of the clinical text (in
collaboration with Palmer, Martin and Ward)
• Treebanking (deep parses)
• Predicate-argument structure and semantic labeling
(PropBanking)
• UMLS relations (except temporal relations)
21
Current Efforts - II
• Temporal relation discovery
• In collaboration with Palmer, Martin and Ward
• Pending funding decision, to start July 1, 2010
• Clinical lexical resources
• In collaboration with Chapman and Elhadad
• A la Treebank and clinical named entities with
•
attributes and modifiers
Pending funding decision, to start July 1, 2010
22
Questions?
23