Slide-ppt - DePaul University

Download Report

Transcript Slide-ppt - DePaul University

BioNLP, Information Extraction from
Radiology Reports
Emilia Apostolova
College of Computing and Digital Media
DePaul University
BioNLP – conferences and shared tasks
Pacific Symposium on Biocomputing
 Intelligent Systems for Molecular Biology
 Association for Computational Linguistics
 North American Association for Computational
Linguistics
 BioNLP
 BioCreative
 TREC Genomics
 IClef

Information Extraction (in BioMedicine)
The NLP Pipeline
•
•
•
Lexical Analysis – tokenization,
morphological analysis, linguistic lexicons.
Syntactic Analysis – Part of Speech
Tagging, Chunking, Parsing.
Semantic Analysis – Lexical Semantic
Interpretation, Semantic Interpretation of
Utterances.
NLP Pipeline Frameworks
•
•
•
•
GATE - General Architecture for Text
Engineering.
Apache UIMA - Unstructured Information
Management Application.
Geneways - a system for automatically
extracting, analyzing, visualizing and
integrating molecular pathway data from the
research literature.
PASTA - Protein Structures and Information
Extraction from Biological Texts.
Lexical Analysis - Tokenization
Segmenting text into linguistic tokens – words
and sentences.
•
•
•
•
•
Abbreviations - The Study was conducted
within the U.S.
Apostrophes - IL-10's cytokine synthesis
inhibitory activity
Hyphenation - co-operate, cooperate
Multiple formats: 464,285.23 and 464295.23
Sentence boundary detection - :, ;, -
Lexical Analysis – Morphological
analysis
Link surface variants of a lexical element to its
canonical base form. E.g. inflections (activat-es,
activat-ed, activat-ing), derivations (activation).
Porter stemmer – lexicon-free approach. Finds
longest match of a word to a a list of English
derivational and inflectional suffixes.
Two-level morphology – a finite state based
approach that applies a series of parallel
transducers to input tokens. (fly -> flies)
Syntactic Level – Part of Speech
Tagging
activation – POS noun, singular
activate – POS verb, present non-3d
person singular
active – POS adjective
report?
Syntactic Level - Parsing
A natural language parser is a program that works out the
grammatical structure of sentences, for instance, which
groups of words go together (as "phrases") and which
words are the subject or object of a verb.
The Stanford Dependency Parser - a Java implementation
of probabilistic natural language parsers, trained on the
Penn Treebank.
Semantic Level – Lexical Interpretation
•
Selectional Restrictions:
transitive verbs: inhibit [something], transcribe
[something]
semantic restrictions: inhibit [Process],
transcribe [Nucleic Acid]
Syntactically admissible, but semantically
invalid:
to inhibit amino acids
to transcribe cell growth
Discourse Level - Pragmatics
•
Discourse referents; what entities does a given message
refer to?
•
What background knowledge is needed to understand a
given message?
•
How do the beliefs of speaker and hearer interact in the
interpretation of a message?
•
What is a relevant answer to a given question?
•
Summarization, Translation, Dialog Systems, Natural
Language Generation.
Lexical resources for (Bio)NLP
•
Princeton Wordnet
•
NLM UMLS lexicon and metathesaurus.
•
The Open Biomedical Ontologies
Text and Image Integration
Automatic Image Annotation
Automatic Image Annotation
Where? Woman (Population Group), Right breast (Body Part,
Organ, or Organ Component)
How? Mammography (Diagnostic Procedure)
What? Calcification (Pathologic Function), Lesion (Finding),
Carcinoma, Papillary (Neoplastic Process)
IE from Clinical Texts – Radiology and
Pathology Reports
Northwestern University Medical School
Department of Radiology
Imaging Informatics
Radiology Reports
Sample Radiology Report
Patient Name: XXXXXXX, XXXXX
Medical Record Number: XXXXXXXXXX DOB: XXXX.XX.XX Sex: F
Accession Number: XXXXXXXX
Study Requested: DIG MAMMOGRAM SCREENING (3300000)
Scheduled Date and Time: XXXX.XX.XX 13:02:00.0000
Requesting Physician: XXXXXXX,
Reason for Exam: V76.12
----------------------------Radiological Report---------------------------------
Comparison is made to previous exams dated XX/XX/XX.
CLINICAL HISTORY: Seventy-two year old woman for screening exam. Patient has a family history of breast cancer, sister age
sixty years old. Patient has a history of a previous left breast benign biopsy.
TECHNIQUE: Mammograms were obtained using digital technique.
FINDINGS: There is dense fibroglandular tissue bilaterally. No dominant masses or clustered microcalcifications suggestive of
malignancy are seen.
1. NO SPECIFIC FEATURES OF MALIGNANCY SEEN EITHER BREAST.
2. NO SIGNIFICANT CHANGE WHEN COMPARED WITH PRIOR STUDIES.
3. ANNUAL SCREENING MAMMOGRAM IS RECOMMENDED.
CODE (1): NEGATIVE
Attending Radiologist: XXXXXXX, MD
Date Signed off: XXXXXX, Transc. by: NS
NLP for Clinical Texts
•
Document retrieval – case finding.
•
Subject recruitment – identify patients that can
benefit from a study.
•
Surveillance – monitoring disease outbreaks.
•
Discovery of disease-drug associations.
•
Discovery of disease-finding associations.
IE from Radiology Reports
Automatic Section Segmentation
Demographics
History
Comparison
Technique
Findings
Impression
Recommendation
Sign off
Dataset
215,000 free-text radiology reports selected randomly
from 3 million reports over period of 9 years and
representing 24 different types of diagnostic procedures.
Method – Training Set
•
Hand-crafted rules for automatic extraction of a
training set. Common boundary patterns: e.g.
section Findings – text between known section
headers and another known headings:
^ (finding | observation | discussion)s?:
^ (\W*)(finding | observation | discussion)s?(\W*)$
•
3,000 automatically segmented “highconfidence” radiology reports, containing all 8
sections of interest.
Method
•
Classification task - each sentence from a
radiology report is assigned to one of 8 predefined report sections.
•Sentence features used for
classifier.
training a
Sentence Orthography
Possible orthographic types are All Capitals, Mixed Case,
or presence of a Header pattern, such as a phrase at the
beginning of a line followed by a colon.
Previous Sentence
Boundary
Formatting boundary separating the current and previous
text sentences. Possible values are white space
containing new lines, white space without new lines, nonalphabetic characters, or the beginning of the file.
Following Sentence
Boundary
Formatting boundary separating the current and next text
sentences. Possible values are white space containing
new lines, white space without new lines, non-alphabetic
characters, or the end of the file.
Cosine Vector Distance
Distance from the current sentence to each of the eight
sections' word vectors.
Exact Header Match
This feature specifies if the sentence contains a header
identified as belonging to one of the sections in the
training data.
Work in Progress
•
Identify named entities within sections using a
controlled vocabulary – findings, diseases,
observations, anatomical organs, imaging
modalities.
•
Negation Discovery.
•
Identify relationships between named entities of
interest, for example what observations are
associated with a diagnosis.
•
Use radiology report text to support automatic
annotation of medical images.
Q/A