Transcript ppt

Automated recognition of malignancy
mentions in biomedical literature
BMC Bioinformatics 2006, 7:492
Speaker: Yu-Ching Fang
Advisors: Hsueh-Fen Juan and
Hsin-Hsi Chen
1
Outline
•
•
•
•
•
Background
Methods
Results
Discussion
Conclusion
2
Background - Motivation
• The rapid proliferation of biomedical literature
makes it increasingly difficult for researchers
peruse, query, and synthesize it for biomedical
knowledge gain.
• Less biomedical text mining work has been
performed to identify disease-related objects
and concepts.
• Related works about automated disease entity
recognition often do not perform well.
• More extensive work on medical entity class
recognition is necessary.
3
Related works
1. Automated extractors for the
identification of gene and protein names.
2. Automated entity recognition to the
identification of phenotypic and disease
objects.
3. A machine-learning algorithm to extract
gene-disorder relations.
4. Extract phenotypic attributes from Online
Mendelian Inheritance in Man (OMIM).
4
Goal
• Develop a named entity recognizer (MTag),
an entity tagger for recognizing clinical
descriptions of malignancy presented in
text. MTag is based upon the probability
model Conditional Random Fields (CRFs).
• Minimize manual efforts and still perform
with high accuracy.
5
Conditional Random Fields (CRFs)
• CRFs are probabilistic tagging models that give
the conditional probability of a possible tag
sequence t = t1, ... tn given the input token
sequence o = o1,..., on (Ryan McDonald and
Fernando Pereira, 2005).
• For example, the identification of gene mentions
in text can be implemented as a tagging task.
o:
t:
Begins (B), continues (I), or is outside (O) of a gene mention
6
Conditional Random Fields (CRFs)
input token sequence
tag sequence
7
John Lafferty et al., 2001
Methods – Task definition
• To develop an automated method that would
accurately identify and extract strings of text
corresponding to a clinician’s or researcher’s
reference to cancer (malignancy).
• Label “Malignancy”: the full noun phrase
encompassing a mention of a cancer subtype.
• For example, “neuroblastoma”, “localized
neuroblastoma” and “primary extracranial
neuroblastoma” were considered to be distinct
mentions of malignancy.
• Directly adjacent prepositional phrases were not
allowed, such as “cancer <of the lung>”.
8
Two corpus combination
1. The first corpus concentrated upon a specific
malignancy (neuroblastoma) and consisted of
1,000 randomly selected abstracts identified by
querying PubMed with the query terms
"neuroblastoma" and "gene".
2. The second corpus consisted of 600 abstracts
previously selected as likely containing gene
mutation instances for genes commonly
mutated in a wide variety of malignancies.
9
Two corpus combination (cont.)
• 1000+600-158=1442 abstracts
(eliminating 158 abstracts that appeared to
be non-topical, had no abstract body, or
were not written in English.)
• Manually annotated for tokenization, partof-speech assignments, and malignancy
named entity recognition.
10
Two corpus combination (cont.)
• Annotations were performed on all
documents by experienced annotators
with biomedical knowledge.
• Discrepancies were resolved through
forum discussions.
• A total of 7,303 malignancy mentions were
identified in the document set.
11
MTag algorithm
• MTag was developed using the probability model
Conditional Random Fields (CRFs).
• CRFs model the conditional probability of a tag
sequence given an observation sequence.
• O is an observation sequence, or a sequence of
tokens in the text.
• t is a corresponding tag sequence in which each
tag labels the corresponding token with either
Malignancy (meaning that the token is part of a
malignancy mention) or Other.
O: Lung cancer may be related to gene mutation.
t: <Malignancy><Malignancy><Other><Other><Other><Other><Other><Other> 12
MTag algorithm (cont.)
• CRFs are based on a set of feature functions,
fi(tj, tj-1, O).
O: Lung cancer may be related to gene mutation.
t: <Malignancy><Malignancy><Other><Other><Other><Other><Other><Other>
• This feature represents the probability of
whether the token "cancer" is tagged with label
Malignancy given the presence of "lung" as the
previous token.
13
MTag algorithm (cont.)
• Consider many textual features when it makes
decisions on classifying whether a word
comprises all or part of a malignancy mention.
• Word-based features: The frequency of each
string of 2, 3, or 4 adjacent characters (character
n-grams) within each word of the training text
was calculated.
For example, lung (lu, lun, lung, un, ung, ng)
• The differential frequency of each n-gram within
words manually tagged as being malignancy
mentions was considered as a series of features.
For example: lung (bigram: 3/6, trigram:2/6, fourgram:1/6) 14
MTag algorithm (cont.)
• Orthographic features included the usage
and distribution of punctuation, alternative
spellings, and case usage.
• Domain-specific features comprised a
lexicon of 5,555 malignancies and a
regular expression for tokens containing
the suffix -oma.
15
Evaluation
• The evaluation set: 432 abstracts
- 2,031 sentences containing mentions of
malignancy
- 3,752 sentences without mentions
• Correctly identified if the predicted and manually
labeled tags were exactly the same in content
and both boundary determinations.
• The performance of MTag was calculated
according to precision, recall and F-measure.
16
Results - MTag performance
• Two separate training experiments were
performed, either with or without the
inclusion of malignancy-specific features,
which were the addition of a lexicon of
malignancy mentions and a list of
indicative suffixes.
17
MTag performance (cont.)
MTag model
Evaluation set
neuroblastomaspecific and
genome-specific
Neuroblastomaspecific
genome-specific
all biological
feature sets: Yes
Precision: 0.846
Recall: 0.831
F-measure: 0.838
Precision: 0.88
Recall: 0.87
F-measure: 0.88
Precision: 0.77
Recall: 0.69
F-measure: 0.73
all biological
feature sets: No
Precision: 0.851
Recall: 0.818
F-measure: 0.834
18
MTag performance (cont.)
• As expected, the extractor performed with
higher accuracy with the more narrowly
defined corpus (neuroblastoma).
• At least for this class of entities, the
extractor performs the task of identifying
malignancy mentions efficiently without the
use of a specialized lexicon.
19
Extraction versus string matching
• String matching: the NCI (National Cancer
Institute) neoplasm ontology, a term list of
5,555 malignancies, was used as a lexicon
to identify malignancy mentions.
• Lexicon terms were individually queried
against text by case-insensitive exact
string matching.
20
Extraction versus string matching (cont.)
random selection
Testing set
(432 abstracts)
MTag: automated
extractor
39 abstracts
(202 malignancy
mentions)
String matching
21
Extraction versus string matching (cont.)
• MTag identified 190 of the 202 mentions
correctly (94.1%), while the NCI list
identified only 85 mentions (42.1%), all of
which were also identified by the extractor.
22
Extraction versus string matching (cont.)
• Change lexicon for string matching
NCI list
Malignancy mentions identified in the
manually curated training set
annotations (1,010 documents)
85 mentions (42.1%)
• 79 of 202 mentions (39.1%)
• Combining the manually-derived lexicon with the
NCI lexicon yielded 124 of 202 matches (61.4%).
23
Extraction versus string matching (cont.)
• 202-124=78 (68) malignancy mentions
•Missed by the string
matching with combined
lists but positively
identified by MTag.
68 malignancy mentions
•This suggests that
MTag contributes a
significant learning
component.
Minor variations in spelling
and form (e.g., "leukaemia"
versus "leukemia")
acronyms (e.g., "AML" in place
of "acute myeloid leukemia")
New mentions of malignancies that
were in neither in the NCI list or
training set.
24
Application to MEDLINE
• MTag was used to extract mentions of
malignancy from all MEDLINE abstracts
through 2005.
• 15,433,668 documents
• A total of 9,153,340 redundant mentions
and 580,002 unique mentions (ignoring
case) were identified.
25
Application to MEDLINE (cont.)
• The
25 mentions found in the greatest number of
abstracts by MTag
26
Application to MEDLINE (cont.)
• Six false postives: pulmonary, fibroblasts,
neoplastic, neoplasm metastasis, extramural,
and abdominal
• Only "extramural“ is not frequently associated
with malignancy descriptions.
• The remaining five phrases are likely the result
of the extractor:
- failing to properly define mention boundaries in
certain cases. For example, "neoplasm“ v.s
“neoplasm metastasis”.
- shared use of an otherwise indicative
character string (e.g., "opl" in "brain neoplasm"
and "neoplastic") between a true positive and a 27
false positive.
Application to MEDLINE (cont.)
• To assess document-level precision, 100
abstracts identified by MTag were
randomly selected each for the
malignancies "breast cancer" and
"adenocarcinoma".
• Manual evaluation of these abstracts
showed that all of the articles were true
positives.
28
MTag input and output
• Directly accept files downloaded from
PubMed and formatted in MEDLINE
format as input.
• Text or HTML file versions of the extractor
output results.
29
MTag HTML output
30
Discussion
• It is evident that an F-measure of 0.83 is
not sufficient as a stand-alone approach
for curation tasks.
• However, such an approach provides
highly enriched material for manual
curators to utilize further.
• Substantial improvement and efficiency
• MTag appeared to be accurately predicting
malignancy mentions.
31
Discussion (cont.)
• Analysis of mis-annotations would likely
suggest additional features and/or
heuristics that could boost performance
considerably.
• It may be no need for extensive domainspecific lexicons because the addition of
biological features provided very little
boost to the recall rate.
32
Conclusion
• MTag is one of the first directed efforts to
automatically extract entity mentions in a
disease-oriented domain with high
accuracy.
• MTag substantially outperformed
information retrieval methods using
specialized lexicons.
• When combined with expert evaluation of
output, MTag can assist with vocabulary
building for cancer entity class.
33
Thank you for your attention
34