Transcript ppt

A Simple Algorithm for
Identifying Abbreviation
Definitions in Biomedical Text
A. S. Schwartz & M. A. Hearst
UC Berkeley
Presented by Jing Jiang
The Problem – to Identify
Acronyms
• To identify <“short form”, “long form”>
pairs from biomedical text:
– Short form is abbreviation of long form
– There exists character mapping from short
form to long form
– Example:
• Gcn5-related N-acetyltransferase (GNAT)
• A non-trivial problem:
– Words in long form may be skipped
– Internal letters in long form may be used
Previous Work
• Machine learning approach
– Linear regression (Chang et al.)
– Encoding and compression (Yeates et al.)
• Heuristic approach
– Rule-based
– Factors considered include:
• Distance between definition and abbreviation
• Number of stop words
• Capitalization
Step 1: Identifying Candidates
• Consider only two cases:
– long form ‘(‘ short form ‘)’
– short form ‘(‘ long form ‘)’
• Short form:
–
–
–
–
No more than 2 words
Between 2 and 10 chars
At least one letter
First char alphanumeric
• Long form:
– Adjacent to short form
– No more than min(|A| + 5, |A| * 2) words
Step 2: Identifying Correct
Long Forms
• From right to left, the shortest long
form that matches the short form:
– Each character in short form must
match a character in long form
– The match of the character at the
beginning of the short form must match
a character in the initial position of the
first word in the long form
Java Code for Finding
the Best Long Form for
a Given Short Form
Evaluation
• 1000 randomly selected MEDLINE
abstracts
– 82% recall, 95% precision
• Medstract Gold Standard Evaluation
Corpus
– 82% recall, 96% precision
– Compared with
• 83% recall, 80% precision (Cheng et al., linear
regression)
• 72% recall, 98% precision (Pustejovsky et al.,
heuristics)
Missing Pairs
• Skipped characters in short form
– <CNS1, cyclophilin seven suppressor>
• No match
– <5-HT, serotonin>
• Out of order
– <ATN, anterior thalamus>
• Partial match
– <Pol I, RNA polymerase I>
Discussion
• Cons:
– Simple method
– Decent performance
• Questions:
– Tradeoff between complexity of rules
and performance
– Generality of the heuristic rules
– Heuristics vs. machine learning
Mining MEDLINE for Implicit
Links between Dietary
Substances and Diseases
P. Srinivasan & B. Libbus
U. Iowa
Presented by Jing Jiang
The Goal – to Discover Implicit
Links between Topics
• Open discovery
– Start from topic A
– Navigate through intermediate topics B1,
B2, etc.
– Reach terminal topics C1, C2, etc.
• Closed discovery
– Start from topics A and C
– Find connections B1, B2, etc.
General model for discovering
implicit links between topics
Terminology
• Topic Profile: a set of terms that are
highly related to the topic, together
with weights assigned to each term
• MeSH: Medical Subject Heading
• UMLS types: Unified Medical
Language System semantic types
Open Discovery Algorithm
• Input:
– Topic A
– Two sets of UMLS types ST-B & ST-C
– Threshold M
• Output:
– Terms related to A and of some type in
ST-C
Open Discovery Algorithm
(cont.)
• Build topic A’s profile AP
• For each type in ST-B, select M top
terms B1, B2, etc. from AP
• Build Bi’s profiles BPi
• Build combined profile CP from BPs
limited to types in ST-C
• Remove terms directly linked to A
from CP
Building Profile for Topic A
• Search PubMed for A
• Extract MeSH terms from relevant
documents
• Compute TF * IDF
– TF: # occurrences of the term in retrieved
document set
– IDF: log(N/TF)
– N: # retrieved documents
• Normalize the weight vector
Testing with Turmeric
• Topic A: Turmeric
• ST-B:
– Gene or Genome
– Enzyme
– Amino Acid, Peptide or Protein
• ST-C:
– Body Part, Organ or Organ Component
– Disease or Syndrome
– Neoplastic Process
• M: 5, 10, 15
Results
• B terms:
– 37% recall, 38% precision (compared
with manually identified terms)
• C terms:
– 67% recall, 67% precision (compared
with manual results)
Novel C MeSH Terms
Discussion
• Cons:
– Simple method
– Domain knowledge (MeSH terms, UMLS types)
to shape search direction
• Questions:
–
–
–
–
TF & IDF?
Longer path?
What relationships?
Co-occurrence = link?
End of Talk