Codifying Semantic Information Presentation

Download Report

Transcript Codifying Semantic Information Presentation

Codifying Semantic
Information in Medical
Questions Using Lexical
Sources
Paul E. Pancoast
Arthur B. Smith
Chi-Ren Shyu
Research Purpose


To find a method for classifying medical
questions that are asked by clinicians
Hypothesis - Simply indexing by keywords isn’t
enough to


distinguish questions with different meanings but
similar wording, or to
group questions with similar meanings but different
words.
Definitions





Semantic Information – the meaning of the
words
Syntactic Information – the parts of speech of
the words (word type, sentence part)
Medical Questions – a question asked by a
clinician
Lexical Sources – sources of words and
vocabularies
UMLS – Unified Medical Language System
UMLS





Ambitious project of the National Library of
Medicine, begun in 1986
Help researchers retrieve and integrate
electronic biomedical information from a variety
of sources
Links over 100 controlled vocabularies
Assigns unique identifiers to medical concepts
and strings
Maps the hierarchical relationships between the
medical concepts
Why Bother?
(To classify medical questions?)





Clinicians have questions when treating patients
Researchers have gathered collections of these
questions
No good method exists to classify the questions
How many times has a particular question been
asked?
Which questions should receive priority for
evidence-based answers?
Examples




What is the best way to treat acute pharyngitis?
How should I approach a patient with a sore
throat?
What should I do with a patient with diabetes
and insulin resistance?
What should I do with a patient with diabetes
who is resistant to taking insulin?
Methods
Source Questions




American researcher – observed clinicians
at work
British researchers – questions sent in by
clinicians – answered by researchers
Australian researchers – questions sent in
by clinicians – answered by researchers
4083 total questions
Methods
Source Vocabulary

MRCON – a table from the Metathesaurus



Lists the medical concepts by unique identifiers (CUI)
and each string associated with a concept
unique (string => 1 concept)
ambiguous (string => 2+ concepts)



COLD – ambient temperature, viral respiratory infection,
chronic obstructive lung disease
2,247,454 strings associated with concepts
Non-medical Lexicon – from Roget’s Thesaurus


Query objects (why, when, how), identifiers (I, you,
he), modifiers (soon, frequently)
749 terms in this lexicon
String Matching




Parsing program (written in C)
Separates individual questions into 3-word, 2word, 1-word windows
Matches the window against MRCON and our
lexicon
Generates a report of:



Total number of words parsed
Number of matches from unique, ambiguous, nonmedical lists
Strings that didn’t match any of the lists
Results



String – individual word or words that matched
Hits – how often the string was found
Words – total number of matching words (some strings have more than one
word in them)
Strings Hits
MRCON
Unique
Words
%
match
4,534
24,844
30,186
42.3%
MRCON
Ambiguous
574
9,256
9,769
13.7%
Nonmedical
208
16,768
17,783
24.9%
2,321
13,624
19.1%
Unmatched
Results



100 strings occurred 7850 times – or
57.6% of the total matches
712 strings => 3+ hits, 85% of all hits
Our focus was on strings that didn’t match
one of the source vocabularies


19.1% didn’t match
Hypothesis that additional terms not found in
MRCON will be important for indexing
Results

Unmatched words – 2+ occurrences
Unique words
Total Number
Percent
Verb
261
3676
31.7%
Noun
186
2356
20.3%
9
2544
21.9%
103
1095
9.5%
Mix *
72
810
7.0%
Pronoun
10
614
5.3%
Integer
70
502
4.3%
Preposition
Adj/Adv/Conj
* can be more than one word type, depending on the context.
Attacks, step, process all can be nouns or verbs
Discussion

MRCON – selected because of low
rate of ambiguous string-CUI
combinations
89% unique string matches
 11% ambiguous string matches


Other tables have greater word
coverage, but have more ambiguity
for each of the words
Discussion


Our word-matching results were similar to
other researchers
Cimino matched 43% of words with Meta-1
(we had 56% MRCON matches)


Computers & Biomedical Research. Aug 1992;25(4):366-373.
Hersh matched 60% of words to medical
terminology & names dictionary
(we had 79% combined lexicon matches)

Proceedings/AMIA Annual Fall Symposium. p. 1997.
Discussion


Stop words – commonly removed by most
normalization tools. Prepositions,
conjunctions, pronouns
Provide valuable contextual information.




Blood FOR an HIV-positive patient
Blood FROM an HIV-positive patient
Asprin AND warfarin
Asprin OR warfarin
Discussion

Integers




186 distinct integers or integer word
combinations
Occurred 647 times
Additional modification of concepts
Hyperkalemia – 5.3 mEq/li & 8.7 mEq/li

Both are hyperkalemia, but the evaluation and
management are markedly different
Discussion

Verbs – largest category of unmatched
words


Include action and relation concepts
Non-medical lexicon contained some


Treats, attends, increases, lessens, reduce, follows, starts,
can, should, is, equal, improve
Verb tense changes the meaning of a
question


In a patient TAKING antibiotics
In a patient who TOOK antibiotics
Discussion

Verbs may be conceptually related to
medical concepts





Diagnose
Treat
Evaluate
Prescribe
=>
=>
=>
=>
Diagnosis
Treatment
Evaluation
Prescription
In these cases the verb (relationship) is
not equivalent to the noun (concept)
Summary

We developed an application to




Parse individual words from collections of medical
questions
Match the words (phrases) with lexical sources,
codified by the UMLS
Our results were better than previous
investigators (for percentage of matched words)
We still have some work to do….
Related Experiments

We attempted to cluster questions by
sequences of semantic types

Initial attempts mostly clustered common
phrases such as “How should I” and “What is
the”

We may repeat this method after discarding
‘stop phrases’
Future Work


Family Practice Inquiries Network (FPIN)
has 200 questions that have associated
MeSH terms manually assigned by
librarians.
We will look at these question-term
groups for clustering purposes (with the
hypothesis that they will not make distinct
clusters).
Future Work
I will work with researchers at NLM to apply
MetaMap to medical questions
 extract triplets (Medical Concept-Allowable
Relation-Medical Concept) from questions.
Drug-treats-Disease
 Insert the triplets into a vector-space
model and look for clusters
Thank-you!!
???