pptx - people.vcu.edu

Download Report

Transcript pptx - people.vcu.edu

KNOWLEDGE-BASED METHOD FOR
DETERMINING THE MEANING OF
AMBIGUOUS BIOMEDICAL TERMS
USING INFORMATION CONTENT
MEASURES OF SIMILARITY
Bridget McInnes
Ted Pedersen
1
Ying Liu
Genevieve B. Melton
Serguei Pakhomov
OBJECTIVE OF THIS WORK


Develop and evaluate a method than can
disambiguate terms in biomedical text by
exploiting similarity information
extrapolated from the Unified Medical Language System
Evaluate the efficacy of Information Content-based similarity
measures over path-based similarity measures
for Word Sense Disambiguation, WSD
2
WORD SENSE DISAMBIGUATION
Word sense disambiguation is the task of determining the
appropriate sense of a term given context in which it is used.
TERM: tolerance
Drug
Tolerance
Immune
Tolerance
3
WORD SENSE DISAMBIGUATION
Word sense disambiguation is the task of determining the
appropriate sense of a term given context in which it is used.
Busprione attenuates tolerance to morphine
in mice with skin cancer
Drug
Tolerance
Immune
Tolerance
4
SENSE INVENTORY: UNIFIED MEDICAL LANGUAGE
SYSTEM

Unified Medical Language Sources (UMLS)


Semantic Network
Metathesaurus




~1.7 million biomedical and clinical concepts; integrated semi-automatically
CUIs (Concept Unique Identifiers), linked:

Hierarchical: PAR/CHD and RB/RN

Non-hierarchical: SIB, RO
Sources viewed together or independently

Medical Subject Heading (MSH)
SPECIALIST Lexicon

Biomedical and clinical terms, including variants
5
WORD SENSE DISAMBIGUATION
Busprione attenuates tolerance to morphine
in mice with skin cancer
Drug
Tolerance:
C0013220
Immune
Tolerance:
C0020963
Concept Unique Identifiers: CUIs
6
SENSERELATE ALGORITHM

Each possible sense of a target word is assigned a score
[sum similarity between it and its surrounding terms]
Assign target word the sense with highest score

Proposed by Patwardhan and Pedersen 2003 using WordNet


UMLS::SenseRelate is a modification of this algorithm using
information from the UMLS
NEXT UP: an example
7
SENSERELATE EXAMPLE
Busprione attenuates tolerance to morphine
in mice with skin cancer
8
SENSERELATE EXAMPLE
Busprione attenuates tolerance to morphine
in mice with skin cancer
Drug
Tolerance:
C0013220
Immune
Tolerance:
C0020963
9
SENSERELATE EXAMPLE
Busprione attenuates tolerance to morphine
in mice with skin cancer
Drug
Tolerance:
C0013220
Busprione:
C0006462
Morphine:
C0026549
Immune
Tolerance:
C0020963
Mice:
C0026809
Skin cancer:
C0007114
10
SENSERELATE EXAMPLE
Busprione attenuates tolerance to morphine
in mice with skin cancer
Drug
Tolerance:
C0013220
0.09 0.09
Busprione:
C0006462
Immune
Tolerance:
C0020963
0.11
0.16
Morphine:
C0026549
Mice:
C0026809
Skin cancer:
C0007114
11
SENSERELATE EXAMPLE
Busprione attenuates tolerance to morphine
in mice with skin cancer
Drug Tolerance
Score = 0.09 + 0.09 + 0.16 + 0.11 = 0.45
Drug
Tolerance:
C0013220
0.09 0.09
Busprione:
C0006462
Immune
Tolerance:
C0020963
0.11
0.16
Morphine:
C0026549
Mice:
C0026809
Skin cancer:
C0007114
12
SENSERELATE EXAMPLE
Busprione attenuates tolerance to morphine
in mice with skin cancer
Drug Tolerance
Score = 0.09 + 0.09 + 0.16 + 0.11 = 0.45
Drug
Tolerance:
C0013220
0.09 0.09
Busprione:
C0006462
Immune
Tolerance:
C0020963
0.11
0.16
Morphine:
C0026549
0.09
0.09 0.05
Mice:
C0026809
0.04
Skin cancer:
C0007114
13
SENSERELATE EXAMPLE
Busprione attenuates tolerance to morphine
in mice with skin cancer
Drug Tolerance
Score = 0.09 + 0.09 + 0.16 + 0.11 = 0.45
Immune Tolerance
Score = 0.09 + 0.09 + 0.05 + 0.05 = 0.27
Drug
Tolerance:
C0013220
0.09 0.09
Busprione:
C0006462
Immune
Tolerance:
C0020963
0.11
0.16
Morphine:
C0026549
0.09
0.09 0.05
Mice:
C0026809
0.04
Skin cancer:
C0007114
14
SENSERELATE EXAMPLE
Busprione attenuates tolerance to morphine
in mice with skin cancer
Drug Tolerance
Score = 0.09 + 0.09 + 0.16 + 0.11 = 0.45
Immune Tolerance
Score = 0.09 + 0.09 + 0.05 + 0.05 = 0.27
Drug
Tolerance:
C0013220
0.09 0.09
Busprione:
C0006462
Immune
Tolerance:
C0020963
0.11
0.16
Morphine:
C0026549
0.09
0.09 0.05
Mice:
C0026809
0.04
Skin cancer:
C0007114
15
SENSE RELATE ASSUMPTION
An ambiguous word is often used in the sense
that is most similar to the sense of the
terms that surround it
16
SENSERELATE COMPONENTS

Identifying the concepts of surrounding terms

Calculating semantic similarity
17
IDENTIFYING THE CONCEPTS OF THE SURROUNDING
TERMS
Use the SPECIALIST LEXICON to identify the
terms and map the terms doing a string match to
the MRCONSO table in the UMLS
18
IDENTIFYING THE CONCEPTS OF THE SURROUNDING
TERMS
Use the SPECIALIST LEXICON to identify the
terms and map the terms doing a string match to
the MRCONSO table in the UMLS
Busprione attenuates tolerance to morphine
in mice with skin cancer
19
IDENTIFYING THE CONCEPTS OF THE SURROUNDING
TERMS
Use the SPECIALIST LEXICON to identify the
terms and map the terms doing a string match to
the MRCONSO table in the UMLS
Busprione attenuates tolerance to morphine
in mice with skin cancer
SPECIALIST
LEXICON
...
skin cancer
skin grafting
skin disease
...
20
IDENTIFYING THE CONCEPTS OF THE SURROUNDING
TERMS
Use the SPECIALIST LEXICON to identify the
terms and map the terms doing a string match to
the MRCONSO table in the UMLS
Busprione attenuates tolerance to morphine
in mice with skin cancer
MRCONSO
...
skin cancer
skin grafting
skin disease
C0007114
C0037297
C0037274
...
SPECIALIST
LEXICON
...
skin cancer
skin grafting
skin disease
...
21
SEMANTIC SIMILARITY MEASURES

Path-based measures





Path
Wu and Palmer
Leacock and Chodorow
Ngyuen and Al-Mubaid
Information content (IC)-based measures



Resnik
Lin
Jiang and Conrath
22
PATH-BASED SIMILARITY MEASURES

Use only the path information obtained from a taxonomy
23
PATH-BASED SIMILARITY MEASURES


Use only the path information obtained from a taxonomy
Path measure
 sim(c1,c2) = 1 / minpath(c2,c2)
 where minpath is the shortest path between the two
concepts
24
PATH-BASED SIMILARITY MEASURES

Use only the path information obtained from a taxonomy

Path measure

sim(c1,c2) = 1/minpath(c2,c2)


where minpath is the shortest path between the two concepts
Wu and Palmer, 1994
 sim(c1,c2) = (2*depth(LCS(c2,c2))) / (depth(c1)+depth(c2))
 where LCS is the least common subsumer of the two
concepts
25
PATH-BASED SIMILARITY MEASURES

Use only the path information obtained from a taxonomy

Path measure

sim(c1,c2) = 1/ minpath(c2,c2)


Wu and Palmer, 1994

sim(c1,c2) = (2*depth(LCS(c2,c2))) / (depth(c1)+depth(c2))


where minpath is the shortest path between the two concepts
where LCS is the least common subsumer of the two concepts
Leacock and Chodorow, 1998
 sim(c1,c2) = -log( minpath(c1,c2) / (2D) )
 where D is the total depth of the taxonomy
26
PATH-BASED SIMILARITY MEASURES

Use only the path information obtained from a taxonomy

Path measure

sim(c1,c2) = 1/ minpath(c2,c2)


Leacock and Chodorow, 1998

sim(c1,c2) = -log( minpath(c1,c2) / (2D) )


where D is the total depth of the taxonomy
Wu and Palmer, 1994

sim(c1,c2) = (2*depth(LCS(c2,c2))) / (depth(c1)+depth(c2))


where minpath is the shortest path between the two concepts
where LCS is the least common subsumer of the two concepts
Nyguen and Al-Mubaid, 2006
 sim(c1,c2) = log ( (2 + minpath(c1,c2) - 1) *
(D - depth(LCS(c1,c2))) )
27
PATH-BASED SIMILARITY MEASURES
Disease:
C0012634
Drug Related
Disorder:
C0277579
Neoplasm:
C1302761
Drug
Tolerance:
C0013220
USE ONLY THE PATH INFORMATION
OBTAINED FROM A TAXONOMY
Neoplastic
Disease:
C1882062
Malignant
Neoplasm:
C0006826
Skin cancer:
C0007114
28
INFORMATION CONTENT-BASED MEASURES

Incorporate the probability of the concepts

IC = -log(P(concept))
29
INFORMATION CONTENT-BASED MEASURES

Incorporate the probability of the concepts


IC = -log(P(concept))
P(concept)


Calculated by summing the probability of the concept and the
probability of its descendants
Probabilities are obtained from an external corpus
30
INFORMATION CONTENT-BASED MEASURES

Incorporate the probability of the concepts


IC = -log(P(concept)
Resnik, 1995
 sim(c1,c2) = IC(LCS(c1,c2))
31
INFORMATION CONTENT-BASED MEASURES

Incorporate the probability of the concepts



IC = -log(P(concept)
Resnik, 1995

sim(c1,c2) = IC(LCS(c2,c2))
Jiang and Conrath, 1997
 sim(c1,c2) = 1 / (IC(c1)+IC(c2) – 2* IC(LCS(c1,c2))
32
INFORMATION CONTENT-BASED MEASURES

Incorporate the probability of the concepts


Resnik, 1995


sim(c1,c2) = IC(LCS(c2,c2))
Jiang and Conrath, 1997


IC = -log(P(concept)
sim(c1,c2) = 1 ÷ (IC(c1)+IC(c2) – 2* IC(LCS(c1,c2))
Lin, 1998
 sim(c1,c2) = (2*IC(LCS(c2,c2))) / (IC(c1)+IC(c2))
33
IC-BASED SIMILARITY MEASURES
PATH INFORMATION
PROBABILITY OF
CONCEPTS
Disease:
C0012634
Drug Related
Disorder:
C0277579
Drug
Tolerance:
C0013220
Neoplasm:
C1302761
Neoplastic
Disease:
C1882062
EXTERNAL CORPUS
Malignant
Neoplasm:
C0006826
Skin
cancer:
C0007114
34
EXPERIMENTAL FRAMEWORK



Use open-source UMLS::Similarity package to obtain the
similarity between the terms and possible senses in the
SenseRelate algorithm
Path information: parent/child relations in MSH source
Information content: calculated using the UMLSonMedline
dataset created by NLM


Consists of concepts from 2009AB UMLS and the frequency they
occurred in Medline using the Essie Search Engine (Ide et al 2007)
Medline: database of citations of biomedical/clinical articles
35
EVALUATION DATA: MSH WSD

MSH-WSD dataset (Jimeno-Yepes, et al 2011)

203 target words (ambiguous word) from Medline






terms
acronyms
mixtures
e.g. tolerance
e.g. CA (calcium, california)
e.g. bat (brown adipose tissue)
Each target word contains ~187 instances (Medline abstracts)


106
88
9
abstract = ~ 500 words
Each target word in the instances assigned a concept from MSH by
exploiting the manually assigned MSH concepts assigned to the
abstract
Average of 2.08 possible senses per target word
Majority sense over all the target words is 54.5%
36
RESULTS
0.8
0.72
0.7
a
c
c
u
r
a
c
y
0.6
0.69
0.7
lch
wup
0.72
0.73
0.74
0.74
res
jcn
lin
0.55
0.5
0.4
0.3
0.2
0.1
0
baseline
path
Path-based
nam
IC-based
37
COMPARISON ACROSS SUBSETS OF MSH-WSD
1
0.87
0.85
0.8
0.9
a
c
c
u
r
a
c
y
0.8
0.7
0.6
0.93
0.73
0.71
0.67
0.67
0.55
0.88
0.54
0.53
0.8 0.78
0.74
0.55
Baseline
0.5
SenseRelate
0.4
MRD
0.3
2-MRD
0.2
0.1
38
0
Terms
Acronyms
Mixture
Overall
COMPARISON ACROSS SUBSETS OF MSH-WSD
1
0.87
0.85
0.8
0.9
a
c
c
u
r
a
c
y
0.8
0.7
0.6
0.93
0.73
0.71
0.67
0.67
0.55
0.88
0.54
0.53
0.8 0.78
0.74
0.55
Baseline
0.5
SenseRelate
0.4
MRD
0.3
2-MRD
0.2
0.1
39
0
Terms
Acronyms
Mixture
Overall
COMPARISON ACROSS SUBSETS OF MSH-WSD
1
0.87
0.85
0.8
0.9
a
c
c
u
r
a
c
y
0.8
0.7
0.6
0.93
0.73
0.71
0.67
0.67
0.55
0.88
0.54
0.53
0.8 0.78
0.74
0.55
Baseline
0.5
SenseRelate
0.4
MRD
0.3
2-MRD
0.2
0.1
40
0
Terms
Acronyms
Mixture
Overall
COMPARISON ACROSS SUBSETS OF MSH-WSD
1
0.87
0.85
0.8
0.9
a
c
c
u
r
a
c
y
0.8
0.7
0.6
0.93
0.73
0.71
0.67
0.67
0.55
0.88
0.54
0.53
0.8 0.78
0.74
0.55
Baseline
0.5
SenseRelate
0.4
MRD
0.3
2-MRD
0.2
0.1
41
0
Terms
Acronyms
Mixture
Overall
COMPARISON ACROSS SUBSETS OF MSH-WSD
1
0.87
0.85
0.8
0.9
a
c
c
u
r
a
c
y
0.8
0.7
0.6
0.93
0.73
0.71
0.67
0.67
0.55
0.88
0.54
0.53
0.8 0.78
0.74
0.55
Baseline
0.5
SenseRelate
0.4
MRD
0.3
2-MRD
0.2
0.1
42
0
Terms
Acronyms
Mixture
Overall
WINDOW SIZES

Use the terms surrounding the target word within a
specified window: 1, 2, 5, 10, 25, 50, 60, 70
WINDOW SIZE = 2
Busprione attenuates tolerance to morphine in mice with skin_cancer
43
COMPARISON OF WINDOW SIZES FOR LIN
0.8
0.7
a
c
c
u
r
a
c
y
0.65
0.6
0.5
0.5
0.69
0.71
0.74
0.74
0.74
0.74
0.53
0.4
lin
0.3
0.2
0.1
0
44
0
1
2
5
10
window size
25
50
60
70
SURROUNDING TERMS
Not all terms have a concept in the UMLS
therefore
Not all surrounding terms in the window mapped to CUIs
45
WINDOW SIZES VERSUS MAPPED TERMS
18
n
u
m
b
e
r
o
f
m
a
p
p
i
n
g
s
15.64
16
14.28
14
12.96
12
10
lin
7.6
8
6
3.49
4
1.85
2
0
0
0.27
0
1
0.79
46
2
5
10
window size
25
50
60
70
FUTURE WORK: MAPPING TERMS

Currently looking at mapping the terms to CUIs using
information from the concept mapping system MetaMap

Obtain the terms from MetaMap and do a dictionary look up
in MRCONSO


Hypothesis – the terms obtained by MetaMap are more accurate
than using the SPECIALIST Lexicon
Obtain the CUIs from MetaMap

Hypothesis – the CUIs obtained by MetaMap will be more accurate
than the dictionary look-up
47
OBJECTIVE #1
Develop and evaluate a method than can disambiguate terms in
biomedical text by exploiting similarity information
extrapolated from the UMLS


UMLS::SenseRelate statistically significantly higher
disambiguation accuracy than the baseline
On par with previous unsupervised methods for terms
48
OBJECTIVE #2
Evaluate the efficacy of IC-based similarity measures over pathbased measures on a secondary task


There is no statistically significant difference between the
accuracies obtained by the IC-based measures
There is a statistically significant difference between the ICbased measures and the path-based measures
49
TAKE HOME MESSAGE:
An ambiguous word is often used in the sense
that is most similar to the sense of the concepts
of the terms that surround it
50
RESOURCES

Software:

UMLS::SenseRelate


UMLS::Similarity


http://search.cpan.org/dist/UMLS-SenseRelate/
http://search.cpan.org/dist/UMLS-Similarity/
Data

MSH-WSD

http://wsd.nlm.nih.gov/collaboration.shtml
51
RESOURCES

Software:

UMLS::SenseRelate


UMLS::Similarity


http://search.cpan.org/dist/UMLS-SenseRelate/
http://search.cpan.org/dist/UMLS-Similarity/
Data

MSH-WSD

http://wsd.nlm.nih.gov/collaboration.shtml
THANK YOU
52
RESOURCES

Software:

UMLS::SenseRelate


UMLS::Similarity


http://search.cpan.org/dist/UMLS-SenseRelate/
http://search.cpan.org/dist/UMLS-Similarity/
Data

MSH-WSD

http://wsd.nlm.nih.gov/collaboration.shtml
QUESTIONS?
53