JIGSAW: an Algorithm for Word Sense Disambiguation
Download
Report
Transcript JIGSAW: an Algorithm for Word Sense Disambiguation
EVALITA 2007
Evaluation of NLP Tools for Italian
JIGSAW: an Algorithm for
Word Sense Disambiguation
Dipartimento di Informatica
University of Bari
Pierpaolo Basile ([email protected])
Giovanni Semeraro ([email protected])
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
Word Sense Disambiguation
Word Sense Disambiguation (WSD) is the
problem of selecting a sense for a word
from a set of predefined possibilities
Sense Inventory usually comes from a
dictionary or thesaurus
Knowledge intensive methods, supervised
learning, and (sometimes) bootstrapping
approaches
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
All Words WSD
Attempt to disambiguate all open-class words in
a text:
“He put his suit over the back of the chair”
How?
Knowledge-based approaches
Use information from dictionaries
Position in a semantic network
Use discourse properties
Minimally supervised approaches
Most frequent sense
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
JIGSAW
Knowledge-based WSD algorithm
Disambiguation of words in a text by
exploiting WordNet senses
Combination of three different strategies
to disambiguate nouns, verbs, adjectives
and adverbs
Main motivation: the effectiveness of a
WSD algorithm is strongly influenced by
the POS tag of the target word
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
JIGSAW algorithm
Input: document d = {w1, w2, ... , wh}
Output: list of WordNet synsets X = {s1, s2, ... ,
sk}
each element si is obtained by disambiguating the
target word wi
based on the information obtained from WordNet
about words in the context
context C of the target word: a window of n words to the left
and another n words to the right, for a total of 2n surrounding
words
For each word JIGSAW adopts a different
strategy based on POS tag
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
JIGSAW_nouns: the idea
Based on Resnik [Resnik95] algorithm for
disambiguating noun groups
Given a set of nouns W={w1,w2, ... ,wn}
from document d:
each wi has an associated sense inventory
Si={si1, si2, ... , sik} of possible senses
Goal: assigning each wi with the most
appropriate sense sihSi, according to the
similarity of wi with the other nouns in W
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
JIGSAW_nouns: semantic similarity
“The white cat is hunting the mouse”
w = cat
C = {mouse}
Wcat={cat#1,cat#2}
Wcat={02037721,00847815}
Cat#1: feline
mammal…
T={mouse#1,mouse#2}
T={02244530,03651364}
0.726
0.0
Mouse#1: any of
numerous small
rodents…
0.0
mouse
cat
Cat#2: computerized
axial tomography…
0.107
Mouse#2: a handoperated electronic
device …
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
JIGSAW_nouns: MSS support
MostSpecificSubsumer
{placental_mammal}
Wcat={cat#1,cat#2}
T={mouse#1,mouse#2}
MostSpecificSubsumer between words
Give more importance to senses that are
hyponym of MSS
Combine MSS support with semantic
similarity
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
Difference between
JIGSAW_nouns and Resnik
Leacock-Chodorow measure to calculate
similarity (instead Information Content)
a Gaussian factor G, which takes into account
the distance between words in the text
a factor R, which takes into account the synset
frequency score in WordNet
a parameterized search for the MSS (Most
Specific Subsumer) between two concepts
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
JIGSAW_verbs: the idea
Try to establish a relation between verbs
and nouns (distinct IS-A hierarchies in
WordNet)
Verb wi disambiguated using:
nouns in the context C of wi
nouns into the description (gloss + WordNet
usage examples) of each candidate synset for
wi
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
JIGSAW_verbs: algorithm [1/4]
For each candidate synset sik of wi
computes nouns(i, k): the set of nouns in the
description for sik
for each wj in C and each synset sik computes the
highest similarity maxjk
maxjk is the highest similarity value for wj wrt the nouns
related to the k-th sense for wi (using LeacockChodorow measure)
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
JIGSAW_verbs: algorithm [2/4]
I play basketball and soccer
wi=play
C={basketball, soccer}
1. (70) play -- (participate in games or sport; "We played hockey all
afternoon"; "play cards"; "Pele played for the Brazilian teams in
many important matches")
2. (29) play -- (play on an instrument; "The band played all night
long")
3. …
nouns(play,1): game, sport, hockey, afternoon, card, team, match
nouns(play,2): instrument, band, night
…
nouns(play,35): …
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
JIGSAW_verbs: algorithm [3/4]
wi=play
C={basketball, soccer}
nouns(play,1): game, sport, hockey, afternoon, card, team, match
game1
game
basketball1
game2
…
gamek
similarity
…
basketball
basketballh
sport1
sport
sport2
…
MAXbasketball = MAXi Sim(wi,basketball)
winouns(play,1)
sportm
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
JIGSAW_verbs: algorithm [4/4]
finally, an overall similarity score, (i, k),
among sik and the whole context C is computed:
G ( pos ( w
(i, k ) R ( k )
w
j
i
), pos ( w j )) max
C
G ( pos ( w
i
), pos ( w h ))
h
the synset assigned to wi is the one with the
highest value
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
jk
JIGSAW_others
Based on the WSD algorithm proposed by
Banerjee and Pedersen [Banerjee07] (inspired
to Lesk)
Idea: computes the overlap between the
glosses of each candidate sense for the target
word to the glosses of all words in its context
assigns the synset with the highest overlap
score
if ties occur, the most common synset in WordNet is
chosen
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
Evaluation
EVALITA All-Words-Task
disambiguate all words in a text
Dataset
16 texts in Italian language
about 5000 words (tagged by ItalWordNet)
Processing
WSD needs others NLP steps:
Text normalization and Tokenization
Part-Of-Speech Tagging (based on ACOPOST)
Lemmatization (based on Morph-it! Resource)
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
META
Evaluation (Results)
system
precision recall
attempted
JIGSAW
Baseline (1°
sense)
0,560
0,669
73,95%
100%
0,414
0,669
Comments:
results are encouraging considering that our system
exploits only ItalWordNet
pre-processing phases, lemmatization and POStagging, introduce errors:
77,66% lemmatization precision
76,23% POS-tagging precision
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
Conclusions
Conclusions:
Knowledge-based WSD algorithm
Different strategy for each POS-tagging
Use only WordNet (ItalWordNet) and some
heuristics
Advantage: use the same strategy for Italian and
English [Basile07]
Drawback: low precision (now)
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
Future Work
Including new knowledge sources:
Web
(e.g. Topic Signature)
Wikipedia
(e.g. Wikipedia similarity)
Do not include resources that are available for only
few languages
Use others heuristics:
Statistical distribution of senses instead WordNet frequency
WordNet domains
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
References
S. Banerjee and T. Pedersen. An adapted lesk algorithm for word
sense disambiguation using wordnet. In CICLing’02: Proc. 3rd Int’l
Conf. on Computational Linguistics and Intelligent Text
Processing,pages 136–145, London, UK, 2002. Springer-Verlag.
P. Basile, M. de Gemmis, A.L. Gentile, P. Lops, and G. Semeraro.
JIGSAW algorithm for word sense disambiguation. In SemEval2007: 4th Int. Workshop on Semantic Evaluations, pages 398–401.
ACL press, 2007.
C. Leacock and M. Chodorow. Combining local context and wordnet
similarity for word sense identification. In C. Fellbaum (Ed.),
WordNet: An Electronic Lexical Database, pages 305–332. MIT
Press, 1998.
P. Resnik. Disambiguating noun groupings with respect to WordNet
senses. In Proceedings of the Third Workshop on Very Large
Corpora, pages 54–68. Association for Computational Linguistics,
1995.
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
Backup slides
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
Results (detail)
total
nouns
2405
verbs
1479
others
694
proper nouns 159
valid
1923
1118
330
126
precision
0,556
0,375
0,676
0,913
Comments:
polysemy of verb is high
generally proper nouns have only one sense
lemmatizer and pos-tagger work worst for adjectives
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
JIGSAW
:
The
idea
nouns
Inspired by the idea proposed by Resnik
[Resnik95]
[ w1 [ w1,ww
wn ]
2 2, ……
[s11 s12 … s1k]
[0.4 0.3
0.5]
wn ]
[s21 s22 … s1h]
[sn1 sn2 … snm]
[0.2 0.3
[0.6 0.1
0.4]
0.2]
Most plausible assignment of senses to a set of
co-occurring nouns is the one that maximizes the
the senses
relatedness
Relatednessofismeanings
measuredamong
by computing
a score for
chosen
each s
ij
confidence with which sij is the most appropriate
synset for wi
[Resnik95] P. Resnik. Disambiguating
noun groupings with respect to WordNet senses. In
Proceedings of the Third Workshop on Very Large Corpora, pages 54–68. Association for
Computational
Linguistics,of
1995.
EVALITA
2007 – Evaluation
NLP Tools for Italian, 10 September 2007 - Roma, Italy
The score is a way to assign credit to word senses
JIGSAWnouns: The support
[ w1
w2
…
wn ]
most specific
subsumer MSS
[s11 s12 … s1k]
[s21 s22 … s1h]
[sn1 sn2 … snm]
0.56
the more similar two words are, the more informative will
0.15
be the most
specific concept that subsumes both of them
semantic similarity
score
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
JIGSAWnouns: The idea
cat
[ w1
mouse
w2
…
wn ]
most specific
subsumer MSS
[s11 s12 … s1k]
feline 0.0
[0.56
mammal
0.0]
[s21 s22 … s1h]
rodent0.0
[0.56
0.56]
[sn1 sn2 … snm]
[0.0 0.0
0.0]
MSS
Placental mammal
Carnivore
Rodent
MSS = placental
mammal
Feline, felid
Cat
(feline mammal)
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
Mouse
(rodent)