JIGSAW: an Algorithm for Word Sense Disambiguation

Download Report

Transcript JIGSAW: an Algorithm for Word Sense Disambiguation

EVALITA 2007
Evaluation of NLP Tools for Italian
JIGSAW: an Algorithm for
Word Sense Disambiguation
Dipartimento di Informatica
University of Bari
Pierpaolo Basile ([email protected])
Giovanni Semeraro ([email protected])
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
Word Sense Disambiguation
Word Sense Disambiguation (WSD) is the
problem of selecting a sense for a word
from a set of predefined possibilities
Sense Inventory usually comes from a
dictionary or thesaurus
Knowledge intensive methods, supervised
learning, and (sometimes) bootstrapping
approaches
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
All Words WSD
 Attempt to disambiguate all open-class words in
a text:
 “He put his suit over the back of the chair”
 How?
 Knowledge-based approaches
 Use information from dictionaries
 Position in a semantic network
 Use discourse properties
 Minimally supervised approaches
 Most frequent sense
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
JIGSAW
Knowledge-based WSD algorithm
Disambiguation of words in a text by
exploiting WordNet senses
Combination of three different strategies
to disambiguate nouns, verbs, adjectives
and adverbs
Main motivation: the effectiveness of a
WSD algorithm is strongly influenced by
the POS tag of the target word
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
JIGSAW algorithm
 Input: document d = {w1, w2, ... , wh}
 Output: list of WordNet synsets X = {s1, s2, ... ,
sk}
 each element si is obtained by disambiguating the
target word wi
 based on the information obtained from WordNet
about words in the context
 context C of the target word: a window of n words to the left
and another n words to the right, for a total of 2n surrounding
words
 For each word JIGSAW adopts a different
strategy based on POS tag
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
JIGSAW_nouns: the idea
Based on Resnik [Resnik95] algorithm for
disambiguating noun groups
Given a set of nouns W={w1,w2, ... ,wn}
from document d:
each wi has an associated sense inventory
Si={si1, si2, ... , sik} of possible senses
Goal: assigning each wi with the most
appropriate sense sihSi, according to the
similarity of wi with the other nouns in W
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
JIGSAW_nouns: semantic similarity
“The white cat is hunting the mouse”
w = cat
C = {mouse}
Wcat={cat#1,cat#2}
Wcat={02037721,00847815}
Cat#1: feline
mammal…
T={mouse#1,mouse#2}
T={02244530,03651364}
0.726
0.0
Mouse#1: any of
numerous small
rodents…
0.0
mouse
cat
Cat#2: computerized
axial tomography…
0.107
Mouse#2: a handoperated electronic
device …
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
JIGSAW_nouns: MSS support
MostSpecificSubsumer
{placental_mammal}
Wcat={cat#1,cat#2}
T={mouse#1,mouse#2}
MostSpecificSubsumer between words
Give more importance to senses that are
hyponym of MSS
Combine MSS support with semantic
similarity
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
Difference between
JIGSAW_nouns and Resnik
 Leacock-Chodorow measure to calculate
similarity (instead Information Content)
 a Gaussian factor G, which takes into account
the distance between words in the text
 a factor R, which takes into account the synset
frequency score in WordNet
 a parameterized search for the MSS (Most
Specific Subsumer) between two concepts
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
JIGSAW_verbs: the idea
Try to establish a relation between verbs
and nouns (distinct IS-A hierarchies in
WordNet)
Verb wi disambiguated using:
nouns in the context C of wi
nouns into the description (gloss + WordNet
usage examples) of each candidate synset for
wi
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
JIGSAW_verbs: algorithm [1/4]
 For each candidate synset sik of wi
 computes nouns(i, k): the set of nouns in the
description for sik
 for each wj in C and each synset sik computes the
highest similarity maxjk
 maxjk is the highest similarity value for wj wrt the nouns
related to the k-th sense for wi (using LeacockChodorow measure)
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
JIGSAW_verbs: algorithm [2/4]
I play basketball and soccer
wi=play
C={basketball, soccer}
1. (70) play -- (participate in games or sport; "We played hockey all
afternoon"; "play cards"; "Pele played for the Brazilian teams in
many important matches")
2. (29) play -- (play on an instrument; "The band played all night
long")
3. …
nouns(play,1): game, sport, hockey, afternoon, card, team, match
nouns(play,2): instrument, band, night
…
nouns(play,35): …
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
JIGSAW_verbs: algorithm [3/4]
wi=play
C={basketball, soccer}
nouns(play,1): game, sport, hockey, afternoon, card, team, match
game1
game
basketball1
game2
…
gamek
similarity
…
basketball
basketballh
sport1
sport
sport2
…
MAXbasketball = MAXi Sim(wi,basketball)
winouns(play,1)
sportm
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
JIGSAW_verbs: algorithm [4/4]
 finally, an overall similarity score, (i, k),
among sik and the whole context C is computed:
 G ( pos ( w
 (i, k )  R ( k )
w
j
i
), pos ( w j ))  max
C
 G ( pos ( w
i
), pos ( w h ))
h
 the synset assigned to wi is the one with the
highest  value
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
jk
JIGSAW_others
 Based on the WSD algorithm proposed by
Banerjee and Pedersen [Banerjee07] (inspired
to Lesk)
 Idea: computes the overlap between the
glosses of each candidate sense for the target
word to the glosses of all words in its context
 assigns the synset with the highest overlap
score
 if ties occur, the most common synset in WordNet is
chosen
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
Evaluation
 EVALITA All-Words-Task
 disambiguate all words in a text
 Dataset
 16 texts in Italian language
 about 5000 words (tagged by ItalWordNet)
 Processing
 WSD needs others NLP steps:
 Text normalization and Tokenization
 Part-Of-Speech Tagging (based on ACOPOST)
 Lemmatization (based on Morph-it! Resource)
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
META
Evaluation (Results)
system
precision recall
attempted
JIGSAW
Baseline (1°
sense)
0,560
0,669
73,95%
100%
0,414
0,669
 Comments:
 results are encouraging considering that our system
exploits only ItalWordNet
 pre-processing phases, lemmatization and POStagging, introduce errors:
 77,66% lemmatization precision
 76,23% POS-tagging precision
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
Conclusions
Conclusions:
Knowledge-based WSD algorithm
Different strategy for each POS-tagging
Use only WordNet (ItalWordNet) and some
heuristics
 Advantage: use the same strategy for Italian and
English [Basile07]
 Drawback: low precision (now)
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
Future Work
 Including new knowledge sources:
 Web
 (e.g. Topic Signature)
 Wikipedia
 (e.g. Wikipedia similarity)
 Do not include resources that are available for only
few languages
 Use others heuristics:
 Statistical distribution of senses instead WordNet frequency
 WordNet domains
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
References
 S. Banerjee and T. Pedersen. An adapted lesk algorithm for word
sense disambiguation using wordnet. In CICLing’02: Proc. 3rd Int’l
Conf. on Computational Linguistics and Intelligent Text
Processing,pages 136–145, London, UK, 2002. Springer-Verlag.
 P. Basile, M. de Gemmis, A.L. Gentile, P. Lops, and G. Semeraro.
JIGSAW algorithm for word sense disambiguation. In SemEval2007: 4th Int. Workshop on Semantic Evaluations, pages 398–401.
ACL press, 2007.
 C. Leacock and M. Chodorow. Combining local context and wordnet
similarity for word sense identification. In C. Fellbaum (Ed.),
WordNet: An Electronic Lexical Database, pages 305–332. MIT
Press, 1998.
 P. Resnik. Disambiguating noun groupings with respect to WordNet
senses. In Proceedings of the Third Workshop on Very Large
Corpora, pages 54–68. Association for Computational Linguistics,
1995.
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
Backup slides
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
Results (detail)
total
nouns
2405
verbs
1479
others
694
proper nouns 159
valid
1923
1118
330
126
precision
0,556
0,375
0,676
0,913
 Comments:
 polysemy of verb is high
 generally proper nouns have only one sense
 lemmatizer and pos-tagger work worst for adjectives
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
JIGSAW
:
The
idea
nouns
 Inspired by the idea proposed by Resnik
[Resnik95]
[ w1 [ w1,ww
wn ]
2 2, ……
[s11 s12 … s1k]
[0.4 0.3
0.5]
wn ]
[s21 s22 … s1h]
[sn1 sn2 … snm]
[0.2 0.3
[0.6 0.1
0.4]
0.2]
 Most plausible assignment of senses to a set of
co-occurring nouns is the one that maximizes the
the senses
 relatedness
Relatednessofismeanings
measuredamong
by computing
a score for
chosen
each s
ij
 confidence with which sij is the most appropriate
synset for wi
[Resnik95] P. Resnik. Disambiguating
noun groupings with respect to WordNet senses. In
Proceedings of the Third Workshop on Very Large Corpora, pages 54–68. Association for
Computational
Linguistics,of
1995.
EVALITA
2007 – Evaluation
NLP Tools for Italian, 10 September 2007 - Roma, Italy
 The score is a way to assign credit to word senses
JIGSAWnouns: The support
[ w1
w2
…
wn ]
most specific
subsumer MSS
[s11 s12 … s1k]
[s21 s22 … s1h]
[sn1 sn2 … snm]
0.56
the more similar two words are, the more informative will
0.15
be the most
specific concept that subsumes both of them
semantic similarity
score
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
JIGSAWnouns: The idea
cat
[ w1
mouse
w2
…
wn ]
most specific
subsumer MSS
[s11 s12 … s1k]
feline 0.0
[0.56
mammal
0.0]
[s21 s22 … s1h]
rodent0.0
[0.56
0.56]
[sn1 sn2 … snm]
[0.0 0.0
0.0]
MSS
Placental mammal
Carnivore
Rodent
MSS = placental
mammal
Feline, felid
Cat
(feline mammal)
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
Mouse
(rodent)