Lexical Semantics and Word Sense Disambiguation

Download Report

Transcript Lexical Semantics and Word Sense Disambiguation

Natural Language Processing
COMPSCI 423/723
Rohit Kate
Lexical Semantics and
Word Sense Disambiguation
Some of the slides have been adapted from Raymond
Mooney’s NLP course at UT Austin.
Basic Steps of Natural
Language Processing
Sound
waves
Words
Phonetics
Syntactic
processing
Parses
Semantic
processing
Meaning
Pragmatic Meaning
processing in context
Lexical Semantics
• Study of meanings of words
– How to represent word meanings?
– How are they related to each others?
• Synonyms, antonyms, hypernyms (more general) ,
hyponyms (more specific)
• A word can have multiple meanings “I am
going to the bank.”
• Compositionality: How meanings of individual
words combine to give meaning of a
sentence
– Many exceptions: “kick the bucket”
Lexeme, Lexicon & Lemma
• Lexeme: Smallest unit of language which has
a meaning (roughly dictionary entry), e.g. run
– Takes various inflected word forms, e.g. runs,
running, ran
– conduct (verb) is a different lexeme from conduct
(noun)
• Lexicon: A finite set of lexemes (roughly
dictionary)
• Lemma: The canonical or basic form that
represents the lexeme, e.g. run
Lemmatization
• The process of mapping word forms to
their lemmas, e.g. running => run
• Typically done using morphological
analysis
• Often done in NLP to avoid data
sparsity, but depending on the
application sometimes it may be best to
keep the word forms
Lemmatization is not Trivial
• May depend on the context
– He found the ball => find
– He will found the Institute => found
• Depends on the part of speech
– He conducted the orchestra => conduct
(verb)
Stemming
• Reduce a word to its “stem”
• Relates to lemmatization but the stem need
not be a word itself
– May reduce compute, computational, computing
all to comput
• The purpose of stemming is to bring variant
forms of a word together, not to map a word
onto its canonical form
• Porter’s stemmer is a well known simple
algorithm for stemming that is mostly based
on removing well known suffixes
• Less linguistically motivated and less effective
but is easy to do and often serves the
purpose
Word Senses
• A word sense is a particular meaning of a
word
• Senses of a word may be entirely different
with no relations, called homonyms
– Bank: money bank, river bank
• Senses of a word may be related, called
polysemes
– Bank: financial institute, building of the financial
institute, storage of blood (blood bank)
• No hard threshold to distinguish between
polysemy and homonymy, it’s a matter of
degree
When I use a word it means
just what I choose it to mean
- neither more nor less.
How Many Senses a Word Has?
• Not always an easy question
– Drive the car
– Drive to school
– Drive me mad
How Many Senses a Word Has?
• Dictionaries (or humans) may differ on how
many senses a word has
• Typically dictionaries or linguistic resources
give very fine-grained senses of a word, but
for NLP that may not be needed (in fact that
may hurt)
• WordNet has 34 senses for drive
Relations Between Senses
• Synonyms: When two senses of two words
are identical or very similar, e.g. buy &
purchase
– Could be tested by substitution
• I bought/purchased a car.
– Probably there is no perfect synonymy, they still
may be different in some contexts, e.g. water and
H2O
• Synonymy is best defined for senses not
words
– Home purchase is a long process.
– *Home buy is a long process.
Relations Between Senses
• Antonyms: Senses of words with opposite
meanings, e.g. long/short, rise/fall
• While antonyms are very different because
they have opposite meanings, they are also
very similar because they share all other
aspects, e.g. long and short are degree of
lengths
• It is often difficult to distinguish between
synonyms and antonyms if automatically
extracted from a corpus using measures of
context similarity
– This is good.
– This is nice.
– This is bad.
Relations Between Senses
• Hyponyms: A sense of a word is more
specific than a sense of another word,
e.g. apple is a hyponym of fruit
• Hypernyms: Opposite of hyponym, e.g.
fruit is a hypernym of apple
• Meronyms: Part-whole relation, e.g.
wheel is a meronym of car
• Holonyms: Opposite of meronyms, e.g.
car is a holonym of wheel
WordNet
• A computational resource for English
sense relations, lexical database
• Available for free, browse or download:
http://wordnet.princeton.edu/
• Developed by famous cognitive
psychologist George Miller and a team
at Princeton University
• Database of word senses and their
relations
16
WordNet
• Synset (synonym set): Set of near synonyms
in WordNet
– Basic primitive of WordNet
– Each synset expresses a semantic concept
– Example synset: {drive, thrust, driving force}
• Entry for each word shows all the synsets
(senses) the word appears in, some
description and sometimes example usage
• About 140,000 words and 109,000 synsets
• Synsets (not individual words) are connected
by various sense relations
Some WordNet Synset
Relationships
•
•
•
•
•
•
•
•
Antonym: front  back
Similar: unquestioning  absolute
Cause: kill  die
Entailment: breathe  inhale
Holonym: chapter  text (part-of)
Meronym: computer  cpu (whole-of)
Hyponym: tree  plant (specialization)
Hypernym: fruit  apple (generalization)
18
A WordNet Snapshot
synsets
motor vehicle,
automotive vehicle
hypernym
car, auto, automobile,
machine,motorcar
hyponym
cab, taxi, taxicab, hack
meronym
accelerator,
gas pedal,
gas
hyponym
ambulance
WordNets for Other
Languages
• EuroWordNet: Individual WordNets for some
European languages (Dutch, Italian, Spanish,
German, French, Czech, and Estonia) which
are also interconnected by interlingual links
http://www.illc.uva.nl/EuroWordNet/
• WordNets for some asian languages:
– Hindi:
• http://www.cfilt.iitb.ac.in/wordnet/webhwn/
– Marathi:
• http://www.cfilt.iitb.ac.in/wordnet/webmwn/
– Japanese:
• http://nlpwww.nict.go.jp/wn-ja/index.en.html
20
WordNet Senses
• WordNets senses (like many dictionary senses) tend
to be very fine-grained
• “play” as a verb has 35 senses, including
– play a role or part: “Gielgud played Hamlet”
– pretend to have certain qualities or state of mind: “John
played dead.”
• Difficult to disambiguate to this level for people and
computers. Only expert lexicographers are perhaps
able to reliably differentiate senses
• Not clear such fine-grained senses are useful for NLP
• Several proposals for grouping senses into coarser,
easier to identify senses (e.g. homonyms only)
21
Word Sense Disambiguation
(WSD)
• Task of automatically selecting the correct
sense for a word
• Many tasks in NLP require disambiguation of
ambiguous words
–
–
–
–
–
Question Answering
Information Retrieval
Machine Translation
Text Mining
Phone Help Systems
• Understanding how people disambiguate
words is an interesting problem that can
provide insight in psycholinguistics
WSD Tasks
• Lexical sample task:
– Choose one or more ambiguous words
each with a sense inventory
– Disambiguate occurrences of those
specific words in a corpus
• All words task:
– In a corpus disambiguate every word with
a a sense tag from a broad-coverage
lexical database (e.g. WordNet).
Supervised Learning for WSD
• Treat as a classification problem with the
potential senses for the target word as the
classification labels
• Decide appropriate features and a
classification method (Naïve Bayes, MaxEnt,
decision lists etc.)
• Train using data labeled with the correct word
senses
• Use the trained classifier to disambiguate
instances of the target word in the test corpus
Feature Engineering
• The success of machine learning requires
instances to be represented using an
effective set of features that are correlated
with the categories of interest
• Feature engineering can be a laborious
process that requires substantial human
expertise and knowledge of the domain
• In NLP it is common to extract many (even
thousands of) potentially features and use a
learning algorithm that works well with many
relevant and irrelevant features
Contextual Features
•
•
•
•
Surrounding bag of words
POS of neighboring words
Local collocations
Syntactic relations
Experimental evaluations indicate that all of
these features are useful; and the best results
comes from integrating all of these cues in the
disambiguation process.
Surrounding Bag of Words
• Unordered individual words near the
ambiguous word.
• Words in the same sentence.
• May include words in the previous sentence
or surrounding paragraph.
• Gives general topical cues of the context.
• May use feature selection to determine a
smaller set of words that help discriminate
possible senses.
• May just remove common “stop words” such
as articles, prepositions, etc.
POS of Neighboring Words
• POS of the word narrows down the senses
• Also use part-of-speech of immediately
neighboring words.
• Provides evidence of local syntactic context.
• P-i is the POS of the word i positions to the
left of the target word.
• Pi is the POS of the word i positions to the
right of the target word.
• Typical to include features for:
P-3, P-2, P-1, P1, P2, P3
Local Collocations
• Specific lexical context immediately adjacent to the word.
• For example, to determine if “interest” as a noun refers to
“readiness to give attention” or “money paid for the use of
money”, the following collocations are useful:
–
–
–
–
“in the interest of”
“an interest in”
“interest rate”
“accrued interest”
• Ci,j is a feature of the sequence of words from local position i to j
relative to the target word.
– C-2,1 for “in the interest of” is “in the of”
• Typical to include:
– Single word context: C-1,-1 , C1,1, C-2,-2, C2,2
– Two word context: C-2,-1, C-1,1 ,C1,2
– Three word context: C-3,-1, C-2,1, C-1,2, C1,3
Syntactic Relations
(Ambiguous Verbs)
• For an ambiguous verb, it is very useful to
know its direct object.
–
–
–
–
–
–
“played the game”
“played the guitar”
“played the risky and long-lasting card game”
“played the beautiful and expensive guitar”
“played the big brass tuba at the football game”
“played the game listening to the drums and the
tubas”
• May also be useful to know its subject:
– “The game was played while the band played.”
– “The game that included a drum and a tuba was
played on Friday.”
Syntactic Relations
(Ambiguous Nouns)
• For an ambiguous noun, it is useful to
know what verb it is an object of:
– “played the piano and the horn”
– “wounded by the rhinoceros’ horn”
• May also be useful to know what verb it
is the subject of:
– “the bank near the river loaned him $100”
– “the bank is eroding and the bank has
given the city the money to repair it”
Syntactic Relations
(Ambiguous Adjectives)
• For an ambiguous adjective, it useful to
know the noun it is modifying.
– “a brilliant young man”
– “a brilliant yellow light”
– “a wooden writing desk”
– “a wooden acting performance”
Using Syntax in WSD
• Produce a parse tree for a sentence using a
S
syntactic parser.
NP
ProperN
John
VP
V
played
NP
DET
the
N
piano
• For ambiguous verbs, use the head word of
its direct object and of its subject as features.
• For ambiguous nouns, use verbs for which it
is the object and the subject as features.
• For ambiguous adjectives, use the head word
(noun) of its NP as a feature.
Feature Vectors
A small example
Training
Example
P1
P-1
C1,1
Category
1
NP
IN
guitar
play1
2
DT
RB
band
play2
3
VBN
NN
good
play1
4
DT
IN
string
play1
Classification
Play?
P1
P-1
…
Naïve
Bayes
C1
Evaluation of WSD
• “In vitro”:
– Corpus developed in which one or more ambiguous words
are labeled with explicit sense tags according to some sense
inventory.
– Corpus used for training and testing WSD and evaluated
using accuracy (percentage of labeled words correctly
disambiguated).
• Use most common sense selection as a baseline.
• “In vivo”:
– Incorporate WSD system into some larger application
system, such as machine translation, information retrieval, or
question answering.
– Evaluate relative contribution of different WSD methods by
measuring performance impact on the overall system on
final task (accuracy of MT, IR, or QA results).
Evaluating Categorization
• Evaluation must be done on test data that are
independent of the training data (usually a
disjoint set of instances).
• Classification accuracy: c/n where n is the
total number of test instances and c is the
number of test instances correctly classified
by the system.
• Results can vary based on sampling error
due to different training and test sets.
• Average results over multiple training and test
sets (splits of the overall data) for the best
results.
N-Fold Cross-Validation
• Ideally, test and training sets are independent
on each trial.
– But this would require too much labeled data.
• Partition data into N equal-sized disjoint
segments.
• Run N trials, each time using a different
segment of the data for testing, and training
on the remaining N1 segments.
• This way, at least test-sets are independent.
• Report average classification accuracy over
the N trials.
• Typically, N = 10.
Learning Curves
• In practice, labeled data is usually rare and
expensive.
• Would like to know how performance varies
with the number of training instances.
• Learning curves plot classification accuracy
on independent test data (Y axis) versus
number of training examples (X axis).
N-Fold Learning Curves
• Want learning curves averaged over
multiple trials.
• Use N-fold cross validation to generate
N full training and test sets.
• For each trial, train on increasing
fractions of the training set, measuring
accuracy on the test data for each point
on the desired learning curve.
WSD “line” Corpus
• Example WSD corpus
• 4,149 examples from newspaper
articles containing the word “line.”
• Each instance of “line” labeled with one
of 6 senses from WordNet.
• Each example includes a sentence
containing “line” and the previous
sentence for context.
Senses of “line”
• Product: “While he wouldn’t estimate the sale price, analysts
have estimated that it would exceed $1 billion. Kraft also told
analysts it plans to develop and test a line of refrigerated
entrees and desserts, under the Chillery brand name.”
• Formation: “C-LD-R L-V-S V-NNA reads a sign in Caldor’s book
department. The 1,000 or so people fighting for a place in line
have no trouble filling in the blanks.”
• Text: “Newspaper editor Francis P. Church became famous for
a 1897 editorial, addressed to a child, that included the line
“Yes, Virginia, there is a Santa Clause.”
• Cord: “It is known as an aggressive, tenacious litigator. Richard
D. Parsons, a partner at Patterson, Belknap, Webb and Tyler,
likes the experience of opposing Sullivan & Cromwell to “having
a thousand-pound tuna on the line.”
• Division: “Today, it is more vital than ever. In 1983, the act was
entrenched in a new constitution, which established a tricameral
parliament along racial lines, whith separate chambers for
whites, coloreds and Asians but none for blacks.”
• Phone: “On the tape recording of Mrs. Guba's call to the 911
emergency line, played at the trial, the baby sitter is heard
begging for an ambulance.”
Experimental Data for WSD of
“line”
• Sample equal number of examples of each
sense to construct a corpus of 2,094.
• Represent as simple binary vectors of word
occurrences in 2 sentence context.
– Stop words eliminated
– Stemmed to eliminate morphological variation
• Final examples represented with 2,859 binary
word features.
Learning Curves for WSD of
“line” [Mooney, 1996]
Discussion of Learning Curves
for WSD of “line”
• Naïve Bayes and Perceptron give the best
results.
• Both use a weighted linear combination of
evidence from many features.
• Symbolic systems that try to find a small set
of relevant features tend to overfit the training
data and are not as accurate.
• Nearest neighbor method that weights all
features equally is also not as accurate.
• Of symbolic systems, decision lists work the
best.
SenseEval
• Standardized international “competition” on
WSD
http://www.senseval.org/
• Organized by the Association for
Computational Linguistics (ACL) Special
Interest Group on the Lexicon (SIGLEX).
• Competitions:
–
–
–
–
–
Senseval 1: 1998
Senseval 2: 2001
Senseval 3: 2004
Under SemEval 1: 2007
Under SemEval 2: 2010
Other Approaches to WSD
• Dictionary based methods
– Lesk algorithm: Choose the sense whose
dictionary gloss shares the most words with the
context
• Semi-supervised learning
– Bootstrap from a small number of labeled
examples to exploit unlabeled data
• Train a classifier on the labeled data
• Test on the unlabeled data and treat the instances with
high confidence of disambiguation as (weakly) labeled
• Iterate till no more high confidence instances can be
found
– Exploit “one sense per discourse”
Issues in WSD
• What is the right granularity of a sense
inventory?
• Integrating WSD with other NLP tasks
– Syntactic parsing
– Semantic role labeling
– Semantic parsing
• Does WSD actually improve performance on
some real end-user task?
–
–
–
–
Information retrieval
Information extraction
Machine translation
Question answering
Homework 5
• Which is the right WordNet synset
(sense) according to you for the word
“position” in the sentence: “He kept on
debating but his position was not
clear.”? What feature(s) will be useful to
decide that by a WSD system?