Transcript P(A | B)

Foundation of Statistical Natural Language Processing
Word Sense Disambiguation
2014.05.10
Minho Kim
([email protected])
Motivation
• Computationally determining which sense of a word is activated by
its use in a particular context.
– E.g. I am going to withdraw money from the bank.
• One of the central challenges in NLP.
• Needed in:
– Machine Translation: For correct lexical choice.
– Information Retrieval: Resolving ambiguity in queries.
– Information Extraction: For accurate analysis of text.
Senses and ambiguity
• Many words have different meanings
(senses) in different contexts
– E.g. Bank  river bank; financial institution
• The problem in general is more
complicated by the fact that the “senses”
of a particular word are just subtly different.
Homonym and Polysemy
POS Tagging
• Some words are used in different parts of
speech
– “They're waiting in line at the ticket office.”  Noun
– “You should line a coat with fur”  Verb
• The techniques used for tagging and senses
disambiguation are bit different.
– For tagging the local context is heavily used – looking
at the use of determiners and predicates and the like.
– For word sense disambiguation the techniques look at
a broader context of the word. Tagging is explored in
Chapter 10.
Methodological Preliminaries
• Corpus Based Approaches
– Rely on corpus evidence.
– Supervised and unsupervised learning
– Train a model using tagged or untagged corpus.
– Probabilistic/Statistical models.
• Knowledge Based Approaches
– Rely on knowledge resources like WordNet,
Thesaurus etc.
– May use grammar rules for disambiguation.
– May use hand coded rules for disambiguation.
• Hybrid Approaches
– Use corpus evidence as well as semantic relations form
WordNet.
Corpus Based Approaches
• Supervised and unsupervised learning
– In supervised learning, we know the actual
“sense” of a word which is labeled
– Supervised learning tends to be a
classification task
– Unsupervised tends to be a clustering task
• Providing labeled corpora is expensive
• Knowledge sources to help with task
– Dictionaries, thesaurus, aligned bilingual texts
Pseudowords
• When one has difficulty coming up with
sufficient training and test data, one
techniques is to create “pseudowords”
from an existing corpora.
• For e.g. replace banana and door with the
pseudoword “banana-door”.
– The ambiguous set is the text with
pseudowords.
– The disambiguated set is the original text.
Upper and lower bounds
• Upper and lower bounds on performance
– Upper bound is usually defined as human
performance
– Lower bound is given by the simplest possible
algorithm
• Most Frequent Class
• Naïve Bayes
• Evaluation measure
– Precision, Recall, F-measure
Supervised Disambiguation
Classification and Clustering
Class A
A
Class B
Class C
Model
B
C
Sense Tagged Corpus
<p>
BSAA0011-00018403
BSAA0011-00018404
BSAA0011-00018405
BSAA0011-00018406
BSAA0011-00018407
BSAA0011-00018408
BSAA0011-00018409
BSAA0011-00018410
BSAA0011-00018411
BSAA0011-00018412
BSAA0011-00018413
BSAA0011-00018414
BSAA0011-00018415
BSAA0011-00018416
BSAA0011-00018417
BSAA0011-00018418
BSAA0011-00018419
BSAA0011-00018420
BSAA0011-00018421
BSAA0011-00018422
BSAA0011-00018423
BSAA0011-00018424
BSAA0011-00018425
BSAA0011-00018426
BSAA0011-00018427
</p>
서양에만
젤리가
있는
것이
아니라
우리
나라에서도
앵두
사과
모과
살구
같은
과일로
'과편'을
만들어
먹었지만
수박은
물기가
너무
많고
펙틴질이
없어
가공해
먹지
못했다.
서양/NNG + 에/JKB + 만/JX
젤리/NNG + 가/JKS
있/VV + 는/ETM
것/NNB + 이/JKC
아니/VCN + 라/EC
우리/NP
나라/NNG + 에서/JKB + 도/JX
앵두/NNG
사과__05/NNG
모과__02/NNG
살구/NNG
같/VA + 은/ETM
과일__01/NNG + 로/JKB
'/SS + 과편/NNG + '/SS + 을/JKO
만들/VV + 어/EC
먹__02/VV + 었/EP + 지만/EC
수박__01/NNG + 은/JX
물기/NNG + 가/JKS
너무/MAG
많/VA + 고/EC
펙틴질/NNG + 이/JKS
없/VA + 어/EC
가공__01/NNG + 하/XSV + 아/EC
먹__02/VV + 지/EC
못하/VX + 았/EP + 다/EF + ./SF
Notational Conventions
Supervised task
• The idea here is that there is a training set of
exemplars which tags each word that needs to
be disambiguated with the correct “sense” of the
word.
• The task is to correctly classify the word sense
in the testing set using the statistical properties
gleaned from the training set for the occurrence
of the word in a particular context
• This chapter explores two approaches to this
problem
– Bayesian approach and information theoretic
approach
Bayesian Classification
Prior Probability
• Prior probability: the probability before we
consider any additional knowledge
P( A)
16
Conditional probability
• Sometimes we have partial knowledge
about the outcome of an experiment
• Conditional (or Posterior) Probability
• Suppose we know that event B is true
• The probability that A is true given the
knowledge about B is expressed by
P( A | B)
P(A,B)
P(A|B) 
P(B)
17
18
http://ai.stanford.edu/~paskin/gm-short-course/lec1.pdf
Conditional probability (cont)
P ( A, B )  P ( A | B ) P ( B )
 P ( B | A) P ( A)
•
•
•
•
•
Note: P(A,B) = P(A ∩ B)
Chain Rule
P(A, B) = P(A|B) P(B) = The probability that A and B both happen is the
probability that B happens times the probability that A happens, given B has
occurred.
P(A, B) = P(B|A) P(A) = The probability that A and B both happen is the
probability that A happens times the probability that B happens, given A has
occurred.
Multi-dimensional table with a value in every cell giving the probability of
that specific state occurring
19
Chain Rule
P(A,B) = P(A|B)P(B)
= P(B|A)P(A)
P(A,B,C,D…) = P(A)P(B|A)P(C|A,B)P(D|A,B,C..)
20
Chain Rule  Bayes' rule
P(A,B) = P(A|B)P(B)
= P(B|A)P(A)
P(B|A)P(A)
P(A|B) 
P(B)
Bayes' rule
Useful when one quantity is more easy to calculate;
trivial consequence of the definitions we saw but it’ s
extremely useful
21
Bayes' rule
P(A|B)P(A)
P(A|B) 
P(B)
Bayes' rule translates causal knowledge into diagnostic knowledge.
For example, if A is the event that a patient has a disease, and B is the
event that she displays a symptom, then P(B | A) describes a causal
relationship, and P(A | B) describes a diagnostic one (that is usually
hard to assess).
If P(B | A), P(A) and P(B) can be assessed easily, then we get P(A | B)
for free.
22
Example
• S:stiff neck, M: meningitis
• P(S|M) =0.5, P(M) = 1/50,000 P(S)=1/20
• I have stiff neck, should I worry?
P( S | M ) P( M )
P( M | S ) 
P( S )
0.5 1 / 50,000

 0.0002
1 / 20
23
(Conditional) independence
• Two events A e B are independent of each
other if
P(A) = P(A|B)
• Two events A and B are conditionally
independent of each other given C if
P(A|C) = P(A|B,C)
24
Back to language
• Statistical NLP aims to do statistical inference for
the field of NLP
– Topic classification
• P( topic | document )
– Language models
• P (word | previous word(s) )
– WSD
• P( sense | word)
• Two main problems
– Estimation: P in unknown: estimate P
– Inference: We estimated P; now we want to find
(infer) the topic of a document, o the sense of a word
25
Language Models (Estimation)
• In general, for language events, P is
unknown
• We need to estimate P, (or model M of the
language)
• We’ll do this by looking at evidence about
what P must be based on a sample of data
26
Estimation of P
• Frequentist statistics
– Parametric
– Non-parametric (distribution free)
• Bayesian statistics
– Bayesian statistics measures degrees of belief
– Degrees are calculated by starting with prior beliefs
and updating them in face of the evidence, using
Bayes theorem
• 2 different approaches, 2 different philosophies
27
Inference
• The central problem of computational Probability
Theory is the inference problem:
• Given a set of random variables X1, … , Xk and
their joint density P(X1, … , Xk), compute one or
more conditional densities given observations.
– Compute
•
•
•
•
P(X1 | X2 … , Xk)
P(X3 | X1 )
P(X1 , X2 | X3, X4,)
Etc …
• Many problems can be formulated in these
terms.
28
Bayes decision rule
•
•
•
•
w: ambiguous word
S = {s1, s2, …, sn } senses for w
C = {c1, c2, …, cn } context of w in a corpus
V = {v1, v2, …, vj } words used as contextual features for
disambiguation
• Bayes decision rule
– Decide sj if P(sj | c) > P(sk | c) for sj ≠ sk
• We want to assign w to the sense s’ where
s’ = argmaxsk P(sk | c)
29
Bayes classification for WSD
• We want to assign w to the sense s’ where
s’ = argmaxsk P(sk | c)
• We usually do not know P(sk | c) but we can
compute it using Bayes rule
P(c,sk) P(c|sk)
P(sk|c) 

P(sk)
P(c)
P(c)
P(c|sk)
s'  arg max skP(sk|c)  arg max sk
P(sk)  arg max skP(c|sk)P(sk)
P(c)
30
Naïve Bayes classifier
s' arg max P(c|sk)P(sk)
sk
• Naïve Bayes classifier widely used in
machine learning
• Estimate P(c | sk) and P(sk)
31
Naïve Bayes classifier
s' arg max P(c|sk)P(sk)
sk
• Estimate P(c | sk) and P(sk)
• w: ambiguous word
• S = {s1, s2, …, sn } senses for w
• C = {c1, c2, …, cn } context of w in a corpus
• V = {v1, v2, …, vj } words used as contextual features for
disambiguation
• Naïve Bayes assumption:
P(c|sk)  P( {vj|vj in c} | sk)   P(vj | sk)
vj in c
32
Naïve Bayes classifier
P(c|s )  P( {v |v in c} | s )   P(v | s )
k
j
j
j
k
k
vj in c
• Naïve Bayes assumption:
– Two consequences
– All the structure and linear ordering of words within
the context is ignored bags of words model
– The presence of one word in the model is
independent of the others
• Not true but model “easier” and very “efficient”
• “easier” “efficient” mean something specific in the
probabilistic framework
– We’ll see later (but easier to estimate parameters and more
efficient inference)
– Naïve Bayes assumption is inappropriate if there are strong
dependencies, but often it does very well (partly because the 33
decision may be optimal even if the assumption is not correct)
Naïve Bayes for WSD
s' arg max P(c|sk)P(sk)
Bayes decision rule
sk
P(c|sk)  P( {vj|vj in c} | sk)   P(vj | sk)
Naïve Bayes assumption
vj in c
s'  arg max  P(vj | sk)P(sk)
sk
vj in c
P(vj | sk) 
C (vj , sk)
C ( sk)
Count of vj when sk
Estimation
C ( sk)
P(sk) 
C ( w)
Prior probability of sk
34
Naïve Bayes Algorithm for WSD
• TRAINING (aka Estimation)
• For all of senses sk of w do
– For all words vj in the vocabulary calculate
– end
• end
C (vj , sk)
P(vj | sk) 
C ( sk)
• For all of senses sk of w do
P(sk) 
C ( sk)
C ( w)
• end
35
Naïve Bayes Algorithm for WSD
• TESTING (aka Inference or Disambiguation)
• For all of senses sk of w do
– For all words vj in the context window c calculate
score ( sk)  P(sk|c)  P(c|sk)P(sk)
– end
• end
• Choose s= sk of w do
s' arg max score ( sk)
sk
36
An information-theoretic
approach
Information theoretic approach
• Look for key words (informant) that
disambiguates the sense of the word.
Flip-Flop Algorithm
t  translations for
the ambiguous word
x  possible values
for indicators
The algorithm works by searching for
a partition of senses that maximizes
the mutual information. The algorithm
stops when the increase becomes
insignificant.
Stepping through flip-flop algorithm
for the French word: prendre
Disambiguation process
•
Once the partition set for P& Q (indicator
words determined), then disambiguation
is simple:
1. For every occurrence of the ambiguous
word, determine the value of xi – the
indicator word.
2. If xi is in Q1, assign it to sense 1; if not
assign it to sense 2.
Decision Lists
Decision Lists and Trees
• Very widely used in Machine Learning.
• Decision trees used very early for WSD research (e.g.,
Kelly and Stone, 1975; Black, 1988).
• Represent disambiguation problem as a series of
questions (presence of feature) that reveal the sense
of a word.
– List decides between two senses after one positive answer
– Tree allows for decision among multiple senses after a
series of answers
• Uses a smaller, more refined set of features than “bag
of words” and Naïve Bayes.
– More descriptive and easier to interpret.
Decision List for WSD
(Yarowsky, 1994)
• Identify collocational features from sense tagged data.
• Word immediately to the left or right of target :
– I have my bank/1 statement.
– The river bank/2 is muddy.
• Pair of words to immediate left or right of target :
– The world’s richest bank/1 is here in New York.
– The river bank/2 is muddy.
• Words found within k positions to left or right of target,
where k is often 10-50 :
– My credit is just horrible because my bank/1 has made
several mistakes with my account and the balance is very
low.
Building the Decision List
•Sort order of collocation tests using log of conditional
probabilities.
•Words most indicative of one sense (and not the other)
will be ranked highly.
Abs(log
p ( S 1| Fi Collocationi )
p ( S  2| Fi Collocationi )
)
Computing DL score
– Given 2,000 instances of “bank”, 1,500 for bank/1 (financial
sense) and 500 for bank/2 (river sense)
• P(S=1) = 1,500/2,000 = .75
• P(S=2) = 500/2,000 = .25
– Given “credit” occurs 200 times with bank/1 and 4 times with
bank/2.
• P(F1=“credit”) = 204/2,000 = .102
• P(F1=“credit”|S=1) = 200/1,500 = .133
• P(F1=“credit”|S=2) = 4/500 = .008
– From Bayes Rule…
• P(S=1|F1=“credit”) = .133*.75/.102 = .978
• P(S=2|F1=“credit”) = .008*.25/.102 = .020
– DL Score = abs (log (.978/.020)) = 3.89
Using the Decision List
• Sort DL-score, go through test instance
looking for matching feature. First match
reveals sense…
DL-score
Feature
Sense
3.89
credit within bank
Bank/1 financial
2.20
bank is muddy
Bank/2 river
1.09
pole within bank
Bank/2 river
0.00
of the bank
N/A
Using the Decision List
CREDIT?
BANK/1 FINANCIAL
IS MUDDY?
BANK/2 RIVER
POLE?
BANK/2 RIVER
Support Vector
Machine(SVM)
Ch. 15
Linear classifiers: Which Hyperplane?
• Lots of possible solutions for a, b, c.
• Some methods find a separating
hyperplane, but not the optimal one
[according to some criterion of expected goodness]
– E.g., perceptron
• Support Vector Machine (SVM) finds an
optimal* solution.
This line
represents the
decision
boundary:
ax + by − c = 0
– Maximizes the distance between the
hyperplane and the “difficult points” close
to decision boundary
– One intuition: if there are no points near
the decision surface, then there are no
very uncertain classification decisions
50
Sec. 15.1
Another intuition
• If you have to place a fat separator between
classes, you have less choices, and so the
capacity of the model has been decreased
51
Sec. 15.1
Support Vector Machine (SVM)
Support vectors
• SVMs maximize the margin
around the separating
hyperplane.
• A.k.a. large margin
classifiers
• The decision function is fully
specified by a subset of training
Maximizes
samples, the support vectors.
Narrower
margin
margin
• Solving SVMs is a quadratic
programming problem
• Seen by many as the most
successful current text
52
*but other discriminative methods
often perform very similarly
classification method*
From Text to Feature Vectors
• My/pronoun grandfather/noun used/verb to/prep
fish/verb along/adv the/det banks/SHORE of/prep
the/det Mississippi/noun River/noun. (S1)
• The/det bank/FINANCE issued/verb a/det check/noun
for/prep the/det amount/noun of/prep interest/noun. (S2)
S1
S2
P-2
P-1
P+1
P+2 fish
check
river interest
SENSE TAG
adv
det
prep
det
Y
N
Y
N
SHORE
det
verb
det
N
Y
N
Y
FINANCE
K-NN
+
o
+
o
?
o o
+
o
+
o
o
o
+
o
+
o
o
o
o
+
o
+
o
o
o
o
Supervised Approaches –
Comparisons
Approach
Average Recall
Corpus
Naïve Bayes
Average
Precision
64.13%
Not reported
Decision Lists
96%
Not applicable
Senseval3 – All
Words Task
Tested on a set of
12 highly
polysemous
English words
WSJ6 containing
191 content words
Exemplar Based
68.6%
disambiguation (kNN)
SVM
72.4%
Not reported
Perceptron trained
HMM
73.74%
67.60
Senseval 3 –
Lexical sample
task (Used for
disambiguation of
57 words)
Senseval3 – All
Words Task
72.4%
55
Average Baseline
Accuracy
60.90%
63.9%
63.7%
55.2%
60.90%
Dictionary-Based
Discrimination
Overview
• In this section use of Dictionaries and Thesauri for the
purposes of word sense disambiguation is explored.
• Lesk (1986) explores use of dictionary
• Yarowsky (1992) explores use of Roget’s thesaurus.
• Dagan & Itai(1994) explore the use of a bilingual
dictionary.
• Also, a careful examination of the distribution properties
of words may provide additional cues. Commonly
ambiguous word may not appear with more than one
meaning in any given text.
Disambiguation based on sense definitions
Thesaurus-based disambiguation
• Algorithm proposed by Walker (1987)
1. Comment: Given: context c
2. for all senses sk of w do
3.
score(sk) = vjc (t (sk ), v j )
4. end
5. choose s’ s.t. s’ = argmaxsk score(sk)
Where t(sk) is the subject code of sense sk, and
 (t ( sk ), v j ) = 1; iff t(sk) is one of the subject
codes of vj and 0 otherwise.
Yarowski’s adaptation
• Context is simply a 100-word window centered
around the word to be disambiguated
• Algorithm adds words to the thesaurus category
if it happens more often than chance in the
context of that category. For instance
“Navratilova” occurs more often than not only in
a “sports” context if you are analyzing news
articles.
• One can look at this as key markers in the
context to guide the disambiguation process.
Thesaurus-based disambiguation
Disambiguation based on translations in a
second-language corpus
• The insight to this methodology is that
words that may have multiple senses in
English tend to manifest themselves as
different words in other languages. And if
you have a body of translations available
that you can draw upon, you can use it for
disambiguation purposes.
Using Second language corpus
One sense per discourse, one
sense per collocation
• One sense per discourse – the sense of a
target word is highly consistent within any
given document
• One sense per collocation. Nearby words
provide strong and consistent clues to the
sense of the word.
Example of one sense per
discourse
Yarowski’s Algorithm
Unsupervised disambiguation
Sense Tagging and Sense
Discrimination
• Sense tagging – ability to tag occurrences
of a word in one sense or other
• Sense discrimination – ability to recognize
that the sense of a word is different
without worrying about the actual sense
• Example
– K-means
– EM-algorithm
K-means Demo
1. User set up the number of
clusters they’d like. (e.g. k=5)
70
K-means Demo
1. User set up the number of
clusters they’d like. (e.g. K=5)
2. Randomly guess K cluster
Center locations
71
K-means Demo
1. User set up the number of
clusters they’d like. (e.g. K=5)
2. Randomly guess K cluster
Center locations
3. Each data point finds out
which Center it’s closest to.
(Thus each Center “owns” a
set of data points)
72
K-means Demo
1. User set up the number of
clusters they’d like. (e.g. K=5)
2. Randomly guess K cluster
centre locations
3. Each data point finds out
which centre it’s closest to.
(Thus each Center “owns” a
set of data points)
4. Each centre finds the centroid
of the points it owns
73
K-means Demo
1. User set up the number of
clusters they’d like. (e.g. K=5)
2. Randomly guess K cluster
centre locations
3. Each data point finds out
which centre it’s closest to.
(Thus each centre “owns” a
set of data points)
4. Each centre finds the centroid
of the points it owns
5. …and jumps there
74
K-means Demo
1. User set up the number of
clusters they’d like. (e.g. K=5)
2. Randomly guess K cluster
centre locations
3. Each data point finds out
which centre it’s closest to.
(Thus each centre “owns” a
set of data points)
4. Each centre finds the centroid
of the points it owns
5. …and jumps there
6. …Repeat until terminated!
75
Disambiguation using clustering
Decide s’ where,
s'  arg max [log P( sk )   log P(v j | sk )]
sk
v j c
Unsupervised Clustering