771Lec22-WordSimilarity

Download Report

Transcript 771Lec22-WordSimilarity

CSCE 771
Natural Language Processing
Lecture 22
Word Similarity
Topics



word similarity
Thesaurus based word similarity
Intro. Distributional based word similarity
Readings:
 NLTK book Chapter 2 (wordnet)
 Text Chapter 20
April 8, 2013
Overview
Last Time (Programming)

Features in NLTK




NL queries  SQL
NLTK support for Interpretations and Models
Propositional and predicate logic support
Prover9
Today



Last Lectures slides 25-29
Features in NLTK
Computational Lexical Semantics
Readings:


Text 19,20
NLTK Book: Chapter 10
Next Time: Computational Lexical Semantics II
–2–
CSCE 771 Spring 2013
Figure 20.1 Possible sense tags for
bass
Chapter 20 – Word Sense disambiguation (WSD)
Machine translation
Supervised vs unsupervised learning
Semantic concordance – corpus with words tagged
with sense tags
–3–
CSCE 771 Spring 2013
Feature Extraction for WSD
Feature vectors
Collocation
[wi-2, POSi-2, wi-1, POSi-1, wi, POSi, wi+1, POSi+1, wi+2, POSi+2]
Bag-of-words – unordered set of neighboring words
Represent sets of most frequent content words with
membership vector
[0,0,1,0,0,0,1] – set of 3rd and 7th most freq. content word
Window of nearby words/features
–4–
CSCE 771 Spring 2013
Naïve Bayes Classifier
w – word vector
s – sense tag vector
f – feature vector [wi, POSi ] for i=1, …n
^
s  arg max P ( s | f )
sS
Approximate by frequency counts
But how practical?
–5–
CSCE 771 Spring 2013
Looking for Practical formula
.
^
s  arg max P( s | f )
sS
P( f | s ) P( s )
 arg max
P( f )
sS
Still not practical
–6–
CSCE 771 Spring 2013
Naïve == Assume Independence
n
P( f | s )   P( f j | s )
j 1
Now practical, but realistic?
n
^
s  arg max  P( f j | s)
sS
–7–
j 1
CSCE 771 Spring 2013
Training = count frequencies
.
P( si ) 
count ( si , w j )
count ( w j )
Maximum likelihood estimator (20.8)
P( f j | si ) 
–8–
count ( f j , s )
count ( s )
CSCE 771 Spring 2013
Decision List Classifiers
Naïve Bayes hard for humans to examine decisions
and understand
Decision list classifiers - like “case” statement
sequence of (test, returned-sense-tag) pairs
–9–
CSCE 771 Spring 2013
Figure 20.2 Decision List Classifier
Rules
– 10 –
CSCE 771 Spring 2013
WSD Evaluation, baselines, ceilings
Extrinsic evaluation - evaluating embedded NLP in endto-end applications (in vivo)
Intrinsic evaluation – WSD evaluating by itself (in vitro)
Sense accuracy
Corpora – SemCor, SENSEVAL, SEMEVAL
Baseline - Most frequent sense (wordnet sense 1)
Ceiling – Gold standard – human experts with
discussion and agreement
– 11 –
CSCE 771 Spring 2013
Similarity of Words or Senses
 generally we will be saying words but giving
similarity of word senses
 similarity vs relatedness


ex similarity
ex relatedness
 Similarity of words
 Similarity of phrases/sentence (not usually done)
– 12 –
CSCE 771 Spring 2013
Figure 20.3 Simplified Lesk Algorithm
gloss/sentence overlap
– 13 –
CSCE 771 Spring 2013
Simplified Lesk example
The bank can guarantee deposits will eventually cover
future tuition costs because it invests in adjustable
rate mortgage securities.
– 14 –
CSCE 771 Spring 2013
Corpus Lesk
 Using equals weights on words just does not seem
right
 weights applied to overlap words
 inverse document frequency
idfi = log (Ndocs / num docs containing wi)
– 15 –
CSCE 771 Spring 2013
SENSEVAL competitions
http://www.senseval.org/
Check the Senseval-3 website.
– 16 –
CSCE 771 Spring 2013
SemEval-2 -Evaluation Exercises on
Semantic Evaluation - ACL SigLex event
– 17 –
CSCE 771 Spring 2013
Task Name
Area
#1
Coreference Resolution in
Multiple Languages Coref
#2
Cross-Lingual Lexical
Substitution
Cross-Lingual, Lexical
Substitu
#3
Cross-Lingual Word Sense
Disambiguation
Cross-Lingual, Word
Senses
#4
VP Ellipsis - Detection and
Resolution
Ellipsis
#5
Automatic Keyphrase
Extraction from Scientific Articles
#6
Classification of Semantic
Relations between MeSH Entities in
Swedish Medical Texts
#10
Linking Events and their
Participants in Discourse
Semantic Role Labeling,
Information Extraction
#11
Event Detection in Chinese
News Sentences Semantic Role
Labeling, Word Senses
#12
Parser Training and Evaluation
using Textual Entailment
#13
TempEval 2
Expressions
#14
Time
Word Sense Induction
#15
Infrequent Sense Identification
for Mandarin Text to Speech Systems
#7
Argument Selection and
Coercion Metonymy
#16
#8
Multi-Way Classification of
Semantic Relations Between Pairs of
Nominals
#17
All-words Word Sense
Disambiguation on a Specific Domain
(WSD-domain)
#9
Noun Compound Interpretation
Using
– 18 – Paraphrasing Verbs Noun
compounds
#18
Disambiguating Sentiment
Ambiguous Adjectives
Senses,
CSCE Word
771 Spring
2013
Japanese WSD Word Senses
20.4.2 Selectional Restrictions and
Preferences
• verb eat  theme=object has feature Food+
• Katz and Fodor 1963 used this idea to rule out
senses that were not consistent
• WSD of disk
(20.12) “In out house, evrybody has a career and none
of them includes washing dishes,” he says.
(20.13) In her tiny kitchen, Ms, Chen works efficiently,
stir-frying several simple dishes, inlcuding …
• Verbs wash, stir-frying
•
•
– 19 –
wash  washable+
stir-frying  edible+
CSCE 771 Spring 2013
Resnik’s model of Selectional
Association
How much does a predicate tell you about the semantic
class of its arguments?
• eat 
• was, is, to be …
• selectional preference strength of a verb is indicated
by two distributions:
1. P(c) how likely the direct object is to be in class c
2. P(c|v) the distribution of expected semantic classes
for the particular verb v
•
– 20 –
the greater the difference in these distributions
means the verb provides more information
CSCE 771 Spring 2013
Relative entropy – Kullback-Leibler
divergence
Given two distributions P and Q
D(P || Q) =
∑ P(x) log (p(x)/Q(x))
(eq 20.16)
Selectional preference
SR(v) = D( P(c|v) || P(c)) =
– 21 –
CSCE 771 Spring 2013
Resnik’s model of Selectional Association
– 22 –
CSCE 771 Spring 2013
High and Low Selectional
Associations – Resnik 1996
Selectional Associations
– 23 –
CSCE 771 Spring 2013
20.5 Minimally Supervised WSD:
Bootstrapping
 “supervised and dictionary methods require large
hand-built resources”
 bootstrapping or semi-supervised learning or
minimally supervised learning to address the no-data
problem
 Start with seed set and grow it.
– 24 –
CSCE 771 Spring 2013
Yarowsky algorithm preliminaries
Idea of bootstrapping: “create a larger training set from
a small set of seeds”
Heuritics: senses of “bass”
1. one sense per collocation

in a sentence both senses of bass are not used
2. one sense per discourse

Yarowsky showed that of 37,232 examples of bass
occurring in a discourse there was only one sense per
discourse
Yarowsky
– 25 –
CSCE 771 Spring 2013
Yarowsky algorithm
Goal: learn a word-sense classifier for a word
Input: Λ0 small seed set of labeled instances of each sense
1. train classifier on seed-set Λ0,
2. label the unlabeled corpus V0 with the classifier
3. Select examples delta in V that you are “most
confident in”
4. Λ1 = Λ0 + delta
5. Repeat
– 26 –
CSCE 771 Spring 2013
Figure 20.4 Two senses of plant
Plant 1 – manufacturing plant …
plant 2 – flora, plant life
– 27 –
CSCE 771 Spring 2013
2009 Survey of WSD by Navigili
,
– 28 –
iroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdf
CSCE 771 Spring 2013
Figure 20.5 Samples of bass-sentences
from WSJ (Wall Street Journal)
– 29 –
CSCE 771 Spring 2013
Word Similarity: Thesaurus Based
Methods Figure 20.6 Path Distances
in hierarchy
Wordnet of course (pruned)
– 30 –
CSCE 771 Spring 2013
Figure 20.6 Path Based Similarity
.
\
simpath(c1, c2)= 1/pathlen(c1, c2) (length + 1)
– 31 –
CSCE 771 Spring 2013
WN -hierarchy
tortoise =
wn.synset('tortoise.n.01')
novel = wn.synset('novel.n.01')
# Wordnet examples from NLTK
book
import nltk
from nltk.corpus import wordnet
as wn
right =
wn.synset('right_whale.n.01')
orca = wn.synset('orca.n.01')
minke =
wn.synset('minke_whale.n.01')
– 32 –
print "LCS(right,
minke)=",right.lowest_common
_hypernyms(minke)
print "LCS(right,
orca)=",right.lowest_common_h
ypernyms(orca)
print "LCS(right,
tortoise)=",right.lowest_commo
n_hypernyms(tortoise)
print "LCS(right, novel)=",
right.lowest_common_hyperny
ms(novel)
CSCE 771 Spring 2013
#path similarity
print "Path similarities"
print right.path_similarity(minke)
print right.path_similarity(orca)
print right.path_similarity(tortoise)
print right.path_similarity(novel)
Path similarities
0.25
0.166666666667
0.0769230769231
0.0434782608696
– 33 –
CSCE 771 Spring 2013
Wordnet in NLTK
 http://nltk.org/_modules/nltk/corpus/reader/wordnet.html
 http://nltk.googlecode.com/svn/trunk/doc/howto/wordnet.ht
ml (partially in Chap 02 NLTK book; but different version)
 http://grey.colorado.edu/mingus/index.php/Objrec_Wordnet
.py

code for similarity – runs for a while; lots of results
 x
– 34 –
CSCE 771 Spring 2013
https://groups.google.com/forum
Hi,
I was wondering if it is possible for me to use NLTK + wordnet to
group (nouns) words together via similar meanings?
Assuming I have 2000 words or topics. Is it possible for me to group
them together according to similar meanings using NLTK?
So that at the end of the day I would have different groups of words
that are similar in meaning? Can that be done in NLTK? and possibly be
beautiful
able to detect salient patterns emerging?
(trend in topics etc...).
3/4/10
Is there a further need for a word classifier based on the CMU BOW
toolkit to classify words to get it into categories? or the above group
would be good enough? Is there a need to classify words further?
How would one classify words in NLTK effectively?
Really hope you can enlighten me?
FM
– 35 –
CSCE 771 Spring 2013
Response from Steven Bird
2010/3/5 Republic <[email protected]>:
> Assuming I have 2000 words or topics. Is it possible for me to group
> them together according to similar meanings using NLTK?
You could compute WordNet similarity (pairwise), so that each
word/topic is represented as a vector of distances, which could then
be discretized, so each vector would have a form like this:
[0,2,3,1,0,0,2,1,3,...]. These vectors could then be clustered using
one of the methods in the NLTK cluster package.
Steven Bird
> So that at the end of the day I would have different groups of words 3/7/10
> that are similar in meaning? Can that be done in NLTK? and possibly be
> able to detect salient patterns emerging? (trend in topics etc...).
This suggests a temporal dimension, which might mean recomputing the
clusters as more words or topics come in.
It might help to read the NLTK book sections on WordNet and on text
classification, and also some of the other cited material.
-Steven Bird
– 36 –
CSCE 771 Spring 2013
import nltk
from nltk.corpus import wordnet as wn
More general?
Stack-Overflow
waiter = wn.synset('waiter.n.01')
employee = wn.synset('employee.n.01')
all_hyponyms_of_waiter = list(set([w.replace("_"," ")
for s in waiter.closure(lambda s:s.hyponyms())
for w in s.lemma_names]))
all_hyponyms_of_employee = …
if 'waiter' in all_hyponyms_of_employee:
print 'employee more general than waiter'
elif 'employee' in all_hyponyms_of_waiter:
print 'waiter more general than employee'
else:
– 37http://stackoverflow.com/questions/...-semantic-hierarchies-relations-in--nltk
–
CSCE 771 Spring 2013
print wn(help)
…
| res_similarity(self, synset1, synset2, ic,
verbose=False)
|
Resnik Similarity:
|
Return a score denoting how similar two word
senses are, based on the
|
Information Content (IC) of the Least Common
Subsumer (most specific
|

ancestor node).
http://grey.colorado.edu/mingus/index.php/Objrec_Wordnet.py
– 38 –
CSCE 771 Spring 2013
Similarity based on a hierarchy
(=ontology)
– 39 –
CSCE 771 Spring 2013
Information Content word similarity
– 40 –
CSCE 771 Spring 2013
Resnick Similarity / Wordnet
simresnick(c1, c2) = -log P(LCS(c1, c2))\
wordnet
res_similarity(self, synset1, synset2, ic, verbose=False)
|
Resnik Similarity:
|
Return a score denoting how similar two word
senses are, based on the
|
Information Content (IC) of the Least Common
Subsumer (most specific
|
– 41 –
ancestor node).
CSCE 771 Spring 2013
Fig 20.7 Wordnet with Lin P(c) values
Change for
Resnick!!
– 42 –
CSCE 771 Spring 2013
Lin variation 1998
• Commonality –
• Difference –
• IC(description(A,B)) – IC(common(A,B))
• simLin(A,B) = Common(A,B) / description(A,B)
– 43 –
CSCE 771 Spring 2013
Fig 20.7 Wordnet with Lin P(c) values
– 44 –
CSCE 771 Spring 2013
Extended Lesk
based on
1.
2.
glosses
glosses of hypernyms, hyponyms
Example
• drawing paper: paper that is specially prepared for
use in drafting
• decal: the art of transferring designs from specially
prepared paper to a wood, glass or metal surface.
• Lesk score = sum of squares of lengths of common
phrases
• Example: 1 + 22 = 5
– 45 –
CSCE 771 Spring 2013
Figure 20.8 Summary of Thesaurus
Similarity measures
– 46 –
CSCE 771 Spring 2013
Wordnet similarity functions
path_similarity()?
lch_similarity()?
wup_similarity()?
res_similarity()?
jcn_similarity()?
lin_similarity()?
– 47 –
CSCE 771 Spring 2013
Problems with thesaurus-based
 don’t always have a thesaurus
 Even so problems with recall


missing words
phrases missing
 thesauri work less well for verbs and adjectives

– 48 –
less hyponymy structure
Distributional Word Similarity D. Jurafsky
CSCE 771 Spring 2013
Distributional models of meaning
 vector-space models of meaning
 offer higher recall than hand-built thesauri

– 49 –
less precision probably
Distributional Word Similarity D. Jurafsky
CSCE 771 Spring 2013
Word Similarity Distributional Methods
20.31 tezguino example
• A bottle of tezguino is on the table.
• Everybody likes tezguino.
• tezguino makes you drunk.
• We make tezguino out of corn.
• What do you know about tezguino?
– 50 –
CSCE 771 Spring 2013
Term-document matrix
 Collection of documents
 Identify collection of important terms, discriminatory
terms(words)
 Matrix: terms X documents –



term frequency tfw,d =
each document a vector in ZV:
Z= integers; N=natural numbers more accurate but perhaps
misleading
 Example
– 51 –
Distributional Word Similarity D. Jurafsky
CSCE 771 Spring 2013
Example Term-document matrix
Subset of terms = {battle, soldier, fool, clown}
As you like it
12th Night
Julius Caesar Henry V
Battle
1
1
8
15
Soldier
2
2
12
36
fool
37
58
1
5
clown
6
117
0
0
– 52 –
Distributional Word Similarity D. Jurafsky
CSCE 771 Spring 2013
Figure 20.9 Term in context matrix for
word similarity
window of 20 words – 10 before 10 after from Brown
corpus
– 53 –
CSCE 771 Spring 2013
Pointwise Mutual Information
• td-idf (inverse document frequency) rating instead of
raw counts
•
idf intuition again –
• pointwise mutual information (PMI)
•
•
Do events x and y occur more than if they were
independent?
PMI(X,Y)= log2 P(X,Y) / P(X)P(Y)
• PMI between words
• Positive PMI between two words (PPMI)
– 54 –
CSCE 771 Spring 2013
Computing PPMI
 Matrix with W (words) rows and C (contexts)
columns
 fij is frequency of wi in cj,
– 55 –
CSCE 771 Spring 2013
Example computing PPMI
.
– 56 –
CSCE 771 Spring 2013
Figure 20.10
– 57 –
CSCE 771 Spring 2013
Figure 20.11
– 58 –
CSCE 771 Spring 2013
Figure 20.12
– 59 –
CSCE 771 Spring 2013
Figure 20.13
– 60 –
CSCE 771 Spring 2013
Figure 20.14
– 61 –
CSCE 771 Spring 2013
Figure 20.15
– 62 –
CSCE 771 Spring 2013
Figure 20.16
– 63 –
CSCE 771 Spring 2013
http://www.cs.ucf.edu/courses/cap5636/fall2011/nltk.pdf how to do in nltk
NLTK 3.0a1 released : February 2013
This version adds support for NLTK’s graphical user interfaces.
http://nltk.org/nltk3-alpha/
which similarity function in nltk.corpus.wordnet is Appropriate for find
similarity of two words?
I want use a function for word clustering and yarowsky algorightm for find
similar collocation in a large text.
http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Linguistics
http://en.wikipedia.org/wiki/Portal:Linguistics
http://en.wikipedia.org/wiki/Yarowsky_algorithm
http://nltk.googlecode.com/svn/trunk/doc/howto/wordnet.html
– 64 –
CSCE 771 Spring 2013